Skip to content

Conversation

@damian0815
Copy link

@damian0815 damian0815 commented Oct 17, 2025

  • Basic functionality unit tests
  • warcio implementation
    • Validate output is identical to MRJob output with "test" robotstxt in MRJob repo
    • Validate on recent full-scale crawl output
  • fastwarc implementation
  • unit test to validate text encoding edge cases and validity (currently all test cases are completely valid utf8)
  • check output works with crawl-tools/server/seed/sitemaps/sitemaps_robotstxt.py

@damian0815 damian0815 marked this pull request as draft October 17, 2025 15:27
@damian0815 damian0815 marked this pull request as ready for review October 20, 2025 14:24
Signed-off-by: Damian Stewart <[email protected]>
Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. I'll continue with testing. But several points need a discussion.

Signed-off-by: Damian Stewart <[email protected]>
Damian Stewart added 8 commits October 30, 2025 15:17
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants