Skip to content

Fix revisit content-type#42

Open
lfoppiano wants to merge 6 commits intoccfrom
bugfix/revisit-content-type
Open

Fix revisit content-type#42
lfoppiano wants to merge 6 commits intoccfrom
bugfix/revisit-content-type

Conversation

@lfoppiano
Copy link

This PR fixed issue #40. I replaced the content type message/http with application/http; msgtype=response.

Copy link

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lfoppiano, thanks! The fix looks good to me.

However, the unit test failed when I tried to run it. Creating a "HTTP 304" response object, resp. the included Content object, might be challenging. See the attached data. It's on you whether to read the response from the segment, or to study it to figure out how it should look like. Use nutch readseg for inspection. There are other Nutch unit tests which use segments.

Otherwise, it's very appreciated that there will be now the first "deeper" unit test for the WARC writer.

@lfoppiano
Copy link
Author

lfoppiano commented Feb 24, 2026

@sebastian-nagel the test was left failing on purpose, because I wanted to first fix the test running.

I managed (or, better, Claude Opus managed - other models weren't able) to find the cause.
In short, all test were ran with <fork> and there were two tests in particular TestCommonCrawlDataDumper whcih called at some point System.exit() dumping the whole forked process. However also TestMimeUtil had a simliar problem.

For now I've separated the org.commoncrawl.* from the org.apache.* tests. But the issue may still occur preventing other tests from the same group to be executed.

@lfoppiano lfoppiano marked this pull request as draft February 25, 2026 07:37
@sebastian-nagel
Copy link

called at some point System.exit()

Thanks! That's a left-over of NUTCH-2852. Unclear why it hits our fork but not upstream Nutch. It should be fixed upstream anyway. I'll also comment on PR #44.

@lfoppiano lfoppiano marked this pull request as ready for review February 26, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants