Skip to content

Output warc-record-id and warc-ip-address in the CDX index#48

Open
lfoppiano wants to merge 3 commits intoccfrom
bugfix/warc-id-warc-ip
Open

Output warc-record-id and warc-ip-address in the CDX index#48
lfoppiano wants to merge 3 commits intoccfrom
bugfix/warc-id-warc-ip

Conversation

@lfoppiano
Copy link

Ref #39

@lfoppiano lfoppiano force-pushed the bugfix/warc-id-warc-ip branch from cec9c62 to c5a4bd6 Compare February 27, 2026 17:06
@lfoppiano lfoppiano linked an issue Feb 27, 2026 that may be closed by this pull request
@lfoppiano
Copy link
Author

@sebastian-nagel I did some manual test with the local Nutch (the same we did together) and added some simple unit tests (e.g. check that we correctly remove the id prefix). The records may still have some imperfections as they were built manually.

I did not have time to work on a more detailed integration test as we discussed to deserialize the java objects from the segments, but I will add an issue for that that should benefit more use cases.

@lfoppiano lfoppiano marked this pull request as ready for review March 1, 2026 06:53
Copy link

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lfoppiano!

The change is fine.

But the PR should be kept open for a while until also the related changes mentioned in #39 are implemented. I'd prefer to put everything into production in one turn.

data.put("redirect", redirectLocation);
}
if (ip != null) {
data.put("ipaddress", ip);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in #39 regarding the naming.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would prefer to have the hypen in it, but after discussion we could save quite some space by keeping it without.

*/
public class TestCommonCrawlDataDumper {

@Test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot access the code, if this is some leftover tests, I believe I removed them.

long length = (countingOut.getByteCount() - offset);
writeCdxLine(targetUri, date, offset, length, payloadDigest, content, true,
null, null);
null, null, recordId.toASCIIString(), ip);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will include the prefix (URI scheme) urn:uuid:. In the WARC file there are even surrounding brackets: <urn:uuid:...>. So, we have three options:

  1. <urn:uuid:824d10d3-131a-4f67-9cbf-e40ecb5f0fa5>
  2. urn:uuid:824d10d3-131a-4f67-9cbf-e40ecb5f0fa5
  3. 824d10d3-131a-4f67-9cbf-e40ecb5f0fa5

I'd prefer 3 without decorations. The digest (WARC-Payload-Digest) is indexed without the prefix sha1:. Then it would be consistent in the index. It would also save some data volume if the URL index server sends the results uncompressed (most clients do not request compression / send the Accept-Encoding header.

However, the GneissWeb Annotations use option 1.

Needs to be discussed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the substring is done inside the method, and we need to confirm which option we implemeent. Now it's option 3 as I understood that GneissWeb annotation may be changed.


URI targetUri = new URI("https://example.com/revisit");
Date date = new Date();
Content content = createContent("304", "application/http;msgtype=response");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A HTTP 304 response (usually) does not have a Content-Type HTTP header. Consequently, the "Content" object holding it, also shouldn't. In case, the responding server needlessly sends a Content-Type header, then it might be used for the "Content" object.

application/http;msgtype=response is the Content-Type of the WARC record holding a revisit record. Since WARC borrows from HTTP the header format, there can be two Content-Type headers in a WARC record - one is part of the WARC header, the other part of the HTTP header.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, in theory a crawler can visit a WARC record and archive it. Then the Content-Type of both the WARC and HTTP header would be the same: application/http;msgtype=response.

Copy link
Author

@lfoppiano lfoppiano Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's more clear now: the record, the warc, the index. I removed the content.

When #49 will be in place we might be able to get much better and realistic data (and I will learn more from it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add WARC-Record-ID and WARC-IP-Address to CDX files

2 participants