Output warc-record-id and warc-ip-address in the CDX index by lfoppiano · Pull Request #48 · commoncrawl/nutch

lfoppiano · 2026-02-26T20:20:44Z

lfoppiano · 2026-03-01T06:53:55Z

@sebastian-nagel I did some manual test with the local Nutch (the same we did together) and added some simple unit tests (e.g. check that we correctly remove the id prefix). The records may still have some imperfections as they were built manually.

I did not have time to work on a more detailed integration test as we discussed to deserialize the java objects from the segments, but I will add an issue for that that should benefit more use cases.

sebastian-nagel

Thanks, @lfoppiano!

The change is fine.

But the PR should be kept open for a while until also the related changes mentioned in #39 are implemented. I'd prefer to put everything into production in one turn.

sebastian-nagel · 2026-02-27T07:20:56Z

src/java/org/commoncrawl/util/WarcCdxWriter.java

      data.put("redirect", redirectLocation);
    }
+    if (ip != null) {
+      data.put("ipaddress", ip);


See comment in #39 regarding the naming.

Personally, I would prefer to have the hypen in it, but after discussion we could save quite some space by keeping it without.

sebastian-nagel · 2026-02-27T07:21:28Z

src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java

 */
 public class TestCommonCrawlDataDumper {

+  @Test


Out of scope.

I cannot access the code, if this is some leftover tests, I believe I removed them.

sebastian-nagel · 2026-02-27T07:42:30Z

src/java/org/commoncrawl/util/WarcCdxWriter.java

    long length = (countingOut.getByteCount() - offset);
    writeCdxLine(targetUri, date, offset, length, payloadDigest, content, true,
-        null, null);
+        null, null, recordId.toASCIIString(), ip);


This will include the prefix (URI scheme) urn:uuid:. In the WARC file there are even surrounding brackets: <urn:uuid:...>. So, we have three options:

<urn:uuid:824d10d3-131a-4f67-9cbf-e40ecb5f0fa5>

urn:uuid:824d10d3-131a-4f67-9cbf-e40ecb5f0fa5

824d10d3-131a-4f67-9cbf-e40ecb5f0fa5

I'd prefer 3 without decorations. The digest (WARC-Payload-Digest) is indexed without the prefix sha1:. Then it would be consistent in the index. It would also save some data volume if the URL index server sends the results uncompressed (most clients do not request compression / send the Accept-Encoding header.

However, the GneissWeb Annotations use option 1.

Needs to be discussed.

Yes, the substring is done inside the method, and we need to confirm which option we implemeent. Now it's option 3 as I understood that GneissWeb annotation may be changed.

sebastian-nagel · 2026-03-03T06:30:12Z

src/test/org/commoncrawl/util/TestWarcCdxWriter.java

+
+    URI targetUri = new URI("https://example.com/revisit");
+    Date date = new Date();
+    Content content = createContent("304", "application/http;msgtype=response");


A HTTP 304 response (usually) does not have a Content-Type HTTP header. Consequently, the "Content" object holding it, also shouldn't. In case, the responding server needlessly sends a Content-Type header, then it might be used for the "Content" object.

application/http;msgtype=response is the Content-Type of the WARC record holding a revisit record. Since WARC borrows from HTTP the header format, there can be two Content-Type headers in a WARC record - one is part of the WARC header, the other part of the HTTP header.

Of course, in theory a crawler can visit a WARC record and archive it. Then the Content-Type of both the WARC and HTTP header would be the same: application/http;msgtype=response.

Thanks! That's more clear now: the record, the warc, the index. I removed the content.

When #49 will be in place we might be able to get much better and realistic data (and I will learn more from it)

lfoppiano mentioned this pull request Feb 26, 2026

Add WARC-Record-ID and WARC-IP-Address to CDX files #39

Open

feat: output warc-record-id and warc-ip-address in the CDX index

c5a4bd6

lfoppiano force-pushed the bugfix/warc-id-warc-ip branch from cec9c62 to c5a4bd6 Compare February 27, 2026 17:06

lfoppiano linked an issue Feb 27, 2026 that may be closed by this pull request

Add WARC-Record-ID and WARC-IP-Address to CDX files #39

Open

test: add unit tests for WarcCdxWriter IP and record ID handling

6cc8c54

lfoppiano marked this pull request as ready for review March 1, 2026 06:53

sebastian-nagel approved these changes Mar 3, 2026

View reviewed changes

fix: remove content-type from a revisit record

b238f95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output warc-record-id and warc-ip-address in the CDX index#48

Output warc-record-id and warc-ip-address in the CDX index#48
lfoppiano wants to merge 3 commits intoccfrom
bugfix/warc-id-warc-ip

lfoppiano commented Feb 26, 2026

Uh oh!

lfoppiano commented Mar 1, 2026

Uh oh!

sebastian-nagel left a comment

Uh oh!

sebastian-nagel Feb 27, 2026

Uh oh!

lfoppiano Mar 4, 2026

Uh oh!

sebastian-nagel Feb 27, 2026

Uh oh!

lfoppiano Mar 4, 2026

Uh oh!

sebastian-nagel Feb 27, 2026

Uh oh!

lfoppiano Mar 4, 2026

Uh oh!

sebastian-nagel Mar 3, 2026

Uh oh!

sebastian-nagel Mar 3, 2026

Uh oh!

lfoppiano Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfoppiano commented Feb 26, 2026

Uh oh!

lfoppiano commented Mar 1, 2026

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lfoppiano Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lfoppiano Mar 4, 2026 •

edited

Loading