Output warc-record-id and warc-ip-address in the CDX index#48
Output warc-record-id and warc-ip-address in the CDX index#48
Conversation
cec9c62 to
c5a4bd6
Compare
|
@sebastian-nagel I did some manual test with the local Nutch (the same we did together) and added some simple unit tests (e.g. check that we correctly remove the id prefix). The records may still have some imperfections as they were built manually. I did not have time to work on a more detailed integration test as we discussed to deserialize the java objects from the segments, but I will add an issue for that that should benefit more use cases. |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks, @lfoppiano!
The change is fine.
But the PR should be kept open for a while until also the related changes mentioned in #39 are implemented. I'd prefer to put everything into production in one turn.
| data.put("redirect", redirectLocation); | ||
| } | ||
| if (ip != null) { | ||
| data.put("ipaddress", ip); |
There was a problem hiding this comment.
See comment in #39 regarding the naming.
There was a problem hiding this comment.
Personally, I would prefer to have the hypen in it, but after discussion we could save quite some space by keeping it without.
| */ | ||
| public class TestCommonCrawlDataDumper { | ||
|
|
||
| @Test |
There was a problem hiding this comment.
I cannot access the code, if this is some leftover tests, I believe I removed them.
| long length = (countingOut.getByteCount() - offset); | ||
| writeCdxLine(targetUri, date, offset, length, payloadDigest, content, true, | ||
| null, null); | ||
| null, null, recordId.toASCIIString(), ip); |
There was a problem hiding this comment.
This will include the prefix (URI scheme) urn:uuid:. In the WARC file there are even surrounding brackets: <urn:uuid:...>. So, we have three options:
<urn:uuid:824d10d3-131a-4f67-9cbf-e40ecb5f0fa5>urn:uuid:824d10d3-131a-4f67-9cbf-e40ecb5f0fa5824d10d3-131a-4f67-9cbf-e40ecb5f0fa5
I'd prefer 3 without decorations. The digest (WARC-Payload-Digest) is indexed without the prefix sha1:. Then it would be consistent in the index. It would also save some data volume if the URL index server sends the results uncompressed (most clients do not request compression / send the Accept-Encoding header.
However, the GneissWeb Annotations use option 1.
Needs to be discussed.
There was a problem hiding this comment.
Yes, the substring is done inside the method, and we need to confirm which option we implemeent. Now it's option 3 as I understood that GneissWeb annotation may be changed.
|
|
||
| URI targetUri = new URI("https://example.com/revisit"); | ||
| Date date = new Date(); | ||
| Content content = createContent("304", "application/http;msgtype=response"); |
There was a problem hiding this comment.
A HTTP 304 response (usually) does not have a Content-Type HTTP header. Consequently, the "Content" object holding it, also shouldn't. In case, the responding server needlessly sends a Content-Type header, then it might be used for the "Content" object.
application/http;msgtype=response is the Content-Type of the WARC record holding a revisit record. Since WARC borrows from HTTP the header format, there can be two Content-Type headers in a WARC record - one is part of the WARC header, the other part of the HTTP header.
There was a problem hiding this comment.
Of course, in theory a crawler can visit a WARC record and archive it. Then the Content-Type of both the WARC and HTTP header would be the same: application/http;msgtype=response.
There was a problem hiding this comment.
Thanks! That's more clear now: the record, the warc, the index. I removed the content.
When #49 will be in place we might be able to get much better and realistic data (and I will learn more from it)
Ref #39