Skip to content

Conversation

@damian0815
Copy link
Contributor

@damian0815 damian0815 commented Oct 15, 2025

  1. The word "extract" was overloaded, which could be confusing given that we're also working with gzipped data. I limited its use to just those sections where the literal command warcio extract is run. Other instances of "extract" have been replaced with synonyms or reworded.
  2. I felt like the cdxt usage (Task 6) deserved a bit more space and explanation. I've given it that. (This section would flow better as a notebook rather than a single make cdx_toolkit invocation.)

@wumpus wumpus marked this pull request as draft October 15, 2025 16:22
@handecelikkanat handecelikkanat marked this pull request as ready for review October 15, 2025 17:59
Copy link
Member

@laurieburchell laurieburchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - I think the expanded explanation of the cdxt command is a great idea.


```
look up this capture in the comoncrawl cdx index for CC-MAIN-2024-22, returning only the first match:
cdxt --limit 1 --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you've taken out the --from and --to fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some experimentation and discovered that with --limit 1 they weren't needed. when i went through the whirlwind tour myself for the first time these timestamps were the most confusing thing for me, so i thought it better just to drop them from the actual command and then add below a comment saying that you can use them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of from and to is that we want exactly that one record at that date. It's likely it's the only capture of that URL in that crawl, but it still useful to have from and to and maybe an explanation of why it's there.

Copy link
Contributor Author

@damian0815 damian0815 Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I do in fact have such an explanation, just below:
https://github.com/commoncrawl/whirlwind-python/pull/18/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R382
(line 382 if the link doesn't work)

My rationale for removing the --from and --to args from there is that I found myself confused when I was first going through the whirlwind tour as to where these exact timestamps came from. There seemed to be a chicken/egg problem: if I need the timestamps to use iter where do I get them from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an explanation of why it's there

this is in fact the part I don't quite understand. We're already restricting to a single result with --limit 1 --crawl CC-MAIN-2024-22. Why are --from and --to also specified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want 1 result, we want that exact capture. Captures are named by the surtkey and the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made in b2d310b

cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
cat TEST-000000.extracted.warc.cdxj
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "379", "filename": "TEST-000000.extracted.warc.gz"}
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "406", "filename": "TEST-000000.extracted.warc.gz"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a mistake in the offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just did some digging, at first i noticed that the command is embedded in the output warc.gz file - but then when i ran it repeatedly it came out with a different number each time? I guess it's because the WARC-Date header (the date the warc was generated) compresses to a different number of bytes every time it's run...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. That offset number is indeed not stable, and shouldn't be part of any test. This deserves a note in the README

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ef167da

@laurieburchell
Copy link
Member

I believe that the failing windows test isn't a concern, but @wumpus can confirm.

@handecelikkanat
Copy link
Contributor

I believe that the failing windows test isn't a concern, but @wumpus can confirm.

@laurieburchell Was due to end-of-life py38 not being properly supported anymore, updated github CI tests to drop py38 (PR), which should solve the issue. I think a rebase is needed on this branch to get updated tests, Ill do.

@wumpus wumpus removed their request for review October 17, 2025 03:41
Copy link
Member

@wumpus wumpus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my 2 new comments. Thanks.

@damian0815 damian0815 requested a review from wumpus October 20, 2025 09:08
Copy link
Member

@wumpus wumpus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are 2 changes:

  • re-add from and to
  • add an explanation that captures are named by the surtkey and the time

Everything else looks great! Thanks.

@damian0815 damian0815 requested a review from wumpus October 27, 2025 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants