-
Couldn't load subscription status.
- Fork 6
Fix overloaded extract; polish CDXT examples
#18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…on into damian/polish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great - I think the expanded explanation of the cdxt command is a great idea.
|
|
||
| ``` | ||
| look up this capture in the comoncrawl cdx index for CC-MAIN-2024-22, returning only the first match: | ||
| cdxt --limit 1 --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come you've taken out the --from and --to fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some experimentation and discovered that with --limit 1 they weren't needed. when i went through the whirlwind tour myself for the first time these timestamps were the most confusing thing for me, so i thought it better just to drop them from the actual command and then add below a comment saying that you can use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of from and to is that we want exactly that one record at that date. It's likely it's the only capture of that URL in that crawl, but it still useful to have from and to and maybe an explanation of why it's there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I do in fact have such an explanation, just below:
https://github.com/commoncrawl/whirlwind-python/pull/18/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R382
(line 382 if the link doesn't work)
My rationale for removing the --from and --to args from there is that I found myself confused when I was first going through the whirlwind tour as to where these exact timestamps came from. There seemed to be a chicken/egg problem: if I need the timestamps to use iter where do I get them from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an explanation of why it's there
this is in fact the part I don't quite understand. We're already restricting to a single result with --limit 1 --crawl CC-MAIN-2024-22. Why are --from and --to also specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want 1 result, we want that exact capture. Captures are named by the surtkey and the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change made in b2d310b
| cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj | ||
| cat TEST-000000.extracted.warc.cdxj | ||
| org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "379", "filename": "TEST-000000.extracted.warc.gz"} | ||
| org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "406", "filename": "TEST-000000.extracted.warc.gz"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a mistake in the offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just did some digging, at first i noticed that the command is embedded in the output warc.gz file - but then when i ran it repeatedly it came out with a different number each time? I guess it's because the WARC-Date header (the date the warc was generated) compresses to a different number of bytes every time it's run...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. That offset number is indeed not stable, and shouldn't be part of any test. This deserves a note in the README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in ef167da
|
I believe that the failing windows test isn't a concern, but @wumpus can confirm. |
@laurieburchell Was due to end-of-life py38 not being properly supported anymore, updated github CI tests to drop py38 (PR), which should solve the issue. I think a rebase is needed on this branch to get updated tests, Ill do. |
4596d07 to
16e4596
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my 2 new comments. Thanks.
Signed-off-by: Damian Stewart <[email protected]>
…on into damian/polish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are 2 changes:
- re-add from and to
- add an explanation that captures are named by the surtkey and the time
Everything else looks great! Thanks.
warcio extractis run. Other instances of "extract" have been replaced with synonyms or reworded.cdxtusage (Task 6) deserved a bit more space and explanation. I've given it that. (This section would flow better as a notebook rather than a singlemake cdx_toolkitinvocation.)