Conversation
…on into damian/polish
laurieburchell
left a comment
There was a problem hiding this comment.
This is great - I think the expanded explanation of the cdxt command is a great idea.
|
|
||
| ``` | ||
| look up this capture in the comoncrawl cdx index for CC-MAIN-2024-22, returning only the first match: | ||
| cdxt --limit 1 --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete |
There was a problem hiding this comment.
How come you've taken out the --from and --to fields?
There was a problem hiding this comment.
I did some experimentation and discovered that with --limit 1 they weren't needed. when i went through the whirlwind tour myself for the first time these timestamps were the most confusing thing for me, so i thought it better just to drop them from the actual command and then add below a comment saying that you can use them.
There was a problem hiding this comment.
The point of from and to is that we want exactly that one record at that date. It's likely it's the only capture of that URL in that crawl, but it still useful to have from and to and maybe an explanation of why it's there.
There was a problem hiding this comment.
👍 I do in fact have such an explanation, just below:
https://github.com/commoncrawl/whirlwind-python/pull/18/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R382
(line 382 if the link doesn't work)
My rationale for removing the --from and --to args from there is that I found myself confused when I was first going through the whirlwind tour as to where these exact timestamps came from. There seemed to be a chicken/egg problem: if I need the timestamps to use iter where do I get them from?
There was a problem hiding this comment.
an explanation of why it's there
this is in fact the part I don't quite understand. We're already restricting to a single result with --limit 1 --crawl CC-MAIN-2024-22. Why are --from and --to also specified?
There was a problem hiding this comment.
We don't want 1 result, we want that exact capture. Captures are named by the surtkey and the time.
| cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj | ||
| cat TEST-000000.extracted.warc.cdxj | ||
| org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "379", "filename": "TEST-000000.extracted.warc.gz"} | ||
| org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "406", "filename": "TEST-000000.extracted.warc.gz"} |
There was a problem hiding this comment.
Was there a mistake in the offset?
There was a problem hiding this comment.
i just did some digging, at first i noticed that the command is embedded in the output warc.gz file - but then when i ran it repeatedly it came out with a different number each time? I guess it's because the WARC-Date header (the date the warc was generated) compresses to a different number of bytes every time it's run...
There was a problem hiding this comment.
Ah. That offset number is indeed not stable, and shouldn't be part of any test. This deserves a note in the README
|
I believe that the failing windows test isn't a concern, but @wumpus can confirm. |
@laurieburchell Was due to end-of-life py38 not being properly supported anymore, updated github CI tests to drop py38 (PR), which should solve the issue. I think a rebase is needed on this branch to get updated tests, Ill do. |
4596d07 to
16e4596
Compare
wumpus
left a comment
There was a problem hiding this comment.
See my 2 new comments. Thanks.
Signed-off-by: Damian Stewart <ot@damianstewart.com>
…on into damian/polish
wumpus
left a comment
There was a problem hiding this comment.
So there are 2 changes:
- re-add from and to
- add an explanation that captures are named by the surtkey and the time
Everything else looks great! Thanks.
|
I think Damian requested a new review from Greg before merging, correct? Shall we go ahead and merge Greg? (This was about the HAI time so I remember me recommending Damian to give Greg time to come back.) |
warcio extractis run. Other instances of "extract" have been replaced with synonyms or reworded.cdxtusage (Task 6) deserved a bit more space and explanation. I've given it that. (This section would flow better as a notebook rather than a singlemake cdx_toolkitinvocation.)