feat: add direct remote access over s3 and https via warcio >= 1.8.0#25
feat: add direct remote access over s3 and https via warcio >= 1.8.0#25handecelikkanat wants to merge 4 commits intomainfrom
Conversation
This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils. To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script. |
Previously this used a local file open, Ill check fsspec.
I was now thinking that Any other suggestions? |
|
@malteos Can |
EDIT: Explained by Greg that this can be handled by making the requirement stricter in whirlwind side ✔️ |
ac6d444 to
7d99aee
Compare
From https://github.com/commoncrawl/issues/issues/684
This PR adds direct remote access (s3, https) to warc/wet/wat files in S3 buckets, using warcio.
Since 1.8.0, warcio supports direct remote file access over s3 and https: https://github.com/webrecorder/warcio/blob/master/CHANGELIST.rst
This PR adds:
fsspec.opencall to replace local open call inwarcio-iterator.pymake iterate-remoteto remote access the example whirlwind.warc.gz file in Github repo directly over https:make cdxj-remoteto index two EoT WARCs over https and s3 (s3 doesnt require aws credentials)make extract-remoteto extract records from the two EoT WARCs over https and s3warcio[s3]>=1.8.0Note:
I still keep task/target to iterate over (
make iterate), index (make cdxj), and extract records (make extract) from local files (to be cloned from the Github repo.I think this is a gentle start.
Might invite people to check files better, in their local.