Skip to content

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25

Draft
handecelikkanat wants to merge 4 commits intomainfrom
feat/s3-access-via-warcio1.8
Draft

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25
handecelikkanat wants to merge 4 commits intomainfrom
feat/s3-access-via-warcio1.8

Conversation

@handecelikkanat
Copy link
Copy Markdown
Contributor

@handecelikkanat handecelikkanat commented Apr 9, 2026

From https://github.com/commoncrawl/issues/issues/684

This PR adds direct remote access (s3, https) to warc/wet/wat files in S3 buckets, using warcio.

Since 1.8.0, warcio supports direct remote file access over s3 and https: https://github.com/webrecorder/warcio/blob/master/CHANGELIST.rst

This PR adds:

  • fsspec.open call to replace local open call in warcio-iterator.py
  • New make targets:
    • make iterate-remote to remote access the example whirlwind.warc.gz file in Github repo directly over https:
    • make cdxj-remote to index two EoT WARCs over https and s3 (s3 doesnt require aws credentials)
    • make extract-remote to extract records from the two EoT WARCs over https and s3
  • New requirement warcio[s3]>=1.8.0

Note:
I still keep task/target to iterate over (make iterate), index (make cdxj), and extract records (make extract) from local files (to be cloned from the Github repo.
I think this is a gentle start.
Might invite people to check files better, in their local.

@malteos
Copy link
Copy Markdown

malteos commented Apr 10, 2026

fsspec_open call from warcio.utils

This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils.

To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script.

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

fsspec_open call from warcio.utils

This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils.

Previously this used a local file open, Ill check fsspec.

To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script.

I was now thinking that warcio extract should be working with remote files as well. Ill modify that task: cdx index extract info -> warcio extract over (local and) remote files.

Any other suggestions? warcio index looks potentially confusable with cdx index to me, because of "index" label.

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

@malteos Can cdxj-indexer work with remote files (maybe through warcio) now?

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

@malteos Can cdxj-indexer work with remote files (maybe through warcio) now?

I guess this is not guaranteed. I see that they include warcio but not s3, and dont force > 1.8.0: https://github.com/webrecorder/cdxj-indexer/blob/9ad2b9e1c54d2d20c391050fdb831ca1ee981504/setup.py#L49

Ill continue assuming it needs to work on local files.

EDIT: Explained by Greg that this can be handled by making the requirement stricter in whirlwind side ✔️

@handecelikkanat handecelikkanat force-pushed the feat/s3-access-via-warcio1.8 branch from ac6d444 to 7d99aee Compare April 12, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants