Skip to content

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25

Merged
handecelikkanat merged 7 commits into
mainfrom
feat/s3-access-via-warcio1.8
Jun 15, 2026
Merged

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25
handecelikkanat merged 7 commits into
mainfrom
feat/s3-access-via-warcio1.8

Conversation

@handecelikkanat

@handecelikkanat handecelikkanat commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

From https://github.com/commoncrawl/issues/issues/684

This PR adds direct remote access (s3, https) to warc/wet/wat files in S3 buckets, using warcio.

Since 1.8.0, warcio supports direct remote file access over s3 and https: https://github.com/webrecorder/warcio/blob/master/CHANGELIST.rst

This PR adds:

  • fsspec.open call to replace local open call in warcio-iterator.py
  • New make targets:
    • make iterate-remote to remote access the example whirlwind.warc.gz file in Github repo directly over https:
    • make cdxj-remote-https and make cdxj-remote-s3 to index two EoT WARCs over https and s3
    • make extract-remote-https and make extract-remote-s3 to extract records from the two EoT WARCs over https and s3
    • Note: Still keeping processing over local files in the tutorial, as a gentle start.
  • New requirement warcio[s3]>=1.8.0
  • New CI steps: run: make iterate-remote, run: make cdxj-remote-https, run: make extract-remote-https (No testing of s3 versions, which requires AWS creds)

@malteos

malteos commented Apr 10, 2026

Copy link
Copy Markdown

fsspec_open call from warcio.utils

This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils.

To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script.

@handecelikkanat

handecelikkanat commented Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

fsspec_open call from warcio.utils

This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils.

Previously this used a local file open, Ill check fsspec.

To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script.

I was now thinking that warcio extract should be working with remote files as well. Ill modify that task: cdx index extract info -> warcio extract over (local and) remote files.

Any other suggestions? warcio index looks potentially confusable with cdx index to me, because of "index" label.

@handecelikkanat

handecelikkanat commented Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

@malteos Can cdxj-indexer work with remote files (maybe through warcio) now?

@handecelikkanat

handecelikkanat commented Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

@malteos Can cdxj-indexer work with remote files (maybe through warcio) now?

I guess this is not guaranteed. I see that they include warcio but not s3, and dont force > 1.8.0: https://github.com/webrecorder/cdxj-indexer/blob/9ad2b9e1c54d2d20c391050fdb831ca1ee981504/setup.py#L49

Ill continue assuming it needs to work on local files.

EDIT: Explained by Greg that this can be handled by making the requirement stricter in whirlwind side ✔️

@handecelikkanat handecelikkanat force-pushed the feat/s3-access-via-warcio1.8 branch from ac6d444 to 7d99aee Compare April 12, 2026 16:29
@handecelikkanat handecelikkanat force-pushed the feat/s3-access-via-warcio1.8 branch from bd16c58 to 157eeec Compare April 13, 2026 09:27
@handecelikkanat handecelikkanat marked this pull request as ready for review April 15, 2026 18:28
@wumpus

wumpus commented May 31, 2026

Copy link
Copy Markdown
Member

@malteos you should have reviewed this already

@malteos malteos left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor suggestions. Otherwise LGTM!

Comment thread requirements.txt
Comment thread Makefile
Comment thread Makefile
Comment thread Makefile
Comment thread Makefile
Comment thread README.md Outdated
@handecelikkanat

Copy link
Copy Markdown
Contributor Author

Thanks for good recs @malteos . Applied and merging.

@handecelikkanat handecelikkanat merged commit 57209f8 into main Jun 15, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants