Skip to content

Commit eb617b0

Browse files
feat: add direct remote access over s3 and https via warcio >= 1.8.0
1 parent 08dcfbe commit eb617b0

4 files changed

Lines changed: 75 additions & 2 deletions

File tree

Makefile

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,33 @@ iterate:
2222
python ./warcio-iterator.py whirlwind.warc.wat.gz
2323
@echo
2424

25+
#FIXME: Update s3 locations if moved to public bucket:
26+
iterate-remote-s3:
27+
@echo iterating over remote warcs over https:
28+
@echo
29+
@echo warc:
30+
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.gz
31+
@echo
32+
@echo wet:
33+
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.wet.gz
34+
@echo
35+
@echo wat:
36+
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.wat.gz
37+
38+
39+
#FIXME: We need the example files on public s3 bucket for this:
40+
#iterate-remote-https:
41+
# @echo iterating over remote warcs over https:
42+
# @echo
43+
# @echo warc:
44+
# python ./warcio-iterator.py https://data.commoncrawl.org/<HYPOTHETICAL-PREFIX>/whirlwind.warc.gz
45+
# @echo
46+
# @echo wet:
47+
# python ./warcio-iterator.py https://data.commoncrawl.org/<HYPOTHETICAL-PREFIX>/whirlwind.warc.wet.gz
48+
# @echo
49+
# @echo wat:
50+
# python ./warcio-iterator.py https://data.commoncrawl.org/<HYPOTHETICAL-PREFIX>/whirlwind.warc.wat.gz
51+
2552
cdxj:
2653
@echo "creating *.cdxj index files from the local warcs"
2754
cdxj-indexer whirlwind.warc.gz > whirlwind.warc.cdxj

README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,51 @@ python ./warcio-iterator.py whirlwind.warc.wat.gz
174174

175175
The output has three sections, one each for the WARC, WET, and WAT. For each one, it prints the record types we saw before, plus the `WARC-Target-URI` for those record types that have it.
176176

177+
### Task 2-i: Iterating over "Remote" Files
178+
So far we've been working with small local WARC files. But Common Crawl's real WARC files live on AWS S3. Since warcio 1.8, you can iterate over remote files exactly the same way as local ones — no download step required. We can do this over HTTPS or S3.
179+
180+
If you have AWS credentials configured, you can stream directly from S3, which is faster if you're running on AWS. Although the S3 bucket is public, but S3 access still requires AWS credentials.
181+
182+
`make iterate-remote-s3`
183+
184+
<details>
185+
<summary>Click to view output</summary>
186+
```
187+
iterating over remote warcs over s3:
188+
189+
warc:
190+
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.gz
191+
WARC-Type: warcinfo
192+
WARC-Type: request
193+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
194+
WARC-Type: response
195+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
196+
WARC-Type: metadata
197+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
198+
199+
wet:
200+
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.wet.gz
201+
WARC-Type: warcinfo
202+
WARC-Type: conversion
203+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
204+
205+
wat:
206+
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.wat.gz
207+
WARC-Type: warcinfo
208+
WARC-Type: metadata
209+
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
210+
```
211+
</details>
212+
213+
214+
If you don't have credentials configured, the HTTPS version works without any authentication.
215+
216+
`make iterate-remote-https`
217+
218+
<details>
219+
<summary>Click to view output</summary>
220+
</details>
221+
177222
## Task 3: Index the WARC, WET, and WAT
178223

179224
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
warcio
1+
warcio[s3]>=1.8.0
22
cdx_toolkit
33
duckdb
44
pyarrow

warcio-iterator.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,11 @@
22

33
import sys
44

5+
from warcio.utils import fsspec_open
56
from warcio.archiveiterator import ArchiveIterator
67

78
for file in sys.argv[1:]:
8-
with open(file, 'rb') as stream:
9+
with fsspec_open(file, 'rb') as stream:
910
for record in ArchiveIterator(stream):
1011
print(' ', 'WARC-Type:', record.rec_type)
1112
if record.rec_type in {'request', 'response', 'conversion', 'metadata'}:

0 commit comments

Comments
 (0)