Skip to content

Commit 7d99aee

Browse files
fix(README.md): remove remote access details from Task 2 (to be moved to Task 3)
1 parent ab23a9c commit 7d99aee

1 file changed

Lines changed: 8 additions & 30 deletions

File tree

README.md

Lines changed: 8 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -174,50 +174,28 @@ python ./warcio-iterator.py whirlwind.warc.wat.gz
174174

175175
The output has three sections, one each for the WARC, WET, and WAT. For each one, it prints the record types we saw before, plus the `WARC-Target-URI` for those record types that have it.
176176

177-
### Task 2-i: Iterating over "Remote" Files
178-
So far we've been working with small local WARC files. But Common Crawl's real WARC files live on AWS S3. Since warcio 1.8, you can iterate over remote files exactly the same way as local ones — no download step required. We can do this over HTTPS or S3.
177+
warcio also supports working on remote files, so let us try the same command on the remote version of the same WARC file we just iterated locally. We will reach this remote file from the Github repository for this tutorial:
179178

180-
If you have AWS credentials configured, you can stream directly from S3, which is faster if you're running on AWS. Although the S3 bucket is public, but S3 access still requires AWS credentials.
181-
182-
`make iterate-remote-s3`
179+
`make iterate-remote`
183180

181+
<details>
182+
<summary>Click to view code</summary>
183+
python ./warcio-iterator.py https://raw.githubusercontent.com/commoncrawl/whirlwind-python/refs/heads/main/whirlwind.warc.gz
184+
</details>
185+
The output should be identical to what you saw from the local file:
184186
<details>
185187
<summary>Click to view output</summary>
186-
```
187-
iterating over remote warcs over s3:
188-
189-
warc:
190-
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.gz
191188
WARC-Type: warcinfo
192189
WARC-Type: request
193190
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
194191
WARC-Type: response
195192
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
196193
WARC-Type: metadata
197194
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
198-
199-
wet:
200-
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.wet.gz
201-
WARC-Type: warcinfo
202-
WARC-Type: conversion
203-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
204-
205-
wat:
206-
python ./warcio-iterator.py s3://commoncrawl-dev/whirlwind-example-files/whirlwind.warc.wat.gz
207-
WARC-Type: warcinfo
208-
WARC-Type: metadata
209-
WARC-Target-URI https://an.wikipedia.org/wiki/Escopete
210-
```
211195
</details>
212196

197+
We get the same output, but this time by streaming the file over HTTPS instead of reading from local disk. Later in the tour, we will use this capability to index and extract records from remote WARC files hosted on AWS S3 buckets.
213198

214-
If you don't have credentials configured, the HTTPS version works without any authentication.
215-
216-
`make iterate-remote-https`
217-
218-
<details>
219-
<summary>Click to view output</summary>
220-
</details>
221199

222200
## Task 3: Index the WARC, WET, and WAT
223201

0 commit comments

Comments
 (0)