You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The output has three sections, one each for the WARC, WET, and WAT. For each one, it prints the record types we saw before, plus the `WARC-Target-URI` for those record types that have it.
176
176
177
+
### Task 2-i: Iterating over "Remote" Files
178
+
So far we've been working with small local WARC files. But Common Crawl's real WARC files live on AWS S3. Since warcio 1.8, you can iterate over remote files exactly the same way as local ones — no download step required. We can do this over HTTPS or S3.
179
+
180
+
If you have AWS credentials configured, you can stream directly from S3, which is faster if you're running on AWS. Although the S3 bucket is public, but S3 access still requires AWS credentials.
If you don't have credentials configured, the HTTPS version works without any authentication.
215
+
216
+
`make iterate-remote-https`
217
+
218
+
<details>
219
+
<summary>Click to view output</summary>
220
+
</details>
221
+
177
222
## Task 3: Index the WARC, WET, and WAT
178
223
179
224
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
0 commit comments