Skip to content

Commit 32ba031

Browse files
fix: divide remote into remote-https and remote-s3. add ci tests only for https.
1 parent 157eeec commit 32ba031

3 files changed

Lines changed: 54 additions & 21 deletions

File tree

.github/workflows/ci.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,14 +63,14 @@ jobs:
6363
- name: make cdxj
6464
run: make cdxj
6565

66-
- name: make cdxj-remote
67-
run: make cdxj-remote
66+
- name: make cdxj-remote-https
67+
run: make cdxj-remote-https
6868

6969
- name: make extract
7070
run: make extract
7171

72-
- name: make extract-remote
73-
run: make extract-remote
72+
- name: make extract-remote-https
73+
run: make extract-remote-https
7474

7575
- name: make cdx_toolkit
7676
run: make cdx_toolkit

Makefile

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,18 @@ cdxj:
3838
cdxj-indexer --records conversion whirlwind.warc.wet.gz > whirlwind.warc.wet.cdxj
3939
cdxj-indexer whirlwind.warc.wat.gz > whirlwind.warc.wat.cdxj
4040

41-
cdxj-remote:
41+
cdxj-remote-https:
4242
@echo "indexing End-of-Term-2024 Internet Archive WARC over HTTPS (File size ~1GB, showing first 10 records):"
4343
cdxj-indexer $(EOT_IA_WARC_HTTPS) 2>/dev/null | head -n 10 | tee eot-ia.cdxj
4444
@echo
45+
@echo "indexing End-of-Term-2024 Common Crawl repackage WARC over HTTPS (File size ~1GB, showing first 10 records):"
46+
cdxj-indexer $(EOT_CC_WARC_HTTPS) 2>/dev/null | head -n 10 | tee eot-cc.cdxj
47+
48+
cdxj-remote-s3:
49+
@echo "!! this step requires authentication via S3 credentials (even though it is free)"
50+
@echo "indexing End-of-Term-2024 Internet Archive WARC over S3 (File size ~1GB, showing first 10 records):"
51+
cdxj-indexer $(EOT_IA_WARC_S3) 2>/dev/null | head -n 10 | tee eot-ia.cdxj
52+
@echo
4553
@echo "indexing End-of-Term-2024 Common Crawl repackage WARC over S3 (File size ~1GB, showing first 10 records):"
4654
cdxj-indexer $(EOT_CC_WARC_S3) 2>/dev/null | head -n 10 | tee eot-cc.cdxj
4755

@@ -52,10 +60,18 @@ extract:
5260
warcio extract --payload whirlwind.warc.wat.gz 443 > extraction.json
5361
@echo "hint: python -m json.tool extraction.json"
5462

55-
extract-remote:
63+
extract-remote-https:
5664
@echo "extracting hpxml.nrel.gov record from End-of-Term Internet Archive WARC over HTTPS (offset 50755):"
5765
warcio extract $(EOT_IA_WARC_HTTPS) 50755
5866
@echo
67+
@echo "extracting before-you-ship.18f.gov record from End-of-Term Common Crawl repackage WARC over HTTPS (offset 18595):"
68+
warcio extract $(EOT_CC_WARC_HTTPS) 18595
69+
70+
extract-remote-s3:
71+
@echo "!! this step requires authentication via S3 credentials (even though it is free)"
72+
@echo "extracting hpxml.nrel.gov record from End-of-Term Internet Archive WARC over S3 (offset 50755):"
73+
warcio extract $(EOT_IA_WARC_S3) 50755
74+
@echo
5975
@echo "extracting before-you-ship.18f.gov record from End-of-Term Common Crawl repackage WARC over S3 (offset 18595):"
6076
warcio extract $(EOT_CC_WARC_S3) 18595
6177

@@ -81,7 +97,7 @@ download_collinfo:
8197
curl -O https://index.commoncrawl.org/collinfo.json
8298

8399
CC-MAIN-2024-22.warc.paths.gz:
84-
@echo "downloading the list from S3 requires S3 auth (even though it is free)"
100+
@echo "!! this step requires authentication via S3 credentials (even though it is free)"
85101
@echo "note that this file should already be in the repo"
86102
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
87103

README.md

Lines changed: 31 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -257,19 +257,34 @@ The JSON blob has enough information to cleanly isolate the raw data of a single
257257
Through warcio's remote file handling capabilities, `cdxj-indexer` too can work on remote files, and this is true not just Common Crawl's, but any WARC files accessible over HTTPS or S3. As an example, let us check two WARC files from the End-of-Term Web Archive, which preserves U.S. government websites around presidential transitions. We will check one WARC file crawled by the Internet Archive (in the IA-000 segment), and another one repackaged from Common Crawl data (in the CC-000 segment). Let's index a few records from each.
258258

259259
Run:
260-
261-
`make cdxj-remote`
260+
`make cdxj-remote-https`
262261

263262
<details>
264263
<summary>Click to view code</summary>
265264

266265
```
267266
cdxj-indexer https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2024/segments/IA-000/warc/EOT24PRE-20240926172119-crawl804_EOT24PRE-20240926172119-00000.warc.gz 2>/dev/null | head -n 10 | tee eot-ia.cdxj
268-
cdxj-indexer s3://eotarchive/crawl-data/EOT-2024/segments/CC-000/warc/EOT-2024-REPACKAGE-CC-MAIN-2024-42-GOV-000000-001.warc.gz 2>/dev/null | head -n 10 | tee eot-cc.cdxj
267+
cdxj-indexer https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2024/segments/CC-000/warc/EOT-2024-REPACKAGE-CC-MAIN-2024-42-GOV-000000-001.warc.gz 2>/dev/null | head -n 10 | tee eot-cc.cdxj
269268
```
270269
</details>
271270

272-
The first command fetches and indexes a WARC over HTTPS, the second over S3. These real-life WARC files are around 1GB each, so we display and save only the first 10 records.
271+
The first command fetches and indexes these two WARC over HTTPS. Since they are both around 1GB each, so we display and save only the first 10 records.
272+
273+
If you have AWS credentials configured, you can also access the same files over S3, which is faster when running on AWS. Even though you will need AWS credentials for authentication purposes, this process is still free of charge since these are public buckets.
274+
If you do not have AWS credentials, you can access the same information over HTTPS as described above.
275+
276+
Run:
277+
278+
`make cdxj-remote-s3`
279+
280+
<details>
281+
<summary>Click to view code</summary>
282+
283+
```
284+
cdxj-indexer s3://eotarchive/crawl-data/EOT-2024/segments/IA-000/warc/EOT24PRE-20240926172119-crawl804_EOT24PRE-20240926172119-00000.warc.gz 2>/dev/null | head -n 10 | tee eot-ia.cdxj
285+
cdxj-indexer s3://eotarchive/crawl-data/EOT-2024/segments/CC-000/warc/EOT-2024-REPACKAGE-CC-MAIN-2024-42-GOV-000000-001.warc.gz 2>/dev/null | head -n 10 | tee eot-cc.cdxj
286+
```
287+
</details>
273288

274289

275290
## Task 4: Use the CDXJ index to extract a subset of raw content from the local WARC, WET, and WAT
@@ -316,24 +331,26 @@ Notice that we extracted HTML from the WARC, text from WET, and JSON from the WA
316331
The same random access trick works on remote files. By indexing deeper into the EOT WARC files from Task 3 (try increasing the head count, or removing it entirely if you're patient), we can find offsets for specific records and extract them directly — without downloading the entire file.
317332

318333
Run:
319-
320-
`make extract-remote`
334+
`make extract-remote-https`
321335

322336
<details>
323337
<summary>Click to view code</summary>
324338

325339
```
326340
warcio extract https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2024/segments/IA-000/warc/EOT24PRE-20240926172119-crawl804_EOT24PRE-20240926172119-00000.warc.gz 50755
327-
warcio extract s3://eotarchive/crawl-data/EOT-2024/segments/CC-000/warc/EOT-2024-REPACKAGE-CC-MAIN-2024-42-GOV-000000-001.warc.gz 18595
341+
warcio extract https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2024/segments/CC-000/warc/EOT-2024-REPACKAGE-CC-MAIN-2024-42-GOV-000000-001.warc.gz 18595
328342
```
329343
</details>
330344

331-
The first command extracts the record for `https://hpxml.nrel.gov/` (HPXML Toolbox, hosted by the National Renewable Energy Laboratory) from an Internet Archive crawl, fetched over HTTPS. The second extracts the record for `https://before-you-ship.18f.gov/` (18F's pre-launch checklist for government services) from a Common Crawl repackage, fetched over S3.
345+
The first command extracts the record for https://hpxml.nrel.gov/ (HPXML Toolbox, hosted by the National Renewable Energy Laboratory) from an Internet Archive crawl. The second extracts the record for https://before-you-ship.18f.gov/ (18F's pre-launch checklist for government services) from a Common Crawl repackage.
346+
347+
As with indexing, you can also use S3 paths if you have AWS credentials configured:
332348

333-
In both cases, warcio uses the byte offset to seek directly to the right position in the remote file and decompress just that one record. Later in this tutorial we will see the same mechanism being used by `cdx_toolkit` to fetch a specific capture, by looking up the offset in the CDX index, then make a byte-range request to retrieve just the record you want.
349+
`make extract-remote-s3`
334350

335-
**Note:** If you look at the output of the second extraction (`before-you-ship.18f.gov`), you'll notice that despite having an HTTP 200 status in the index, the actual HTML content is just a redirect page pointing to `handbook.tts.gsa.gov`. This is a good reminder that real crawl data is messy, a 200 status in the index doesn't always mean you'll get a full page of content!
351+
In both cases, warcio uses the byte offset to seek directly to the right position in the remote file and decompress just that one record. Later in this tutorial we will see the same mechanism being used by `cdx_toolkit` to fetch a specific capture, by looking up the offset in the CDX index, then making a byte-range request to retrieve just the record you want.
336352

353+
**Note:** If you look at the output of the second extraction (before-you-ship.18f.gov), you'll notice that despite having an HTTP 200 status in the index, the actual HTML content is just a redirect page pointing to handbook.tts.gsa.gov. This is a good reminder that real crawl data is messy — a 200 status in the index doesn't always mean you'll get a full page of content!
337354

338355
## Task 5: Wreck the WARC by compressing it wrong
339356

@@ -472,7 +489,7 @@ We check for capture results using the `cdxt` command `iter`, specifying the exa
472489
#### Retrieve the fetched content as WARC
473490

474491
Next, we use the `cdxt` command `warc` to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range. This creates the WARC file `TEST-000000.extracted.warc.gz` which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested.
475-
* If you dig into cdx_toolkit's code, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make a HTTP byte range request to S3 that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
492+
* If you dig into `cdx_toolkit`'s code, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make a HTTP byte range request to S3 that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
476493
* By default `cdxt` avoids overwriting existing files by automatically incrementing the counter in the filename. If you run this again without deleting `TEST-000000.extracted.warc.gz`, the data will be written again to a new file `TEST-000001.extracted.warc.gz`.
477494
* Limit, timestamp, and crawl index args, as well as URL wildcards, work as for `iter`.
478495

@@ -486,7 +503,7 @@ Now let's look at the columnar index, the other kind of index that Common Crawl
486503

487504
We could read the data directly from our index in our S3 bucket and analyse it in the cloud through AWS Athena. However, this is a managed service that costs money to use (though usually a small amount). [You can read about using it here.](https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format) This whirlwind tour will only use the free method of either fetching data from outside of AWS (which is kind of slow), or making a local copy of a single columnar index (300 gigabytes per monthly crawl), and then using that.
488505

489-
The columnar index is divided up into a separate index per crawl, which Athena or duckdb can stitch together. The cdx index is similarly divided up, but cdx_toolkit hides that detail from you.
506+
The columnar index is divided up into a separate index per crawl, which Athena or duckdb can stitch together. The cdx index is similarly divided up, but `cdx_toolkit` hides that detail from you.
490507

491508
For the purposes of this whirlwind tour, we don't want to configure all the crawl indices because it would be slow. So let's start by figuring out which crawl was ongoing on the date 20240518015810, and then we'll work with just that one crawl.
492509

@@ -640,8 +657,8 @@ All of these scripts run the same SQL query and should return the same record (w
640657

641658
1. Use the DuckDb techniques from [Task 8](#task-8-query-using-the-columnar-index--duckdb-from-outside-aws) and the [Index Server](https://index.commoncrawl.org) to find a new webpage in the archives.
642659
2. Note its url, warc, and timestamp.
643-
3. Now open up the Makefile from [Task 6](#task-6-use-cdx_toolkit-to-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the cdx_toolkit section.
644-
4. Repeat the cdx_toolkit steps, but for the page and date range you found above.
660+
3. Now open up the Makefile from [Task 6](#task-6-use-cdx_toolkit-to-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the `cdx_toolkit` section.
661+
4. Repeat the `cdx_toolkit` steps, but for the page and date range you found above.
645662

646663
## Congratulations!
647664

0 commit comments

Comments
 (0)