Skip to content

Commit 4c97de4

Browse files
authored
Task 3, 4, and 7 (#8)
* ignore .idea, target * add pom.xml, Readme.md and the data files * add makefile * add read warc * add CI + spotless * add figures, editorconfig, .gitignore from the python repository brother * remove unclear make install, remove venv info from readme * update read class, add recompress, * cleanup, removing the rest of the python stuff for task 0,1,2 * fix missing make install * move data under 'data' directory * add Apache header in the code * make sure we build before running * update .gitignore * Implement WARC compression validation for Task 5 * Ignore gzip validation if is uncompressed * fix compression check, update Readme.md * add missing apache licence * add commons-compress library * place Github Actions in the correct directory * Add CDJX indexer using unreleased JARC code * Implement Task 3 and 4 * fix: CI build * fix: Reformat with spotless * fix: Rename class * feat: task 7 * fix(makefile): write stuff in data/ * fix(makefile): avoid reimplementing the wheel
1 parent 8244615 commit 4c97de4

5 files changed

Lines changed: 486 additions & 32 deletions

File tree

Makefile

Lines changed: 30 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
build:
22
mvn clean package
33

4-
# cdxj:
5-
# @echo "creating *.cdxj index files from the local warcs"
6-
# cdxj-indexer whirlwind.warc.gz > whirlwind.warc.cdxj
7-
# cdxj-indexer --records conversion whirlwind.warc.wet.gz > whirlwind.warc.wet.cdxj
8-
# cdxj-indexer whirlwind.warc.wat.gz > whirlwind.warc.wat.cdxj
4+
cdxj: build ensure_jwarc
5+
@echo "creating *.cdxj index files from the local warcs"
6+
java -jar jwarc.jar cdxj data/whirlwind.warc.gz > data/whirlwind.warc.cdxj
7+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wet.gz --records conversion" > data/whirlwind.warc.wet.cdxj
8+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wat.gz --records metadata" > data/whirlwind.warc.wat.cdxj
9+
10+
extract:
11+
@echo "creating extraction.* from local warcs, the offset numbers are from the cdxj index"
12+
java -jar jwarc.jar extract --payload data/whirlwind.warc.gz 1023 > extraction.html
13+
java -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz 466 > extraction.txt
14+
java -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 > extraction.json
15+
@echo "hint: python -m json.tool extraction.json"
916

10-
# extract:
11-
# @echo "creating extraction.* from local warcs, the offset numbers are from the cdxj index"
12-
# warcio extract --payload whirlwind.warc.gz 1023 > extraction.html
13-
# warcio extract --payload whirlwind.warc.wet.gz 466 > extraction.txt
14-
# warcio extract --payload whirlwind.warc.wat.gz 443 > extraction.json
15-
# @echo "hint: python -m json.tool extraction.json"
16-
#
1717
# cdx_toolkit:
1818
# @echo demonstrate that we have this entry in the index
1919
# cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
@@ -31,15 +31,15 @@ build:
3131
# python ./warcio-iterator.py TEST-000000.extracted.warc.gz
3232
# @echo
3333
#
34-
# download_collinfo:
35-
# @echo "downloading collinfo.json so we can find out the crawl name"
36-
# curl -O https://index.commoncrawl.org/collinfo.json
37-
#
38-
# CC-MAIN-2024-22.warc.paths.gz:
39-
# @echo "downloading the list from s3, requires s3 auth even though it is free"
40-
# @echo "note that this file should be in the repo"
41-
# aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
42-
#
34+
download_collinfo:
35+
@echo "downloading collinfo.json so we can find out the crawl name"
36+
curl -o data/collinfo.json https://index.commoncrawl.org/collinfo.json
37+
38+
CC-MAIN-2024-22.warc.paths.gz:
39+
@echo "downloading the list from s3, requires s3 auth even though it is free"
40+
@echo "note that this file should be in the repo"
41+
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > data/CC-MAIN-2024-22.warc.paths.gz
42+
4343
# duck_local_files:
4444
# @echo "warning! 300 gigabyte download"
4545
# python duck.py local_files
@@ -52,11 +52,12 @@ build:
5252
# @echo "warning! this might take 1-10 minutes"
5353
# python duck.py cloudfront
5454
#
55-
get_jwarc:
55+
56+
jwarc.jar:
5657
@echo "downloading JWarc JAR"
5758
curl -fL -o jwarc.jar https://github.com/iipc/jwarc/releases/download/v0.33.0/jwarc-0.33.0.jar
5859

59-
wreck_the_warc: build get_jwarc
60+
wreck_the_warc: build jwarc.jar
6061
@echo
6162
@echo we will break and then fix this warc
6263
cp data/whirlwind.warc.gz data/testing.warc.gz
@@ -67,24 +68,24 @@ wreck_the_warc: build get_jwarc
6768
gzip data/testing.warc
6869
@echo
6970
@echo showing the records in the compressed warc - note the offsets of request and response are
70-
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
71+
java -jar jwarc.jar ls data/testing.warc.gz
7172
@echo
7273
@echo access the request record - failing
73-
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
74+
java -jar jwarc.jar extract data/testing.warc.gz 3734 || /usr/bin/true
7475
@echo
7576
@echo access the response record - failing
76-
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
77+
java -jar jwarc.jar extract data/testing.warc.gz 3734 || /usr/bin/true
7778
@echo
7879
@echo "now let's do it the right way"
7980
gzip -d data/testing.warc.gz
8081
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args="data/testing.warc data/testing.warc.gz"
8182
@echo
8283
@echo showing the records in the compressed warc - note the skewed offsets of request and response
83-
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
84+
java -jar jwarc.jar ls data/testing.warc.gz
8485
@echo
8586
@echo access the request record - works
86-
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 518 | head
87+
java -jar jwarc.jar extract data/testing.warc.gz 518 | head
8788
@echo
8889
@echo access the response record - works
89-
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 1027 | head -n 20
90+
java -jar jwarc.jar extract data/testing.warc.gz 1027 | head -n 20
9091
@echo

README.md

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -414,11 +414,78 @@ Feel free to experiment more by looking at other part of the records, or extract
414414

415415
## Task 3: Index the WARC, WET, and WAT
416416

417-
TBA
417+
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
418+
```mermaid
419+
flowchart LR
420+
warc --> indexer --> cdxj & columnar
421+
warc@{shape: cyl}
422+
cdxj@{ shape: stored-data}
423+
columnar@{ shape: stored-data}
424+
```
425+
426+
427+
We have two versions of the index: the CDX index and the columnar index. The CDX index is useful for looking up single pages, whereas the columnar index is better suited to analytical and bulk queries. We'll look at both in this tour, starting with the CDX index.
428+
429+
### CDX(J) index
430+
431+
The CDX index files are sorted plain-text files, with each line containing information about a single capture in the WARC. Technically, Common Crawl uses CDXJ index files since the information about each capture is formatted as JSON. We'll use CDX and CDXJ interchangeably in this tour for legacy reasons 💅
432+
433+
We can create our own CDXJ index from the local WARCs by running:
434+
435+
```make cdxj```
436+
437+
This uses the JWARC library and, partially, a home-cooked code that we wrote to support WET and WAT records, to generate CDXJ index files for our WARC files by running the code below:
438+
439+
<details>
440+
<summary>Click to view code</summary>
441+
442+
```
443+
creating *.cdxj index files from the local warcs
444+
java -jar jwarc.jar cdxj data/whirlwind.warc.gz > whirlwind.warc.cdxj
445+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wet.gz --records conversion" > whirlwind.warc.wet.cdxj
446+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wat.gz --records metadata" > whirlwind.warc.wat.cdxj
447+
```
448+
449+
</details>
450+
451+
Now look at the `.cdxj` files with `cat whirlwind*.cdxj`. You'll see that each file has one entry in the index. The WARC only has the response record indexed, since by default cdxj-indexer guesses that you won't ever want to random-access the request or metadata. WET and WAT have the conversion and metadata records indexed (Common Crawl doesn't publish a WET or WAT index, just WARC).
452+
453+
For each of these records, there's one text line in the index - yes, it's a flat file! It starts with a string like `org,wikipedia,an)/wiki/escopete 20240518015810`, followed by a JSON blob. The starting string is the primary key of the index. The first thing is a [SURT](http://crawler.archive.org/articles/user_manual/glossary.html#surt) (Sort-friendly URI Reordering Transform). The big integer is a date, in ISO-8601 format with the delimiters removed.
454+
455+
What is the purpose of this funky format? It's done this way because these flat files (300 gigabytes total per crawl) can be sorted on the primary key using any out-of-core sort utility e.g. the standard Linux `sort`, or one of the Hadoop-based out-of-core sort functions.
456+
457+
The JSON blob has enough information to cleanly isolate the raw data of a single record: it defines which WARC file the record is in, and the byte offset and length of the record within this file. We'll use that in the next section.
418458

419459
## Task 4: Use the CDXJ index to extract a subset of raw content from the local WARC, WET, and WAT
420460

421-
TBA
461+
Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed. The `gzip` compression utility supports this, but it's rarely used.
462+
463+
To extract one record from a warc file, all you need to know is the filename and the offset into the file. If you're reading over the web, then it really helps to know the exact length of the record.
464+
465+
Run:
466+
467+
```make extract```
468+
469+
to run a set of extractions from your local
470+
`whirlwind.*.gz` files with `JWARC` using the commands below:
471+
472+
<details>
473+
<summary>Click to view code</summary>
474+
475+
```
476+
creating extraction.* from local warcs, the offset numbers are from the cdxj index
477+
java -jar jwarc.jar extract --payload data/whirlwind.warc.gz 1023 > extraction.html
478+
java -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz 466 > extraction.txt
479+
java -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 > extraction.json
480+
hint: python -m json.tool extraction.json
481+
```
482+
483+
</details>
484+
485+
The offset numbers in the Makefile are the same
486+
ones as in the index. Look at the three output files: `extraction.html`, `extraction.txt`, and `extraction.json` (pretty-print the json with `python -m json.tool extraction.json`).
487+
488+
Notice that we extracted HTML from the WARC, text from WET, and JSON from the WAT (as shown in the different file extensions). This is because the payload in each file type is formatted differently!
422489

423490
## Task 5: Wreck the WARC by compressing it wrong
424491

0 commit comments

Comments
 (0)