Skip to content

Commit 352b140

Browse files
authored
Task 8 + Bonus (#11)
1 parent 34c6c87 commit 352b140

3 files changed

Lines changed: 111 additions & 15 deletions

File tree

Makefile

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,26 +9,26 @@ cdxj: build jwarc.jar
99

1010
extract: jwarc.jar
1111
@echo "creating extraction.* from local warcs, the offset numbers are from the cdxj index"
12-
java -jar jwarc.jar extract --payload data/whirlwind.warc.gz 1023 > extraction.html
13-
java -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz 466 > extraction.txt
14-
java -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 > extraction.json
15-
@echo "hint: python -m json.tool extraction.json"
12+
java -jar jwarc.jar extract --payload data/whirlwind.warc.gz 1023 > data/extraction.html
13+
java -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz 466 > data/extraction.txt
14+
java -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 > data/extraction.json
15+
@echo "hint: python -m json.tool data/extraction.json"
1616

1717
cdx_toolkit: jwarc.jar
1818
@echo demonstrate that we have this entry in the index
1919
curl 'https://index.commoncrawl.org/CC-MAIN-2024-22-index?url=an.wikipedia.org/wiki/Escopete&output=json&from=20240518015810&to=20240518015810'
2020
@echo
2121
@echo cleanup previous work
22-
rm -f TEST-000000.extracted.warc.gz
22+
rm -f data/TEST-000000.extracted.warc.gz
2323
@echo retrieve the content from the commoncrawl data server
24-
curl --request GET --url 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz' --header 'Range: bytes=80610731-80628153' > TEST-000000.extracted.warc.gz
24+
curl --request GET --url 'https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz' --header 'Range: bytes=80610731-80628153' > data/TEST-000000.extracted.warc.gz
2525
@echo
2626
@echo index this new warc
27-
java -jar jwarc.jar cdxj TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
28-
cat TEST-000000.extracted.warc.cdxj
27+
java -jar jwarc.jar cdxj data/TEST-000000.extracted.warc.gz > data/TEST-000000.extracted.warc.cdxj
28+
cat data/TEST-000000.extracted.warc.cdxj
2929
@echo
3030
@echo iterate this new warc
31-
java -jar jwarc.jar ls TEST-000000.extracted.warc.gz
31+
java -jar jwarc.jar ls data/TEST-000000.extracted.warc.gz
3232
@echo
3333

3434
download_collinfo:
@@ -41,12 +41,12 @@ CC-MAIN-2024-22.warc.paths.gz:
4141
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > data/CC-MAIN-2024-22.warc.paths.gz
4242

4343
duck_ccf_local_files: build
44-
@echo "warning! only works on Common Crawl Foundadtion's development machine"
45-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args"ccf_local_files"
44+
@echo "warning! only works on Common Crawl Foundation's development machine"
45+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args="ccf_local_files"
4646

4747
duck_cloudfront: build
4848
@echo "warning! this might take 1-10 minutes"
49-
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args"cloudfront"
49+
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.Duck -Dexec.args="cloudfront"
5050

5151
jwarc.jar:
5252
@echo "downloading JWarc JAR"

README.md

Lines changed: 97 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -700,11 +700,106 @@ The date of our test record is 20240518015810, which is
700700

701701
## Task 8: Query using the columnar index + DuckDB from outside AWS
702702

703-
TBA
703+
A single crawl columnar index is around 300 gigabytes. If you don't have a lot of disk space, but you do have a lot of time, you can directly access the index stored on AWS S3. We're going to do just that, and then use [DuckDB](https://duckdb.org) to make an SQL query against the index to find our webpage. We'll be running the following query:
704+
705+
```sql
706+
SELECT
707+
*
708+
FROM ccindex
709+
WHERE subset = 'warc'
710+
AND crawl = 'CC-MAIN-2024-22'
711+
AND url_host_tld = 'org' -- help the query optimizer
712+
AND url_host_registered_domain = 'wikipedia.org' -- ditto
713+
AND url = 'https://an.wikipedia.org/wiki/Escopete'
714+
;
715+
```
716+
717+
Run
718+
719+
```make duck_cloudfront```
720+
721+
On a machine with a 1 gigabit network connection and many cores, this should take about one minute total, and uses 8 cores. The output should look like:
722+
723+
<details>
724+
<summary>Click to view output</summary>
725+
726+
```
727+
Using algorithm: cloudfront
728+
Total records for crawl: CC-MAIN-2024-22
729+
100% ▕████████████████████████████████████████████████████████████▏
730+
2709877975
731+
732+
Our one row:
733+
100% ▕████████████████████████████████████████████████████████████▏
734+
url_surtkey | url | url_host_name | url_host_tld | url_host_2nd_last_part | url_host_3rd_last_part | url_host_4th_last_part | url_host_5th_last_part | url_host_registry_suffix | url_host_registered_domain | url_host_private_suffix | url_host_private_domain | url_host_name_reversed | url_protocol | url_port | url_path | url_query | fetch_time | fetch_status | fetch_redirect | content_digest | content_mime_type | content_mime_detected | content_charset | content_languages | content_truncated | warc_filename | warc_record_offset | warc_record_length | warc_segment | crawl | subset
735+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
736+
org,wikipedia,an)/wiki/escopete | https://an.wikipedia.org/wiki/Escopete | an.wikipedia.org | org | wikipedia | an | NULL | NULL | org | wikipedia.org | org | wikipedia.org | org.wikipedia.an | https | NULL | /wiki/Escopete | NULL | 2024-05-18T01:58:10Z | 200 | NULL | RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU | text/html | text/html | UTF-8 | spa | NULL | crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz | 80610731 | 17423 | 1715971057216.39 | CC-MAIN-2024-22 | warc
737+
738+
Writing our one row to a local parquet file, whirlwind.parquet
739+
100% ▕████████████████████████████████████████████████████████████▏
740+
Total records for local whirlwind.parquet should be 1:
741+
1
742+
743+
Our one row, locally:
744+
url_surtkey | url | url_host_name | url_host_tld | url_host_2nd_last_part | url_host_3rd_last_part | url_host_4th_last_part | url_host_5th_last_part | url_host_registry_suffix | url_host_registered_domain | url_host_private_suffix | url_host_private_domain | url_host_name_reversed | url_protocol | url_port | url_path | url_query | fetch_time | fetch_status | fetch_redirect | content_digest | content_mime_type | content_mime_detected | content_charset | content_languages | content_truncated | warc_filename | warc_record_offset | warc_record_length | warc_segment | crawl | subset
745+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
746+
org,wikipedia,an)/wiki/escopete | https://an.wikipedia.org/wiki/Escopete | an.wikipedia.org | org | wikipedia | an | NULL | NULL | org | wikipedia.org | org | wikipedia.org | org.wikipedia.an | https | NULL | /wiki/Escopete | NULL | 2024-05-18T01:58:10Z | 200 | NULL | RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU | text/html | text/html | UTF-8 | spa | NULL | crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz | 80610731 | 17423 | 1715971057216.39 | CC-MAIN-2024-22 | warc
747+
748+
Complete row:
749+
url_surtkey org,wikipedia,an)/wiki/escopete
750+
url https://an.wikipedia.org/wiki/Escopete
751+
url_host_name an.wikipedia.org
752+
url_host_tld org
753+
url_host_2nd_last_part wikipedia
754+
url_host_3rd_last_part an
755+
url_host_4th_last_part null
756+
url_host_5th_last_part null
757+
url_host_registry_suffix org
758+
url_host_registered_domain wikipedia.org
759+
url_host_private_suffix org
760+
url_host_private_domain wikipedia.org
761+
url_host_name_reversed org.wikipedia.an
762+
url_protocol https
763+
url_port null
764+
url_path /wiki/Escopete
765+
url_query null
766+
fetch_time 2024-05-18T01:58:10Z
767+
fetch_status 200
768+
fetch_redirect null
769+
content_digest RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU
770+
content_mime_type text/html
771+
content_mime_detected text/html
772+
content_charset UTF-8
773+
content_languages spa
774+
content_truncated null
775+
warc_filename crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz
776+
warc_record_offset 80610731
777+
warc_record_length 17423
778+
warc_segment 1715971057216.39
779+
crawl CC-MAIN-2024-22
780+
subset warc
781+
782+
Equivalent to CDXJ:
783+
org,wikipedia,an)/wiki/escopete 20240518015810 {"url":"https://an.wikipedia.org/wiki/Escopete","mime":"text/html","status":"200","digest":"sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU","length":"17423","offset":"80610731","filename":"crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz"}
784+
```
785+
</details>
786+
787+
The above command runs code in `Duck.java`, which accesses the relevant part of the index for our crawl (CC-MAIN-2024-22) and then counts the number of records in that crawl (2709877975!). The code runs the SQL query we saw before which should match the single response record we want.
788+
789+
The program then writes that one record into a local Parquet file, does a second query that returns that one record, and shows the full contents of the record. We can see that the complete row contains many columns containing different information associated with our record. Finally, it converts the row to the CDXJ format we saw before.
704790

705791
### Bonus: download a full crawl index and query with DuckDB
706792

707-
TBA
793+
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
794+
795+
```shell
796+
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ .'
797+
```
798+
799+
> [!IMPORTANT]
800+
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
801+
802+
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
708803

709804
## Bonus 2: combine some steps
710805

src/main/java/org/commoncrawl/whirlwind/Duck.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
import java.nio.charset.StandardCharsets;
2525
import java.nio.file.Files;
2626
import java.nio.file.Path;
27+
import java.nio.file.Paths;
2728
import java.sql.*;
2829
import java.time.format.DateTimeFormatter;
2930
import java.util.*;
@@ -124,7 +125,7 @@ public static List<String> getFiles(Algorithm algo, String crawl) throws IOExcep
124125
case CLOUDFRONT: {
125126
String externalPrefix = String
126127
.format("https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=%s/subset=warc/", crawl);
127-
String pathsFile = crawl + ".warc.paths.gz";
128+
String pathsFile = Paths.get("data", crawl + ".warc.paths.gz").toString();
128129

129130
List<String> files = new ArrayList<>();
130131
try (GZIPInputStream gzis = new GZIPInputStream(new FileInputStream(pathsFile));

0 commit comments

Comments
 (0)