Skip to content

Commit 8adcbb0

Browse files
committed
chore(docs): update documentation with instruction on how to download the crawl data with and without the AWS CLI
1 parent 6f3028b commit 8adcbb0

1 file changed

Lines changed: 47 additions & 2 deletions

File tree

README.md

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -549,9 +549,54 @@ The program then writes that one record into a local Parquet file, does a second
549549

550550
### Bonus: download a full crawl index and query with DuckDB
551551

552-
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
552+
In case you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly.
553553

554-
```make duck_local_files```
554+
> [!IMPORTANT]
555+
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
556+
557+
To download the crawl index, there are two options: if you have access to the CCF AWS buckets, run:
558+
559+
```shell
560+
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
561+
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc'
562+
```
563+
564+
If, by any other chance, you don't have access through the AWS CLI:
565+
566+
```shell
567+
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
568+
cd 'crawl=CC-MAIN-2024-22/subset=warc'
569+
570+
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz
571+
gunzip cc-index-table.paths.gz
572+
573+
grep 'subset=warc' cc-index-table.paths | \
574+
awk '{print "https://data.commoncrawl.org/" $1, $1}' | \
575+
xargs -n 2 -P 10 sh -c '
576+
echo "Downloading: $2"
577+
mkdir -p "$(dirname "$2")" &&
578+
wget -O "$2" "$1"
579+
' _
580+
581+
rm cc-index-table.paths
582+
cd -
583+
```
584+
585+
In both ways, the file structure should be something like this:
586+
```shell
587+
tree my_data
588+
my_data
589+
└── crawl=CC-MAIN-2024-22
590+
└── subset=warc
591+
├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
592+
├── part-00001-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
593+
├── part-00002-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
594+
```
595+
596+
597+
Then run:
598+
599+
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
555600

556601
If the files aren't already downloaded, this command will give you
557602
download instructions.

0 commit comments

Comments
 (0)