Skip to content

Commit 3701812

Browse files
committed
fix: remove awk scripts and aws instruction and replace them with the cc-downloader
1 parent 25595fc commit 3701812

1 file changed

Lines changed: 19 additions & 30 deletions

File tree

README.md

Lines changed: 19 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -795,46 +795,35 @@ In case you want to run many of these queries, and you have a lot of disk space,
795795
> [!IMPORTANT]
796796
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
797797
798-
To download the crawl index, there are two options: if you have access to the CCF AWS buckets, run:
798+
To download the crawl index, please use cc-downloader, which is a polite downloader for Common Crawl data:
799799

800800
```shell
801-
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
802-
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc'
801+
cargo install cc-downloader
803802
```
804803

805-
If, by any other chance, you don't have access through the AWS CLI:
804+
cc-downloader will not be set up on your path by default, but you can run it by prepending the right path.
806805

807806
```shell
808-
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
809-
cd 'crawl=CC-MAIN-2024-22/subset=warc'
810-
811-
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz
812-
gunzip cc-index-table.paths.gz
813-
814-
grep 'subset=warc' cc-index-table.paths | \
815-
awk '{print "https://data.commoncrawl.org/" $1, $1}' | \
816-
xargs -n 2 -P 10 sh -c '
817-
echo "Downloading: $2"
818-
mkdir -p "$(dirname "$2")" &&
819-
wget -O "$2" "$1"
820-
' _
821-
822-
rm cc-index-table.paths
823-
cd -
807+
mkdir crawl
808+
~/.cargo/bin/cc-downloader download-paths CC-MAIN-2024-22 cc-index-table crawl
809+
~/.cargo/bin/cc-downloader download crawl/cc-index-table.paths.gz --progress crawl
824810
```
825811

826812
In both ways, the file structure should be something like this:
827813
```shell
828-
tree my_data
829-
my_data
830-
└── crawl=CC-MAIN-2024-22
831-
└── subset=warc
832-
├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
833-
├── part-00001-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
834-
├── part-00002-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
835-
```
836-
837-
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
814+
tree crawl/
815+
crawl/
816+
├── cc-index
817+
│ └── table
818+
│ └── cc-main
819+
│ └── warc
820+
│ └── crawl=CC-MAIN-2024-22
821+
│ └── subset=warc
822+
│ ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
823+
│ ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c001.gz.parquet
824+
```
825+
826+
Then, you can run `make duck_local_files LOCAL_DIR=crawl` to run the same query as above, but this time using your local copy of the index files.
838827

839828
Both `make duck_ccf_local_files` and `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` run the same SQL query and should return the same record (written as a parquet file).
840829

0 commit comments

Comments
 (0)