You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+47-2Lines changed: 47 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -549,9 +549,54 @@ The program then writes that one record into a local Parquet file, does a second
549
549
550
550
### Bonus: download a full crawl index and query with DuckDB
551
551
552
-
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
552
+
In case you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly.
553
553
554
-
```make duck_local_files```
554
+
> [!IMPORTANT]
555
+
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
556
+
557
+
To download the crawl index, there are two options: if you have access to the CCF AWS buckets, run:
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
555
600
556
601
If the files aren't already downloaded, this command will give you
0 commit comments