You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace bash scripts and aws instruction with the cc-downloader (#18)
* fix: remove awk scripts and aws instruction and replace them with the cc-downloader
* doc: add the link to the github repo of the cc-downloader
* fix: add info in case cargo is not available
Copy file name to clipboardExpand all lines: README.md
+20-30Lines changed: 20 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -795,46 +795,36 @@ In case you want to run many of these queries, and you have a lot of disk space,
795
795
> [!IMPORTANT]
796
796
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
797
797
798
-
To download the crawl index, there are two options: if you have access to the CCF AWS buckets, run:
798
+
To download the crawl index, please use [cc-downloader](https://github.com/commoncrawl/cc-downloader), which is a polite downloader for Common Crawl data:
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
Then, you can run `make duck_local_files LOCAL_DIR=crawl` to run the same query as above, but this time using your local copy of the index files.
838
828
839
829
Both `make duck_ccf_local_files` and `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` run the same SQL query and should return the same record (written as a parquet file).
0 commit comments