You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+97-2Lines changed: 97 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -700,11 +700,106 @@ The date of our test record is 20240518015810, which is
700
700
701
701
## Task 8: Query using the columnar index + DuckDB from outside AWS
702
702
703
-
TBA
703
+
A single crawl columnar index is around 300 gigabytes. If you don't have a lot of disk space, but you do have a lot of time, you can directly access the index stored on AWS S3. We're going to do just that, and then use [DuckDB](https://duckdb.org) to make an SQL query against the index to find our webpage. We'll be running the following query:
704
+
705
+
```sql
706
+
SELECT
707
+
*
708
+
FROM ccindex
709
+
WHERE subset ='warc'
710
+
AND crawl ='CC-MAIN-2024-22'
711
+
AND url_host_tld ='org'-- help the query optimizer
712
+
AND url_host_registered_domain ='wikipedia.org'-- ditto
713
+
AND url ='https://an.wikipedia.org/wiki/Escopete'
714
+
;
715
+
```
716
+
717
+
Run
718
+
719
+
```make duck_cloudfront```
720
+
721
+
On a machine with a 1 gigabit network connection and many cores, this should take about one minute total, and uses 8 cores. The output should look like:
The above command runs code in `Duck.java`, which accesses the relevant part of the index for our crawl (CC-MAIN-2024-22) and then counts the number of records in that crawl (2709877975!). The code runs the SQL query we saw before which should match the single response record we want.
788
+
789
+
The program then writes that one record into a local Parquet file, does a second query that returns that one record, and shows the full contents of the record. We can see that the complete row contains many columns containing different information associated with our record. Finally, it converts the row to the CDXJ format we saw before.
704
790
705
791
### Bonus: download a full crawl index and query with DuckDB
706
792
707
-
TBA
793
+
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
801
+
802
+
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
0 commit comments