Skip to content

Commit 0a790a5

Browse files
committed
docs: added bonus multi-task exercise
1 parent 842c9f8 commit 0a790a5

1 file changed

Lines changed: 8 additions & 0 deletions

File tree

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -536,10 +536,18 @@ download instructions.
536536

537537
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
538538

539+
## Bonus 2: combine some steps
540+
541+
1. Use the DuckDb techniques from [Task 8](#task-8-query-using-the-columnar-index--duckdb-from-outside-aws) and the [Index Server](https://index.commoncrawl.org) to find a new webpage in the archives.
542+
2. Note its url, warc, and timestamp.
543+
3. Now open up the Makefile from [Task 6](#task-6-use-cdx_toolkit-to-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the cdx_toolkit section.
544+
4. Repeat the cdx_toolkit steps, but for the page and date range you found above.
545+
539546
## Congratulations!
540547

541548
You have completed the Whirlwind Tour of Common Crawl's Datasets using Python! You should now understand different filetypes we have in our corpus and how to interact with Common Crawl's datasets using Python. To see what other people have done with our data, see the [Examples page](https://commoncrawl.org/examples) on our website. Why not join our Discord through the Community tab?
542549

550+
543551
## Other datasets
544552

545553
We make more datasets available than just the ones discussed in this Whirlwind Tour. Below is a short introduction to some of these other datasets, along with links to where you can find out more.

0 commit comments

Comments
 (0)