Skip to content

Commit 842c9f8

Browse files
committed
docs: added flowchart diagrams
1 parent f630817 commit 842c9f8

1 file changed

Lines changed: 31 additions & 2 deletions

File tree

README.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,18 @@
11
# Whirlwind Tour of Common Crawl's Datasets using Python
22

33
The Common Crawl corpus contains petabytes of crawl data, including raw web page data, metadata extracts, and text extracts. Common Crawl's data storage is a little complicated, as you might expect for such a large and rich dataset. We make our crawl data available in a variety of formats (WARC, WET, WAT) and we also have two index files of the crawled webpages: CDXJ and columnar.
4+
```mermaid
5+
flowchart TD
6+
WEB["WEB"] -- crawler --> cc["Common Crawl"]
7+
cc --> WARC["WARC"] & WAT["WAT"] & WET["WET"] & CDXJ["CDXJ"] & Columnar["Columnar"] & etc["..."]
8+
WEB@{ shape: cyl}
9+
WARC@{ shape: stored-data}
10+
WAT@{ shape: stored-data}
11+
WET@{ shape: stored-data}
12+
CDXJ@{ shape: stored-data}
13+
Columnar@{ shape: stored-data}
14+
etc@{ shape: stored-data}
15+
```
416

517
The goal of this whirlwind tour is to show you how a single webpage appears in all of these different places. That webpage is [https://an.wikipedia.org/wiki/Escopete](https://an.wikipedia.org/wiki/Escopete), which we crawled on the date 2024-05-18T01:58:10Z. On the way, we'll also explore the file formats we use and learn about some useful tools for interacting with our data!
618

@@ -96,7 +108,12 @@ Now that we've looked at the uncompressed versions of these files to understand
96108
## Task 2: Iterate over WARC, WET, and WAT files
97109

98110
The [warcio](https://github.com/webrecorder/warcio) Python library lets us read and write WARC files programmatically.
99-
111+
```mermaid
112+
flowchart LR
113+
user["user (r/w)"]--warcio (r) -->warc
114+
user--warcio (w) -->warc
115+
warc@{shape: cyl}
116+
```
100117
Let's use it to iterate over our WARC, WET, and WAT files and print out the record types we looked at before. First, look at the code in `warcio-iterator.py`:
101118

102119
<details>
@@ -161,6 +178,12 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one
161178
## Task 3: Index the WARC, WET, and WAT
162179

163180
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
181+
```mermaid
182+
flowchart LR
183+
warc[warc] --> indexer --> cdxj[.cdxj] & columnar[.parquet]
184+
warc@{shape: cyl}
185+
```
186+
164187

165188
We have two versions of the index: the CDX index and the columnar index. The CDX index is useful for looking up single pages, whereas the columnar index is better suited to analytical and bulk queries. We'll look at both in this tour, starting with the CDX index.
166189

@@ -196,7 +219,7 @@ The JSON blob has enough information to extract individual records: it says whic
196219

197220
## Task 4: Use the CDXJ index to extract raw content from the local WARC, WET, and WAT
198221

199-
Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed.The `gzip` compression utility supports this, but it's rarely used.
222+
Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed. The `gzip` compression utility supports this, but it's rarely used.
200223

201224
To extract one record from a warc file, all you need to know is the filename and the offset into the file. If you're reading over the web, then it really helps to know the exact length of the record.
202225

@@ -312,6 +335,12 @@ Make sure you compress WARCs the right way!
312335
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
313336

314337
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
338+
```mermaid
339+
flowchart LR
340+
user --cdx_toolkit--> cdxi
341+
cdxi@{shape: cyl}
342+
```
343+
315344

316345
The [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) is a set of tools for working with CDX indices of web crawls and archives. It knows how to query the CDX index across all of our crawls and also can create WARCs of just the records you want. We will fetch the same record from Wikipedia that we've been using for the whirlwind tour.
317346

0 commit comments

Comments
 (0)