Skip to content

Commit 6f6ffa0

Browse files
committed
docs: cleanup after testing of diagrams
1 parent 0a790a5 commit 6f6ffa0

1 file changed

Lines changed: 6 additions & 10 deletions

File tree

README.md

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -110,8 +110,8 @@ Now that we've looked at the uncompressed versions of these files to understand
110110
The [warcio](https://github.com/webrecorder/warcio) Python library lets us read and write WARC files programmatically.
111111
```mermaid
112112
flowchart LR
113-
user["user (r/w)"]--warcio (r) -->warc
114-
user--warcio (w) -->warc
113+
user["userprocess (r/w)"]--warcio (w) -->warc
114+
warc --warcio (r)--> user
115115
warc@{shape: cyl}
116116
```
117117
Let's use it to iterate over our WARC, WET, and WAT files and print out the record types we looked at before. First, look at the code in `warcio-iterator.py`:
@@ -177,11 +177,13 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one
177177

178178
## Task 3: Index the WARC, WET, and WAT
179179

180-
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
180+
The example WARC files we've been using a\e tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
181181
```mermaid
182182
flowchart LR
183-
warc[warc] --> indexer --> cdxj[.cdxj] & columnar[.parquet]
183+
warc --> indexer --> cdxj & columnar
184184
warc@{shape: cyl}
185+
cdxj@{ shape: stored-data}
186+
columnar@{ shape: stored-data}
185187
```
186188

187189

@@ -335,12 +337,6 @@ Make sure you compress WARCs the right way!
335337
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
336338

337339
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
338-
```mermaid
339-
flowchart LR
340-
user --cdx_toolkit--> cdxi
341-
cdxi@{shape: cyl}
342-
```
343-
344340

345341
The [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) is a set of tools for working with CDX indices of web crawls and archives. It knows how to query the CDX index across all of our crawls and also can create WARCs of just the records you want. We will fetch the same record from Wikipedia that we've been using for the whirlwind tour.
346342

0 commit comments

Comments
 (0)