Skip to content

Commit 0583f4c

Browse files
committed
chore: update readme
1 parent 86353bc commit 0583f4c

1 file changed

Lines changed: 17 additions & 14 deletions

File tree

README.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<img width="400" alt="logo" src="https://github.com/user-attachments/assets/ea105e9e-00d0-4d48-a2a4-006cc4e89848" />
22

3-
[![Scraper](https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=scraper-v*&label=scraper)](https://github.com/superdoc-dev/docx-corpus/releases)
3+
[![CLI](https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=cli-v*&label=cli)](https://github.com/superdoc-dev/docx-corpus/releases)
44
[![CDX Filter](https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=cdx-filter-v*&label=cdx-filter)](https://github.com/superdoc-dev/docx-corpus/releases)
55
[![codecov](https://codecov.io/gh/superdoc-dev/docx-corpus/graph/badge.svg)](https://codecov.io/gh/superdoc-dev/docx-corpus)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -23,21 +23,24 @@ Document rendering is hard. Microsoft Word has decades of edge cases, quirks, an
2323
## How It Works
2424

2525
```
26+
Phase 1: Index Filtering (Lambda)
2627
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
27-
│ Common Crawl │ │ cdx-filter │ │ scraper │
28-
│ (S3 bucket) │ ───► │ (Lambda) │ ───► │ (Bun) │
29-
│ │ │ │ │ │
30-
│ CDX indexes │ │ Filters .docx │ │ Downloads WARC │
31-
│ WARC archives │ │ Writes to R2 │ │ Validates docx │
28+
│ Common Crawl │ │ cdx-filter │ │ Cloudflare R2 │
29+
│ (S3) │ ───► │ (Lambda) │ ───► │ │
30+
│ │ │ │ │ cdx-filtered/ │
31+
│ CDX indexes │ │ Filters .docx │ │ *.jsonl │
3232
└──────────────────┘ └──────────────────┘ └──────────────────┘
33-
│ │
34-
▼ ▼
35-
┌──────────────────┐ ┌──────────────────┐
36-
│ Cloudflare R2 │ │ Storage │
37-
│ │ │ │
38-
│ cdx-filtered/ │ │ Local or R2 │
39-
│ *.jsonl │ │ documents/ │
40-
└──────────────────┘ └──────────────────┘
33+
34+
Phase 2: Document Collection (CLI)
35+
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
36+
│ Cloudflare R2 │ │ corpus CLI │ │ Storage │
37+
│ │ ───► │ (Bun) │ ───► │ │
38+
│ cdx-filtered/ │ │ │ │ Local or R2 │
39+
│ *.jsonl │ │ Downloads WARC │ │ documents/ │
40+
├──────────────────┤ │ Validates docx │ └──────────────────┘
41+
│ Common Crawl │ ───► │ Deduplicates │
42+
│ WARC archives │ │ │
43+
└──────────────────┘ └──────────────────┘
4144
```
4245

4346
### Why Common Crawl?

0 commit comments

Comments
 (0)