11<img width =" 400 " alt =" logo " src =" https://github.com/user-attachments/assets/ea105e9e-00d0-4d48-a2a4-006cc4e89848 " />
22
3- [ ![ Scraper ] ( https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=scraper -v*&label=scraper )] ( https://github.com/superdoc-dev/docx-corpus/releases )
3+ [ ![ CLI ] ( https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=cli -v*&label=cli )] ( https://github.com/superdoc-dev/docx-corpus/releases )
44[ ![ CDX Filter] ( https://img.shields.io/github/v/release/superdoc-dev/docx-corpus?filter=cdx-filter-v*&label=cdx-filter )] ( https://github.com/superdoc-dev/docx-corpus/releases )
55[ ![ codecov] ( https://codecov.io/gh/superdoc-dev/docx-corpus/graph/badge.svg )] ( https://codecov.io/gh/superdoc-dev/docx-corpus )
66[ ![ License: MIT] ( https://img.shields.io/badge/License-MIT-yellow.svg )] ( https://opensource.org/licenses/MIT )
@@ -23,21 +23,24 @@ Document rendering is hard. Microsoft Word has decades of edge cases, quirks, an
2323## How It Works
2424
2525```
26+ Phase 1: Index Filtering (Lambda)
2627┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
27- │ Common Crawl │ │ cdx-filter │ │ scraper │
28- │ (S3 bucket) │ ───► │ (Lambda) │ ───► │ (Bun) │
29- │ │ │ │ │ │
30- │ CDX indexes │ │ Filters .docx │ │ Downloads WARC │
31- │ WARC archives │ │ Writes to R2 │ │ Validates docx │
28+ │ Common Crawl │ │ cdx-filter │ │ Cloudflare R2 │
29+ │ (S3) │ ───► │ (Lambda) │ ───► │ │
30+ │ │ │ │ │ cdx-filtered/ │
31+ │ CDX indexes │ │ Filters .docx │ │ *.jsonl │
3232└──────────────────┘ └──────────────────┘ └──────────────────┘
33- │ │
34- ▼ ▼
35- ┌──────────────────┐ ┌──────────────────┐
36- │ Cloudflare R2 │ │ Storage │
37- │ │ │ │
38- │ cdx-filtered/ │ │ Local or R2 │
39- │ *.jsonl │ │ documents/ │
40- └──────────────────┘ └──────────────────┘
33+
34+ Phase 2: Document Collection (CLI)
35+ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
36+ │ Cloudflare R2 │ │ corpus CLI │ │ Storage │
37+ │ │ ───► │ (Bun) │ ───► │ │
38+ │ cdx-filtered/ │ │ │ │ Local or R2 │
39+ │ *.jsonl │ │ Downloads WARC │ │ documents/ │
40+ ├──────────────────┤ │ Validates docx │ └──────────────────┘
41+ │ Common Crawl │ ───► │ Deduplicates │
42+ │ WARC archives │ │ │
43+ └──────────────────┘ └──────────────────┘
4144```
4245
4346### Why Common Crawl?
0 commit comments