You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-3Lines changed: 23 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,6 +38,9 @@ bun install
38
38
All pipeline stages are accessible through a single CLI:
39
39
40
40
```bash
41
+
corpus cdx-filter # Show available vs filtered crawls
42
+
corpus cdx-filter --crawl CC-MAIN-2026-08 # Filter a specific crawl via Lambda
43
+
corpus cdx-filter --latest 3 # Filter 3 newest missing crawls
41
44
corpus crawls # List available crawls from R2
42
45
corpus scrape --crawl CC-MAIN-2025-51 # Scrape a specific crawl
43
46
corpus scrape --crawl 3 --batch 100 # Latest 3 crawls, 100 docs each
@@ -89,11 +92,25 @@ db/
89
92
Pre-filters Common Crawl CDX indexes for `.docx` URLs. Runs in AWS Lambda (us-east-1) for direct S3 access — minutes instead of days.
90
93
91
94
```bash
92
-
cd apps/cdx-filter
93
-
./invoke-all.sh CC-MAIN-2025-51
95
+
corpus cdx-filter # Show what's available vs filtered
96
+
corpus cdx-filter --crawl CC-MAIN-2026-08 # Filter one crawl
97
+
corpus cdx-filter --all # Filter all missing crawls
94
98
```
95
99
96
-
See [apps/cdx-filter/README.md](apps/cdx-filter/README.md) for setup.
100
+
**AWS setup**: The Lambda function needs AWS credentials configured locally. See [apps/cdx-filter/README.md](apps/cdx-filter/README.md) for Lambda deployment.
101
+
102
+
```bash
103
+
# Option 1: AWS CLI profile (recommended)
104
+
aws configure --profile docx-corpus
105
+
export AWS_PROFILE=docx-corpus
106
+
107
+
# Option 2: Environment variables
108
+
export AWS_ACCESS_KEY_ID=...
109
+
export AWS_SECRET_ACCESS_KEY=...
110
+
export AWS_REGION=us-east-1
111
+
```
112
+
113
+
The AWS IAM user/role needs `lambda:InvokeFunction` permission on the `cdx-filter` function.
97
114
98
115
### 2. Scraping
99
116
@@ -205,6 +222,9 @@ STORAGE_PATH=./corpus
205
222
# Embeddings (optional)
206
223
GOOGLE_API_KEY=
207
224
225
+
# AWS (for cdx-filter Lambda invocation)
226
+
AWS_PROFILE=docx-corpus # or set AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY
0 commit comments