Skip to content

Commit d8a29aa

Browse files
committed
docs: add cdx-filter command and AWS setup to README
1 parent a0478c2 commit d8a29aa

1 file changed

Lines changed: 23 additions & 3 deletions

File tree

README.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,9 @@ bun install
3838
All pipeline stages are accessible through a single CLI:
3939

4040
```bash
41+
corpus cdx-filter # Show available vs filtered crawls
42+
corpus cdx-filter --crawl CC-MAIN-2026-08 # Filter a specific crawl via Lambda
43+
corpus cdx-filter --latest 3 # Filter 3 newest missing crawls
4144
corpus crawls # List available crawls from R2
4245
corpus scrape --crawl CC-MAIN-2025-51 # Scrape a specific crawl
4346
corpus scrape --crawl 3 --batch 100 # Latest 3 crawls, 100 docs each
@@ -89,11 +92,25 @@ db/
8992
Pre-filters Common Crawl CDX indexes for `.docx` URLs. Runs in AWS Lambda (us-east-1) for direct S3 access — minutes instead of days.
9093

9194
```bash
92-
cd apps/cdx-filter
93-
./invoke-all.sh CC-MAIN-2025-51
95+
corpus cdx-filter # Show what's available vs filtered
96+
corpus cdx-filter --crawl CC-MAIN-2026-08 # Filter one crawl
97+
corpus cdx-filter --all # Filter all missing crawls
9498
```
9599

96-
See [apps/cdx-filter/README.md](apps/cdx-filter/README.md) for setup.
100+
**AWS setup**: The Lambda function needs AWS credentials configured locally. See [apps/cdx-filter/README.md](apps/cdx-filter/README.md) for Lambda deployment.
101+
102+
```bash
103+
# Option 1: AWS CLI profile (recommended)
104+
aws configure --profile docx-corpus
105+
export AWS_PROFILE=docx-corpus
106+
107+
# Option 2: Environment variables
108+
export AWS_ACCESS_KEY_ID=...
109+
export AWS_SECRET_ACCESS_KEY=...
110+
export AWS_REGION=us-east-1
111+
```
112+
113+
The AWS IAM user/role needs `lambda:InvokeFunction` permission on the `cdx-filter` function.
97114

98115
### 2. Scraping
99116

@@ -205,6 +222,9 @@ STORAGE_PATH=./corpus
205222
# Embeddings (optional)
206223
GOOGLE_API_KEY=
207224

225+
# AWS (for cdx-filter Lambda invocation)
226+
AWS_PROFILE=docx-corpus # or set AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY
227+
208228
# Classification (for LLM labeling step only)
209229
ANTHROPIC_API_KEY=
210230
```

0 commit comments

Comments
 (0)