You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: scraper dedup improvements and failed URL retry
- Add cross-crawl duplicate detection (creates dup records when URL exists under different crawl)
- Content-dedup now matches any existing hash, not just uploaded status
- Clean up stale failed-* records when WARC retry succeeds
- Track processedUrls after successful download to prevent same-URL CDX duplicates
- Failed URLs are no longer skipped on re-runs (different WARC capture might succeed)
- Add deleteDocument and getUrlsForCrawl to DbClient
- Document dedup model in CLAUDE.md
The scraper maintains **exact parity** between CDX URLs and database records: every URL in a crawl's CDX files has exactly one record in the `documents` table under that `crawl_id`.
35
+
36
+
### Document statuses
37
+
38
+
-`uploaded` — valid .docx saved to R2, ID is `{contentHash}`
39
+
-`failed` — WARC download failed or content is invalid docx, ID is `failed-{urlHash}` (download error) or `{contentHash}` (validation error)
40
+
-`duplicate` — same content already exists under a different URL, ID is `dup-{urlHash}`
41
+
42
+
### ID scheme
43
+
44
+
IDs are content-addressed for storage mapping (`documents/{id}.docx`):
45
+
46
+
| Scenario | ID | Reason |
47
+
|---|---|---|
48
+
| Uploaded |`{sha256(content)}`| Maps to R2 storage key |
49
+
| Download failed |`failed-{sha256(url)}`| No content available, use URL hash |
| Content duplicate |`dup-{sha256(url)}`| Content hash would collide with original |
52
+
53
+
### Dedup paths
54
+
55
+
The scraper handles three dedup scenarios in order:
56
+
57
+
1.**URL-dedup** (instant, no download) — URL already in `processedUrls` Set (loaded from all crawls at startup). If the URL exists under a different `crawl_id`, creates a cross-crawl `duplicate` record under the current crawl. If already under the current crawl, silently skips.
58
+
59
+
2.**Content-dedup** (requires WARC download) — URL is new but content hash matches an existing document. Creates a `duplicate` record pointing to the original.
60
+
61
+
3.**Same-URL retry** (within same crawl) — Same URL appears multiple times in CDX files (different WARC captures). After a successful WARC download, the URL is added to `processedUrls` so subsequent entries are skipped. Failed downloads do NOT add to `processedUrls`, allowing retry from a different WARC capture.
62
+
63
+
### Stale record cleanup
64
+
65
+
When a WARC download succeeds, the scraper deletes any previous `failed-{urlHash}` record for that URL. This prevents duplicate records when a URL fails on one attempt but succeeds on a later retry (since the failed and successful records have different IDs).
66
+
67
+
### Re-run safety
68
+
69
+
Running the scraper on the same crawl again is safe:
70
+
-`--force` flag: re-downloads everything but checks `source_url` before creating dup records, so existing records aren't duplicated
71
+
- Without `--force`: all existing URLs are in `processedUrls` and skipped instantly
72
+
32
73
## Database
33
74
34
75
Single `documents` table in PostgreSQL (NeonDB) with pgvector. All pipeline stages write to this table.
0 commit comments