You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add --retry-failed flag, skip failed URLs by default
Default runs now skip all known URLs (uploaded + duplicate + failed)
for fast scans. Use --retry-failed to re-download only failures, or
--force to re-process everything from scratch.
Copy file name to clipboardExpand all lines: CLAUDE.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,11 +54,11 @@ IDs are content-addressed for storage mapping (`documents/{id}.docx`):
54
54
55
55
The scraper handles three dedup scenarios in order:
56
56
57
-
1.**URL-dedup** (instant, no download) — URL already in `processedUrls` Set (loaded from all crawls at startup). If the URL exists under a different `crawl_id`, creates a cross-crawl `duplicate` record under the current crawl. If already under the current crawl, silently skips.
57
+
1.**URL-dedup** (instant, no download) — URL already in `processedHashes` Set (md5 hashes loaded from all crawls at startup). Includes uploaded, duplicate, AND failed URLs by default. If the URL exists under a different `crawl_id`, creates a cross-crawl `duplicate` record under the current crawl. If already under the current crawl, silently skips.
58
58
59
59
2.**Content-dedup** (requires WARC download) — URL is new but content hash matches an existing document. Creates a `duplicate` record pointing to the original.
60
60
61
-
3.**Same-URL retry** (within same crawl) — Same URL appears multiple times in CDX files (different WARC captures). After a successful WARC download, the URL is added to `processedUrls` so subsequent entries are skipped. Failed downloads do NOT add to `processedUrls`, allowing retry from a different WARC capture.
61
+
3.**Same-URL retry** (within same crawl) — Same URL appears multiple times in CDX files (different WARC captures). After a successful WARC download, the URL is added to `processedHashes` so subsequent entries are skipped.
62
62
63
63
### Stale record cleanup
64
64
@@ -67,8 +67,9 @@ When a WARC download succeeds, the scraper deletes any previous `failed-{urlHash
67
67
### Re-run safety
68
68
69
69
Running the scraper on the same crawl again is safe:
70
-
-`--force` flag: re-downloads everything but checks `source_url` before creating dup records, so existing records aren't duplicated
71
-
- Without `--force`: all existing URLs are in `processedUrls` and skipped instantly
70
+
-`--force`: re-downloads everything from scratch
71
+
-`--retry-failed`: re-downloads only previously failed URLs
72
+
- Default: all known URLs (uploaded + duplicate + failed) are skipped instantly
0 commit comments