Harden bounded authority crawl: atomic dequeue + per-corpus Analysis reuse (#2027)#2052
Conversation
…reuse Addresses the code review in issue #2027 of the Phase-5 BFS crawl engine. Issue #1 (atomic dequeue): AuthorityFrontierService.dequeue_queued() was a plain filter(discovery_state="queued") read — the in_progress transition only happened later in discover_and_bootstrap, leaving a window where two concurrent crawl_authorities tasks could dequeue and bootstrap the same frontier row (wasted provider calls, distorted counters). It now claims the rows it returns inside a single SELECT ... FOR UPDATE SKIP LOCKED transaction, flipping them to in_progress; a second worker skips locked rows and grabs the next ones. Issue #2 (Analysis-per-section bloat): every section of an authority bootstraps into ONE corpus (the provider title is a constant, so all usc-* sections land in the single "United States Code" corpus), so the BFS calls apply() on that corpus once per ingested section. Each call previously minted a fresh Analysis via _get_analysis. crawl() now caches the Analysis the first apply creates per corpus and threads it back through apply(analysis=...), capping it at one provenance row per corpus. Issue #3 (blocked_by_bound): clarified in comments that min_demand_or_depth is populated only on the frontier_drained stop — where every residual queued row is provably bound-excluded — and intentionally not on the max_authorities / token_budget early stops, whose unreached-but-eligible rows are accounted for by the frontier_residual census instead. Minor: crawl_authorities / acrawl_authorities tool params now apply the C.CRAWL_DEFAULT_* constants uniformly instead of None sentinels for two of five. Regression tests: dequeue atomically claims returned rows / leaves filtered-out rows queued (test_authority_frontier.py); a crawl ingesting multiple sections of one authority reuses a single provenance Analysis (test_crawl_authorities.py ApplyAnalysisReuseTests). Closes #2027
Code ReviewThis PR addresses three well-scoped issues from #2027: an atomic dequeue claim for the frontier, per-corpus 1.
|
The new test typed the captured-analyses list as list[object], so mypy rejected the .id access in the reuse assertion ([attr-defined]). Type it list[Analysis | None] and narrow the reused element with an assert before reading .id.
Code ReviewThis PR hardens the bounded authority crawl in three areas: an atomic 1 —
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
Addresses the code review in #2027 (Authority Discovery Phase 5 — Bounded Recursive Crawl). I re-verified each flagged item against the current codebase and fixed the three substantive ones plus the minor sentinel observation.
Issue #1 —
dequeue_queuedhas no atomic dequeue guard (moderate) ✅AuthorityFrontierService.dequeue_queued()was a plainfilter(discovery_state="queued")read; thein_progresstransition only happened later insidediscover_and_bootstrap. So two concurrentcrawl_authoritiestasks (e.g. two manual triggers on the same corpus) could dequeue the same frontier row anddiscover_and_bootstrapit twice — wasted provider calls and distorted summary counters.Fix: the method now claims the rows it returns inside a single
SELECT … FOR UPDATE SKIP LOCKEDtransaction, flipping them toin_progress.skip_lockedlets a second worker grab the next available rows instead of blocking. Rows excluded bymax_depth/min_demandare never claimed, so the crawl'sfrontier_drainedresidual census still sees them asqueued.Issue #2 — fresh
Analysisperapply()(low-moderate) ✅ — confirmed, and broader than suspectedThe review asked whether
_get_analysisis idempotent per corpus. It isn't (Analysis.objects.create(...)), and the impact is larger than a per-section row count:provider.titleis a constant ("United States Code","Code of Federal Regulations", …) and the authority corpus isget_or_create(title=corpus_title, creator=user)— so everyusc-*section lands in one corpus. The BFS therefore callsapply()on that single corpus once per ingested section, each call minting a newAnalysis.Fix:
crawl()caches theAnalysisthe first apply creates per corpus (via the existingapply_res["analysis_id"]) and threads it back throughapply(analysis=…)for that corpus's remaining sections — capping it at one provenanceAnalysisper corpus. The writer's claim/finalize rules already make reusing oneAnalysisacross applies safe. The stale "this apply scan is bounded (one small document per section)" comment was corrected.Issue #3 —
blocked_by_boundonly set onfrontier_drained(minor) ✅The review tentatively suggested also populating
blocked_by_bound["min_demand_or_depth"]on themax_authorities/token_budgetstops. That would be incorrect: rows stillqueuedwhen a cap fires were simply not reached and may be perfectly eligible (abovemin_demand, withinmax_depth) — attributing them to a bound is a lie. The attribution is only valid onfrontier_drained, where dequeue returning nothing proves every residualqueuedrow failed the filters. Added precise comments documenting this; thefrontier_residualcensus already covers the early-stop residue (no silent truncation).Minor — sentinel inconsistency ✅
crawl_authorities/acrawl_authorities(corpus_references.py) now apply theC.CRAWL_DEFAULT_*constants to all five bound parameters instead of mixingNonesentinels for two of them with constants for the other three. The tool schema is declared manually intool_registry.py, so the agent-facing surface is unchanged.candidate_keysobservationNo change — the review confirmed the suffix-stripping regex works as intended with good coverage.
Tests
All run green inside the test image (
120 passed):test_authority_frontier.py— new: dequeue atomically claims returned rows (in_progress, second dequeue skips them); filtered-out rows stayqueued.test_crawl_authorities.py::ApplyAnalysisReuseTests— new: a crawl ingesting multiple sections of one authority reuses a single provenanceAnalysis.test_authority_frontier,test_authority_frontier_actions,test_crawl_authorities,test_enrichment_tools,test_authority_discovery_subset.black,isort,flake8clean. Changelog fragment added atchangelog.d/2027-authority-crawl.fixed.md.Closes #2027
Generated by Claude Code