Skip to content

Harden bounded authority crawl: atomic dequeue + per-corpus Analysis reuse (#2027)#2052

Merged
JSv4 merged 2 commits into
mainfrom
claude/serene-pascal-4103pn
Jun 24, 2026
Merged

Harden bounded authority crawl: atomic dequeue + per-corpus Analysis reuse (#2027)#2052
JSv4 merged 2 commits into
mainfrom
claude/serene-pascal-4103pn

Conversation

@JSv4

@JSv4 JSv4 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Addresses the code review in #2027 (Authority Discovery Phase 5 — Bounded Recursive Crawl). I re-verified each flagged item against the current codebase and fixed the three substantive ones plus the minor sentinel observation.

Issue #1dequeue_queued has no atomic dequeue guard (moderate) ✅

AuthorityFrontierService.dequeue_queued() was a plain filter(discovery_state="queued") read; the in_progress transition only happened later inside discover_and_bootstrap. So two concurrent crawl_authorities tasks (e.g. two manual triggers on the same corpus) could dequeue the same frontier row and discover_and_bootstrap it twice — wasted provider calls and distorted summary counters.

Fix: the method now claims the rows it returns inside a single SELECT … FOR UPDATE SKIP LOCKED transaction, flipping them to in_progress. skip_locked lets a second worker grab the next available rows instead of blocking. Rows excluded by max_depth/min_demand are never claimed, so the crawl's frontier_drained residual census still sees them as queued.

Issue #2 — fresh Analysis per apply() (low-moderate) ✅ — confirmed, and broader than suspected

The review asked whether _get_analysis is idempotent per corpus. It isn't (Analysis.objects.create(...)), and the impact is larger than a per-section row count: provider.title is a constant ("United States Code", "Code of Federal Regulations", …) and the authority corpus is get_or_create(title=corpus_title, creator=user) — so every usc-* section lands in one corpus. The BFS therefore calls apply() on that single corpus once per ingested section, each call minting a new Analysis.

Fix: crawl() caches the Analysis the first apply creates per corpus (via the existing apply_res["analysis_id"]) and threads it back through apply(analysis=…) for that corpus's remaining sections — capping it at one provenance Analysis per corpus. The writer's claim/finalize rules already make reusing one Analysis across applies safe. The stale "this apply scan is bounded (one small document per section)" comment was corrected.

Issue #3blocked_by_bound only set on frontier_drained (minor) ✅

The review tentatively suggested also populating blocked_by_bound["min_demand_or_depth"] on the max_authorities/token_budget stops. That would be incorrect: rows still queued when a cap fires were simply not reached and may be perfectly eligible (above min_demand, within max_depth) — attributing them to a bound is a lie. The attribution is only valid on frontier_drained, where dequeue returning nothing proves every residual queued row failed the filters. Added precise comments documenting this; the frontier_residual census already covers the early-stop residue (no silent truncation).

Minor — sentinel inconsistency ✅

crawl_authorities / acrawl_authorities (corpus_references.py) now apply the C.CRAWL_DEFAULT_* constants to all five bound parameters instead of mixing None sentinels for two of them with constants for the other three. The tool schema is declared manually in tool_registry.py, so the agent-facing surface is unchanged.

candidate_keys observation

No change — the review confirmed the suffix-stripping regex works as intended with good coverage.

Tests

All run green inside the test image (120 passed):

  • test_authority_frontier.pynew: dequeue atomically claims returned rows (in_progress, second dequeue skips them); filtered-out rows stay queued.
  • test_crawl_authorities.py::ApplyAnalysisReuseTestsnew: a crawl ingesting multiple sections of one authority reuses a single provenance Analysis.
  • Existing suites unaffected: test_authority_frontier, test_authority_frontier_actions, test_crawl_authorities, test_enrichment_tools, test_authority_discovery_subset.

black, isort, flake8 clean. Changelog fragment added at changelog.d/2027-authority-crawl.fixed.md.

Closes #2027


Generated by Claude Code

…reuse

Addresses the code review in issue #2027 of the Phase-5 BFS crawl engine.

Issue #1 (atomic dequeue): AuthorityFrontierService.dequeue_queued() was a
plain filter(discovery_state="queued") read — the in_progress transition only
happened later in discover_and_bootstrap, leaving a window where two concurrent
crawl_authorities tasks could dequeue and bootstrap the same frontier row
(wasted provider calls, distorted counters). It now claims the rows it returns
inside a single SELECT ... FOR UPDATE SKIP LOCKED transaction, flipping them to
in_progress; a second worker skips locked rows and grabs the next ones.

Issue #2 (Analysis-per-section bloat): every section of an authority
bootstraps into ONE corpus (the provider title is a constant, so all usc-*
sections land in the single "United States Code" corpus), so the BFS calls
apply() on that corpus once per ingested section. Each call previously minted a
fresh Analysis via _get_analysis. crawl() now caches the Analysis the first
apply creates per corpus and threads it back through apply(analysis=...),
capping it at one provenance row per corpus.

Issue #3 (blocked_by_bound): clarified in comments that
min_demand_or_depth is populated only on the frontier_drained stop — where
every residual queued row is provably bound-excluded — and intentionally not on
the max_authorities / token_budget early stops, whose unreached-but-eligible
rows are accounted for by the frontier_residual census instead.

Minor: crawl_authorities / acrawl_authorities tool params now apply the
C.CRAWL_DEFAULT_* constants uniformly instead of None sentinels for two of five.

Regression tests: dequeue atomically claims returned rows / leaves filtered-out
rows queued (test_authority_frontier.py); a crawl ingesting multiple sections
of one authority reuses a single provenance Analysis (test_crawl_authorities.py
ApplyAnalysisReuseTests).

Closes #2027
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Code Review

This PR addresses three well-scoped issues from #2027: an atomic dequeue claim for the frontier, per-corpus Analysis reuse, and uniform tool-function defaults. The concurrency fix for dequeue_queued is well-reasoned — SELECT … FOR UPDATE SKIP LOCKED is the right primitive, the bulk_update correctly stamps modified to match the single-row mark() writer, and the two new tests cover the core behaviour. A few issues worth addressing before merge:


1. per_jurisdiction_cap=0 silently parks every row — crawl_authorities_service.py:191

The PR removes the None→constant substitution for per_jurisdiction_cap and token_budget in the tool wrapper, which is correct for the stated goal. However, it doesn't guard against 0 being passed for per_jurisdiction_cap. At line 191:

if per_juris[jkey] >= per_jurisdiction_cap:   # per_juris starts at 0

A Counter starts at 0 for every key, so 0 >= 0 is True on the very first row encountered in each jurisdiction. The crawl terminates with zero ingests, stop_reason = "frontier_drained", and no error is raised — a completely silent failure. token_budget <= 0 is documented as "unbounded", but per_jurisdiction_cap <= 0 has no such escape hatch. Either document that 0 means "block all" and enforce >= 1 at the call sites, or add a <= 0 guard to skip the cap check (matching the token_budget pattern on line 163).


2. Analysis.objects.get() bypasses service layer and raises unhandled DoesNotExistcrawl_authorities_service.py:248

apply_analyses[authority_corpus_id] = Analysis.objects.get(
    pk=new_analysis_id
)

Two issues here:

a) Service-layer policy. CLAUDE.md: "Code with a user context … must reach models through opencontractserver/<app>/services/." crawl() takes creator_id and is a user-context classmethod. BaseService.get_or_none (inherited by AnalysisService) is the policy-compliant path.

b) Unguarded DoesNotExist. If the Analysis row is deleted between apply() returning and this .get() executing (e.g. by a concurrent admin action), DoesNotExist propagates unhandled, aborts the crawl, and leaves all currently-claimed frontier rows stranded in in_progress with no requeue path. Using AnalysisService.get_or_none(pk=new_analysis_id) returns None instead of raising, letting you log and continue (or at least surface a useful error).


3. No cleanup of claimed rows on BFS exception — crawl_authorities_service.py:152

dequeue_queued() now atomically marks rows in_progress before discover_and_bootstrap() runs. If discover_and_bootstrap() (line 199) raises an unhandled exception, the BFS loop exits without calling mark(row, …) to a terminal state, leaving the row permanently in_progress. Pre-PR, a failure before discover_and_bootstrap's own in-progress transition left the row queued and therefore retryable on the next crawl run.

A try/except around the loop body that calls AuthorityFrontierService.mark(row, "failed") on unexpected exceptions (or a try/finally to requeue) would preserve the retryability property the old non-atomic approach had.


4. Inline import without circular-import justification — crawl_authorities_service.py:148

# inside crawl() body:
from opencontractserver.analyzer.models import Analysis

The sibling enrichment_service.py imports from opencontractserver.analyzer.models at module level without issue, so there's no obvious circular-import reason for deferring this. The inline placement makes the dependency invisible at the top of the file and is inconsistent with module-level imports elsewhere. Move to the top-of-file import block or add a comment explaining the circular-import constraint if one exists.


5. Old None-sentinel callers silently broken — corpus_references.py:250-251

The old signatures allowed crawl_authorities(per_jurisdiction_cap=None) as a "use the constant default" idiom (the conditional in the wrapper substituted the constant). Any Python caller using that pattern now silently passes None through to CrawlAuthoritiesService.crawl(), where per_juris[jkey] >= None raises TypeError at line 191. No callers in this diff use None explicitly, but it's a latent break worth noting if the function is called from scripts or tests outside this repo.

The new test typed the captured-analyses list as list[object], so mypy
rejected the .id access in the reuse assertion ([attr-defined]). Type it
list[Analysis | None] and narrow the reused element with an assert before
reading .id.
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Code Review

This PR hardens the bounded authority crawl in three areas: an atomic SELECT FOR UPDATE SKIP LOCKED claim in dequeue_queued, per-corpus Analysis reuse to cap provenance bloat, and a uniform tool signature replacing mixed None-sentinel/constant defaults. The core mechanisms are sound — the SELECT/FOR UPDATE pattern is the right fix for the TOCTOU race, and the Analysis-reuse cache is correctly placed before the BFS loop. A few issues are worth addressing before merge.


1 — corpus_analysis_tasks.py not updated — asymmetric None-sentinel API (medium)

opencontractserver/tasks/corpus_analysis_tasks.py, lines 124–125, 168–174

The Celery task still declares:

per_jurisdiction_cap: int | None = None,
token_budget: int | None = None,

and translates them via ternary before forwarding to CrawlAuthoritiesService.crawl. The PR updated corpus_references.py to drop None sentinels but left the task layer untouched. This creates two problems:

  • A direct caller passing CrawlAuthoritiesService.crawl(per_jurisdiction_cap=None) (bypassing the task) gets a TypeError at line 191 (Counter()[key] >= None). The task ternary is the only guard.
  • A caller passing per_jurisdiction_cap=0 through the task reaches the service as 0 (because 0 is not None). Counter() defaults to 0, so per_juris[jkey] >= 0 is always true on the first row — every row is immediately parked at deferred_cap and nothing is ingested, with stop_reason = "frontier_drained" obscuring the real cause.

The task signature should match the service (use int = C.CRAWL_DEFAULT_* and drop the ternaries), or the service should document that 0 is not a valid per-jurisdiction cap.


2 — discover_and_bootstrap re-marks in_progress after dequeue_queued already did it (low)

opencontractserver/enrichment/services/authority_discovery_service.py, line 237

discover_and_bootstrap unconditionally calls AuthorityFrontierService.mark(frontier_row, "in_progress"). After this PR, dequeue_queued already atomically bulk-updates the row to in_progress before returning it. The normal crawl path therefore produces three separate DB writes per row: bulk_update (claim) → save(provider)mark(in_progress). The second mark also re-stamps last_attempt at a later clock instant than the claim, so last_attempt reflects the provider-selection moment rather than the dequeue moment, which is actually slightly more accurate — but both timestamps exist in-flight without documentation.

The mark("in_progress") call can't be removed from discover_and_bootstrap because discover_selected calls it directly without going through dequeue_queued. Worth noting in the docstring of discover_and_bootstrap that the in_progress mark is redundant (but safe) when called from the standard dequeue_queued path.


3 — Gratuitous Analysis.objects.get() query per new authority corpus (low)

opencontractserver/enrichment/services/crawl_authorities_service.py, lines 246–250

apply_analyses[authority_corpus_id] = Analysis.objects.get(pk=new_analysis_id)

EnrichmentService.apply() already holds the Analysis object (it creates it) and returns only analysis.id in the dict (line 749 of enrichment_service.py). The crawl then re-fetches it by PK. This is one extra SELECT per new authority corpus. A simpler fix: add "analysis": analysis alongside "analysis_id" in apply()'s return dict and read apply_res.get("analysis") in the crawl.


4 — Deferred Analysis import inside crawl() body (low / conventions)

opencontractserver/enrichment/services/crawl_authorities_service.py, line 148

from opencontractserver.analyzer.models import Analysis

The two Analyzer deferred imports above it (lines 44, 58) are inside classmethods that avoid a circular import. Analysis has no such constraint — it's imported only for the type annotation on apply_analyses. Python caches imports, so there's no performance impact, but it signals a non-existent circular dependency to readers and means the import runs inside the hot crawl() function call. Move it to the module-level import block alongside Analyzer's peer models.


5 — token_budget=0 magic number in tests (low / conventions)

opencontractserver/tests/test_crawl_authorities.py, multiple test calls

token_budget=0 is the "unbounded" sentinel (the crawl loop checks token_budget > 0 before comparing). C.CRAWL_DEFAULT_TOKEN_BUDGET is 2_000_000 — far from 0. Tests that want no budget constraint pass bare 0, which reads as "zero tokens allowed = immediately exhausted" to a reader unfamiliar with the semantics. CLAUDE.md §4: "No magic numbers — use constants files." Add CRAWL_TOKEN_BUDGET_UNBOUNDED = 0 to opencontractserver/enrichment/constants.py and use it in all affected test calls.


6 — test_authority_frontier_actions.py:62 assertion now mutates row state (test smell)

opencontractserver/tests/test_authority_frontier_actions.py, line 62

assert row.pk in {r.pk for r in AuthorityFrontierService.dequeue_queued()}

Before this PR, dequeue_queued was a pure read. Now it atomically transitions matched rows to in_progress. This assertion has a silent side-effect: after line 62, row.discovery_state is in_progress in the database, even though the line looks like a pure membership check. The test happens to pass because line 69 checks for "ingested" (not "queued"), but any future assertion between lines 62 and 69 that reads row.discovery_state from the DB will see in_progress, not queued. Either use a fresh dequeue_queued call earlier in the test setup, or add a comment explaining the side-effect.

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@JSv4 JSv4 merged commit 1c43a3f into main Jun 24, 2026
14 checks passed
@JSv4 JSv4 deleted the claude/serene-pascal-4103pn branch June 24, 2026 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

## Code Review: Authority Discovery Phase 5 — Bounded Recursive Crawl

2 participants