Harden bounded authority crawl: atomic dequeue + per-corpus Analysis reuse

claude · claude · commit e232766e2ddb · 2026-06-23T05:18:48.000Z
Addresses the code review in issue #2027 of the Phase-5 BFS crawl engine. Issue #1 (atomic dequeue): AuthorityFrontierService.dequeue_queued() was a plain filter(discovery_state="queued") read — the in_progress transition only happened later in discover_and_bootstrap, leaving a window where two concurrent crawl_authorities tasks could dequeue and bootstrap the same frontier row (wasted provider calls, distorted counters). It now claims the rows it returns inside a single SELECT ... FOR UPDATE SKIP LOCKED transaction, flipping them to in_progress; a second worker skips locked rows and grabs the next ones. Issue #2 (Analysis-per-section bloat): every section of an authority bootstraps into ONE corpus (the provider title is a constant, so all usc-* sections land in the single "United States Code" corpus), so the BFS calls apply() on that corpus once per ingested section. Each call previously minted a fresh Analysis via _get_analysis. crawl() now caches the Analysis the first apply creates per corpus and threads it back through apply(analysis=...), capping it at one provenance row per corpus. Issue #3 (blocked_by_bound): clarified in comments that min_demand_or_depth is populated only on the frontier_drained stop — where every residual queued row is provably bound-excluded — and intentionally not on the max_authorities / token_budget early stops, whose unreached-but-eligible rows are accounted for by the frontier_residual census instead. Minor: crawl_authorities / acrawl_authorities tool params now apply the C.CRAWL_DEFAULT_* constants uniformly instead of None sentinels for two of five. Regression tests: dequeue atomically claims returned rows / leaves filtered-out rows queued (test_authority_frontier.py); a crawl ingesting multiple sections of one authority reuses a single provenance Analysis (test_crawl_authorities.py ApplyAnalysisReuseTests). Closes #2027
diff --git a/changelog.d/2027-authority-crawl.fixed.md b/changelog.d/2027-authority-crawl.fixed.md
@@ -0,0 +1,41 @@
+- Harden the bounded authority crawl against a concurrency race and provenance
+  bloat (issue #2027, a code review of the Phase-5 BFS engine):
+  - **Atomic dequeue claim.**
+    `AuthorityFrontierService.dequeue_queued()`
+    (`opencontractserver/enrichment/services/authority_frontier_service.py`)
+    was a plain `filter(discovery_state="queued")` read — the `in_progress`
+    transition only happened later inside `discover_and_bootstrap`, leaving a
+    TOCTOU window where two concurrent `crawl_authorities` tasks (e.g. two
+    manual triggers on the same corpus) could dequeue the SAME frontier row and
+    `discover_and_bootstrap` it twice (wasted provider calls, distorted summary
+    counters). It now claims the rows it returns inside a single
+    `SELECT … FOR UPDATE SKIP LOCKED` transaction, flipping them to
+    `in_progress`; a second worker skips locked rows and grabs the next ones.
+    Rows excluded by `max_depth` / `min_demand` are never claimed, so the
+    `frontier_drained` residual census still counts them as `queued`.
+  - **One provenance `Analysis` per authority corpus.** Every section of an
+    authority bootstraps into ONE corpus (the provider `title` is a constant —
+    all `usc-*` sections land in the single "United States Code" corpus), so the
+    BFS calls `EnrichmentService.apply()` on that corpus once per ingested
+    section. Each call previously minted a fresh `Analysis`
+    (`_get_analysis` → `Analysis.objects.create`), so a deep crawl left dozens
+    of provenance rows on one corpus. `CrawlAuthoritiesService.crawl()`
+    (`opencontractserver/enrichment/services/crawl_authorities_service.py`) now
+    caches the `Analysis` the first apply creates per corpus and threads it back
+    into the rest via `apply(analysis=…)`, capping it at one per corpus. A
+    misleading "this apply scan is bounded (one small document per section)"
+    comment was corrected.
+  - **Honest `blocked_by_bound` accounting.** Clarified (in comments) that
+    `blocked_by_bound["min_demand_or_depth"]` is populated only on the
+    `frontier_drained` stop — where every residual `queued` row is provably
+    bound-excluded — and intentionally NOT on the `max_authorities` /
+    `token_budget` early stops, whose unreached rows may be perfectly eligible
+    and are accounted for by the `frontier_residual` census instead.
+  - **Uniform tool signature.** `crawl_authorities` / `acrawl_authorities`
+    (`opencontractserver/llms/tools/core_tools/corpus_references.py`) now apply
+    `C.CRAWL_DEFAULT_*` constants to all five bound parameters instead of using
+    `None` sentinels for two of them and constants for the other three.
+- Regression tests: `test_authority_frontier.py` (dequeue atomically claims
+  returned rows / leaves filtered-out rows `queued`) and
+  `test_crawl_authorities.py::ApplyAnalysisReuseTests` (a crawl that ingests
+  multiple sections of one authority reuses a single provenance `Analysis`).
diff --git a/opencontractserver/enrichment/services/authority_frontier_service.py b/opencontractserver/enrichment/services/authority_frontier_service.py
@@ -10,6 +10,7 @@
 from collections.abc import Mapping
 from dataclasses import dataclass
 
+from django.db import transaction
 from django.db.models import Count, Q
 from django.utils import timezone
 
@@ -143,19 +144,47 @@ def dequeue_queued(
         max_depth: int | None = None,
         min_demand: int = 0,
     ) -> list[AuthorityFrontier]:
-        """Highest-demand queued rows regardless of assigned provider.
+        """Atomically CLAIM the highest-demand queued rows for the crawl driver.
 
         Unlike ``dequeue_for_provider`` (which requires a stamped provider),
         this serves the crawl driver: it picks ``discovery_state="queued"`` rows
         ranked by ``-mention_count``, optionally bounded by depth and a minimum
         demand floor.  Provider selection happens later in the discovery service.
+
+        The returned rows are transitioned to ``in_progress`` inside a single
+        ``SELECT ... FOR UPDATE SKIP LOCKED`` transaction, so the dequeue is an
+        atomic *claim* rather than a plain read: two ``crawl_authorities`` tasks
+        running concurrently (e.g. two manual triggers on the same corpus) can
+        never return — and therefore never ``discover_and_bootstrap`` — the same
+        frontier row twice (issue #2027). ``skip_locked`` lets a second worker
+        pick the next available rows instead of blocking on the first worker's
+        lock. Rows excluded by ``max_depth`` / ``min_demand`` are never claimed,
+        so the crawl's ``frontier_drained`` residual census still sees them as
+        ``queued``.
         """
         qs = AuthorityFrontier.objects.filter(discovery_state="queued")
         if max_depth is not None:
             qs = qs.filter(depth__lte=max_depth)
         if min_demand:
             qs = qs.filter(mention_count__gte=min_demand)
-        return list(qs.order_by("-mention_count")[:limit])
+        with transaction.atomic():
+            rows = list(
+                qs.select_for_update(skip_locked=True).order_by("-mention_count")[
+                    :limit
+                ]
+            )
+            if rows:
+                now = timezone.now()
+                for row in rows:
+                    row.discovery_state = "in_progress"
+                    row.last_attempt = now
+                    # bulk_update bypasses auto_now — stamp ``modified`` so the
+                    # claim matches the single-row ``mark()`` writer.
+                    row.modified = now
+                AuthorityFrontier.objects.bulk_update(
+                    rows, ["discovery_state", "last_attempt", "modified"]
+                )
+        return rows
 
     @classmethod
     def seed_child_keys(
diff --git a/opencontractserver/enrichment/services/crawl_authorities_service.py b/opencontractserver/enrichment/services/crawl_authorities_service.py
@@ -136,8 +136,27 @@ def crawl(
         # iterations rather than constructing a fresh object on every BFS hop.
         enrichment = EnrichmentService()
 
+        # One provenance Analysis per authority corpus, reused across the run.
+        # Every section of a given authority bootstraps into ONE corpus — the
+        # provider's ``title`` is a constant, so every ``usc-*`` section lands in
+        # the single "United States Code" corpus — so a crawl that ingests N
+        # sections of an authority calls apply() on the SAME corpus N times.
+        # Letting each call mint its own Analysis (apply()'s default when
+        # ``analysis=None``) would leave N provenance rows on that one corpus;
+        # instead we capture the Analysis the first apply creates and feed it back
+        # into the rest, capping it at one per corpus (issue #2027).
+        from opencontractserver.analyzer.models import Analysis
+
+        apply_analyses: dict[int, Analysis] = {}
+
         while True:
-            # Hard cap checks before dequeue so the summary is honest.
+            # Hard cap checks before dequeue so the summary is honest. On these
+            # early stops we intentionally do NOT populate
+            # blocked_by_bound["min_demand_or_depth"]: rows still queued when a cap
+            # fires were simply not reached, and may be perfectly eligible (above
+            # min_demand, within max_depth) — attributing them to a bound would be
+            # a lie. The frontier_residual census (computed below for EVERY stop
+            # reason) accounts for them, so the summary is still non-silent.
             if ingested >= max_authorities:
                 stop_reason = "max_authorities"
                 break
@@ -150,10 +169,13 @@ def crawl(
                 limit=1, max_depth=max_depth, min_demand=min_demand
             )
             if not rows:
-                # Count how many queued rows remain so the summary is non-silent
-                # about what was left. This is the UNION of rows excluded by the
-                # min_demand floor and/or the max_depth bound — the single key
-                # does not attribute each row to one cause or the other.
+                # frontier_drained: dequeue returned nothing, so EVERY remaining
+                # queued row failed the (min_demand AND max_depth) filters. Here —
+                # and only here — is attributing the residual queued count to those
+                # bounds correct (the early max_authorities / token_budget breaks
+                # above leave their unreached-but-eligible rows to
+                # frontier_residual instead). The single key is the UNION of the
+                # two exclusions; it does not split each row by cause.
                 blocked_by_bound["min_demand_or_depth"] = (
                     AuthorityFrontier.objects.filter(discovery_state="queued").count()
                 )
@@ -206,14 +228,26 @@ def crawl(
             # Re-extract the authority's OWN outbound citations and seed the
             # frontier at depth+1 — only when we haven't reached max_depth.
             if row.depth < max_depth:
-                # Authority corpora hold one small document per statute section,
-                # so this apply scan is bounded (not a large-corpus scan).
+                # Reuse this corpus's provenance Analysis across sections (see the
+                # apply_analyses note above) so the BFS doesn't accumulate one
+                # Analysis row per section on a shared authority corpus.
+                apply_analysis = apply_analyses.get(authority_corpus_id)
                 apply_res = enrichment.apply(
                     corpus_id=authority_corpus_id,
                     creator_id=creator_id,
                     types=[C.REF_LAW],
                     extra_tiers=[C.DETECTION_TIER_GRAMMAR],
+                    analysis=apply_analysis,
                 )
+                if apply_analysis is None:
+                    # First apply on this corpus created the provenance Analysis;
+                    # cache it so the corpus's remaining sections reattach to it
+                    # instead of each minting a fresh one.
+                    new_analysis_id = apply_res.get("analysis_id")
+                    if new_analysis_id is not None:
+                        apply_analyses[authority_corpus_id] = Analysis.objects.get(
+                            pk=new_analysis_id
+                        )
 
                 outbound = list(
                     CorpusReferenceService.for_corpus(user, authority_corpus_id)
diff --git a/opencontractserver/llms/tools/core_tools/corpus_references.py b/opencontractserver/llms/tools/core_tools/corpus_references.py
@@ -247,14 +247,18 @@ def crawl_authorities(
     max_depth: int = C.CRAWL_DEFAULT_MAX_DEPTH,
     min_demand: int = C.CRAWL_DEFAULT_MIN_DEMAND,
     max_authorities: int = C.CRAWL_DEFAULT_MAX_AUTHORITIES,
-    per_jurisdiction_cap: int | None = None,
-    token_budget: int | None = None,
+    per_jurisdiction_cap: int = C.CRAWL_DEFAULT_PER_JURISDICTION_CAP,
+    token_budget: int = C.CRAWL_DEFAULT_TOKEN_BUDGET,
 ) -> dict:
     """Bounded recursive crawl: discover & ingest the authorities a corpus
     cites, then the authorities THOSE cite, up to ``max_depth`` hops. Returns a
     summary with per-state counts, per-jurisdiction tallies, the stop reason,
     and the full frontier residual census. Idempotent: already-ingested
     authorities are skipped, re-crawling creates zero duplicate documents.
+
+    All five bound parameters default to the ``C.CRAWL_DEFAULT_*`` constants —
+    a uniform signature so a caller reading it sees the same default style for
+    every bound (no None-sentinel for two of them and constants for the rest).
     """
     from opencontractserver.enrichment.services.crawl_authorities_service import (
         CrawlAuthoritiesService,
@@ -266,14 +270,8 @@ def crawl_authorities(
         max_depth=max_depth,
         min_demand=min_demand,
         max_authorities=max_authorities,
-        per_jurisdiction_cap=(
-            per_jurisdiction_cap
-            if per_jurisdiction_cap is not None
-            else C.CRAWL_DEFAULT_PER_JURISDICTION_CAP
-        ),
-        token_budget=(
-            token_budget if token_budget is not None else C.CRAWL_DEFAULT_TOKEN_BUDGET
-        ),
+        per_jurisdiction_cap=per_jurisdiction_cap,
+        token_budget=token_budget,
     )
 
 
@@ -284,8 +282,8 @@ async def acrawl_authorities(
     max_depth: int = C.CRAWL_DEFAULT_MAX_DEPTH,
     min_demand: int = C.CRAWL_DEFAULT_MIN_DEMAND,
     max_authorities: int = C.CRAWL_DEFAULT_MAX_AUTHORITIES,
-    per_jurisdiction_cap: int | None = None,
-    token_budget: int | None = None,
+    per_jurisdiction_cap: int = C.CRAWL_DEFAULT_PER_JURISDICTION_CAP,
+    token_budget: int = C.CRAWL_DEFAULT_TOKEN_BUDGET,
 ) -> dict:
     return await _db_sync_to_async(crawl_authorities)(
         creator_id=creator_id,
diff --git a/opencontractserver/tests/test_authority_frontier.py b/opencontractserver/tests/test_authority_frontier.py
@@ -509,6 +509,59 @@ def test_combined_max_depth_and_min_demand(self):
         self.assertNotIn("usc-15:7b", keys)
         self.assertNotIn("usc-15:7c", keys)
 
+    def test_dequeue_atomically_claims_rows_in_progress(self):
+        """dequeue_queued is an atomic CLAIM, not a plain read (issue #2027).
+
+        Each returned row must be flipped to ``in_progress`` — both in the
+        returned object and in the DB — so a second concurrent dequeue cannot
+        re-return it and re-run ``discover_and_bootstrap`` on the same key.
+        """
+        self._make_row("usc-15:claim-a", mention_count=10)
+        self._make_row("usc-15:claim-b", mention_count=5)
+
+        first = AuthorityFrontierService.dequeue_queued(limit=1)
+        self.assertEqual(len(first), 1)
+        self.assertEqual(first[0].canonical_key, "usc-15:claim-a")
+        # Claimed in the returned object AND persisted to the DB.
+        self.assertEqual(first[0].discovery_state, "in_progress")
+        self.assertEqual(
+            AuthorityFrontier.objects.get(
+                canonical_key="usc-15:claim-a"
+            ).discovery_state,
+            "in_progress",
+        )
+        self.assertIsNotNone(first[0].last_attempt)
+
+        # A second dequeue must skip the already-claimed row and pick the next.
+        second = AuthorityFrontierService.dequeue_queued(limit=10)
+        keys = {r.canonical_key for r in second}
+        self.assertNotIn("usc-15:claim-a", keys)
+        self.assertIn("usc-15:claim-b", keys)
+
+    def test_filtered_out_rows_are_not_claimed(self):
+        """Rows excluded by min_demand/max_depth must stay ``queued`` (unclaimed).
+
+        The crawl's frontier_drained residual census counts ``queued`` rows, so
+        the claim must touch only rows it actually returns.
+        """
+        self._make_row("usc-15:keep", mention_count=5, depth=0)
+        self._make_row("usc-15:low", mention_count=1, depth=0)  # below min_demand
+        self._make_row("usc-15:deep", mention_count=5, depth=9)  # beyond max_depth
+
+        claimed = AuthorityFrontierService.dequeue_queued(
+            limit=10, max_depth=2, min_demand=2
+        )
+        self.assertEqual({r.canonical_key for r in claimed}, {"usc-15:keep"})
+        # The excluded rows are untouched — still queued for a later, looser pass.
+        self.assertEqual(
+            AuthorityFrontier.objects.get(canonical_key="usc-15:low").discovery_state,
+            "queued",
+        )
+        self.assertEqual(
+            AuthorityFrontier.objects.get(canonical_key="usc-15:deep").discovery_state,
+            "queued",
+        )
+
 
 class SeedChildKeysTests(TestCase):
     """Tests for AuthorityFrontierService.seed_child_keys (Phase-5 idempotent seeding)."""
diff --git a/opencontractserver/tests/test_crawl_authorities.py b/opencontractserver/tests/test_crawl_authorities.py