Skip to content

Commit 8873b05

Browse files
authored
Merge pull request #2050 from Open-Source-Legal/claude/upbeat-euler-wvsuth
Address issue #2020 authority-frontier code review findings
2 parents 8d91ace + f0e2d1a commit 8873b05

12 files changed

Lines changed: 336 additions & 95 deletions
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
- Moved the `AuthorityFrontier` discovery-state vocabulary into `opencontractserver/enrichment/constants.py` as a single source of truth (`DISCOVERY_STATE_*` constants, `DISCOVERY_STATE_CHOICES`, and `DISCOVERY_SUCCESS_STATES`). The model field (`opencontractserver/annotations/models.py`), the frontier service transition verbs, the discovery orchestrator, the crawl driver, and the verify/license gate's `GATE_*` verdicts now reference these constants instead of bare string literals, so a rename is a one-line edit and the parallel definitions cannot drift (CLAUDE.md item 4; issue #2020 finding 7). The model's `DISCOVERY_STATE_CHOICES` is re-exported from the constants with identical values, so no migration is generated.
2+
- Renamed `enrichment.constants._USC_PREFIX_RE` / `_CFR_PREFIX_RE` to `USC_PREFIX_RE` / `CFR_PREFIX_RE` (no leading underscore). They are imported cross-module by the USC/CFR authority source providers' `can_handle` overrides, so the underscore (signalling module-private) was misleading; updated both providers' imports (issue #2020 finding 3).
3+
- Optimised `AuthorityDiscoveryService._provider_for` (`opencontractserver/enrichment/services/authority_discovery_service.py`) to instantiate each enabled provider once per call and reuse the instances across every candidate key, instead of re-instantiating every provider for each candidate. `can_handle` is pure given the key, so the prior per-candidate instantiation was wasted work that grew with the candidate-key fan-out (issue #2020 finding 6).
4+
- Expanded the `dequeue_for_provider` docstring to document its crash-recovery purpose (freshly-seeded rows carry `provider=None` and are intentionally skipped; the primary dispatch path is the provider-agnostic `dequeue_queued`) and clarified the `AuthorityKeyEquivalence` model docstring that its direction convention is documentation-only because callers query both columns (issue #2020 findings 9, 10).
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
- Fixed a TOCTOU race in `AuthorityFrontierService.seed_from_wanted_authorities` (`opencontractserver/enrichment/services/authority_frontier_service.py`): the per-key create-or-refresh now runs inside a single `transaction.atomic()` + `select_for_update()` critical section instead of `get_or_create` followed by an unconditional `save`. Two concurrent seed passes could previously both clear `get_or_create` and have one silently clobber the other's count update, and a freshly created row was briefly visible at `mention_count=0` (the field default) between the insert and its follow-up save. The lock serialises the second writer and seeds the real counts atomically. The "prefer existing non-null" semantics for `jurisdiction`/`authority_type` are preserved (issue #2020 finding 1).
2+
- Fixed stale `last_error` on a successful retry in `AuthorityFrontierService.mark()` (same file): a row transitioning `failed (error=...) -> ingested` previously retained the old error string, so downstream health checks reading `last_error` misread a healthy row as broken. `mark()` now clears `last_error` implicitly when transitioning into a terminal success state (`enrichment.constants.DISCOVERY_SUCCESS_STATES`, i.e. `ingested`); the failure history is still preserved in the append-only `candidate_sources` audit trail (issue #2020 finding 2).
3+
- Fixed a double XML traversal in `CFRAuthoritySourceProvider._fetch_impl` (`opencontractserver/pipeline/authority_source_providers/cfr_provider.py`): the `<P>` flatten list comprehension called `_flatten_element_text(p_el)` twice per element (once for the value, once for the truthiness filter). A walrus binds the result once so each `<P>` is traversed a single time (issue #2020 finding 5).
4+
- Added regression tests in `opencontractserver/tests/test_authority_frontier.py`: the seed upsert takes a `SELECT ... FOR UPDATE` row lock, a `failed -> ingested` transition clears the stale error (while a non-success transition preserves it), and `dequeue_for_provider` excludes freshly-seeded `provider=None` rows (issue #2020 findings 11, 12).

opencontractserver/annotations/models.py

Lines changed: 15 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2187,28 +2187,18 @@ class AuthorityFrontier(django.db.models.Model):
21872187
# ("top_detection_tier" is one of enrichment_constants.ALL_DETECTION_TIERS).
21882188
candidate_sources = django.db.models.JSONField(default=list, blank=True)
21892189

2190-
# Discovery state machine. NOTE: the historical ``discovered`` and
2191-
# ``resolved`` states were retired (Authority Console Phase 4): no production
2192-
# code path ever assigned them (discovery jumps in_progress -> ingested, and
2193-
# the resolution outcome lives on the relink result / Analysis, not the
2194-
# frontier row), so carrying them as choices was a dead-vocabulary trap.
2195-
DISCOVERY_STATE_CHOICES = [
2196-
("queued", "Queued"),
2197-
("in_progress", "In progress"),
2198-
("ingested", "Document imported"),
2199-
("failed", "No source found"),
2200-
("unsupported", "No provider can_handle"),
2201-
# Phase 4: visible, non-silent gate outcomes
2202-
("blocked_license", "Provider license is not public-domain"),
2203-
("blocked_domain", "Source domain not on the public-domain allowlist"),
2204-
("unlocated", "Located text did not verify against the requested key"),
2205-
("pending_approval", "Found, awaiting human approval before ingest"),
2206-
# Phase 5: per-jurisdiction cap reached; row parked so dequeue can skip it
2207-
("deferred_cap", "Deferred: per-jurisdiction cap reached"),
2208-
]
2190+
# Discovery state machine. The state vocabulary + labels are the single
2191+
# source of truth in ``enrichment.constants`` (CLAUDE.md item 4: no magic
2192+
# strings) so the model, the frontier service transition verbs, the
2193+
# discovery orchestrator, and the gate cannot drift. Re-exported here as a
2194+
# class attribute so existing references (e.g.
2195+
# config/graphql/annotation_types.py) keep resolving
2196+
# ``AuthorityFrontier.DISCOVERY_STATE_CHOICES``. See the constants module for
2197+
# the note on the retired ``discovered``/``resolved`` states.
2198+
DISCOVERY_STATE_CHOICES = enrichment_constants.DISCOVERY_STATE_CHOICES
22092199
discovery_state = django.db.models.CharField(
22102200
max_length=32,
2211-
default="queued",
2201+
default=enrichment_constants.DISCOVERY_STATE_QUEUED,
22122202
db_index=True,
22132203
choices=DISCOVERY_STATE_CHOICES,
22142204
)
@@ -2261,6 +2251,11 @@ class AuthorityKeyEquivalence(django.db.models.Model):
22612251
that they denote the same authority so find_authority_target can hop across
22622252
namespaces. Plain Model: derived reference data (populated from USLM
22632253
@identifier + <sourceCredit> <ref>s), never user content.
2254+
2255+
Direction convention (``from_key`` = popular-name act citation, ``to_key`` =
2256+
USC codification) is for documentation only: every caller queries both
2257+
columns, so equivalence is treated as symmetric and a developer reading the
2258+
model in isolation should not assume lookups are one-directional.
22642259
"""
22652260

22662261
# Single-column indexes are declared once in Meta.indexes below (a bare

opencontractserver/enrichment/constants.py

Lines changed: 60 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -215,8 +215,11 @@ def doc_max_concurrency() -> int:
215215

216216

217217
# --- Phase 3: prefix classifier ---------------------------------------- #
218-
_USC_PREFIX_RE = _re.compile(r"^usc-\d+$")
219-
_CFR_PREFIX_RE = _re.compile(r"^cfr-\d+$")
218+
# Public (no leading underscore): these are imported cross-module by the USC /
219+
# CFR authority source providers' ``can_handle`` overrides, so they are part of
220+
# the package's intentional surface, not module-private helpers.
221+
USC_PREFIX_RE = _re.compile(r"^usc-\d+$")
222+
CFR_PREFIX_RE = _re.compile(r"^cfr-\d+$")
220223
# Municipal grammar keys (issue #1995): the ``muni`` catch-all (bare "Municipal
221224
# Code § N") and per-city ``muni-<city-slug>`` keys (both the table-keyed codes
222225
# and open-vocab city captures). Matched by shape so a city added to the table
@@ -237,6 +240,59 @@ def doc_max_concurrency() -> int:
237240
GRAMMAR_STATUTE_META_PREFIXES = frozenset({"act", "publ", "stat"})
238241

239242

243+
# --- Phase 3/5: AuthorityFrontier discovery-state machine ------------------ #
244+
# Single source of truth for the frontier state vocabulary (CLAUDE.md item 4:
245+
# no magic strings). The model field choices (annotations.models.AuthorityFrontier
246+
# .DISCOVERY_STATE_CHOICES), the transition primitive (AuthorityFrontierService
247+
# .mark), the discovery orchestrator, and the crawl driver all reference these
248+
# names so a rename is a one-line edit. The verify+license gate
249+
# (AuthorityGateService) reuses the overlapping subset for its GATE_* verdicts.
250+
DISCOVERY_STATE_QUEUED = "queued"
251+
DISCOVERY_STATE_IN_PROGRESS = "in_progress"
252+
DISCOVERY_STATE_INGESTED = "ingested"
253+
DISCOVERY_STATE_FAILED = "failed"
254+
DISCOVERY_STATE_UNSUPPORTED = "unsupported"
255+
# Phase 4: visible, non-silent gate outcomes.
256+
DISCOVERY_STATE_BLOCKED_LICENSE = "blocked_license"
257+
DISCOVERY_STATE_BLOCKED_DOMAIN = "blocked_domain"
258+
DISCOVERY_STATE_UNLOCATED = "unlocated"
259+
DISCOVERY_STATE_PENDING_APPROVAL = "pending_approval"
260+
# Phase 5: per-jurisdiction cap reached; row parked so dequeue can skip it.
261+
DISCOVERY_STATE_DEFERRED_CAP = "deferred_cap"
262+
263+
# (value, human label) pairs for the model field. The labels live with the
264+
# vocabulary so the model, admin, and any serializer share one definition.
265+
# NOTE: the historical ``discovered`` and ``resolved`` states were retired
266+
# (Authority Console Phase 4): no production code path ever assigned them
267+
# (discovery jumps in_progress -> ingested, and the resolution outcome lives on
268+
# the relink result / Analysis, not the frontier row), so carrying them as
269+
# choices was a dead-vocabulary trap.
270+
DISCOVERY_STATE_CHOICES = [
271+
(DISCOVERY_STATE_QUEUED, "Queued"),
272+
(DISCOVERY_STATE_IN_PROGRESS, "In progress"),
273+
(DISCOVERY_STATE_INGESTED, "Document imported"),
274+
(DISCOVERY_STATE_FAILED, "No source found"),
275+
(DISCOVERY_STATE_UNSUPPORTED, "No provider can_handle"),
276+
(DISCOVERY_STATE_BLOCKED_LICENSE, "Provider license is not public-domain"),
277+
(
278+
DISCOVERY_STATE_BLOCKED_DOMAIN,
279+
"Source domain not on the public-domain allowlist",
280+
),
281+
(
282+
DISCOVERY_STATE_UNLOCATED,
283+
"Located text did not verify against the requested key",
284+
),
285+
(DISCOVERY_STATE_PENDING_APPROVAL, "Found, awaiting human approval before ingest"),
286+
(DISCOVERY_STATE_DEFERRED_CAP, "Deferred: per-jurisdiction cap reached"),
287+
]
288+
289+
# States that represent a successful terminal ingest. ``mark()`` clears
290+
# ``last_error`` when transitioning into one of these, so a healthy row never
291+
# retains a stale error string from an earlier failed attempt. Only "ingested"
292+
# qualifies today ("resolved" was retired — see the note above).
293+
DISCOVERY_SUCCESS_STATES = frozenset({DISCOVERY_STATE_INGESTED})
294+
295+
240296
def classify_prefix(prefix: str) -> tuple:
241297
"""(jurisdiction, authority_type) for a canonical_key prefix.
242298
@@ -249,9 +305,9 @@ def classify_prefix(prefix: str) -> tuple:
249305
Falls back to ``PREFIX_CLASSIFICATION`` for named registry bodies (dgcl,
250306
exchange-act, irc, …) and returns ``(None, None)`` for unknown prefixes.
251307
"""
252-
if _USC_PREFIX_RE.match(prefix):
308+
if USC_PREFIX_RE.match(prefix):
253309
return (JURISDICTION_US_FEDERAL, AUTHORITY_TYPE_STATUTE)
254-
if _CFR_PREFIX_RE.match(prefix):
310+
if CFR_PREFIX_RE.match(prefix):
255311
return (JURISDICTION_US_FEDERAL, AUTHORITY_TYPE_REGULATION)
256312
if prefix == "fedreg":
257313
return (JURISDICTION_US_FEDERAL, AUTHORITY_TYPE_ADMIN_RULE)

opencontractserver/enrichment/services/authority_discovery_service.py

Lines changed: 35 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
from django.utils import timezone
2626

2727
from opencontractserver.annotations.models import AuthorityFrontier
28+
from opencontractserver.enrichment import constants as C
2829
from opencontractserver.shared.services.base import BaseService
2930

3031
logger = logging.getLogger(__name__)
@@ -83,15 +84,21 @@ def _provider_for(cls, canonical_key: str) -> tuple[str | None, Any, str | None]
8384
key=lambda d: getattr(d.component_class, "priority", 100),
8485
)
8586

87+
# Instantiate each enabled provider ONCE per call and reuse the instances
88+
# across every candidate key below. ``can_handle`` is pure given the key,
89+
# so re-instantiating a provider for each candidate (the prior behaviour)
90+
# was wasted work that grew with the candidate-key fan-out.
91+
providers = [
92+
(defn.name, defn.component_class())
93+
for defn in defns
94+
if defn.component_class is not None
95+
and getattr(defn.component_class, "enabled", True)
96+
]
97+
8698
def _match(key: str):
87-
for defn in defns:
88-
if defn.component_class is None:
89-
continue
90-
if not getattr(defn.component_class, "enabled", True):
91-
continue
92-
provider = defn.component_class()
99+
for name, provider in providers:
93100
if provider.can_handle(key):
94-
return defn.name, provider
101+
return name, provider
95102
return None
96103

97104
from opencontractserver.enrichment.data import mappings as _mappings
@@ -221,8 +228,11 @@ def discover_and_bootstrap(
221228
name, provider, fetch_key = cls._provider_for(canonical_key)
222229

223230
if provider is None:
224-
AuthorityFrontierService.mark(frontier_row, "unsupported")
225-
return {"status": "unsupported", "canonical_key": canonical_key}
231+
AuthorityFrontierService.mark(frontier_row, C.DISCOVERY_STATE_UNSUPPORTED)
232+
return {
233+
"status": C.DISCOVERY_STATE_UNSUPPORTED,
234+
"canonical_key": canonical_key,
235+
}
226236

227237
# ``_provider_for`` only returns a non-None provider alongside a non-None
228238
# ``fetch_key`` (the matched candidate key); the ``(None, None, None)``
@@ -234,7 +244,7 @@ def discover_and_bootstrap(
234244
# Record which provider was selected and mark in-flight.
235245
frontier_row.provider = name
236246
frontier_row.save(update_fields=["provider", "modified"])
237-
AuthorityFrontierService.mark(frontier_row, "in_progress")
247+
AuthorityFrontierService.mark(frontier_row, C.DISCOVERY_STATE_IN_PROGRESS)
238248

239249
# --- fetch -----------------------------------------------------------
240250
try:
@@ -257,17 +267,17 @@ def discover_and_bootstrap(
257267
candidate_record = cls._audit_record(
258268
provider_name=name,
259269
provider_license=provider.license,
260-
outcome="failed",
270+
outcome=C.DISCOVERY_STATE_FAILED,
261271
error=str(exc),
262272
)
263273
AuthorityFrontierService.mark(
264274
frontier_row,
265-
"failed",
275+
C.DISCOVERY_STATE_FAILED,
266276
error=str(exc),
267277
candidate_record=candidate_record,
268278
)
269279
return {
270-
"status": "failed",
280+
"status": C.DISCOVERY_STATE_FAILED,
271281
"error": str(exc),
272282
"canonical_key": canonical_key,
273283
}
@@ -298,7 +308,11 @@ def discover_and_bootstrap(
298308
provider_license=provider.license,
299309
source_domain=decision.source_domain,
300310
verify=decision.verify,
301-
outcome=decision.verdict if decision.verdict != GATE_OK else "ingested",
311+
outcome=(
312+
decision.verdict
313+
if decision.verdict != GATE_OK
314+
else C.DISCOVERY_STATE_INGESTED
315+
),
302316
error=None if decision.verdict == GATE_OK else decision.reason,
303317
)
304318
if decision.verdict != GATE_OK:
@@ -378,9 +392,11 @@ def discover_and_bootstrap(
378392
result["equivalence_relink"] = relink_result
379393

380394
# --- mark ingested -----------------------------------------------
395+
# mark() clears any stale last_error from a prior failed attempt on
396+
# this SUCCESS transition (see C.DISCOVERY_SUCCESS_STATES).
381397
AuthorityFrontierService.mark(
382398
frontier_row,
383-
"ingested",
399+
C.DISCOVERY_STATE_INGESTED,
384400
document_id=ingested_doc.id if ingested_doc else None,
385401
candidate_record=candidate_record,
386402
)
@@ -394,25 +410,25 @@ def discover_and_bootstrap(
394410
)
395411
AuthorityFrontierService.mark(
396412
frontier_row,
397-
"failed",
413+
C.DISCOVERY_STATE_FAILED,
398414
error=str(exc),
399415
candidate_record=cls._audit_record(
400416
provider_name=name,
401417
provider_license=provider.license,
402418
source_domain=decision.source_domain,
403419
verify=decision.verify,
404-
outcome="failed",
420+
outcome=C.DISCOVERY_STATE_FAILED,
405421
error=str(exc),
406422
),
407423
)
408424
return {
409-
"status": "failed",
425+
"status": C.DISCOVERY_STATE_FAILED,
410426
"error": str(exc),
411427
"canonical_key": canonical_key,
412428
}
413429

414430
return {
415-
"status": "ingested",
431+
"status": C.DISCOVERY_STATE_INGESTED,
416432
**result,
417433
"relinked_count": relinked_count,
418434
"canonical_key": canonical_key,

0 commit comments

Comments
 (0)