Skip to content

Authority packs: self-contained per-pack config, bug fixes, docs#2064

Merged
JSv4 merged 2 commits into
mainfrom
feature/authority-packs-config
Jun 25, 2026
Merged

Authority packs: self-contained per-pack config, bug fixes, docs#2064
JSv4 merged 2 commits into
mainfrom
feature/authority-packs-config

Conversation

@JSv4

@JSv4 JSv4 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #2053 that makes an authority pack a self-contained, copyable directory — everything for one authority (taxonomy, content, personas, its scraper, and its source hosts) lives in the pack and travels with it: copy the directory, get the authority. Also fixes a namespace re-seed bug and centralizes the docs.

Changes

Self-contained packs

  • Provider discovery from <pack>/providers/ — in-tree packs and out-of-tree dirs on the new AUTHORITY_PACK_PATHS setting (imported by file path, failures logged + skipped). A provider now lives with its authority instead of in core's authority_source_providers/ package; duplicate supported_prefixes across providers now warn.
  • pack.yaml source_hosts: are merged into the SSRF allowlist at runtime via an injected provider; the gate uses the same effective set. The SSRF mechanism (HTTPS-only, public-IP, per-redirect-hop revalidation, size caps) is unchanged — a pack only widens which hosts are reachable, and only once installed. Secrets stay in the PipelineSettings encrypted vault.
  • A pack's mappings YAML can declare shape_rules: (classify a numbered prefix family) and abbreviations: (state/municipal), merged onto the Python baseline at runtime (baseline wins collisions).

Bug fixes

  • _namespace_seed post_migrate re-seed no longer clobbers source="manual" / corpus-linked AuthorityNamespace rows on every migrate/flush (restores the console's "a re-load can't overwrite curator edits" guarantee).
  • Reject self-referential equivalences fail-fast; remove the dead _park_for_cap crawl helper.

Docs

  • New docs/guides/authoring-authority-packs.md; wired into the nav along with the packs proposal; cross-linked from the Authority Console doc + bolivia README. The proposal is re-framed as design-history.
  • CLAUDE.md doc-hygiene principle (keep docs concise/current; code pointers over pasted code).

Testing

  • The authority/enrichment/provider suites pass; new tests: test_authority_pack_providers.py, test_authority_source_hosts.py, test_authority_pack_taxonomy.py, plus additions to test_authority_mapping_loader.py / test_authority_pack.py.
  • Live runtime smokes confirmed provider discovery, host-allowlist widening, and the shape-rule/abbreviation merge in a fresh process.

Left as core (by design)

The shared authority_type vocabulary (wired to model choices) and the Tier-2a citation-form parsing grammars stay in the engine — shared vocabulary / parsing logic, not per-authority config.

- Discover provider modules from a pack's providers/ dir (in-tree packs and
  out-of-tree dirs on the AUTHORITY_PACK_PATHS setting); warn on duplicate
  provider prefixes.
- Merge a pack's pack.yaml source_hosts into the SSRF allowlist at runtime via
  an injected provider; the gate uses the same effective allowlist.
- A pack's mappings YAML can declare shape_rules and state/municipal
  abbreviations, merged onto the Python baseline at runtime.
- Fix the AuthorityNamespace post_migrate re-seed clobbering manual/corpus-linked
  rows on every migrate/flush.
- Reject self-referential equivalences fail-fast; remove dead crawl helper.
- Add docs/guides/authoring-authority-packs.md, wire it + the packs proposal into
  the nav, and cross-link from the console doc and bolivia README.
- Add a CLAUDE.md doc-hygiene principle (concise, current, code pointers).
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Code Review

This PR makes authority packs self-contained: scrapers, SSRF host declarations, and citation vocabulary now travel with the pack directory rather than requiring core edits. The namespace re-seed bug fix is well-targeted, the SSRF injection mechanism is appropriately fail-closed, and the test coverage for the new modules is solid. A few issues worth addressing before merge:


1. Missing isinstance(data, dict) guard — two places, different blast radii (CONFIRMED bug)

opencontractserver/enrichment/services/authority_source_hosts.py ~line 66 and opencontractserver/enrichment/services/authority_pack_config.py ~line 65 (_iter_pack_mapping_files) both do:

data = yaml.safe_load(manifest.read_text(encoding="utf-8")) or {}

The or {} only catches falsy values (None, empty string). A syntactically valid pack.yaml whose top-level is a YAML list is truthy, so data becomes a Python list. The next line — data.get("source_hosts") / data.get("mappings") — raises AttributeError, which is not caught by the except yaml.YAMLError block. This isn't guarded elsewhere in the call stack.

Impact by site:

  • authority_source_hosts.py: crashes pack_declared_source_hosts()effective_source_allowlist() fails → every call to host_on_allowlist() / validate_url() / safe_fetch_bytes() that uses the dynamic default crashes. In a production deployment with a malformed pack, every authority fetch would break.
  • authority_pack_config.py (_iter_pack_mapping_files): crashes pack_declared_shape_rules() and pack_declared_abbreviations(), which propagates into classify_prefix() and GenericCitationExtractor.__init__().

The fix already exists in the same PR — _load_yaml() in authority_pack_config.py (line ~80) has the correct guard:

return data if isinstance(data, dict) else {}

Apply it consistently in both places:

# authority_source_hosts.py, after the try/except:
if not isinstance(data, dict):
    logger.warning("%s: pack manifest must be a YAML mapping; ignoring", manifest)
    continue

The same guard should be added to _iter_pack_mapping_files() (same file, same pattern). The three-copy duplication of this YAML-loading pattern (in _iter_pack_mapping_files, authority_source_hosts.pack_declared_source_hosts, and authority_source_hosts._load_yaml) is what allowed the inconsistency to creep in — see finding #4.


2. classify_prefix silently changed its contract from "never raises" to "may raise" (PLAUSIBLE)

opencontractserver/enrichment/constants.py ~line 335

The old code:

return PREFIX_CLASSIFICATION.get(prefix, (None, None))

could never raise. The new code calls pack_declared_shape_rules(), which internally reads YAML files via _iter_pack_mapping_files(). Neither function wraps the manifest.read_text() call in an OSError catch — only yaml.YAMLError is caught. A race condition or permission change on a pack file (file visible to is_file() but fails on read_text()) would propagate an OSError through pack_declared_shape_rules() into classify_prefix().

Five call sites unpack the result without any exception guard:

  • grammars.py:262jur, typ = C.classify_prefix(canonical) (citation extraction hot path)
  • authorities.py:187, governance_graph_service.py:193/216, authority_frontier_service.py:61/224

A fix that keeps the "never raises" contract:

try:
    rules = pack_declared_shape_rules()
except Exception:
    logger.warning("pack_declared_shape_rules raised; falling back to (None, None)", exc_info=True)
    rules = ()
for pattern, jur, typ in rules:
    ...

3. Fragmented cache resets create test-isolation risk (PLAUSIBLE)

opencontractserver/pipeline/registry.py, authority_pack_config.py, authority_source_hosts.py

There are now three independent lru_cache namespaces and three separate reset functions:

  • reset_registry() — clears only the registry
  • reset_pack_config_cache() — clears pack_declared_shape_rules + pack_declared_abbreviations
  • reset_source_hosts_cache() — clears pack_declared_source_hosts

The new test files call them independently:

  • test_authority_pack_providers.py calls only reset_registry()
  • test_authority_pack_taxonomy.py calls only reset_pack_config_cache()
  • test_authority_source_hosts.py calls only reset_source_hosts_cache()

A test that installs a mock pack with override_settings(AUTHORITY_PACK_PATHS=[...]) and calls only reset_registry() leaves stale shape rules and source host caches from a prior test on the same worker — the next test's classify_prefix("bo-ley-1234") returns the cached ("bo", "statute") even if the pack is no longer configured. In parallel test runs (pytest -n 4), each worker has its own process so cross-worker contamination isn't a risk, but within a single worker test ordering matters.

A reset_all_pack_caches() helper that clears all three in one call would be the right fix, exported from a single location and called in setUp for tests that touch pack state.


4. Pack manifest YAML loading duplicated 3× with inconsistent guards (DRY)

_iter_pack_mapping_files() (authority_pack_config.py), pack_declared_source_hosts() (authority_source_hosts.py), and _load_yaml() (authority_pack_config.py) all implement the same four-line pattern: open pack.yaml, yaml.safe_load, catch YAMLError, return {}. The pattern exists three times and the isinstance(data, dict) guard is applied only in one of them (finding #1 above). There's already a _load_yaml() helper in authority_pack_config.py; extending it (or moving it to a shared location) and calling it from both _iter_pack_mapping_files() and pack_declared_source_hosts() would centralize the guard and eliminate the duplication.


5. Magic string literals "baseline" and "manual" in _namespace_seed.py (CONVENTIONS)

opencontractserver/enrichment/_namespace_seed.py lines ~44, ~65

The source-ownership tokens "manual" and "baseline" are the same values defined (separately) as:

  • AuthorityMappingLoader.MANUAL = "manual" and AuthorityMappingLoader.BASELINE = "baseline" (authority_mapping_loader.py:33-34)
  • MANUAL = "manual" in both authority_namespace_service.py:49 and authority_mapping_service.py:31

The inline raw strings are a fourth copy. Importing from the service modules would be unsafe in a migration-context helper (live model dependency), so the right fix is to hoist NAMESPACE_SOURCE_BASELINE = "baseline" and NAMESPACE_SOURCE_MANUAL = "manual" into opencontractserver/enrichment/constants.py (which _namespace_seed.py already safely imports inside function bodies), then replace the three independent definitions in the service modules with imports from there.

CLAUDE.md: "No magic numbers — we have constants files in opencontractserver/constants/. Use them for any hardcoded values."


6. is_valid_source_host() belongs in utils/safe_http.py (CONVENTIONS)

opencontractserver/enrichment/services/authority_source_hosts.py line ~42

is_valid_source_host() is a pure hostname-validation function (regex match, no enrichment domain logic, no ORM). It is already imported by two call sites (authority_source_hosts.py itself and load_authority_pack.py). The natural home is opencontractserver/utils/safe_http.py, which already owns PUBLIC_DOMAIN_SOURCE_HOSTS, host_on_allowlist, validate_url, and the SSRFValidationError hierarchy. Placing it in an enrichment-domain service module forces load_authority_pack.py to reach into the enrichment service layer for what is a generic SSRF utility.

CLAUDE.md: "Utility functions belong in utility files: Before writing new utilities, check existing utility files first (opencontractserver/utils/)."

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.63100% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/pipeline/registry.py 98.03% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…d registry fault isolation

Covers the patch lines codecov/patch/Backend flagged on this PR:

- test_authority_pack_config.py (new): the fail-fast validators
  (iter_shape_rules / iter_abbreviations) on every malformed shape, the
  never-raise _load_yaml / _iter_pack_mapping_files skip paths, and the
  runtime-scan fault isolation that downgrades a bad pack to log-and-skip.
- test_authority_source_hosts.py: manifest-skip branches in
  pack_declared_source_hosts (no pack.yaml, malformed YAML, non-list
  source_hosts).
- test_authority_pack.py: Command._validate_source_hosts rejects a
  non-list source_hosts declaration.
- test_authority_pack_providers.py: a broken in-pack provider module is
  logged + skipped without breaking sibling discovery; a duplicate
  supported-prefix install is warned; authority_pack_dirs skips a
  non-directory AUTHORITY_PACK_PATHS entry.
- test_safe_http.py: _resolve_allowlist fails closed to the baseline when
  no dynamic provider is registered.

Tests only; no production code change.
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Code Review

This PR makes authority packs self-contained (provider + source hosts + citation vocabulary all travel with the pack directory), fixes the namespace re-seed clobber bug, and centralises docs. The core design is well-reasoned: fail-closed allowlist fallback, fail-fast validation at load time with log-and-skip at runtime, and the lru_cache + explicit reset pattern keeps the hot paths cheap. The test coverage is thorough. A few issues worth addressing before merge:


1. Silent swallow of allowlist provider registration — opencontractserver/annotations/apps.py:87-101

The new register_allowlist_provider call shares a single try/except ImportError: pass block with the ensure_seeded signal registration that already lived there. If effective_source_allowlist or register_allowlist_provider fails to import (e.g. a partial install, a future rename, or a missing yaml transitive dep), the exception is swallowed with no log entry. _allowlist_provider stays None, every safe_fetch_* / host_on_allowlist call silently falls back to the hardcoded baseline, and every domain declared only in a pack's source_hosts is rejected with GATE_BLOCKED_DOMAIN — with no actionable error anywhere.

# current: one broad block, silent failure of the new registration
try:
    from opencontractserver.enrichment._namespace_seed import ensure_seeded
    post_migrate.connect(...)
    from opencontractserver.enrichment.services.authority_source_hosts import effective_source_allowlist
    from opencontractserver.utils.safe_http import register_allowlist_provider
    register_allowlist_provider(effective_source_allowlist)
except ImportError:
    pass

# suggested: separate try/except with a warning so the failure is visible
try:
    from opencontractserver.enrichment._namespace_seed import ensure_seeded
    post_migrate.connect(...)
except ImportError:
    pass

try:
    from opencontractserver.enrichment.services.authority_source_hosts import effective_source_allowlist
    from opencontractserver.utils.safe_http import register_allowlist_provider
    register_allowlist_provider(effective_source_allowlist)
except ImportError:
    import logging
    logging.getLogger(__name__).warning(
        "Could not register pack-aware SSRF allowlist provider; "
        "safe_fetch will fall back to the hardcoded baseline."
    )

2. reset_registry() leaves pack-config and source-host caches stale — opencontractserver/pipeline/registry.py:937-947

reset_registry() clears the three registry caches but does not call reset_pack_config_cache() or reset_source_hosts_cache(). After a reset, the registry reflects the new pack set, but classify_prefix(), GenericCitationExtractor, and effective_source_allowlist() all still serve values from before the reset — a silent three-way divergence.

The tests work around this by calling all three resets independently in addCleanup, but that puts the burden on every caller to know the full dependency graph. Any production code (e.g., a management command that hot-reloads packs) that calls only reset_registry() will see incoherent state.

# registry.py
def reset_registry() -> None:
    PipelineComponentRegistry._instance = None
    PipelineComponentRegistry._initialized = False
    get_registry.cache_clear()
    get_supported_mime_types.cache_clear()
    get_allowed_mime_types.cache_clear()
    # Add these:
    from opencontractserver.enrichment.services.authority_pack_config import reset_pack_config_cache
    from opencontractserver.enrichment.services.authority_source_hosts import reset_source_hosts_cache
    reset_pack_config_cache()
    reset_source_hosts_cache()

Or export a single reset_all_pack_state() that all three tests can call, and have reset_registry() delegate to it.


3. Module-level enrichment → pipeline import — authority_pack_config.py:47, authority_source_hosts.py:30

Both new enrichment/services modules import authority_pack_dirs at the module top level from opencontractserver.pipeline.registry. The existing enrichment services avoid this by using lazy function-level imports (see authority_discovery_service.py:78, authority_frontier_service.py:469), precisely because the pipeline registry imports the full component machinery.

authority_pack_dirs itself only needs two things: the in-tree packs root path (a Path constant) and the AUTHORITY_PACK_PATHS setting. It belongs in an enrichment-layer utility (e.g. opencontractserver/enrichment/services/authority_pack_dirs.py) so enrichment services can import it without pulling in the pipeline registry's ComponentType enums, PipelineComponentDefinition dataclasses, and registry singleton at module load time.

The existing docstring on authority_pack_config.py acknowledges the indirection but frames it as a "lazy import in the consumer" solution; the cleaner fix at the right altitude is to move authority_pack_dirs (and _AUTHORITY_PACKS_ROOT) out of registry.py into an enrichment utility, then have registry.py import it from there instead of the reverse.


4. Stale compiled regexes in reused GenericCitationExtractor instances — opencontractserver/enrichment/grammars.py:293

GenericCitationExtractor.__init__ merges pack_declared_abbreviations() into the baseline tables and compiles all regex alternations into self._state_re / self._muni_re. This is correct for a freshly-constructed instance, but any code that reuses an existing extractor instance across a reset_pack_config_cache() call (e.g., a long-lived Celery task that caches the extractor) will silently continue using the old compiled regex regardless of pack changes.

This pattern predates this PR (the baseline was already baked at __init__), but the PR makes it observable: pack abbreviation changes now require both a cache reset AND re-instantiation. Worth noting in a docstring comment on __init__ so future callers know not to reuse instances across pack changes.


Minor nit

authority_source_hosts.py logs "Authority pack %r widens the SSRF allowlist with source host: %s" at INFO level only on the first encounter (if host not in hosts and host not in PUBLIC_DOMAIN_SOURCE_HOSTS). Good — but the check is host not in hosts, so if a pack re-declares a host it already declared in the same scan (two pack.yaml files listing the same host), the second one is silently de-duped without a log. Probably intentional, just worth confirming.

@JSv4 JSv4 merged commit 94c67a8 into main Jun 25, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant