Authority packs: self-contained per-pack config, bug fixes, docs#2064
Conversation
- Discover provider modules from a pack's providers/ dir (in-tree packs and out-of-tree dirs on the AUTHORITY_PACK_PATHS setting); warn on duplicate provider prefixes. - Merge a pack's pack.yaml source_hosts into the SSRF allowlist at runtime via an injected provider; the gate uses the same effective allowlist. - A pack's mappings YAML can declare shape_rules and state/municipal abbreviations, merged onto the Python baseline at runtime. - Fix the AuthorityNamespace post_migrate re-seed clobbering manual/corpus-linked rows on every migrate/flush. - Reject self-referential equivalences fail-fast; remove dead crawl helper. - Add docs/guides/authoring-authority-packs.md, wire it + the packs proposal into the nav, and cross-link from the console doc and bolivia README. - Add a CLAUDE.md doc-hygiene principle (concise, current, code pointers).
Code ReviewThis PR makes authority packs self-contained: scrapers, SSRF host declarations, and citation vocabulary now travel with the pack directory rather than requiring core edits. The namespace re-seed bug fix is well-targeted, the SSRF injection mechanism is appropriately fail-closed, and the test coverage for the new modules is solid. A few issues worth addressing before merge: 1. Missing
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…d registry fault isolation Covers the patch lines codecov/patch/Backend flagged on this PR: - test_authority_pack_config.py (new): the fail-fast validators (iter_shape_rules / iter_abbreviations) on every malformed shape, the never-raise _load_yaml / _iter_pack_mapping_files skip paths, and the runtime-scan fault isolation that downgrades a bad pack to log-and-skip. - test_authority_source_hosts.py: manifest-skip branches in pack_declared_source_hosts (no pack.yaml, malformed YAML, non-list source_hosts). - test_authority_pack.py: Command._validate_source_hosts rejects a non-list source_hosts declaration. - test_authority_pack_providers.py: a broken in-pack provider module is logged + skipped without breaking sibling discovery; a duplicate supported-prefix install is warned; authority_pack_dirs skips a non-directory AUTHORITY_PACK_PATHS entry. - test_safe_http.py: _resolve_allowlist fails closed to the baseline when no dynamic provider is registered. Tests only; no production code change.
Code ReviewThis PR makes authority packs self-contained (provider + source hosts + citation vocabulary all travel with the pack directory), fixes the namespace re-seed clobber bug, and centralises docs. The core design is well-reasoned: fail-closed allowlist fallback, fail-fast validation at load time with log-and-skip at runtime, and the 1. Silent swallow of allowlist provider registration —
|
Summary
Follow-up to #2053 that makes an authority pack a self-contained, copyable directory — everything for one authority (taxonomy, content, personas, its scraper, and its source hosts) lives in the pack and travels with it: copy the directory, get the authority. Also fixes a namespace re-seed bug and centralizes the docs.
Changes
Self-contained packs
<pack>/providers/— in-tree packs and out-of-tree dirs on the newAUTHORITY_PACK_PATHSsetting (imported by file path, failures logged + skipped). A provider now lives with its authority instead of in core'sauthority_source_providers/package; duplicatesupported_prefixesacross providers now warn.pack.yamlsource_hosts:are merged into the SSRF allowlist at runtime via an injected provider; the gate uses the same effective set. The SSRF mechanism (HTTPS-only, public-IP, per-redirect-hop revalidation, size caps) is unchanged — a pack only widens which hosts are reachable, and only once installed. Secrets stay in thePipelineSettingsencrypted vault.shape_rules:(classify a numbered prefix family) andabbreviations:(state/municipal), merged onto the Python baseline at runtime (baseline wins collisions).Bug fixes
_namespace_seedpost_migratere-seed no longer clobberssource="manual"/ corpus-linkedAuthorityNamespacerows on every migrate/flush (restores the console's "a re-load can't overwrite curator edits" guarantee)._park_for_capcrawl helper.Docs
docs/guides/authoring-authority-packs.md; wired into the nav along with the packs proposal; cross-linked from the Authority Console doc + bolivia README. The proposal is re-framed as design-history.Testing
test_authority_pack_providers.py,test_authority_source_hosts.py,test_authority_pack_taxonomy.py, plus additions totest_authority_mapping_loader.py/test_authority_pack.py.Left as core (by design)
The shared
authority_typevocabulary (wired to modelchoices) and the Tier-2a citation-form parsing grammars stay in the engine — shared vocabulary / parsing logic, not per-authority config.