Authority Packs: proposal + Phase 1 implementation (Bolivia pack + load_authority_pack)#2053
Conversation
…w-up) Design doc only — no code, migrations, or tests (mirrors #1444's convention). - docs/architecture/proposals/0002-authority-packs.md: reframes the 'authority pack' drop-in concept on the shipped Authority architecture (the provider rail that #1444 could not yet assume). Defines the four extension seams, the pack anatomy, the drop-in lifecycle, the #1305 component->slot mapping (full/partial/gap), the residual gaps with a 4-phase recommendation, and the re-homing of #1444 Phase A/B onto this rail. - docs/architecture/proposals/0002-authority-packs-bolivia-spec.md: the buildable Phase 1 artifact — pack layout, authority_mappings.bolivia.yaml, a BaseAuthoritySourceProvider skeleton, a bootstrap section-spec JSON, a Spanish persona, the one required host-allowlist edit, and the drop-in commands. All code references fact-checked against current source (provider interface, AuthoritySection, bootstrap_authority spec, ALL_AUTHORITY_TYPES, the CorpusGroup/scheduling gaps). Credits @jseborga per the #1444 migration story.
Review — Authority Packs proposal + Bolivia specThis PR is a well-researched, well-structured design doc. The author verified 10 of 11 code-shape claims against the current source before publishing, and the gap-and-phasing analysis in §7 is accurate (the two missing primitives were confirmed absent in the repo). Overall a clean contribution — a few issues worth addressing before merge. 1. Broken cross-reference —
|
…low-up) Seed-based pack format on the shipped Authority architecture — no bespoke app. - load_authority_pack management command (generic, jurisdiction-agnostic): reads a pack.yaml manifest and idempotently loads the taxonomy YAML (AuthorityMappingLoader.load_all), bootstraps one authority corpus per legal area from a JSON section spec (bootstrap_authority_corpus), and writes each area persona into Corpus.corpus_agent_instructions. --path accepts any dir (out-of-tree packs supported). - Reference Bolivia pack under enrichment/data/authority_packs/bolivia/: 5-prefix taxonomy (jurisdiction=bo, Spanish aliases), seeded constitucional corpus (CPE articles), Spanish persona, manifest, README. Repackages PR #1305 (@jseborga) as data instead of a standalone Django app. - Tests (test_authority_pack.py): static pack-validity (manifest/mappings/spec integrity, authority_type vocab, declared-prefix coverage) + command end-to-end (namespaces, corpus, persona, document count, idempotency). 5 pass. - Updated proposal + spec docs to record the as-built Phase 1 and that the live-fetch provider folds into Phase 2 (#2054): the Bolivian publishers are listing-page, not key-addressable, so a citation-keyed provider cannot fetch yet. Seed-based Phase 1 needs no provider and no host-allowlist edit. - Changelog fragment. Credits @jseborga per the #1444 migration story.
Code ReviewThis PR introduces the authority-pack format (Phase 1): a generic 1. Relink sweeps run once per corpus instead of once totalFile:
all_keys: list[str] = []
for entry in corpora:
...
out = bootstrap_authority_corpus(..., relink=False) # defer
all_keys.extend(sec.key for sec in sections)
...
if not options["no_relink"] and all_keys:
from opencontractserver.enrichment.services import EnrichmentService
EnrichmentService().relink_corpora_for_keys(all_keys)2.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…pec parser Code review (load_authority_pack.py): - Defer the reactive re-link to a single sweep after the whole pack loads instead of one full-table sweep per corpus; print the relink summary. - Persist `modified` when corpus overrides are applied (add it to update_fields so Corpus.save()'s timestamp bump is not filtered out). - Resolve the declared persona BEFORE bootstrapping so a missing persona file can't strand a half-loaded corpus. - Distinguish missing/null/wrong-type `corpora` and reject a manifest that declares neither mappings nor corpora (was a silent exit-zero on a typo). - Make override application idempotent: skip the SELECT when nothing is declared and the UPDATE when every declared value already matches. - Extract the duplicated JSON section-spec parsing into enrichment.authorities.parse_section_spec / read_section_spec; both bootstrap_authority and load_authority_pack now share one contract (DRY). Docs review (0002-authority-packs-bolivia-spec.md): - Document AuthorityRequest.params in the Slot 2 narrative. - Drop the misleading re.IGNORECASE on the lowercase-only _NUMBER_RE. - Guard the colon-less canonical_key split in the _locate_impl skeleton. Tests: add LoadAuthorityPackEdgeCaseTests covering --public, deferred single relink, --no-relink, taxonomy-only packs, persona idempotency + modified persistence, model overrides, and the manifest/spec/persona error guards.
Code Review — Authority Packs: Phase 1This PR cleanly repackages PR #1305's Bolivia-law work as a data bundle (taxonomy YAML + section specs + personas) consumed by a new generic 1.
|
…output - load_authority_pack: validate the whole pack (mappings file, corpora shape, per-entry spec + persona) BEFORE any DB write, so a malformed corpora entry can no longer abort with taxonomy already committed and zero corpora created. Split _load_taxonomy into _resolve_mappings_path (validation) + _load_taxonomy (load) and extract _validate_corpus_entry. - parse_section_spec: reject a non-list-of-strings 'aliases' (a bare string was iterated char-by-char downstream, corrupting the authority alias registry). - load_authority_pack: include 'restamped' in the per-corpus success line so a re-run after a text change reports the restamp instead of all-zeros. - tests: assert --public cascades is_public to the seeded documents; add taxonomy-not-loaded-on-invalid-corpora and aliases-wrong-type cases; use the get_corpus_documents_visible_to_user variant for the user-facing count asserts.
Code ReviewThis PR cleanly packages the Bolivia-law concept from #1305 as data on the existing Authority rail rather than a bespoke Django app. The 1. Non-Mapping JSON spec crashes with
|
Why
PR #1305 (@jseborga) proposed Bolivian-law support as a standalone
bolivian_laws/Django app — its own scraper base, dedup, 11 hard-coded legal areas, personas, and a bespoke GraphQL mutation. The Authority architecture (#1990 / #1997 / #2037) now ships the primitives that app hand-rolled. This PR (1) captures the "authority pack" design that exploits that rail and (2) implements Phase 1 — a seed-based pack format + a reference Bolivia pack — so #1305's work is repackaged as data, not an app.What
Design docs
docs/architecture/proposals/0002-authority-packs.md— the proposal: the four extension seams, pack anatomy, drop-in lifecycle, the Add Bolivian Laws RAG service with multi-agent orchestration #1305 → pack mapping (full / partial / gap), a 4-phase roadmap, and §9 re-homing [Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305) #1444's missing primitives onto this rail.docs/architecture/proposals/0002-authority-packs-bolivia-spec.md— the concrete Bolivia spec (taxonomy YAML, section-spec JSON, persona, and the Phase-2 provider skeleton reference).Phase 1 implementation (seed-based)
load_authority_packmanagement command (generic, jurisdiction-agnostic): reads apack.yamlmanifest and idempotently (1) loads the pack'sauthority_mappingsYAML intoAuthorityNamespaceviaAuthorityMappingLoader.load_all(path=…), (2) bootstraps one authority corpus per legal area from a JSON section spec viabootstrap_authority_corpus, (3) writes each persona intoCorpus.corpus_agent_instructions.--pathaccepts any directory → out-of-tree packs load identically.opencontractserver/enrichment/data/authority_packs/bolivia/): five-prefix taxonomy (jurisdiction: bo, Spanish aliases), a seededconstitucionalcorpus (CPE articles), a Spanish persona, manifest + README.opencontractserver/tests/test_authority_pack.py): static pack-validity (manifest / mappings schema /authority_typevocab / declared-prefix coverage) + the command end-to-end (namespaces, corpus, persona, document count, idempotency). 5 pass.The one finding that shaped Phase 1
Reading #1305's actual scrapers confirmed the Bolivian sources (Gaceta Oficial / TSJ / TCP) are listing-page publishers, not key-addressable — so a deterministic
canonical_key → URLprovider can't fetch yet. That capability is the Phase-2 discovery gap (#2054). Phase 1 therefore ships taxonomy + curated content + personas with no live fetch (and so no host-allowlist edit); the provider is retained in the spec as the Phase-2 reference skeleton.Roadmap — generalizable follow-up issues (not Bolivia-specific)
CorpusGroup+asearch_across_corpora(= [Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305) #1444 Phase B)Provenance
Personas, the eleven-area taxonomy, and the source-publisher knowledge are ported from PR #1305; @jseborga credited per #1444's migration story. Code references in the docs were fact-checked against current source.