Skip to content

Authority Packs: proposal + Phase 1 implementation (Bolivia pack + load_authority_pack)#2053

Merged
JSv4 merged 6 commits into
mainfrom
docs/authority-packs-proposal
Jun 24, 2026
Merged

Authority Packs: proposal + Phase 1 implementation (Bolivia pack + load_authority_pack)#2053
JSv4 merged 6 commits into
mainfrom
docs/authority-packs-proposal

Conversation

@JSv4

@JSv4 JSv4 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Why

PR #1305 (@jseborga) proposed Bolivian-law support as a standalone bolivian_laws/ Django app — its own scraper base, dedup, 11 hard-coded legal areas, personas, and a bespoke GraphQL mutation. The Authority architecture (#1990 / #1997 / #2037) now ships the primitives that app hand-rolled. This PR (1) captures the "authority pack" design that exploits that rail and (2) implements Phase 1 — a seed-based pack format + a reference Bolivia pack — so #1305's work is repackaged as data, not an app.

What

Design docs

Phase 1 implementation (seed-based)

  • load_authority_pack management command (generic, jurisdiction-agnostic): reads a pack.yaml manifest and idempotently (1) loads the pack's authority_mappings YAML into AuthorityNamespace via AuthorityMappingLoader.load_all(path=…), (2) bootstraps one authority corpus per legal area from a JSON section spec via bootstrap_authority_corpus, (3) writes each persona into Corpus.corpus_agent_instructions. --path accepts any directory → out-of-tree packs load identically.
  • Reference Bolivia pack (opencontractserver/enrichment/data/authority_packs/bolivia/): five-prefix taxonomy (jurisdiction: bo, Spanish aliases), a seeded constitucional corpus (CPE articles), a Spanish persona, manifest + README.
  • Tests (opencontractserver/tests/test_authority_pack.py): static pack-validity (manifest / mappings schema / authority_type vocab / declared-prefix coverage) + the command end-to-end (namespaces, corpus, persona, document count, idempotency). 5 pass.

The one finding that shaped Phase 1

Reading #1305's actual scrapers confirmed the Bolivian sources (Gaceta Oficial / TSJ / TCP) are listing-page publishers, not key-addressable — so a deterministic canonical_key → URL provider can't fetch yet. That capability is the Phase-2 discovery gap (#2054). Phase 1 therefore ships taxonomy + curated content + personas with no live fetch (and so no host-allowlist edit); the provider is retained in the spec as the Phase-2 reference skeleton.

Roadmap — generalizable follow-up issues (not Bolivia-specific)

Provenance

Personas, the eleven-area taxonomy, and the source-publisher knowledge are ported from PR #1305; @jseborga credited per #1444's migration story. Code references in the docs were fact-checked against current source.

…w-up)

Design doc only — no code, migrations, or tests (mirrors #1444's convention).

- docs/architecture/proposals/0002-authority-packs.md: reframes the 'authority
  pack' drop-in concept on the shipped Authority architecture (the provider rail
  that #1444 could not yet assume). Defines the four extension seams, the pack
  anatomy, the drop-in lifecycle, the #1305 component->slot mapping
  (full/partial/gap), the residual gaps with a 4-phase recommendation, and the
  re-homing of #1444 Phase A/B onto this rail.
- docs/architecture/proposals/0002-authority-packs-bolivia-spec.md: the buildable
  Phase 1 artifact — pack layout, authority_mappings.bolivia.yaml, a
  BaseAuthoritySourceProvider skeleton, a bootstrap section-spec JSON, a Spanish
  persona, the one required host-allowlist edit, and the drop-in commands.

All code references fact-checked against current source (provider interface,
AuthoritySection, bootstrap_authority spec, ALL_AUTHORITY_TYPES, the
CorpusGroup/scheduling gaps). Credits @jseborga per the #1444 migration story.
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review — Authority Packs proposal + Bolivia spec

This PR is a well-researched, well-structured design doc. The author verified 10 of 11 code-shape claims against the current source before publishing, and the gap-and-phasing analysis in §7 is accurate (the two missing primitives were confirmed absent in the repo). Overall a clean contribution — a few issues worth addressing before merge.

1. Broken cross-reference — authority-console.md does not exist

File: docs/architecture/proposals/0002-authority-packs.md, metadata "Relates to" row

The link points to ../authority-console.md, but docs/architecture/authority-console.md does not exist in the repo. docs/architecture/reference-web-enrichment.md (the other link in the same row) does exist.

Fix: remove the link, correct it if a different file was intended, or note it as "coming soon" the way the 0001 entry is handled.

2. Missing changelog fragment

CLAUDE.md requires a changelog fragment under changelog.d/ for significant additions — no fragment exists for this PR. A doc that names and specifies a new architectural concept ("authority packs") is substantive enough to warrant one, and there is repo precedent for docs additions getting fragments.

Suggested: changelog.d/2053-authority-packs.added.md

3. AuthorityRequest.params field absent from the Bolivia spec narrative

File: docs/architecture/proposals/0002-authority-packs-bolivia-spec.md, Slot 2 section

The actual dataclass in opencontractserver/pipeline/base/base_authority_source_provider.py has five fields:

@dataclass
class AuthorityRequest:
    canonical_key: str
    url: str
    params: dict = field(default_factory=dict)  # not mentioned in the spec
    citation: str | None = None
    extra: dict = field(default_factory=dict)

The skeleton omitting params is fine (it defaults to {}), but a contributor building a provider against a query-string API needs to know the field exists. The "buildable spec" goal is slightly undercut without it.

Fix: one-line mention of params: dict in the _locate_impl narrative.

Minor notes (no action required)

  • re.IGNORECASE on _NUMBER_RE: the pattern ^[0-9][0-9a-z\-]*$ already restricts to lowercase; the flag silently widens it to accept uppercase too. Harmless but misleading — dropping it would make the intent explicit.
  • No guard on colon-less key in _locate_impl: safe at runtime (guarded by can_handle()), but canonical_key.split(":", 1) with no colon gives a confusing ValueError: not enough values to unpack if called directly in a unit test. A one-line guard before the split would improve the skeleton's value as a template.

Everything else — file paths, class and method signatures, ALL_AUTHORITY_TYPES values, YAML structure, registry auto-discovery mechanism, bootstrap_authority command — was verified against the current source and is accurate.

…low-up)

Seed-based pack format on the shipped Authority architecture — no bespoke app.

- load_authority_pack management command (generic, jurisdiction-agnostic): reads a
  pack.yaml manifest and idempotently loads the taxonomy YAML
  (AuthorityMappingLoader.load_all), bootstraps one authority corpus per legal
  area from a JSON section spec (bootstrap_authority_corpus), and writes each
  area persona into Corpus.corpus_agent_instructions. --path accepts any dir
  (out-of-tree packs supported).
- Reference Bolivia pack under enrichment/data/authority_packs/bolivia/: 5-prefix
  taxonomy (jurisdiction=bo, Spanish aliases), seeded constitucional corpus (CPE
  articles), Spanish persona, manifest, README. Repackages PR #1305 (@jseborga)
  as data instead of a standalone Django app.
- Tests (test_authority_pack.py): static pack-validity (manifest/mappings/spec
  integrity, authority_type vocab, declared-prefix coverage) + command
  end-to-end (namespaces, corpus, persona, document count, idempotency). 5 pass.
- Updated proposal + spec docs to record the as-built Phase 1 and that the
  live-fetch provider folds into Phase 2 (#2054): the Bolivian publishers are
  listing-page, not key-addressable, so a citation-keyed provider cannot fetch
  yet. Seed-based Phase 1 needs no provider and no host-allowlist edit.
- Changelog fragment.

Credits @jseborga per the #1444 migration story.
@JSv4 JSv4 changed the title Authority Packs: proposal + concrete Bolivia pack spec (#1305 / #1444 follow-up) Authority Packs: proposal + Phase 1 implementation (Bolivia pack + load_authority_pack) Jun 23, 2026
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Code Review

This PR introduces the authority-pack format (Phase 1): a generic load_authority_pack management command that reads a pack.yaml manifest and idempotently bootstraps taxonomy namespaces + per-area authority corpora, plus a reference Bolivia pack demonstrating the pattern. The architecture — repackaging a jurisdiction's body of law as data rather than a bespoke app — is clean, and the design documents are thorough. A few actionable issues below, ranked by severity.


1. Relink sweeps run once per corpus instead of once total

File: opencontractserver/corpuses/management/commands/load_authority_pack.py, line ~167

bootstrap_authority_corpus(relink=not options["no_relink"]) is called inside the for entry in corpora: loop. When relink=True, bootstrap_authority_corpus calls EnrichmentService().relink_corpora_for_keys(keys) before returning — so an 11-corpus Bolivia pack triggers 11 separate relink sweeps, each scanning the full CorpusReference table for its own narrow key set. The correct shape is to collect all section keys across the loop and run one relink after the loop completes:

all_keys: list[str] = []
for entry in corpora:
    ...
    out = bootstrap_authority_corpus(..., relink=False)   # defer
    all_keys.extend(sec.key for sec in sections)
    ...

if not options["no_relink"] and all_keys:
    from opencontractserver.enrichment.services import EnrichmentService
    EnrichmentService().relink_corpora_for_keys(all_keys)

2. modified timestamp not persisted when overrides are applied

File: opencontractserver/corpuses/management/commands/load_authority_pack.py, _apply_corpus_overrides (the corpus.save(update_fields=fields) call near the end)

Corpus.save() unconditionally sets self.modified = timezone.now() (line 472 of models.py), but because "modified" is never appended to fields, Django's update_fields filter prevents the value from reaching the database. The corpus's modified column stays at its pre-override value. Compare with bootstrap_authority_corpus in authorities.py line 312, which correctly uses corpus.save(update_fields=["is_public", "modified"]).

Fix: fields.append("modified") before the save() call, or add it unconditionally to update_fields.


3. Partial DB state when persona file is missing

File: opencontractserver/corpuses/management/commands/load_authority_pack.py, handle() loop (~lines 161–169)

bootstrap_authority_corpus() creates the corpus and all its section documents and commits them to the database before _apply_corpus_overrides() is called. If the persona file referenced in pack.yaml doesn't exist, _apply_corpus_overrides raises CommandError — but the corpus and documents already exist in the DB. There is no transaction.atomic() wrapping the pair.

The practical risk is an operator who sees the command fail but doesn't realise the corpus was actually created. On a re-run the sections skip correctly (idempotent), but the error message is misleading on the first run. The fix is either to validate all referenced paths (persona, spec) before calling bootstrap_authority_corpus, or to wrap the pair in transaction.atomic().


4. Silent exit-zero when corpora key is missing or null

File: opencontractserver/corpuses/management/commands/load_authority_pack.py, lines 147–149

corpora = manifest.get("corpora") or []
if not isinstance(corpora, list):
    raise CommandError("Manifest 'corpora' must be a list.")

When pack.yaml has corpora: with nothing after it (YAML null → Python None), or the key is absent entirely, None or [] gives [], the isinstance guard passes, and the loop runs zero times. The command exits 0 having bootstrapped no corpora, with no warning. A simple typo in the key name produces a silent no-op.

Fix: distinguish missing-key from wrong-type:

raw_corpora = manifest.get("corpora")
if raw_corpora is None:
    raise CommandError("Manifest must contain a 'corpora' key.")
if not isinstance(raw_corpora, list):
    raise CommandError("Manifest 'corpora' must be a list.")
corpora = raw_corpora

5. _parse_sections duplicates bootstrap_authority.py (DRY violation)

Files: load_authority_pack.py lines 53–86, bootstrap_authority.py lines 87–112

The JSON→AuthoritySection parsing routine is copy-pasted nearly verbatim: same file-read + exception handling, same isinstance(raw_sections, list) guard, same per-element isinstance(sec, dict) and all(isinstance(sec.get(f), str) and sec[f].strip() for f in ("key", "heading", "text")) check, same AuthoritySection(...) construction. The _parse_sections docstring acknowledges this explicitly: "Mirrors the validation in the bootstrap_authority command."

The fix is to extract the function to opencontractserver/enrichment/authorities.py (alongside AuthoritySection and bootstrap_authority_corpus) and have both commands import it. This is called out in CLAUDE.md: "DRY — please always architect code for maximal dryness and always see if you can consolidate related code."


6. Relink summary completely suppressed from command output

File: opencontractserver/corpuses/management/commands/load_authority_pack.py, lines ~171–178

bootstrap_authority_corpus returns out["relink"] when relink=True, containing corpora_relinked, law_references_linked, links_restamped, and corpora_failed. bootstrap_authority.py prints a second success line with these values (lines 134–143). load_authority_pack.py never reads out["relink"] at all — failures in corpora_failed are silently discarded. The minimal fix is to mirror the if relink := out.get("relink"): self.stdout.write(...) block from bootstrap_authority.py.


7. Persona written unconditionally on every run (idempotency claim violated)

File: opencontractserver/corpuses/management/commands/load_authority_pack.py, _apply_corpus_overrides (~lines 193–201)

Every run unconditionally reads the persona file from disk and calls corpus.save(update_fields=["corpus_agent_instructions"]) — even when the content is identical to what is already stored. This contradicts the module docstring's claim of idempotency ("Idempotent and re-runnable") and emits a spurious UPDATE per re-run. By contrast, AuthorityCorpusBootstrapper.bootstrap() explicitly compares if current == sec.text: skipped += 1 before writing section text.

Fix: read the existing value first and skip the save if unchanged:

new_persona = persona_path.read_text(encoding="utf-8").strip()
if corpus.corpus_agent_instructions != new_persona:
    corpus.corpus_agent_instructions = new_persona
    fields.append("corpus_agent_instructions")

8. Double fetch + save per corpus when --public and overrides are both active

File: opencontractserver/corpuses/management/commands/load_authority_pack.py, handle() + _apply_corpus_overrides

When both --public and a persona are declared, bootstrap_authority_corpus fetches the corpus and saves it for is_public (line 307 of authorities.py), then _apply_corpus_overrides immediately fetches the same row again (Corpus.objects.get(pk=corpus_id)) and saves it a second time. These two round-trips touch the same row and could be collapsed into one. As a minor related point, even when there are no overrides to apply, _apply_corpus_overrides always issues the Corpus.objects.get(pk=corpus_id) SELECT — this fetch should be deferred until we know there is something to write.

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.49624% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...orpuses/management/commands/load_authority_pack.py 98.13% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

JSv4 added 2 commits June 24, 2026 05:54
…pec parser

Code review (load_authority_pack.py):
- Defer the reactive re-link to a single sweep after the whole pack loads
  instead of one full-table sweep per corpus; print the relink summary.
- Persist `modified` when corpus overrides are applied (add it to update_fields
  so Corpus.save()'s timestamp bump is not filtered out).
- Resolve the declared persona BEFORE bootstrapping so a missing persona file
  can't strand a half-loaded corpus.
- Distinguish missing/null/wrong-type `corpora` and reject a manifest that
  declares neither mappings nor corpora (was a silent exit-zero on a typo).
- Make override application idempotent: skip the SELECT when nothing is
  declared and the UPDATE when every declared value already matches.
- Extract the duplicated JSON section-spec parsing into
  enrichment.authorities.parse_section_spec / read_section_spec; both
  bootstrap_authority and load_authority_pack now share one contract (DRY).

Docs review (0002-authority-packs-bolivia-spec.md):
- Document AuthorityRequest.params in the Slot 2 narrative.
- Drop the misleading re.IGNORECASE on the lowercase-only _NUMBER_RE.
- Guard the colon-less canonical_key split in the _locate_impl skeleton.

Tests: add LoadAuthorityPackEdgeCaseTests covering --public, deferred single
relink, --no-relink, taxonomy-only packs, persona idempotency + modified
persistence, model overrides, and the manifest/spec/persona error guards.
@claude

claude Bot commented Jun 24, 2026

Copy link
Copy Markdown

Code Review — Authority Packs: Phase 1

This PR cleanly repackages PR #1305's Bolivia-law work as a data bundle (taxonomy YAML + section specs + personas) consumed by a new generic load_authority_pack management command, rather than a bespoke Django app. The DRY extraction of spec validation into parse_section_spec/read_section_spec and the one-sweep relink optimisation are both good calls. Three confirmed bugs worth fixing before merge:


1. --public skips the Celery cascade — child documents stay private (HIGH)

opencontractserver/enrichment/authorities.py ~line 363

bootstrap_authority_corpus sets Corpus.is_public = True via a raw corpus.save(update_fields=[...]), bypassing CorpusService.set_visibility. The inline comment even claims this propagates to documents — it doesn't. CorpusService.set_visibility is the only path that enqueues make_corpus_public_task, which bulk-updates is_public=True on child documents, annotations, labels, analyses, extracts, relationships, datacells, and more. Skipping it means every authority corpus loaded with --public will have Corpus.is_public = True but all its seeded documents will remain private.

# Current (authorities.py ~line 363)
if make_public:
    corpus = Corpus.objects.get(pk=result["corpus_id"])
    if not corpus.is_public:
        corpus.is_public = True
        corpus.save(update_fields=["is_public", "modified"])  # cascade never fires

Fix: after bootstrap_authority_corpus creates the corpus, call CorpusService(request=None).set_visibility(corpus, is_public=True) (or the equivalent internal path that enqueues the task), and delete the misleading comment.


2. handle() is not atomic — taxonomy persists even when corpora validation fails (MEDIUM)

opencontractserver/corpuses/management/commands/load_authority_pack.py ~line 93

_load_taxonomy() is called first and immediately commits every AuthorityNamespace/AuthorityKeyEquivalence row (each upsert_equivalence call is wrapped in its own transaction.atomic() savepoint). Only then does _manifest_corpora() run. If the manifest has valid mappings: but a malformed corpora: value (e.g. corpora: null with the key present, or a non-list), CommandError is raised after the taxonomy is already durably written — with no rollback.

The command is documented as idempotent and re-runnable, but a failed first run leaves the DB in a hybrid state: taxonomy loaded, zero corpora created.

Fix: wrap handle() in transaction.atomic(), or — simpler — move all validation (including _manifest_corpora() and per-entry spec/persona file checks) before any DB writes.


3. parse_section_spec doesn't validate aliases type — a string value corrupts the alias registry (MEDIUM)

opencontractserver/enrichment/authorities.py ~line 92

sections is validated with isinstance(raw_sections, list), but aliases is returned raw:

return sections, spec.get("aliases")   # no isinstance check

If a spec file contains "aliases": "CPE" (a string, not a list), the string propagates to _stamp_keymeta["authority_aliases"] = "CPE". authority_alias_registry then iterates for alias in meta.get("authority_aliases") or [] — a non-empty string is truthy, so or [] is skipped and Python iterates character-by-character: 'C', 'P', 'E' each become registered alias keys, corrupting the alias map.

Fix: add a type check in parse_section_spec right before return:

raw_aliases = spec.get("aliases")
if raw_aliases is not None and not (
    isinstance(raw_aliases, list)
    and all(isinstance(a, str) for a in raw_aliases)
):
    raise ValueError(f"{label}: 'aliases' must be a list of strings, got {type(raw_aliases).__name__!r}.")
return sections, raw_aliases

Minor: test assertion uses the wrong CorpusDocumentService variant

opencontractserver/tests/test_authority_pack.py ~line 133

CorpusDocumentService.get_corpus_documents(self.owner, corpus).count(), 4

CLAUDE.md (issue #1682) reserves get_corpus_documents for pipeline/MCP callers (corpus-as-gate semantics) and mandates get_corpus_documents_visible_to_user for user-facing assertions. The two methods are equivalent for the current fixture (owner has full permissions), but using the corpus-as-gate variant here sets a misleading precedent and would silently pass even if the seeded documents were accidentally created without document-level permissions for self.owner.


Nit: per-corpus output omits documents_restamped

load_authority_pack's per-corpus success message prints created / updated / skipped but not restamped. The sibling bootstrap_authority.py command includes it. Operators re-running after a text change would see 0 created, 0 updated, 0 skipped and assume nothing happened.

JSv4 added 2 commits June 24, 2026 07:02
…output

- load_authority_pack: validate the whole pack (mappings file, corpora shape,
  per-entry spec + persona) BEFORE any DB write, so a malformed corpora entry
  can no longer abort with taxonomy already committed and zero corpora created.
  Split _load_taxonomy into _resolve_mappings_path (validation) + _load_taxonomy
  (load) and extract _validate_corpus_entry.
- parse_section_spec: reject a non-list-of-strings 'aliases' (a bare string was
  iterated char-by-char downstream, corrupting the authority alias registry).
- load_authority_pack: include 'restamped' in the per-corpus success line so a
  re-run after a text change reports the restamp instead of all-zeros.
- tests: assert --public cascades is_public to the seeded documents; add
  taxonomy-not-loaded-on-invalid-corpora and aliases-wrong-type cases; use the
  get_corpus_documents_visible_to_user variant for the user-facing count asserts.
@JSv4 JSv4 merged commit e5e96e6 into main Jun 24, 2026
9 checks passed
@JSv4 JSv4 deleted the docs/authority-packs-proposal branch June 24, 2026 12:29
@claude

claude Bot commented Jun 24, 2026

Copy link
Copy Markdown

Code Review

This PR cleanly packages the Bolivia-law concept from #1305 as data on the existing Authority rail rather than a bespoke Django app. The load_authority_pack management command is well-structured, the pre-flight validation-before-write design is thoughtful, and the test suite is thorough. Four actionable issues survived verification.


1. Non-Mapping JSON spec crashes with AttributeError instead of a clean error

opencontractserver/enrichment/authorities.py, line 120 / read_section_spec

read_section_spec calls json.loads() and passes the result directly to parse_section_spec without checking that it is a Mapping. If a spec file contains valid JSON that is not an object — e.g. [] or 42json.loads succeeds (no JSONDecodeError), and parse_section_spec then calls spec.get("sections") on a list/int, raising AttributeError. That is not a ValueError, so it escapes _validate_corpus_entry's except ValueError, bypasses the entire pre-flight block, and crashes the command with an unformatted traceback after zero useful context.

Fix: add a isinstance(spec, Mapping) guard immediately after json.loads in read_section_spec, raising ValueError for non-object JSON. One line.


2. preferred_embedder override silently bypasses the embedder-freeze guard

opencontractserver/corpuses/management/commands/load_authority_pack.py, line 959 / _apply_corpus_overrides

_apply_corpus_overrides writes preferred_embedder via raw setattr + Corpus.save(). CorpusService.assert_embedder_change_allowed (corpus_service.py:355) exists specifically to reject this change once the corpus has documents — "Cannot change preferred_embedder after documents have been added … use reEmbedCorpus to migrate". Because the pack command calls bootstrap_authority_corpus first (creating documents), and then calls _apply_corpus_overrides, any re-run that changes preferred_embedder silently creates a corpus with mixed embeddings: documents seeded on the first run have vectors from the old embedder; new or re-seeded documents get the new one.

Fix: call CorpusService.assert_embedder_change_allowed(corpus, new_embedder) before writing the field; raise CommandError on the returned error string. Or, better, add preferred_embedder as a parameter to bootstrap_authority_corpus so the freeze check fires before any documents are created.


3. A non-mapping pack.yaml (or an OSError reading it) produces an AttributeError

opencontractserver/corpuses/management/commands/load_authority_pack.py, line 83 / handle

Two problems in the same line:

manifest = yaml.safe_load(manifest_path.read_text(encoding="utf-8")) or {}
  • OSError: read_text() can raise OSError (e.g. permissions denied). Only yaml.YAMLError is caught; OSError propagates as an unhandled exception.
  • Non-mapping YAML: if pack.yaml is syntactically valid YAML but not a mapping (e.g. - item is a list), yaml.safe_load returns ['item'], which is truthy, so or {} is not applied. The next call to manifest.get("mappings") raises AttributeError (lists have no .get), not caught by either yaml.YAMLError or the downstream ValueError handlers.

Fix: catch (OSError, yaml.YAMLError) together, and after yaml.safe_load assert isinstance(manifest, dict).


4. Invalid preferred_llm in a pack entry aborts mid-pack after partial DB writes

opencontractserver/corpuses/management/commands/load_authority_pack.py, line 962 / corpus.save()

Corpus.save() (models.py:417–423) validates preferred_llm and raises django.core.exceptions.ValidationError for malformed or unregistered model specs. This ValidationError propagates out of _apply_corpus_overrides() uncaught — handle() has no try/except around the call. The pre-flight validation block (line 92's comment explains the intent) reads persona files and spec schemas, but never validates preferred_llm syntax. A pack entry with preferred_llm: "not:a:valid:spec" will load taxonomy rows and all earlier corpora before dying on the offending entry, leaving a half-committed pack that the idempotent re-run can't surface cleanly.

Fix: validate preferred_llm during _validate_corpus_entry using validate_model_spec (already importable from opencontractserver.llms.llm_registry), raising CommandError before any DB write.


Minor: duplicate section-validation loop in the LLM tool layer (DRY)

opencontractserver/llms/tools/core_tools/corpus_references.py, lines 143–162

The PR extracted section-spec validation into parse_section_spec, but the LLM tool wrapper for bootstrap_authority_corpus (corpus_references.py:143–162) still contains an identical validation loop that was not updated. The two validators can now diverge silently: if parse_section_spec gains a new required field check, the tool layer won't inherit it. The fix is to call parse_section_spec from the tool wrapper's list of dict inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant