Skip to content

Commit 94c67a8

Browse files
authored
Merge pull request #2064 from Open-Source-Legal/feature/authority-packs-config
Authority packs: self-contained per-pack config, bug fixes, docs
2 parents eeaf20b + 0fafcf4 commit 94c67a8

31 files changed

Lines changed: 1760 additions & 35 deletions

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,7 @@ docker compose -f production.yml up
238238
- **Frontend utilities**: `frontend/src/utils/formatters.ts` (formatting), `frontend/src/utils/files.ts` (file operations), etc.
239239
- **Backend utilities**: `opencontractserver/utils/` contains permissioning, PDF processing, and other shared utilities
240240
7. **Always go through the app's `services/` package.** Code with a user context (GraphQL resolvers, MCP tools, LLM tools, REST views, Celery tasks invoked with a user) **must** reach models through `opencontractserver/<app>/services/` — never compose `visible_to_user` / `user_can` / `user_has_permission_for_obj` inline. The shared base (`opencontractserver.shared.services.base.BaseService`) exposes `get_or_none`, `filter_visible`, `require_permission`, and `user_has` for the cases where a dedicated per-app method is overkill. The invariant is enforced twice — by a pytest test (`opencontractserver/tests/architecture/test_graphql_service_layer.py`) and by a Django system check (`opencontractserver/shared/checks.py`, `opencontracts.E001`) that fails `manage.py` startup on any inline Tier-0 use. **Scope of mechanical enforcement:** both the test and the check scan **`config/graphql/` only** (recursively). The rule above is *policy* for the other user-context surfaces — MCP tools (`opencontractserver/mcp/`), LLM tools (`opencontractserver/llms/tools/`), REST views, and user-context Celery tasks — but those are **not** scanned today and still contain correct-but-inline Tier-0 calls. Treat a green E001 as "no inline Tier-0 in `config/graphql/`," not "none anywhere." Failure messages carry the copy-pasteable recipe inline; see `docs/development/architecture_invariants.md` for the invariant catalogue and `docs/architecture/query_permission_patterns.md` for the per-app service catalogue + migration recipes.
241+
8. **Docs: keep them current, concise, and pointer-based.** When you change behavior, update the relevant doc (or add one if none fits) as part of the same change — stale docs are worse than none. But keep docs *concise and prune as you go*: a doc that only ever grows rots into noise, so trim superseded content rather than appending forever. **Favor code pointers over pasted code** — reference files + symbols (e.g. `opencontractserver/enrichment/authorities.py::bootstrap_authority_corpus`) instead of copying snippets in. Pasted code drifts out of sync the moment the source changes and becomes dead documentation; a pointer stays live. (See `## Documentation Locations` for where things live, and `## Changelog Maintenance` for change records.)
241242

242243
## Testing Patterns
243244

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
- **Authority namespace re-seed no longer clobbers curator edits.** The
2+
`post_migrate` convergence in `opencontractserver/enrichment/_namespace_seed.py`
3+
(`ensure_seeded``seed`) ran on every production `migrate` and every test
4+
flush and `update_or_create`d every shipped-prefix `AuthorityNamespace`
5+
unconditionally — silently reverting a curator's `source="manual"` console edits
6+
(display_name / jurisdiction / authority_type / aliases) and re-forcing
7+
`is_global=True` on the next deploy, defeating the Authority Console's headline
8+
"a re-load can no longer clobber a curator's runtime edits" guarantee. The seed
9+
now honours the same source-ownership partition as
10+
`AuthorityMappingLoader.load_namespaces` (skip `source="manual"` and
11+
corpus-linked rows), guarded on the `source` column's presence so the historical
12+
0082/0085/0086/0090 seed states are unaffected. Regression tests in
13+
`test_authority_mapping_loader.py::NamespaceReseedOwnershipTests`.
14+
- **`authority_mappings` reader rejects self-referential equivalences.**
15+
`enrichment/data/mappings.py::iter_equivalences` now raises `ValueError` for a
16+
`from_key == to_key` pair (matching the DB `CheckConstraint`) instead of
17+
counting it into `total` and then silently dropping it in the loader.
18+
- **Removed dead code in the authority crawl.** The per-jurisdiction-cap parking
19+
in `crawl_authorities_service.py` now routes through the previously-unused
20+
`_park_for_cap` helper (DRY; behaviour-preserving).
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
- **Centralized authority-system documentation.** Added a single durable
2+
operator/author guide, `docs/guides/authoring-authority-packs.md` (pack layout,
3+
shipping a scraper inside a pack, `source_hosts`, add/remove/copy-to-port,
4+
what's still core), wired into the mkdocs nav alongside the existing
5+
Authority Console and Reference-Web Enrichment architecture docs. The
6+
`0002-authority-packs` proposal is re-framed as design-history (it points to the
7+
guide for the how-to and notes the now-self-contained pack layout), its dangling
8+
`0001-…` reference de-linked, and the proposal added to the nav so it is no
9+
longer orphaned. The guide and the proposal favour pointers to real files over
10+
pasted code to limit staleness.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
- **Authority packs can now ship their own source provider.** The pipeline
2+
registry discovers `BaseAuthoritySourceProvider` subclasses from
3+
`<pack>/providers/*.py` — both for in-tree packs under
4+
`opencontractserver/enrichment/data/authority_packs/` and for out-of-tree pack
5+
directories listed in the new `AUTHORITY_PACK_PATHS` setting (env var). A pack's
6+
scraper now lives *with* its authority, so copying the pack directory to another
7+
OpenContracts install brings the provider with it — no more dropping a `.py` into
8+
core's `pipeline/authority_source_providers/` package
9+
(`opencontractserver/pipeline/registry.py`, `config/settings/base.py`). Provider
10+
modules are imported by file path under a collision-free synthetic module name;
11+
an import failure is logged and skipped, never crashing registry build. Secrets
12+
stay in the `PipelineSettings` encrypted vault (keyed by provider class path),
13+
never in pack files.
14+
- **Duplicate authority-provider prefixes now warn at registry build.** Two
15+
providers claiming the same `supported_prefixes` family (e.g. a pack provider
16+
shadowing a core one) previously registered silently; `_provider_for` then
17+
resolved them non-deterministically by priority-then-discovery-order. The
18+
registry now logs a warning identifying both claimants.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
- **Authority packs can declare their own SSRF source hosts.** A scraping pack
2+
lists the hosts it fetches from in `pack.yaml` (`source_hosts: [...]`); those are
3+
read from every installed pack (in-tree under `authority_packs/` or on the
4+
`AUTHORITY_PACK_PATHS` setting) and unioned with the hardcoded
5+
`PUBLIC_DOMAIN_SOURCE_HOSTS` baseline at runtime, so a fetching pack is portable
6+
as a directory without editing `constants/safe_http.py`. The union is injected
7+
into the pure `safe_http` util via a registered provider
8+
(`register_allowlist_provider`) so the util never imports the enrichment layer;
9+
every pack-added host is logged. The SSRF mechanism is unchanged (HTTPS-only,
10+
public-IP, per-redirect-hop revalidation, size caps) — a pack only widens *which*
11+
hosts are reachable, and only once installed (the install is the trust decision).
12+
`AuthorityGateService` now consults the same effective allowlist as the fetch.
13+
`load_authority_pack` validates `source_hosts` shape fail-fast. New module
14+
`opencontractserver/enrichment/services/authority_source_hosts.py`; tests in
15+
`test_authority_source_hosts.py`.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
- **Authority packs can carry their jurisdiction's citation vocabulary.** A pack's
2+
authority-mappings YAML may now declare two optional sections — `shape_rules:`
3+
(classify a numbered prefix family, e.g. `bo-ley-<n>`, without a core edit) and
4+
`abbreviations:` (`state`/`municipal` Bluebook abbreviations the Tier-2a
5+
extractor matches). They are read from every installed pack (in-tree +
6+
`AUTHORITY_PACK_PATHS`) and merged onto the shipped Python baseline at runtime
7+
(`classify_prefix` consults pack shape rules; `GenericCitationExtractor` merges
8+
pack abbreviations), so a jurisdiction's citation vocabulary travels *with* the
9+
pack — the baseline always wins a collision (a pack extends, never overrides).
10+
Malformed entries are logged + skipped at runtime and rejected fail-fast by
11+
`load_authority_pack`. New module
12+
`opencontractserver/enrichment/services/authority_pack_config.py`; tests in
13+
`test_authority_pack_taxonomy.py`. The shared `authority_type` vocabulary and the
14+
citation-*form* parsing grammars remain core (shared vocabulary / parsing logic,
15+
not per-authority config).

config/settings/base.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1665,6 +1665,17 @@
16651665
# latency against DB/connection pressure.
16661666
ENRICHMENT_DOC_MAX_CONCURRENCY = env.int("ENRICHMENT_DOC_MAX_CONCURRENCY", default=None)
16671667

1668+
# Out-of-tree authority-pack directories. Each entry is a self-contained pack
1669+
# directory (pack.yaml + optional providers/ + mappings/specs/personas). The
1670+
# pipeline registry scans every pack here — in addition to the in-tree packs under
1671+
# opencontractserver/enrichment/data/authority_packs/ — for provider modules under
1672+
# <pack>/providers/, so an authority pack copied to this install brings its scraper
1673+
# with it WITHOUT dropping a .py into core. Comma-separated absolute paths in the
1674+
# AUTHORITY_PACK_PATHS env var; empty by default. (Provider discovery happens at
1675+
# registry build, so adding a path needs a worker/web restart — same as any new
1676+
# in-tree provider.)
1677+
AUTHORITY_PACK_PATHS = env.list("AUTHORITY_PACK_PATHS", default=[])
1678+
16681679
# Rate Limiting Configuration
16691680
# ------------------------------------------------------------------------------
16701681
# Import rate limiting settings

docs/architecture/authority-console.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,9 @@ prefixes, license, priority, the enabled and requires-approval flags, and whethe
118118
the secrets vault holds credentials. Enabling/disabling stays in code; credentials
119119
are edited through System Settings' component-secrets vault
120120
(`updateComponentSecrets`), **not** here — the console never invents a parallel
121-
credential store.
121+
credential store. Providers can be shipped *inside* an authority pack
122+
(`<pack>/providers/`) so a scraper travels with its jurisdiction — see
123+
[Authoring an Authority Pack](../guides/authoring-authority-packs.md).
122124

123125
### Runs tab
124126

docs/architecture/proposals/0002-authority-packs.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22

33
| | |
44
|---|---|
5-
| **Status** | Proposed (design doc only — no code, migrations, or tests in this PR) |
6-
| **Supersedes / builds on** | 0001 — Generic scheduled scraping + corpus groups ([PR #1444](https://github.com/Open-Source-Legal/OpenContracts/pull/1444); doc lands at `./0001-scheduled-scraping-and-corpus-groups.md` once merged) |
5+
| **Status** | Partially implemented — this doc is the original design rationale + gap analysis. For the operator/author how-to (and the now-self-contained pack layout: in-pack providers + `source_hosts`), see the guide: [Authoring an Authority Pack](../../guides/authoring-authority-packs.md). |
6+
| **Supersedes / builds on** | 0001 — Generic scheduled scraping + corpus groups (PR #1444; the `0001-` proposal doc is not yet written) |
77
| **Relates to** | PR #1305 (Bolivian-law contributor PR, closed/reference), the Authority architecture (PRs #1990 / #1997 / #2037), [`authority-console.md`](../authority-console.md), [`reference-web-enrichment.md`](../reference-web-enrichment.md) |
88
| **Author** | follow-up to #1305 / #1444 |
99

@@ -37,6 +37,19 @@ rather than against a `scraping/` app that was never built.
3737
> therefore ships taxonomy + curated content + personas (no live fetch, so no
3838
> host-allowlist edit is needed yet).
3939
40+
> **Update — packs are now self-contained (gaps 1 & 6 closed).** A pack may now
41+
> ship its scraper inside the pack (`<pack>/providers/*.py`, discovered by the
42+
> pipeline registry from in-tree packs and out-of-tree dirs on the
43+
> `AUTHORITY_PACK_PATHS` setting) and declare the hosts it fetches from in
44+
> `pack.yaml` (`source_hosts:`, merged into the SSRF allowlist at runtime). The
45+
> "one un-packable edit" of §3 (the hardcoded host allowlist) and the
46+
> "single hardcoded package" of gap 6 (§7) no longer hold — a fetching pack is
47+
> portable as a directory, secrets still living in the `PipelineSettings` vault.
48+
> See [Authoring an Authority Pack](../../guides/authoring-authority-packs.md)
49+
> (tests: `test_authority_pack_providers.py`, `test_authority_source_hosts.py`).
50+
> The remaining gaps (scheduled scraping, multi-corpus orchestration,
51+
> config-declarable `authority_type`/shape grammars) are unchanged.
52+
4053
## 1. Context — three artifacts, one intent
4154

4255
| Artifact | What it is | Status |

0 commit comments

Comments
 (0)