Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ docker compose -f production.yml up
- **Frontend utilities**: `frontend/src/utils/formatters.ts` (formatting), `frontend/src/utils/files.ts` (file operations), etc.
- **Backend utilities**: `opencontractserver/utils/` contains permissioning, PDF processing, and other shared utilities
7. **Always go through the app's `services/` package.** Code with a user context (GraphQL resolvers, MCP tools, LLM tools, REST views, Celery tasks invoked with a user) **must** reach models through `opencontractserver/<app>/services/` — never compose `visible_to_user` / `user_can` / `user_has_permission_for_obj` inline. The shared base (`opencontractserver.shared.services.base.BaseService`) exposes `get_or_none`, `filter_visible`, `require_permission`, and `user_has` for the cases where a dedicated per-app method is overkill. The invariant is enforced twice — by a pytest test (`opencontractserver/tests/architecture/test_graphql_service_layer.py`) and by a Django system check (`opencontractserver/shared/checks.py`, `opencontracts.E001`) that fails `manage.py` startup on any inline Tier-0 use. **Scope of mechanical enforcement:** both the test and the check scan **`config/graphql/` only** (recursively). The rule above is *policy* for the other user-context surfaces — MCP tools (`opencontractserver/mcp/`), LLM tools (`opencontractserver/llms/tools/`), REST views, and user-context Celery tasks — but those are **not** scanned today and still contain correct-but-inline Tier-0 calls. Treat a green E001 as "no inline Tier-0 in `config/graphql/`," not "none anywhere." Failure messages carry the copy-pasteable recipe inline; see `docs/development/architecture_invariants.md` for the invariant catalogue and `docs/architecture/query_permission_patterns.md` for the per-app service catalogue + migration recipes.
8. **Docs: keep them current, concise, and pointer-based.** When you change behavior, update the relevant doc (or add one if none fits) as part of the same change — stale docs are worse than none. But keep docs *concise and prune as you go*: a doc that only ever grows rots into noise, so trim superseded content rather than appending forever. **Favor code pointers over pasted code** — reference files + symbols (e.g. `opencontractserver/enrichment/authorities.py::bootstrap_authority_corpus`) instead of copying snippets in. Pasted code drifts out of sync the moment the source changes and becomes dead documentation; a pointer stays live. (See `## Documentation Locations` for where things live, and `## Changelog Maintenance` for change records.)

## Testing Patterns

Expand Down
20 changes: 20 additions & 0 deletions changelog.d/authority-config-phase-a.fixed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
- **Authority namespace re-seed no longer clobbers curator edits.** The
`post_migrate` convergence in `opencontractserver/enrichment/_namespace_seed.py`
(`ensure_seeded` → `seed`) ran on every production `migrate` and every test
flush and `update_or_create`d every shipped-prefix `AuthorityNamespace`
unconditionally — silently reverting a curator's `source="manual"` console edits
(display_name / jurisdiction / authority_type / aliases) and re-forcing
`is_global=True` on the next deploy, defeating the Authority Console's headline
"a re-load can no longer clobber a curator's runtime edits" guarantee. The seed
now honours the same source-ownership partition as
`AuthorityMappingLoader.load_namespaces` (skip `source="manual"` and
corpus-linked rows), guarded on the `source` column's presence so the historical
0082/0085/0086/0090 seed states are unaffected. Regression tests in
`test_authority_mapping_loader.py::NamespaceReseedOwnershipTests`.
- **`authority_mappings` reader rejects self-referential equivalences.**
`enrichment/data/mappings.py::iter_equivalences` now raises `ValueError` for a
`from_key == to_key` pair (matching the DB `CheckConstraint`) instead of
counting it into `total` and then silently dropping it in the loader.
- **Removed dead code in the authority crawl.** The per-jurisdiction-cap parking
in `crawl_authorities_service.py` now routes through the previously-unused
`_park_for_cap` helper (DRY; behaviour-preserving).
10 changes: 10 additions & 0 deletions changelog.d/authority-docs.changed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
- **Centralized authority-system documentation.** Added a single durable
operator/author guide, `docs/guides/authoring-authority-packs.md` (pack layout,
shipping a scraper inside a pack, `source_hosts`, add/remove/copy-to-port,
what's still core), wired into the mkdocs nav alongside the existing
Authority Console and Reference-Web Enrichment architecture docs. The
`0002-authority-packs` proposal is re-framed as design-history (it points to the
guide for the how-to and notes the now-self-contained pack layout), its dangling
`0001-…` reference de-linked, and the proposal added to the nav so it is no
longer orphaned. The guide and the proposal favour pointers to real files over
pasted code to limit staleness.
18 changes: 18 additions & 0 deletions changelog.d/authority-pack-providers.added.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
- **Authority packs can now ship their own source provider.** The pipeline
registry discovers `BaseAuthoritySourceProvider` subclasses from
`<pack>/providers/*.py` — both for in-tree packs under
`opencontractserver/enrichment/data/authority_packs/` and for out-of-tree pack
directories listed in the new `AUTHORITY_PACK_PATHS` setting (env var). A pack's
scraper now lives *with* its authority, so copying the pack directory to another
OpenContracts install brings the provider with it — no more dropping a `.py` into
core's `pipeline/authority_source_providers/` package
(`opencontractserver/pipeline/registry.py`, `config/settings/base.py`). Provider
modules are imported by file path under a collision-free synthetic module name;
an import failure is logged and skipped, never crashing registry build. Secrets
stay in the `PipelineSettings` encrypted vault (keyed by provider class path),
never in pack files.
- **Duplicate authority-provider prefixes now warn at registry build.** Two
providers claiming the same `supported_prefixes` family (e.g. a pack provider
shadowing a core one) previously registered silently; `_provider_for` then
resolved them non-deterministically by priority-then-discovery-order. The
registry now logs a warning identifying both claimants.
15 changes: 15 additions & 0 deletions changelog.d/authority-pack-source-hosts.added.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
- **Authority packs can declare their own SSRF source hosts.** A scraping pack
lists the hosts it fetches from in `pack.yaml` (`source_hosts: [...]`); those are
read from every installed pack (in-tree under `authority_packs/` or on the
`AUTHORITY_PACK_PATHS` setting) and unioned with the hardcoded
`PUBLIC_DOMAIN_SOURCE_HOSTS` baseline at runtime, so a fetching pack is portable
as a directory without editing `constants/safe_http.py`. The union is injected
into the pure `safe_http` util via a registered provider
(`register_allowlist_provider`) so the util never imports the enrichment layer;
every pack-added host is logged. The SSRF mechanism is unchanged (HTTPS-only,
public-IP, per-redirect-hop revalidation, size caps) — a pack only widens *which*
hosts are reachable, and only once installed (the install is the trust decision).
`AuthorityGateService` now consults the same effective allowlist as the fetch.
`load_authority_pack` validates `source_hosts` shape fail-fast. New module
`opencontractserver/enrichment/services/authority_source_hosts.py`; tests in
`test_authority_source_hosts.py`.
15 changes: 15 additions & 0 deletions changelog.d/authority-pack-taxonomy.added.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
- **Authority packs can carry their jurisdiction's citation vocabulary.** A pack's
authority-mappings YAML may now declare two optional sections — `shape_rules:`
(classify a numbered prefix family, e.g. `bo-ley-<n>`, without a core edit) and
`abbreviations:` (`state`/`municipal` Bluebook abbreviations the Tier-2a
extractor matches). They are read from every installed pack (in-tree +
`AUTHORITY_PACK_PATHS`) and merged onto the shipped Python baseline at runtime
(`classify_prefix` consults pack shape rules; `GenericCitationExtractor` merges
pack abbreviations), so a jurisdiction's citation vocabulary travels *with* the
pack — the baseline always wins a collision (a pack extends, never overrides).
Malformed entries are logged + skipped at runtime and rejected fail-fast by
`load_authority_pack`. New module
`opencontractserver/enrichment/services/authority_pack_config.py`; tests in
`test_authority_pack_taxonomy.py`. The shared `authority_type` vocabulary and the
citation-*form* parsing grammars remain core (shared vocabulary / parsing logic,
not per-authority config).
11 changes: 11 additions & 0 deletions config/settings/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1665,6 +1665,17 @@
# latency against DB/connection pressure.
ENRICHMENT_DOC_MAX_CONCURRENCY = env.int("ENRICHMENT_DOC_MAX_CONCURRENCY", default=None)

# Out-of-tree authority-pack directories. Each entry is a self-contained pack
# directory (pack.yaml + optional providers/ + mappings/specs/personas). The
# pipeline registry scans every pack here — in addition to the in-tree packs under
# opencontractserver/enrichment/data/authority_packs/ — for provider modules under
# <pack>/providers/, so an authority pack copied to this install brings its scraper
# with it WITHOUT dropping a .py into core. Comma-separated absolute paths in the
# AUTHORITY_PACK_PATHS env var; empty by default. (Provider discovery happens at
# registry build, so adding a path needs a worker/web restart — same as any new
# in-tree provider.)
AUTHORITY_PACK_PATHS = env.list("AUTHORITY_PACK_PATHS", default=[])

# Rate Limiting Configuration
# ------------------------------------------------------------------------------
# Import rate limiting settings
Expand Down
4 changes: 3 additions & 1 deletion docs/architecture/authority-console.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,9 @@ prefixes, license, priority, the enabled and requires-approval flags, and whethe
the secrets vault holds credentials. Enabling/disabling stays in code; credentials
are edited through System Settings' component-secrets vault
(`updateComponentSecrets`), **not** here — the console never invents a parallel
credential store.
credential store. Providers can be shipped *inside* an authority pack
(`<pack>/providers/`) so a scraper travels with its jurisdiction — see
[Authoring an Authority Pack](../guides/authoring-authority-packs.md).

### Runs tab

Expand Down
17 changes: 15 additions & 2 deletions docs/architecture/proposals/0002-authority-packs.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

| | |
|---|---|
| **Status** | Proposed (design doc only — no code, migrations, or tests in this PR) |
| **Supersedes / builds on** | 0001 — Generic scheduled scraping + corpus groups ([PR #1444](https://github.com/Open-Source-Legal/OpenContracts/pull/1444); doc lands at `./0001-scheduled-scraping-and-corpus-groups.md` once merged) |
| **Status** | Partially implemented — this doc is the original design rationale + gap analysis. For the operator/author how-to (and the now-self-contained pack layout: in-pack providers + `source_hosts`), see the guide: [Authoring an Authority Pack](../../guides/authoring-authority-packs.md). |
| **Supersedes / builds on** | 0001 — Generic scheduled scraping + corpus groups (PR #1444; the `0001-…` proposal doc is not yet written) |
| **Relates to** | PR #1305 (Bolivian-law contributor PR, closed/reference), the Authority architecture (PRs #1990 / #1997 / #2037), [`authority-console.md`](../authority-console.md), [`reference-web-enrichment.md`](../reference-web-enrichment.md) |
| **Author** | follow-up to #1305 / #1444 |

Expand Down Expand Up @@ -37,6 +37,19 @@ rather than against a `scraping/` app that was never built.
> therefore ships taxonomy + curated content + personas (no live fetch, so no
> host-allowlist edit is needed yet).

> **Update — packs are now self-contained (gaps 1 & 6 closed).** A pack may now
> ship its scraper inside the pack (`<pack>/providers/*.py`, discovered by the
> pipeline registry from in-tree packs and out-of-tree dirs on the
> `AUTHORITY_PACK_PATHS` setting) and declare the hosts it fetches from in
> `pack.yaml` (`source_hosts:`, merged into the SSRF allowlist at runtime). The
> "one un-packable edit" of §3 (the hardcoded host allowlist) and the
> "single hardcoded package" of gap 6 (§7) no longer hold — a fetching pack is
> portable as a directory, secrets still living in the `PipelineSettings` vault.
> See [Authoring an Authority Pack](../../guides/authoring-authority-packs.md)
> (tests: `test_authority_pack_providers.py`, `test_authority_source_hosts.py`).
> The remaining gaps (scheduled scraping, multi-corpus orchestration,
> config-declarable `authority_type`/shape grammars) are unchanged.

## 1. Context — three artifacts, one intent

| Artifact | What it is | Status |
Expand Down
Loading