[Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305)#1444
[Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305)#1444JSv4 wants to merge 1 commit into
Conversation
… yet) This proposal extracts the genuinely missing primitives from PR #1305 (scheduled scraping, multi-corpus retrieval) into reusable OC-native infrastructure, in two sequential phases. Phase A is the scraping app; Phase B is the corpus-group / multi-corpus tool concept. No implementation in this PR -- design doc only, intended to anchor discussion with the #1305 contributor before any code lands.
Code Review — PR #1444 (Design Doc: Scheduled Scraping + Corpus Groups)This is a well-constructed architecture proposal. The overlap analysis is honest and accurate, the migration story for PR #1305 is respectful of the contributor's work, and the two-phase split is the right call — Phase A is self-contained and Phase B can safely wait. Notes below are ordered by weight. Issues worth resolving before implementation1. Storing cron strings as
The beat-sync signal should also handle 2.
This means adding a new corpus to a 3. Dedup scope is per-source, not global
4. The free-form Design questions worth answering before Phase A lands5. When 6. Max-failure threshold in The test spec says "partial failure does not abort batch," which is correct for resilience. The runner design should also specify whether there is a failure-rate threshold above which the batch is aborted and 7. Beat schedule reload without worker restart
8. Phase B Conversation binding The open question is flagged but it has architectural weight. Option (b) — Small nits
What's good
Overall this proposal is ready for maintainer and contributor discussion. Resolving items 1–4 above in the doc before implementation starts would prevent the most likely implementation-time surprises. |
|
May have some shared lineage in #1976 - to further explore a generic way to define and source law |
…low-up) Seed-based pack format on the shipped Authority architecture — no bespoke app. - load_authority_pack management command (generic, jurisdiction-agnostic): reads a pack.yaml manifest and idempotently loads the taxonomy YAML (AuthorityMappingLoader.load_all), bootstraps one authority corpus per legal area from a JSON section spec (bootstrap_authority_corpus), and writes each area persona into Corpus.corpus_agent_instructions. --path accepts any dir (out-of-tree packs supported). - Reference Bolivia pack under enrichment/data/authority_packs/bolivia/: 5-prefix taxonomy (jurisdiction=bo, Spanish aliases), seeded constitucional corpus (CPE articles), Spanish persona, manifest, README. Repackages PR #1305 (@jseborga) as data instead of a standalone Django app. - Tests (test_authority_pack.py): static pack-validity (manifest/mappings/spec integrity, authority_type vocab, declared-prefix coverage) + command end-to-end (namespaces, corpus, persona, document count, idempotency). 5 pass. - Updated proposal + spec docs to record the as-built Phase 1 and that the live-fetch provider folds into Phase 2 (#2054): the Bolivian publishers are listing-page, not key-addressable, so a citation-keyed provider cannot fetch yet. Seed-based Phase 1 needs no provider and no host-allowlist edit. - Changelog fragment. Credits @jseborga per the #1444 migration story.
…-Legal#1305 follow-up) Design doc only — no code, migrations, or tests (mirrors Open-Source-Legal#1444's convention). - docs/architecture/proposals/0002-authority-packs.md: reframes the 'authority pack' drop-in concept on the shipped Authority architecture (the provider rail that Open-Source-Legal#1444 could not yet assume). Defines the four extension seams, the pack anatomy, the drop-in lifecycle, the Open-Source-Legal#1305 component->slot mapping (full/partial/gap), the residual gaps with a 4-phase recommendation, and the re-homing of Open-Source-Legal#1444 Phase A/B onto this rail. - docs/architecture/proposals/0002-authority-packs-bolivia-spec.md: the buildable Phase 1 artifact — pack layout, authority_mappings.bolivia.yaml, a BaseAuthoritySourceProvider skeleton, a bootstrap section-spec JSON, a Spanish persona, the one required host-allowlist edit, and the drop-in commands. All code references fact-checked against current source (provider interface, AuthoritySection, bootstrap_authority spec, ALL_AUTHORITY_TYPES, the CorpusGroup/scheduling gaps). Credits @jseborga per the Open-Source-Legal#1444 migration story.
Summary
Design doc only — no code, no migrations, no tests in this PR. Intended to anchor discussion before any implementation lands.
This proposal extracts the genuinely missing primitives from #1305 (scheduled scraping, multi-corpus retrieval) into reusable OC-native infrastructure, in two sequential phases:
Phase A — Scheduled scraping. A generic
opencontractserver/scraping/app:BaseScraper+ auto-discovery registry,ScrapedSource+ScrapedDocumentmodels, atomic ingestion service (closes a race window), DB-driven Beat schedules, generic management commands, GraphQL surface with permission gating. PR Add Bolivian Laws RAG service with multi-agent orchestration #1305's three scrapers move into this app verbatim asscraping/scrapers/bolivia/{gaceta,tsj,tcp}.py.Phase B — Corpus Groups + multi-corpus retrieval. A
CorpusGroupmodel bundles N corpora; an asyncasearch_across_corporatool searches across them with per-user visibility filtering. Bound to anAgentConfigurationwhose system prompt is PR Add Bolivian Laws RAG service with multi-agent orchestration #1305's orchestrator text. The existingws/agent-chat/?agent_id=Xroute handles streaming + persistence — no new transport.Why this approach
PR #1305 is well-built — three working scrapers, defensive parsing with
httpx.MockTransporttestability, eleven thoughtful specialist personas, and a working orchestrator pattern. The architectural concern is overlap, not implementation quality:Corpus.corpus_agent_instructions) and are auto-injected byCoreCorpusAgentFactory.UnifiedAgentConsumeroverws/agent-chat/?corpus_id=X).Conversation.chat_with_corpus+ django-guardian.What OC genuinely lacks today: scheduled scraping into a Corpus (Phase A) and multi-corpus retrieval (Phase B). Once those exist as generic primitives, PR #1305 collapses into ~20 lines of fixture data, and the same pattern works for any future deployment (Brazilian jurisprudence, EU regulations, internal compliance feeds, etc.) without copy-pasting an app.
Migration story for PR #1305
The intent is to credit @jseborga as co-author on the Phase A implementation PR — the three scrapers, dedup approach, persona text, and
httpx.MockTransporttesting pattern all port over. The full preservation list is in the doc.PR #1305 stays open as the reference implementation while this proposal is reviewed; once Phase A merges, PR #1305 either closes or rebases into a small fixture PR creating eleven Bolivian corpora + three
ScrapedSourcerows.What's in this PR
docs/architecture/proposals/0001-scheduled-scraping-and-corpus-groups.md— full design doc, including:Test plan
Generated by Claude Code