SVVSD corpus indexer and MCP server. Parses board agenda staff tables, builds
meeting/minutes/video indexes from raw packets on /Volumes/BJH, and exposes
them as FastMCP search tools registered in OpenClaw / Claude Desktop.
14 years shipped · 14,736 staff records · 1,068 tests · 35 MCP tools · schema v1.0
| School year | ERA | Records |
|---|---|---|
| 2012-2013 | ERA_TABLE_2012 | 392 |
| 2013-2014 | ERA_TABLE_2012 | 984 |
| 2014-2015 | ERA_TABLE_2017 | 1,103 |
| 2015-2016 | ERA_TABLE_2017 | 0 (confirmed — no staff tables in corpus) |
| 2016-2017 | ERA_TABLE_2017 | 0 (confirmed — no staff tables in corpus) |
| 2017-2018 | ERA_TABLE_2017 | 1,105 |
| 2018-2019 | ERA_TABLE_2018 | 1,436 |
| 2019-2020 | ERA_TABLE_2018/2020/2021 (mixed) | 1,206 |
| 2020-2021 | ERA_TABLE_2020 | 1,475 |
| 2021-2022 | ERA_MEMO_2021 | 1,349 |
| 2022-2023 | ERA_TABLE_2020/2018/2021 (mixed) | 1,302 |
| 2023-2024 | ERA_TABLE_2020/2018/2021 (mixed) | 1,363 |
| 2024-2025 | ERA_TABLE_2020/2018/2021 (mixed) | 1,066 |
| 2025-2026 | ERA_TABLE_2020/2018/2021 (mixed) | 934 (partial year) |
| Total | 13,715 |
Output: /Volumes/BJH/data-hub-svvsd/staff/appointments/staff_YYYY-YYYY.csv — one file per year, frozen schema v1.0.
Years 2019-2020 onward use --era auto (mixed-era pages within a single year). See
docs/agenda_schema.md for ERA reference and .claude/skills/research/svvsd-board-agenda-parse.md
for the full era mix table.
SQLite database built from all board packets 2011–2026.
| Table | Rows | Description |
|---|---|---|
| meetings | 634 | All board meetings with date, type, agenda/minutes URLs; Phase 12 columns: meeting_type_canonical, is_primary_meeting, was_televised |
| people | 23 | Board members + superintendents with canonical names |
| people_aliases | 26 | Name variants + "Director X" aliases |
| attendance | 3,172 | Per-member attendance at each meeting |
| agenda_items | 6,501 | All agenda + minutes items with title, body_text, motion_text, category |
| votes | 11,019 | Per-member roll-call votes (yes/no/absent/recusal) |
| vote_summary | — (view) | Joins votes → people → meetings |
| public_comments | 688 | Named visitors and audience speakers |
| vendors | 36 | Canonical vendor/contractor registry |
| vendor_aliases | 5 | Name normalisation aliases |
| projects | 10 | Capital construction projects |
| contracts | 101 | Board-approved NTE contracts; 25 vendor_id=NULL (intentional — non-vendor contracts) |
| change_orders | 17 | CMGC construction change orders with delta and cumulative values |
| coi_exceptions | 34 | Conflict-of-interest (GBEA) per-person exceptions 2011–2024 |
| policy_actions | 78 | Board actions on named policies (CHSAA resolutions, adoptions, etc.) |
| policy_index | 388 | Board policies scraped from svvsd.org (section, doc_type, pdf_url) |
| student_advisory_appearances | 132 | HSAC student names + schools, 37 meetings |
| staff_events | 14,736 | Staff appointments/departures bridged from CSV to meetings by meeting_date |
| vendor_spend_summary | — (view) | Total NTE by vendor |
| school_closures | 7 | Board-approved school closure/consolidation/boundary events (Phase 15 — 120 → 7 confirmed genuine) |
| executive_sessions | 10 | Executive/closed session entries with legal basis |
| incidents | 0 | Incident/accident reports (intentionally empty; SVVSD handles incidents administratively) |
| org_changes | 2 | Organizational structure changes (position additions/eliminations) |
| enrollment_reports | 19 | Enrollment report presentations; Pass 1 metadata only |
| finance_reports | 137 | Financial report metadata; 15 budget_adoption rows have total_amount (General Fund) |
| board_communications | 2,971 | Per-member Board Reports/Communications bullet items (2010–2019 era) |
| _schema_meta | — | Schema version tracking |
Index CSVs: docs/meeting_index.csv, docs/minutes_index.csv, docs/video_index.csv,
docs/policy_index.csv (source for the policy_index DB table above).
SQLite database tracking court records, news, CORA requests, and regulatory
filings — separate from agenda_index.db by design.
| Table | Rows | Description |
|---|---|---|
| sources | 27 | Downloaded artifacts with URL, sha256, raw_path, tier |
| court_cases | 20 | Federal + state cases (civil, criminal, ADA, OCR, settlement) |
| case_sources | — | Many-to-many: court_cases ↔ sources |
| source_entities | — | Named entities extracted from sources |
| cora_requests | 5 | CORA request lifecycle tracker (draft → fulfilled/denied) |
| rag_chunks | 263 | Text chunks for LanceDB embedding pipeline |
| audit_log | — | Append-only provenance trail |
Built by build_research_rag_index.py. 22 sources → 263 chunks (avg 468 tokens).
- Embedding model:
nomic-ai/nomic-embed-text-v1.5via sentence-transformers (on-device) - Vector store: LanceDB, tier-split:
data-hub-svvsd/data/lancedb-public/(253 chunks) andlancedb-private/(10 chunks) - Chunk strategy: 512-token target, 64-token overlap, paragraph-then-sentence splitting
Registered in OpenClaw as svvsd-corpus.
CSV-backed:
| Tool | Source | Key filters |
|---|---|---|
search_minutes |
minutes_index.csv | keyword, date range, status |
search_videos |
video_index.csv | keyword, date range |
search_meetings |
meeting_index.csv | date range, type, gaps_only |
search_staff |
BJH/data-hub-svvsd/staff/appointments/staff_*.csv | name, position, location, action, year |
get_coverage_summary |
meeting_index.md | — |
DB-backed:
| Tool | Table(s) | Key filters |
|---|---|---|
search_votes |
vote_summary view | person, vote_value, keyword, date range |
search_public_comments |
public_comments | speaker, keyword, date range |
search_contracts |
contracts + vendors | vendor, contract_type, fiscal_year, min_nte |
search_coi_exceptions |
coi_exceptions | person, employer, fiscal_year |
search_policy_actions |
policy_actions + policy_index | policy_code, action_type; returns pdf_url |
search_policy_index |
policy_index | code, section, doc_type, keyword, is_local |
search_attendance |
attendance + people | person_name, status, school_year, date range |
search_change_orders |
change_orders + contracts | project, vendor, order_number |
search_student_advisory |
student_advisory_appearances | student_name, school, date range |
search_staff_events |
staff_events | name, position, location, action_type, year |
search_vendor_summary |
vendor_spend_summary | — (ordered by total_nte DESC) |
get_board_member_profile |
people + votes + attendance | name (resolved via aliases) |
search_policy_lifecycle |
policy_actions + policy_index | policy_code (chronological history) |
search_agenda_items |
agenda_items | keyword, category, outcome, school_year, doc_type |
search_tabled_items |
agenda_items | keyword, resolved, school_year |
search_school_closures |
school_closures | school_name, action_type, keyword |
search_finance_reports |
finance_reports | report_type, fiscal_year, fund_type |
search_executive_sessions |
executive_sessions | topic_category, legal_basis, keyword |
search_incidents |
incidents | incident_type, severity, location |
search_org_changes |
org_changes | change_type, department, keyword |
search_enrollment_reports |
enrollment_reports | report_type, school_year |
search_board_documents |
.md corpus (filesystem) | keyword (full-text), doc_type, school_year |
search_board_communications |
board_communications | person_name, keyword, school_year |
Research corpus (Phase 4 — research_index.db + LanceDB RAG):
| Tool | Source | Key filters |
|---|---|---|
search_research_sources |
LanceDB vector index | query (semantic), source_type, tier, outlet, date range |
search_court_cases |
court_cases | case_name, docket, court, case_type, disposition, svvsd_role |
search_cora_requests |
cora_requests | status, agency, priority, keyword |
search_research_sources embeds queries with nomic-embed-text-v1.5 and returns
ranked text chunks with scores, provenance URLs, and source metadata. Private-tier
results require SVVSD_PRIVATE_ACCESS=1 env var.
All tools return a _meta sentinel as the last item: {"_meta": true, "_total": N, "_offset": 0, ...}.
Use limit + offset for pagination. No hard caps.
Run stages before phases. Within Phase 7, order matters — run 7c before 7e.
# Stages — CSV indexes (run first)
build_minutes_index.py Stage 1a — minutes CSV index
build_video_index.py Stage 1b — video CSV index
build_meeting_index.py Stage 2 — unified meeting CSV index
# Phases — SQLite DB population (run in order)
build_agenda_index.py Phase 0 — register meetings in DB
build_people_registry.py Phase 1 — seed canonical people + aliases
build_attendance.py Phase 2 — board attendance from minutes PDFs
build_agenda_items.py Phase 3 — parse agenda + minutes PDFs → agenda_items
build_votes.py Phase 4 — motion attribution, bracket votes, pulled items
build_public_comments.py Phase 5 — VISITORS + AUDIENCE PARTICIPATION → public_comments
build_coi.py Phase 7a — COI (GBEA) per-person exceptions
build_policy_actions.py Phase 7b — CHSAA resolutions + general policy actions
build_change_orders.py Phase 7c — CMGC change orders ← must run BEFORE build_contracts.py
build_hsac.py Phase 7d — HSAC student advisory appearances
build_contracts.py Phase 7e — NTE contracts (run AFTER build_change_orders.py)
build_policy_index.py Phase 8 — scrape svvsd.org/board-policies → policy_index.csv + DB table
build_staff_events.py Phase 10 — bridge staff CSV → staff_events table by meeting_date
build_agenda_classification.py Phase 11 — keyword classifier for agenda_items.category
build_tabled_resolution.py Phase 11 — resolve tabled items to downstream outcomes
build_meeting_type_normalization.py Phase 12a — meeting_type_canonical + is_primary_meeting
build_rescheduled_links.py Phase 12b — rescheduled_from/to FK pairs
build_televised_flag.py Phase 12d — was_televised flag on pre-video-era meetings
build_vendor_null_resolution.py Phase 12e — curate vendor_id=NULL contracts
build_school_closures.py Phase 12c — school closure/consolidation events (Tool 21)
build_finance_reports.py Phase 12d — financial report metadata, Pass 1 (Tool 22)
populate_finance_amounts.py Phase 12d Pass 2 — General Fund appropriation amounts from .md corpus
build_executive_sessions.py Phase 13a — executive/closed-session entries (Tool 26)
build_incidents.py Phase 13b — incident/accident reports (Tool 27)
build_org_changes.py Phase 13c — organizational chart changes (Tool 28)
build_enrollment_reports.py Phase 13d — enrollment report presentations (Tool 29)
build_board_communications.py — Board Reports/Communications → board_communications table (Tool 32)
⚠️ Build order note:build_change_orders.py(7c) must run beforebuild_contracts.py(7e). The contracts builder uses a DB guard to skip items already classified as change orders. Reversing this order drops CO rows and creates duplicate contract stubs.
cd ~/Laboratory/project-svvsd
make install # python3.12 venv + deps
make build # rebuild all index CSVs
make run # start FastMCP server
make test # pytest (824 tests)
make lint # ruff
# Staff parser (individual year — all years already shipped)
source .venv/bin/activate
python parse_staff.py /Volumes/BJH/data-hub-svvsd/raw/Pdf/YYYY-YYYY \
--era auto --section all --out /Volumes/BJH/data-hub-svvsd/staff/appointments/staff_YYYY-YYYY.csvRaw board packets: /Volumes/BJH/data-hub-svvsd/raw/ — READ-ONLY, never write here.
data-hub-staging/ lives in ~/Laboratory/project-data-cleanup/ (migrated 2026-05-27).
/Volumes/BJH/data-hub-svvsd/raw/ ← source packets (READ-ONLY)
project-svvsd ← corpus indexer (this repo)
↓ symlinked into
audio-agent/projects/svvsd/ ← audio-agent corpus access
research-agent/projects/svvsd/ ← research-agent corpus access
MCP config: ~/.openclaw/openclaw.json (OpenClaw) and
~/Library/Application Support/Claude/claude_desktop_config.json (Claude Desktop).
These are confirmed data limitations — do not re-investigate without new source PDFs.
| Gap | Detail |
|---|---|
| CHSAA 2013–2016 | October minutes PDFs for these years are not in the indexed corpus; no parser fix will recover them — source documents needed |
| CHSAA 2020-2021 | Recovered via agenda fallback (Pass 1b) — minutes were not indexed for that meeting |
| Vendor null (25 contracts) | 25 of 101 contracts have vendor_id = NULL; intentional — non-vendor contracts (federal grants, umbrella approvals, charter school agreements). Documented in docs/vendor_resolution_map.csv |
| Staff years 2015-2016, 2016-2017 | Confirmed no staff tables exist in raw corpus for these years |
| Board Communications 2020–2024 | Post-COVID minutes dropped the Board Reports section; no per-member items exist for these years |
| School closures pre-2011 | Corpus coverage starts 2011; no pre-corpus closure events can be recovered from these sources |
| HSAC 2022–2026 | Post-2022, student advisory reports were folded into VISITORS and SUPERINTENDENT UPDATE sections; no dedicated HSAC agenda items exist. Students still appear in body_text but not as standalone items |
| COI person_id (all 34 NULL) | COI exceptions are staff disclosures parsed from employee names; person_id linkage to board members intentionally deferred (low volume, ambiguous name overlap) |
| Phase | What shipped |
|---|---|
| 0–2 | Meeting index, people registry, attendance |
| 3–4 | Agenda items parser, votes with roll-call attribution |
| 5 | Public comments (VISITORS + AUDIENCE PARTICIPATION) |
| 6 | Video index, policy index scraper |
| 7 | Investigative registries: COI exceptions, policy actions, change orders, HSAC appearances, NTE contracts |
| 8 | policy_index table + loader; 4 new MCP tools; mcp_db fixture + test_mcp_tools.py |
| 10 | staff_events table (13,712 rows); vendor_spend_summary view; 4 new MCP tools |
| 11 | Agenda item classification (10 categories); build_agenda_classification.py; search_agenda_items, search_tabled_items |
| 12 | Meeting type normalization, rescheduled links, televised flag, vendor null resolution, voice vote audit; search_school_closures, search_finance_reports (107 rows, 14 with General Fund amounts) |
| 13 Track A | search_executive_sessions (10 rows), search_incidents, search_org_changes, search_enrollment_reports |
| 31–32 | search_board_documents (full-text .md corpus search); search_board_communications (2,961 per-member items, 2010–2019) |
| 14 | Track A curation confirmed (org_changes 2 rows, incidents 0 intentional, enrollment_reports 19 rows); board_communications name resolution (601 rows backfilled, 5 new board members, 0 NULL person_ids); 834 tests |
| 15 | build_school_closures.py keyword regex tightened (120 → 7 candidates); 7 confirmed genuine rows written; 841 tests |
| 16 | Classifier: policy-in-consent override (149 items reclassified consent → policy); contract cleanup (mill-levy false positive deleted, 101 descriptions backfilled); HSAC gap confirmed real-world; 1,068 tests |
| Research 1–3 | research_index.db schema + helpers; ingest_sources.py provenance gate (22 sources resolved, CourtListener API); build_rag_index.py (263 chunks, nomic-embed-text-v1.5, LanceDB) |
| Research 4 | research_rag.py RAG query engine; 3 new MCP tools: search_research_sources, search_court_cases, search_cora_requests; 35 tools total |
Migrated from research-agent on 2026-05-17 via git-filter-repo.
Original SHA: 939dcbd83a9959cd8d1c33e04ebce13192627ab2.
Handoff documents in docs/HANDOFF_*.md are historical records of session context.