Skip to content

redbeard-alt/project-svvsd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

230 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

project-svvsd

SVVSD corpus indexer and MCP server. Parses board agenda staff tables, builds meeting/minutes/video indexes from raw packets on /Volumes/BJH, and exposes them as FastMCP search tools registered in OpenClaw / Claude Desktop.

Status (2026-05-27)

14 years shipped · 14,736 staff records · 1,068 tests · 35 MCP tools · schema v1.0

Staff Corpus

School year ERA Records
2012-2013 ERA_TABLE_2012 392
2013-2014 ERA_TABLE_2012 984
2014-2015 ERA_TABLE_2017 1,103
2015-2016 ERA_TABLE_2017 0 (confirmed — no staff tables in corpus)
2016-2017 ERA_TABLE_2017 0 (confirmed — no staff tables in corpus)
2017-2018 ERA_TABLE_2017 1,105
2018-2019 ERA_TABLE_2018 1,436
2019-2020 ERA_TABLE_2018/2020/2021 (mixed) 1,206
2020-2021 ERA_TABLE_2020 1,475
2021-2022 ERA_MEMO_2021 1,349
2022-2023 ERA_TABLE_2020/2018/2021 (mixed) 1,302
2023-2024 ERA_TABLE_2020/2018/2021 (mixed) 1,363
2024-2025 ERA_TABLE_2020/2018/2021 (mixed) 1,066
2025-2026 ERA_TABLE_2020/2018/2021 (mixed) 934 (partial year)
Total 13,715

Output: /Volumes/BJH/data-hub-svvsd/staff/appointments/staff_YYYY-YYYY.csv — one file per year, frozen schema v1.0.

Years 2019-2020 onward use --era auto (mixed-era pages within a single year). See docs/agenda_schema.md for ERA reference and .claude/skills/research/svvsd-board-agenda-parse.md for the full era mix table.

Investigative Database (docs/agenda_index.db)

SQLite database built from all board packets 2011–2026.

Table Rows Description
meetings 634 All board meetings with date, type, agenda/minutes URLs; Phase 12 columns: meeting_type_canonical, is_primary_meeting, was_televised
people 23 Board members + superintendents with canonical names
people_aliases 26 Name variants + "Director X" aliases
attendance 3,172 Per-member attendance at each meeting
agenda_items 6,501 All agenda + minutes items with title, body_text, motion_text, category
votes 11,019 Per-member roll-call votes (yes/no/absent/recusal)
vote_summary — (view) Joins votes → people → meetings
public_comments 688 Named visitors and audience speakers
vendors 36 Canonical vendor/contractor registry
vendor_aliases 5 Name normalisation aliases
projects 10 Capital construction projects
contracts 101 Board-approved NTE contracts; 25 vendor_id=NULL (intentional — non-vendor contracts)
change_orders 17 CMGC construction change orders with delta and cumulative values
coi_exceptions 34 Conflict-of-interest (GBEA) per-person exceptions 2011–2024
policy_actions 78 Board actions on named policies (CHSAA resolutions, adoptions, etc.)
policy_index 388 Board policies scraped from svvsd.org (section, doc_type, pdf_url)
student_advisory_appearances 132 HSAC student names + schools, 37 meetings
staff_events 14,736 Staff appointments/departures bridged from CSV to meetings by meeting_date
vendor_spend_summary — (view) Total NTE by vendor
school_closures 7 Board-approved school closure/consolidation/boundary events (Phase 15 — 120 → 7 confirmed genuine)
executive_sessions 10 Executive/closed session entries with legal basis
incidents 0 Incident/accident reports (intentionally empty; SVVSD handles incidents administratively)
org_changes 2 Organizational structure changes (position additions/eliminations)
enrollment_reports 19 Enrollment report presentations; Pass 1 metadata only
finance_reports 137 Financial report metadata; 15 budget_adoption rows have total_amount (General Fund)
board_communications 2,971 Per-member Board Reports/Communications bullet items (2010–2019 era)
_schema_meta Schema version tracking

Index CSVs: docs/meeting_index.csv, docs/minutes_index.csv, docs/video_index.csv, docs/policy_index.csv (source for the policy_index DB table above).

Research Corpus (docs/research_index.db)

SQLite database tracking court records, news, CORA requests, and regulatory filings — separate from agenda_index.db by design.

Table Rows Description
sources 27 Downloaded artifacts with URL, sha256, raw_path, tier
court_cases 20 Federal + state cases (civil, criminal, ADA, OCR, settlement)
case_sources Many-to-many: court_cases ↔ sources
source_entities Named entities extracted from sources
cora_requests 5 CORA request lifecycle tracker (draft → fulfilled/denied)
rag_chunks 263 Text chunks for LanceDB embedding pipeline
audit_log Append-only provenance trail

RAG Index

Built by build_research_rag_index.py. 22 sources → 263 chunks (avg 468 tokens).

  • Embedding model: nomic-ai/nomic-embed-text-v1.5 via sentence-transformers (on-device)
  • Vector store: LanceDB, tier-split: data-hub-svvsd/data/lancedb-public/ (253 chunks) and lancedb-private/ (10 chunks)
  • Chunk strategy: 512-token target, 64-token overlap, paragraph-then-sentence splitting

MCP Server — 35 Tools

Registered in OpenClaw as svvsd-corpus.

CSV-backed:

Tool Source Key filters
search_minutes minutes_index.csv keyword, date range, status
search_videos video_index.csv keyword, date range
search_meetings meeting_index.csv date range, type, gaps_only
search_staff BJH/data-hub-svvsd/staff/appointments/staff_*.csv name, position, location, action, year
get_coverage_summary meeting_index.md

DB-backed:

Tool Table(s) Key filters
search_votes vote_summary view person, vote_value, keyword, date range
search_public_comments public_comments speaker, keyword, date range
search_contracts contracts + vendors vendor, contract_type, fiscal_year, min_nte
search_coi_exceptions coi_exceptions person, employer, fiscal_year
search_policy_actions policy_actions + policy_index policy_code, action_type; returns pdf_url
search_policy_index policy_index code, section, doc_type, keyword, is_local
search_attendance attendance + people person_name, status, school_year, date range
search_change_orders change_orders + contracts project, vendor, order_number
search_student_advisory student_advisory_appearances student_name, school, date range
search_staff_events staff_events name, position, location, action_type, year
search_vendor_summary vendor_spend_summary — (ordered by total_nte DESC)
get_board_member_profile people + votes + attendance name (resolved via aliases)
search_policy_lifecycle policy_actions + policy_index policy_code (chronological history)
search_agenda_items agenda_items keyword, category, outcome, school_year, doc_type
search_tabled_items agenda_items keyword, resolved, school_year
search_school_closures school_closures school_name, action_type, keyword
search_finance_reports finance_reports report_type, fiscal_year, fund_type
search_executive_sessions executive_sessions topic_category, legal_basis, keyword
search_incidents incidents incident_type, severity, location
search_org_changes org_changes change_type, department, keyword
search_enrollment_reports enrollment_reports report_type, school_year
search_board_documents .md corpus (filesystem) keyword (full-text), doc_type, school_year
search_board_communications board_communications person_name, keyword, school_year

Research corpus (Phase 4 — research_index.db + LanceDB RAG):

Tool Source Key filters
search_research_sources LanceDB vector index query (semantic), source_type, tier, outlet, date range
search_court_cases court_cases case_name, docket, court, case_type, disposition, svvsd_role
search_cora_requests cora_requests status, agency, priority, keyword

search_research_sources embeds queries with nomic-embed-text-v1.5 and returns ranked text chunks with scores, provenance URLs, and source metadata. Private-tier results require SVVSD_PRIVATE_ACCESS=1 env var.

All tools return a _meta sentinel as the last item: {"_meta": true, "_total": N, "_offset": 0, ...}. Use limit + offset for pagination. No hard caps.

Build Pipeline

Run stages before phases. Within Phase 7, order matters — run 7c before 7e.

# Stages — CSV indexes (run first)
build_minutes_index.py      Stage 1a — minutes CSV index
build_video_index.py        Stage 1b — video CSV index
build_meeting_index.py      Stage 2  — unified meeting CSV index

# Phases — SQLite DB population (run in order)
build_agenda_index.py       Phase 0 — register meetings in DB
build_people_registry.py    Phase 1 — seed canonical people + aliases
build_attendance.py         Phase 2 — board attendance from minutes PDFs
build_agenda_items.py       Phase 3 — parse agenda + minutes PDFs → agenda_items
build_votes.py              Phase 4 — motion attribution, bracket votes, pulled items
build_public_comments.py    Phase 5 — VISITORS + AUDIENCE PARTICIPATION → public_comments
build_coi.py                Phase 7a — COI (GBEA) per-person exceptions
build_policy_actions.py     Phase 7b — CHSAA resolutions + general policy actions
build_change_orders.py      Phase 7c — CMGC change orders  ← must run BEFORE build_contracts.py
build_hsac.py               Phase 7d — HSAC student advisory appearances
build_contracts.py          Phase 7e — NTE contracts (run AFTER build_change_orders.py)
build_policy_index.py       Phase 8 — scrape svvsd.org/board-policies → policy_index.csv + DB table
build_staff_events.py       Phase 10 — bridge staff CSV → staff_events table by meeting_date
build_agenda_classification.py Phase 11 — keyword classifier for agenda_items.category
build_tabled_resolution.py  Phase 11 — resolve tabled items to downstream outcomes
build_meeting_type_normalization.py Phase 12a — meeting_type_canonical + is_primary_meeting
build_rescheduled_links.py  Phase 12b — rescheduled_from/to FK pairs
build_televised_flag.py     Phase 12d — was_televised flag on pre-video-era meetings
build_vendor_null_resolution.py Phase 12e — curate vendor_id=NULL contracts
build_school_closures.py    Phase 12c — school closure/consolidation events (Tool 21)
build_finance_reports.py    Phase 12d — financial report metadata, Pass 1 (Tool 22)
populate_finance_amounts.py Phase 12d Pass 2 — General Fund appropriation amounts from .md corpus
build_executive_sessions.py Phase 13a — executive/closed-session entries (Tool 26)
build_incidents.py          Phase 13b — incident/accident reports (Tool 27)
build_org_changes.py        Phase 13c — organizational chart changes (Tool 28)
build_enrollment_reports.py Phase 13d — enrollment report presentations (Tool 29)
build_board_communications.py — Board Reports/Communications → board_communications table (Tool 32)

⚠️ Build order note: build_change_orders.py (7c) must run before build_contracts.py (7e). The contracts builder uses a DB guard to skip items already classified as change orders. Reversing this order drops CO rows and creates duplicate contract stubs.

Dev Commands

cd ~/Laboratory/project-svvsd
make install        # python3.12 venv + deps
make build          # rebuild all index CSVs
make run            # start FastMCP server
make test           # pytest (824 tests)
make lint           # ruff

# Staff parser (individual year — all years already shipped)
source .venv/bin/activate
python parse_staff.py /Volumes/BJH/data-hub-svvsd/raw/Pdf/YYYY-YYYY \
  --era auto --section all --out /Volumes/BJH/data-hub-svvsd/staff/appointments/staff_YYYY-YYYY.csv

Source Data

Raw board packets: /Volumes/BJH/data-hub-svvsd/raw/READ-ONLY, never write here.

data-hub-staging/ lives in ~/Laboratory/project-data-cleanup/ (migrated 2026-05-27).

Architecture

/Volumes/BJH/data-hub-svvsd/raw/   ← source packets (READ-ONLY)
project-svvsd                       ← corpus indexer (this repo)
  ↓ symlinked into
audio-agent/projects/svvsd/         ← audio-agent corpus access
research-agent/projects/svvsd/      ← research-agent corpus access

MCP config: ~/.openclaw/openclaw.json (OpenClaw) and ~/Library/Application Support/Claude/claude_desktop_config.json (Claude Desktop).

Known Gaps

These are confirmed data limitations — do not re-investigate without new source PDFs.

Gap Detail
CHSAA 2013–2016 October minutes PDFs for these years are not in the indexed corpus; no parser fix will recover them — source documents needed
CHSAA 2020-2021 Recovered via agenda fallback (Pass 1b) — minutes were not indexed for that meeting
Vendor null (25 contracts) 25 of 101 contracts have vendor_id = NULL; intentional — non-vendor contracts (federal grants, umbrella approvals, charter school agreements). Documented in docs/vendor_resolution_map.csv
Staff years 2015-2016, 2016-2017 Confirmed no staff tables exist in raw corpus for these years
Board Communications 2020–2024 Post-COVID minutes dropped the Board Reports section; no per-member items exist for these years
School closures pre-2011 Corpus coverage starts 2011; no pre-corpus closure events can be recovered from these sources
HSAC 2022–2026 Post-2022, student advisory reports were folded into VISITORS and SUPERINTENDENT UPDATE sections; no dedicated HSAC agenda items exist. Students still appear in body_text but not as standalone items
COI person_id (all 34 NULL) COI exceptions are staff disclosures parsed from employee names; person_id linkage to board members intentionally deferred (low volume, ambiguous name overlap)

Phase Log

Phase What shipped
0–2 Meeting index, people registry, attendance
3–4 Agenda items parser, votes with roll-call attribution
5 Public comments (VISITORS + AUDIENCE PARTICIPATION)
6 Video index, policy index scraper
7 Investigative registries: COI exceptions, policy actions, change orders, HSAC appearances, NTE contracts
8 policy_index table + loader; 4 new MCP tools; mcp_db fixture + test_mcp_tools.py
10 staff_events table (13,712 rows); vendor_spend_summary view; 4 new MCP tools
11 Agenda item classification (10 categories); build_agenda_classification.py; search_agenda_items, search_tabled_items
12 Meeting type normalization, rescheduled links, televised flag, vendor null resolution, voice vote audit; search_school_closures, search_finance_reports (107 rows, 14 with General Fund amounts)
13 Track A search_executive_sessions (10 rows), search_incidents, search_org_changes, search_enrollment_reports
31–32 search_board_documents (full-text .md corpus search); search_board_communications (2,961 per-member items, 2010–2019)
14 Track A curation confirmed (org_changes 2 rows, incidents 0 intentional, enrollment_reports 19 rows); board_communications name resolution (601 rows backfilled, 5 new board members, 0 NULL person_ids); 834 tests
15 build_school_closures.py keyword regex tightened (120 → 7 candidates); 7 confirmed genuine rows written; 841 tests
16 Classifier: policy-in-consent override (149 items reclassified consent → policy); contract cleanup (mill-levy false positive deleted, 101 descriptions backfilled); HSAC gap confirmed real-world; 1,068 tests
Research 1–3 research_index.db schema + helpers; ingest_sources.py provenance gate (22 sources resolved, CourtListener API); build_rag_index.py (263 chunks, nomic-embed-text-v1.5, LanceDB)
Research 4 research_rag.py RAG query engine; 3 new MCP tools: search_research_sources, search_court_cases, search_cora_requests; 35 tools total

History

Migrated from research-agent on 2026-05-17 via git-filter-repo. Original SHA: 939dcbd83a9959cd8d1c33e04ebce13192627ab2. Handoff documents in docs/HANDOFF_*.md are historical records of session context.

About

SVVSD Board Agenda Staff Table Parser and corpus processing pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors