Skip to content

mgesteban/concierge

Repository files navigation

BoardBreeze Concierge

The customer success team I couldn't afford to hire — running on Claude Opus 4.7.

A voice concierge for appboardbreeze.com, the SaaS that helps California public-agency boards run Brown-Act-compliant meetings. Callers dial one phone number; Claude Opus 4.7 answers governance questions, resolves product support, closes deals, and escalates to the founder when a human is actually needed.

Started as my entry to the Anthropic × Cerebral Valley Global Hackathon (Apr 21–27 2026). Built with Claude Opus 4.7 and Claude Code. Continuing post-hackathon as a real BoardBreeze product surface.


Try it now

Running 24/7 on AWS ECS Fargate (us-east-2) behind an Application Load Balancer with an Amazon-issued TLS certificate.


What it does

  1. One phone number. Board secretaries, clerks, and prospects call; the concierge answers 24/7.
  2. Five specialist modes, picked per turn. A Haiku 4.5 classifier reads each caller turn and routes to one of five focused mode prompts — Governance Expert, Product Expert, Tech Support, Sales Closer, Escalation — each with its own tool subset. Application-level multi-agent on the voice path; faster and cheaper than threading every turn through a coordinator.
  3. Citations get verified before they ship. Every statutory citation passes through verify_citation — section-exact KB lookup → Haiku 4.5 claim-support classifier — before the agent reads it aloud. Enforced as a tool contract, not a prompt instruction we hope the model follows.
  4. Sub-3-second voice replies. Direct Messages API streaming, sentence-level ElevenLabs synth, and chained TwiML so Twilio doesn't buffer. Caller hears a filler within ~500 ms; real reply lands around ~2.5 s on governance questions.
  5. Hot leads page Grace. When the agent escalates, escalate_to_grace emails Grace via Resend with the caller's phone, urgency, and a clean summary. Email reaches her phone reliably; the alternative SMS path is a wired-but-dormant best-effort second leg.
  6. Opus when it matters, Haiku where it pays off. Opus 4.7 runs the main turn; Haiku 4.5 handles the per-turn mode classifier and the verify_citation claim-support check. Executor model is env-configurable (VOICE_EXECUTOR_MODEL) for A/B testing.

Architecture

   Twilio call ─▶ /gather (caller speaks)
                       │
                       ▼
              ┌─────────────────────────┐
              │  voice_pipeline.run_turn│
              └────────────┬────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │  Haiku 4.5 classifier   │  ~6 tokens, ~0.5 s
              │  picks one of 5 modes   │
              └────────────┬────────────┘
                           │
                           ▼
   ┌──────────────────────────────────────────────────────┐
   │  ONE FOCUSED MODE per turn — own prompt + tool set:  │
   │   • Governance Expert  (Brown Act, Bagley-Keene,     │
   │                         Robert's Rules)              │
   │   • Product Expert     (features, plans, how-to)     │
   │   • Tech Support       (bug triage)                  │
   │   • Sales Closer       (pricing, demos)              │
   │   • Escalation Handler (single job: page Grace)      │
   │                                                      │
   │  Opus 4.7 runs the turn (env-configurable to Haiku). │
   │  Sentence-streamed to ElevenLabs as it generates.    │
   └────────────┬─────────────────────────────────────────┘
                │
                ▼
   ┌────────────────────────┐  ┌────────────────────────┐
   │ search_governance_kb / │  │  verify_citation       │
   │ search_product_kb      │  │  (Haiku 4.5 — section  │
   │ (Voyage 512-dim +      │  │  lookup + claim-       │
   │  Supabase pgvector)    │  │  support classifier)   │
   └────────────────────────┘  └────────────────────────┘

   ┌────────────────────────┐  ┌────────────────────────┐
   │  consult_advisor       │  │  escalate_to_grace     │
   │  (Opus 4.7 second      │  │  (Resend email — SMS   │
   │  opinion when needed)  │  │  leg dormant)          │
   └────────────────────────┘  └────────────────────────┘

   ▼ (parallel) ─ TTS path
   ElevenLabs eleven_flash_v2_5 (Polly fallback)
   Chained TwiML: filler → background turn → MP3 → next gather

Per-mode dispatch ("Phase 2.5"). Each caller turn opens with a Haiku classification call: takes ~6 tokens out, ~0.5 s of latency, picks one of governance / product / tech_support / sales / escalation. The dispatched mode owns its system prompt (in app/managed_agents/agents/<mode>.py) and gets only the tool subset it needs — escalation_handler has just escalate_to_grace; governance_expert has search_governance_kb + verify_citation + consult_advisor; etc. Focused prompts cut the cross-rule interference that monolithic prompts produce on Haiku-class executors. Universal rules (voice budget, no-SMS, escalation pair, ground rules) live in one shared module so the per-mode prompts don't drift.

Why not Claude Managed Agents for voice? We tried. CMA's harness adds ~10–14 s of cold-start TTFT even on Haiku 4.5, well past Twilio's 12 s webhook ceiling. Bench in notes/voice_cma_bench_results.md. The CMA multi-agent topology (coordinator + 5 sub-agents using SDK 0.100.0's first-class multiagent kwarg) is fully built in app/managed_agents/agents/ and one shell command away from deployment via scripts/upgrade_to_multiagent.py — kept as a working reference architecture in case SMS becomes a real surface, but voice ships on the direct Messages API.

The KB. A single Supabase governance_kb table backs both retrieval tools. Rows tagged jurisdiction='CA' / 'CA_STATE' / 'any' are public-meeting law (Brown Act, Bagley-Keene, Robert's Rules, Ed Code) — 20 hand-curated chunks with exact statutory citations. Rows tagged jurisdiction='product' are the BoardBreeze product FAQ — 61 chunks covering plans, pricing, free trial, auth, audio formats, transcription, minutes formatting, security, and the glossary. Embeddings are Voyage voyage-3-lite (512-dim); retrieval is pgvector cosine similarity via the match_governance_kb RPC.


How we used Opus 4.7

We chose Opus 4.7 specifically for three properties that 4.6 couldn't match:

  1. Mode coherence within a focused prompt. Each per-mode prompt is ~3K characters of focused guidance (governance KB rules, voice budget, escalation pair, etc.). 4.7 follows the focused prompt through tool round-trips without drift; 4.6 kept reverting to whichever mode appeared first in the prompt after a tool call returned. This was the original motivation for Phase 2.5 — even with focused per-mode prompts, model choice still matters for staying in mode.
  2. Precise instruction following. Each mode's prompt specifies exactly when to cite, when to defer to counsel, when to call verify_citation, when to escalate. 4.7 follows the conditionals without drift.
  3. Adapted prompts to literal interpretation. Per Tark's AMA, we rewrote NEVER/ALWAYS prohibitions as conditional language ("avoid X unless Y") to prevent 4.7 from over-triggering on adjacent legitimate behavior. Governance Expert mode explains what the Brown Act is without refusing, while still declining jurisdiction-specific legal advice.

The same pattern is enforced as a project rule in CLAUDE.md so it survives future prompt edits.


What Opus 4.6 couldn't do

Three concrete behaviors made 4.6 the wrong fit for this product. Each one mapped to a specific change we made for 4.7.

1. Mode coherence under tool use. The original v1–v6 architecture held five specialist modes in a single system prompt. On 4.6, after a tool call returned (e.g. search_governance_kb mid-sales-conversation), the model drifted back to whichever mode the prompt mentioned first — typically Governance — and re-introduced a citation the caller hadn't asked for. 4.7 stays locked on the active mode through the tool round-trip. The Phase 2.5 per-mode dispatch (v11) made this even cleaner — each turn runs against only the relevant mode's prompt, so cross-mode interference can't happen at all. Same KB, different model + cleaner runtime: 4.7 ships, 4.6 doesn't.

2. Long-call thread continuity. A real support call can run 10+ minutes and 30–40 turns. Per the framing of 4.7's release, 4.6 loses the thread at that length. We chose to lean on 4.7's longer coherent context. In-call history is held per-CallSid in voice_pipeline.py and survives mode switches mid-call — caller can ask a Brown Act question and then ask for Grace, classifier shifts modes between turns, full prior context flows through.

3. Literal instruction-following as a feature, not a bug. Prompt writers used to over-emphasize prohibitions in caps lock ("NEVER do X", "ALWAYS do Y") because earlier models were loose. 4.7 follows those instructions literally, which means a NEVER instruction over-triggers and refuses adjacent legitimate behavior. We adapted: the Concierge system prompt uses conditional language throughout ("avoid X unless Y; here's how to handle the edge case"). The Governance Expert mode now explains what the Brown Act is without refusing, while still declining jurisdiction-specific legal advice that requires interpreting a statute against a caller's specific facts. CLAUDE.md rule #1 enforces this convention for future prompt edits.

The verify_citation layer is what makes the literal-following bet safe in production: even when 4.7 is willing to cite a section, the tool contract gates the citation against the actual KB text. That contract isn't a "should" the model needs to remember — it's a function call the agent has to make before it speaks.


How we used Claude Managed Agents (and where we landed)

After Michael Cohen's Thursday session at the hackathon, we pivoted from a planned six-agent supervisor topology to one Managed Agent + specialist modes — his recommended pattern at the time. See notes/cohen-managed-agents.md for the direct quotes.

When SDK 0.100.0 shipped Anthropic's first-class multi-agent primitive (a multiagent={"type":"coordinator","agents":[...]} kwarg on agents.create), we built the topology: 1 coordinator + 5 specialist sub-agents, each with focused prompts and dedicated tool subsets. The full implementation lives in app/managed_agents/agents/ and a one-shot rollout script in scripts/upgrade_to_multiagent.py.

Then we benchmarked. Detailed numbers in notes/voice_cma_bench_results.md, but the headline:

model mode mean TTFT max TTFT
Opus 4.7 cold 13.86 s 20.67 s
Opus 4.7 warm 5.19 s 7.99 s
Haiku 4.5 cold 10.99 s 14.72 s
Haiku 4.5 warm 9.32 s 13.02 s

Twilio webhooks have a ~12 s ceiling. CMA's harness overhead is structural, not model-dependent — Haiku WARM (9.32 s) is actually slower than Opus WARM (5.19 s) on the same path, because per-event scheduling dominates inference time. Voice cannot ride CMA today.

So we kept the prompt design and dropped the runtime. The 5 focused mode prompts now serve the voice channel via a Haiku classifier in voice_pipeline.py — application-level routing, no harness overhead, ~0.5 s of classifier latency per turn. Same focused-prompt quality benefit; runs on the channel that's actually live. The CMA wiring stays as a working reference architecture for if the harness ever ships sub-3 s TTFT — or if SMS becomes a real product surface (today it isn't; email handles outbound, callers call).

The custom tools (search_governance_kb, search_product_kb, verify_citation, consult_advisor, escalate_to_grace) live in app/managed_agents/tools_registry.py as a single source of truth — both paths (voice direct API and dormant CMA topology) materialize their tool subsets from the same registry, so descriptions don't drift.


How we caught hallucinations: the verification layer

Section 16.5 of the playbook articulated the single highest-leverage piece of the architecture: the Governance Expert mode cannot ship a citation until a separate verifier confirms (a) the cited section exists in our curated KB and (b) its actual text supports the claim.

The flow:

  1. Draft. Governance Expert mode generates a reply with a citation (e.g., "Government Code §54954.2 requires 72-hour posting").
  2. Verify. The agent calls verify_citation(citation, claim). The verifier extracts the section number, looks it up in governance_kb, and asks Haiku 4.5: "Does this passage support this claim? yes / no / partial." Returns {verified, actual_text} on a hit, {verified: false, suggested_rewrite} on a miss. Haiku keeps the round trip under ~300 ms — the agent gets the answer well inside the latency budget.
  3. Rewrite or hedge. If verification fails, the agent rewrites with the suggested safe phrasing or hedges: "I'm not certain on the specific section — let me connect you with someone who can confirm."

Result on the golden Q&A suite: 10/10 pass. True positives on 72-hour posting, 24-hour special notice, 10-day Bagley-Keene, 2/3 vote to close debate, and open-meetings questions. Zero false positives on adversarial probes (wrong-hours, wrong-threshold, unknown-section).

The rule is enforced as a tool contract, not as prompt instruction the model "should" follow — see CLAUDE.md rule #2.

Code: app/managed_agents/custom_tools.py::_verify_citation. System prompt rule: app/managed_agents/agent_spec.py.


Voice latency: getting under Twilio's ceiling

Twilio webhooks have a hard ~15 s response budget; live voice needs to feel snappy or callers hang up. Two production-blocking problems and the fixes:

  1. CMA event-stream overhead. ~6 s on top of the underlying model on identical prompts. Voice switched to the direct Messages API with sentence-level streaming. Tokens are split on sentence + em-dash boundaries; each clause is synthesized by ElevenLabs (eleven_flash_v2_5) and yielded as it's ready. Tool use loop is inline against the same custom-tool handlers CMA uses.
  2. Twilio's <Play> buffers the whole MP3. Live test caught Twilio waiting for the full streamed MP3 before starting playback — caller heard 6 s of silence even though our server was emitting bytes at 2.5 s. Fixed by chained TwiML: /gather plays a pre-synthesized filler and <Redirect>s to /continue/{turn_id}; the Claude turn runs in a background ThreadPool; /continue blocks up to 12 s on the Future and returns the reply MP3 + a fresh <Gather>. Twilio sees two short, complete MP3s and plays each immediately.

Net: filler audio starts within ~500 ms of the caller finishing speaking; perceived first-audio of the real reply lands around ~2.5–3 s on governance questions (Phase 2.5 adds ~0.5 s for the per-turn mode classifier on top of the v5 baseline; the focused-prompt quality gain was worth it).

Graceful degradation: ElevenLabs synth is wrapped in a Polly fallback path — quota errors, rate limits, or outages emit <Say voice="Polly.Joanna"> TwiML rather than 500ing the call.


Production deployment

Live in us-east-2 on AWS ECS Fargate, in the same cluster as the main appboardbreeze.com service.

  • Image: Python 3.11 slim, ~478 MB, pushed to Amazon ECR (boardbreeze-concierge repo).
  • Task: Fargate, 1 vCPU / 2 GB, port 8000, awsvpc networking, deployment circuit-breaker with auto-rollback.
  • Secrets: 11 API keys (Anthropic, Twilio×3, Supabase×2, Voyage, ElevenLabs×2, Grace contact) live in AWS Secrets Manager as one JSON secret; the task definition projects each key into its own env var via valueFrom. The execution role has read access scoped to that one secret ARN — least privilege.
  • Networking: dedicated Application Load Balancer (separate from the existing CDK-managed ALB so the prod stack stays untouched), TLS via an Amazon-issued ACM certificate for concierge.appboardbreeze.com, HTTP→HTTPS 301 redirect on :80. Two security groups: ALB SG (public 80/443), task SG (8000 only from ALB SG).
  • DNS: concierge.appboardbreeze.com → ALB DNS via a Vercel CNAME (the apex domain registers with Vercel).
  • Logs: CloudWatch /ecs/boardbreeze-concierge, 30-day retention.
  • Container health: Docker HEALTHCHECK curls /health every 30 s. ALB target group hits the same path on a 15 s interval.

The full step-by-step playbook with every CLI command pre-filled is in Deployment.md. The original plan was AWS App Runner; we pivoted to ECS Fargate mid-hackathon to match the rest of the BoardBreeze stack and avoid running two operational mental models. The pivot took one Saturday afternoon.


Cost & economics

The concierge runs as a paid product surface, so unit economics matter. This section covers what an inbound call actually costs, what runs 24/7 even when nobody's calling, and why we deliberately pay for the more expensive choice on telephony and compute.

Why we don't run SMS

Inbound SMS code lives in app/channels/sms.py and the CMA multi-agent topology supports it end-to-end, but production answers no SMS traffic. This is a cost decision as much as a product one:

  • SMS is per-segment billed. Twilio US is ~$0.0079 per inbound segment + ~$0.0079 per outbound segment, plus carrier fees. A typical back-and-forth conversation costs more than the 90-second voice call that would have answered the same question, and gives the caller a worse experience.
  • It would force 10DLC / A2P registration work for outbound messaging, which is real engineering and compliance overhead with no offsetting revenue while voice covers the same callers.
  • Email is sufficient for the only async path we actually need — outbound escalation to Grace via Resend. There's no inbound async surface that boards use enough to justify the bill.

The CMA topology stays as code (one shell command from rollout via scripts/upgrade_to_multiagent.py) for the day SMS becomes a real surface. Today, every dollar goes to the channel callers actually use.

Per-call cost (typical 90-second governance question)

Estimates based on public list pricing — verify against your own console bills before quoting.

Component Cost / call Notes
Twilio inbound voice (PSTN) ~$0.013 ~$0.0085/min × ~1.5 min
Twilio <Gather> speech recognition ~$0.030 ~$0.02/min × ~1.5 min
Anthropic Opus 4.7 (main turn) ~$0.015 ~600 tokens in / ~80 out at public Opus 4 list rates
Anthropic Haiku 4.5 (mode classifier, ~2 turns) <$0.001 ~200 in / 6 out per turn — negligible
Anthropic Haiku 4.5 (verify_citation) <$0.001 Fires only on answers with statutory citations
ElevenLabs eleven_flash_v2_5 synth ~$0.005 ~250 chars at Creator-tier effective rate
Total per call ~$0.06 At ~30 calls/day, ~$55/month variable

Per-call sub-10¢ is a load-bearing assumption: if it drifted into double digits, the ROI on a $499/month BoardBreeze plan would invert against support cost.

Monthly fixed cost (always-on)

Component Cost Notes
AWS ECS Fargate (1 vCPU / 2 GB, 24/7) ~$25 One task, always running.
AWS Application Load Balancer ~$22 TLS termination, ACM cert, health checks.
AWS Secrets Manager (1 JSON secret) ~$0.40 All 11 API keys in one rotated secret.
AWS CloudWatch Logs (30-day retention) ~$2 /ecs/boardbreeze-concierge.
Twilio toll-free number rental ~$2 +1 (844) 786-2076.
ElevenLabs Creator subscription ~$22 ~100K characters/month, comfortable for current volume.
Total fixed ~$73/month Plus per-call variable above.

All-in at current call volume: ~$130/month. Linear scaling on the variable side; the fixed footprint doesn't change until call volume forces a bigger Fargate task or a Pro-tier ElevenLabs plan.

Why we pay for Fargate (not a $5 VPS)

Fargate is 2–4× the cost of running the same container on a single EC2 t4g.small or a Hetzner VPS. We pay it on purpose:

  • VPC isolation, awsvpc networking. Each task gets its own ENI; the task security group accepts port 8000 only from the ALB's security group. No public IP on the task, no SSH, no shared host with other workloads.
  • Secrets injected at runtime via IAM, not baked into the image. The execution role has secretsmanager:GetSecretValue scoped to one secret ARN. No .env on disk, no secrets in CloudWatch, no secrets in the ECR image. If the image leaks, the keys don't.
  • Zero-downtime rolling deploys with auto-rollback. The ECS deployment circuit breaker reverts on health-check failure; bad deploys self-heal in ~2 min.
  • One operational mental model with the rest of BoardBreeze. The main appboardbreeze.com SaaS already runs ECS Fargate in the same VPC. Shared logging, shared IAM patterns, one console.

A $5/month VPS would technically serve a hackathon demo. It would not survive a security review by a public-agency procurement team — which is the actual customer.

Why we pay for Twilio (not a cheaper softphone API)

Same logic. Twilio is more expensive than Telnyx or building on raw SIP, but:

  • Real PSTN with carrier-of-record obligations. Calls to 1-844-786-2076 ride the same network as 911. SOC 2, HIPAA-eligible voice tier, STIR/SHAKEN attestation — required for any board secretary calling on a government-issued line.
  • Signed webhooks (X-Twilio-Signature) plus full audit trail on every call leg, available to procurement on request.
  • One vendor for voice today, SMS tomorrow if we need it, and PSTN dial-out — no provider swap when scope grows.

For a B2G product, "the cheaper option exists but procurement won't accept it" is a real and frequent failure mode. Twilio + Fargate are priced to be procurement-acceptable, which is what makes the rest of the unit economics meaningful.


The evolution (Keep Thinking)

  • v0 (Wed). Hand-rolled multi-agent supervisor in Python — six specialist files, keyword-routed handoffs. Reference loop kept in app/agents/_governance_reference_loop.py so the journey is legible.
  • v1 (Thu morning). Pivoted to one Claude Managed Agent + specialist modes after Michael Cohen's session. Five .py files deleted; the system prompt became the routing layer. First governance question routed to Governance Expert mode unprompted.
  • v2 (Thu midday). Real verify_citation (KB lookup + Haiku 4.5 claim classifier, 10/10 golden Q&A) and real escalate_to_grace (Twilio SMS to Grace) replaced the safe stubs. Anti-hallucination guardrail and escalation path are real, not aspirational.
  • v3 (Thu afternoon). ElevenLabs replaced Polly on voice for quality; tightened reply cap to ~30 words to fit the latency budget; added Polly fallback so voice degrades gracefully under ElevenLabs outages.
  • v4 (Thu evening). Measured CMA at ~6 s overhead and moved the voice path to the direct Messages API with sentence-level streaming. SMS stayed on CMA where cross-session continuity matters more than first-token latency.
  • v5 (Thu late evening). Live test exposed Twilio's <Play> buffer eating 6 s of audio. Restructured to chained TwiML (/gather filler → background turn → /continue reply). Filler at ~500 ms, real reply at ~2.5 s.
  • v6 (Sat). Closed the Product Expert mode's KB hole. Grace's internal BoardBreeze FAQ (28 sections) chunked into 61 product rows alongside the 20 governance rows in the same governance_kb table, tagged jurisdiction='product'. New search_product_kb tool (same Supabase RPC, jurisdiction-pinned) so Product Expert mode answers pricing / plan / feature questions from authoritative source rather than model recall. Without this, the agent dodged "what's your pricing" with a callback offer; with this, it cites the actual $29.99 / $99 / $499 tiers.
  • v7 (Sat night). Production deployment. Started on AWS App Runner; pivoted to ECS Fargate mid-day to match the existing appboardbreeze.com stack pattern and consolidate ops. In one afternoon: containerized the app (Dockerfile + .dockerignore), pushed the image to a new ECR repo, moved 11 env values into a single AWS Secrets Manager JSON entry, built two least-privilege IAM roles, registered the task definition with Secrets Manager valueFrom projections, created a dedicated ALB + target group + security groups (separate from the CDK-managed ALB so the prod stack stays untouched), got an Amazon-issued ACM cert for concierge.appboardbreeze.com (the first attempt failed CAA_ERROR because Vercel's default CAA on the apex didn't authorize Amazon — added 4 records, waited 5 min for AWS's internal CAA cache to clear, retried, issued in 19 s), pointed Vercel DNS at the ALB, flipped Twilio webhooks. First production call completed end-to-end (Twilio → ALB → Fargate task → Messages API → Supabase KB → ElevenLabs → caller's ear) in ~5 s. Live at https://concierge.appboardbreeze.com. Full playbook in Deployment.md.

Post-hackathon (the concierge becomes a real product)

  • v8 (May, Phase 1 — Advisor strategy). Wired the keynote's advisor pattern: consult_advisor as a custom tool that routes hard turns to Opus 4.7 from a Haiku-class executor. Voice executor now VOICE_EXECUTOR_MODEL-configurable. Bench (notes/advisor_bench_results.md) showed Haiku 4.5 with the advisor available cuts mean TTFT 45% vs Opus alone, with 0% advisor-invocation rate — Haiku felt confident enough alone that the advisor sits as a safety net. Held back from defaulting to Haiku because of an escalation regression on the gold set; revisit after Phase 2.5 settles in production.
  • v9 (May, Phase 2 — CMA multi-agent topology built). SDK 0.100.0 shipped first-class multi-agent (multiagent={"type":"coordinator","agents":[...]} kwarg). Built the full topology: 1 coordinator + 5 specialist sub-agents, each with own focused prompt and tool subset. Single source of truth for tool defs in tools_registry.py; shared rules (voice budget, escalation pair, no-SMS, ground rules) factored into agents/_shared.py so per-agent prompts don't drift on universal rules.
  • v10 (May, Phase 3 — voice ↔ CMA bench). Open question: with multi-agent now first-class, can voice ride CMA? Answer: no. Bench showed CMA cold-start TTFT at 10–14 s on Haiku and warm at ~9 s — Haiku WARM is slower than Opus WARM on CMA, because per-event harness scheduling dominates inference. Voice keeps the direct Messages API path. The CMA topology stays as code (one shell command from rollout) for the day SMS becomes a real surface.
  • v11 (May, Phase 2.5 — voice gets per-mode dispatch). Took the 5 focused mode prompts from Phase 2 and applied them voice-side: a Haiku classifier picks the mode each turn, voice runs against that focused prompt + tool subset. Same Phase 2 quality benefit, no CMA harness overhead. Adds ~0.5 s of classifier latency per turn (well within the voice budget). Escape hatch via VOICE_DISABLE_MODE_DISPATCH=1 in case the classifier misroutes on real calls.
  • v12 (May, voice-only product confirmed). Talked through whether SMS belongs on the roadmap. Conclusion: no. Email is sufficient for outbound paging (already the primary leg in _escalate_to_grace); inbound SMS isn't a realistic surface for board secretaries who already pick up the phone. Trimmed prompts to never promise to "text" the caller; updated CLAUDE.md and the upgrade plan to reflect voice-only direction.

The current production stack is the v11 voice-side dispatch on the v7 Fargate deployment.

See Progress.md for the day-by-day narrative, CHANGELOG.md for per-commit detail, and notes/concierge-upgrade-plan.md for the live upgrade plan with status per phase.


Repo layout

boardbreeze-concierge-voice/
├── app/
│   ├── main.py                   FastAPI entrypoint (auto-loads .env)
│   ├── config.py                 pydantic-settings env loader
│   ├── voice_pipeline.py         direct Messages API turn loop for voice —
│   │                             sentence streaming, inline tool dispatch,
│   │                             ThreadPool-backed queue_turn_async
│   ├── managed_agents/           CMA topology (built dormant) + shared spec
│   │   ├── agent_spec.py           legacy single-agent prompt (still used by
│   │   │                           voice's fallback path; the active voice
│   │   │                           runtime uses per-mode dispatch via agents/)
│   │   ├── tools_registry.py       single source of truth for custom tool defs
│   │   ├── agents/                 per-mode prompts (used by voice classifier
│   │   │   │                       AND the dormant CMA coordinator)
│   │   │   ├── coordinator.py        CMA front-door + multiagent_config helper
│   │   │   ├── governance_expert.py  Brown Act / Robert's Rules
│   │   │   ├── product_expert.py     features, plans, how-to
│   │   │   ├── tech_support.py       bug triage
│   │   │   ├── sales_closer.py       pricing, demos
│   │   │   ├── escalation_handler.py single job: page Grace
│   │   │   └── _shared.py            voice budget, escalation pair, no-SMS,
│   │   │                             ground rules — universal rules in one place
│   │   ├── client.py               ensure_agents/environment/session +
│   │   │                           handle_message
│   │   └── custom_tools.py         backend dispatch for KB search,
│   │                               verify_citation, consult_advisor,
│   │                               escalate_to_grace
│   ├── agents/                   v0 reference loop only — nothing imports it
│   ├── tools/governance_tools/   RAG + jurisdiction tools used by custom_tools
│   ├── channels/
│   │   ├── sms.py                  Twilio SMS webhook → CMA
│   │   ├── voice.py                Twilio Voice: chained TwiML, Polly fallback
│   │   └── tts.py                  ElevenLabs synth + static/dynamic caches
│   ├── db/                       Supabase schema + migrations (phone_sessions)
│   └── kb/                       governance_kb seed (statute chunks + the
│                                 BoardBreeze FAQ chunker — both go into one
│                                 table, distinguished by `jurisdiction`)
├── Dockerfile                    python:3.11-slim, port 8000, /health curl healthcheck
├── .dockerignore                 keeps .env, KB sources, demo assets out of image
├── Deployment.md                 12-phase ECS Fargate playbook, every CLI command pre-filled
├── .claude/skills/               /interview, /governance-verify, /status
├── notes/                        external-intel notes (Cohen talk, etc.)
├── tests/                        offline tests, no network/keys required
├── CLAUDE.md                     project rules for Claude Code sessions
├── CHANGELOG.md                  per-commit detail
├── Progress.md                   day-by-day "Keep Thinking" log
└── README.md                     you are here

Setup

# 1. Python 3.11 venv (the SDKs we use require ≥3.10; 3.11 is what we develop on)
python3.11 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Environment
cp .env.example .env
# fill in keys — Anthropic, Supabase (URL + service role), Voyage,
# Twilio (SID + auth + phone), ElevenLabs (API key + voice ID),
# and Grace's phone for escalations.

# 3. Supabase schema
# Run app/db/schema.sql then app/db/migrations/001_phone_sessions.sql
# in the Supabase SQL editor.

# 4. Seed the KB (governance statutes + BoardBreeze product FAQ).
# Place the FAQ markdown at the repo root as
# "BoardBreeze Comprehensive FAQ — AI Agent Knowledge Base.md"
# (gitignored — supply your own product KB if reproducing). The
# governance chunks ship inside seed_kb.py and need no extra files.
python -m app.kb.seed_kb

# 5. Run tests (offline, no keys needed)
python -m pytest tests/

# 6. Sanity-check that the CMA agent + environment exist
python -c "from dotenv import load_dotenv; load_dotenv('.env'); \
  from app.managed_agents.client import ensure_agent, ensure_environment; \
  print(ensure_agent(), ensure_environment())"

# 7. Start the server (local dev)
uvicorn app.main:app --reload --port 8000

# 8. Expose to Twilio (local dev only)
# In a second terminal:  ngrok http 8000
# Point the Twilio phone number's Voice webhook at  {ngrok}/twilio/voice/inbound
# (SMS webhook routes exist as `app/channels/sms.py` for completeness, but
#  SMS isn't a live product surface — production points only the voice URL.)

For production deployment (AWS ECS Fargate, the stack actually serving https://concierge.appboardbreeze.com), follow Deployment.md — 12 phases from Dockerfile to live HTTPS, with every CLI command pre-filled with the relevant account, cluster, VPC, and subnet IDs.


Built by

Grace Esteban — solo founder of BoardBreeze. Domain expert in California public-agency governance (Brown Act, Bagley-Keene, community college districts, special districts).

Paired with Claude Code (Opus 4.7) as development partner throughout the hackathon. See Co-Authored-By trailers in git log for per-commit attribution.

License

MIT — see LICENSE.

About

24/7 voice + Email concierge for BoardBreeze, built on Claude Managed Agents and Claude Opus 4.7

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors