feat(voice): backend voice picker + advanced controls behind disclosure (#742) by heavygee · Pull Request #743 · tiann/hapi

heavygee · 2026-05-30T22:49:10Z

Reviewers: do not use the default "Files changed" tab on this PR.
This branch is stacked on #692 (feat/pluggable-voice-backend). Against tiann/main GitHub will show the union of #692 + this PR.
Review only: heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends
Do not merge until #692 lands, then rebase onto main so this PR shrinks to its own delta only.

Summary

Two voice features against #742, stacked on #692. The visible default surface stays small (one picker, one toggle); everything else lives behind a single collapsed disclosure so a user who just wants to pick a voice never sees the tuning UI.

Backend-aware voice picker

shared/voicePickerCatalog.ts - static Gemini/Qwen voice lists, per-backend localStorage keys, resolve helpers
Hub GET /api/voice/backend returns { backend, backends } when multiple providers are configured
Hub GET /api/voice/voices - ElevenLabs list available even when default backend is Gemini
Hub gemini-ws - ?voice= query param wired into setup message
Web Settings - voice backend chooser (when 2+ backends), voice list follows selection, Gemini/Qwen descriptions, preview-is-EL-only hint
Voice sessions - VoiceSessionConfig.voiceName + stored preference per backend

Composed system prompt + bootstrap-and-stream context (folded in from `feat/voice-advanced-controls`)

Layered prompt in shared/voicePromptLayers.ts: platform fixtures (read-only - tool contracts, routing, TTS rules) + provider guardrails + user-editable identity + user-editable character. composeVoiceAgentPrompt merges them.
Bootstrap + stream context: small initial conversation payload at handshake (~4 KB) plus streaming chunks via sendContextualUpdate after connect. Honest UI wire-budget hints.
All three backends: ElevenLabs ConvAI, Gemini Live, and Qwen Realtime each compose + bootstrap + stream.
ElevenLabs minimal-overrides discipline: empty prefs produce a minimal {agent:{language:'en'}} payload (byte-parity with upstream baseline). Custom layers/sliders/voiceId opt in their respective override fields. Fixes the unauthorized-override crash (Cannot read properties of undefined ('error_type')).
Hub-side ConvAI override reconciliation: on every /voice/token resolution the hub PATCHes the agent's platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt + tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process and best-effort - existing agents that predate the override declaration now self-heal on next session start instead of requiring operator-side console edits.

UI discipline

New web/src/components/settings/VoiceAdvancedControls.tsx wraps fixtures preview, identity editor, character editor, delivery preset selector, and tuning sliders inside one master Advanced voice settings disclosure (collapsed by default).
Sub-sections (fixtures / identity / character / delivery / tuning) start collapsed when the master opens.
A customized badge appears next to the master title if any layer differs from defaults so power users still find their tweaks.
Backend picker + voice picker remain at the top, outside the disclosure.

Test plan

bunx tsc --noEmit (hub + web)
bun test voice routes (hub) - 20 pass including new reconciles platform_settings.overrides on existing agents test
bun test voice client tests (web) - voicePersonalitySession + voiceContextPlan green
hapi-driver-rebuild --build-web --verify then dogfood the three backends end-to-end on driver
Operator dogfood on driver soup (PR review gate)

Merge order

Merge feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime #692 (feat/pluggable-voice-backend)
Rebase this branch onto upstream/main (PR diff should drop to this PR's own delta), merge feat(voice): backend voice picker + advanced controls behind disclosure (#742) #743

Issues

Ref #742
Blocked by #692

heavygee · 2026-05-31T00:53:06Z

Stack note: This PR is blocked by #692. For review, prefer the incremental diff on the fork:

heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends

(4 commits: catalog scaffold, backend chooser, Gemini/Qwen descriptions, Playwright dogfood script.)

github-actions

Findings

[Major] Qwen proxy forwards arbitrary client frames with the hub API key — POST /api/voice/qwen-token correctly keeps the DashScope key server-side, but the new /api/voice/qwen-ws proxy then opens DashScope with that key and blindly forwards every browser frame to upstream. Unlike the Gemini path, which rejects client-provided setup frames, any authenticated web client can connect directly and send its own session.update/response.create payloads, turning the hub into a generic DashScope proxy and allowing client-controlled instructions/tools under server credentials. Evidence: hub/src/web/server.ts:159.
Suggested fix:

const allowedQwenRuntimeEvents = new Set([
    'input_audio_buffer.append',
    'input_audio_buffer.commit',
    'response.create',
    'conversation.item.create'
])

function parseQwenClientEvent(message: string | ArrayBuffer | Uint8Array): { type?: string } | null {
    try {
        return JSON.parse(decodeWsText(message)) as { type?: string }
    } catch {
        return null
    }
}

message(clientWs: ServerWebSocket<unknown>, message: string | ArrayBuffer | Uint8Array) {
    const event = parseQwenClientEvent(message)
    if (!event?.type || !allowedQwenRuntimeEvents.has(event.type)) {
        try { clientWs.close(1008, 'Client-provided Qwen setup is not allowed') } catch { /* */ }
        return
    }

    const upstream = upstreamMap.get(clientWs)
    if (upstream?.readyState === WebSocket.OPEN) {
        upstream.send(message)
    }
}

Move the initial session.update construction into the hub proxy, as Gemini does with buildGeminiLiveSetupMessage, and only let the browser send runtime audio/tool-response events afterward.

Questions

None.

Summary

Review mode: initial
One issue found: Qwen realtime proxy needs the same server-owned setup boundary as Gemini before this is safe to merge.

Testing

Not run (automation; static review only, per PR security instructions).

HAPI Bot

heavygee · 2026-05-31T01:06:56Z

Noise check (re: default diff vs main)

Compare base	Commits	Files
`tiann/main` (GitHub default)	29	35
`feat/pluggable-voice-backend` (#692 tip)	4	23

GitHub would not let us retarget base to feat/pluggable-voice-backend on tiann/hapi (branch only exists on heavygee/hapi). Until #692 merges, use the compare link in the PR description for review.

Automated review note: The github-actions MAJOR on Qwen proxy (hub/src/web/server.ts) is from the #692 stack in this branch, not from the 4 #742 commits — please route that feedback to #692 if still open.

Rebased from Overbaker/hapi#401 onto current main. Adds a pluggable voice backend architecture that extends the existing ElevenLabs integration: - Gemini 2.5 Live (gemini-live): Google real-time audio via WebSocket with full function calling (messageCodingAgent, processPermissionRequest) - Qwen Realtime (qwen-realtime): Alibaba DashScope via hub WebSocket proxy (browser cannot set Authorization header directly) - VoiceBackendSession: dynamic backend selector with React.lazy loading, gates voice button until backend module is registered - Hub WS proxies: JWT-authenticated /api/voice/gemini-ws and /api/voice/qwen-ws endpoints in Bun.serve, with message queueing during upstream connect to prevent dropped setup frames - AudioWorklet pipeline: inline Blob URL recorder, 24 kHz PCM player, serial tool call execution, AudioContext created in user gesture for mobile - Backend discovery: GET /voice/backend + POST /voice/gemini-token / POST /voice/qwen-token hub routes; frontend auto-detects active backend Merge notes: - Rebased 135 upstream commits cleanly; HappyComposer keeps upstream's configurable enter-behavior setting (supersedes hard-coded Ctrl+Enter) - Converted gemini test files from bun:test to vitest (web package uses vitest) - All 221 hub tests and 636 web tests pass; TypeScript clean

turnComplete handler was unconditionally calling setMuted(false), which re-enabled the mic track even when the user had manually muted. Now restores to state.micMuted instead.

buildGeminiLiveConfig was appending VOICE_CHINESE_LANGUAGE_BLOCK which forced Gemini to always respond in Mandarin regardless of user locale. Gemini now uses the neutral base prompt and responds in the language the user speaks to it, consistent with the ElevenLabs behaviour.

If the session closes while Gemini is mid-speech, cleanup() left state.modelSpeaking=true. The next startSession() would then drop all mic audio in sendAudioChunk() until a model turn eventually flipped the flag — effectively deaf until page reload.

ws.onclose operated on module-level state.ws, not the socket that fired the event. A rapid stop/restart could cause the old socket's onclose to call cleanup() after the new socket was assigned, tearing down the live session. Guard with `if (state.ws !== ws) return` before cleanup. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Matches the Gemini fix — both backends now use VOICE_SYSTEM_PROMPT without the Chinese language block, giving consistent English-default behaviour across all non-ElevenLabs backends. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Adds a "Proactive voice" toggle (default: off = reactive) to the Voice Assistant settings section. Reactive (default): initial context and agent-ready events are fed silently; the assistant waits for the user to speak first. Proactive: original behaviour — Gemini/Qwen narrate context on connect and speak unprompted when the agent finishes a task. ElevenLabs is also affected via onReady sending a user message rather than a silent update. Covers all three backends uniformly. localStorage key: hapi-voice-proactive. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

… visibility - hub/server.ts: add toClientCloseCode() to normalize reserved upstream close codes (1005/1006/1015) to 1011 before forwarding to browser; abnormal upstream drops (1006) would otherwise throw on clientWs.close() and leave the browser socket open - realtime/index.ts: remove static GeminiLiveVoiceSession and QwenVoiceSession barrel exports; VoiceBackendSession lazy-imports both, so barrel re-exports created static dependencies that defeated the intended code-split - App.tsx: gate global useVisibilityReporter on !sessionEventSubscription so the always-on SSE connection does not suppress native Web Push notifications for sessions the user is not currently viewing via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…toggle label - buildGeminiLiveConfig() now accepts optional language param; appends VOICE_CHINESE_LANGUAGE_BLOCK only when language === 'zh' - GeminiLiveVoiceSession passes config.language through - QwenVoiceSession conditionally builds basePrompt from language setting - Fixes silent no-op when user selects Chinese in voice settings on Gemini/Qwen backends (was ElevenLabs-only) - Rename voice-start toggle label to 'Start voice session with summary' - Fix description: clarifies the choice is about session-open behaviour (summary vs greeting), not ongoing narration via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Gemini Live has no built-in first-message like ElevenLabs agents do; without an explicit turnComplete:true it sits silently. In reactive mode (default, toggle off) now sends a greeting instruction after any silent context feed so Gemini introduces itself and invites the user to speak. Proactive mode is unchanged: the context summary is the opening speech. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…reeting - VOICE_SYSTEM_PROMPT: explicit instruction never to call itself Gemini, Google, or any underlying model/provider name — always HAPI - Greeting trigger text: instruct to greet as HAPI only, suppress model name and any reference to context/recent activity in the opening line via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Gemini + Qwen client: - onerror now sets setupDone/sessionReady and nulls state.ws before calling reject(), so the stale-close guard trips in onclose and prevents a duplicate statusCallback('error') on WS failure Gemini client: - Proactive mode with no initialContext now falls through to the greeting trigger instead of sitting silently - Remove unused handleBargeIn callback (dead code) Qwen client: - Add input_audio_sample_rate: 16000 to session.update so PCM rate is declared explicitly rather than relying on DashScope's default Hub proxy: - Remove no-op ternary in Gemini flush loop and message handler (typeof x === 'string' ? x : x); use upstream.send(msg) directly - Qwen onerror now calls upstreamMap.delete() before closing client, eliminating the stale map entry window - Align Qwen hub fallback model string with QWEN_REALTIME_MODEL constant ('qwen3-omni-flash-realtime') via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

hub/voice.ts: - Replace string-concat WS URL construction with buildVoiceWsUrl() which uses URL API to set protocol/pathname cleanly — fixes double-slash when HAPI_PUBLIC_URL has a trailing slash (would silently skip the proxy route) QwenVoiceSession.tsx: - Wrap tool definitions in {type:'function', function:{...}} as required by Qwen-Omni realtime schema — previous flat shape caused session.update rejection before audio capture could start - Use pcm16/pcm24 audio formats matching DashScope spec; remove input_audio_sample_rate (encoded in format name) via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…ose codes GeminiLiveVoiceSession + QwenVoiceSession: - startAudioCapture() is now async and awaits recorder.start() before calling setMuted() — previously setMuted ran before getUserMedia resolved so a session restarted while muted would open the mic anyway - statusCallback('connected') now fires after audio is ready - setMuted() called unconditionally (not just when true) to correctly apply saved state in either direction hub/src/web/server.ts: - Both Gemini and Qwen close() handlers now pass the client code through toClientCloseCode() before forwarding to upstream — prevents reserved codes (e.g. 1006) from causing WebSocket.close() to throw and leave the upstream session open until provider timeout - Reason string capped at 123 bytes (WebSocket protocol limit) via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

An unhandled rejection inside the async onmessage callback does not propagate to the outer startSession Promise — the UI hangs on 'connecting' and the provider socket stays partially open. Wrapping the await in try/catch calls cleanup()/statusCallback('error')/reject() so failures surface correctly. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…alling back to ElevenLabs fetchVoiceBackend no longer catches errors and defaults to 'elevenlabs' — any network or server failure now throws so VoiceBackendSession can surface it via onStatusChange('error', ...) rather than silently mounting the wrong backend. VoiceBackendSession also resets backend state to null when api changes, so a stale ElevenLabs registration from a prior discovery cannot persist into a new session. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…alling back to ElevenLabs Unknown backend strings (future values, typos) now throw rather than defaulting to elevenlabs, closing the narrow remaining form of the original misrouting bug. Also removes the unnecessary `as VoiceBackendResponse` cast. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…r base64 uploads Qwen session.updated handler now sends the same proactive summary or greeting trigger that Gemini does — previously it started silently in both proactive and reactive modes. maxHttpBufferSize raised to 68 MiB to account for base64 expansion: 50 MiB decoded files become ~66.7 MiB as base64 JSON, so the previous 55 MiB ceiling would disconnect uploads above ~41 MiB before they reached the CLI. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

….update for Qwen text Qwen's realtime API only supports conversation.item.create for function_call_output. Sending it with type:'message' for greetings/context was invalid and could fail before the user spoke. sendTextMessage and sendContextualUpdate now update session instructions via session.update (accumulating context into the system prompt) and trigger response.create only when a spoken reply is needed — matching Qwen's supported client event surface. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…n start session.updated now returns early after the first ack — subsequent session.update calls (instruction appends) also echo session.updated but must not re-trigger audio capture or the greeting path. currentSessionConfig is now reset to null at the top of startSession so a stale config from a failed previous session cannot leak into the new one. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Without this guard, a missing wsUrl in the hub token response would silently attempt to connect directly to Google with "proxied" as the API key — producing a confusing auth failure instead of a clear error. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

DashScope realtime API accepts only 'pcm' for both input and output audio formats. The pcm16/pcm24 values caused session.update rejection before audio capture could start, leaving the Qwen backend unusable. Also updates the default voice from Mia (not in the qwen3-omni-flash- realtime voice list) to Cherry, which is documented as supported. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Failed token fetch, microphone denial, or WebSocket error during setup left state.playbackContext open. Each failure path now calls cleanup() before throwing/rejecting, preventing AudioContext leaks on mobile browsers with hard limits on concurrent contexts. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Reverts changes to files that shouldn't differ from upstream: - .gitignore: remove fork-only AGENTS.local.md entry - web/src/App.tsx: restore dual-subscription SSE pattern (scope-aware) - web/src/hooks/useSSE.ts: restore SSEScope/scope parameter - web/src/hooks/useSSE.test.ts: restore (was accidentally deleted) - web/src/lib/appSseSubscriptions.ts: restore (was accidentally deleted) - web/src/lib/appSseSubscriptions.test.ts: restore (was accidentally deleted) - hub/src/sync/syncEngine.ts: restore (off-topic change)

Hub sends HAPI-owned Gemini setup on proxy connect and rejects client setup frames. Qwen proxy always uses QWEN_REALTIME_MODEL instead of a client query parameter. Shared buildGeminiLiveSetupMessage() keeps wire format in one place. Co-authored-by: Cursor <cursoragent@cursor.com>

Mirror the Gemini proxy security model for Qwen: - Hub sends initial session.update (voice/tools/instructions) on upstream connect so the browser cannot override config fields. - Proxy message() now calls isQwenSafeClientFrame() and closes the connection (1008) if a client session.update touches any field other than 'instructions' (blocks tool/voice/modality overrides). - QwenVoiceSession no longer sends session.update on session.created; it waits for the hub-relayed session.updated and then sends only instruction-only updates for context/proactive content. - Language passed as query param (?language=zh) so hub builds the correct Chinese system prompt without a client-supplied session.update. - buildQwenSessionUpdateMessage() and isQwenSafeClientFrame() added to @hapi/protocol/voice; 9 new unit tests cover filter edge cases.

…ring DashScope requires session.update to be sent AFTER session.created is received, not immediately on WebSocket open. Previously the hub sent session.update in upstream.onopen, which violated this ordering and risked the config being processed in an uninitialized session context. Add pendingSetupMap to buffer the hub-owned session.update payload. The onmessage handler now relays session.created to the browser first, then immediately sends the pending session.update to DashScope — matching the protocol ordering the old browser-side code used (which waited for session.created before sending session.update). Also remove maxHttpBufferSize from the socket.io Engine config. That setting is unrelated to voice backends; upstream/main had no such limit set and it is not introduced by this PR.

heavygee · 2026-05-31T17:33:08Z

Status update on the open Major (Qwen WS proxy)

Coordinated with the #692 agent — they're already implementing the fix in their worktree:

buildQwenSessionUpdateMessage(language) in shared/src/voice.ts (parallel to buildGeminiLiveSetupMessage)
isQwenSafeClientFrame(message) — allowlist; only session.update frames with only instructions pass through
Hub createQwenProxyWebSocketHandler sends hub-owned setup on upstream open; rejects with WS close 1008 'Client session.update may only modify instructions'
QwenVoiceSession.tsx updated to drop the now-redundant client-side initial setup

This is correctly #692 scope — the Qwen proxy message() forwarding-everything behaviour was introduced in #692; my #742 delta to hub/src/web/server.ts is 5 lines, all on the Gemini path (threading voiceName through buildGeminiLiveSetupMessage). No Qwen changes in this branch.

Plan: Stand by until #692 pushes the fix, then rebase feat/voice-selection-all-backends on the new tip. Their hardening flows through to this PR automatically (we stack on it). The bot's Major thread should then be resolvable on #692's commit, not on a duplicate fix here.

CI green; no other open Major findings. Holding for the upstream layer to land.

…-completions) Qwen Realtime session.update expects tools as flat objects: { type: 'function', name, description, parameters } The previous code used the chat-completions shape: { type: 'function', function: { name, description, parameters } } DashScope may reject session.update or silently ignore tools with the nested shape, causing tool calls to fail at runtime. Fix applied in buildQwenSessionUpdateMessage(); test updated to assert flat shape and that no nested `function` key is present.

Static voice lists and per-backend localStorage keys. Settings UI wiring follows in this branch after tiann#692 lands. Refs tiann#742 Co-authored-by: Cursor <cursoragent@cursor.com>

Expose configured backends from hub, let Settings pick provider when multiple API keys exist, and wire voice selection through Gemini/Qwen/ElevenLabs paths. Co-authored-by: Cursor <cursoragent@cursor.com>

Surface catalog descriptions on the voice row and in the picker, with a hint when preview is ElevenLabs-only. Disabled preview buttons stay visible. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

When hub/.env (operator-local) sets GEMINI_API_KEY, the 'falls back to elevenlabs for unknown VOICE_BACKEND values' test leaks gemini-live into the backends list. Delete the four non-elevenlabs key env vars defensively at the start of the test, mirroring the cleanup pattern used by the other tests in the same describe block. No behavior change. Co-authored-by: Cursor <cursoragent@cursor.com>

Stacks on voice-selection-all-backends. Adds shared voice-personality presets, Settings accordions (character + backend-specific tuning), and ElevenLabs session overrides from stored preferences. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace append-only personality notes with the bundled HAPI voice system prompt in an advanced editor. User edits replace the base instruction for ElevenLabs, Gemini (incl. hub proxy), and Qwen. Presets only drive TTS sliders unless the user appends delivery text explicitly. Co-authored-by: Cursor <cursoragent@cursor.com>

Split voice instructions into platform fixtures, provider guardrails, and editable identity/character; compose at runtime for ElevenLabs, Gemini, and Qwen. Session history uses a small connect bootstrap plus deferred contextual chunks, gated by the proactive-summary setting. Settings UI shows wire budgets and read-only fixtures. Co-authored-by: Cursor <cursoragent@cursor.com>

The web workspace runs tests via vitest run. Both voice tests imported describe/expect/test from 'bun:test', which vite cannot bundle and caused 'Cannot bundle built-in module bun:test' transform failures during driver soup verify. Only generic test APIs are used; switching the import to 'vitest' is a no-op behavior change. Co-authored-by: Cursor <cursoragent@cursor.com>

heavygee · 2026-05-31T23:12:48Z

Heads-up from a downstream soup-rebuild (heavygee/hapi driver/integration, 2026-05-31):

To get this branch + the voice-advanced layer building cleanly on the current feat/pluggable-voice-backend tip, two pre-existing test bugs needed fixing. Both are local-only commits on the rebased branches and will be lost on the next force-push unless preserved.

1. On feat/voice-selection-all-backends (this PR)

238ad4c test(voice): isolate VOICE_BACKEND fallback test from leaked env vars

hub/src/web/routes/voice.test.ts | 4 ++++

The "falls back to elevenlabs for unknown VOICE_BACKEND values" test never deleted GEMINI_API_KEY / QWEN_API_KEY / OPENAI_API_KEY / QWEN_REALTIME_API_KEY before asserting. When hub/.env has any of those set, gemini-live (or another backend) leaks into the resolved backends list and the test fails. The fix adds four delete process.env.* lines at the top of the test, mirroring the cleanup pattern already used by the other tests in the same describe block. No behavior change.

2. On feat/voice-advanced-controls (sibling branch, stacks on this one)

f3ab0f3 test(web): import from vitest instead of bun:test

web/src/lib/voicePersonalitySession.test.ts        | 2 +-
web/src/realtime/hooks/voiceContextPlan.test.ts    | 2 +-

Both files imported describe/expect/test from 'bun:test'. The web workspace runs vitest run, which cannot bundle the built-in bun:test module and fails the transform during bun run build. Only generic test APIs are used; the import swap is a no-op behavior change.

Surfacing in case you re-do the rebase from scratch and overwrite these locally - they would silently come back as test failures on the next CI/build. No action needed if you keep them.

github-actions

Findings

[Major] Qwen proxy still forwards runtime frames before setup is acknowledged — the browser waits for session.updated before starting audio, but an authenticated client can connect directly and send response.create, conversation.item.create, or instruction-only session.update while pendingSetupMap is still set. Those frames are forwarded immediately at hub/src/web/server.ts:188, before the hub-owned session.update sent at hub/src/web/server.ts:159 is acknowledged. That can run the upstream session under the provider default config or race ahead of HAPI's locked tools/voice/instructions, reopening part of the generic-proxy boundary the previous review called out.
Suggested fix:

const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

// after session.created
pendingSetupMap.delete(clientWs)
setupAckMap.set(clientWs, false)
upstream.send(pendingSetup)

// after session.updated for the hub-owned setup
setupAckMap.set(clientWs, true)
for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
pendingClientFrames.delete(clientWs)

message(clientWs, message) {
    if (!isQwenSafeClientFrame(message)) { /* close */ return }
    const upstream = upstreamMap.get(clientWs)
    if (upstream?.readyState !== WebSocket.OPEN) return
    if (pendingSetupMap.has(clientWs) || setupAckMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
    upstream.send(message)
}

Questions

None.

Summary

Review mode: follow-up after new commits
The previous Qwen proxy issue is partly addressed, but the proxy still needs to gate client frames until the hub-owned Qwen setup has been acknowledged.

Testing

Not run (automation; static review only, per PR security instructions).

HAPI Bot

The previous Qwen proxy hardening sent the hub-owned session.update once upstream's session.created landed, but never waited for session.updated before forwarding client frames. An authenticated client could push response.create / conversation.item.create / instruction-only session.update during that window and run the upstream session under the provider default config or partially-applied state - reopening part of the generic-proxy boundary the prior review called out. Changes: - Extract createQwenProxyWebSocketHandler into hub/src/web/qwenProxyHandler.ts so the gating behaviour can be unit-tested without spinning up Bun.serve. - Track a per-client setupAcked WeakMap and a pendingClientFrames queue. Client frames are queued (after passing isQwenSafeClientFrame) until the upstream emits session.updated; on ack the queue is flushed in order. - Clear the WeakMaps on upstream error / upstream close / client close so no state leaks across reused references. - Inject the upstream WebSocket constructor for tests; production keeps the real global WebSocket. Tests: 4 new ack-gate cases + existing 245 hub tests still green. Resolves the [Major] thread on PR tiann#743 follow-up review. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions

Findings

[Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup on upstream open, but it then flushes queued non-setup frames immediately and forwards later frames whenever the upstream socket is open. A direct authenticated client can therefore send clientContent/tool frames before Google acknowledges the hub-owned setup with setupComplete, racing execution ahead of HAPI's locked model, tools, system prompt, language, and voice config. Evidence: hub/src/web/server.ts:77 and hub/src/web/server.ts:107.
Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
}

upstream.onmessage = (event) => {
    const text = typeof event.data === 'string'
        ? event.data
        : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
    if (setupCompleteMap.get(clientWs) === false && JSON.parse(text).setupComplete) {
        setupCompleteMap.set(clientWs, true)
        for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
        pendingClientFrames.delete(clientWs)
    }
    if (clientWs.readyState === 1) clientWs.send(event.data)
}

message(clientWs, message) {
    if (isGeminiSetupFrame(message)) {
        clientWs.close(1008, 'Client-provided Gemini setup is not allowed')
        return
    }
    const upstream = upstreamMap.get(clientWs)
    if (!upstream || upstream.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
    upstream.send(message)
}

Questions

None.

Summary

Review mode: follow-up after new commits
The previous Qwen ack-gate issue appears addressed in the follow-up, but the newly added Gemini proxy still has the same setup-ack race at its proxy boundary.

Testing

Not run (automation; static review only, per PR security instructions).

HAPI Bot

github-actions · 2026-06-01T02:41:54Z

+            upstream.onopen = () => {
+                // Hub-owned setup only — never forward client setup (prevents generic Gemini proxy abuse).
+                upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
+                for (const queued of pending.splice(0)) {


[Major] Gate Gemini client frames until setupComplete

This flushes queued non-setup frames immediately after sending HAPI's setup, before Google has acknowledged it with setupComplete; once the upstream socket is open, message() also forwards later frames directly. A direct authenticated client can send clientContent or tool frames before HAPI's hub-owned model, tools, system prompt, language, and voice setup is installed, so the session can race ahead under default or partially-applied config.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName))) } // On the first upstream setupComplete, flip the gate and flush queued client frames. if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) { setupCompleteMap.set(clientWs, true) for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued) pendingClientFrames.delete(clientWs) } message(clientWs, message) { if (isGeminiSetupFrame(message)) { clientWs.close(1008, 'Client-provided Gemini setup is not allowed') return } const upstream = upstreamMap.get(clientWs) if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } upstream.send(message) }

ElevenLabs convai rejects sessions whose `overrides` payload references any property the agent has not explicitly authorized. The server reply is a malformed error packet on the LiveKit data channel that the convai-react SDK then deref's as `event.error_type` (undefined -> TypeError -> disconnect), so the symptom is a generic "Unknown DataChannel error" with no clue. Empty-prefs sessions now emit exactly `{ agent: { language } }` for byte-parity with the upstream/main baseline that the configured agent permits today: - agent.prompt: only when user has customized identity or character - tts (full slider object): only when preset diverges from balanced defaults - tts: { voice_id }: only when a voice was picked and sliders are at defaults Co-authored-by: Cursor <cursoragent@cursor.com>

…onvAI overrides UI: wrap fixtures, identity, character, delivery preset, and tuning sliders in a single collapsed "Advanced voice settings" disclosure. Defaults stay quiet; sub-sections start collapsed when the master opens. Shows a "customized" badge if any layer differs from defaults so a user who tweaked settings still knows where to find them. Hub: on every /voice/token resolution, PATCH the resolved ConvAI agent's platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt, tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process (cached per agent_id), best-effort (non-fatal on PATCH error). Fixes the "Cannot read properties of undefined (reading 'error_type')" crash on operators who have an existing agent that predates the override declaration. Co-authored-by: Cursor <cursoragent@cursor.com>

Brings layered voice prompt UI, context bootstrap + streaming, and ConvAI agent override reconciliation into the existing voice picker PR. All new surfaces live behind a single collapsed "Advanced voice settings" disclosure; defaults stay quiet for users who only want to pick a voice. Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # hub/src/web/server.ts

github-actions

Findings

[Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup, then immediately flushes queued client frames and forwards later frames as soon as the upstream socket is open. A direct authenticated client can race clientContent or tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's model, tools, prompt, language, and voice configuration are installed. Evidence: hub/src/web/server.ts:88 and hub/src/web/server.ts:118.
Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
}

upstream.onmessage = (event) => {
    const text = typeof event.data === 'string'
        ? event.data
        : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
    const parsed = JSON.parse(text) as { setupComplete?: unknown }
    if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
        setupCompleteMap.set(clientWs, true)
        for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
        pendingClientFrames.delete(clientWs)
    }
    if (clientWs.readyState === 1) clientWs.send(event.data)
}

if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}

[Major] Pass decoded systemInstruction into Qwen setup — /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client sends that param for advanced voice settings, but createQwenProxyWebSocketHandler() casts the data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName). Qwen sessions therefore ignore the user-edited identity/character/delivery prompt while Gemini uses it. Evidence: hub/src/web/qwenProxyHandler.ts:53 and hub/src/web/qwenProxyHandler.ts:61.
Suggested fix:
```
const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))
```

Questions

None.

Summary
Review mode: follow-up after new commits
The previous Gemini setup-ack race remains present, and the new Qwen prompt transport path drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.

Testing

Not run (automation; static review only).

HAPI Bot

github-actions · 2026-06-01T07:16:16Z

+                    data.voiceName,
+                    data.systemInstruction
+                )))
+                for (const queued of pending.splice(0)) {


[Major] Gate Gemini client frames until setupComplete

This flushes queued client frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's model, tools, prompt, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction))) } if (setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } // Flip the gate and flush only after the first upstream setupComplete frame.

github-actions · 2026-06-01T07:16:16Z

+            })
+
+            upstreamMap.set(clientWs, upstream)
+            pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))


[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))

heavygee marked this pull request as ready for review May 31, 2026 00:53

github-actions Bot reviewed May 31, 2026

View reviewed changes

Comment thread hub/src/web/server.ts Outdated

heavygee mentioned this pull request May 31, 2026

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime #692

Open

9 tasks

heavygee and others added 25 commits May 31, 2026 15:39

fix(voice): restore user mic mute state after Gemini turn completes

58ebf8e

turnComplete handler was unconditionally calling setMuted(false), which re-enabled the mic track even when the user had manually muted. Now restores to state.micMuted instead.

heavygee added 2 commits May 31, 2026 16:34

heavygee and others added 13 commits May 31, 2026 19:39

feat(voice): scaffold picker catalogs for Gemini and Qwen (tiann#742)

97935c6

Static voice lists and per-backend localStorage keys. Settings UI wiring follows in this branch after tiann#692 lands. Refs tiann#742 Co-authored-by: Cursor <cursoragent@cursor.com>

feat(voice): backend chooser and per-provider voice settings (tiann#742)

6036734

Expose configured backends from hub, let Settings pick provider when multiple API keys exist, and wire voice selection through Gemini/Qwen/ElevenLabs paths. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(web): show Gemini/Qwen voice descriptions in Settings

abe1ef5

Surface catalog descriptions on the voice row and in the picker, with a hint when preview is ElevenLabs-only. Disabled preview buttons stay visible. Co-authored-by: Cursor <cursoragent@cursor.com>

chore(dev): harden voice settings Playwright dogfood wait strategy

d24d993

Co-authored-by: Cursor <cursoragent@cursor.com>

fix(web): VoiceAdvancedControls checkbox handler typo

c841ae3

fix(web): ElevenLabs overrides keep client tools, skip default prompt

daa838e

fix(web): cap ElevenLabs WebRTC payload (65KB message limit)

cbfbd5b

heavygee force-pushed the feat/voice-selection-all-backends branch from 7469b27 to 238ad4c Compare June 1, 2026 02:19

github-actions Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread hub/src/web/server.ts Outdated

github-actions Bot reviewed Jun 1, 2026

View reviewed changes

heavygee and others added 3 commits June 1, 2026 07:31

heavygee changed the title ~~feat(voice): backend-aware voice picker for Gemini and Qwen (#742)~~ feat(voice): backend voice picker + advanced controls behind disclosure (#742) Jun 1, 2026

github-actions Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(voice): backend voice picker + advanced controls behind disclosure (#742)#743

feat(voice): backend voice picker + advanced controls behind disclosure (#742)#743
heavygee wants to merge 44 commits into
tiann:mainfrom
heavygee:feat/voice-selection-all-backends

heavygee commented May 30, 2026 •

edited

Loading

Uh oh!

heavygee commented May 31, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

heavygee commented May 31, 2026

Uh oh!

heavygee commented May 31, 2026

Uh oh!

heavygee commented May 31, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 1, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 1, 2026

Uh oh!

github-actions Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

heavygee commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Backend-aware voice picker

Composed system prompt + bootstrap-and-stream context (folded in from feat/voice-advanced-controls)

UI discipline

Test plan

Merge order

Issues

Uh oh!

heavygee commented May 31, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

heavygee commented May 31, 2026

Uh oh!

heavygee commented May 31, 2026

Uh oh!

heavygee commented May 31, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

heavygee commented May 30, 2026 •

edited

Loading

Composed system prompt + bootstrap-and-stream context (folded in from `feat/voice-advanced-controls`)