Skip to content

feat(voice): backend voice picker + advanced controls behind disclosure (#742)#743

Open
heavygee wants to merge 44 commits into
tiann:mainfrom
heavygee:feat/voice-selection-all-backends
Open

feat(voice): backend voice picker + advanced controls behind disclosure (#742)#743
heavygee wants to merge 44 commits into
tiann:mainfrom
heavygee:feat/voice-selection-all-backends

Conversation

@heavygee
Copy link
Copy Markdown
Contributor

@heavygee heavygee commented May 30, 2026

Reviewers: do not use the default "Files changed" tab on this PR.
This branch is stacked on #692 (feat/pluggable-voice-backend). Against tiann/main GitHub will show the union of #692 + this PR.
Review only: heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends
Do not merge until #692 lands, then rebase onto main so this PR shrinks to its own delta only.

Summary

Two voice features against #742, stacked on #692. The visible default surface stays small (one picker, one toggle); everything else lives behind a single collapsed disclosure so a user who just wants to pick a voice never sees the tuning UI.

Backend-aware voice picker

  • shared/voicePickerCatalog.ts - static Gemini/Qwen voice lists, per-backend localStorage keys, resolve helpers
  • Hub GET /api/voice/backend returns { backend, backends } when multiple providers are configured
  • Hub GET /api/voice/voices - ElevenLabs list available even when default backend is Gemini
  • Hub gemini-ws - ?voice= query param wired into setup message
  • Web Settings - voice backend chooser (when 2+ backends), voice list follows selection, Gemini/Qwen descriptions, preview-is-EL-only hint
  • Voice sessions - VoiceSessionConfig.voiceName + stored preference per backend

Composed system prompt + bootstrap-and-stream context (folded in from feat/voice-advanced-controls)

  • Layered prompt in shared/voicePromptLayers.ts: platform fixtures (read-only - tool contracts, routing, TTS rules) + provider guardrails + user-editable identity + user-editable character. composeVoiceAgentPrompt merges them.
  • Bootstrap + stream context: small initial conversation payload at handshake (~4 KB) plus streaming chunks via sendContextualUpdate after connect. Honest UI wire-budget hints.
  • All three backends: ElevenLabs ConvAI, Gemini Live, and Qwen Realtime each compose + bootstrap + stream.
  • ElevenLabs minimal-overrides discipline: empty prefs produce a minimal {agent:{language:'en'}} payload (byte-parity with upstream baseline). Custom layers/sliders/voiceId opt in their respective override fields. Fixes the unauthorized-override crash (Cannot read properties of undefined ('error_type')).
  • Hub-side ConvAI override reconciliation: on every /voice/token resolution the hub PATCHes the agent's platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt + tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process and best-effort - existing agents that predate the override declaration now self-heal on next session start instead of requiring operator-side console edits.

UI discipline

  • New web/src/components/settings/VoiceAdvancedControls.tsx wraps fixtures preview, identity editor, character editor, delivery preset selector, and tuning sliders inside one master Advanced voice settings disclosure (collapsed by default).
  • Sub-sections (fixtures / identity / character / delivery / tuning) start collapsed when the master opens.
  • A customized badge appears next to the master title if any layer differs from defaults so power users still find their tweaks.
  • Backend picker + voice picker remain at the top, outside the disclosure.

Test plan

  • bunx tsc --noEmit (hub + web)
  • bun test voice routes (hub) - 20 pass including new reconciles platform_settings.overrides on existing agents test
  • bun test voice client tests (web) - voicePersonalitySession + voiceContextPlan green
  • hapi-driver-rebuild --build-web --verify then dogfood the three backends end-to-end on driver
  • Operator dogfood on driver soup (PR review gate)

Merge order

  1. Merge feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime #692 (feat/pluggable-voice-backend)
  2. Rebase this branch onto upstream/main (PR diff should drop to this PR's own delta), merge feat(voice): backend voice picker + advanced controls behind disclosure (#742) #743

Issues

Ref #742
Blocked by #692

@heavygee heavygee marked this pull request as ready for review May 31, 2026 00:53
@heavygee
Copy link
Copy Markdown
Contributor Author

Stack note: This PR is blocked by #692. For review, prefer the incremental diff on the fork:

heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends

(4 commits: catalog scaffold, backend chooser, Gemini/Qwen descriptions, Playwright dogfood script.)

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Qwen proxy forwards arbitrary client frames with the hub API key — POST /api/voice/qwen-token correctly keeps the DashScope key server-side, but the new /api/voice/qwen-ws proxy then opens DashScope with that key and blindly forwards every browser frame to upstream. Unlike the Gemini path, which rejects client-provided setup frames, any authenticated web client can connect directly and send its own session.update/response.create payloads, turning the hub into a generic DashScope proxy and allowing client-controlled instructions/tools under server credentials. Evidence: hub/src/web/server.ts:159.
    Suggested fix:
    const allowedQwenRuntimeEvents = new Set([
        'input_audio_buffer.append',
        'input_audio_buffer.commit',
        'response.create',
        'conversation.item.create'
    ])
    
    function parseQwenClientEvent(message: string | ArrayBuffer | Uint8Array): { type?: string } | null {
        try {
            return JSON.parse(decodeWsText(message)) as { type?: string }
        } catch {
            return null
        }
    }
    
    message(clientWs: ServerWebSocket<unknown>, message: string | ArrayBuffer | Uint8Array) {
        const event = parseQwenClientEvent(message)
        if (!event?.type || !allowedQwenRuntimeEvents.has(event.type)) {
            try { clientWs.close(1008, 'Client-provided Qwen setup is not allowed') } catch { /* */ }
            return
        }
    
        const upstream = upstreamMap.get(clientWs)
        if (upstream?.readyState === WebSocket.OPEN) {
            upstream.send(message)
        }
    }
    Move the initial session.update construction into the hub proxy, as Gemini does with buildGeminiLiveSetupMessage, and only let the browser send runtime audio/tool-response events afterward.

Questions

  • None.

Summary

  • Review mode: initial
  • One issue found: Qwen realtime proxy needs the same server-owned setup boundary as Gemini before this is safe to merge.

Testing

  • Not run (automation; static review only, per PR security instructions).

HAPI Bot

Comment thread hub/src/web/server.ts Outdated
@heavygee
Copy link
Copy Markdown
Contributor Author

Noise check (re: default diff vs main)

Compare base Commits Files
tiann/main (GitHub default) 29 35
feat/pluggable-voice-backend (#692 tip) 4 23

GitHub would not let us retarget base to feat/pluggable-voice-backend on tiann/hapi (branch only exists on heavygee/hapi). Until #692 merges, use the compare link in the PR description for review.

Automated review note: The github-actions MAJOR on Qwen proxy (hub/src/web/server.ts) is from the #692 stack in this branch, not from the 4 #742 commits — please route that feedback to #692 if still open.

heavygee and others added 25 commits May 31, 2026 15:39
Rebased from Overbaker/hapi#401 onto current main. Adds a pluggable voice
backend architecture that extends the existing ElevenLabs integration:

- Gemini 2.5 Live (gemini-live): Google real-time audio via WebSocket
  with full function calling (messageCodingAgent, processPermissionRequest)
- Qwen Realtime (qwen-realtime): Alibaba DashScope via hub WebSocket
  proxy (browser cannot set Authorization header directly)
- VoiceBackendSession: dynamic backend selector with React.lazy loading,
  gates voice button until backend module is registered
- Hub WS proxies: JWT-authenticated /api/voice/gemini-ws and
  /api/voice/qwen-ws endpoints in Bun.serve, with message queueing during
  upstream connect to prevent dropped setup frames
- AudioWorklet pipeline: inline Blob URL recorder, 24 kHz PCM player,
  serial tool call execution, AudioContext created in user gesture for mobile
- Backend discovery: GET /voice/backend + POST /voice/gemini-token /
  POST /voice/qwen-token hub routes; frontend auto-detects active backend

Merge notes:
- Rebased 135 upstream commits cleanly; HappyComposer keeps upstream's
  configurable enter-behavior setting (supersedes hard-coded Ctrl+Enter)
- Converted gemini test files from bun:test to vitest (web package uses vitest)
- All 221 hub tests and 636 web tests pass; TypeScript clean
turnComplete handler was unconditionally calling setMuted(false), which
re-enabled the mic track even when the user had manually muted. Now
restores to state.micMuted instead.
buildGeminiLiveConfig was appending VOICE_CHINESE_LANGUAGE_BLOCK which
forced Gemini to always respond in Mandarin regardless of user locale.
Gemini now uses the neutral base prompt and responds in the language the
user speaks to it, consistent with the ElevenLabs behaviour.
If the session closes while Gemini is mid-speech, cleanup() left
state.modelSpeaking=true. The next startSession() would then drop all
mic audio in sendAudioChunk() until a model turn eventually flipped
the flag — effectively deaf until page reload.
ws.onclose operated on module-level state.ws, not the socket that fired
the event. A rapid stop/restart could cause the old socket's onclose to
call cleanup() after the new socket was assigned, tearing down the live
session. Guard with `if (state.ws !== ws) return` before cleanup.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Matches the Gemini fix — both backends now use VOICE_SYSTEM_PROMPT
without the Chinese language block, giving consistent English-default
behaviour across all non-ElevenLabs backends.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Adds a "Proactive voice" toggle (default: off = reactive) to the Voice
Assistant settings section.

Reactive (default): initial context and agent-ready events are fed
silently; the assistant waits for the user to speak first.

Proactive: original behaviour — Gemini/Qwen narrate context on connect
and speak unprompted when the agent finishes a task. ElevenLabs is also
affected via onReady sending a user message rather than a silent update.

Covers all three backends uniformly. localStorage key: hapi-voice-proactive.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
… visibility

- hub/server.ts: add toClientCloseCode() to normalize reserved upstream
  close codes (1005/1006/1015) to 1011 before forwarding to browser;
  abnormal upstream drops (1006) would otherwise throw on clientWs.close()
  and leave the browser socket open

- realtime/index.ts: remove static GeminiLiveVoiceSession and QwenVoiceSession
  barrel exports; VoiceBackendSession lazy-imports both, so barrel re-exports
  created static dependencies that defeated the intended code-split

- App.tsx: gate global useVisibilityReporter on !sessionEventSubscription so
  the always-on SSE connection does not suppress native Web Push notifications
  for sessions the user is not currently viewing

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…toggle label

- buildGeminiLiveConfig() now accepts optional language param; appends
  VOICE_CHINESE_LANGUAGE_BLOCK only when language === 'zh'
- GeminiLiveVoiceSession passes config.language through
- QwenVoiceSession conditionally builds basePrompt from language setting
- Fixes silent no-op when user selects Chinese in voice settings on
  Gemini/Qwen backends (was ElevenLabs-only)

- Rename voice-start toggle label to 'Start voice session with summary'
- Fix description: clarifies the choice is about session-open behaviour
  (summary vs greeting), not ongoing narration

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Gemini Live has no built-in first-message like ElevenLabs agents do;
without an explicit turnComplete:true it sits silently. In reactive mode
(default, toggle off) now sends a greeting instruction after any silent
context feed so Gemini introduces itself and invites the user to speak.

Proactive mode is unchanged: the context summary is the opening speech.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…reeting

- VOICE_SYSTEM_PROMPT: explicit instruction never to call itself Gemini,
  Google, or any underlying model/provider name — always HAPI
- Greeting trigger text: instruct to greet as HAPI only, suppress model
  name and any reference to context/recent activity in the opening line

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Gemini + Qwen client:
- onerror now sets setupDone/sessionReady and nulls state.ws before
  calling reject(), so the stale-close guard trips in onclose and
  prevents a duplicate statusCallback('error') on WS failure

Gemini client:
- Proactive mode with no initialContext now falls through to the
  greeting trigger instead of sitting silently
- Remove unused handleBargeIn callback (dead code)

Qwen client:
- Add input_audio_sample_rate: 16000 to session.update so PCM rate
  is declared explicitly rather than relying on DashScope's default

Hub proxy:
- Remove no-op ternary in Gemini flush loop and message handler
  (typeof x === 'string' ? x : x); use upstream.send(msg) directly
- Qwen onerror now calls upstreamMap.delete() before closing client,
  eliminating the stale map entry window
- Align Qwen hub fallback model string with QWEN_REALTIME_MODEL
  constant ('qwen3-omni-flash-realtime')

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
hub/voice.ts:
- Replace string-concat WS URL construction with buildVoiceWsUrl() which
  uses URL API to set protocol/pathname cleanly — fixes double-slash when
  HAPI_PUBLIC_URL has a trailing slash (would silently skip the proxy route)

QwenVoiceSession.tsx:
- Wrap tool definitions in {type:'function', function:{...}} as required
  by Qwen-Omni realtime schema — previous flat shape caused session.update
  rejection before audio capture could start
- Use pcm16/pcm24 audio formats matching DashScope spec; remove
  input_audio_sample_rate (encoded in format name)

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…ose codes

GeminiLiveVoiceSession + QwenVoiceSession:
- startAudioCapture() is now async and awaits recorder.start() before
  calling setMuted() — previously setMuted ran before getUserMedia resolved
  so a session restarted while muted would open the mic anyway
- statusCallback('connected') now fires after audio is ready
- setMuted() called unconditionally (not just when true) to correctly
  apply saved state in either direction

hub/src/web/server.ts:
- Both Gemini and Qwen close() handlers now pass the client code through
  toClientCloseCode() before forwarding to upstream — prevents reserved
  codes (e.g. 1006) from causing WebSocket.close() to throw and leave
  the upstream session open until provider timeout
- Reason string capped at 123 bytes (WebSocket protocol limit)

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
An unhandled rejection inside the async onmessage callback does not
propagate to the outer startSession Promise — the UI hangs on
'connecting' and the provider socket stays partially open. Wrapping
the await in try/catch calls cleanup()/statusCallback('error')/reject()
so failures surface correctly.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…alling back to ElevenLabs

fetchVoiceBackend no longer catches errors and defaults to 'elevenlabs' — any
network or server failure now throws so VoiceBackendSession can surface it via
onStatusChange('error', ...) rather than silently mounting the wrong backend.

VoiceBackendSession also resets backend state to null when api changes, so
a stale ElevenLabs registration from a prior discovery cannot persist into
a new session.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…alling back to ElevenLabs

Unknown backend strings (future values, typos) now throw rather than defaulting
to elevenlabs, closing the narrow remaining form of the original misrouting bug.
Also removes the unnecessary `as VoiceBackendResponse` cast.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…r base64 uploads

Qwen session.updated handler now sends the same proactive summary or greeting
trigger that Gemini does — previously it started silently in both proactive and
reactive modes.

maxHttpBufferSize raised to 68 MiB to account for base64 expansion: 50 MiB
decoded files become ~66.7 MiB as base64 JSON, so the previous 55 MiB ceiling
would disconnect uploads above ~41 MiB before they reached the CLI.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
….update for Qwen text

Qwen's realtime API only supports conversation.item.create for function_call_output.
Sending it with type:'message' for greetings/context was invalid and could fail
before the user spoke.

sendTextMessage and sendContextualUpdate now update session instructions via
session.update (accumulating context into the system prompt) and trigger
response.create only when a spoken reply is needed — matching Qwen's supported
client event surface.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…n start

session.updated now returns early after the first ack — subsequent session.update
calls (instruction appends) also echo session.updated but must not re-trigger
audio capture or the greeting path.

currentSessionConfig is now reset to null at the top of startSession so a stale
config from a failed previous session cannot leak into the new one.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Without this guard, a missing wsUrl in the hub token response would
silently attempt to connect directly to Google with "proxied" as the
API key — producing a confusing auth failure instead of a clear error.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
DashScope realtime API accepts only 'pcm' for both input and output
audio formats. The pcm16/pcm24 values caused session.update rejection
before audio capture could start, leaving the Qwen backend unusable.

Also updates the default voice from Mia (not in the qwen3-omni-flash-
realtime voice list) to Cherry, which is documented as supported.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Failed token fetch, microphone denial, or WebSocket error during
setup left state.playbackContext open. Each failure path now calls
cleanup() before throwing/rejecting, preventing AudioContext leaks
on mobile browsers with hard limits on concurrent contexts.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Reverts changes to files that shouldn't differ from upstream:
- .gitignore: remove fork-only AGENTS.local.md entry
- web/src/App.tsx: restore dual-subscription SSE pattern (scope-aware)
- web/src/hooks/useSSE.ts: restore SSEScope/scope parameter
- web/src/hooks/useSSE.test.ts: restore (was accidentally deleted)
- web/src/lib/appSseSubscriptions.ts: restore (was accidentally deleted)
- web/src/lib/appSseSubscriptions.test.ts: restore (was accidentally deleted)
- hub/src/sync/syncEngine.ts: restore (off-topic change)
Hub sends HAPI-owned Gemini setup on proxy connect and rejects client
setup frames. Qwen proxy always uses QWEN_REALTIME_MODEL instead of a
client query parameter. Shared buildGeminiLiveSetupMessage() keeps wire
format in one place.

Co-authored-by: Cursor <cursoragent@cursor.com>
heavygee added 2 commits May 31, 2026 16:34
Mirror the Gemini proxy security model for Qwen:

- Hub sends initial session.update (voice/tools/instructions) on upstream
  connect so the browser cannot override config fields.
- Proxy message() now calls isQwenSafeClientFrame() and closes the
  connection (1008) if a client session.update touches any field other
  than 'instructions' (blocks tool/voice/modality overrides).
- QwenVoiceSession no longer sends session.update on session.created;
  it waits for the hub-relayed session.updated and then sends only
  instruction-only updates for context/proactive content.
- Language passed as query param (?language=zh) so hub builds the
  correct Chinese system prompt without a client-supplied session.update.
- buildQwenSessionUpdateMessage() and isQwenSafeClientFrame() added to
  @hapi/protocol/voice; 9 new unit tests cover filter edge cases.
…ring

DashScope requires session.update to be sent AFTER session.created is
received, not immediately on WebSocket open. Previously the hub sent
session.update in upstream.onopen, which violated this ordering and
risked the config being processed in an uninitialized session context.

Add pendingSetupMap to buffer the hub-owned session.update payload.
The onmessage handler now relays session.created to the browser first,
then immediately sends the pending session.update to DashScope — matching
the protocol ordering the old browser-side code used (which waited for
session.created before sending session.update).

Also remove maxHttpBufferSize from the socket.io Engine config. That
setting is unrelated to voice backends; upstream/main had no such limit
set and it is not introduced by this PR.
@heavygee
Copy link
Copy Markdown
Contributor Author

Status update on the open Major (Qwen WS proxy)

Coordinated with the #692 agent — they're already implementing the fix in their worktree:

  • buildQwenSessionUpdateMessage(language) in shared/src/voice.ts (parallel to buildGeminiLiveSetupMessage)
  • isQwenSafeClientFrame(message) — allowlist; only session.update frames with only instructions pass through
  • Hub createQwenProxyWebSocketHandler sends hub-owned setup on upstream open; rejects with WS close 1008 'Client session.update may only modify instructions'
  • QwenVoiceSession.tsx updated to drop the now-redundant client-side initial setup

This is correctly #692 scope — the Qwen proxy message() forwarding-everything behaviour was introduced in #692; my #742 delta to hub/src/web/server.ts is 5 lines, all on the Gemini path (threading voiceName through buildGeminiLiveSetupMessage). No Qwen changes in this branch.

Plan: Stand by until #692 pushes the fix, then rebase feat/voice-selection-all-backends on the new tip. Their hardening flows through to this PR automatically (we stack on it). The bot's Major thread should then be resolvable on #692's commit, not on a duplicate fix here.

CI green; no other open Major findings. Holding for the upstream layer to land.

heavygee and others added 13 commits May 31, 2026 19:39
…-completions)

Qwen Realtime session.update expects tools as flat objects:
  { type: 'function', name, description, parameters }

The previous code used the chat-completions shape:
  { type: 'function', function: { name, description, parameters } }

DashScope may reject session.update or silently ignore tools with the
nested shape, causing tool calls to fail at runtime. Fix applied in
buildQwenSessionUpdateMessage(); test updated to assert flat shape and
that no nested `function` key is present.
Static voice lists and per-backend localStorage keys. Settings UI wiring
follows in this branch after tiann#692 lands.

Refs tiann#742

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose configured backends from hub, let Settings pick provider when multiple
API keys exist, and wire voice selection through Gemini/Qwen/ElevenLabs paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
Surface catalog descriptions on the voice row and in the picker, with a
hint when preview is ElevenLabs-only. Disabled preview buttons stay visible.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When hub/.env (operator-local) sets GEMINI_API_KEY, the 'falls back to
elevenlabs for unknown VOICE_BACKEND values' test leaks gemini-live
into the backends list. Delete the four non-elevenlabs key env vars
defensively at the start of the test, mirroring the cleanup pattern
used by the other tests in the same describe block. No behavior change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Stacks on voice-selection-all-backends. Adds shared voice-personality presets,
Settings accordions (character + backend-specific tuning), and ElevenLabs session
overrides from stored preferences.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace append-only personality notes with the bundled HAPI voice system
prompt in an advanced editor. User edits replace the base instruction for
ElevenLabs, Gemini (incl. hub proxy), and Qwen. Presets only drive TTS
sliders unless the user appends delivery text explicitly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Split voice instructions into platform fixtures, provider guardrails, and
editable identity/character; compose at runtime for ElevenLabs, Gemini, and Qwen.
Session history uses a small connect bootstrap plus deferred contextual chunks,
gated by the proactive-summary setting. Settings UI shows wire budgets and
read-only fixtures.

Co-authored-by: Cursor <cursoragent@cursor.com>
The web workspace runs tests via vitest run. Both voice tests imported
describe/expect/test from 'bun:test', which vite cannot bundle and
caused 'Cannot bundle built-in module bun:test' transform failures
during driver soup verify. Only generic test APIs are used; switching
the import to 'vitest' is a no-op behavior change.

Co-authored-by: Cursor <cursoragent@cursor.com>
@heavygee
Copy link
Copy Markdown
Contributor Author

Heads-up from a downstream soup-rebuild (heavygee/hapi driver/integration, 2026-05-31):

To get this branch + the voice-advanced layer building cleanly on the current feat/pluggable-voice-backend tip, two pre-existing test bugs needed fixing. Both are local-only commits on the rebased branches and will be lost on the next force-push unless preserved.

1. On feat/voice-selection-all-backends (this PR)

238ad4c test(voice): isolate VOICE_BACKEND fallback test from leaked env vars

hub/src/web/routes/voice.test.ts | 4 ++++

The "falls back to elevenlabs for unknown VOICE_BACKEND values" test never deleted GEMINI_API_KEY / QWEN_API_KEY / OPENAI_API_KEY / QWEN_REALTIME_API_KEY before asserting. When hub/.env has any of those set, gemini-live (or another backend) leaks into the resolved backends list and the test fails. The fix adds four delete process.env.* lines at the top of the test, mirroring the cleanup pattern already used by the other tests in the same describe block. No behavior change.

2. On feat/voice-advanced-controls (sibling branch, stacks on this one)

f3ab0f3 test(web): import from vitest instead of bun:test

web/src/lib/voicePersonalitySession.test.ts        | 2 +-
web/src/realtime/hooks/voiceContextPlan.test.ts    | 2 +-

Both files imported describe/expect/test from 'bun:test'. The web workspace runs vitest run, which cannot bundle the built-in bun:test module and fails the transform during bun run build. Only generic test APIs are used; the import swap is a no-op behavior change.

Surfacing in case you re-do the rebase from scratch and overwrite these locally - they would silently come back as test failures on the next CI/build. No action needed if you keep them.

@heavygee heavygee force-pushed the feat/voice-selection-all-backends branch from 7469b27 to 238ad4c Compare June 1, 2026 02:19
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Qwen proxy still forwards runtime frames before setup is acknowledged — the browser waits for session.updated before starting audio, but an authenticated client can connect directly and send response.create, conversation.item.create, or instruction-only session.update while pendingSetupMap is still set. Those frames are forwarded immediately at hub/src/web/server.ts:188, before the hub-owned session.update sent at hub/src/web/server.ts:159 is acknowledged. That can run the upstream session under the provider default config or race ahead of HAPI's locked tools/voice/instructions, reopening part of the generic-proxy boundary the previous review called out.
    Suggested fix:
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    // after session.created
    pendingSetupMap.delete(clientWs)
    setupAckMap.set(clientWs, false)
    upstream.send(pendingSetup)
    
    // after session.updated for the hub-owned setup
    setupAckMap.set(clientWs, true)
    for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
    pendingClientFrames.delete(clientWs)
    
    message(clientWs, message) {
        if (!isQwenSafeClientFrame(message)) { /* close */ return }
        const upstream = upstreamMap.get(clientWs)
        if (upstream?.readyState !== WebSocket.OPEN) return
        if (pendingSetupMap.has(clientWs) || setupAckMap.get(clientWs) !== true) {
            const pending = pendingClientFrames.get(clientWs) ?? []
            pending.push(message)
            pendingClientFrames.set(clientWs, pending)
            return
        }
        upstream.send(message)
    }

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous Qwen proxy issue is partly addressed, but the proxy still needs to gate client frames until the hub-owned Qwen setup has been acknowledged.

Testing

  • Not run (automation; static review only, per PR security instructions).

HAPI Bot

Comment thread hub/src/web/server.ts Outdated
The previous Qwen proxy hardening sent the hub-owned session.update once
upstream's session.created landed, but never waited for session.updated
before forwarding client frames. An authenticated client could push
response.create / conversation.item.create / instruction-only
session.update during that window and run the upstream session under the
provider default config or partially-applied state - reopening part of
the generic-proxy boundary the prior review called out.

Changes:
- Extract createQwenProxyWebSocketHandler into hub/src/web/qwenProxyHandler.ts
  so the gating behaviour can be unit-tested without spinning up Bun.serve.
- Track a per-client setupAcked WeakMap and a pendingClientFrames queue.
  Client frames are queued (after passing isQwenSafeClientFrame) until the
  upstream emits session.updated; on ack the queue is flushed in order.
- Clear the WeakMaps on upstream error / upstream close / client close so
  no state leaks across reused references.
- Inject the upstream WebSocket constructor for tests; production keeps
  the real global WebSocket.

Tests: 4 new ack-gate cases + existing 245 hub tests still green.
Resolves the [Major] thread on PR tiann#743 follow-up review.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup on upstream open, but it then flushes queued non-setup frames immediately and forwards later frames whenever the upstream socket is open. A direct authenticated client can therefore send clientContent/tool frames before Google acknowledges the hub-owned setup with setupComplete, racing execution ahead of HAPI's locked model, tools, system prompt, language, and voice config. Evidence: hub/src/web/server.ts:77 and hub/src/web/server.ts:107.
    Suggested fix:
    const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    upstream.onopen = () => {
        setupCompleteMap.set(clientWs, false)
        upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
    }
    
    upstream.onmessage = (event) => {
        const text = typeof event.data === 'string'
            ? event.data
            : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
        if (setupCompleteMap.get(clientWs) === false && JSON.parse(text).setupComplete) {
            setupCompleteMap.set(clientWs, true)
            for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
            pendingClientFrames.delete(clientWs)
        }
        if (clientWs.readyState === 1) clientWs.send(event.data)
    }
    
    message(clientWs, message) {
        if (isGeminiSetupFrame(message)) {
            clientWs.close(1008, 'Client-provided Gemini setup is not allowed')
            return
        }
        const upstream = upstreamMap.get(clientWs)
        if (!upstream || upstream.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
            const pending = pendingClientFrames.get(clientWs) ?? []
            pending.push(message)
            pendingClientFrames.set(clientWs, pending)
            return
        }
        upstream.send(message)
    }

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous Qwen ack-gate issue appears addressed in the follow-up, but the newly added Gemini proxy still has the same setup-ack race at its proxy boundary.

Testing

  • Not run (automation; static review only, per PR security instructions).

HAPI Bot

Comment thread hub/src/web/server.ts
upstream.onopen = () => {
// Hub-owned setup only — never forward client setup (prevents generic Gemini proxy abuse).
upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
for (const queued of pending.splice(0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Gate Gemini client frames until setupComplete

This flushes queued non-setup frames immediately after sending HAPI's setup, before Google has acknowledged it with setupComplete; once the upstream socket is open, message() also forwards later frames directly. A direct authenticated client can send clientContent or tool frames before HAPI's hub-owned model, tools, system prompt, language, and voice setup is installed, so the session can race ahead under default or partially-applied config.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
}

// On the first upstream setupComplete, flip the gate and flush queued client frames.
if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
    setupCompleteMap.set(clientWs, true)
    for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
    pendingClientFrames.delete(clientWs)
}

message(clientWs, message) {
    if (isGeminiSetupFrame(message)) {
        clientWs.close(1008, 'Client-provided Gemini setup is not allowed')
        return
    }
    const upstream = upstreamMap.get(clientWs)
    if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
    upstream.send(message)
}

heavygee and others added 3 commits June 1, 2026 07:31
ElevenLabs convai rejects sessions whose `overrides` payload references any
property the agent has not explicitly authorized. The server reply is a malformed
error packet on the LiveKit data channel that the convai-react SDK then deref's
as `event.error_type` (undefined -> TypeError -> disconnect), so the symptom is
a generic "Unknown DataChannel error" with no clue.

Empty-prefs sessions now emit exactly `{ agent: { language } }` for byte-parity
with the upstream/main baseline that the configured agent permits today:

- agent.prompt: only when user has customized identity or character
- tts (full slider object): only when preset diverges from balanced defaults
- tts: { voice_id }: only when a voice was picked and sliders are at defaults

Co-authored-by: Cursor <cursoragent@cursor.com>
…onvAI overrides

UI: wrap fixtures, identity, character, delivery preset, and tuning sliders
in a single collapsed "Advanced voice settings" disclosure. Defaults stay
quiet; sub-sections start collapsed when the master opens. Shows a
"customized" badge if any layer differs from defaults so a user who tweaked
settings still knows where to find them.

Hub: on every /voice/token resolution, PATCH the resolved ConvAI agent's
platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt,
tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process
(cached per agent_id), best-effort (non-fatal on PATCH error). Fixes the
"Cannot read properties of undefined (reading 'error_type')" crash on
operators who have an existing agent that predates the override declaration.

Co-authored-by: Cursor <cursoragent@cursor.com>
Brings layered voice prompt UI, context bootstrap + streaming, and
ConvAI agent override reconciliation into the existing voice picker
PR. All new surfaces live behind a single collapsed "Advanced voice
settings" disclosure; defaults stay quiet for users who only want to
pick a voice.

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	hub/src/web/server.ts
@heavygee heavygee changed the title feat(voice): backend-aware voice picker for Gemini and Qwen (#742) feat(voice): backend voice picker + advanced controls behind disclosure (#742) Jun 1, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup, then immediately flushes queued client frames and forwards later frames as soon as the upstream socket is open. A direct authenticated client can race clientContent or tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's model, tools, prompt, language, and voice configuration are installed. Evidence: hub/src/web/server.ts:88 and hub/src/web/server.ts:118.
    Suggested fix:

    const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    upstream.onopen = () => {
        setupCompleteMap.set(clientWs, false)
        upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
    }
    
    upstream.onmessage = (event) => {
        const text = typeof event.data === 'string'
            ? event.data
            : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
        const parsed = JSON.parse(text) as { setupComplete?: unknown }
        if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
            setupCompleteMap.set(clientWs, true)
            for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
            pendingClientFrames.delete(clientWs)
        }
        if (clientWs.readyState === 1) clientWs.send(event.data)
    }
    
    if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
  • [Major] Pass decoded systemInstruction into Qwen setup — /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client sends that param for advanced voice settings, but createQwenProxyWebSocketHandler() casts the data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName). Qwen sessions therefore ignore the user-edited identity/character/delivery prompt while Gemini uses it. Evidence: hub/src/web/qwenProxyHandler.ts:53 and hub/src/web/qwenProxyHandler.ts:61.
    Suggested fix:

    const data = clientWs.data as {
        apiKey: string
        model: string
        language?: string
        voiceName?: string
        systemInstruction?: string
    }
    
    pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))

Questions

  • None.

Summary
Review mode: follow-up after new commits
The previous Gemini setup-ack race remains present, and the new Qwen prompt transport path drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.

Testing

  • Not run (automation; static review only).

HAPI Bot

Comment thread hub/src/web/server.ts
data.voiceName,
data.systemInstruction
)))
for (const queued of pending.splice(0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Gate Gemini client frames until setupComplete

This flushes queued client frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's model, tools, prompt, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
}

if (setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}
// Flip the gate and flush only after the first upstream setupComplete frame.

})

upstreamMap.set(clientWs, upstream)
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant