feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#692
feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#692heavygee wants to merge 29 commits into
Conversation
21d2417 to
9081a9b
Compare
There was a problem hiding this comment.
Findings
- [Major] Gemini turn completion ignores the user mute state — when Gemini starts speaking, the recorder is force-muted, but
turnCompletealways callssetMuted(false). If the user had muted the mic, the next model turn re-enables the MediaStream track and can stream microphone audio while the UI still shows muted, evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:217.
Suggested fix:state.modelSpeaking = false state.recorder?.setMuted(state.micMuted)
Summary
- Review mode: initial
- One major issue found in the new Gemini voice backend mute handling.
Testing
- Not run (automation;
bunis not installed in this runner).
HAPI Bot
There was a problem hiding this comment.
Findings
- [Major] Gemini restart can keep microphone input disabled after stopping during playback —
cleanup()closes the old session but leavesstate.modelSpeakingunchanged. If the user stops or the socket closes while Gemini is speaking, the nextstartSession()begins withmodelSpeaking === true, andsendAudioChunk()drops every mic chunk until a page reload or a later model turn flips the flag, evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:60and related guard atweb/src/realtime/GeminiLiveVoiceSession.tsx:301.
Suggested fix:function cleanup() { // existing recorder/player/ws cleanup... state.ws = null state.modelSpeaking = false }
Summary
- Review mode: follow-up after new commits
- One major issue remains in the new Gemini Live backend lifecycle handling. The previous mute-state finding is addressed by the latest commit.
Testing
- Not run (automation;
bunis not installed in this runner).
HAPI Bot
There was a problem hiding this comment.
Findings
- [Major] Stale Gemini close handlers can tear down the next session —
ws.onclosealways calls module-levelcleanup(), butcleanup()operates onstate.ws, not necessarily the socket that fired the event. If the user stops/restarts quickly, the old socket can close after the new socket is assigned and close the new session, leaving voice disconnected, evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:258.
Suggested fix:ws.onclose = (event) => { if (state.ws !== ws) return cleanup() resetRealtimeSessionState() // existing setupDone/status handling... }
- [Major] Stale Qwen close handlers can tear down the next session — same module-level cleanup race exists in the Qwen backend, so an old socket close can reset/close a newer session after a quick reconnect, evidence
web/src/realtime/QwenVoiceSession.tsx:285.
Suggested fix:ws.onclose = (event) => { if (state.ws !== ws) return cleanup() resetRealtimeSessionState() // existing sessionReady/status handling... }
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous Gemini
modelSpeakingrestart issue is fixed in the latest head. Two lifecycle races remain in the new WebSocket-backed voice sessions.
Testing
- Not run (automation;
bunis not installed in this runner).
HAPI Bot
Follow-on: per-user backend picker in SettingsThe current Rough shape:
This composes cleanly with the voice picker from #690 — if you switch back to ElevenLabs, the voice picker reappears with your account voices (including clones). Gemini/Qwen don't have per-voice selection yet, so that section would hide itself when a non-ElevenLabs backend is active. Will file this as a follow-on once #690 and this PR are settled, since both touch the settings page and it makes sense to land them together. |
There was a problem hiding this comment.
Findings
- [Minor] Gemini Live omits the non-ElevenLabs language block —
VOICE_CHINESE_LANGUAGE_BLOCKis documented as the block appended for Gemini/Qwen backends, and Qwen appends it before sending session instructions, butbuildGeminiLiveConfig()still sends onlyVOICE_SYSTEM_PROMPT. WithVOICE_BACKEND=gemini-live, the assistant will not get the Mandarin response instruction that the new Qwen path gets, evidenceshared/src/voice.ts:347.
Suggested fix:export function buildGeminiLiveConfig(): GeminiLiveConfig { return { model: GEMINI_LIVE_MODEL, systemInstruction: VOICE_SYSTEM_PROMPT + VOICE_CHINESE_LANGUAGE_BLOCK, tools: [ { functionDeclarations: buildGeminiLiveFunctionDeclarations() } ], responseModalities: ['AUDIO'] } }
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous stale-close findings are fixed in the latest head. One Gemini/Qwen behavior mismatch remains.
Testing
- Not run (automation; PR code not executed).
HAPI Bot
|
Qwen Realtime untested — DashScope API signup was inaccessible during dogfooding. Will update once a key is obtainable. |
|
On removing
ElevenLabs handles this separately via its |
There was a problem hiding this comment.
Findings
- [Major] Global visibility suppresses native push for unrelated sessions — the new always-on
{ all: true }SSE connection is also reported as visible, soPushNotificationChannelsees the namespace as visible and sends the toast to that global connection instead of falling back to Web Push. When the user is viewing session A, a ready/permission/task notification for session B is delivered only to the open tab and no native push is sent, even though the selected-session connection is not subscribed to B. Evidenceweb/src/App.tsx:327and existing push fallback logic inhub/src/push/pushNotificationChannel.ts:39.
Suggested fix:const isGlobalVisibilityAuthoritative = !sessionEventSubscription useVisibilityReporter({ api, subscriptionId: globalSubscriptionId, enabled: sseEnabled && isGlobalVisibilityAuthoritative }) useVisibilityReporter({ api, subscriptionId: sessionSubscriptionId, enabled: sseEnabled && Boolean(sessionEventSubscription) })
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The new split-SSE wiring still keeps session-list updates live, but its visibility reporting can now swallow native notifications for sessions other than the selected one.
Testing
- Not run (automation; PR code not executed). Add coverage for the split-SSE visibility case: selected session visible, notification for another session should still call Web Push rather than only delivering an in-app toast.
HAPI Bot
|
On the proactive/reactive toggle (scope increase explanation) The HAPI Bot flagged that But that review led somewhere useful: digging into why the original code behaved that way, it became clear the Chinese block wasn't just a language setting — it was tied to a proactive narration style where the assistant speaks unprompted when context changes or the agent finishes a task. That's a legitimate and useful mode; it's just not the right default for everyone. Rather than silently drop that behaviour (which was part of @Overbaker's original design intent in #401), we've surfaced it as a Proactive voice toggle in Settings, defaulting to off (reactive, matching current The toggle applies uniformly to all three backends: Gemini, Qwen, and ElevenLabs. |
There was a problem hiding this comment.
Findings
- [Major] Normalize upstream close codes before closing the client WebSocket — both new voice proxies forward
event.codefrom the upstream socket. Abnormal upstream closes commonly surface as1006, and1005/1006/1015are reserved codes that cannot be sent in a close frame. IfclientWs.close(event.code, ...)throws, the catch swallows it after the upstream is removed, leaving the browser socket open while later messages are dropped. Evidencehub/src/web/server.ts:68.
Suggested fix:function toClientCloseCode(code: number): number { return code >= 1000 && code <= 4999 && code !== 1005 && code !== 1006 && code !== 1015 ? code : 1011 } function closeClientFromUpstream(clientWs: ServerWebSocket<unknown>, event: CloseEvent): void { try { clientWs.close(toClientCloseCode(event.code), event.reason || 'Upstream closed') } catch { /* client gone */ } } upstream.onclose = (event) => { pendingMap.delete(clientWs) closeClientFromUpstream(clientWs, event) upstreamMap.delete(clientWs) }
- [Minor] Barrel export defeats lazy backend loading —
VoiceBackendSessionlazy-imports Gemini/Qwen to keep alternate backends out of the initial path, but the barrel now statically re-exports those same modules. Any@/realtimeimport creates static dependencies on both backend modules, so the lazy split is unreliable. Evidenceweb/src/realtime/index.ts:20.
Suggested fix:export { RealtimeVoiceSession, type RealtimeVoiceSessionProps } from './RealtimeVoiceSession' export { VoiceBackendSession, type VoiceBackendSessionProps } from './VoiceBackendSession'
Questions
- None.
Summary
- Review mode: follow-up after new commits
- Found two issues in the final diff: upstream abnormal-close handling can leave proxied voice sockets hung, and the realtime barrel export undermines the intended lazy backend split.
Testing
- Not run (automation; PR code not executed). Static fallback attempted:
bun typecheckis unavailable in this runner (bun: command not found), and directtsc -p web/tsconfig.json --noEmit/tsc -p hub/tsconfig.json --noEmitare blocked by missing workspace type packages (vite/client,bun-types,node).
HAPI Bot
There was a problem hiding this comment.
Findings
- [Major] Keep the session SSE connection responsible for session catch-up — after the split, only the global SSE passes
onConnect, so reconnecting while a session is open invalidates the session query and fetches latest messages from the global connection lifecycle. That global connection does not restart whenselectedSessionIdchanges or when the session-scoped SSE reconnects, so a dropped/api/events?sessionId=...stream can reconnect atweb/src/hooks/useSSE.ts:543without running the catch-up atweb/src/App.tsx:206, leaving the active chat stale until another event arrives. Evidenceweb/src/App.tsx:312.
Suggested fix:const handleSessionSseConnect = useCallback(() => { if (!selectedSessionId || !api) return void Promise.all([ queryClient.invalidateQueries({ queryKey: queryKeys.session(selectedSessionId) }), fetchLatestMessages(api, selectedSessionId) ]).catch((error) => { console.error('Failed to refresh session after SSE connect:', error) }) }, [api, queryClient, selectedSessionId]) const { subscriptionId: sessionSubscriptionId } = useSSE({ enabled: sseEnabled && Boolean(sessionEventSubscription), token: token ?? '', baseUrl, subscription: sessionEventSubscription ?? undefined, scope: 'full', onConnect: handleSessionSseConnect, onEvent: handleSseEvent })
- [Major] Do not disable visibility reporting for the global subscription while a session is open —
useVisibilityReporteronly clears its retry/local refs when disabled; it does not POST hidden or unregister the SSE subscription. With this newenabled: sseEnabled && !sessionEventSubscription, the all-sessions connection stays registered asvisiblefrom its initialvisibilityquery, andSSEManager.sendToastdelivers to any visible connection in the namespace. Result: while the user is reading a session, global toast delivery can count as visible and suppress push/Telegram fallback even if the session-scoped reporter later goes hidden or fails. Evidenceweb/src/App.tsx:330.
Suggested fix:useVisibilityReporter({ api, subscriptionId: globalSubscriptionId, enabled: sseEnabled }) useVisibilityReporter({ api, subscriptionId: sessionSubscriptionId, enabled: sseEnabled && Boolean(sessionEventSubscription) })
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous bot findings were addressed, but the SSE split introduces two regressions: active-session reconnects no longer run session catch-up, and the global SSE subscription can remain visible without a reporter while a session is open.
Testing
- Not run (automation;
bunis unavailable in this runner:bun: command not found).
HAPI Bot
…#694 The following files were accidentally included from work intended for a separate PR (session-list-status / SSE split subscriptions) due to that branch being submitted from main rather than its own worktree branch: - web/src/App.tsx (reverted to upstream/main) - web/src/hooks/useSSE.ts (reverted to upstream/main) - web/src/hooks/useSSE.test.ts (removed — added by other branch) - web/src/lib/appSseSubscriptions.ts (removed — added by other branch) - web/src/lib/appSseSubscriptions.test.ts (removed — added by other branch) - hub/src/sync/syncEngine.ts (reverted to upstream/main) Apologies for the noise. This PR (tiann#692) is scoped to the pluggable voice backend only. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
|
Apologies for the noise on the previous push — a separate agent was working on SSE split-subscription changes (for a different PR) and accidentally submitted from |
There was a problem hiding this comment.
Findings
- [Minor] Preserve the voice language preference for Gemini/Qwen —
VoiceProviderstill passes the selected ElevenLabs language code intostartRealtimeSession, but the new Gemini and Qwen backends never readconfig.language. The PR also addsVOICE_CHINESE_LANGUAGE_BLOCKspecifically for Gemini/Qwen, butbuildGeminiLiveConfig()is called without the language and Qwen buildsinstructionsfromVOICE_SYSTEM_PROMPTonly, so selecting Chinese in settings silently works only for ElevenLabs. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:127,web/src/realtime/QwenVoiceSession.tsx:160.
Suggested fix:// shared/src/voice.ts export function buildGeminiLiveConfig(language?: string): GeminiLiveConfig { const systemInstruction = language === 'zh' ? `${VOICE_SYSTEM_PROMPT}${VOICE_CHINESE_LANGUAGE_BLOCK}` : VOICE_SYSTEM_PROMPT return { model: GEMINI_LIVE_MODEL, systemInstruction, tools: [{ functionDeclarations: buildGeminiLiveFunctionDeclarations() }], responseModalities: ['AUDIO'] } } // web/src/realtime/GeminiLiveVoiceSession.tsx const liveConfig = buildGeminiLiveConfig(config.language) // web/src/realtime/QwenVoiceSession.tsx const basePrompt = config.language === 'zh' ? `${VOICE_SYSTEM_PROMPT}${VOICE_CHINESE_LANGUAGE_BLOCK}` : VOICE_SYSTEM_PROMPT
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous bot findings were reviewed as context. I found one integration gap in the new voice backends: the existing voice language setting is ignored outside ElevenLabs.
Testing
- Not run (automation;
bunis unavailable in this runner).
HAPI Bot
|
Re: SSE visibility regression (Major) — this finding is against the split-subscription code that was already reverted in commit Re: voice language ignored for Gemini/Qwen (Minor) — fixed in Also corrected the UI label for the voice-start toggle ('Proactive voice' → 'Start voice session with summary') to accurately describe what the setting controls: whether a voice session opens with a spoken summary of current agent activity, or with a greeting and a wait for the user to speak. |
There was a problem hiding this comment.
Findings
No findings.
Questions
None.
Summary
Review mode: follow-up after new commits
No issues found in the current diff. The previous HAPI Bot finding around Gemini/Qwen language propagation is addressed in the latest head. Residual risk: the new Gemini/Qwen WebSocket proxy and browser audio paths were reviewed statically only.
Testing
Not run (automation; bun is unavailable in this runner).
HAPI Bot
There was a problem hiding this comment.
Findings
No findings.
Questions
None.
Summary
Review mode: follow-up after new commits
No issues found in the current diff. I reviewed the full latest PR diff, including the two commits added after the previous HAPI Bot review. Residual risk: the new Gemini/Qwen WebSocket proxy and browser audio paths were reviewed statically only.
Testing
Not run (automation; bun is unavailable in this runner).
HAPI Bot
There was a problem hiding this comment.
Findings
- [Major] Qwen
session.updateuses a schema the upstream service rejects — the Qwen backend configuresqwen3-omni-flash-realtime, but sends flattened tool definitions pluspcmaudio formats. Qwen validatessession.updatebefore returningsession.updated, so this can keep the Qwen voice backend from ever reachingconnected. Evidenceweb/src/realtime/QwenVoiceSession.tsx:153,web/src/realtime/QwenVoiceSession.tsx:172.
Suggested fix:const tools = VOICE_TOOL_DEFINITIONS.map((td) => ({ type: 'function' as const, function: { name: td.name, description: td.description, parameters: td.parameters } })) // ... input_audio_format: 'pcm16', output_audio_format: 'pcm24', tools,
- [Minor]
HAPI_PUBLIC_URLwith a trailing slash generates a double-slash WebSocket path —https://host/becomeswss://host//api/voice/..., but the Bun fetch handler only matches/api/voice/gemini-wsand/api/voice/qwen-ws, so the proxy upgrade misses. Evidencehub/src/web/routes/voice.ts:149,hub/src/web/routes/voice.ts:172.
Suggested fix:function buildVoiceWsUrl(base: string, pathname: string): string { const url = new URL(base) url.protocol = url.protocol === 'https:' ? 'wss:' : 'ws:' url.pathname = pathname url.search = '' url.hash = '' return url.toString() } const wsProxyUrl = buildVoiceWsUrl(publicUrl, '/api/voice/gemini-ws')
Questions
None.
Summary
Review mode: follow-up after new commits
Two issues found in the current full diff. The main risk is Qwen Realtime failing during session setup; the smaller risk is proxy URL generation for common public URL formatting.
Testing
Not run (security; review prompt prohibits executing code from PR content).
HAPI Bot
There was a problem hiding this comment.
Findings
- [Major] Initial muted state is applied before the recorder has a media stream, so a session restarted while muted can still open the microphone.
GeminiAudioRecorder.start()assignsmediaStreamonly after its firstawait, butstartAudioCapture()does not await it before callingsetMuted; the same pattern exists in Qwen. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:330, relatedweb/src/realtime/QwenVoiceSession.tsx:341.
Suggested fix:async function startAudioCapture(playbackContext: AudioContext): Promise<void> { state.player = new GeminiAudioPlayer(playbackContext) state.recorder = new GeminiAudioRecorder() await state.recorder.start(onChunk, onError) state.recorder.setMuted(state.micMuted) } await startAudioCapture(state.playbackContext!) state.statusCallback?.('connected')
- [Major] Client close codes are forwarded directly to upstream providers, which can leave paid realtime sessions running. Browser/server close callbacks can report reserved codes like
1006;WebSocket.close(1006, ...)throws, the catch swallows it, and the map entry is deleted without closing the Gemini/Qwen upstream socket. Evidencehub/src/web/server.ts:95, relatedhub/src/web/server.ts:146.
Suggested fix:function closeUpstream(upstream: WebSocket, code: number, reason: string) { const safeCode = toClientCloseCode(code) const safeReason = (reason || 'Client closed').slice(0, 123) try { upstream.close(safeCode, safeReason) } catch { upstream.close(1011, 'Client closed') } } closeUpstream(upstream, code, reason)
Questions
None.
Summary
Review mode: follow-up after new commits
Two issues remain in the current full diff. The main risks are privacy/user expectation around muted voice restarts and orphaned upstream realtime voice sessions after abnormal client disconnects.
Testing
Not run (automation review; no PR code executed).
HAPI Bot
There was a problem hiding this comment.
Findings
- [Major] Audio capture failures after WebSocket setup can leave voice startup permanently pending.
GeminiAudioRecorder.start()still performs a secondgetUserMediacall and builds the audio graph; after the follow-up change,await startAudioCapture(...)runs inside the asynconmessagehandler, but any rejection there is not caught or passed to the outerPromisereject path. The UI can remain inconnecting,startRealtimeSession()never finishes, and the partially-open provider socket is not cleaned up. Same pattern inweb/src/realtime/QwenVoiceSession.tsx:195. Evidenceweb/src/realtime/GeminiLiveVoiceSession.tsx:181.
Suggested fix:try { await startAudioCapture(state.playbackContext!) } catch (error) { const message = error instanceof Error ? error.message : "Microphone error" cleanup() state.statusCallback?.("error", message) reject(error instanceof Error ? error : new Error(message)) return } state.statusCallback?.("connected")
Questions
None.
Summary
Review mode: follow-up after new commits
The previous two HAPI Bot findings are addressed, but the new awaited audio-capture path needs an explicit rejection/cleanup path in both Gemini and Qwen. Residual testing gap: add a startup failure test/mock for GeminiAudioRecorder.start() rejecting after the provider WebSocket reaches ready state.
Testing
Not run (automation review; no PR code executed).
HAPI Bot
There was a problem hiding this comment.
Findings
- [Major] Socket.IO limit still rejects valid 50 MiB uploads —
MAX_UPLOAD_BYTESallows 50 MiB decoded files, but the RPC payload carries that file as base64 JSON through Socket.IO. A 50 MiB file expands to about 66.7 MiB before JSON overhead, so the new 55 MiBmaxHttpBufferSizestill makes files above roughly 41 MiB pass the REST check and then fail/disconnect during the hub-to-CLI RPC. Evidencehub/src/socket/server.ts:69; related size checkhub/src/web/routes/sessions.ts:22.
Suggested fix:maxHttpBufferSize: 68 * 1024 * 1024, // 50 MiB decoded uploads are ~66.7 MiB as base64 JSON
Questions
None.
Summary
Review mode: follow-up after new commits
The prior HAPI Bot audio-startup finding is addressed. One upload-size regression remains in the updated Socket.IO buffer limit.
Testing
Not run (automation review; no PR code executed).
HAPI Bot
Cold review: Gemini hub config still hits ElevenLabs token pathSeverity: functional bug on non-default backends (reproduced on daily driver with SymptomUser taps voice with hub configured for Gemini Live ( That error string cannot come from the Gemini path — it is emitted only by the ElevenLabs token route in What we verified (not hub config)Live hub on daily driver:
Conclusion: hub side is correct. Web client is mounting the ElevenLabs session implementation instead of Root cause (this PR, not upstream main)Upstream Smoking gun — export async function fetchVoiceBackend(api: ApiClient): Promise<VoiceBackendResponse> {
try {
const result = await api.fetchVoiceBackend()
// ...
} catch {
return { backend: 'elevenlabs' }
}
}Any failure discovering the backend silently defaults to ElevenLabs. Likely triggers for the catch (any one suffices):
Recommended fix (implement on this branch)
Falsification test (before/after)DevTools → Network → tap voice mic:
Scope
Please do not “fix” by uncommenting Posted from tooling triage (daily driver repro). Implementing agent: treat as blocking for Gemini/Qwen backend selection. |
|
Fixed in Two commits:
Falsification test: DevTools → Network → tap voice mic with |
There was a problem hiding this comment.
Findings
-
[Major] Qwen startup never triggers the initial spoken summary/greeting — after
session.updated, Qwen only starts capture, marks connected, and resolves. Unlike the Gemini path, it never sends aconversation.item.createplusresponse.create, sohapi-voice-proactive=truedoes not speak the startup summary and the reactive path does not greet before waiting. Evidenceweb/src/realtime/QwenVoiceSession.tsx:204.
Suggested fix:const proactive = localStorage.getItem('hapi-voice-proactive') === 'true' if (proactive && config.initialContext) { this.sendTextMessage(`[Context] ${config.initialContext}`) } else { if (config.initialContext) { this.sendContextualUpdate(config.initialContext) } this.sendTextMessage('[Greet the user as HAPI. Say a brief hello and invite them to speak. Do not mention Qwen or any model name.]') }
-
[Major] Socket.IO limit still rejects valid 50 MiB uploads —
MAX_UPLOAD_BYTESvalidates decoded files up to 50 MiB, but the hub sends the file to the CLI as base64 JSON over Socket.IO. A 50 MiB file expands to about 66.7 MiB before JSON overhead, so the new 55 MiB limit still lets REST accept files that fail during RPC. Evidencehub/src/socket/server.ts:69; related size checkhub/src/web/routes/sessions.ts:22.
Suggested fix:maxHttpBufferSize: 68 * 1024 * 1024, // 50 MiB decoded uploads are ~66.7 MiB as base64 JSON
Questions
None.
Summary
Review mode: follow-up after new commits
The prior bot pass is still relevant for the upload-size mismatch, and the updated Qwen backend misses the startup speech path covered by the new proactive setting. No additional findings above 80% confidence.
Testing
Not run (automation review; no PR code executed).
HAPI Bot
There was a problem hiding this comment.
Findings
- [Major] Qwen sends unsupported
conversation.item.createmessages — Qwen's client-events reference only supportsconversation.item.createforfunction_call_output, but this implementation uses it withitem.type: 'message'for startup greetings, proactive updates, and contextual updates. The follow-up change now callssendTextMessage()immediately aftersession.updated, soVOICE_BACKEND=qwen-realtimecan fail before the user speaks. Evidenceweb/src/realtime/QwenVoiceSession.tsx:336. Reference: https://docs.qwencloud.com/api-reference/real-time-multimodal/client-events#conversation-item-create
Suggested fix:type QwenSessionConfig = { modalities: ['text', 'audio'] voice: string input_audio_format: 'pcm16' output_audio_format: 'pcm24' instructions: string temperature: number turn_detection: { type: 'server_vad' threshold: number silence_duration_ms: number prefix_padding_ms: number } tools: Array<unknown> tool_choice: 'auto' } class QwenVoiceSessionImpl implements VoiceSession { private currentSessionConfig: QwenSessionConfig | null = null private updateInstructions(update: string): void { if (!this.currentSessionConfig) return this.currentSessionConfig = { ...this.currentSessionConfig, instructions: `${this.currentSessionConfig.instructions}\n\n[System Context Update]\n${update}` } sendEvent('session.update', { session: this.currentSessionConfig }) } sendTextMessage(message: string): void { this.updateInstructions(message) sendEvent('response.create') } sendContextualUpdate(update: string): void { this.updateInstructions(update) } }
Questions
None.
Summary
Review mode: follow-up after new commits
The earlier upload-size finding is fixed, and the Qwen startup-greeting finding was addressed structurally, but the Qwen implementation still sends unsupported user-message items to the realtime API. No other findings above 80% confidence.
Testing
Not run (automation review; no PR code executed).
HAPI Bot
a0d7905 to
7f70bea
Compare
There was a problem hiding this comment.
Findings
No issues found above 80% confidence.
Questions
None.
Summary
Review mode: follow-up after new commits
No actionable regressions found in the latest full diff. The previous Qwen conversation.item.create issue appears addressed; residual risk remains around provider integration coverage, especially Qwen Realtime end-to-end, which the PR test plan marks untested.
Testing
Not run (automation review; no PR code executed).
HAPI Bot
…toggle label - buildGeminiLiveConfig() now accepts optional language param; appends VOICE_CHINESE_LANGUAGE_BLOCK only when language === 'zh' - GeminiLiveVoiceSession passes config.language through - QwenVoiceSession conditionally builds basePrompt from language setting - Fixes silent no-op when user selects Chinese in voice settings on Gemini/Qwen backends (was ElevenLabs-only) - Rename voice-start toggle label to 'Start voice session with summary' - Fix description: clarifies the choice is about session-open behaviour (summary vs greeting), not ongoing narration via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Gemini Live has no built-in first-message like ElevenLabs agents do; without an explicit turnComplete:true it sits silently. In reactive mode (default, toggle off) now sends a greeting instruction after any silent context feed so Gemini introduces itself and invites the user to speak. Proactive mode is unchanged: the context summary is the opening speech. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
…reeting - VOICE_SYSTEM_PROMPT: explicit instruction never to call itself Gemini, Google, or any underlying model/provider name — always HAPI - Greeting trigger text: instruct to greet as HAPI only, suppress model name and any reference to context/recent activity in the opening line via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Gemini + Qwen client:
- onerror now sets setupDone/sessionReady and nulls state.ws before
calling reject(), so the stale-close guard trips in onclose and
prevents a duplicate statusCallback('error') on WS failure
Gemini client:
- Proactive mode with no initialContext now falls through to the
greeting trigger instead of sitting silently
- Remove unused handleBargeIn callback (dead code)
Qwen client:
- Add input_audio_sample_rate: 16000 to session.update so PCM rate
is declared explicitly rather than relying on DashScope's default
Hub proxy:
- Remove no-op ternary in Gemini flush loop and message handler
(typeof x === 'string' ? x : x); use upstream.send(msg) directly
- Qwen onerror now calls upstreamMap.delete() before closing client,
eliminating the stale map entry window
- Align Qwen hub fallback model string with QWEN_REALTIME_MODEL
constant ('qwen3-omni-flash-realtime')
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
hub/voice.ts:
- Replace string-concat WS URL construction with buildVoiceWsUrl() which
uses URL API to set protocol/pathname cleanly — fixes double-slash when
HAPI_PUBLIC_URL has a trailing slash (would silently skip the proxy route)
QwenVoiceSession.tsx:
- Wrap tool definitions in {type:'function', function:{...}} as required
by Qwen-Omni realtime schema — previous flat shape caused session.update
rejection before audio capture could start
- Use pcm16/pcm24 audio formats matching DashScope spec; remove
input_audio_sample_rate (encoded in format name)
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
…ose codes
GeminiLiveVoiceSession + QwenVoiceSession:
- startAudioCapture() is now async and awaits recorder.start() before
calling setMuted() — previously setMuted ran before getUserMedia resolved
so a session restarted while muted would open the mic anyway
- statusCallback('connected') now fires after audio is ready
- setMuted() called unconditionally (not just when true) to correctly
apply saved state in either direction
hub/src/web/server.ts:
- Both Gemini and Qwen close() handlers now pass the client code through
toClientCloseCode() before forwarding to upstream — prevents reserved
codes (e.g. 1006) from causing WebSocket.close() to throw and leave
the upstream session open until provider timeout
- Reason string capped at 123 bytes (WebSocket protocol limit)
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
An unhandled rejection inside the async onmessage callback does not
propagate to the outer startSession Promise — the UI hangs on
'connecting' and the provider socket stays partially open. Wrapping
the await in try/catch calls cleanup()/statusCallback('error')/reject()
so failures surface correctly.
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
…alling back to ElevenLabs
fetchVoiceBackend no longer catches errors and defaults to 'elevenlabs' — any
network or server failure now throws so VoiceBackendSession can surface it via
onStatusChange('error', ...) rather than silently mounting the wrong backend.
VoiceBackendSession also resets backend state to null when api changes, so
a stale ElevenLabs registration from a prior discovery cannot persist into
a new session.
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
…alling back to ElevenLabs Unknown backend strings (future values, typos) now throw rather than defaulting to elevenlabs, closing the narrow remaining form of the original misrouting bug. Also removes the unnecessary `as VoiceBackendResponse` cast. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
…r base64 uploads Qwen session.updated handler now sends the same proactive summary or greeting trigger that Gemini does — previously it started silently in both proactive and reactive modes. maxHttpBufferSize raised to 68 MiB to account for base64 expansion: 50 MiB decoded files become ~66.7 MiB as base64 JSON, so the previous 55 MiB ceiling would disconnect uploads above ~41 MiB before they reached the CLI. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
….update for Qwen text Qwen's realtime API only supports conversation.item.create for function_call_output. Sending it with type:'message' for greetings/context was invalid and could fail before the user spoke. sendTextMessage and sendContextualUpdate now update session instructions via session.update (accumulating context into the system prompt) and trigger response.create only when a spoken reply is needed — matching Qwen's supported client event surface. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
…n start session.updated now returns early after the first ack — subsequent session.update calls (instruction appends) also echo session.updated but must not re-trigger audio capture or the greeting path. currentSessionConfig is now reset to null at the top of startSession so a stale config from a failed previous session cannot leak into the new one. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Without this guard, a missing wsUrl in the hub token response would silently attempt to connect directly to Google with "proxied" as the API key — producing a confusing auth failure instead of a clear error. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
DashScope realtime API accepts only 'pcm' for both input and output audio formats. The pcm16/pcm24 values caused session.update rejection before audio capture could start, leaving the Qwen backend unusable. Also updates the default voice from Mia (not in the qwen3-omni-flash- realtime voice list) to Cherry, which is documented as supported. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Failed token fetch, microphone denial, or WebSocket error during setup left state.playbackContext open. Each failure path now calls cleanup() before throwing/rejecting, preventing AudioContext leaks on mobile browsers with hard limits on concurrent contexts. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Reverts changes to files that shouldn't differ from upstream: - .gitignore: remove fork-only AGENTS.local.md entry - web/src/App.tsx: restore dual-subscription SSE pattern (scope-aware) - web/src/hooks/useSSE.ts: restore SSEScope/scope parameter - web/src/hooks/useSSE.test.ts: restore (was accidentally deleted) - web/src/lib/appSseSubscriptions.ts: restore (was accidentally deleted) - web/src/lib/appSseSubscriptions.test.ts: restore (was accidentally deleted) - hub/src/sync/syncEngine.ts: restore (off-topic change)
Hub sends HAPI-owned Gemini setup on proxy connect and rejects client setup frames. Qwen proxy always uses QWEN_REALTIME_MODEL instead of a client query parameter. Shared buildGeminiLiveSetupMessage() keeps wire format in one place. Co-authored-by: Cursor <cursoragent@cursor.com>
57fb9f8 to
a4e8b25
Compare
There was a problem hiding this comment.
Findings
None.
Questions
None.
Summary
Review mode: follow-up after new commits
No high-confidence issues found in the latest full diff. Residual risk: Qwen Realtime remains unverified end-to-end per the PR test plan, and the new hub WebSocket proxy paths would still benefit from integration coverage for upgrade/auth/upstream relay behavior, but I did not find a concrete regression in the changed lines.
Testing
Not run (automation).
HAPI Bot
Mirror the Gemini proxy security model for Qwen: - Hub sends initial session.update (voice/tools/instructions) on upstream connect so the browser cannot override config fields. - Proxy message() now calls isQwenSafeClientFrame() and closes the connection (1008) if a client session.update touches any field other than 'instructions' (blocks tool/voice/modality overrides). - QwenVoiceSession no longer sends session.update on session.created; it waits for the hub-relayed session.updated and then sends only instruction-only updates for context/proactive content. - Language passed as query param (?language=zh) so hub builds the correct Chinese system prompt without a client-supplied session.update. - buildQwenSessionUpdateMessage() and isQwenSafeClientFrame() added to @hapi/protocol/voice; 9 new unit tests cover filter edge cases.
|
Fixed in The Qwen proxy now mirrors the Gemini pattern:
|
There was a problem hiding this comment.
Findings
- [Major] Qwen tool declarations use the chat-completions shape — the Qwen path is sending Realtime
session.updateevents, but the new hub-owned setup nests each tool underfunction. Realtime tool entries need the callable fields on the tool object itself; with the current payload DashScope can reject setup or never exposemessageCodingAgent/processPermissionRequest, so voice requests and permission approvals will not reach HAPI. Evidenceshared/src/voice.ts:376.
Suggested fix:const tools = VOICE_TOOL_DEFINITIONS.map((td) => ({ type: 'function' as const, name: td.name, description: td.description, parameters: td.parameters }))
Questions
None.
Summary
Review mode: follow-up after new commits
Found one major issue in the latest full diff. Residual risk: Qwen Realtime is still marked untested end-to-end in the PR test plan, and the new hub WebSocket proxy paths would benefit from integration coverage around upgrade/auth/upstream relay behavior.
Testing
Not run (automation; review only).
HAPI Bot
…ring DashScope requires session.update to be sent AFTER session.created is received, not immediately on WebSocket open. Previously the hub sent session.update in upstream.onopen, which violated this ordering and risked the config being processed in an uninitialized session context. Add pendingSetupMap to buffer the hub-owned session.update payload. The onmessage handler now relays session.created to the browser first, then immediately sends the pending session.update to DashScope — matching the protocol ordering the old browser-side code used (which waited for session.created before sending session.update). Also remove maxHttpBufferSize from the socket.io Engine config. That setting is unrelated to voice backends; upstream/main had no such limit set and it is not introduced by this PR.
|
Two additional fixes in Fix 1 — Qwen session.created→session.update protocol orderingThe hub was sending the hub-owned Fix: added Fix 2 — Remove off-topic socket.io buffer change
|
There was a problem hiding this comment.
Findings
- [Major] Qwen setup still uses the chat-completions tool shape, so the Realtime session can start without callable HAPI tools or reject the setup entirely. Evidence
shared/src/voice.ts:376.
Suggested fix:const tools = VOICE_TOOL_DEFINITIONS.map((td) => ({ type: 'function' as const, name: td.name, description: td.description, parameters: td.parameters }))
Questions
None.
Summary
Review mode: follow-up after new commits
Found one unresolved major issue in the latest full diff. The follow-up commits improved Qwen setup ordering, but the hub-owned session.update payload still declares tools under function, and the new test locks in that incompatible shape. Residual risk: Qwen Realtime remains untested end-to-end in the PR plan, and the new WebSocket proxy paths still lack integration coverage around upgrade/auth/upstream relay behavior.
Testing
Not run (automation; review only).
HAPI Bot
…-completions)
Qwen Realtime session.update expects tools as flat objects:
{ type: 'function', name, description, parameters }
The previous code used the chat-completions shape:
{ type: 'function', function: { name, description, parameters } }
DashScope may reject session.update or silently ignore tools with the
nested shape, causing tool calls to fail at runtime. Fix applied in
buildQwenSessionUpdateMessage(); test updated to assert flat shape and
that no nested `function` key is present.
There was a problem hiding this comment.
Findings
None.
Questions
None.
Summary
Review mode: follow-up after new commits
No high-confidence issues found in the latest full diff. The prior HAPI Bot finding on Qwen Realtime tool shape appears fixed: shared/src/voice.ts now emits flat Realtime tool definitions and shared/src/voice.gemini.test.ts asserts there is no nested function field. Residual risk remains around the new Gemini/Qwen WebSocket proxy paths because the PR adds route-level tests and shared payload tests, but not an integration test that exercises upgrade/auth/upstream relay behavior.
Testing
Not run (automation; local bun was unavailable in this review environment).
HAPI Bot
…service Live-tested against DashScope international API: - Model: qwen3-omni-flash-realtime → qwen3.5-omni-flash-realtime (previous model ID did not exist on DashScope) - Default voice: Cherry → Tina (confirmed from session.created response on qwen3.5-omni-flash-realtime) - Default WS base: dashscope.aliyuncs.com → dashscope-intl.aliyuncs.com (international accounts use the -intl endpoint; China endpoint rejects international API keys; QWEN_REALTIME_WS_URL env var still overrides)
There was a problem hiding this comment.
Findings
- [Major] Qwen tools are declared with the wrong wire shape — Qwen Realtime expects function tools under a nested
functionobject, but this maps them as flat{ type, name, description, parameters }. That means DashScope will not registermessageCodingAgent/processPermissionRequest, so the new Qwen backend can talk but cannot reliably drive the coding agent or handle approvals. Evidence:shared/src/voice.ts:375.
Suggested fix:const tools = VOICE_TOOL_DEFINITIONS.map((td) => ({ type: 'function' as const, function: { name: td.name, description: td.description, parameters: td.parameters } }))
Questions
None.
Summary
Review mode: follow-up after new commits
The previous bot pass had no findings. Reviewing the full current diff, I found one Qwen backend regression: the session setup advertises tools in a shape the Qwen Realtime API does not consume, so function calling for sending messages and processing permissions is effectively disabled for that backend. Residual risk remains around Gemini/Qwen WebSocket proxy behavior because local automation could not run here and there is still no integration test covering upgrade/auth/upstream relay.
Testing
Not run (automation; bun is unavailable in this review environment).
HAPI Bot
| ? `${VOICE_SYSTEM_PROMPT}${VOICE_CHINESE_LANGUAGE_BLOCK}` | ||
| : VOICE_SYSTEM_PROMPT | ||
| // Qwen Realtime uses the flat Realtime shape, not the chat-completions nested {function:{...}} shape. | ||
| const tools = VOICE_TOOL_DEFINITIONS.map((td) => ({ |
There was a problem hiding this comment.
[Major] Qwen tools are declared with the wrong wire shape. Qwen Realtime function tools need the metadata nested under function, but this maps each tool as flat { type, name, description, parameters }, so DashScope will not register messageCodingAgent / processPermissionRequest and the Qwen backend cannot reliably drive the coding agent or handle approvals.
Suggested fix:
const tools = VOICE_TOOL_DEFINITIONS.map((td) => ({
type: 'function' as const,
function: {
name: td.name,
description: td.description,
parameters: td.parameters
}
}))
Overview
Rebased and completed @Overbaker's #401 onto current `main` after it went fallow (~4 weeks). All original design and implementation credit belongs to @Overbaker and @TennyDDDD — I've only done the merge work and fixed up the test runner.
Adds a pluggable voice backend architecture extending the existing ElevenLabs integration with two new providers:
Changes from original PR
Configuration
```bash
Gemini Live (free tier available, full function calling)
VOICE_BACKEND=gemini-live
GEMINI_API_KEY=your-google-api-key # also accepts GOOGLE_API_KEY
Qwen Realtime (voice conversation; function calling limited by model support)
VOICE_BACKEND=qwen-realtime
DASHSCOPE_API_KEY=your-dashscope-key # also accepts QWEN_API_KEY
ElevenLabs (default — unchanged)
VOICE_BACKEND=elevenlabs
ELEVENLABS_API_KEY=your-elevenlabs-key
```
Files changed
Voice settings additions
"Start voice session with summary" toggle (Settings → Voice Assistant, default off)
Surfaced from @Overbaker's original design intent in #401: rather than silently dropping the proactive narration behaviour, it's now user-configurable.
Backend behaviour notes:
Language setting now respected by all three backends. Selecting Chinese in voice settings appends `VOICE_CHINESE_LANGUAGE_BLOCK` to the system prompt for Gemini and Qwen (previously ignored; ElevenLabs uses its own language override field).
Test plan