fix(google): surface context exhaustion errors#1841
fix(google): surface context exhaustion errors#1841rosetta-livekit-bot[bot] wants to merge 2 commits into
Conversation
🦋 Changeset detectedLatest commit: bf0905d The changes in this PR will be included in the next version bump. This PR includes changesets to release 35 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
🟡 Stale sessionError not cleared between retry iterations can terminate a healthy session
The new sessionError field is set by onclose, sendTask, and onReceiveMessage callbacks, and is intended to be consumed (and cleared) at plugins/google/src/realtime/realtime_api.ts:1097-1101. However, if cancelAndWait at line 1095 throws (e.g., due to a 2-second timeout), execution jumps directly to the catch block, skipping the sessionError check. The stale sessionError is never cleared — neither by closeActiveSession() (plugins/google/src/realtime/realtime_api.ts:523-544) nor at the top of the next while iteration (line 993-997). On the next iteration, after a potentially successful new session, the stale sessionError is discovered at line 1097, thrown, and processed. If the stale error was a 1007 context-exhaustion error, isContextExhaustedError would return true at line 1110, causing the perfectly healthy new session to be terminated.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
|
||
| // WebSocket close codes (RFC 6455) | ||
| const WS_CLOSE_NORMAL = 1000; | ||
| const WS_CLOSE_CONTEXT_EXHAUSTED = 1007; |
There was a problem hiding this comment.
🚩 WebSocket code 1007 is RFC 6455 'Invalid frame payload data', not a standard context exhaustion code
The constant WS_CLOSE_CONTEXT_EXHAUSTED = 1007 is placed under the comment // WebSocket close codes (RFC 6455). However, RFC 6455 defines code 1007 as 'Invalid frame payload data' (non-UTF-8 data in a text frame). Google's use of 1007 for context exhaustion is a non-standard application-specific meaning. The naming and comment could mislead future maintainers into thinking this is a standard code. Consider adding a note like // Google-specific: Gemini uses 1007 for context exhaustion (RFC 6455 defines 1007 as 'Invalid frame payload data').
Was this helpful? React with 👍 or 👎 to provide feedback.
502a2b8 to
b8d5af0
Compare
| private isContextExhaustedError(error: unknown): boolean { | ||
| return ( | ||
| (typeof error === 'object' && | ||
| error !== null && | ||
| 'statusCode' in error && | ||
| error.statusCode === WS_CLOSE_CONTEXT_EXHAUSTED) || | ||
| String(error).includes(String(WS_CLOSE_CONTEXT_EXHAUSTED)) | ||
| ); | ||
| } |
There was a problem hiding this comment.
🔴 Overly broad string-based check in isContextExhaustedError can cause false-positive session termination
The fallback branch String(error).includes(String(WS_CLOSE_CONTEXT_EXHAUSTED)) expands to String(error).includes('1007'), which matches ANY error whose string representation contains the substring "1007" — not just context exhaustion errors. This is problematic because sessionError can be set from sendTask (line 1234) or onReceiveMessage (line 1348), where the error could be an arbitrary SDK/network error. If such an error's message or stringified form incidentally contains "1007" (e.g., token counts like 10071, request IDs, byte counts like "frame size 10070"), the session will be incorrectly terminated as "context exhausted" instead of retried.
How the false positive leads to session termination
sendTaskcatches a non-context-exhaustion error (e.g., fromsession.sendRealtimeInput)- Sets
this.sessionError = this.toError(e)(a plainErrorwithoutstatusCode) - In
#mainTask, the error is thrown and caught isContextExhaustedError(err)— first branch fails (nostatusCode), butString(err).includes('1007')matches- Session terminates with "context exhausted" instead of retrying
The primary onclose code path already handles the real 1007 case via the statusCode property check, making the string fallback unnecessary for the intended flow but dangerous for other errors.
| private isContextExhaustedError(error: unknown): boolean { | |
| return ( | |
| (typeof error === 'object' && | |
| error !== null && | |
| 'statusCode' in error && | |
| error.statusCode === WS_CLOSE_CONTEXT_EXHAUSTED) || | |
| String(error).includes(String(WS_CLOSE_CONTEXT_EXHAUSTED)) | |
| ); | |
| } | |
| private isContextExhaustedError(error: unknown): boolean { | |
| return ( | |
| typeof error === 'object' && | |
| error !== null && | |
| 'statusCode' in error && | |
| error.statusCode === WS_CLOSE_CONTEXT_EXHAUSTED | |
| ); | |
| } | |
Was this helpful? React with 👍 or 👎 to provide feedback.
| this.sessionError = this.toError(e); | ||
| this.markRestartNeeded(); | ||
| } |
There was a problem hiding this comment.
🚩 Behavioral change: sendTask/onReceiveMessage errors now count against retry budget
Prior to this PR, errors in sendTask and onReceiveMessage were logged and caused a silent restart (via markRestartNeeded()) without entering the catch block or incrementing numRetries. With the new sessionError mechanism (lines 1234, 1348), these errors are now thrown into the catch block (plugins/google/src/realtime/realtime_api.ts:1108-1111) and go through the full retry logic including numRetries++ at line 1159. This means repeated transient send/receive failures (e.g., network blips mid-session) will now accumulate toward maxRetries and eventually terminate the session. Previously they would retry indefinitely. The numRetries counter is reset at plugins/google/src/realtime/realtime_api.ts:1342-1343 only when a message is successfully received, so a session that repeatedly fails mid-send before receiving any response from the new connection will now terminate. This may be intentional (preventing infinite retry loops, as the existing TODO(brian): handle error from tasks comment at line 1100 suggests), but it's a significant behavioral change beyond the stated scope of handling 1007 errors.
Was this helpful? React with 👍 or 👎 to provide feedback.
| }); | ||
| } | ||
|
|
||
| this.emitError(err, true); |
There was a problem hiding this comment.
🚩 New emitError(err, true) call emits recoverable errors on every retry attempt
Line 1147 adds this.emitError(err, true) which is new behavior — previously, errors were only emitted as non-recoverable when retries were exhausted. Now every retry attempt emits a recoverable error event to consumers. This is likely intentional for better observability, but callers subscribed to the 'error' event will now receive more frequent error notifications during transient network issues. If downstream consumers take action on every error event (e.g., logging prominently, alerting), this could increase noise.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
1007as an unrecoverable context exhaustion error.Testing
pnpm exec prettier --check plugins/google/src/realtime/realtime_api.ts .changeset/google-context-exhaustion.mdpnpm build:agentspnpm --filter @livekit/agents-plugins-test buildpnpm --filter @livekit/agents-plugin-silero buildpnpm --filter @livekit/agents-plugin-openai build:typespnpm --filter @livekit/agents-plugin-google build:typespnpm --filter @livekit/agents-plugin-google lint(warnings only: pre-existingno-explicit-anywarnings in realtime logging helpers)Ported from livekit/agents#6144
Original PR description
No description.