Skip to content

bug(live): zombie WebSocket session after LiveRequestQueue.close() — periodic 'operation cancelled' errors in Cloud Audit Logs #5228

@TonyLee-AI

Description

@TonyLee-AI

Describe the bug

run_live() contains a while True: reconnection loop intended for session resumption. However, this loop has no way to distinguish between:

  • An intentional client-side shutdown via LiveRequestQueue.close()
  • An unintentional network drop that should trigger reconnection

As a result, calling LiveRequestQueue.close() does not actually terminate the live session. After the application code believes the session has ended, run_live() silently re-establishes a new WebSocket connection (a "zombie" session) without any notification to the caller.

Steps to reproduce

  1. Start a live session with agent_runner.run_live() and consume events via async for event in live_events:
  2. Call live_request_queue.close() to signal session end
  3. Wait — the underlying WebSocket connection will be re-established automatically by the reconnect loop
  4. Observe in Cloud Audit Logs (or server-side logs): periodic "The operation was cancelled." errors at ~10-minute intervals indefinitely

A minimal reproduction is possible with the official bidi-demo sample:
https://github.com/google/adk-samples/tree/main/python/agents/bidi-demo/app

After sending a message and leaving idle, the zombie session and its periodic cancellations persist indefinitely.

Expected behavior

Calling LiveRequestQueue.close() should fully terminate the live session. run_live() should exit cleanly without reconnecting.

Observed behavior

Scenario A — session resumption handle present: run_live() catches APIError(1000) (normal WebSocket close), finds a session handle, and calls continue — reconnecting despite the intentional close.

Scenario B — no session resumption handle: run_live() catches APIError(1000), finds no handle, logs a spurious ERROR: APIError in live flow: 1000 None., and raises — treating a clean close as an error.

In both cases, a zombie connection is either kept alive or repeatedly re-established. The Gemini Live server cancels idle connections after ~10 minutes, which surfaces as:

ERROR: "The operation was cancelled." (gRPC code 1)

in Cloud Audit Logs — repeated indefinitely at ~10-minute intervals, even long after the application believes the session has ended.

The Google auth token refresh cycle visible in debug logs confirms the zombie connection remains active:

[DEBUG] google.auth.transport.requests: Making request...    # every ~10 min
[DEBUG] google.auth.transport.requests: Response received...

No application-level logs appear — the zombie reconnect is completely transparent to user code.

Environment

  • google-adk version: 1.22.1 (also reproduced on latest)
  • google-genai version: 1.59.0+
  • Python version: 3.12
  • OS: Linux
  • Model: gemini-live-2.5-flash (Vertex AI)
  • Method: google.cloud.aiplatform.v1beta1.LlmBidiService.BidiGenerateContent

Regression

This affects all versions of google-adk that include the while True: reconnection loop in run_live() (introduced with session resumption support). PR #5007 did not address this case as it fixed the opposite direction (session resumption loop never iterating).

Logs

Cloud Audit Log (repeated every ~10 minutes after session is believed closed):

{
  "protoPayload": {
    "status": { "code": 1, "message": "The operation was cancelled." },
    "methodName": "google.cloud.aiplatform.v1beta1.LlmBidiService.BidiGenerateContent"
  },
  "severity": "ERROR"
}

Application debug logs (every ~10 minutes — auth refresh for zombie connection):

[DEBUG] google.auth.transport.requests: Making request...
[DEBUG] google.auth.transport.requests: Response received...

No application-level logs appear — the zombie reconnect is completely transparent to user code.

Root cause

In base_llm_flow.py, run_live()'s exception handlers cannot tell whether APIError(1000) / ConnectionClosed originated from:

  • LiveRequestQueue.close() calling llm_connection.close() (intentional)
  • A server-side or network-triggered close (unintentional)
except errors.APIError as e:
    if e.code in [1000, 1006]:
        if invocation_context.live_session_resumption_handle:
            continue  # reconnects even after intentional close!
    logger.error('APIError in live flow: %s', e)  # spurious error if no handle
    raise

Proposed fix

PR #5226 addresses this by adding an is_closed flag to LiveRequestQueue that is set synchronously in close(). run_live()'s exception handlers check this flag before attempting to reconnect:

if e.code == 1000 and invocation_context.live_request_queue.is_closed:
    logger.info('Live session for agent %s closed by client request.', ...)
    return  # clean exit, no reconnect

Additional context

  • Google Cloud Support confirmed: "simply pushing a 'close' message or sentinel to the LiveRequestQueue is not sufficient to fully terminate the underlying bidirectional streaming connection"
  • Discussed in GitHub Discussion #4156

Metadata

Metadata

Assignees

Labels

live[Component] This issue is related to live, voice and video chat

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions