Skip to content

LLM providers: detect dead/half-open backend connections fast so failover actually triggers #1253

@Aaronontheweb

Description

@Aaronontheweb

Observed failure

A main-session turn (session D0AC6CKBK5K/1780249736.522049, not a sub-agent) fired an LLM streaming call to its Main provider (spark-362c, vLLM Qwen/Qwen3.6-35B-A3B-FP8) at 19:33:44. The session was concurrently doing DGX Spark vLLM reconfiguration, and spark-362c was restarted/reloaded while the request was in flight (its /v1/models reported a load time of ~19:50).

The in-flight request kept an ESTABLISHED TCP connection with zero bytes, no tokens, no prompt_progress keepalives, and no RST. It hung for a full 20 minutes until the session's no-progress watchdog cancelled it:

19:33:44.889  turn_llm_call_start
19:53:44.897  Processing watchdog expired ... (noProgress=True)
19:53:44.899  turn_failed category=Timeout / turn_buffer_drain count=3

The turn failed (and 3 buffered follow-ups were released) — but there was no failover to the configured, healthy fallback provider (spark-acad), even though spark-acad was up the entire time.

Root cause

Two things combine:

  1. No transport-level dead-peer detection. ProviderPluginBase.CreateLlmHttpClient (src/Netclaw.Providers/ProviderPluginBase.cs:48-53) builds the LLM HttpClient on a plain HttpClientHandler with only Timeout = TimeSpan.FromHours(1) — no SocketsHttpHandler.KeepAlivePingPolicy, no TCP keepalive, no connect/time-to-first-byte timeout. A half-open connection therefore makes the streaming MoveNextAsync() block indefinitely (up to the 1-hour ceiling) with no exception.

  2. Failover is exception-driven. FailoverChatClient.StreamWithFailoverAsync (src/Netclaw.Daemon/Configuration/FailoverChatClient.cs:124-155) fails over to the fallback when the primary throws before the first chunk. A silent hang throws nothing, so this path is never reached. The only thing that eventually breaks the hang is the ProcessingWatchdog cancellation (~20 min), which is — correctly — excluded from failover (!cancellationToken.IsCancellationRequested), since cancellation signals a deliberate abort. So the turn just fails; the fallback is never tried.

The no-progress watchdog (the recently-added keepalive-immune deadline) did its job as a backstop — it prevented an open-ended hang — but it is intentionally generous (it must tolerate legitimately slow cold prefills, which can be silent 10+ minutes). It is not, and should not be, a fast failure detector.

Impact

Any backend that dies or restarts mid-request — common during model reloads, node maintenance, or autoscaling on self-hosted fleets — wedges the turn for up to the no-progress budget (default 20 min) with no automatic failover, despite a healthy fallback provider being configured. The user must resend.

Proposed fix

Make a dead/half-open backend connection surface as a connection exception in seconds, so the existing FailoverChatClient pre-first-chunk path fails over to the fallback automatically:

  1. Primary: transport-level dead-peer detection on the LLM HttpClient. Switch CreateLlmHttpClient to a configured SocketsHttpHandler with:

    • KeepAlivePingPolicy / KeepAlivePingDelay / KeepAlivePingTimeout for HTTP/2 backends, and/or
    • TCP keepalive (via ConnectCallback setting SO_KEEPALIVE + aggressive idle/interval) for HTTP/1.1 backends like vLLM/llama-server's uvicorn server, and
    • a ConnectTimeout.

    Key property that resolves the tension with slow prefill: a keepalive ping probes whether the peer is alive, independent of whether the model is producing tokens. A legitimately slow 10-minute cold prefill answers pings (peer alive → no false timeout); a restarted/dead backend does not (ping unanswered → connection aborted in seconds → exception → failover). This is exactly the signal we want, and it does not penalize a slow-but-healthy prefill the way a blanket data-inactivity timeout would.

  2. Optional/complementary: consider distinguishing a no-progress watchdog timeout from a user cancel so the watchdog path can also trigger a failover attempt rather than failing the turn outright. This is more invasive (the watchdog lives in the session/sub-agent actor; failover lives in the chat-client decorator) — fix Bump dotnet-sdk from 10.0.102 to 10.0.103 #1 is the cleaner primary lever and reuses the failover logic that already exists.

Notes

  • HttpClient.Timeout = 1h is currently the only transport-layer backstop; the ProcessingWatchdog (prefill 30 min / no-progress 20 min) is the app-layer backstop. Neither is a fast detector.
  • Related but distinct: Timeout error messages are misleading — should include watchdog context #718 (timeout error messages should include watchdog context) — that is about messaging; this is about detection + failover.
  • Verify behavior against both HTTP/1.1 (vLLM/llama-server) and HTTP/2 backends, and confirm a slow cold prefill (silent 10+ min, peer alive) is NOT killed by the new keepalive settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions