LLM providers: detect dead/half-open backend connections fast so failover actually triggers

## Observed failure

A main-session turn (session `D0AC6CKBK5K/1780249736.522049`, **not** a sub-agent) fired an LLM streaming call to its Main provider (`spark-362c`, vLLM `Qwen/Qwen3.6-35B-A3B-FP8`) at 19:33:44. The session was concurrently doing DGX Spark vLLM reconfiguration, and `spark-362c` was restarted/reloaded **while the request was in flight** (its `/v1/models` reported a load time of ~19:50).

The in-flight request kept an **ESTABLISHED TCP connection with zero bytes, no tokens, no `prompt_progress` keepalives, and no RST**. It hung for a full **20 minutes** until the session's no-progress watchdog cancelled it:

```
19:33:44.889  turn_llm_call_start
19:53:44.897  Processing watchdog expired ... (noProgress=True)
19:53:44.899  turn_failed category=Timeout / turn_buffer_drain count=3
```

The turn **failed** (and 3 buffered follow-ups were released) — but there was **no failover** to the configured, healthy fallback provider (`spark-acad`), even though `spark-acad` was up the entire time.

## Root cause

Two things combine:

1. **No transport-level dead-peer detection.** `ProviderPluginBase.CreateLlmHttpClient` (`src/Netclaw.Providers/ProviderPluginBase.cs:48-53`) builds the LLM `HttpClient` on a plain `HttpClientHandler` with only `Timeout = TimeSpan.FromHours(1)` — no `SocketsHttpHandler.KeepAlivePingPolicy`, no TCP keepalive, no connect/time-to-first-byte timeout. A half-open connection therefore makes the streaming `MoveNextAsync()` block indefinitely (up to the 1-hour ceiling) with no exception.

2. **Failover is exception-driven.** `FailoverChatClient.StreamWithFailoverAsync` (`src/Netclaw.Daemon/Configuration/FailoverChatClient.cs:124-155`) fails over to the fallback when the primary **throws before the first chunk**. A silent hang throws nothing, so this path is never reached. The only thing that eventually breaks the hang is the `ProcessingWatchdog` cancellation (~20 min), which is — correctly — excluded from failover (`!cancellationToken.IsCancellationRequested`), since cancellation signals a deliberate abort. So the turn just fails; the fallback is never tried.

The no-progress watchdog (the recently-added keepalive-immune deadline) did its job as a **backstop** — it prevented an open-ended hang — but it is intentionally generous (it must tolerate legitimately slow cold prefills, which can be silent 10+ minutes). It is not, and should not be, a fast failure detector.

## Impact

Any backend that dies or restarts **mid-request** — common during model reloads, node maintenance, or autoscaling on self-hosted fleets — wedges the turn for up to the no-progress budget (default 20 min) with **no automatic failover**, despite a healthy fallback provider being configured. The user must resend.

## Proposed fix

Make a dead/half-open backend connection surface as a **connection exception in seconds**, so the *existing* `FailoverChatClient` pre-first-chunk path fails over to the fallback automatically:

1. **Primary: transport-level dead-peer detection** on the LLM `HttpClient`. Switch `CreateLlmHttpClient` to a configured `SocketsHttpHandler` with:
   - `KeepAlivePingPolicy` / `KeepAlivePingDelay` / `KeepAlivePingTimeout` for HTTP/2 backends, and/or
   - TCP keepalive (via `ConnectCallback` setting `SO_KEEPALIVE` + aggressive idle/interval) for HTTP/1.1 backends like vLLM/llama-server's uvicorn server, and
   - a `ConnectTimeout`.

   **Key property that resolves the tension with slow prefill:** a keepalive *ping* probes whether the **peer is alive**, independent of whether the model is producing tokens. A legitimately slow 10-minute cold prefill answers pings (peer alive → no false timeout); a restarted/dead backend does not (ping unanswered → connection aborted in seconds → exception → failover). This is exactly the signal we want, and it does not penalize a slow-but-healthy prefill the way a blanket data-inactivity timeout would.

2. **Optional/complementary:** consider distinguishing a no-progress watchdog timeout from a user cancel so the watchdog path can *also* trigger a failover attempt rather than failing the turn outright. This is more invasive (the watchdog lives in the session/sub-agent actor; failover lives in the chat-client decorator) — fix #1 is the cleaner primary lever and reuses the failover logic that already exists.

## Notes

- `HttpClient.Timeout = 1h` is currently the only transport-layer backstop; the `ProcessingWatchdog` (prefill 30 min / no-progress 20 min) is the app-layer backstop. Neither is a fast detector.
- Related but distinct: #718 (timeout error messages should include watchdog context) — that is about messaging; this is about detection + failover.
- Verify behavior against both HTTP/1.1 (vLLM/llama-server) and HTTP/2 backends, and confirm a slow cold prefill (silent 10+ min, peer alive) is NOT killed by the new keepalive settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM providers: detect dead/half-open backend connections fast so failover actually triggers #1253

Observed failure

Root cause

Impact

Proposed fix

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LLM providers: detect dead/half-open backend connections fast so failover actually triggers #1253

Description

Observed failure

Root cause

Impact

Proposed fix

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions