You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A main-session turn (session D0AC6CKBK5K/1780249736.522049, not a sub-agent) fired an LLM streaming call to its Main provider (spark-362c, vLLM Qwen/Qwen3.6-35B-A3B-FP8) at 19:33:44. The session was concurrently doing DGX Spark vLLM reconfiguration, and spark-362c was restarted/reloaded while the request was in flight (its /v1/models reported a load time of ~19:50).
The in-flight request kept an ESTABLISHED TCP connection with zero bytes, no tokens, no prompt_progress keepalives, and no RST. It hung for a full 20 minutes until the session's no-progress watchdog cancelled it:
The turn failed (and 3 buffered follow-ups were released) — but there was no failover to the configured, healthy fallback provider (spark-acad), even though spark-acad was up the entire time.
Root cause
Two things combine:
No transport-level dead-peer detection.ProviderPluginBase.CreateLlmHttpClient (src/Netclaw.Providers/ProviderPluginBase.cs:48-53) builds the LLM HttpClient on a plain HttpClientHandler with only Timeout = TimeSpan.FromHours(1) — no SocketsHttpHandler.KeepAlivePingPolicy, no TCP keepalive, no connect/time-to-first-byte timeout. A half-open connection therefore makes the streaming MoveNextAsync() block indefinitely (up to the 1-hour ceiling) with no exception.
Failover is exception-driven.FailoverChatClient.StreamWithFailoverAsync (src/Netclaw.Daemon/Configuration/FailoverChatClient.cs:124-155) fails over to the fallback when the primary throws before the first chunk. A silent hang throws nothing, so this path is never reached. The only thing that eventually breaks the hang is the ProcessingWatchdog cancellation (~20 min), which is — correctly — excluded from failover (!cancellationToken.IsCancellationRequested), since cancellation signals a deliberate abort. So the turn just fails; the fallback is never tried.
The no-progress watchdog (the recently-added keepalive-immune deadline) did its job as a backstop — it prevented an open-ended hang — but it is intentionally generous (it must tolerate legitimately slow cold prefills, which can be silent 10+ minutes). It is not, and should not be, a fast failure detector.
Impact
Any backend that dies or restarts mid-request — common during model reloads, node maintenance, or autoscaling on self-hosted fleets — wedges the turn for up to the no-progress budget (default 20 min) with no automatic failover, despite a healthy fallback provider being configured. The user must resend.
Proposed fix
Make a dead/half-open backend connection surface as a connection exception in seconds, so the existingFailoverChatClient pre-first-chunk path fails over to the fallback automatically:
Primary: transport-level dead-peer detection on the LLM HttpClient. Switch CreateLlmHttpClient to a configured SocketsHttpHandler with:
KeepAlivePingPolicy / KeepAlivePingDelay / KeepAlivePingTimeout for HTTP/2 backends, and/or
TCP keepalive (via ConnectCallback setting SO_KEEPALIVE + aggressive idle/interval) for HTTP/1.1 backends like vLLM/llama-server's uvicorn server, and
a ConnectTimeout.
Key property that resolves the tension with slow prefill: a keepalive ping probes whether the peer is alive, independent of whether the model is producing tokens. A legitimately slow 10-minute cold prefill answers pings (peer alive → no false timeout); a restarted/dead backend does not (ping unanswered → connection aborted in seconds → exception → failover). This is exactly the signal we want, and it does not penalize a slow-but-healthy prefill the way a blanket data-inactivity timeout would.
Optional/complementary: consider distinguishing a no-progress watchdog timeout from a user cancel so the watchdog path can also trigger a failover attempt rather than failing the turn outright. This is more invasive (the watchdog lives in the session/sub-agent actor; failover lives in the chat-client decorator) — fix Bump dotnet-sdk from 10.0.102 to 10.0.103 #1 is the cleaner primary lever and reuses the failover logic that already exists.
Notes
HttpClient.Timeout = 1h is currently the only transport-layer backstop; the ProcessingWatchdog (prefill 30 min / no-progress 20 min) is the app-layer backstop. Neither is a fast detector.
Verify behavior against both HTTP/1.1 (vLLM/llama-server) and HTTP/2 backends, and confirm a slow cold prefill (silent 10+ min, peer alive) is NOT killed by the new keepalive settings.
Observed failure
A main-session turn (session
D0AC6CKBK5K/1780249736.522049, not a sub-agent) fired an LLM streaming call to its Main provider (spark-362c, vLLMQwen/Qwen3.6-35B-A3B-FP8) at 19:33:44. The session was concurrently doing DGX Spark vLLM reconfiguration, andspark-362cwas restarted/reloaded while the request was in flight (its/v1/modelsreported a load time of ~19:50).The in-flight request kept an ESTABLISHED TCP connection with zero bytes, no tokens, no
prompt_progresskeepalives, and no RST. It hung for a full 20 minutes until the session's no-progress watchdog cancelled it:The turn failed (and 3 buffered follow-ups were released) — but there was no failover to the configured, healthy fallback provider (
spark-acad), even thoughspark-acadwas up the entire time.Root cause
Two things combine:
No transport-level dead-peer detection.
ProviderPluginBase.CreateLlmHttpClient(src/Netclaw.Providers/ProviderPluginBase.cs:48-53) builds the LLMHttpClienton a plainHttpClientHandlerwith onlyTimeout = TimeSpan.FromHours(1)— noSocketsHttpHandler.KeepAlivePingPolicy, no TCP keepalive, no connect/time-to-first-byte timeout. A half-open connection therefore makes the streamingMoveNextAsync()block indefinitely (up to the 1-hour ceiling) with no exception.Failover is exception-driven.
FailoverChatClient.StreamWithFailoverAsync(src/Netclaw.Daemon/Configuration/FailoverChatClient.cs:124-155) fails over to the fallback when the primary throws before the first chunk. A silent hang throws nothing, so this path is never reached. The only thing that eventually breaks the hang is theProcessingWatchdogcancellation (~20 min), which is — correctly — excluded from failover (!cancellationToken.IsCancellationRequested), since cancellation signals a deliberate abort. So the turn just fails; the fallback is never tried.The no-progress watchdog (the recently-added keepalive-immune deadline) did its job as a backstop — it prevented an open-ended hang — but it is intentionally generous (it must tolerate legitimately slow cold prefills, which can be silent 10+ minutes). It is not, and should not be, a fast failure detector.
Impact
Any backend that dies or restarts mid-request — common during model reloads, node maintenance, or autoscaling on self-hosted fleets — wedges the turn for up to the no-progress budget (default 20 min) with no automatic failover, despite a healthy fallback provider being configured. The user must resend.
Proposed fix
Make a dead/half-open backend connection surface as a connection exception in seconds, so the existing
FailoverChatClientpre-first-chunk path fails over to the fallback automatically:Primary: transport-level dead-peer detection on the LLM
HttpClient. SwitchCreateLlmHttpClientto a configuredSocketsHttpHandlerwith:KeepAlivePingPolicy/KeepAlivePingDelay/KeepAlivePingTimeoutfor HTTP/2 backends, and/orConnectCallbacksettingSO_KEEPALIVE+ aggressive idle/interval) for HTTP/1.1 backends like vLLM/llama-server's uvicorn server, andConnectTimeout.Key property that resolves the tension with slow prefill: a keepalive ping probes whether the peer is alive, independent of whether the model is producing tokens. A legitimately slow 10-minute cold prefill answers pings (peer alive → no false timeout); a restarted/dead backend does not (ping unanswered → connection aborted in seconds → exception → failover). This is exactly the signal we want, and it does not penalize a slow-but-healthy prefill the way a blanket data-inactivity timeout would.
Optional/complementary: consider distinguishing a no-progress watchdog timeout from a user cancel so the watchdog path can also trigger a failover attempt rather than failing the turn outright. This is more invasive (the watchdog lives in the session/sub-agent actor; failover lives in the chat-client decorator) — fix Bump dotnet-sdk from 10.0.102 to 10.0.103 #1 is the cleaner primary lever and reuses the failover logic that already exists.
Notes
HttpClient.Timeout = 1his currently the only transport-layer backstop; theProcessingWatchdog(prefill 30 min / no-progress 20 min) is the app-layer backstop. Neither is a fast detector.