Skip to content

fix(providers): detect dead/half-open backend connections via TCP keepalive#1272

Draft
Aaronontheweb wants to merge 1 commit into
netclaw-dev:devfrom
Aaronontheweb:claude-wt-network-timeout-halfopen
Draft

fix(providers): detect dead/half-open backend connections via TCP keepalive#1272
Aaronontheweb wants to merge 1 commit into
netclaw-dev:devfrom
Aaronontheweb:claude-wt-network-timeout-halfopen

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Collaborator

Summary

  • Switch LLM HttpClient from bare HttpClientHandler to SocketsHttpHandler with aggressive TCP keepalive (idle=10s, interval=5s, probes=3) so a dead or restarted backend is detected in ~25s instead of hanging for the full 20-minute watchdog budget
  • The resulting IOException lets the existing FailoverChatClient pre-first-chunk path fail over to the configured fallback provider automatically
  • Adds ConnectTimeout (10s), PooledConnectionIdleTimeout (60s), and HTTP/2 PING settings for cloud providers that negotiate h2 over TLS

Context

When a self-hosted LLM backend (vLLM, llama.cpp) restarts mid-request, the TCP connection goes half-open — no RST, no data, no exception. The existing FailoverChatClient only fails over on exceptions, so the turn hangs for up to 20 minutes (until ProcessingWatchdog cancels) with no failover to the healthy fallback. TCP keepalive probes detect peer death in ~25s, surfacing an IOException that the existing failover logic already handles.

Key insight: TCP keepalive probes peer liveness independently of data flow. A slow-but-alive prefill answers probes (no false kill); a dead/restarted backend doesn't (fast detection → exception → failover).

Verified: vLLM (uvicorn) and llama.cpp (cpp-httplib) are HTTP/1.1 only — HTTP/2 PING frames won't help for self-hosted. TCP keepalive is the primary mechanism.

Closes #1253

Test plan

  • dotnet build src/Netclaw.Providers/ compiles cleanly
  • Existing tests pass (TCP keepalive doesn't affect loopback test connections)
  • Operational validation: restart a vLLM backend mid-streaming-request and confirm failover triggers in ~25s instead of hanging 20 minutes

…palive

Switch LLM HttpClient from bare HttpClientHandler to SocketsHttpHandler
with aggressive TCP keepalive (idle=10s, interval=5s, probes=3) so that
a dead or restarted backend is detected in ~25s instead of hanging for
the full 20-minute watchdog budget. The resulting IOException lets the
existing FailoverChatClient pre-first-chunk path fail over to the
configured fallback provider automatically.

Also adds ConnectTimeout (10s), PooledConnectionIdleTimeout (60s), and
HTTP/2 PING settings for cloud providers that negotiate h2 over TLS.

Closes netclaw-dev#1253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM providers: detect dead/half-open backend connections fast so failover actually triggers

1 participant