fix(providers): detect dead/half-open backend connections via TCP keepalive by Aaronontheweb · Pull Request #1272 · netclaw-dev/netclaw

Aaronontheweb · 2026-06-01T11:51:46Z

Summary

Switch LLM HttpClient from bare HttpClientHandler to SocketsHttpHandler with aggressive TCP keepalive (idle=10s, interval=5s, probes=3) so a dead or restarted backend is detected in ~25s instead of hanging for the full 20-minute watchdog budget
The resulting IOException lets the existing FailoverChatClient pre-first-chunk path fail over to the configured fallback provider automatically
Adds ConnectTimeout (10s), PooledConnectionIdleTimeout (60s), and HTTP/2 PING settings for cloud providers that negotiate h2 over TLS

Context

When a self-hosted LLM backend (vLLM, llama.cpp) restarts mid-request, the TCP connection goes half-open — no RST, no data, no exception. The existing FailoverChatClient only fails over on exceptions, so the turn hangs for up to 20 minutes (until ProcessingWatchdog cancels) with no failover to the healthy fallback. TCP keepalive probes detect peer death in ~25s, surfacing an IOException that the existing failover logic already handles.

Key insight: TCP keepalive probes peer liveness independently of data flow. A slow-but-alive prefill answers probes (no false kill); a dead/restarted backend doesn't (fast detection → exception → failover).

Verified: vLLM (uvicorn) and llama.cpp (cpp-httplib) are HTTP/1.1 only — HTTP/2 PING frames won't help for self-hosted. TCP keepalive is the primary mechanism.

Closes #1253

Test plan

dotnet build src/Netclaw.Providers/ compiles cleanly
Existing tests pass (TCP keepalive doesn't affect loopback test connections)
Operational validation: restart a vLLM backend mid-streaming-request and confirm failover triggers in ~25s instead of hanging 20 minutes

…palive Switch LLM HttpClient from bare HttpClientHandler to SocketsHttpHandler with aggressive TCP keepalive (idle=10s, interval=5s, probes=3) so that a dead or restarted backend is detected in ~25s instead of hanging for the full 20-minute watchdog budget. The resulting IOException lets the existing FailoverChatClient pre-first-chunk path fail over to the configured fallback provider automatically. Also adds ConnectTimeout (10s), PooledConnectionIdleTimeout (60s), and HTTP/2 PING settings for cloud providers that negotiate h2 over TLS. Closes netclaw-dev#1253

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(providers): detect dead/half-open backend connections via TCP keepalive#1272

fix(providers): detect dead/half-open backend connections via TCP keepalive#1272
Aaronontheweb wants to merge 1 commit into
netclaw-dev:devfrom
Aaronontheweb:claude-wt-network-timeout-halfopen

Aaronontheweb commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Jun 1, 2026

Summary

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant