fix(gateway): stop flagging RPC timeouts while gateway is starting/restarting by hazeone · Pull Request #906 · ValueCell-ai/ClawX

hazeone · 2026-04-24T03:18:27Z

Problem

On Windows + ClawX 0.3.10, many users see a persistent red banner on the Channels page:

网关状态异常 · 最近的网关 RPC 调用发生超时

This analysis of user-submitted logs (captured spawnToReadyMs of 50–130 seconds on Windows) showed the banner is not a single bug, but the product of several Windows-specific lifecycle choices combining against a UI judgement with no cooldown:

Cold-start dwarfs RPC timeouts. On Windows, Gateway bootstrap (uv python download, Defender scans, bonjour probing, model-pricing fetch, plugin init) routinely takes 30 s to 2 min. The first channels.status (5–8 s timeout) and chat.history (30–35 s timeout) fire long before the gateway is actually ready, rejecting with Gateway not connected / Gateway stopped / RPC timeout.
Windows reload is forced into restart. SIGUSR1 is not supported, so every provider + channel save triggers a full cold-start, cascading the previous problem.
Windows heartbeat recovery was disabled outright. Once the Gateway deadlocked (Defender / plugin / IO stall), the manager logged Gateway heartbeat recovery skipped (platform=win32) and did nothing until the user manually clicked "重启网关".
Handshake / challenge ceilings were platform-agnostic (10 s / 20 s). On Windows those were shorter than real cold-starts, producing a false Connect handshake timeout, which startup-orchestrator treated as a transient error and retried up to 3× — stacking another >1 min of delay.
Health judgement was too hot. consecutiveRpcFailures > 0 → rpc_timeout reason without any cooldown. A single transport failure while the gateway was still booting stuck the banner until the next successful RPC.
Channels page re-fired 5 refresh ticks on page load (1.2 / 2.6 / 4.5 / 7 / 10.5 s), compounding the counter during cold-start.

Log evidence (abridged, from attached session transcripts):

[metric] gateway.startup { spawnToReadyMs: 128245, totalMs: 128866 }
[channels.accounts] channels.status probe=0 failed after 8013ms
[gateway:rpc] chat.history failed (timeoutMs=35000): Error: Gateway stopped
[ERROR] Gateway connect handshake timed out
[WARN ] Transient start error: Error: Connect handshake timeout. Retrying... (1/3)
Gateway heartbeat: 5 consecutive pong misses … autoReconnect=true
Gateway heartbeat recovery skipped (platform=win32)

Fix

electron/gateway/manager.ts

isTransportRpcFailure() narrowed to only "RPC timeout:". Gateway not connected / Gateway stopped / send failures are expected during lifecycle transitions and are already surfaced via status.state (gateway_not_running / gateway_error reasons).
recordRpcFailure() gates incrementing consecutiveRpcFailures on status.state === 'running', gatewayReady === true, AND 45 s elapsed since connectedAt — matching the renderer-side chat.history startup retry window.
recordRpcSuccess() now infers gatewayReady = true on the first successful RPC, so builds that never emit (or emit too-early) gateway.ready no longer wait on the 30 s fallback.
Windows heartbeat recovery is re-enabled but behind a 5-minute silence guard (WINDOWS_RECOVERY_SILENCE_MS). Short Defender-induced blips (which emit another message within a minute or two) keep the existing deferral; genuine deadlocks now self-heal after 5 min.

electron/gateway/ws-client.ts

New GATEWAY_CHALLENGE_TIMEOUT_MS_WIN = 30_000 and GATEWAY_CONNECT_HANDSHAKE_TIMEOUT_MS_WIN = 45_000, selected automatically via getPlatformChallenge/ConnectHandshakeTimeoutMs() based on the caller's platform. Linux / macOS retain the stricter 10 s / 20 s defaults.
Prevents the retry loop that previously stacked three 20 s handshake-timeout failures on top of a legitimate Windows cold-start.

electron/api/routes/channels.ts

buildChannelAccountsView() short-circuits the runtime channels.status RPC unless the gateway is truly running && gatewayReady. Startup renders now rely on the config view (already fully populated) instead of firing requests that are guaranteed to fail.
Channel save debounce raised from 150 ms → 2000 ms (CHANNEL_SAVE_RESTART_DEBOUNCE_MS). A typical provider + channel setup flow (which can trigger 3–5 saves within 1 s of each other) now coalesces into a single Gateway restart instead of stacking several cold-starts on Windows.

electron/main/ipc-handlers.ts

Same 2000 ms debounce applied to the legacy IPC code path that also drives channel-save refreshes.

src/pages/Channels/index.tsx

scheduleConvergenceRefresh() timers re-read the live gateway status before firing. Ticks are skipped (with a clear info log) while the gateway is not yet running + gatewayReady.
Added a second trigger so that when gatewayReady flips true while status.state is already 'running', the page re-fetches runtime data and re-schedules convergence. This covers the case where a page loaded during 'starting' and transitioned via gateway.ready without ever changing status.state.

Non-fix decisions documented

UtilityProcess IPC reload (PLAN item Dev #4): declined after code audit of node_modules/openclaw/dist. OpenClaw only handles reload via in-process SIGUSR1 (process.listenerCount('SIGUSR1') > 0 ? emit : kill) and does not listen for parent process.on('message') IPC. There is no gateway.reload / admin.reload RPC; secrets.reload and config.apply both funnel back into the same SIGUSR1 path. Even if implemented, the SIGUSR1 in-process reload re-runs the same plugin/bonjour/model-pricing init that causes the cost we see anyway. The debounce bump from 150 ms → 2 s (above) captures the much larger win — one restart per typical setup flow instead of 3-5.

Tests

tests/unit/gateway-manager-diagnostics.test.ts (now 6 tests):
- new: RPC timeout during state: 'starting' does NOT pollute consecutiveRpcFailures / rpc_timeout reason.
- new: RPC timeout within the 45 s post-connect grace window does NOT count.
- new: First successful RPC flips gatewayReady = true.
- Existing cases updated to set connectedAt / gatewayReady so recordRpcFailure still fires where intended.
tests/unit/gateway-manager-heartbeat.test.ts (now 5 tests):
- Windows case split into "deferred while <5 min silence guard" + "fires after prolonged silence".
tests/unit/gateway-ws-client.test.ts (now 12 tests):
- new: getPlatformChallenge/ConnectHandshakeTimeoutMs returns the Windows constants on win32 and defaults elsewhere.
- new: A win32 handshake still succeeds after advancing past the non-windows default (+5 s) because the Windows ceiling is in effect.
tests/unit/channel-routes.test.ts:
- 25 getStatus mocks extended with gatewayReady: true so channels.status still dispatches in tests.

Verification

Check	Result
`pnpm test`	84 files, 542 tests passed
`pnpm typecheck`	clean
`pnpm run lint`	clean
`pnpm run comms:replay` + `comms:compare`	all metrics PASS (rpc_timeout_rate, duplicate_event_rate, event_fanout_ratio, history_inflight_max, rpc_p95_ms 0% delta vs baseline)

Notes for reviewers

No change to the banner UI itself or to user-facing copy.
The startup-grace change intentionally keeps lastRpcFailureAt / lastRpcFailureMethod diagnostics populated for debugging — only the health summary is gated.
No renderer-side change needed for history path: src/stores/chat/history-startup-retry.ts already classifies these failures and retries for up to ~45 s, so timeouts that used to poison the banner now retry transparently.

…starting Windows / 0.3.10 users see a persistent "网关状态异常 · 最近的网关 RPC 调用发生超时" banner on the Channels page because cold-start on Windows routinely takes 30s to 2min (uv python download, Defender, bonjour probing, model-pricing fetch, plugin init), which is longer than the channels.status / chat.history RPC timeouts. Any RPC fired during that window rejected with "Gateway not connected" / "Gateway stopped" / "RPC timeout", each of which incremented consecutiveRpcFailures and immediately tripped the rpc_timeout health reason — with no cooldown, the banner stuck around. Changes: - isTransportRpcFailure() now only counts "RPC timeout:" messages. "Gateway not connected" / "Gateway stopped" / send failures are expected during lifecycle transitions and are already surfaced via status.state (gateway_not_running / gateway_error reasons). - recordRpcFailure() gates counting on status.state === 'running', gatewayReady === true, AND 45s elapsed since connectedAt — matching the renderer-side chat.history startup retry window. - On Windows we no longer disable heartbeat-driven recovery outright. Recovery requires 5 minutes of true silence (no pong, no inbound message) before the manager restarts, which keeps Defender-induced blips from false-positive restarts while still self-healing real deadlocks. - buildChannelAccountsView() skips the channels.status RPC entirely unless the gateway is running AND gatewayReady, so bootstrap renders no longer fire requests that are guaranteed to fail. - gatewayReady now also flips on the first successful RPC, so builds that don't emit gateway.ready no longer wait 30s for the fallback. - Channel save debounce raised from 150ms to 2000ms so a typical provider+channel setup flow coalesces into a single Gateway restart instead of stacking several on Windows (where reload=restart). Code paths: - electron/gateway/manager.ts - electron/api/routes/channels.ts - electron/main/ipc-handlers.ts Co-authored-by: Haze <hazeone@users.noreply.github.com>

…ndows silence-guarded recovery - gateway-manager-diagnostics: verify that RPC timeouts during state=starting or within the 45s post-connect grace window don't pollute consecutiveRpcFailures / rpc_timeout reason; verify gatewayReady flips to true on first successful RPC. - gateway-manager-heartbeat: split the old win32-disabled-recovery test into (1) defers while <5min silence guard, (2) restarts after prolonged silence. - channel-routes: include gatewayReady=true in all mocked getStatus() returns so buildChannelAccountsView continues firing channels.status in tests (the new implementation short-circuits while gateway is not yet ready). Co-authored-by: Haze <hazeone@users.noreply.github.com>

Gateway cold-start on Windows is dominated by Defender scans, npx/plugin unpacking, bonjour probing, and model-pricing fetch. The existing 10s/20s challenge/handshake ceilings regularly fired during normal startup, which startup-orchestrator then classified as a transient error and retried up to 3x — stacking an extra >1 min of false cold-start latency on top of the real boot time. Introduce platform-aware ceilings: - challengeTimeoutMs: 10s (other) -> 30s (win32) - connectTimeoutMs: 20s (other) -> 45s (win32) connectGatewaySocket() now falls back to getPlatformChallenge/Connect* helpers when the caller does not override the timeout explicitly. Linux and macOS retain the stricter defaults. Also adds two regression tests: - getPlatformChallenge/ConnectTimeoutMs returns the Windows constants on win32 and the defaults elsewhere. - A win32 handshake still succeeds after advancing past the non-windows default (GATEWAY_CONNECT_HANDSHAKE_TIMEOUT_MS + 5s) because the Windows ceiling (45s) is now in effect. Co-authored-by: Haze <hazeone@users.noreply.github.com>

…ready Entering the Channels page previously scheduled five convergence refreshes at 1.2 / 2.6 / 4.5 / 7 / 10.5 s, each issuing a channels.status RPC regardless of the gateway lifecycle state. On Windows cold-start (30s-130s) every one of those ticks was guaranteed to fail or be skipped by the main-process RPC gate (just added), flooding the log with noise and producing a poor UX. Changes: - scheduleConvergenceRefresh() timers now re-read the live gateway status before firing. If state!='running' or gatewayReady!==true the tick is skipped with a clear info log instead of calling fetchPageData. - Added a second trigger: when gatewayReady flips true while status.state is already 'running' (e.g. after a page that loaded during 'starting' transitioned via gateway.ready), re-run fetchPageData + scheduleConvergenceRefresh so the runtime view converges without requiring a state transition. This complements the main-process channels.status short-circuit added in the prior commit and keeps the UI in sync with actual gateway readiness. Co-authored-by: Haze <hazeone@users.noreply.github.com>

cursoragent and others added 4 commits April 24, 2026 03:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): stop flagging RPC timeouts while gateway is starting/restarting#906

fix(gateway): stop flagging RPC timeouts while gateway is starting/restarting#906
hazeone wants to merge 4 commits into
mainfrom
cursor/rpc-e46d

hazeone commented Apr 24, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hazeone commented Apr 24, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Non-fix decisions documented

Tests

Verification

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hazeone commented Apr 24, 2026 •

edited by cursor Bot

Loading