Skip to content

fix(gateway): stop flagging RPC timeouts while gateway is starting/restarting#906

Draft
hazeone wants to merge 4 commits into
mainfrom
cursor/rpc-e46d
Draft

fix(gateway): stop flagging RPC timeouts while gateway is starting/restarting#906
hazeone wants to merge 4 commits into
mainfrom
cursor/rpc-e46d

Conversation

@hazeone
Copy link
Copy Markdown
Contributor

@hazeone hazeone commented Apr 24, 2026

Problem

On Windows + ClawX 0.3.10, many users see a persistent red banner on the Channels page:

网关状态异常 · 最近的网关 RPC 调用发生超时

This analysis of user-submitted logs (captured spawnToReadyMs of 50–130 seconds on Windows) showed the banner is not a single bug, but the product of several Windows-specific lifecycle choices combining against a UI judgement with no cooldown:

  1. Cold-start dwarfs RPC timeouts. On Windows, Gateway bootstrap (uv python download, Defender scans, bonjour probing, model-pricing fetch, plugin init) routinely takes 30 s to 2 min. The first channels.status (5–8 s timeout) and chat.history (30–35 s timeout) fire long before the gateway is actually ready, rejecting with Gateway not connected / Gateway stopped / RPC timeout.
  2. Windows reload is forced into restart. SIGUSR1 is not supported, so every provider + channel save triggers a full cold-start, cascading the previous problem.
  3. Windows heartbeat recovery was disabled outright. Once the Gateway deadlocked (Defender / plugin / IO stall), the manager logged Gateway heartbeat recovery skipped (platform=win32) and did nothing until the user manually clicked "重启网关".
  4. Handshake / challenge ceilings were platform-agnostic (10 s / 20 s). On Windows those were shorter than real cold-starts, producing a false Connect handshake timeout, which startup-orchestrator treated as a transient error and retried up to 3× — stacking another >1 min of delay.
  5. Health judgement was too hot. consecutiveRpcFailures > 0rpc_timeout reason without any cooldown. A single transport failure while the gateway was still booting stuck the banner until the next successful RPC.
  6. Channels page re-fired 5 refresh ticks on page load (1.2 / 2.6 / 4.5 / 7 / 10.5 s), compounding the counter during cold-start.

Log evidence (abridged, from attached session transcripts):

[metric] gateway.startup { spawnToReadyMs: 128245, totalMs: 128866 }
[channels.accounts] channels.status probe=0 failed after 8013ms
[gateway:rpc] chat.history failed (timeoutMs=35000): Error: Gateway stopped
[ERROR] Gateway connect handshake timed out
[WARN ] Transient start error: Error: Connect handshake timeout. Retrying... (1/3)
Gateway heartbeat: 5 consecutive pong misses … autoReconnect=true
Gateway heartbeat recovery skipped (platform=win32)

Fix

electron/gateway/manager.ts

  • isTransportRpcFailure() narrowed to only "RPC timeout:". Gateway not connected / Gateway stopped / send failures are expected during lifecycle transitions and are already surfaced via status.state (gateway_not_running / gateway_error reasons).
  • recordRpcFailure() gates incrementing consecutiveRpcFailures on status.state === 'running', gatewayReady === true, AND 45 s elapsed since connectedAt — matching the renderer-side chat.history startup retry window.
  • recordRpcSuccess() now infers gatewayReady = true on the first successful RPC, so builds that never emit (or emit too-early) gateway.ready no longer wait on the 30 s fallback.
  • Windows heartbeat recovery is re-enabled but behind a 5-minute silence guard (WINDOWS_RECOVERY_SILENCE_MS). Short Defender-induced blips (which emit another message within a minute or two) keep the existing deferral; genuine deadlocks now self-heal after 5 min.

electron/gateway/ws-client.ts

  • New GATEWAY_CHALLENGE_TIMEOUT_MS_WIN = 30_000 and GATEWAY_CONNECT_HANDSHAKE_TIMEOUT_MS_WIN = 45_000, selected automatically via getPlatformChallenge/ConnectHandshakeTimeoutMs() based on the caller's platform. Linux / macOS retain the stricter 10 s / 20 s defaults.
  • Prevents the retry loop that previously stacked three 20 s handshake-timeout failures on top of a legitimate Windows cold-start.

electron/api/routes/channels.ts

  • buildChannelAccountsView() short-circuits the runtime channels.status RPC unless the gateway is truly running && gatewayReady. Startup renders now rely on the config view (already fully populated) instead of firing requests that are guaranteed to fail.
  • Channel save debounce raised from 150 ms → 2000 ms (CHANNEL_SAVE_RESTART_DEBOUNCE_MS). A typical provider + channel setup flow (which can trigger 3–5 saves within 1 s of each other) now coalesces into a single Gateway restart instead of stacking several cold-starts on Windows.

electron/main/ipc-handlers.ts

  • Same 2000 ms debounce applied to the legacy IPC code path that also drives channel-save refreshes.

src/pages/Channels/index.tsx

  • scheduleConvergenceRefresh() timers re-read the live gateway status before firing. Ticks are skipped (with a clear info log) while the gateway is not yet running + gatewayReady.
  • Added a second trigger so that when gatewayReady flips true while status.state is already 'running', the page re-fetches runtime data and re-schedules convergence. This covers the case where a page loaded during 'starting' and transitioned via gateway.ready without ever changing status.state.

Non-fix decisions documented

  • UtilityProcess IPC reload (PLAN item Dev #4): declined after code audit of node_modules/openclaw/dist. OpenClaw only handles reload via in-process SIGUSR1 (process.listenerCount('SIGUSR1') > 0 ? emit : kill) and does not listen for parent process.on('message') IPC. There is no gateway.reload / admin.reload RPC; secrets.reload and config.apply both funnel back into the same SIGUSR1 path. Even if implemented, the SIGUSR1 in-process reload re-runs the same plugin/bonjour/model-pricing init that causes the cost we see anyway. The debounce bump from 150 ms → 2 s (above) captures the much larger win — one restart per typical setup flow instead of 3-5.

Tests

  • tests/unit/gateway-manager-diagnostics.test.ts (now 6 tests):
    • new: RPC timeout during state: 'starting' does NOT pollute consecutiveRpcFailures / rpc_timeout reason.
    • new: RPC timeout within the 45 s post-connect grace window does NOT count.
    • new: First successful RPC flips gatewayReady = true.
    • Existing cases updated to set connectedAt / gatewayReady so recordRpcFailure still fires where intended.
  • tests/unit/gateway-manager-heartbeat.test.ts (now 5 tests):
    • Windows case split into "deferred while <5 min silence guard" + "fires after prolonged silence".
  • tests/unit/gateway-ws-client.test.ts (now 12 tests):
    • new: getPlatformChallenge/ConnectHandshakeTimeoutMs returns the Windows constants on win32 and defaults elsewhere.
    • new: A win32 handshake still succeeds after advancing past the non-windows default (+5 s) because the Windows ceiling is in effect.
  • tests/unit/channel-routes.test.ts:
    • 25 getStatus mocks extended with gatewayReady: true so channels.status still dispatches in tests.

Verification

Check Result
pnpm test 84 files, 542 tests passed
pnpm typecheck clean
pnpm run lint clean
pnpm run comms:replay + comms:compare all metrics PASS (rpc_timeout_rate, duplicate_event_rate, event_fanout_ratio, history_inflight_max, rpc_p95_ms 0% delta vs baseline)

Notes for reviewers

  • No change to the banner UI itself or to user-facing copy.
  • The startup-grace change intentionally keeps lastRpcFailureAt / lastRpcFailureMethod diagnostics populated for debugging — only the health summary is gated.
  • No renderer-side change needed for history path: src/stores/chat/history-startup-retry.ts already classifies these failures and retries for up to ~45 s, so timeouts that used to poison the banner now retry transparently.
Open in Web Open in Cursor 

cursoragent and others added 4 commits April 24, 2026 03:09
…starting

Windows / 0.3.10 users see a persistent "网关状态异常 · 最近的网关 RPC 调用发生超时"
banner on the Channels page because cold-start on Windows routinely takes
30s to 2min (uv python download, Defender, bonjour probing, model-pricing
fetch, plugin init), which is longer than the channels.status / chat.history
RPC timeouts.  Any RPC fired during that window rejected with
"Gateway not connected" / "Gateway stopped" / "RPC timeout", each of
which incremented consecutiveRpcFailures and immediately tripped the
rpc_timeout health reason — with no cooldown, the banner stuck around.

Changes:
- isTransportRpcFailure() now only counts "RPC timeout:" messages.
  "Gateway not connected" / "Gateway stopped" / send failures are expected
  during lifecycle transitions and are already surfaced via status.state
  (gateway_not_running / gateway_error reasons).
- recordRpcFailure() gates counting on status.state === 'running',
  gatewayReady === true, AND 45s elapsed since connectedAt — matching the
  renderer-side chat.history startup retry window.
- On Windows we no longer disable heartbeat-driven recovery outright.
  Recovery requires 5 minutes of true silence (no pong, no inbound message)
  before the manager restarts, which keeps Defender-induced blips from
  false-positive restarts while still self-healing real deadlocks.
- buildChannelAccountsView() skips the channels.status RPC entirely unless
  the gateway is running AND gatewayReady, so bootstrap renders no longer
  fire requests that are guaranteed to fail.
- gatewayReady now also flips on the first successful RPC, so builds that
  don't emit gateway.ready no longer wait 30s for the fallback.
- Channel save debounce raised from 150ms to 2000ms so a typical
  provider+channel setup flow coalesces into a single Gateway restart
  instead of stacking several on Windows (where reload=restart).

Code paths:
- electron/gateway/manager.ts
- electron/api/routes/channels.ts
- electron/main/ipc-handlers.ts

Co-authored-by: Haze <hazeone@users.noreply.github.com>
…ndows silence-guarded recovery

- gateway-manager-diagnostics: verify that RPC timeouts during state=starting
  or within the 45s post-connect grace window don't pollute
  consecutiveRpcFailures / rpc_timeout reason; verify gatewayReady flips to
  true on first successful RPC.
- gateway-manager-heartbeat: split the old win32-disabled-recovery test into
  (1) defers while <5min silence guard, (2) restarts after prolonged silence.
- channel-routes: include gatewayReady=true in all mocked getStatus() returns
  so buildChannelAccountsView continues firing channels.status in tests (the
  new implementation short-circuits while gateway is not yet ready).

Co-authored-by: Haze <hazeone@users.noreply.github.com>
Gateway cold-start on Windows is dominated by Defender scans, npx/plugin
unpacking, bonjour probing, and model-pricing fetch.  The existing
10s/20s challenge/handshake ceilings regularly fired during normal
startup, which startup-orchestrator then classified as a transient error
and retried up to 3x — stacking an extra >1 min of false cold-start
latency on top of the real boot time.

Introduce platform-aware ceilings:
- challengeTimeoutMs:  10s (other)  -> 30s (win32)
- connectTimeoutMs:    20s (other)  -> 45s (win32)

connectGatewaySocket() now falls back to getPlatformChallenge/Connect*
helpers when the caller does not override the timeout explicitly.  Linux
and macOS retain the stricter defaults.

Also adds two regression tests:
- getPlatformChallenge/ConnectTimeoutMs returns the Windows constants on
  win32 and the defaults elsewhere.
- A win32 handshake still succeeds after advancing past the non-windows
  default (GATEWAY_CONNECT_HANDSHAKE_TIMEOUT_MS + 5s) because the Windows
  ceiling (45s) is now in effect.

Co-authored-by: Haze <hazeone@users.noreply.github.com>
…ready

Entering the Channels page previously scheduled five convergence refreshes
at 1.2 / 2.6 / 4.5 / 7 / 10.5 s, each issuing a channels.status RPC
regardless of the gateway lifecycle state.  On Windows cold-start
(30s-130s) every one of those ticks was guaranteed to fail or be skipped
by the main-process RPC gate (just added), flooding the log with noise
and producing a poor UX.

Changes:

- scheduleConvergenceRefresh() timers now re-read the live gateway status
  before firing.  If state!='running' or gatewayReady!==true the tick is
  skipped with a clear info log instead of calling fetchPageData.
- Added a second trigger: when gatewayReady flips true while status.state
  is already 'running' (e.g. after a page that loaded during 'starting'
  transitioned via gateway.ready), re-run fetchPageData +
  scheduleConvergenceRefresh so the runtime view converges without
  requiring a state transition.

This complements the main-process channels.status short-circuit added in
the prior commit and keeps the UI in sync with actual gateway readiness.

Co-authored-by: Haze <hazeone@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants