fix(gateway): stop flagging RPC timeouts while gateway is starting/restarting#906
Draft
hazeone wants to merge 4 commits into
Draft
fix(gateway): stop flagging RPC timeouts while gateway is starting/restarting#906hazeone wants to merge 4 commits into
hazeone wants to merge 4 commits into
Conversation
…starting Windows / 0.3.10 users see a persistent "网关状态异常 · 最近的网关 RPC 调用发生超时" banner on the Channels page because cold-start on Windows routinely takes 30s to 2min (uv python download, Defender, bonjour probing, model-pricing fetch, plugin init), which is longer than the channels.status / chat.history RPC timeouts. Any RPC fired during that window rejected with "Gateway not connected" / "Gateway stopped" / "RPC timeout", each of which incremented consecutiveRpcFailures and immediately tripped the rpc_timeout health reason — with no cooldown, the banner stuck around. Changes: - isTransportRpcFailure() now only counts "RPC timeout:" messages. "Gateway not connected" / "Gateway stopped" / send failures are expected during lifecycle transitions and are already surfaced via status.state (gateway_not_running / gateway_error reasons). - recordRpcFailure() gates counting on status.state === 'running', gatewayReady === true, AND 45s elapsed since connectedAt — matching the renderer-side chat.history startup retry window. - On Windows we no longer disable heartbeat-driven recovery outright. Recovery requires 5 minutes of true silence (no pong, no inbound message) before the manager restarts, which keeps Defender-induced blips from false-positive restarts while still self-healing real deadlocks. - buildChannelAccountsView() skips the channels.status RPC entirely unless the gateway is running AND gatewayReady, so bootstrap renders no longer fire requests that are guaranteed to fail. - gatewayReady now also flips on the first successful RPC, so builds that don't emit gateway.ready no longer wait 30s for the fallback. - Channel save debounce raised from 150ms to 2000ms so a typical provider+channel setup flow coalesces into a single Gateway restart instead of stacking several on Windows (where reload=restart). Code paths: - electron/gateway/manager.ts - electron/api/routes/channels.ts - electron/main/ipc-handlers.ts Co-authored-by: Haze <hazeone@users.noreply.github.com>
…ndows silence-guarded recovery - gateway-manager-diagnostics: verify that RPC timeouts during state=starting or within the 45s post-connect grace window don't pollute consecutiveRpcFailures / rpc_timeout reason; verify gatewayReady flips to true on first successful RPC. - gateway-manager-heartbeat: split the old win32-disabled-recovery test into (1) defers while <5min silence guard, (2) restarts after prolonged silence. - channel-routes: include gatewayReady=true in all mocked getStatus() returns so buildChannelAccountsView continues firing channels.status in tests (the new implementation short-circuits while gateway is not yet ready). Co-authored-by: Haze <hazeone@users.noreply.github.com>
Gateway cold-start on Windows is dominated by Defender scans, npx/plugin unpacking, bonjour probing, and model-pricing fetch. The existing 10s/20s challenge/handshake ceilings regularly fired during normal startup, which startup-orchestrator then classified as a transient error and retried up to 3x — stacking an extra >1 min of false cold-start latency on top of the real boot time. Introduce platform-aware ceilings: - challengeTimeoutMs: 10s (other) -> 30s (win32) - connectTimeoutMs: 20s (other) -> 45s (win32) connectGatewaySocket() now falls back to getPlatformChallenge/Connect* helpers when the caller does not override the timeout explicitly. Linux and macOS retain the stricter defaults. Also adds two regression tests: - getPlatformChallenge/ConnectTimeoutMs returns the Windows constants on win32 and the defaults elsewhere. - A win32 handshake still succeeds after advancing past the non-windows default (GATEWAY_CONNECT_HANDSHAKE_TIMEOUT_MS + 5s) because the Windows ceiling (45s) is now in effect. Co-authored-by: Haze <hazeone@users.noreply.github.com>
…ready Entering the Channels page previously scheduled five convergence refreshes at 1.2 / 2.6 / 4.5 / 7 / 10.5 s, each issuing a channels.status RPC regardless of the gateway lifecycle state. On Windows cold-start (30s-130s) every one of those ticks was guaranteed to fail or be skipped by the main-process RPC gate (just added), flooding the log with noise and producing a poor UX. Changes: - scheduleConvergenceRefresh() timers now re-read the live gateway status before firing. If state!='running' or gatewayReady!==true the tick is skipped with a clear info log instead of calling fetchPageData. - Added a second trigger: when gatewayReady flips true while status.state is already 'running' (e.g. after a page that loaded during 'starting' transitioned via gateway.ready), re-run fetchPageData + scheduleConvergenceRefresh so the runtime view converges without requiring a state transition. This complements the main-process channels.status short-circuit added in the prior commit and keeps the UI in sync with actual gateway readiness. Co-authored-by: Haze <hazeone@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Windows + ClawX 0.3.10, many users see a persistent red banner on the Channels page:
This analysis of user-submitted logs (captured
spawnToReadyMsof 50–130 seconds on Windows) showed the banner is not a single bug, but the product of several Windows-specific lifecycle choices combining against a UI judgement with no cooldown:channels.status(5–8 s timeout) andchat.history(30–35 s timeout) fire long before the gateway is actually ready, rejecting withGateway not connected/Gateway stopped/RPC timeout.reloadis forced intorestart.SIGUSR1is not supported, so every provider + channel save triggers a full cold-start, cascading the previous problem.Gateway heartbeat recovery skipped (platform=win32)and did nothing until the user manually clicked "重启网关".Connect handshake timeout, whichstartup-orchestratortreated as a transient error and retried up to 3× — stacking another >1 min of delay.consecutiveRpcFailures > 0→rpc_timeoutreason without any cooldown. A single transport failure while the gateway was still booting stuck the banner until the next successful RPC.Log evidence (abridged, from attached session transcripts):
Fix
electron/gateway/manager.tsisTransportRpcFailure()narrowed to only"RPC timeout:".Gateway not connected/Gateway stopped/ send failures are expected during lifecycle transitions and are already surfaced viastatus.state(gateway_not_running/gateway_errorreasons).recordRpcFailure()gates incrementingconsecutiveRpcFailuresonstatus.state === 'running',gatewayReady === true, AND 45 s elapsed sinceconnectedAt— matching the renderer-sidechat.historystartup retry window.recordRpcSuccess()now infersgatewayReady = trueon the first successful RPC, so builds that never emit (or emit too-early)gateway.readyno longer wait on the 30 s fallback.WINDOWS_RECOVERY_SILENCE_MS). Short Defender-induced blips (which emit another message within a minute or two) keep the existing deferral; genuine deadlocks now self-heal after 5 min.electron/gateway/ws-client.tsGATEWAY_CHALLENGE_TIMEOUT_MS_WIN = 30_000andGATEWAY_CONNECT_HANDSHAKE_TIMEOUT_MS_WIN = 45_000, selected automatically viagetPlatformChallenge/ConnectHandshakeTimeoutMs()based on the caller'splatform. Linux / macOS retain the stricter 10 s / 20 s defaults.electron/api/routes/channels.tsbuildChannelAccountsView()short-circuits the runtimechannels.statusRPC unless the gateway is trulyrunning && gatewayReady. Startup renders now rely on the config view (already fully populated) instead of firing requests that are guaranteed to fail.CHANNEL_SAVE_RESTART_DEBOUNCE_MS). A typical provider + channel setup flow (which can trigger 3–5 saves within 1 s of each other) now coalesces into a single Gateway restart instead of stacking several cold-starts on Windows.electron/main/ipc-handlers.tssrc/pages/Channels/index.tsxscheduleConvergenceRefresh()timers re-read the live gateway status before firing. Ticks are skipped (with a clear info log) while the gateway is not yetrunning + gatewayReady.gatewayReadyflips true whilestatus.stateis already'running', the page re-fetches runtime data and re-schedules convergence. This covers the case where a page loaded during'starting'and transitioned viagateway.readywithout ever changingstatus.state.Non-fix decisions documented
node_modules/openclaw/dist. OpenClaw only handles reload via in-processSIGUSR1(process.listenerCount('SIGUSR1') > 0 ? emit : kill) and does not listen for parentprocess.on('message')IPC. There is nogateway.reload/admin.reloadRPC;secrets.reloadandconfig.applyboth funnel back into the same SIGUSR1 path. Even if implemented, the SIGUSR1 in-process reload re-runs the same plugin/bonjour/model-pricing init that causes the cost we see anyway. The debounce bump from 150 ms → 2 s (above) captures the much larger win — one restart per typical setup flow instead of 3-5.Tests
tests/unit/gateway-manager-diagnostics.test.ts(now 6 tests):state: 'starting'does NOT polluteconsecutiveRpcFailures/rpc_timeoutreason.gatewayReady = true.connectedAt/gatewayReadysorecordRpcFailurestill fires where intended.tests/unit/gateway-manager-heartbeat.test.ts(now 5 tests):tests/unit/gateway-ws-client.test.ts(now 12 tests):getPlatformChallenge/ConnectHandshakeTimeoutMsreturns the Windows constants on win32 and defaults elsewhere.tests/unit/channel-routes.test.ts:getStatusmocks extended withgatewayReady: truesochannels.statusstill dispatches in tests.Verification
pnpm testpnpm typecheckpnpm run lintpnpm run comms:replay+comms:compareNotes for reviewers
lastRpcFailureAt/lastRpcFailureMethoddiagnostics populated for debugging — only the health summary is gated.src/stores/chat/history-startup-retry.tsalready classifies these failures and retries for up to ~45 s, so timeouts that used to poison the banner now retry transparently.