Skip to content

fix(browse): daemon resilience on loaded machines — Bun ConnectionRefused, stop/restart response flush, startup + git-root timeouts#1732

Open
mplatts wants to merge 2 commits into
garrytan:mainfrom
mplatts:fix/browse-daemon-resilience
Open

fix(browse): daemon resilience on loaded machines — Bun ConnectionRefused, stop/restart response flush, startup + git-root timeouts#1732
mplatts wants to merge 2 commits into
garrytan:mainfrom
mplatts:fix/browse-daemon-resilience

Conversation

@mplatts
Copy link
Copy Markdown

@mplatts mplatts commented May 26, 2026

Problem

On a dev machine under sustained load (think 10 local servers + compiles, load avg 12+), the browse daemon was unreliable in three distinct ways:

  1. browse restart and browse stop printed Unable to connect. Is the computer able to access the url? and exited 1 instead of doing their job.
  2. browse goto <url> intermittently failed with Server failed to start within 8s, even though the daemon came up a second later.
  3. goto would report a 200 but url returned about:blank — the two commands talking to different daemons.

All three are timeout/error-handling assumptions that hold on an idle machine and break under load. Root causes, with measurements from a load-~12 box:

Root causes + fixes

1. Bun reports refused/dropped sockets differently than Node. The compiled CLI runs on Bun, whose fetch throws err.code === 'ConnectionRefused' / 'ConnectionClosed' (message "Unable to connect..."), not Node's ECONNREFUSED/ECONNRESET. The crash-retry guard in sendCommand (cli.ts) only matched the Node codes, so a mid-command daemon drop leaked the raw Bun error and exited 1 instead of restarting-and-retrying. Broadened the guard to match both. (The repo's own e2e helpers already check for both 'ConnectionRefused' and 'Unable to connect', so this shape was known elsewhere.)

2. stop / restart never flushed their HTTP response. Both handlers did await shutdown(); return '...', but shutdown() calls process.exit() inline — so the return was dead code and the response never reached the CLI. The CLI saw a dropped socket; combined with (1) that surfaced as the raw error. Deferred shutdown() one tick (setTimeout(..., 100)) so the 200 flushes first, then the daemon exits and the next command lazily cold-starts.

3. Startup + git-root timeouts too tight under load.

  • MAX_START_WAIT was 8s on macOS/Linux. A cold Chromium launch measured ~5.7s at load 10 and exceeds 8s at load 12+, so the CLI abandoned a daemon that was still booting. Raised to 15s (matches the existing Windows budget). The poll loop returns the instant the daemon is healthy, so this only costs time in a genuine failure.
  • getGitRoot()'s git rev-parse timeout was 2s; under load it spikes past that (~6s observed), returns null, and resolveConfig falls back to process.cwd(). That scatters .gstack/browse.json across cwds, so goto and url hit different daemons. Raised to 8s (still bounds a genuinely broken .git).

Test plan

  • browse restart → exits 0, prints a clear message, next command cold-starts a fresh daemon
  • browse stop → exits 0, daemon and Chromium fully torn down (0 leftover processes)
  • browse goto / url / screenshot → green, single daemon, no stray state files
  • Targeted rebuild (bun build --compile browse/src/cli.ts) + daemon reload from source, verified end-to-end on macOS/arm64, Bun 1.2.18
  • No existing tests asserted the old stop/restart strings

Changes are source-only across browse/src/{cli,config,meta-commands}.ts. No dependency or build-script changes.

mplatts and others added 2 commits May 27, 2026 08:11
…esponse

The compiled CLI runs on Bun, whose fetch reports a refused/dropped socket
as err.code 'ConnectionRefused'/'ConnectionClosed' (message "Unable to
connect..."), not Node's ECONNREFUSED/ECONNRESET. The crash-retry catch in
sendCommand only knew the Node codes, so daemon crashes (and `browse restart`)
leaked the raw Bun error and exited 1 instead of restarting.

The stop/restart meta-command handlers did `await shutdown(); return ...`, but
shutdown() calls process.exit() inline — the response never flushed, so the
CLI saw a dropped socket. Defer shutdown one tick (setTimeout 100ms) so the
200 flushes first; the daemon then exits and the next command cold-starts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both timeouts were tuned for an idle machine and lose under sustained load
(10+ dev servers): cold Chromium launch measured ~5.7s at load 10 and exceeds
the 8s start budget at load 12+, and `git rev-parse` spikes past the 2s
git-root timeout ~30% of the time. The latter falls back to cwd, scattering
per-cwd state files so `goto` and `url` hit different daemons (about:blank).

- MAX_START_WAIT: macOS/Linux 8s -> 15s (matches the existing Windows budget).
  Poll loop returns the instant the daemon is healthy, so this only costs time
  on a genuine failure.
- getGitRoot timeout: 2s -> 8s. Still bounds a broken .git from hanging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant