Skip to content

browse CLI reports "Server failed to start within Ns" when the detached daemon is actually healthy (Windows-under-load gap left by #1732) #1846

@joshwilks111-max

Description

@joshwilks111-max

Summary

The CLI's server-startup probe (browse/src/cli.ts) reports Server failed to start within Ns as a hard error even when the detached daemon comes up healthy a second later. The error is a probe-timeout, not a launch failure: the spawned server is detached: true + .unref()'d, so it keeps booting independently of whether the CLI's poll loop gave up. The very next browse status then shows Mode: headed, healthy.

This is the Windows-under-load twin of the macOS/Linux symptom #1732 fixes. #1732 raises MAX_START_WAIT (8s → 15s on macOS/Linux to "match the Windows budget") and broadens the Bun ConnectionRefused guard — both good. But it leaves two gaps that this issue is about:

  1. The throw site itself is untouched. fix(browse): daemon resilience on loaded machines — Bun ConnectionRefused, stop/restart response flush, startup + git-root timeouts #1732 only changes the constant (line 24) and the sendCommand error guard (~line 495). The fall-through throw new Error(\Server failed to start within ${MAX_START_WAIT / 1000}s`)` (cli.ts:372) still fires on timeout regardless of whether the daemon actually came up. So fix(browse): daemon resilience on loaded machines — Bun ConnectionRefused, stop/restart response flush, startup + git-root timeouts #1732 makes the window wider, but the false-negative at the edge of any window remains.
  2. MAX_START_WAIT has no env override. Every other tunable in server.ts reads from process.env.BROWSE_* (BROWSE_PORT, BROWSE_IDLE_TIMEOUT, etc.), but the startup budget is a hardcoded magic number with no escape hatch when the chosen ceiling still isn't enough.

Repro (Windows 11, gstack v1.55.0.0)

On a busy machine (20+ chrome.exe, ~12 node.exe running), browse connect:

Launching headed Chromium with extension + terminal agent...
[browse] Connect failed: Server failed to start within 15s

Immediately after, browse status:

Status: healthy
Mode: headed
URL: http://127.0.0.1:34567/welcome
Tabs: 1
PID: <pid>

netstat confirms the server is LISTENING on 34567 with ESTABLISHED connections from the extension + terminal-agent. So the launch fully succeeded; only the CLI's 15s health-probe gave up early. This is the existing Windows 15s budget — i.e. #1732's "raise everyone to 15s" ceiling does not resolve it on Windows under load.

Root cause

browse/src/cli.ts, the ensureServer() poll loop:

const start = Date.now();
while (Date.now() - start < MAX_START_WAIT) {
  const state = readState();
  if (state && await isServerHealthy(state.port)) {
    return state;
  }
  await Bun.sleep(100);
}
// ... reads startup-error.log if present, else:
throw new Error(`Server failed to start within ${MAX_START_WAIT / 1000}s`);

The loop polls health every 100ms, then throws on timeout. Because the daemon is detached, it can (and on loaded Windows boxes, does) become healthy after the loop exits. Nothing re-checks before throwing.

Proposed fix (two parts, both small, source-only in cli.ts)

1. Final health-check before throwing (structural — removes the false-negative at any load/platform). Before the timeout throw, do one last readState() + isServerHealthy(); if healthy, return it. Both helpers already exist (readState(): ServerState | null at cli.ts:109, isServerHealthy(port): Promise<boolean> at cli.ts:124). Roughly:

// One final check: the detached daemon may have become healthy after the
// poll loop's last tick (common on loaded machines — the server keeps
// booting independent of this probe). Don't report a false failure.
const finalState = readState();
if (finalState && await isServerHealthy(finalState.port)) {
  return finalState;
}
throw new Error(`Server failed to start within ${MAX_START_WAIT / 1000}s`);

2. Env override on the budget (escape hatch — matches the existing BROWSE_* pattern).

const MAX_START_WAIT = parseInt(process.env.BROWSE_START_WAIT || '', 10)
  || (IS_WINDOWS ? 15000 : (process.env.CI ? 30000 : 15000));

Part 1 is the real fix (eliminates the misleading message without tuning a magic number); part 2 gives power users a knob for genuinely slow environments. They compose cleanly on top of #1732: #1732 widens the window, this removes the false-negative at the edge of whatever window exists.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions