Skip to content

fix(browse): recover a late-healthy detached daemon instead of a false "Server failed to start" (#1846)#1847

Open
harjothkhara wants to merge 1 commit into
garrytan:mainfrom
harjothkhara:fix/1846-browse-startup-false-negative
Open

fix(browse): recover a late-healthy detached daemon instead of a false "Server failed to start" (#1846)#1847
harjothkhara wants to merge 1 commit into
garrytan:mainfrom
harjothkhara:fix/1846-browse-startup-false-negative

Conversation

@harjothkhara
Copy link
Copy Markdown

Summary

browse's startServer() polls /health for MAX_START_WAIT, then throws Server failed to start within Ns on timeout. But the server is spawned detached + .unref()'d, so it keeps booting independently of the CLI's poll loop. On a loaded machine the daemon can become healthy in the gap between the loop's final tick and the throw — the very next browse status shows it listening and healthy. The probe timed out; the launch did not.

Closes #1846.

What this changes

  1. Final health re-check before the throw (the actual fix). Before the timeout throw in startServer, do one last readState() + isServerHealthy(); if healthy, return that state. This mirrors the post-loop recovery already present in ensureServer() (cli.ts:482-483) — startServer was simply missing the same guard. A genuinely failed server stays unhealthy and falls straight through to the existing startup-error-log report, so real failures are unaffected.

  2. BROWSE_START_TIMEOUT env override. MAX_START_WAIT was a hardcoded magic number — the only browse startup tunable without a BROWSE_* escape hatch (unlike BROWSE_PORT, BROWSE_IDLE_TIMEOUT, …). It's now produced by a small pure, exported resolveStartTimeout(env) that honors BROWSE_START_TIMEOUT (ms) and falls back to the platform default for absent/non-positive/unparseable values.

Relationship to #1732

Complementary, not overlapping. #1732 only widens the constant (8s → 15s for non-Windows). This PR removes the false-negative at the throw site itself, so it holds at any budget/platform/load, and adds the env knob #1732 left out. The two changes touch different lines and compose cleanly.

Tests

browse/test/cli-start-final-healthcheck.test.ts:

  • Behavioral coverage of resolveStartTimeout — positive override honored; 0 / negative / garbage / empty fall back to the platform default.
  • Static invariants that the final readState()+isServerHealthy() re-check sits between the poll loop and the throw, returns the recovered state, and that MAX_START_WAIT is wired through resolveStartTimeout().
$ bun test browse/test/cli-start-final-healthcheck.test.ts
 6 pass  0 fail

Existing cli-* / restart-env tests stay green. (The unrelated commands.test.ts failure on a fresh checkout is a missing-npx playwright install browser binary — reproduces identically on main.)

🤖 Generated with Claude Code

…e "failed to start" (garrytan#1846)

startServer polls /health for MAX_START_WAIT, then throws "Server failed to
start within Ns" on timeout. But the server is spawned detached + unref'd, so
it keeps booting independently of the CLI's poll loop. On a loaded machine the
daemon can become healthy in the gap between the loop's final tick and the
throw — the very next `browse status` shows it listening and healthy. The probe
timed out; the launch did not.

- Add a final readState() + isServerHealthy() re-check before the timeout throw,
  mirroring the post-loop recovery already in ensureServer(). A genuinely failed
  server stays unhealthy and falls through to the existing error report.
- Make the startup budget overridable via BROWSE_START_TIMEOUT (ms) through a
  new pure, exported resolveStartTimeout() — matching the BROWSE_* tunable
  convention in server.ts — so hosts where even the platform ceiling is too low
  have an escape hatch.

Complements garrytan#1732 (which only raised the constant): this removes the
false-negative at the throw site itself, on any platform/load.

Test: browse/test/cli-start-final-healthcheck.test.ts — behavioral coverage of
resolveStartTimeout overrides + a static invariant that the final re-check sits
between the poll loop and the throw.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented Jun 3, 2026

❌ This pull request has been removed from the merge queue because required statuses are not defined in .trunk/trunk.yaml, your repo's branch protection rules for main, or the merge queue specific settings. See more details here.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@jbetala7
Copy link
Copy Markdown
Contributor

jbetala7 commented Jun 3, 2026

Independently reviewed against current main (c43c850c) — looks correct and complete.

The asymmetry the PR describes is real on main: startServer's poll loop (browse/src/cli.ts:352-358) drops straight to the throw new Error('Server failed to start within ...') at :372 with no final probe, while ensureServer re-reads state and re-probes health (:482-485) before it ever decides to restart. Since the daemon is spawned detached: true + .unref()'d (:344-345), it keeps booting independently of the CLI's loop, so on a loaded box it can become healthy in the gap between the loop's last tick and the throw — exactly the #1846 repro. The added readState() + isServerHealthy() recheck before the throw recovers that false negative, and a genuinely-failed server stays unhealthy and falls through to the existing error-log report, so there's no masking of real failures.

The BROWSE_START_TIMEOUT override is NaN-safe (Number.isFinite(override) && override > 0, else platform default) and follows the established BROWSE_* tunable convention. Extracting it as a pure exported resolveStartTimeout keeps it unit-testable; using a static-source-invariant test for the recheck (rather than a live spawn) is the right call given live spawn cycles belong in the e2e tier.

Collision check: no competing open PR for #1846 / startServer health. The env-override is slightly beyond the strict fix but is directly motivated by the Windows-under-load repro in the issue and is well-contained. LGTM.

Copy link
Copy Markdown
Author

/trunk merge

@harjothkhara
Copy link
Copy Markdown
Author

These look ready from my side — the branch is clean/mergeable and an independent review confirmed the fix is correct and complete. The only red check is the Trunk merge-queue gate, which from what I can tell just needs a maintainer to enqueue it (no code/test failure). @garrytan whenever you have a moment, could these be added to the merge queue? Happy to rebase if anything is needed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

browse CLI reports "Server failed to start within Ns" when the detached daemon is actually healthy (Windows-under-load gap left by #1732)

2 participants