fix(browse): recover a late-healthy detached daemon instead of a false "Server failed to start" (#1846) by harjothkhara · Pull Request #1847 · garrytan/gstack

harjothkhara · 2026-06-03T05:06:13Z

Summary

browse's startServer() polls /health for MAX_START_WAIT, then throws Server failed to start within Ns on timeout. But the server is spawned detached + .unref()'d, so it keeps booting independently of the CLI's poll loop. On a loaded machine the daemon can become healthy in the gap between the loop's final tick and the throw — the very next browse status shows it listening and healthy. The probe timed out; the launch did not.

Closes #1846.

What this changes

Final health re-check before the throw (the actual fix). Before the timeout throw in startServer, do one last readState() + isServerHealthy(); if healthy, return that state. This mirrors the post-loop recovery already present in ensureServer() (cli.ts:482-483) — startServer was simply missing the same guard. A genuinely failed server stays unhealthy and falls straight through to the existing startup-error-log report, so real failures are unaffected.
BROWSE_START_TIMEOUT env override. MAX_START_WAIT was a hardcoded magic number — the only browse startup tunable without a BROWSE_* escape hatch (unlike BROWSE_PORT, BROWSE_IDLE_TIMEOUT, …). It's now produced by a small pure, exported resolveStartTimeout(env) that honors BROWSE_START_TIMEOUT (ms) and falls back to the platform default for absent/non-positive/unparseable values.

Relationship to #1732

Complementary, not overlapping. #1732 only widens the constant (8s → 15s for non-Windows). This PR removes the false-negative at the throw site itself, so it holds at any budget/platform/load, and adds the env knob #1732 left out. The two changes touch different lines and compose cleanly.

Tests

browse/test/cli-start-final-healthcheck.test.ts:

Behavioral coverage of resolveStartTimeout — positive override honored; 0 / negative / garbage / empty fall back to the platform default.
Static invariants that the final readState()+isServerHealthy() re-check sits between the poll loop and the throw, returns the recovered state, and that MAX_START_WAIT is wired through resolveStartTimeout().

$ bun test browse/test/cli-start-final-healthcheck.test.ts
 6 pass  0 fail

Existing cli-* / restart-env tests stay green. (The unrelated commands.test.ts failure on a fresh checkout is a missing-npx playwright install browser binary — reproduces identically on main.)

🤖 Generated with Claude Code

…e "failed to start" (garrytan#1846) startServer polls /health for MAX_START_WAIT, then throws "Server failed to start within Ns" on timeout. But the server is spawned detached + unref'd, so it keeps booting independently of the CLI's poll loop. On a loaded machine the daemon can become healthy in the gap between the loop's final tick and the throw — the very next `browse status` shows it listening and healthy. The probe timed out; the launch did not. - Add a final readState() + isServerHealthy() re-check before the timeout throw, mirroring the post-loop recovery already in ensureServer(). A genuinely failed server stays unhealthy and falls through to the existing error report. - Make the startup budget overridable via BROWSE_START_TIMEOUT (ms) through a new pure, exported resolveStartTimeout() — matching the BROWSE_* tunable convention in server.ts — so hosts where even the platform ceiling is too low have an escape hatch. Complements garrytan#1732 (which only raised the constant): this removes the false-negative at the throw site itself, on any platform/load. Test: browse/test/cli-start-final-healthcheck.test.ts — behavioral coverage of resolveStartTimeout overrides + a static invariant that the final re-check sits between the poll loop and the throw. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

trunk-io · 2026-06-03T05:06:17Z

❌ This pull request has been removed from the merge queue because required statuses are not defined in .trunk/trunk.yaml, your repo's branch protection rules for main, or the merge queue specific settings. See more details here.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

jbetala7 · 2026-06-03T05:59:29Z

Independently reviewed against current main (c43c850c) — looks correct and complete.

The asymmetry the PR describes is real on main: startServer's poll loop (browse/src/cli.ts:352-358) drops straight to the throw new Error('Server failed to start within ...') at :372 with no final probe, while ensureServer re-reads state and re-probes health (:482-485) before it ever decides to restart. Since the daemon is spawned detached: true + .unref()'d (:344-345), it keeps booting independently of the CLI's loop, so on a loaded box it can become healthy in the gap between the loop's last tick and the throw — exactly the #1846 repro. The added readState() + isServerHealthy() recheck before the throw recovers that false negative, and a genuinely-failed server stays unhealthy and falls through to the existing error-log report, so there's no masking of real failures.

The BROWSE_START_TIMEOUT override is NaN-safe (Number.isFinite(override) && override > 0, else platform default) and follows the established BROWSE_* tunable convention. Extracting it as a pure exported resolveStartTimeout keeps it unit-testable; using a static-source-invariant test for the recheck (rather than a live spawn) is the right call given live spawn cycles belong in the e2e tier.

Collision check: no competing open PR for #1846 / startServer health. The env-override is slightly beyond the strict fix but is directly motivated by the Windows-under-load repro in the issue and is well-contained. LGTM.

harjothkhara · 2026-06-04T01:22:09Z

/trunk merge

harjothkhara · 2026-06-05T18:17:04Z

These look ready from my side — the branch is clean/mergeable and an independent review confirmed the fix is correct and complete. The only red check is the Trunk merge-queue gate, which from what I can tell just needs a maintainer to enqueue it (no code/test failure). @garrytan whenever you have a moment, could these be added to the merge queue? Happy to rebase if anything is needed. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(browse): recover a late-healthy detached daemon instead of a false "Server failed to start" (#1846)#1847

fix(browse): recover a late-healthy detached daemon instead of a false "Server failed to start" (#1846)#1847
harjothkhara wants to merge 1 commit into
garrytan:mainfrom
harjothkhara:fix/1846-browse-startup-false-negative

harjothkhara commented Jun 3, 2026

Uh oh!

trunk-io Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

jbetala7 commented Jun 3, 2026

Uh oh!

harjothkhara commented Jun 4, 2026

Uh oh!

harjothkhara commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harjothkhara commented Jun 3, 2026

Summary

What this changes

Relationship to #1732

Tests

Uh oh!

trunk-io Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbetala7 commented Jun 3, 2026

Uh oh!

harjothkhara commented Jun 4, 2026

Uh oh!

harjothkhara commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trunk-io Bot commented Jun 3, 2026 •

edited

Loading