Skip to content

Commit 3ca0415

Browse files
fix(server): tighten startChromiumAndWait readiness gate (#257)
# Checklist - [ ] A link to a related issue in our repository. - [x] A description of the changes proposed in the pull request. - [ ] @mentions of the person or team responsible for reviewing proposed changes. ## Summary `startChromiumAndWait` could return `"devtools ready"` within ~135 ms of `supervisorctl start` ack'ing — long before the new Chromium had actually bound its DevTools listener. Two issues combined to cause this: 1. `tryReady`'s `allowCurrent=true` shortcut dialed the **previous** Chromium's URL the instant `supervisorctl start` returned. That URL points at the just-killed Chromium, and a bare `websocket.Dial` could complete against a half-closed socket of the dying process (or against a freshly bound but not-yet-routed listener of the new one), producing a false-positive ready. 2. `tryReady` never made a real CDP round-trip after dial, so it couldn't distinguish a live Chromium from a stale listener. End-to-end this caused `POST /configure` (and `PATCH /chromium/*` / `PATCH /display`) to return `200` several seconds before Chromium could actually serve requests. We measured this both in production (~5 s gap) and via a controlled local benchmark (~8.5 s gap on Apple Silicon under Rosetta). ## What changes ### `server/lib/cdpclient` * New `Client.GetBrowserVersion(ctx)` that sends a single `Browser.getVersion` CDP message. A successful round-trip proves the WebSocket is wired to a live, CDP-responsive Chromium. * Test coverage for both happy path and CDP-error-from-chromium. ### `server/cmd/api/api/chromium.go` * `startChromiumAndWait`: * Drop the `allowCurrent` parameter from `tryReady`. The readiness check is now: upstream URL must differ from `prevUpstream` AND a `Browser.getVersion` round-trip must succeed. * `doneCh` (supervisorctl start command return) is used only to short-circuit on start failure — it no longer attempts readiness against the current upstream. * `stopChromium`: * Add an `"chromium stopped"` info log on the success paths with `outcome`, `elapsed`, and `elapsed_ms` attributes. First numeric duration field on the stop path; enables `p99(elapsed_ms)` aggregations in SigNoz without parsing the existing Go duration strings. ## Test plan * `go vet ./...` ✅ * `go test -race $(go list ./... | grep -v /e2e$)` ✅ — all packages pass ### Benchmark Two `kernel-images-headless` images built locally (linux/amd64 under Rosetta on Apple Silicon): * `kernel-images-headless:before` — upstream `main` * `kernel-images-headless:after` — this PR For each, the container is started, the initial Chromium is verified CDP-responsive, then `POST /configure` is issued with `display {1280x800@60}` + `chromium_flags ["--kiosk"]` (which triggers the stop/start path). The benchmark times: * `configureMs` — when `/configure` returns 200 * `newUUIDMs` — when `/json/version` reports a new `webSocketDebuggerUrl` (proves new chromium emitted its `DevTools listening` line) * `cdpReadyMs` — when `Browser.getVersion` succeeds against that new UUID * `GAP = cdpReadyMs − configureMs` — the user-visible bug magnitude n=3 per image: | | before (median) | after (median) | | --- | --- | --- | | `/configure` returns 200 | 3,586 ms | 12,046 ms | | New UUID observed | 12,006 ms | 12,009 ms | | `Browser.getVersion` ok | 12,074 ms | 12,093 ms | | **GAP** | **8,516 ms** | **45 ms** | Total time-to-real-readiness is identical (~12 s — Chromium itself takes that long to fully restart with kiosk + viewport under Rosetta). The fix doesn't make Chromium faster; it stops the api from lying about when Chromium is ready. The remaining 45 ms in `after` is just the 25 ms probe interval + WebSocket round-trip — measurement noise. Benchmark script lives under `bench/bench-configure-readiness.ts`. Re-run with: ```bash bun bench/bench-configure-readiness.ts --image kernel-images-headless:before --label before bun bench/bench-configure-readiness.ts --image kernel-images-headless:after --label after ``` ## Not addressed (suggested follow-ups) * `stopsignal=KILL stopwaitsecs=0` in `images/chromium-{headful,headless}/.../supervisor/services/chromium.conf` causes SIGKILL spam and skips Chromium's graceful shutdown path. Recommend `stopsignal=TERM stopwaitsecs=5 killasgroup=true stopasgroup=true` plus a `Setpgid: true` on the chromium exec in `chromium-launcher` so `killasgroup` has a group to target. Worth a separate PR because it changes shutdown semantics (graceful exit can let us drop the `SingletonLock` cleanup hack in `chromium-launcher/main.go`). * The new `elapsed_ms` numeric attribute on `stopChromium` could be retrofitted onto the other six `elapsed=time.Since(start).String()` log lines in this file for consistent aggregation in SigNoz. Made with [Cursor](https://cursor.com) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes core restart readiness for configure and related Chromium APIs; responses may take longer but should be more accurate; mis-tuned timeouts could increase 500s on slow restarts. > > **Overview** > Fixes **false-positive DevTools readiness** after Chromium restarts so configure/flags/policy/extension endpoints no longer return success while the browser is still unusable. > > **`startChromiumAndWait`** now treats Chromium as ready only when the upstream DevTools URL **differs from the pre-restart URL** and a **`Browser.getVersion` CDP round-trip** succeeds on that URL. It removes the `allowCurrent` shortcut and no longer declares ready when `supervisorctl start` returns (which only means the process was forked). **`stopChromium`** gains structured **`chromium stopped`** logs with `outcome`, `elapsed`, and **`elapsed_ms`** for observability. > > **`cdpclient`** adds **`GetBrowserVersion`** (plus unit tests). A new **`bench/bench-configure-readiness.ts`** script measures the gap between `POST /configure` returning 200 and real CDP readiness for before/after images. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit df450e2. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 6ba41bd commit 3ca0415

4 files changed

Lines changed: 524 additions & 20 deletions

File tree

bench/bench-configure-readiness.ts

Lines changed: 365 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,365 @@
1+
#!/usr/bin/env bun
2+
3+
/*
4+
Benchmark the readiness contract of POST /configure on the kernel-images
5+
container by:
6+
1. Starting one of the kernel-images-headless docker images.
7+
2. Waiting for the initial chromium to be fully CDP-responsive.
8+
3. Issuing a single POST /configure with both viewport (1280x800@60) and
9+
--kiosk chromium_flags so the VM has to stop chromium and start it
10+
back up (the "stop/start path" inside ChromiumConfigure).
11+
4. Concurrently probing:
12+
(a) when /json/version starts reporting a NEW browser UUID
13+
(i.e. the new chromium has emitted "DevTools listening on ...")
14+
(b) when a real Browser.getVersion CDP round-trip succeeds against
15+
that new UUID (i.e. chromium is actually serving CDP).
16+
and timing the gap between the api returning 200 and (a)/(b) being
17+
true.
18+
19+
This isolates the bug fixed in fix/chromium-readiness-false-positive:
20+
the BEFORE image returns 200 several seconds before chromium can actually
21+
serve requests; the AFTER image holds 200 until chromium is real.
22+
23+
Usage:
24+
bun bench/bench-configure-readiness.ts \
25+
--image kernel-images-headless:before --label before
26+
bun bench/bench-configure-readiness.ts \
27+
--image kernel-images-headless:after --label after
28+
*/
29+
30+
interface Args {
31+
image: string;
32+
label: string;
33+
apiPort: number;
34+
devToolsPort: number;
35+
containerName: string;
36+
iterations: number;
37+
}
38+
39+
function parseArgs(): Args {
40+
const argv = process.argv.slice(2);
41+
const get = (flag: string, def?: string) => {
42+
const i = argv.indexOf(flag);
43+
if (i === -1) return def;
44+
return argv[i + 1];
45+
};
46+
const image = get('--image');
47+
if (!image) {
48+
console.error('--image is required');
49+
process.exit(1);
50+
}
51+
return {
52+
image,
53+
label: get('--label', image) ?? image,
54+
apiPort: Number(get('--api-port', '19001')),
55+
devToolsPort: Number(get('--devtools-port', '19222')),
56+
containerName: get('--name', `kibench-${Date.now()}`)!,
57+
iterations: Number(get('--iterations', '3')),
58+
};
59+
}
60+
61+
const args = parseArgs();
62+
63+
async function sh(cmd: string[], opts?: { quiet?: boolean }): Promise<string> {
64+
const proc = Bun.spawn(cmd, {
65+
stdout: 'pipe',
66+
stderr: opts?.quiet ? 'ignore' : 'pipe',
67+
});
68+
const out = await new Response(proc.stdout).text();
69+
await proc.exited;
70+
if (proc.exitCode !== 0) {
71+
throw new Error(`${cmd.join(' ')} exited ${proc.exitCode}: ${out}`);
72+
}
73+
return out;
74+
}
75+
76+
async function dockerRm(name: string): Promise<void> {
77+
try {
78+
await sh(['docker', 'rm', '-f', name], { quiet: true });
79+
} catch {
80+
// ignore
81+
}
82+
}
83+
84+
async function startContainer(): Promise<void> {
85+
await dockerRm(args.containerName);
86+
console.log(`[start] docker run ${args.image}`);
87+
// Detached run; we'll exec on it later if we need anything from inside.
88+
await sh([
89+
'docker',
90+
'run',
91+
'-d',
92+
'--rm',
93+
'--name',
94+
args.containerName,
95+
'--platform',
96+
'linux/amd64',
97+
'--privileged',
98+
'--tmpfs',
99+
'/dev/shm:size=2g',
100+
'-p',
101+
`${args.devToolsPort}:9222`,
102+
'-p',
103+
`${args.apiPort}:10001`,
104+
args.image,
105+
]);
106+
}
107+
108+
async function getBrowserUUID(): Promise<string | null> {
109+
try {
110+
const res = await fetch(`http://localhost:${args.devToolsPort}/json/version`, {
111+
signal: AbortSignal.timeout(1500),
112+
});
113+
if (!res.ok) return null;
114+
const data = (await res.json()) as { webSocketDebuggerUrl?: string };
115+
const m = data.webSocketDebuggerUrl?.match(/\/devtools\/browser\/([^/?]+)/);
116+
return m?.[1] ?? null;
117+
} catch {
118+
return null;
119+
}
120+
}
121+
122+
async function browserGetVersionCDP(uuid: string, timeoutMs = 2000): Promise<boolean> {
123+
// Send a single Browser.getVersion message on a fresh WS.
124+
return new Promise<boolean>((resolve) => {
125+
let settled = false;
126+
const settle = (ok: boolean) => {
127+
if (!settled) {
128+
settled = true;
129+
try {
130+
ws.close();
131+
} catch {
132+
// ignore
133+
}
134+
resolve(ok);
135+
}
136+
};
137+
let ws: WebSocket;
138+
try {
139+
ws = new WebSocket(`ws://localhost:${args.devToolsPort}/devtools/browser/${uuid}`);
140+
} catch {
141+
resolve(false);
142+
return;
143+
}
144+
const t = setTimeout(() => settle(false), timeoutMs);
145+
ws.addEventListener('open', () => {
146+
try {
147+
ws.send(JSON.stringify({ id: 1, method: 'Browser.getVersion' }));
148+
} catch {
149+
clearTimeout(t);
150+
settle(false);
151+
}
152+
});
153+
ws.addEventListener('message', (ev) => {
154+
try {
155+
const msg = JSON.parse(String(ev.data));
156+
if (msg.id === 1 && msg.result?.product) {
157+
clearTimeout(t);
158+
settle(true);
159+
}
160+
} catch {
161+
// ignore
162+
}
163+
});
164+
ws.addEventListener('error', () => {
165+
clearTimeout(t);
166+
settle(false);
167+
});
168+
ws.addEventListener('close', () => {
169+
clearTimeout(t);
170+
settle(false);
171+
});
172+
});
173+
}
174+
175+
async function waitUntil(
176+
fn: () => Promise<boolean | string | null>,
177+
timeoutMs: number,
178+
intervalMs = 100,
179+
): Promise<{ ok: boolean; ms: number; value: boolean | string | null }> {
180+
const t0 = Date.now();
181+
while (Date.now() - t0 < timeoutMs) {
182+
const v = await fn();
183+
if (v) {
184+
return { ok: true, ms: Date.now() - t0, value: v };
185+
}
186+
await new Promise((r) => setTimeout(r, intervalMs));
187+
}
188+
return { ok: false, ms: Date.now() - t0, value: null };
189+
}
190+
191+
function makeConfigureBody(width: number, height: number, refresh: number): { body: Blob; contentType: string } {
192+
const fd = new FormData();
193+
fd.append('display', JSON.stringify({ width, height, refresh_rate: refresh, restart_chromium: true, require_idle: false }));
194+
fd.append('chromium_flags', JSON.stringify({ flags: ['--kiosk'] }));
195+
// Bun's FormData via Request handles boundary automatically; we capture both
196+
// body and content-type via a dummy Request.
197+
const req = new Request('http://x/', { method: 'POST', body: fd });
198+
return { body: req.body as unknown as Blob, contentType: req.headers.get('content-type')! };
199+
}
200+
201+
async function postConfigure(): Promise<{ status: number; ms: number; body: string }> {
202+
const { body, contentType } = makeConfigureBody(1280, 800, 60);
203+
const t0 = Date.now();
204+
let res: Response;
205+
try {
206+
res = await fetch(`http://localhost:${args.apiPort}/configure`, {
207+
method: 'POST',
208+
headers: { 'content-type': contentType },
209+
body,
210+
signal: AbortSignal.timeout(60_000),
211+
});
212+
} catch (e) {
213+
return { status: -1, ms: Date.now() - t0, body: `fetch error: ${(e as Error).message}` };
214+
}
215+
const ms = Date.now() - t0;
216+
const txt = await res.text();
217+
return { status: res.status, ms, body: txt };
218+
}
219+
220+
interface Sample {
221+
iter: number;
222+
initialUUID: string;
223+
configureMs: number;
224+
configureStatus: number;
225+
newUUIDMs: number | null;
226+
newUUID: string | null;
227+
cdpReadyMs: number | null;
228+
gapMs: number | null; // cdpReadyMs - configureMs
229+
}
230+
231+
async function runOneIteration(iter: number): Promise<Sample> {
232+
// Wait for the initial chromium to be fully ready (UUID present + CDP responsive).
233+
console.log(`[iter ${iter}] waiting for initial chromium ready...`);
234+
const initialResult = await waitUntil(
235+
async () => {
236+
const u = await getBrowserUUID();
237+
if (!u) return null;
238+
const ok = await browserGetVersionCDP(u, 1500);
239+
return ok ? u : null;
240+
},
241+
90_000,
242+
250,
243+
);
244+
if (!initialResult.ok || typeof initialResult.value !== 'string') {
245+
throw new Error('initial chromium never became ready');
246+
}
247+
const initialUUID = initialResult.value;
248+
console.log(`[iter ${iter}] initial chromium ready in ${initialResult.ms}ms uuid=${initialUUID.slice(0, 8)}…`);
249+
250+
// Fire POST /configure and concurrently start probing for the new UUID.
251+
console.log(`[iter ${iter}] POST /configure (viewport 1280x800@60 + --kiosk)...`);
252+
const configureStart = Date.now();
253+
const configurePromise = postConfigure();
254+
255+
// Poll fast for the new UUID to appear on /json/version.
256+
let newUUID: string | null = null;
257+
let newUUIDMs: number | null = null;
258+
const newUUIDProbe = (async () => {
259+
while (Date.now() - configureStart < 60_000) {
260+
const u = await getBrowserUUID();
261+
if (u && u !== initialUUID) {
262+
newUUID = u;
263+
newUUIDMs = Date.now() - configureStart;
264+
return;
265+
}
266+
await new Promise((r) => setTimeout(r, 25));
267+
}
268+
})();
269+
270+
// Wait for /configure to return.
271+
const configResult = await configurePromise;
272+
const configureMs = configResult.ms;
273+
console.log(
274+
`[iter ${iter}] /configure returned ${configResult.status} in ${configureMs}ms${configResult.status !== 200 ? ` body=${configResult.body.slice(0, 200)}` : ''}`,
275+
);
276+
277+
// Wait for the new UUID probe to land (or time out).
278+
await newUUIDProbe;
279+
if (newUUID === null) {
280+
return {
281+
iter,
282+
initialUUID,
283+
configureMs,
284+
configureStatus: configResult.status,
285+
newUUIDMs: null,
286+
newUUID: null,
287+
cdpReadyMs: null,
288+
gapMs: null,
289+
};
290+
}
291+
console.log(`[iter ${iter}] new UUID observed at +${newUUIDMs}ms uuid=${newUUID!.slice(0, 8)}…`);
292+
293+
// Now wait for a real CDP round-trip to succeed against the new UUID.
294+
let cdpReadyMs: number | null = null;
295+
const cdpDeadline = configureStart + 60_000;
296+
while (Date.now() < cdpDeadline) {
297+
const ok = await browserGetVersionCDP(newUUID!, 1500);
298+
if (ok) {
299+
cdpReadyMs = Date.now() - configureStart;
300+
break;
301+
}
302+
await new Promise((r) => setTimeout(r, 25));
303+
}
304+
console.log(`[iter ${iter}] Browser.getVersion succeeded at +${cdpReadyMs}ms`);
305+
306+
const gapMs = cdpReadyMs !== null ? cdpReadyMs - configureMs : null;
307+
return {
308+
iter,
309+
initialUUID,
310+
configureMs,
311+
configureStatus: configResult.status,
312+
newUUIDMs,
313+
newUUID,
314+
cdpReadyMs,
315+
gapMs,
316+
};
317+
}
318+
319+
async function main(): Promise<void> {
320+
console.log(`=== benchmark ${args.label} (image=${args.image}) ===`);
321+
console.log(`apiPort=${args.apiPort} devToolsPort=${args.devToolsPort} iters=${args.iterations}`);
322+
323+
await startContainer();
324+
325+
const samples: Sample[] = [];
326+
try {
327+
for (let i = 1; i <= args.iterations; i++) {
328+
const s = await runOneIteration(i);
329+
samples.push(s);
330+
}
331+
} finally {
332+
console.log('\n[cleanup] tearing down container');
333+
await dockerRm(args.containerName);
334+
}
335+
336+
console.log(`\n=== ${args.label} summary ===`);
337+
console.log(
338+
`${'iter'.padEnd(5)}${'configMs'.padEnd(11)}${'newUUID@'.padEnd(11)}${'cdpReady@'.padEnd(11)}${'GAP'.padEnd(8)}`,
339+
);
340+
console.log('-'.repeat(46));
341+
for (const s of samples) {
342+
console.log(
343+
String(s.iter).padEnd(5) +
344+
String(s.configureMs).padEnd(11) +
345+
String(s.newUUIDMs ?? '?').padEnd(11) +
346+
String(s.cdpReadyMs ?? '?').padEnd(11) +
347+
String(s.gapMs ?? '?').padEnd(8),
348+
);
349+
}
350+
console.log();
351+
const gaps = samples.map((s) => s.gapMs).filter((v): v is number => v !== null);
352+
if (gaps.length) {
353+
const sum = gaps.reduce((a, b) => a + b, 0);
354+
const min = Math.min(...gaps);
355+
const max = Math.max(...gaps);
356+
const sorted = [...gaps].sort((a, b) => a - b);
357+
const median = sorted[Math.floor(sorted.length / 2)]!;
358+
console.log(`GAP stats (ms): n=${gaps.length} min=${min} median=${median} mean=${Math.round(sum / gaps.length)} max=${max}`);
359+
console.log('(GAP = cdpReadyMs − configureMs. Positive = api returned 200 before chromium was really ready.)');
360+
} else {
361+
console.log('no gap measurements collected');
362+
}
363+
}
364+
365+
void main();

0 commit comments

Comments
 (0)