Skip to content

Commit ec8e06a

Browse files
heavygeecursoragent
andcommitted
feat(runner): serialize start-sync invocations via setup-window lock
The mtime-driven self-restart + the operator drop-in's ExecStartPre together created a race window where two `hapi runner start-sync` invocations could each see the other's runner as stale, kill it, and then race for the runtime lock. The 2026-05-31 22:40 BST incident captured the worst-case outcome: invocation A killed the live runner, took SIGTERM ~200ms later from invocation B's setup, and exited cleanly - leaving the machine offline because systemd's Restart=on-failure does not recover from exit code 0. Add a separate `runner.start.lock` held by `startRunner()` for the duration of its setup window only (version check + stopRunner + new runtime-lock acquisition + state-file write). Released as soon as the state file is owned, so legitimate next-invocations only delay a few hundred ms. Stale-cleanup after 15s in case a setup crashes between acquire and release. When a second start-sync arrives while the first is mid-setup: 1. It blocks on acquireRunnerStartLock for up to ~6s. 2. By release, the first invocation owns the runtime lock + state file with the current version. 3. The blocked invocation's subsequent isRunnerRunningCurrentlyInstalledHappyVersion() returns "matching", so it exits cleanly via the existing "Runner already running with matching version" path - no kill, no race. If the lock cannot be acquired within the window (another invocation is genuinely stuck), bail out instead of compounding the problem. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 7d3bb7a commit ec8e06a

3 files changed

Lines changed: 128 additions & 1 deletion

File tree

cli/src/configuration.ts

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ class Configuration {
5252
public readonly privateKeyFile: string
5353
public readonly runnerStateFile: string
5454
public readonly runnerLockFile: string
55+
public readonly runnerStartLockFile: string
5556
public readonly currentCliVersion: string
5657

5758
public readonly isExperimentalEnabled: boolean
@@ -80,6 +81,14 @@ class Configuration {
8081
this.privateKeyFile = join(this.happyHomeDir, 'access.key')
8182
this.runnerStateFile = join(this.happyHomeDir, 'runner.state.json')
8283
this.runnerLockFile = join(this.happyHomeDir, 'runner.state.json.lock')
84+
// Held by `startRunner()` during its setup window only (version check +
85+
// stopRunner + new runtime-lock acquisition + state-file write). NOT
86+
// held for the runner's lifetime - that's runnerLockFile's job. The
87+
// start lock serialises concurrent `runner start-sync` invocations so
88+
// they cannot race past each other into the kill-old / start-new
89+
// sequence. Stale-clean via mtime: any holder older than 15s is
90+
// assumed dead (the setup window is well under 1s on healthy hosts).
91+
this.runnerStartLockFile = join(this.happyHomeDir, 'runner.start.lock')
8392

8493
this.isExperimentalEnabled = ['true', '1', 'yes'].includes(process.env.HAPI_EXPERIMENTAL?.toLowerCase() || '')
8594

cli/src/persistence.ts

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,3 +274,73 @@ export async function releaseRunnerLock(lockHandle: FileHandle): Promise<void> {
274274
}
275275
} catch { }
276276
}
277+
278+
/**
279+
* Acquire the runner *start* lock - held by `startRunner()` for the duration
280+
* of its setup window (version check + stopRunner + new runtime-lock + state
281+
* write). Returns the open FileHandle on success, or null when another
282+
* start-sync invocation is already inside its setup window.
283+
*
284+
* This is intentionally separate from `acquireRunnerLock()`:
285+
* * runnerLockFile (runner.state.json.lock) is held for the *lifetime* of a
286+
* running runner - it's the "I am the live runner" marker.
287+
* * runnerStartLockFile (runner.start.lock) is held only during the few-ms
288+
* setup window of a `startRunner()` invocation - it serialises *invocations*
289+
* so two `hapi runner start-sync` calls cannot race past each other into
290+
* the kill-old / start-new sequence (which caused the 2026-05-31 22:40
291+
* incident: invocation A killed the live runner; invocation B's racing
292+
* SIGTERM then knocked A over before its replacement could register).
293+
*
294+
* Stale cleanup: lock file mtime > staleAfterMs (default 15s) is treated as
295+
* abandoned and removed. The genuine setup window completes in well under 1s
296+
* even on slow hardware, so 15s is comfortably loose.
297+
*/
298+
export async function acquireRunnerStartLock(options: {
299+
maxAttempts?: number;
300+
delayMs?: number;
301+
staleAfterMs?: number;
302+
} = {}): Promise<FileHandle | null> {
303+
const maxAttempts = options.maxAttempts ?? 30; // up to ~6s of waiting
304+
const delayMs = options.delayMs ?? 200;
305+
const staleAfterMs = options.staleAfterMs ?? 15_000;
306+
307+
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
308+
try {
309+
const fileHandle = await open(configuration.runnerStartLockFile, 'wx');
310+
await fileHandle.writeFile(`${process.pid} ${new Date().toISOString()}`);
311+
return fileHandle;
312+
} catch (error: any) {
313+
if (error.code !== 'EEXIST') {
314+
throw error;
315+
}
316+
// Stale-cleanup probe before sleeping.
317+
try {
318+
const stats = await stat(configuration.runnerStartLockFile);
319+
if (Date.now() - stats.mtimeMs > staleAfterMs) {
320+
await unlink(configuration.runnerStartLockFile).catch(() => { });
321+
continue;
322+
}
323+
} catch { /* lock vanished between EEXIST and stat - retry */ }
324+
if (attempt === maxAttempts) {
325+
return null;
326+
}
327+
await new Promise(resolve => setTimeout(resolve, delayMs));
328+
}
329+
}
330+
return null;
331+
}
332+
333+
/**
334+
* Release the start lock acquired via acquireRunnerStartLock.
335+
* Idempotent on missing files.
336+
*/
337+
export async function releaseRunnerStartLock(lockHandle: FileHandle): Promise<void> {
338+
try {
339+
await lockHandle.close();
340+
} catch { }
341+
try {
342+
if (existsSync(configuration.runnerStartLockFile)) {
343+
unlinkSync(configuration.runnerStartLockFile);
344+
}
345+
} catch { }
346+
}

cli/src/runner/run.ts

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ import { configuration } from '@/configuration';
1111
import packageJson from '../../package.json';
1212
import { getEnvironmentInfo } from '@/ui/doctor';
1313
import { spawnHappyCLI } from '@/utils/spawnHappyCLI';
14-
import { writeRunnerState, RunnerLocallyPersistedState, readRunnerState, readSettings, acquireRunnerLock, releaseRunnerLock } from '@/persistence';
14+
import { writeRunnerState, RunnerLocallyPersistedState, readRunnerState, readSettings, acquireRunnerLock, releaseRunnerLock, acquireRunnerStartLock, releaseRunnerStartLock } from '@/persistence';
1515
import { isProcessAlive, isWindows, killProcess, killProcessByChildProcess } from '@/utils/process';
1616
import { PERMISSION_MODES } from '@hapi/protocol/modes';
1717
import { withRetry } from '@/utils/time';
@@ -98,6 +98,38 @@ export async function startRunner(options: { workspaceRoots?: string[] } = {}):
9898
logger.debug('[RUNNER RUN] Starting runner process...');
9999
logger.debugLargeJson('[RUNNER RUN] Environment', getEnvironmentInfo());
100100

101+
// Serialise concurrent `runner start-sync` invocations. The 2026-05-31 22:40
102+
// BST incident reproduced what happens without this lock: a terminal-launched
103+
// start-sync (no HAPI_DISABLE_VERSION_HANDOFF) killed the systemd-owned live
104+
// runner via stopRunner(), then took an external SIGTERM ~200ms later before
105+
// its replacement could register - net result was a 3-minute machine outage.
106+
//
107+
// With the start lock held: if any other startRunner() is already inside its
108+
// setup window (kill-old + acquire-runtime-lock + write-state), we wait up
109+
// to ~6s for it to release. By then the winning invocation has a running
110+
// runner; our subsequent isRunnerRunningCurrentlyInstalledHappyVersion()
111+
// check returns "matching", and we exit cleanly with no kill action.
112+
//
113+
// If the start lock cannot be acquired within the window, another invocation
114+
// is genuinely stuck - we bail out instead of compounding the problem.
115+
const startLockHandle = await acquireRunnerStartLock();
116+
if (!startLockHandle) {
117+
logger.debug('[RUNNER RUN] Another runner start-sync invocation is in its setup window; bailing out');
118+
console.log('Another `hapi runner start-sync` invocation is currently starting up; not racing it');
119+
process.exit(0);
120+
}
121+
122+
// From here until we either bail or finish writing runner.state.json with
123+
// the runtime lock held, any early `process.exit()` must release the start
124+
// lock first. The lock file is also stale-cleaned after 15s (see
125+
// acquireRunnerStartLock) so a crashed setup cannot block the next attempt.
126+
let startLockReleased = false;
127+
const releaseStartLockOnce = async () => {
128+
if (startLockReleased) return;
129+
startLockReleased = true;
130+
await releaseRunnerStartLock(startLockHandle);
131+
};
132+
101133
// Check if already running
102134
// Check if running runner version matches current CLI version
103135
const runningRunnerVersionMatches = await isRunnerRunningCurrentlyInstalledHappyVersion();
@@ -107,13 +139,15 @@ export async function startRunner(options: { workspaceRoots?: string[] } = {}):
107139
} else {
108140
logger.debug('[RUNNER RUN] Runner version matches, keeping existing runner');
109141
console.log('Runner already running with matching version');
142+
await releaseStartLockOnce();
110143
process.exit(0);
111144
}
112145

113146
// Acquire exclusive lock (proves runner is running)
114147
const runnerLockHandle = await acquireRunnerLock(5, 200);
115148
if (!runnerLockHandle) {
116149
logger.debug('[RUNNER RUN] Runner lock file already held, another runner is running');
150+
await releaseStartLockOnce();
117151
process.exit(0);
118152
}
119153

@@ -684,6 +718,13 @@ export async function startRunner(options: { workspaceRoots?: string[] } = {}):
684718
writeRunnerState(fileState);
685719
logger.debug('[RUNNER RUN] Runner state written');
686720

721+
// Setup window complete: runtime lock + state file are now owned by us, so
722+
// any subsequent start-sync invocation will see "Runner already running with
723+
// matching version" and bow out without racing. Release the start lock to
724+
// let any queued retries proceed (they will exit cleanly via the version
725+
// check). Holding it longer would just delay legitimate next-invocations.
726+
await releaseStartLockOnce();
727+
687728
// Prepare initial runner state
688729
const initialRunnerState: RunnerState = {
689730
status: 'offline',
@@ -942,6 +983,10 @@ export async function startRunner(options: { workspaceRoots?: string[] } = {}):
942983
await stopControlServer();
943984
await cleanupRunnerState();
944985
await releaseRunnerLock(runnerLockHandle);
986+
// Defensive: start-lock should already be released after writeRunnerState,
987+
// but if cleanupAndShutdown fires before that point (rare, e.g. SIGTERM
988+
// during auth setup) we must not leave it on disk for 15s.
989+
await releaseStartLockOnce();
945990

946991
logger.debug('[RUNNER RUN] Cleanup completed, exiting process');
947992
process.exit(0);
@@ -954,6 +999,9 @@ export async function startRunner(options: { workspaceRoots?: string[] } = {}):
954999
await cleanupAndShutdown(shutdownRequest.source, shutdownRequest.errorMessage);
9551000
} catch (error) {
9561001
logger.debug('[RUNNER RUN][FATAL] Failed somewhere unexpectedly - exiting with code 1', error);
1002+
// Best-effort start-lock release on fatal error (don't await in case the
1003+
// filesystem itself is the problem; stale-cleanup will catch it in 15s).
1004+
void releaseStartLockOnce().catch(() => { });
9571005
process.exit(1);
9581006
}
9591007
}

0 commit comments

Comments
 (0)