Skip to content

Commit 60d270f

Browse files
committed
fix: cancel pending delayed snapshot before the async completion call
The timer wheel can tick during the awaited completeRunAttempt HTTP round trip, so cancelling after it leaves a window where a due snapshot dispatches anyway and pauses a VM that has moved on. Cancel first - the runnerId guard is ordering-independent, and the runner can't schedule a new suspend until it receives this route's reply, so nothing legitimate can be cancelled early. Matches the existing /continue ordering. Addresses Devin review feedback.
1 parent 3b4c9db commit 60d270f

1 file changed

Lines changed: 13 additions & 11 deletions

File tree

  • apps/supervisor/src/workloadServer

apps/supervisor/src/workloadServer/index.ts

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -288,24 +288,26 @@ export class WorkloadServer extends EventEmitter<WorkloadServerEvents> {
288288
async () => {
289289
const { req, reply, params, body } = ctx;
290290
const runnerId = this.runnerIdFromRequest(req);
291+
292+
// A completion attempt invalidates any pending delayed snapshot
293+
// regardless of outcome: the runner has finished executing, so the
294+
// suspended state the snapshot was scheduled to capture no longer
295+
// exists. Cancel BEFORE the async completion call - the timer
296+
// wheel can tick during the await, so cancelling after it leaves
297+
// a real window for a due snapshot to dispatch and pause a VM
298+
// that has moved on. The runnerId guard keeps a stale duplicate
299+
// runner's completion from cancelling a fresh runner's snapshot,
300+
// and the runner can't schedule a new suspend until it receives
301+
// this route's reply, so nothing legitimate can be cancelled here.
302+
this.snapshotService?.cancel(params.runFriendlyId, runnerId);
303+
291304
const completeResponse = await this.workerClient.completeRunAttempt(
292305
params.runFriendlyId,
293306
params.snapshotFriendlyId,
294307
body,
295308
runnerId
296309
);
297310

298-
// A completion attempt invalidates any pending delayed snapshot
299-
// regardless of outcome: the runner has finished executing, so the
300-
// suspended state the snapshot was scheduled to capture no longer
301-
// exists. Without this, the snapshot fires up to snapshotDelayMs
302-
// later and pauses a VM that has long moved on - and on a transient
303-
// completion failure the runner retries, so waiting for success
304-
// would leave the stale snapshot armed in the meantime. The
305-
// runnerId guard keeps a stale duplicate runner's failed completion
306-
// from cancelling a fresh runner's snapshot.
307-
this.snapshotService?.cancel(params.runFriendlyId, runnerId);
308-
309311
if (!completeResponse.success) {
310312
this.logger.error("Failed to complete run", {
311313
params,

0 commit comments

Comments
 (0)