feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918
feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918myftija wants to merge 3 commits into
Conversation
Firestarter's didWarmStart: true means the response was written to a socket, not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck in PENDING_EXECUTING until the run engine's heartbeat redrive minutes later, burning a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES each time. After a warm-start hit the supervisor now retains the DequeuedMessage, waits TRIGGER_WARM_START_VERIFY_DELAY_MS (default 10s), then asks the platform for the run's latest snapshot. If it is still the exact snapshot that was dequeued, no runner ever started the attempt - the run falls through to the regular cold-create path. Double-starts are prevented by the engine: startRunAttempt runs under a per-run lock and rejects stale snapshot ids, so a reviving runner and the fallback workload can't both execute. On probe errors nothing happens - during platform brownouts healthy runners legitimately act late, and falling back on uncertainty would stampede duplicates; the heartbeat redrive stays as the backstop. Off by default; enable with TRIGGER_WARM_START_VERIFY_ENABLED. When disabled the code path is a no-op. Emits warmstart.verify wide events (outcome: delivered / fallback / probe_error). Resolves TRI-10659.
Review follow-ups: the workload-create error log now carries the run id (fallback creates run outside the dequeue wide event, so the log was the only attribution), and the verifier stops before the workload server and session so its timer can't cold-create a workload mid-shutdown.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🧰 Additional context used📓 Path-based instructions (9)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
apps/supervisor/src/env.ts📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)
Files:
**/*.{js,ts,tsx,jsx,css,json,md}📄 CodeRabbit inference engine (AGENTS.md)
Files:
apps/supervisor/src/services/**/*.{js,ts}📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)
Files:
**/*.{test,spec}.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.test.{ts,tsx}📄 CodeRabbit inference engine (CLAUDE.md)
Files:
**/*.test.{js,ts,tsx}📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (8)📚 Learning: 2026-05-14T14:54:39.095ZApplied to files:
📚 Learning: 2026-03-22T13:26:12.060ZApplied to files:
📚 Learning: 2026-03-22T19:24:14.403ZApplied to files:
📚 Learning: 2026-05-18T08:21:27.694ZApplied to files:
📚 Learning: 2026-05-18T08:21:27.694ZApplied to files:
📚 Learning: 2026-06-04T18:16:35.386ZApplied to files:
📚 Learning: 2026-06-09T17:58:04.699ZApplied to files:
📚 Learning: 2026-05-18T14:40:02.173ZApplied to files:
🔇 Additional comments (11)
WalkthroughThis pull request adds an opt-in warm-start delivery verification feature to the supervisor. The feature validates whether warm-start dispatches reached runners and automatically falls back to cold-start workload creation if delivery is not confirmed within a configurable delay window. Configuration is gated by 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint install timed out. The project may have too many dependencies for the sandbox. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| recordPhaseSince("workload_create", createStart, undefined); | ||
|
|
||
| // Disabled for now | ||
| // this.resourceMonitor.blockResources({ | ||
| // cpu: message.run.machine.cpu, | ||
| // memory: message.run.machine.memory, | ||
| // }); | ||
| } catch (error) { | ||
| recordPhaseSince( | ||
| "workload_create", | ||
| createStart, | ||
| error instanceof Error ? error : new Error(String(error)) | ||
| ); |
There was a problem hiding this comment.
🚩 recordPhaseSince called outside runWideEvent context in fallback path
When createWorkload is called from the verification service's timer-driven fallback (warmStartVerificationService.ts:146), it runs outside any runWideEvent async context. The recordPhaseSince calls at apps/supervisor/src/index.ts:576 and apps/supervisor/src/index.ts:584 were originally always inside the runWideEvent callback but now also execute from the verification fallback path where no wide-event AsyncLocalStorage context exists.
If recordPhaseSince gracefully handles a missing context (no-op), this is fine. If it throws, the behavior depends on the path:
- Success path (line 576): workload is already created, but the throw enters the catch block which calls
recordPhaseSinceagain (line 584), causing a second throw that propagates out. ThetryCatchin the verification service catches it and logs a misleading error. - Error path (line 584): the original error is swallowed and replaced with a context-missing error.
In either case the workload creation itself succeeds (or fails for its own reasons), so functional correctness is preserved. But the observability data and error messages would be wrong in the fallback path.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
The condition you flagged holds: recordPhaseSince -> recordPhase guards with if (!state) return (apps/supervisor/src/wideEvents/record.ts:27), and fromContext() returns null outside the ALS scope (wideEvents/context.ts:12) - so from the verifier's timer path it's a clean no-op, no throw, no corrupted phase data. The fallback path's observability instead comes from the warmstart.verify wide event plus the runId-attributed error log in createWorkload's catch.
Problem
Firestarter's
didWarmStart: truemeans the response was written to a long-poll socket — not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck inPENDING_EXECUTINGuntil the run engine's heartbeat redrive (60s default / 300s in prod), and each redrive burns a queue redelivery towardTASK_RUN_DEQUEUED_MAX_RETRIES. Observed ~1/12h on compute-test after the TRI-10293 fixes. Resolves TRI-10659.Change
After a warm-start hit, the supervisor retains the
DequeuedMessage(TimerWheel, default 10s), then probes the existinggetLatestSnapshotAPI. If the run is still on the exact dequeued snapshot, no runner ever acted — it falls through to the regular cold-create path. Recovery: ~10s + cold start, no new APIs, no CLI changes.startRunAttemptruns under a per-run lock and 409s stale snapshot ids, so a reviving runner and the fallback workload can't both execute; the loser exits before running anything.TRIGGER_WARM_START_VERIFY_ENABLED(+TRIGGER_WARM_START_VERIFY_DELAY_MS, 1–60s, default 10s). Disabled = complete no-op. Works for all workload managers (compute/k8s/docker) since it hooks the shared dequeue path.warmstart.verifywide events (outcome: delivered | fallback | probe_error), making the silent-loss rate directly measurable.