Skip to content

feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918

Open
myftija wants to merge 3 commits into
mainfrom
tri-10659-warm-start-delivery-verification
Open

feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918
myftija wants to merge 3 commits into
mainfrom
tri-10659-warm-start-delivery-verification

Conversation

@myftija

@myftija myftija commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Problem

Firestarter's didWarmStart: true means the response was written to a long-poll socket — not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck in PENDING_EXECUTING until the run engine's heartbeat redrive (60s default / 300s in prod), and each redrive burns a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES. Observed ~1/12h on compute-test after the TRI-10293 fixes. Resolves TRI-10659.

Change

After a warm-start hit, the supervisor retains the DequeuedMessage (TimerWheel, default 10s), then probes the existing getLatestSnapshot API. If the run is still on the exact dequeued snapshot, no runner ever acted — it falls through to the regular cold-create path. Recovery: ~10s + cold start, no new APIs, no CLI changes.

  • Double-start safe: startRunAttempt runs under a per-run lock and 409s stale snapshot ids, so a reviving runner and the fallback workload can't both execute; the loser exits before running anything.
  • Probe errors → do nothing: healthy runners legitimately act late during platform brownouts (nested attempt-start retries), so falling back on uncertainty would stampede duplicates. The heartbeat redrive stays as the backstop (also covers supervisor restarts dropping timers).
  • Off by default: TRIGGER_WARM_START_VERIFY_ENABLED (+ TRIGGER_WARM_START_VERIFY_DELAY_MS, 1–60s, default 10s). Disabled = complete no-op. Works for all workload managers (compute/k8s/docker) since it hooks the shared dequeue path.
  • Emits warmstart.verify wide events (outcome: delivered | fallback | probe_error), making the silent-loss rate directly measurable.

myftija added 3 commits June 12, 2026 13:02
Firestarter's didWarmStart: true means the response was written to a
socket, not that the runner received it. A silently dead poller (no FIN,
e.g. a VM torn down mid-poll) leaves the dispatched run stuck in
PENDING_EXECUTING until the run engine's heartbeat redrive minutes
later, burning a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES
each time.

After a warm-start hit the supervisor now retains the DequeuedMessage,
waits TRIGGER_WARM_START_VERIFY_DELAY_MS (default 10s), then asks the
platform for the run's latest snapshot. If it is still the exact
snapshot that was dequeued, no runner ever started the attempt - the
run falls through to the regular cold-create path. Double-starts are
prevented by the engine: startRunAttempt runs under a per-run lock and
rejects stale snapshot ids, so a reviving runner and the fallback
workload can't both execute. On probe errors nothing happens - during
platform brownouts healthy runners legitimately act late, and falling
back on uncertainty would stampede duplicates; the heartbeat redrive
stays as the backstop.

Off by default; enable with TRIGGER_WARM_START_VERIFY_ENABLED. When
disabled the code path is a no-op. Emits warmstart.verify wide events
(outcome: delivered / fallback / probe_error). Resolves TRI-10659.
Review follow-ups: the workload-create error log now carries the run id
(fallback creates run outside the dequeue wide event, so the log was the
only attribution), and the verifier stops before the workload server and
session so its timer can't cold-create a workload mid-shutdown.
@changeset-bot

changeset-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: b6c35ac

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4ddb876f-e151-42d1-9c9e-1301ec2bae98

📥 Commits

Reviewing files that changed from the base of the PR and between 78b7136 and b6c35ac.

📒 Files selected for processing (5)
  • .server-changes/warm-start-delivery-verification.md
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: typecheck / typecheck
🧰 Additional context used
📓 Path-based instructions (9)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Import from @trigger.dev/sdk when writing Trigger.dev tasks. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob

Files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamic import() when circular dependencies cannot be resolved, code splitting is needed for performance, or the module must be loaded conditionally at runtime
Import subpaths only from packages/core (@trigger.dev/core), never import from the root

Files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
apps/supervisor/src/env.ts

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Environment configuration should be defined in src/env.ts

Files:

  • apps/supervisor/src/env.ts
**/*.{js,ts,tsx,jsx,css,json,md}

📄 CodeRabbit inference engine (AGENTS.md)

Use Prettier for code formatting and run pnpm run format before committing

Files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
apps/supervisor/src/services/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Core service logic should be organized in the src/services/ directory

Files:

  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • apps/supervisor/src/services/warmStartVerificationService.test.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx}: Never mock anything in tests - use testcontainers instead
Test files should be placed next to source files (e.g., MyService.ts -> MyService.test.ts)

Files:

  • apps/supervisor/src/services/warmStartVerificationService.test.ts
**/*.test.{js,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{js,ts,tsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Use vitest for unit testing
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed

Files:

  • apps/supervisor/src/services/warmStartVerificationService.test.ts
🧠 Learnings (8)
📚 Learning: 2026-05-14T14:54:39.095Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3545
File: .server-changes/agent-view-sessions.md:10-10
Timestamp: 2026-05-14T14:54:39.095Z
Learning: In the `trigger.dev` repository, do not flag inconsistent dot vs slash notation in route/path strings inside `.server-changes/*.md` files. These markdown files are consumed verbatim into the changelog, so the mixed notation (e.g., `resources.orgs.../runs.$runParam/...`) is intentional and should be preserved as-is.

Applied to files:

  • .server-changes/warm-start-delivery-verification.md
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
📚 Learning: 2026-06-04T18:16:35.386Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.

Applied to files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
📚 Learning: 2026-06-09T17:58:04.699Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.

Applied to files:

  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
📚 Learning: 2026-05-18T14:40:02.173Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In the triggerdotdev/trigger.dev repo, the policy “Never mock anything — use testcontainers instead” should only be enforced for integration tests that interact with real external services (e.g., Redis, Postgres) via actual infrastructure. For unit tests that exercise pure in-memory logic (e.g., cache semantics) it is OK to stub collaborators such as `ApiClient` using Vitest (`vi.fn()`) to assert call counts or control behavior. Do not flag `vi.fn()`-based `ApiClient` stubs in unit tests as violations of the testcontainers policy.

Applied to files:

  • apps/supervisor/src/services/warmStartVerificationService.test.ts
🔇 Additional comments (11)
.server-changes/warm-start-delivery-verification.md (1)

1-7: LGTM!

apps/supervisor/src/env.ts (1)

82-90: LGTM!

apps/supervisor/src/services/warmStartVerificationService.ts (1)

1-174: LGTM!

apps/supervisor/src/services/warmStartVerificationService.test.ts (1)

1-133: LGTM!

apps/supervisor/src/index.ts (7)

30-33: LGTM!


61-61: LGTM!


307-318: LGTM!


476-494: LGTM!


526-531: LGTM!


538-594: LGTM!


677-679: LGTM!


Walkthrough

This pull request adds an opt-in warm-start delivery verification feature to the supervisor. The feature validates whether warm-start dispatches reached runners and automatically falls back to cold-start workload creation if delivery is not confirmed within a configurable delay window. Configuration is gated by TRIGGER_WARM_START_VERIFY_ENABLED (default false) with a configurable probe delay between 1 and 60 seconds (default 10 seconds). The new WarmStartVerificationService uses timer-wheel scheduling and limits concurrent snapshot probes to 10. Integration into the supervisor includes conditional service initialization, scheduling verification on successful warm-start, cancellation when a run connects, and graceful shutdown ordering. A createWorkload helper was extracted to centralize cold-create validation and logging.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description is incomplete; it lacks the required Testing, Changelog, and Screenshots sections from the template, and the Checklist section is missing. Fill out the Testing section with test steps, add a Changelog summary, and complete the Checklist to confirm adherence to contributing guidelines.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main feature: warm-start delivery verification and cold-start fallback for silently lost dispatches.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tri-10659-warm-start-delivery-verification

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install timed out. The project may have too many dependencies for the sandbox.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

Open in Devin Review

Comment on lines +576 to +588
recordPhaseSince("workload_create", createStart, undefined);

// Disabled for now
// this.resourceMonitor.blockResources({
// cpu: message.run.machine.cpu,
// memory: message.run.machine.memory,
// });
} catch (error) {
recordPhaseSince(
"workload_create",
createStart,
error instanceof Error ? error : new Error(String(error))
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 recordPhaseSince called outside runWideEvent context in fallback path

When createWorkload is called from the verification service's timer-driven fallback (warmStartVerificationService.ts:146), it runs outside any runWideEvent async context. The recordPhaseSince calls at apps/supervisor/src/index.ts:576 and apps/supervisor/src/index.ts:584 were originally always inside the runWideEvent callback but now also execute from the verification fallback path where no wide-event AsyncLocalStorage context exists.

If recordPhaseSince gracefully handles a missing context (no-op), this is fine. If it throws, the behavior depends on the path:

  • Success path (line 576): workload is already created, but the throw enters the catch block which calls recordPhaseSince again (line 584), causing a second throw that propagates out. The tryCatch in the verification service catches it and logs a misleading error.
  • Error path (line 584): the original error is swallowed and replaced with a context-missing error.

In either case the workload creation itself succeeds (or fails for its own reasons), so functional correctness is preserved. But the observability data and error messages would be wrong in the fallback path.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition you flagged holds: recordPhaseSince -> recordPhase guards with if (!state) return (apps/supervisor/src/wideEvents/record.ts:27), and fromContext() returns null outside the ALS scope (wideEvents/context.ts:12) - so from the verifier's timer path it's a clean no-op, no throw, no corrupted phase data. The fallback path's observability instead comes from the warmstart.verify wide event plus the runId-attributed error log in createWorkload's catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant