fix(webapp,core): retry run resume through transient database outages by matt-aitken · Pull Request #4161 · triggerdotdev/trigger.dev

matt-aitken · 2026-07-05T14:26:53Z

Summary

When the platform is briefly unreachable while a run is resuming from a wait, the run no longer fails with TASK_EXECUTION_ABORTED. The worker now retries the resume through the outage instead of aborting on the first blip.

Root cause

Resuming a run calls the engine's continue worker-action endpoint. That route caught every error and returned a 422, which the worker's HTTP client treats as non-retryable. So a transient Prisma infrastructure error (for example P1001 "Can't reach database server") was flattened into a permanent failure: the worker gave up, force-killed the run process, and completed it with TASK_EXECUTION_ABORTED.

Fix

The continue route now lets infrastructure errors propagate to the generic 500 handler (message scrubbed, and retryable by the worker's HTTP client), the same treatment the trigger path already gives them via isInfrastructureError. Genuine validation errors (snapshot mismatch, invalid state) still return 422, so a stale retry stays non-retryable. Resuming is idempotent server-side (guarded by the snapshot id), so retrying is safe.
The worker's continueRunExecution call retries with a longer, jittered backoff so it can ride out an outage lasting tens of seconds, and the jitter keeps a fleet of resuming runs from stampeding the database the moment it recovers.

Builds on #3960, which scrubbed the leaked message on these routes but left the status non-retryable.

Resuming a run after a wait calls the engine's continue endpoint. When the database was briefly unreachable, that route caught the Prisma infrastructure error and returned a non-retryable 422, so the worker aborted the run with TASK_EXECUTION_ABORTED over a transient blip. The continue route now lets infrastructure errors propagate to the generic 500 handler (scrubbed and retryable), matching how the trigger path already treats them. The worker's continue call also retries with a longer, jittered backoff so it can ride out an outage lasting tens of seconds without stampeding the database on recovery. Genuine validation errors still return 422.

changeset-bot · 2026-07-05T14:26:58Z

🦋 Changeset detected

Latest commit: f73646b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 28 packages

Name	Type
@trigger.dev/core	Patch
@trigger.dev/build	Patch
trigger.dev	Patch
@trigger.dev/plugins	Patch
@trigger.dev/python	Patch
@trigger.dev/redis-worker	Patch
@trigger.dev/schema-to-json	Patch
@trigger.dev/sdk	Patch
@internal/cache	Patch
@internal/clickhouse	Patch
@internal/llm-model-catalog	Patch
@trigger.dev/rbac	Patch
@internal/redis	Patch
@internal/replication	Patch
@internal/run-engine	Patch
@internal/run-store	Patch
@internal/schedule-engine	Patch
@trigger.dev/sso	Patch
@internal/testcontainers	Patch
@internal/tracing	Patch
@internal/tsql	Patch
@internal/zod-worker	Patch
@internal/dashboard-agent	Patch
@internal/sdk-compat-tests	Patch
@trigger.dev/react-hooks	Patch
@trigger.dev/rsc	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-07-05T14:29:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8b40410b-b22e-4070-b271-f7995c043abf

📥 Commits

Reviewing files that changed from the base of the PR and between 8334820 and f73646b.

📒 Files selected for processing (2)

.changeset/resume-retry-transient-db.md
packages/core/src/v3/runEngineWorker/workload/http.ts

✅ Files skipped from review due to trivial changes (1)

.changeset/resume-retry-transient-db.md

🚧 Files skipped from review as they are similar to previous changes (1)

packages/core/src/v3/runEngineWorker/workload/http.ts

📜 Recent review details

⏰ Context from checks skipped due to timeout. (18)

GitHub Check: packages / 📊 Merge Reports
GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
GitHub Check: e2e / 🧪 CLI v3 tests (blacksmith-4vcpu-windows-2025 - npm)

⚠️ CI failures not shown inline (4)

GitHub Actions: 📝 Agent Instructions Audit / audit: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 25
--model claude-opus-4-8
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are reviewing a PR to check whether any agent instruction files need updating.
In this repo:
- Root shared agent guidance lives in `AGENTS.md`.
- Root `CLAUDE.md` is only a Claude Code adapter that imports `AGENTS.md`.
- Subdirectories may still have scoped `CLAUDE.md` files.
- `.claude/rules/` contains additional Claude Code guidance.
## Your task
1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.
2. For each changed directory, check the applicable instruction files: root `AGENTS.md`, any `CLAUDE.md` in that directory or a parent directory, and relevant `.claude/rules/` files.
3. Determine if any instruction file should be updated based on the changes. Consider:
   - New files/directories that aren't covered by existing documentation
   - Changed architecture or patterns that contradict current agent guidance
   - New dependencies, services, or infrastructure that agents should know about
   - Renamed or moved files that are referenced in an instruction file
   - Changes to build commands, test patterns, or development workflows
## Response format
If NO updates are needed, respond with exactly:
✅ Agent instruction files look current for this PR.
If updates ARE needed, respond with a short list:
📝 **Agent instruction updates suggested:**
- `AGENTS.md`: [what should be added/changed]
- `path/to/CLAUDE.md`: [what should be added/changed]
- `.claude/rules/file.md`: [what should be added/changed]
Keep suggestions specific and brief. Only flag things that would actually mislead agents in future sessions.
Do NOT suggest updates for trivial changes (bug fixes, small refactors within existing patterns).
Do NOT suggest creating new...

GitHub Actions: 📝 Agent Instructions Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

atching-rc.1         -> build-batching-rc.1
  * [new tag]             build-batching-rc.2         -> build-batching-rc.2
  * [new tag]             build-billing-0.0.1         -> build-billing-0.0.1
  * [new tag]             build-billing-0.0.2         -> build-billing-0.0.2
  * [new tag]             build-billing-0.0.3         -> build-billing-0.0.3
  * [new tag]             build-buildinfo-rc.0        -> build-buildinfo-rc.0
  * [new tag]             build-buildinfo-rc.1        -> build-buildinfo-rc.1
  * [new tag]             build-checkpoint-failover-rc.1 -> build-checkpoint-failover-rc.1
  * [new tag]             build-checkpoint-race-condition-1 -> build-checkpoint-race-condition-1
  * [new tag]             build-checkpoint-race-condition-2 -> build-checkpoint-race-condition-2
  * [new tag]             build-checkpoint-race-condition-3 -> build-checkpoint-race-condition-3
  * [new tag]             build-chris-test-blacksmith -> build-chris-test-blacksmith
  * [new tag]             build-chris-test-blacksmith-2 -> build-chris-test-blacksmith-2
  * [new tag]             build-cli-build-upgrade-rc.1 -> build-cli-build-upgrade-rc.1
  * [new tag]             build-clickhouse-reads-rc0  -> build-clickhouse-reads-rc0
  * [new tag]             build-clickhouse-reads-rc1  -> build-clickhouse-reads-rc1
  * [new tag]             build-compute.rc0           -> build-compute.rc0
  * [new tag]             build-compute.rc1           -> build-compute.rc1
  * [new tag]             build-compute.rc2           -> build-compute.rc2
  * [new tag]             build-compute.rc3           -> build-compute.rc3
  * [new tag]             build-compute.rc4           -> build-compute.rc4
  * [new tag]             build-compute.rc5           -> build-compute.rc5
  * [new tag]             build-compute.rc6           -> build-compute.rc6
  * [new tag]             build-corepack-offline-rc.0 -> build-corepack-offline-rc.0
  * [new tag]             build-current-deployment-rc.0 -> build-cur...

GitHub Actions: 🔎 REVIEW.md Drift Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

 build-legacy-run-engine.fix3
  * [new tag]             build-manual-checkpoints.rc1 -> build-manual-checkpoints.rc1
  * [new tag]             build-metadata-upgrade-logging.rc1 -> build-metadata-upgrade-logging.rc1
  * [new tag]             build-metadata-upgrade-logging.rc2 -> build-metadata-upgrade-logging.rc2
  * [new tag]             build-metadata-upgrade-logging.rc3 -> build-metadata-upgrade-logging.rc3
  * [new tag]             build-new-build-system.rc.1 -> build-new-build-system.rc.1
  * [new tag]             build-otel-upgrade-rc.0     -> build-otel-upgrade-rc.0
  * [new tag]             build-otel-upgrade-rc.1     -> build-otel-upgrade-rc.1
  * [new tag]             build-pre-pull-deployments-rc.1 -> build-pre-pull-deployments-rc.1
  * [new tag]             build-prod-rescue-rc.1      -> build-prod-rescue-rc.1
  * [new tag]             build-rate-limiter-fix-rc.1 -> build-rate-limiter-fix-rc.1
  * [new tag]             build-re2.rc0               -> build-re2.rc0
  * [new tag]             build-realtime-v2-stream-fix -> build-realtime-v2-stream-fix
  * [new tag]             build-realtime-v2-stream-fix-2 -> build-realtime-v2-stream-fix-2
  * [new tag]             build-realtime-v2-stream-fix-3 -> build-realtime-v2-stream-fix-3
  * [new tag]             build-realtime-v2-stream-fix-4 -> build-realtime-v2-stream-fix-4
  * [new tag]             build-realtime-v2-stream-fix-5 -> build-realtime-v2-stream-fix-5
  * [new tag]             build-realtimestreams-dedupe -> build-realtimestreams-dedupe
  * [new tag]             build-registry-maintenance-rc.1 -> build-registry-maintenance-rc.1
  * [new tag]             build-registry-maintenance-rc.2 -> build-registry-maintenance-rc.2
  * [new tag]             build-remote-ecr-rc.0       -> build-remote-ecr-rc.0
  * [new tag]             build-reschedule-hotfix.rc1 -> build-reschedule-hotfix.rc1
  * [new tag]             build-resume-fixes.rc1      -> build-resume-fixes.rc1
  * [new tag]             build-resume-...

GitHub Actions: 🔎 REVIEW.md Drift Audit / audit: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 30
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are auditing this PR for drift against `.claude/REVIEW.md`.
## Context
`.claude/REVIEW.md` is the repo's source of truth for what AI / agent code reviewers should treat as critical findings (rolling-deploy safety, hot-table indexes, recovery-path queries, testcontainers usage, Lua versioning, etc.). It is consumed by review agents to calibrate severity. If REVIEW.md goes stale, every future agent review degrades.
## Strategy — read this first
You have a hard turn budget. Spend it on signal, not coverage. The audit is allowed to miss things; it is NOT allowed to time out.
1. Read `.claude/REVIEW.md` once, in full.
2. Run `git diff origin/main...HEAD --name-only` to get the list of changed files. Do NOT read the diff content yet.
3. Scan the file-list for relevance to REVIEW.md scope. Relevance signals: changes to Prisma schema, Redis / queue / Lua code, hot tables, recovery / restart loops, new packages, deletions of paths REVIEW.md cites. Skim everything else.
4. Open at most **5 files** total — only the ones most likely to surface a real signal. If nothing in the file-list looks relevant to any REVIEW.md rule, do NOT read any files; go straight to the verdict.
5. Form a verdict and stop. Do not exhaust the turn budget exploring.
Large PRs (>50 files changed) are a strong signal to be MORE selective, not more thorough. Pick 3-5 files at most.
## What to look for
- **Stale references** — does any REVIEW.md rule cite a file, directory, function, table, Prisma model, or package name that has been removed or renamed in this PR (or is already gone from `main`)?
- **Contradictions** — does code in this PR clearly violate a current REVIEW.md rule? (Don't re-review the PR. Only flag if REVIE...

Walkthrough

The continue run flow now treats Prisma infrastructure errors as retryable instead of mapping them to 422 responses. The webapp continue route re-throws detected infrastructure errors so generic 500 handling can apply, and both HTTP clients add retry policies with exponential backoff and jitter for continueRunExecution. A changeset records the patch version update for @trigger.dev/core.

Changes

Cohort / File(s)	Summary
Webapp continue route — `apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.continue.ts`	Imports `isInfrastructureError`, re-throws infrastructure errors, and updates the warning log message.
ContinueRunExecution retry configuration — `packages/core/src/v3/runEngineWorker/workload/http.ts`, `packages/core/src/v3/runEngineWorker/supervisor/http.ts`	Adds jittered exponential retry policy to `continueRunExecution` in both HTTP clients.
Changeset — `.changeset/resume-retry-transient-db.md`	Documents the patch bump and resume retry behavior.

Sequence Diagram(s)

sequenceDiagram
  participant ContinueRoute
  participant PrismaDB
  participant WorkerClient

  WorkerClient->>ContinueRoute: POST continue run execution
  ContinueRoute->>PrismaDB: perform continuation logic
  PrismaDB-->>ContinueRoute: infrastructure error
  ContinueRoute->>ContinueRoute: isInfrastructureError check
  alt infrastructure error
    ContinueRoute-->>WorkerClient: rethrow for generic 500 handling
  else other error
    ContinueRoute-->>WorkerClient: 422 response
  end

Related issues: None specified.

Related PRs: None specified.

Suggested labels: bug, core, webapp

Suggested reviewers: None specified.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description covers the change well, but it does not follow the required template and is missing the issue link and standard sections.	Add the required template sections: Closes #, checklist items, Testing, Changelog, and Screenshots (or mark them N/A).

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the main change: retrying run resumes through transient database outages.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/resume-retriable-on-transient-db-errors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

The supervisor-to-engine hop is the one that reaches the continue endpoint, so it is where a transient database outage surfaces as a retryable 5xx. Give its continueRunExecution the same longer, jittered retry budget as the workload client so it can ride out the outage.

pkg-pr-new · 2026-07-05T14:35:27Z

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@81ffb3f

trigger.dev

npm i https://pkg.pr.new/trigger.dev@81ffb3f

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@81ffb3f

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@81ffb3f

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@81ffb3f

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@81ffb3f

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@81ffb3f

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@81ffb3f

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@81ffb3f

commit: 81ffb3f

The database-outage retry lives on the supervisor-to-engine hop; the workload client only reaches the supervisor's workload server, so its retry rides out supervisor blips (e.g. a restart), not DB outages. Fix the comment to say so.

This comment was marked as resolved.

Sign in to view

matt-aitken force-pushed the fix/resume-retriable-on-transient-db-errors branch from 7b8d0a0 to 8334820 Compare July 5, 2026 14:33

This comment was marked as resolved.

Sign in to view

matt-aitken added 2 commits July 5, 2026 15:48

Improved changeset message

f73646b

ericallam approved these changes Jul 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(webapp,core): retry run resume through transient database outages#4161

fix(webapp,core): retry run resume through transient database outages#4161
matt-aitken wants to merge 4 commits into
mainfrom
fix/resume-retriable-on-transient-db-errors

matt-aitken commented Jul 5, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 5, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

matt-aitken commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Uh oh!

changeset-bot Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

coderabbitai Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matt-aitken commented Jul 5, 2026 •

edited

Loading

changeset-bot Bot commented Jul 5, 2026 •

edited

Loading

coderabbitai Bot commented Jul 5, 2026 •

edited

Loading

pkg-pr-new Bot commented Jul 5, 2026 •

edited

Loading