Skip to content

fix(webapp,core): retry run resume through transient database outages#4161

Open
matt-aitken wants to merge 4 commits into
mainfrom
fix/resume-retriable-on-transient-db-errors
Open

fix(webapp,core): retry run resume through transient database outages#4161
matt-aitken wants to merge 4 commits into
mainfrom
fix/resume-retriable-on-transient-db-errors

Conversation

@matt-aitken

@matt-aitken matt-aitken commented Jul 5, 2026

Copy link
Copy Markdown
Member

Summary

When the platform is briefly unreachable while a run is resuming from a wait, the run no longer fails with TASK_EXECUTION_ABORTED. The worker now retries the resume through the outage instead of aborting on the first blip.

Root cause

Resuming a run calls the engine's continue worker-action endpoint. That route caught every error and returned a 422, which the worker's HTTP client treats as non-retryable. So a transient Prisma infrastructure error (for example P1001 "Can't reach database server") was flattened into a permanent failure: the worker gave up, force-killed the run process, and completed it with TASK_EXECUTION_ABORTED.

Fix

  • The continue route now lets infrastructure errors propagate to the generic 500 handler (message scrubbed, and retryable by the worker's HTTP client), the same treatment the trigger path already gives them via isInfrastructureError. Genuine validation errors (snapshot mismatch, invalid state) still return 422, so a stale retry stays non-retryable. Resuming is idempotent server-side (guarded by the snapshot id), so retrying is safe.
  • The worker's continueRunExecution call retries with a longer, jittered backoff so it can ride out an outage lasting tens of seconds, and the jitter keeps a fleet of resuming runs from stampeding the database the moment it recovers.

Builds on #3960, which scrubbed the leaked message on these routes but left the status non-retryable.

Resuming a run after a wait calls the engine's continue endpoint. When the
database was briefly unreachable, that route caught the Prisma infrastructure
error and returned a non-retryable 422, so the worker aborted the run with
TASK_EXECUTION_ABORTED over a transient blip.

The continue route now lets infrastructure errors propagate to the generic 500
handler (scrubbed and retryable), matching how the trigger path already treats
them. The worker's continue call also retries with a longer, jittered backoff
so it can ride out an outage lasting tens of seconds without stampeding the
database on recovery. Genuine validation errors still return 422.
@changeset-bot

changeset-bot Bot commented Jul 5, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: f73646b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 28 packages
Name Type
@trigger.dev/core Patch
@trigger.dev/build Patch
trigger.dev Patch
@trigger.dev/plugins Patch
@trigger.dev/python Patch
@trigger.dev/redis-worker Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@trigger.dev/rbac Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/run-store Patch
@internal/schedule-engine Patch
@trigger.dev/sso Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/dashboard-agent Patch
@internal/sdk-compat-tests Patch
@trigger.dev/react-hooks Patch
@trigger.dev/rsc Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai

coderabbitai Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8b40410b-b22e-4070-b271-f7995c043abf

📥 Commits

Reviewing files that changed from the base of the PR and between 8334820 and f73646b.

📒 Files selected for processing (2)
  • .changeset/resume-retry-transient-db.md
  • packages/core/src/v3/runEngineWorker/workload/http.ts
✅ Files skipped from review due to trivial changes (1)
  • .changeset/resume-retry-transient-db.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/core/src/v3/runEngineWorker/workload/http.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout. (18)
  • GitHub Check: packages / 📊 Merge Reports
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
  • GitHub Check: e2e / 🧪 CLI v3 tests (blacksmith-4vcpu-windows-2025 - npm)
⚠️ CI failures not shown inline (4)

GitHub Actions: 📝 Agent Instructions Audit / audit: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 25
--model claude-opus-4-8
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are reviewing a PR to check whether any agent instruction files need updating.
In this repo:
- Root shared agent guidance lives in `AGENTS.md`.
- Root `CLAUDE.md` is only a Claude Code adapter that imports `AGENTS.md`.
- Subdirectories may still have scoped `CLAUDE.md` files.
- `.claude/rules/` contains additional Claude Code guidance.
## Your task
1. Run `git diff origin/main...HEAD --name-only` to see which files changed in this PR.
2. For each changed directory, check the applicable instruction files: root `AGENTS.md`, any `CLAUDE.md` in that directory or a parent directory, and relevant `.claude/rules/` files.
3. Determine if any instruction file should be updated based on the changes. Consider:
   - New files/directories that aren't covered by existing documentation
   - Changed architecture or patterns that contradict current agent guidance
   - New dependencies, services, or infrastructure that agents should know about
   - Renamed or moved files that are referenced in an instruction file
   - Changes to build commands, test patterns, or development workflows
## Response format
If NO updates are needed, respond with exactly:
✅ Agent instruction files look current for this PR.
If updates ARE needed, respond with a short list:
📝 **Agent instruction updates suggested:**
- `AGENTS.md`: [what should be added/changed]
- `path/to/CLAUDE.md`: [what should be added/changed]
- `.claude/rules/file.md`: [what should be added/changed]
Keep suggestions specific and brief. Only flag things that would actually mislead agents in future sessions.
Do NOT suggest updates for trivial changes (bug fixes, small refactors within existing patterns).
Do NOT suggest creating new...

GitHub Actions: 📝 Agent Instructions Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

atching-rc.1         -> build-batching-rc.1
  * [new tag]             build-batching-rc.2         -> build-batching-rc.2
  * [new tag]             build-billing-0.0.1         -> build-billing-0.0.1
  * [new tag]             build-billing-0.0.2         -> build-billing-0.0.2
  * [new tag]             build-billing-0.0.3         -> build-billing-0.0.3
  * [new tag]             build-buildinfo-rc.0        -> build-buildinfo-rc.0
  * [new tag]             build-buildinfo-rc.1        -> build-buildinfo-rc.1
  * [new tag]             build-checkpoint-failover-rc.1 -> build-checkpoint-failover-rc.1
  * [new tag]             build-checkpoint-race-condition-1 -> build-checkpoint-race-condition-1
  * [new tag]             build-checkpoint-race-condition-2 -> build-checkpoint-race-condition-2
  * [new tag]             build-checkpoint-race-condition-3 -> build-checkpoint-race-condition-3
  * [new tag]             build-chris-test-blacksmith -> build-chris-test-blacksmith
  * [new tag]             build-chris-test-blacksmith-2 -> build-chris-test-blacksmith-2
  * [new tag]             build-cli-build-upgrade-rc.1 -> build-cli-build-upgrade-rc.1
  * [new tag]             build-clickhouse-reads-rc0  -> build-clickhouse-reads-rc0
  * [new tag]             build-clickhouse-reads-rc1  -> build-clickhouse-reads-rc1
  * [new tag]             build-compute.rc0           -> build-compute.rc0
  * [new tag]             build-compute.rc1           -> build-compute.rc1
  * [new tag]             build-compute.rc2           -> build-compute.rc2
  * [new tag]             build-compute.rc3           -> build-compute.rc3
  * [new tag]             build-compute.rc4           -> build-compute.rc4
  * [new tag]             build-compute.rc5           -> build-compute.rc5
  * [new tag]             build-compute.rc6           -> build-compute.rc6
  * [new tag]             build-corepack-offline-rc.0 -> build-corepack-offline-rc.0
  * [new tag]             build-current-deployment-rc.0 -> build-cur...

GitHub Actions: 🔎 REVIEW.md Drift Audit / 0_audit.txt: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

 build-legacy-run-engine.fix3
  * [new tag]             build-manual-checkpoints.rc1 -> build-manual-checkpoints.rc1
  * [new tag]             build-metadata-upgrade-logging.rc1 -> build-metadata-upgrade-logging.rc1
  * [new tag]             build-metadata-upgrade-logging.rc2 -> build-metadata-upgrade-logging.rc2
  * [new tag]             build-metadata-upgrade-logging.rc3 -> build-metadata-upgrade-logging.rc3
  * [new tag]             build-new-build-system.rc.1 -> build-new-build-system.rc.1
  * [new tag]             build-otel-upgrade-rc.0     -> build-otel-upgrade-rc.0
  * [new tag]             build-otel-upgrade-rc.1     -> build-otel-upgrade-rc.1
  * [new tag]             build-pre-pull-deployments-rc.1 -> build-pre-pull-deployments-rc.1
  * [new tag]             build-prod-rescue-rc.1      -> build-prod-rescue-rc.1
  * [new tag]             build-rate-limiter-fix-rc.1 -> build-rate-limiter-fix-rc.1
  * [new tag]             build-re2.rc0               -> build-re2.rc0
  * [new tag]             build-realtime-v2-stream-fix -> build-realtime-v2-stream-fix
  * [new tag]             build-realtime-v2-stream-fix-2 -> build-realtime-v2-stream-fix-2
  * [new tag]             build-realtime-v2-stream-fix-3 -> build-realtime-v2-stream-fix-3
  * [new tag]             build-realtime-v2-stream-fix-4 -> build-realtime-v2-stream-fix-4
  * [new tag]             build-realtime-v2-stream-fix-5 -> build-realtime-v2-stream-fix-5
  * [new tag]             build-realtimestreams-dedupe -> build-realtimestreams-dedupe
  * [new tag]             build-registry-maintenance-rc.1 -> build-registry-maintenance-rc.1
  * [new tag]             build-registry-maintenance-rc.2 -> build-registry-maintenance-rc.2
  * [new tag]             build-remote-ecr-rc.0       -> build-remote-ecr-rc.0
  * [new tag]             build-reschedule-hotfix.rc1 -> build-reschedule-hotfix.rc1
  * [new tag]             build-resume-fixes.rc1      -> build-resume-fixes.rc1
  * [new tag]             build-resume-...

GitHub Actions: 🔎 REVIEW.md Drift Audit / audit: fix(webapp,core): retry run resume through transient database outages

Conclusion: failure

View job details

##[group]Run anthropics/claude-code-action@428971d2ecd6e3a7cb0ee0da2a3a8b33fdb3678d
 with:
   anthropic_***REDACTED***
   use_sticky_comment: true
   allowed_bots: devin-ai-integration[bot]
   claude_args: --max-turns 30
--allowedTools "Read,Glob,Grep,Bash(git diff:*)"
   prompt: You are auditing this PR for drift against `.claude/REVIEW.md`.
## Context
`.claude/REVIEW.md` is the repo's source of truth for what AI / agent code reviewers should treat as critical findings (rolling-deploy safety, hot-table indexes, recovery-path queries, testcontainers usage, Lua versioning, etc.). It is consumed by review agents to calibrate severity. If REVIEW.md goes stale, every future agent review degrades.
## Strategy — read this first
You have a hard turn budget. Spend it on signal, not coverage. The audit is allowed to miss things; it is NOT allowed to time out.
1. Read `.claude/REVIEW.md` once, in full.
2. Run `git diff origin/main...HEAD --name-only` to get the list of changed files. Do NOT read the diff content yet.
3. Scan the file-list for relevance to REVIEW.md scope. Relevance signals: changes to Prisma schema, Redis / queue / Lua code, hot tables, recovery / restart loops, new packages, deletions of paths REVIEW.md cites. Skim everything else.
4. Open at most **5 files** total — only the ones most likely to surface a real signal. If nothing in the file-list looks relevant to any REVIEW.md rule, do NOT read any files; go straight to the verdict.
5. Form a verdict and stop. Do not exhaust the turn budget exploring.
Large PRs (>50 files changed) are a strong signal to be MORE selective, not more thorough. Pick 3-5 files at most.
## What to look for
- **Stale references** — does any REVIEW.md rule cite a file, directory, function, table, Prisma model, or package name that has been removed or renamed in this PR (or is already gone from `main`)?
- **Contradictions** — does code in this PR clearly violate a current REVIEW.md rule? (Don't re-review the PR. Only flag if REVIE...

Walkthrough

The continue run flow now treats Prisma infrastructure errors as retryable instead of mapping them to 422 responses. The webapp continue route re-throws detected infrastructure errors so generic 500 handling can apply, and both HTTP clients add retry policies with exponential backoff and jitter for continueRunExecution. A changeset records the patch version update for @trigger.dev/core.

Changes

Cohort / File(s) Summary
Webapp continue routeapps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.continue.ts Imports isInfrastructureError, re-throws infrastructure errors, and updates the warning log message.
ContinueRunExecution retry configurationpackages/core/src/v3/runEngineWorker/workload/http.ts, packages/core/src/v3/runEngineWorker/supervisor/http.ts Adds jittered exponential retry policy to continueRunExecution in both HTTP clients.
Changeset.changeset/resume-retry-transient-db.md Documents the patch bump and resume retry behavior.

Sequence Diagram(s)

sequenceDiagram
  participant ContinueRoute
  participant PrismaDB
  participant WorkerClient

  WorkerClient->>ContinueRoute: POST continue run execution
  ContinueRoute->>PrismaDB: perform continuation logic
  PrismaDB-->>ContinueRoute: infrastructure error
  ContinueRoute->>ContinueRoute: isInfrastructureError check
  alt infrastructure error
    ContinueRoute-->>WorkerClient: rethrow for generic 500 handling
  else other error
    ContinueRoute-->>WorkerClient: 422 response
  end
Loading

Related issues: None specified.

Related PRs: None specified.

Suggested labels: bug, core, webapp

Suggested reviewers: None specified.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the change well, but it does not follow the required template and is missing the issue link and standard sections. Add the required template sections: Closes #, checklist items, Testing, Changelog, and Screenshots (or mark them N/A).
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: retrying run resumes through transient database outages.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/resume-retriable-on-transient-db-errors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

devin-ai-integration[bot]

This comment was marked as resolved.

The supervisor-to-engine hop is the one that reaches the continue endpoint,
so it is where a transient database outage surfaces as a retryable 5xx. Give
its continueRunExecution the same longer, jittered retry budget as the
workload client so it can ride out the outage.
@matt-aitken matt-aitken force-pushed the fix/resume-retriable-on-transient-db-errors branch from 7b8d0a0 to 8334820 Compare July 5, 2026 14:33
@pkg-pr-new

pkg-pr-new Bot commented Jul 5, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@81ffb3f

trigger.dev

npm i https://pkg.pr.new/trigger.dev@81ffb3f

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@81ffb3f

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@81ffb3f

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@81ffb3f

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@81ffb3f

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@81ffb3f

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@81ffb3f

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@81ffb3f

commit: 81ffb3f

devin-ai-integration[bot]

This comment was marked as resolved.

The database-outage retry lives on the supervisor-to-engine hop; the workload
client only reaches the supervisor's workload server, so its retry rides out
supervisor blips (e.g. a restart), not DB outages. Fix the comment to say so.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants