fix(webapp,core): retry run resume through transient database outages#4161
fix(webapp,core): retry run resume through transient database outages#4161matt-aitken wants to merge 4 commits into
Conversation
Resuming a run after a wait calls the engine's continue endpoint. When the database was briefly unreachable, that route caught the Prisma infrastructure error and returned a non-retryable 422, so the worker aborted the run with TASK_EXECUTION_ABORTED over a transient blip. The continue route now lets infrastructure errors propagate to the generic 500 handler (scrubbed and retryable), matching how the trigger path already treats them. The worker's continue call also retries with a longer, jittered backoff so it can ride out an outage lasting tens of seconds without stampeding the database on recovery. Genuine validation errors still return 422.
🦋 Changeset detectedLatest commit: f73646b The changes in this PR will be included in the next version bump. This PR includes changesets to release 28 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📜 Recent review details⏰ Context from checks skipped due to timeout. (18)
|
| Cohort / File(s) | Summary |
|---|---|
Webapp continue route — apps/webapp/app/routes/engine.v1.worker-actions.runs.$runFriendlyId.snapshots.$snapshotFriendlyId.continue.ts |
Imports isInfrastructureError, re-throws infrastructure errors, and updates the warning log message. |
ContinueRunExecution retry configuration — packages/core/src/v3/runEngineWorker/workload/http.ts, packages/core/src/v3/runEngineWorker/supervisor/http.ts |
Adds jittered exponential retry policy to continueRunExecution in both HTTP clients. |
Changeset — .changeset/resume-retry-transient-db.md |
Documents the patch bump and resume retry behavior. |
Sequence Diagram(s)
sequenceDiagram
participant ContinueRoute
participant PrismaDB
participant WorkerClient
WorkerClient->>ContinueRoute: POST continue run execution
ContinueRoute->>PrismaDB: perform continuation logic
PrismaDB-->>ContinueRoute: infrastructure error
ContinueRoute->>ContinueRoute: isInfrastructureError check
alt infrastructure error
ContinueRoute-->>WorkerClient: rethrow for generic 500 handling
else other error
ContinueRoute-->>WorkerClient: 422 response
end
Related issues: None specified.
Related PRs: None specified.
Suggested labels: bug, core, webapp
Suggested reviewers: None specified.
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Description check | The description covers the change well, but it does not follow the required template and is missing the issue link and standard sections. | Add the required template sections: Closes #, checklist items, Testing, Changelog, and Screenshots (or mark them N/A). |
✅ Passed checks (4 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title is concise and accurately summarizes the main change: retrying run resumes through transient database outages. |
| Docstring Coverage | ✅ Passed | No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check. |
| Linked Issues check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
✨ Finishing Touches
📝 Generate docstrings
- Create stacked PR
- Commit on current branch
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Commit unit tests in branch
fix/resume-retriable-on-transient-db-errors
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands.
The supervisor-to-engine hop is the one that reaches the continue endpoint, so it is where a transient database outage surfaces as a retryable 5xx. Give its continueRunExecution the same longer, jittered retry budget as the workload client so it can ride out the outage.
7b8d0a0 to
8334820
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
The database-outage retry lives on the supervisor-to-engine hop; the workload client only reaches the supervisor's workload server, so its retry rides out supervisor blips (e.g. a restart), not DB outages. Fix the comment to say so.
Summary
When the platform is briefly unreachable while a run is resuming from a wait, the run no longer fails with
TASK_EXECUTION_ABORTED. The worker now retries the resume through the outage instead of aborting on the first blip.Root cause
Resuming a run calls the engine's
continueworker-action endpoint. That route caught every error and returned a422, which the worker's HTTP client treats as non-retryable. So a transient Prisma infrastructure error (for exampleP1001"Can't reach database server") was flattened into a permanent failure: the worker gave up, force-killed the run process, and completed it withTASK_EXECUTION_ABORTED.Fix
continueroute now lets infrastructure errors propagate to the generic 500 handler (message scrubbed, and retryable by the worker's HTTP client), the same treatment the trigger path already gives them viaisInfrastructureError. Genuine validation errors (snapshot mismatch, invalid state) still return422, so a stale retry stays non-retryable. Resuming is idempotent server-side (guarded by the snapshot id), so retrying is safe.continueRunExecutioncall retries with a longer, jittered backoff so it can ride out an outage lasting tens of seconds, and the jitter keeps a fleet of resuming runs from stampeding the database the moment it recovers.Builds on #3960, which scrubbed the leaked message on these routes but left the status non-retryable.