Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/resume-retry-transient-db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@trigger.dev/core": patch
---

Runs resuming after a wait now survive transient platform issues instead of failing with `TASK_EXECUTION_ABORTED`. The worker retries the resume call generously with jittered backoff, so a brief blip while the run is being continued no longer aborts it.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import type { WorkerApiContinueRunExecutionRequestBody } from "@trigger.dev/core
import { z } from "zod";
import { logger } from "~/services/logger.server";
import { createLoaderWorkerApiRoute } from "~/services/routeBuilders/apiBuilder.server";
import { clientSafeErrorMessage } from "~/utils/prismaErrors";
import { clientSafeErrorMessage, isInfrastructureError } from "~/utils/prismaErrors";

export const loader = createLoaderWorkerApiRoute(
{
Expand All @@ -31,7 +31,21 @@ export const loader = createLoaderWorkerApiRoute(

return json(continuationResult);
} catch (error) {
logger.warn("Failed to suspend run", { runFriendlyId, snapshotFriendlyId, error });
logger.warn("Failed to continue run execution", {
runFriendlyId,
snapshotFriendlyId,
error,
});

// A Prisma infrastructure error (e.g. P1001 "Can't reach database
// server") means the DB was transiently unreachable while resuming. A 422
// is non-retryable, so the worker would permanently abort a run over a
// blip. Let it propagate to the generic 500 handler, which scrubs the
// message and is retried by the worker's HTTP client.
if (isInfrastructureError(error)) {
throw error;
}

if (error instanceof Error) {
throw json({ error: clientSafeErrorMessage(error) }, { status: 422 });
}
Expand Down
15 changes: 15 additions & 0 deletions packages/core/src/v3/runEngineWorker/supervisor/http.ts
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,21 @@ export class SupervisorHttpClient {
...this.defaultHeaders,
...this.runnerIdHeader(runnerId),
},
},
{
// This is the hop that reaches the engine, so it's where a transient
// database outage during resume surfaces (as a retryable 5xx). Resuming
// is idempotent server-side (guarded by the snapshot id), so retry
// generously to ride out the outage rather than aborting the run.
// `randomize` jitters the delay so a fleet of runs resuming at once
// doesn't stampede the DB the moment it recovers.
retry: {
minTimeoutInMs: 500,
maxTimeoutInMs: 10_000,
maxAttempts: 8,
factor: 2,
randomize: true,
},
}
);
}
Expand Down
14 changes: 14 additions & 0 deletions packages/core/src/v3/runEngineWorker/workload/http.ts
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,20 @@ export class WorkloadHttpClient {
headers: {
...this.defaultHeaders(),
},
},
{
// This hop only reaches the supervisor's workload server, so retry
// generously with jittered backoff to ride out a transient blip
// talking to the supervisor (e.g. a restart) rather than aborting the
// run. Database outages surface one hop further in, on the
// supervisor-to-engine call, which carries its own retry for them.
retry: {
minTimeoutInMs: 500,
maxTimeoutInMs: 10_000,
maxAttempts: 8,
factor: 2,
randomize: true,
},
}
Comment thread
matt-aitken marked this conversation as resolved.
)
);
Expand Down