Fix hidden queued deployments in deployment history#4146
Fix hidden queued deployments in deployment history#4146M-Hassan-Raza wants to merge 6 commits intoDokploy:canaryfrom
Conversation
add queued status
wire queued deployment flow
show queued deployment status
update drizzle metadata
| .set({ | ||
| status: "cancelled", | ||
| finishedAt: cancelledAt, | ||
| }) | ||
| .where(eq(deployments.status, "running")) | ||
| .returning(); |
There was a problem hiding this comment.
queued deployments not cancelled on server restart
initCancelDeployments only cancels deployments with status = 'running'. After this PR, a deployment record can also be in 'queued' status. There is a time window between when the queued DB record is written (in attachQueuedDeployment) and when the BullMQ job is actually enqueued (in myQueue.add). If the server crashes in that window, the record will be stuck in 'queued' forever because:
initCancelDeploymentsruns on startup but ignoresqueuedrows- No BullMQ job exists to transition it to
running
The same problem occurs if an admin uses the "Clean Redis" action (cleanRedis mutation in settings.ts), which flushes all Redis data and destroys any waiting BullMQ jobs, leaving their corresponding queued DB records permanently stuck.
The fix is to also cancel queued deployments on startup:
| .set({ | |
| status: "cancelled", | |
| finishedAt: cancelledAt, | |
| }) | |
| .where(eq(deployments.status, "running")) | |
| .returning(); | |
| const result = await db | |
| .update(deployments) | |
| .set({ | |
| status: "cancelled", | |
| finishedAt: cancelledAt, | |
| }) | |
| .where(inArray(deployments.status, ["running", "queued"])) | |
| .returning(); |
(Requires importing inArray from drizzle-orm.)
| export const enqueueDeploymentJob = async (jobData: DeploymentJob) => { | ||
| const queuedJobData = await attachQueuedDeployment(jobData); | ||
|
|
||
| try { | ||
| if (IS_CLOUD && queuedJobData.serverId) { | ||
| await deploy(queuedJobData); | ||
| return queuedJobData; | ||
| } | ||
|
|
||
| await myQueue.add( | ||
| "deployments", | ||
| { ...queuedJobData }, | ||
| { | ||
| removeOnComplete: true, | ||
| removeOnFail: true, | ||
| }, | ||
| ); | ||
|
|
||
| return queuedJobData; | ||
| } catch (error) { | ||
| if (queuedJobData.deploymentId) { | ||
| await failQueuedDeployment(queuedJobData.deploymentId, error); | ||
| } | ||
|
|
||
| throw error; | ||
| } | ||
| }; |
There was a problem hiding this comment.
attachQueuedDeployment errors bypass failQueuedDeployment
If attachQueuedDeployment itself throws (e.g., the database is momentarily unreachable or writeDeploymentLogPreamble fails mid-flight), the exception propagates before the try block is entered. The catch block that calls failQueuedDeployment is therefore never reached.
In createDeployment's own catch clause, a new deployment row with status: 'error' is inserted. However, because attachQueuedDeployment threw, queuedJobData was never assigned, so the outer caller has no deploymentId to work with either.
The current structure:
const queuedJobData = await attachQueuedDeployment(jobData); // ← if this throws, catch below is skipped
try {
...
} catch (error) {
if (queuedJobData.deploymentId) {
await failQueuedDeployment(queuedJobData.deploymentId, error); // never reached
}
}Consider wrapping the entire body (including attachQueuedDeployment) in the try block, or adding a separate try/catch around attachQueuedDeployment to ensure all failure paths are handled consistently.
|
Just noticed there was a conflict in the branch, have rebased now. Hope this PR gets a look, it looks like a common slop PR from the looks and size (size is mainly just the db snapshot) of it but I did test the code and refactored where I could see some slopification happeneing |
Closes #4072
Summary
When multiple deplomyents are triggered at the same time, only the active deployment gets a deployment record immediately. Waiting jobs are queued in BullMQ, but they do not appear in the deployment history until a worker starts processing them.
This change creates the deployment record before the job is added to the queue and keeps that same record through the rest of the deployment lifecycle.
What changed
queueddeployment statusdeploymentIdthrough the queue and remote deploy API so workers update the existing row instead of creating a new onecancelledManual testing
queuedqueuedtorunningtodoneCancel Queuesupdates a waiting deployment fromqueuedtocancelledqueuedScreenshots
Service deployment view
Caption: Queued deployment visible while another deployment is running
Centralized deployments view
Caption: Queued status shown in the centralized deployments view
Queue cancellation
Caption: Queued deployment updated to cancelled after queue cleanup

Notes
Greptile Summary
This PR introduces a
queueddeployment status so that BullMQ-waiting deployments are visible in the history immediately, rather than only appearing once a worker picks them up. The approach is well-thought-out: a deployment record is created asqueuedbefore the job is enqueued, the same record is transitioned torunningwhen the worker starts processing it, and cancellation/failure paths update the record accordingly. The cloud (IS_CLOUD) remote-server path correctly propagatesdeploymentIdthrough the API schema and handler.Key issues found:
queuedrecords stuck:initCancelDeploymentsonly cancelsstatus = 'running'rows. If the server crashes between writing thequeuedDB record and successfully callingmyQueue.add, or if an admin flushes Redis via the "Clean Redis" action (cleanRedismutation insettings.ts), everyqueuedrow in the database will remain permanently stuck.cleanRedisdoesn't cancel outstandingqueueddeployment records: Flushing Redis destroys in-flight BullMQ jobs but leaves their correspondingqueuedDB rows orphaned.startQueuedDeploymentthrowsTRPCErrorfrom worker context: This function is invoked inside BullMQ workers, not tRPC procedures. ThrowingTRPCErrorcauses the job to be marked as failed with misleading error output.attachQueuedDeploymenterror bypassesfailQueuedDeployment: InenqueueDeploymentJob, the outercatchthat callsfailQueuedDeploymentis only reachable ifattachQueuedDeploymentitself succeeds.Confidence Score: 2/5
Not safe to merge as-is — server restarts and the existing 'Clean Redis' admin action can permanently strand deployment records in the queued state with no recovery path.
The core feature logic is sound and well-structured, but there are multiple independent scenarios where queued deployment records can become permanently stuck. The startup cancellation routine and the Redis-flush mutation both need to be updated to handle the new queued status before this is safe to ship.
packages/server/src/utils/startup/cancel-deployments.ts and apps/dokploy/server/api/routers/settings.ts (the cleanRedis mutation) require the most attention.
Comments Outside Diff (2)
packages/server/src/services/deployment.ts, line 1070-1075 (link)TRPCErrorthrown from BullMQ worker contextstartQueuedDeploymentis called from BullMQ worker job handlers (e.g.,deployApplication,rebuildCompose), which are not tRPC procedures. ThrowingTRPCErrorhere causes BullMQ to mark the job as failed (rather than handling the case gracefully), and emits confusing error stack traces in the worker logs that expose tRPC internals.More concretely: if the deployment was cancelled between job enqueue and job processing (a valid race condition despite job removal being attempted), the worker crashes the job unnecessarily. The deployment DB row is already
'cancelled', so the correct behaviour is a silent no-op.Consider returning the existing deployment record (or
null) and letting the caller decide whether to abort:The callers (
deployApplication,rebuildApplication, etc.) can then bail out early whennullis returned, rather than letting aTRPCErrorunwind the BullMQ job.apps/dokploy/server/api/routers/settings.ts, line 104-126 (link)cleanRedisleavesqueueddeployment records permanently stuckcleanRedisflushes all Redis data viaFLUSHALL. After this PR, every in-flight BullMQ job has a corresponding DB row withstatus = 'queued'. Flushing Redis destroys those jobs without updating the deployment records, so everyqueuedrow in the database will never transition torunning,done, orcancelled.The fix is to cancel all
queueddeployments in the database before (or right after) flushing Redis, analogous to whatcancelQueuedJobsdoes for jobs with knowndeploymentIds. Alternatively,cleanAllDeploymentQueuecould be called first to drain and cancel the queue entries before flushing Redis.Reviews (1): Last reviewed commit: "chore(database): update drizzle metadata" | Re-trigger Greptile
(2/5) Greptile learns from your feedback when you react with thumbs up/down!