Skip to content

Fix hidden queued deployments in deployment history#4146

Open
M-Hassan-Raza wants to merge 6 commits intoDokploy:canaryfrom
M-Hassan-Raza:feat/queued-deployment-status
Open

Fix hidden queued deployments in deployment history#4146
M-Hassan-Raza wants to merge 6 commits intoDokploy:canaryfrom
M-Hassan-Raza:feat/queued-deployment-status

Conversation

@M-Hassan-Raza
Copy link
Copy Markdown

@M-Hassan-Raza M-Hassan-Raza commented Apr 4, 2026

Closes #4072

Summary

When multiple deplomyents are triggered at the same time, only the active deployment gets a deployment record immediately. Waiting jobs are queued in BullMQ, but they do not appear in the deployment history until a worker starts processing them.

This change creates the deployment record before the job is added to the queue and keeps that same record through the rest of the deployment lifecycle.

What changed

  • add a queued deployment status
  • create deployment rows before enqueueing application, compose, and preview deployment jobs
  • pass deploymentId through the queue and remote deploy API so workers update the existing row instead of creating a new one
  • update queue cleanup actions so waiting deployments are marked as cancelled
  • show queued and cancelled deployment states in the deployment views
  • make preview deployment cards reflect the latest deployment attempt status

Manual testing

  • triggered two compose deployments and confirmed the second deployment appears immediately as queued
  • confirmed the same deployment row transitions from queued to running to done
  • confirmed Cancel Queues updates a waiting deployment from queued to cancelled
  • confirmed webhook-triggered application deployments also show queued
  • confirmed global queue cleanup cancels queued application, compose, and preview deployment rows

Screenshots

Service deployment view

Caption: Queued deployment visible while another deployment is running

Screenshot 2026-04-04 at 13 29 34

Centralized deployments view

Caption: Queued status shown in the centralized deployments view

Screenshot 2026-04-04 at 13 29 30

Queue cancellation

Caption: Queued deployment updated to cancelled after queue cleanup
Screenshot 2026-04-04 at 13 58 27

Notes

  • the last commit only contains generated Drizzle metadata which blew up the size :D

Greptile Summary

This PR introduces a queued deployment status so that BullMQ-waiting deployments are visible in the history immediately, rather than only appearing once a worker picks them up. The approach is well-thought-out: a deployment record is created as queued before the job is enqueued, the same record is transitioned to running when the worker starts processing it, and cancellation/failure paths update the record accordingly. The cloud (IS_CLOUD) remote-server path correctly propagates deploymentId through the API schema and handler.

Key issues found:

  • Server-restart / Redis-flush leaves queued records stuck: initCancelDeployments only cancels status = 'running' rows. If the server crashes between writing the queued DB record and successfully calling myQueue.add, or if an admin flushes Redis via the "Clean Redis" action (cleanRedis mutation in settings.ts), every queued row in the database will remain permanently stuck.
  • cleanRedis doesn't cancel outstanding queued deployment records: Flushing Redis destroys in-flight BullMQ jobs but leaves their corresponding queued DB rows orphaned.
  • startQueuedDeployment throws TRPCError from worker context: This function is invoked inside BullMQ workers, not tRPC procedures. Throwing TRPCError causes the job to be marked as failed with misleading error output.
  • attachQueuedDeployment error bypasses failQueuedDeployment: In enqueueDeploymentJob, the outer catch that calls failQueuedDeployment is only reachable if attachQueuedDeployment itself succeeds.

Confidence Score: 2/5

Not safe to merge as-is — server restarts and the existing 'Clean Redis' admin action can permanently strand deployment records in the queued state with no recovery path.

The core feature logic is sound and well-structured, but there are multiple independent scenarios where queued deployment records can become permanently stuck. The startup cancellation routine and the Redis-flush mutation both need to be updated to handle the new queued status before this is safe to ship.

packages/server/src/utils/startup/cancel-deployments.ts and apps/dokploy/server/api/routers/settings.ts (the cleanRedis mutation) require the most attention.

Comments Outside Diff (2)

  1. packages/server/src/services/deployment.ts, line 1070-1075 (link)

    P1 TRPCError thrown from BullMQ worker context

    startQueuedDeployment is called from BullMQ worker job handlers (e.g., deployApplication, rebuildCompose), which are not tRPC procedures. Throwing TRPCError here causes BullMQ to mark the job as failed (rather than handling the case gracefully), and emits confusing error stack traces in the worker logs that expose tRPC internals.

    More concretely: if the deployment was cancelled between job enqueue and job processing (a valid race condition despite job removal being attempted), the worker crashes the job unnecessarily. The deployment DB row is already 'cancelled', so the correct behaviour is a silent no-op.

    Consider returning the existing deployment record (or null) and letting the caller decide whether to abort:

    if (deployment.length === 0 || !deployment[0]) {
        // Deployment was already cancelled/processed — skip silently
        return null;
    }

    The callers (deployApplication, rebuildApplication, etc.) can then bail out early when null is returned, rather than letting a TRPCError unwind the BullMQ job.

  2. apps/dokploy/server/api/routers/settings.ts, line 104-126 (link)

    P1 cleanRedis leaves queued deployment records permanently stuck

    cleanRedis flushes all Redis data via FLUSHALL. After this PR, every in-flight BullMQ job has a corresponding DB row with status = 'queued'. Flushing Redis destroys those jobs without updating the deployment records, so every queued row in the database will never transition to running, done, or cancelled.

    The fix is to cancel all queued deployments in the database before (or right after) flushing Redis, analogous to what cancelQueuedJobs does for jobs with known deploymentIds. Alternatively, cleanAllDeploymentQueue could be called first to drain and cancel the queue entries before flushing Redis.

Reviews (1): Last reviewed commit: "chore(database): update drizzle metadata" | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Apr 4, 2026
Comment on lines 12 to 17
.set({
status: "cancelled",
finishedAt: cancelledAt,
})
.where(eq(deployments.status, "running"))
.returning();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 queued deployments not cancelled on server restart

initCancelDeployments only cancels deployments with status = 'running'. After this PR, a deployment record can also be in 'queued' status. There is a time window between when the queued DB record is written (in attachQueuedDeployment) and when the BullMQ job is actually enqueued (in myQueue.add). If the server crashes in that window, the record will be stuck in 'queued' forever because:

  • initCancelDeployments runs on startup but ignores queued rows
  • No BullMQ job exists to transition it to running

The same problem occurs if an admin uses the "Clean Redis" action (cleanRedis mutation in settings.ts), which flushes all Redis data and destroys any waiting BullMQ jobs, leaving their corresponding queued DB records permanently stuck.

The fix is to also cancel queued deployments on startup:

Suggested change
.set({
status: "cancelled",
finishedAt: cancelledAt,
})
.where(eq(deployments.status, "running"))
.returning();
const result = await db
.update(deployments)
.set({
status: "cancelled",
finishedAt: cancelledAt,
})
.where(inArray(deployments.status, ["running", "queued"]))
.returning();

(Requires importing inArray from drizzle-orm.)

Comment on lines +53 to +79
export const enqueueDeploymentJob = async (jobData: DeploymentJob) => {
const queuedJobData = await attachQueuedDeployment(jobData);

try {
if (IS_CLOUD && queuedJobData.serverId) {
await deploy(queuedJobData);
return queuedJobData;
}

await myQueue.add(
"deployments",
{ ...queuedJobData },
{
removeOnComplete: true,
removeOnFail: true,
},
);

return queuedJobData;
} catch (error) {
if (queuedJobData.deploymentId) {
await failQueuedDeployment(queuedJobData.deploymentId, error);
}

throw error;
}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 attachQueuedDeployment errors bypass failQueuedDeployment

If attachQueuedDeployment itself throws (e.g., the database is momentarily unreachable or writeDeploymentLogPreamble fails mid-flight), the exception propagates before the try block is entered. The catch block that calls failQueuedDeployment is therefore never reached.

In createDeployment's own catch clause, a new deployment row with status: 'error' is inserted. However, because attachQueuedDeployment threw, queuedJobData was never assigned, so the outer caller has no deploymentId to work with either.

The current structure:

const queuedJobData = await attachQueuedDeployment(jobData); // ← if this throws, catch below is skipped

try {
    ...
} catch (error) {
    if (queuedJobData.deploymentId) {
        await failQueuedDeployment(queuedJobData.deploymentId, error); // never reached
    }
}

Consider wrapping the entire body (including attachQueuedDeployment) in the try block, or adding a separate try/catch around attachQueuedDeployment to ensure all failure paths are handled consistently.

@M-Hassan-Raza
Copy link
Copy Markdown
Author

Just noticed there was a conflict in the branch, have rebased now. Hope this PR gets a look, it looks like a common slop PR from the looks and size (size is mainly just the db snapshot) of it but I did test the code and refactored where I could see some slopification happeneing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Concurrent deployments are silently queued with no UI feedback

1 participant