fix: avoid preview service resurrection after delete by budivoogt · Pull Request #4209 · Dokploy/dokploy

budivoogt · 2026-04-12T14:41:00Z

Summary

This guards preview deploy/redeploy jobs so they cannot recreate a Swarm service after the preview deployment record has already been deleted.

Why

We reproduced a race where Dokploy removes the preview record on PR close, but an in-flight preview build can still finish later and create the Swarm service anyway, leaving an orphan service behind.

Issue: #4203

What changed

skip preview queue jobs whose preview deployment record is already gone
re-check the preview deployment record right before mechanizeDockerContainer() in preview deploy/rebuild flows
avoid posting error comments when the preview record/comment context is already gone

Notes

This is intentionally a minimal guard, not a full queue-cancellation refactor.
#2453 improved preview deletion ordering, but the create-vs-delete race still exists in current canary.

Validation

pnpm exec biome check apps/dokploy/server/queues/deployments-queue.ts packages/server/src/services/application.ts packages/server/src/services/preview-deployment.ts
pnpm --filter=server typecheck
Full monorepo pnpm typecheck still fails in unrelated existing apps/api files on a fresh clone, so I did not use it as the regression signal for this PR.

budivoogt · 2026-04-13T08:04:56Z

Ran this patch on our live single-node Swarm Dokploy instance. Controlled test result:

opened a throwaway PR in a private app repo
Dokploy created a preview record
closed the PR while the preview record existed and before any Swarm service existed
after close, no preview Swarm service ever appeared

So this patch does appear to fix the specific late-worker orphan-service resurrection race we reported.

However, the smoke test also exposed a second teardown problem:

the preview record remained in Dokploy as idle after PR close instead of being deleted
previewDeployment.delete for that stuck record returned 500
there was still no matching Swarm service

So from our side the patch looked directionally correct, but not yet sufficient for full preview teardown correctness. We deleted the stuck test record directly from Dokploy Postgres after capturing the result.

If useful I can open a focused follow-up PR once I trace that second failure mode in source.

budivoogt · 2026-04-13T08:21:59Z

Superseded by the sanitized follow-up comment below. The important result was that, after the second patch iteration, the close-during-create smoke test no longer reproduced either the late orphan-service race or the stuck preview-record case on our self-hosted Swarm instance.

budivoogt · 2026-04-13T08:22:10Z

Ran a second live smoke test against a Hetzner Dokploy instance after deploying this branch as a patched Dokploy image.

This time the teardown path behaved cleanly under the same race window:

opened a disposable PR in a private app repo
waited until Dokploy created a preview record with previewStatus=idle
confirmed no Swarm service existed yet
closed the PR immediately
polled Dokploy Postgres and docker service ls for 60 seconds

Observed result:

preview record count dropped to 0 immediately after close
no preview Swarm service ever appeared
the earlier stuck-idle-record / previewDeployment.delete 500 case was not reproduced with this follow-up patch

So with the two changes together, the full close-during-create smoke test passes on our server:

no late orphan service
no stuck preview record

We are still keeping our host-side reconciliation cron in place for now as defense in depth, but this patch set appears to fix the specific preview teardown race we reported in #4203.

budivoogt · 2026-04-14T18:29:28Z

Added a follow-up patch for a stale-preview-on-push issue we reproduced in a private app repo.

What changed in 5f1b0608:

existing preview deployments triggered by pull_request.synchronize now enqueue type: "redeploy" instead of type: "deploy"
preview queue submission now logs explicit context on enqueue and on queue failure (action, appName, applicationId, previewDeploymentId, pullRequestId, jobType)
added a small unit test for the job-type decision

Why this mattered in production:

Dokploy was receiving PR push webhooks and passing collaborator auth
preview labels and preview limits were not blocking the app
but no new preview deployment row was created for the existing preview after push, so the environment stayed stale

This patch should make the existing-preview path take the explicit rebuild branch and make the next failure observable if queue submission is the remaining problem.

Enables the GitHub Deployments service to post deployment objects and status updates against PR preview commits. Requires existing installs to reauthorize the app for the new permission to take effect.

Exposes createGithubDeployment, setGithubDeploymentStatus, and deactivateGithubDeployments, built on the existing authGithub Octokit flow. Every call is wrapped in try/catch that logs and returns — a GitHub API outage must never break a Dokploy deploy. Defaults for previews: transient_environment=true, production_environment=false, auto_inactive=true on success so replacing a preview automatically marks the prior deployment inactive.

Hooks into deployPreviewApplication and rebuildPreviewApplication: - creates a transient GitHub deployment after the preview metadata loads, keyed to `<app>-pr-<PR-number>` - posts in_progress before clone/build - posts success (with the preview URL) alongside the existing "done" updates - posts failure in catch paths and in the mid-build preview-removed early return All GitHub API calls route through the defensive github-deployment service, so GitHub outages can never fail a deploy.

When a PR closes, look up each preview deployment's application and call deactivateGithubDeployments with the matching environment name before removing the preview. Keeps the repo's Environments tab from accumulating stale entries over the life of the project. Failures here only warn — a GitHub API problem must not block the underlying preview teardown.

None of these ever ran on this fork (zero GitHub Actions runs in the fork's history) and all targeted upstream's branches, secrets, or namespaces: - dokploy.yml / deploy.yml: push to dokploy/dokploy and siumauricio/* on Docker Hub; we can't write to either - pr-quality.yml: blocks commits from AI authors, incompatible with this fork's workflow - format.yml, pull-request.yml, create-pr.yml: target main/canary; we branch off fix/preview-teardown-race and feat/* instead - monitoring.yml, sync-openapi-docs.yml: upstream-only housekeeping Cleared out before layering in our own image build pipeline.

Build + push multi-arch image to ghcr.io/budivoogt/dokploy on every push to feat/*, fix/*, or canary-ctd (plus manual dispatch). Tags: - vX.Y.Z-ctd<sha7> — stable tag for a specific commit (use for rollouts) - <branch-slug> — rolling tag for the branch tip (use for dev/staging) Uses GHCR_PAT (classic PAT with write:packages, read:packages) stored as a repo secret. GitHub Actions cache (type=gha) cuts rebuild time on layer hits. Job summary prints the deploy command so tags are copy-pasteable from the Actions run page.

Thin wrapper around \`docker service update\` on the Hetzner swarm node, invoked over Tailscale SSH. Prints rollout status so you know whether the new task scheduled cleanly. GitHub Actions builds the image and pushes to GHCR; this script flips the live service. Keeping deploy separate from build means GH Actions never needs Tailnet access or production credentials.

Captures: image build trigger, tag conventions, deploy script usage, GHCR visibility caveats, rebase recipe, and an inventory of active fork patches with the files they live in. Kept short on purpose — it's a runbook, not a handbook. Also says explicitly when this file should be deleted: when upstream merges equivalents of every row in the active-customizations table.

Tailscale MagicDNS resolves contracko-01 to the Hetzner box the same way the IP did, and it reads better in docs and shell history. User config already carries the identity and user, so we can drop the explicit USER override too.

The Hetzner target is x86_64. QEMU-emulated arm64 builds of the Dokploy Node/Next.js image are disproportionately slow — the first run stalled past 40 minutes on the arm64 leg alone. We don't run Dokploy anywhere else, so building amd64 only cuts end-to-end CI time to roughly 6–8 minutes per image. Add arm64 back if we ever deploy to ARM hardware.

Dokploy publishes port 3000 in host mode, so start-first deadlocks: the replacement task cannot bind while the old task still holds the port. During the first real rollout we hit this and had to force a convergent update by hand. Switching the default to stop-first trades 30-60s of UI downtime for a deploy that actually converges. Preview deploys in flight will queue behind the restart, which is fine.

getDomainHost already returns the full URL with scheme, so prepending another https:// produced 'https://https://preview-...'. Live GitHub Deployments showed the broken URL and the 'View deployment' button 404'd. Six call sites patched across deployPreviewApplication and rebuildPreviewApplication. Caught during the CTD-2065 smoke test on PR Dokploy#2691 — the deployment itself landed and transitioned states correctly; only the click-through URL was malformed.

Extends the GitHub Deployments API integration from preview-only (CTD-2065) to the regular deployApplication path. When Dokploy builds a GitHub-sourced app (e.g. staging push), it now creates a GitHub Deployment with the app name as environment and posts in_progress → success/failure statuses. Unlike previews: transient_environment=false (persistent env), environment name is the app name (no -pr-N suffix), and the environment_url uses the app's first configured domain. This brings parity with Railway's staging deployment entries in the repo's Environments tab and PR merge timelines.

fix: avoid preview service resurrection after delete

ed4f861

budivoogt mentioned this pull request Apr 12, 2026

Preview deployments can leave orphan Swarm services after PR close #4203

Open

fix: harden preview deployment teardown cleanup

174cb56

fix: redeploy existing previews on github sync

5f1b060

budivoogt added 15 commits April 15, 2026 11:33

fix: guard preview worker lookups before deploy

9e9b756

fix: simplify preview deployment lookups

f860e11

chore: add deployments:write to GitHub App manifest

8324a4f

Enables the GitHub Deployments service to post deployment objects and status updates against PR preview commits. Requires existing installs to reauthorize the app for the new permission to take effect.

chore(bin): default deploy script to contracko-01 alias

7079827

Tailscale MagicDNS resolves contracko-01 to the Hetzner box the same way the IP did, and it reads better in docs and shell history. User config already carries the identity and user, so we can drop the explicit USER override too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: avoid preview service resurrection after delete#4209

fix: avoid preview service resurrection after delete#4209
budivoogt wants to merge 18 commits intoDokploy:canaryfrom
budivoogt:fix/preview-teardown-race

budivoogt commented Apr 12, 2026

Uh oh!

budivoogt commented Apr 13, 2026 •

edited

Loading

Uh oh!

budivoogt commented Apr 13, 2026 •

edited

Loading

Uh oh!

budivoogt commented Apr 13, 2026 •

edited

Loading

Uh oh!

budivoogt commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

budivoogt commented Apr 12, 2026

Summary

Why

What changed

Notes

Validation

Uh oh!

budivoogt commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

budivoogt commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

budivoogt commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

budivoogt commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

budivoogt commented Apr 13, 2026 •

edited

Loading

budivoogt commented Apr 13, 2026 •

edited

Loading

budivoogt commented Apr 13, 2026 •

edited

Loading

budivoogt commented Apr 14, 2026 •

edited

Loading