fix: avoid preview service resurrection after delete#4209
fix: avoid preview service resurrection after delete#4209budivoogt wants to merge 18 commits intoDokploy:canaryfrom
Conversation
|
Ran this patch on our live single-node Swarm Dokploy instance. Controlled test result:
So this patch does appear to fix the specific late-worker orphan-service resurrection race we reported. However, the smoke test also exposed a second teardown problem:
So from our side the patch looked directionally correct, but not yet sufficient for full preview teardown correctness. We deleted the stuck test record directly from Dokploy Postgres after capturing the result. If useful I can open a focused follow-up PR once I trace that second failure mode in source. |
|
Superseded by the sanitized follow-up comment below. The important result was that, after the second patch iteration, the close-during-create smoke test no longer reproduced either the late orphan-service race or the stuck preview-record case on our self-hosted Swarm instance. |
|
Ran a second live smoke test against a Hetzner Dokploy instance after deploying this branch as a patched Dokploy image. This time the teardown path behaved cleanly under the same race window:
Observed result:
So with the two changes together, the full close-during-create smoke test passes on our server:
We are still keeping our host-side reconciliation cron in place for now as defense in depth, but this patch set appears to fix the specific preview teardown race we reported in #4203. |
|
Added a follow-up patch for a stale-preview-on-push issue we reproduced in a private app repo. What changed in
Why this mattered in production:
This patch should make the existing-preview path take the explicit rebuild branch and make the next failure observable if queue submission is the remaining problem. |
Enables the GitHub Deployments service to post deployment objects and status updates against PR preview commits. Requires existing installs to reauthorize the app for the new permission to take effect.
Exposes createGithubDeployment, setGithubDeploymentStatus, and deactivateGithubDeployments, built on the existing authGithub Octokit flow. Every call is wrapped in try/catch that logs and returns — a GitHub API outage must never break a Dokploy deploy. Defaults for previews: transient_environment=true, production_environment=false, auto_inactive=true on success so replacing a preview automatically marks the prior deployment inactive.
Hooks into deployPreviewApplication and rebuildPreviewApplication: - creates a transient GitHub deployment after the preview metadata loads, keyed to `<app>-pr-<PR-number>` - posts in_progress before clone/build - posts success (with the preview URL) alongside the existing "done" updates - posts failure in catch paths and in the mid-build preview-removed early return All GitHub API calls route through the defensive github-deployment service, so GitHub outages can never fail a deploy.
When a PR closes, look up each preview deployment's application and call deactivateGithubDeployments with the matching environment name before removing the preview. Keeps the repo's Environments tab from accumulating stale entries over the life of the project. Failures here only warn — a GitHub API problem must not block the underlying preview teardown.
None of these ever ran on this fork (zero GitHub Actions runs in the fork's history) and all targeted upstream's branches, secrets, or namespaces: - dokploy.yml / deploy.yml: push to dokploy/dokploy and siumauricio/* on Docker Hub; we can't write to either - pr-quality.yml: blocks commits from AI authors, incompatible with this fork's workflow - format.yml, pull-request.yml, create-pr.yml: target main/canary; we branch off fix/preview-teardown-race and feat/* instead - monitoring.yml, sync-openapi-docs.yml: upstream-only housekeeping Cleared out before layering in our own image build pipeline.
Build + push multi-arch image to ghcr.io/budivoogt/dokploy on every push to feat/*, fix/*, or canary-ctd (plus manual dispatch). Tags: - vX.Y.Z-ctd<sha7> — stable tag for a specific commit (use for rollouts) - <branch-slug> — rolling tag for the branch tip (use for dev/staging) Uses GHCR_PAT (classic PAT with write:packages, read:packages) stored as a repo secret. GitHub Actions cache (type=gha) cuts rebuild time on layer hits. Job summary prints the deploy command so tags are copy-pasteable from the Actions run page.
Thin wrapper around \`docker service update\` on the Hetzner swarm node, invoked over Tailscale SSH. Prints rollout status so you know whether the new task scheduled cleanly. GitHub Actions builds the image and pushes to GHCR; this script flips the live service. Keeping deploy separate from build means GH Actions never needs Tailnet access or production credentials.
Captures: image build trigger, tag conventions, deploy script usage, GHCR visibility caveats, rebase recipe, and an inventory of active fork patches with the files they live in. Kept short on purpose — it's a runbook, not a handbook. Also says explicitly when this file should be deleted: when upstream merges equivalents of every row in the active-customizations table.
Tailscale MagicDNS resolves contracko-01 to the Hetzner box the same way the IP did, and it reads better in docs and shell history. User config already carries the identity and user, so we can drop the explicit USER override too.
The Hetzner target is x86_64. QEMU-emulated arm64 builds of the Dokploy Node/Next.js image are disproportionately slow — the first run stalled past 40 minutes on the arm64 leg alone. We don't run Dokploy anywhere else, so building amd64 only cuts end-to-end CI time to roughly 6–8 minutes per image. Add arm64 back if we ever deploy to ARM hardware.
Dokploy publishes port 3000 in host mode, so start-first deadlocks: the replacement task cannot bind while the old task still holds the port. During the first real rollout we hit this and had to force a convergent update by hand. Switching the default to stop-first trades 30-60s of UI downtime for a deploy that actually converges. Preview deploys in flight will queue behind the restart, which is fine.
getDomainHost already returns the full URL with scheme, so prepending another https:// produced 'https://https://preview-...'. Live GitHub Deployments showed the broken URL and the 'View deployment' button 404'd. Six call sites patched across deployPreviewApplication and rebuildPreviewApplication. Caught during the CTD-2065 smoke test on PR Dokploy#2691 — the deployment itself landed and transitioned states correctly; only the click-through URL was malformed.
Extends the GitHub Deployments API integration from preview-only (CTD-2065) to the regular deployApplication path. When Dokploy builds a GitHub-sourced app (e.g. staging push), it now creates a GitHub Deployment with the app name as environment and posts in_progress → success/failure statuses. Unlike previews: transient_environment=false (persistent env), environment name is the app name (no -pr-N suffix), and the environment_url uses the app's first configured domain. This brings parity with Railway's staging deployment entries in the repo's Environments tab and PR merge timelines.
Summary
This guards preview deploy/redeploy jobs so they cannot recreate a Swarm service after the preview deployment record has already been deleted.
Why
We reproduced a race where Dokploy removes the preview record on PR close, but an in-flight preview build can still finish later and create the Swarm service anyway, leaving an orphan service behind.
Issue: #4203
What changed
mechanizeDockerContainer()in preview deploy/rebuild flowsNotes
#2453improved preview deletion ordering, but the create-vs-delete race still exists in currentcanary.Validation
pnpm exec biome check apps/dokploy/server/queues/deployments-queue.ts packages/server/src/services/application.ts packages/server/src/services/preview-deployment.tspnpm --filter=server typecheckpnpm typecheckstill fails in unrelated existingapps/apifiles on a fresh clone, so I did not use it as the regression signal for this PR.