Skip to content

fix: avoid preview service resurrection after delete#4209

Draft
budivoogt wants to merge 18 commits intoDokploy:canaryfrom
budivoogt:fix/preview-teardown-race
Draft

fix: avoid preview service resurrection after delete#4209
budivoogt wants to merge 18 commits intoDokploy:canaryfrom
budivoogt:fix/preview-teardown-race

Conversation

@budivoogt
Copy link
Copy Markdown

Summary

This guards preview deploy/redeploy jobs so they cannot recreate a Swarm service after the preview deployment record has already been deleted.

Why

We reproduced a race where Dokploy removes the preview record on PR close, but an in-flight preview build can still finish later and create the Swarm service anyway, leaving an orphan service behind.

Issue: #4203

What changed

  • skip preview queue jobs whose preview deployment record is already gone
  • re-check the preview deployment record right before mechanizeDockerContainer() in preview deploy/rebuild flows
  • avoid posting error comments when the preview record/comment context is already gone

Notes

  • This is intentionally a minimal guard, not a full queue-cancellation refactor.
  • #2453 improved preview deletion ordering, but the create-vs-delete race still exists in current canary.

Validation

  • pnpm exec biome check apps/dokploy/server/queues/deployments-queue.ts packages/server/src/services/application.ts packages/server/src/services/preview-deployment.ts
  • pnpm --filter=server typecheck
  • Full monorepo pnpm typecheck still fails in unrelated existing apps/api files on a fresh clone, so I did not use it as the regression signal for this PR.

@budivoogt
Copy link
Copy Markdown
Author

budivoogt commented Apr 13, 2026

Ran this patch on our live single-node Swarm Dokploy instance. Controlled test result:

  • opened a throwaway PR in a private app repo
  • Dokploy created a preview record
  • closed the PR while the preview record existed and before any Swarm service existed
  • after close, no preview Swarm service ever appeared

So this patch does appear to fix the specific late-worker orphan-service resurrection race we reported.

However, the smoke test also exposed a second teardown problem:

  • the preview record remained in Dokploy as idle after PR close instead of being deleted
  • previewDeployment.delete for that stuck record returned 500
  • there was still no matching Swarm service

So from our side the patch looked directionally correct, but not yet sufficient for full preview teardown correctness. We deleted the stuck test record directly from Dokploy Postgres after capturing the result.

If useful I can open a focused follow-up PR once I trace that second failure mode in source.

@budivoogt
Copy link
Copy Markdown
Author

budivoogt commented Apr 13, 2026

Superseded by the sanitized follow-up comment below. The important result was that, after the second patch iteration, the close-during-create smoke test no longer reproduced either the late orphan-service race or the stuck preview-record case on our self-hosted Swarm instance.

@budivoogt
Copy link
Copy Markdown
Author

budivoogt commented Apr 13, 2026

Ran a second live smoke test against a Hetzner Dokploy instance after deploying this branch as a patched Dokploy image.

This time the teardown path behaved cleanly under the same race window:

  • opened a disposable PR in a private app repo
  • waited until Dokploy created a preview record with previewStatus=idle
  • confirmed no Swarm service existed yet
  • closed the PR immediately
  • polled Dokploy Postgres and docker service ls for 60 seconds

Observed result:

  • preview record count dropped to 0 immediately after close
  • no preview Swarm service ever appeared
  • the earlier stuck-idle-record / previewDeployment.delete 500 case was not reproduced with this follow-up patch

So with the two changes together, the full close-during-create smoke test passes on our server:

  • no late orphan service
  • no stuck preview record

We are still keeping our host-side reconciliation cron in place for now as defense in depth, but this patch set appears to fix the specific preview teardown race we reported in #4203.

@budivoogt
Copy link
Copy Markdown
Author

budivoogt commented Apr 14, 2026

Added a follow-up patch for a stale-preview-on-push issue we reproduced in a private app repo.

What changed in 5f1b0608:

  • existing preview deployments triggered by pull_request.synchronize now enqueue type: "redeploy" instead of type: "deploy"
  • preview queue submission now logs explicit context on enqueue and on queue failure (action, appName, applicationId, previewDeploymentId, pullRequestId, jobType)
  • added a small unit test for the job-type decision

Why this mattered in production:

  • Dokploy was receiving PR push webhooks and passing collaborator auth
  • preview labels and preview limits were not blocking the app
  • but no new preview deployment row was created for the existing preview after push, so the environment stayed stale

This patch should make the existing-preview path take the explicit rebuild branch and make the next failure observable if queue submission is the remaining problem.

Enables the GitHub Deployments service to post deployment objects and
status updates against PR preview commits. Requires existing installs
to reauthorize the app for the new permission to take effect.
Exposes createGithubDeployment, setGithubDeploymentStatus, and
deactivateGithubDeployments, built on the existing authGithub Octokit
flow. Every call is wrapped in try/catch that logs and returns — a
GitHub API outage must never break a Dokploy deploy.

Defaults for previews: transient_environment=true,
production_environment=false, auto_inactive=true on success so
replacing a preview automatically marks the prior deployment inactive.
Hooks into deployPreviewApplication and rebuildPreviewApplication:
- creates a transient GitHub deployment after the preview metadata
  loads, keyed to `<app>-pr-<PR-number>`
- posts in_progress before clone/build
- posts success (with the preview URL) alongside the existing "done"
  updates
- posts failure in catch paths and in the mid-build preview-removed
  early return

All GitHub API calls route through the defensive github-deployment
service, so GitHub outages can never fail a deploy.
When a PR closes, look up each preview deployment's application and
call deactivateGithubDeployments with the matching environment name
before removing the preview. Keeps the repo's Environments tab from
accumulating stale entries over the life of the project.

Failures here only warn — a GitHub API problem must not block the
underlying preview teardown.
None of these ever ran on this fork (zero GitHub Actions runs in the
fork's history) and all targeted upstream's branches, secrets, or
namespaces:

- dokploy.yml / deploy.yml: push to dokploy/dokploy and siumauricio/*
  on Docker Hub; we can't write to either
- pr-quality.yml: blocks commits from AI authors, incompatible with
  this fork's workflow
- format.yml, pull-request.yml, create-pr.yml: target main/canary; we
  branch off fix/preview-teardown-race and feat/* instead
- monitoring.yml, sync-openapi-docs.yml: upstream-only housekeeping

Cleared out before layering in our own image build pipeline.
Build + push multi-arch image to ghcr.io/budivoogt/dokploy on every
push to feat/*, fix/*, or canary-ctd (plus manual dispatch). Tags:

- vX.Y.Z-ctd<sha7>  — stable tag for a specific commit (use for rollouts)
- <branch-slug>     — rolling tag for the branch tip (use for dev/staging)

Uses GHCR_PAT (classic PAT with write:packages, read:packages) stored
as a repo secret. GitHub Actions cache (type=gha) cuts rebuild time on
layer hits.

Job summary prints the deploy command so tags are copy-pasteable from
the Actions run page.
Thin wrapper around \`docker service update\` on the Hetzner swarm node,
invoked over Tailscale SSH. Prints rollout status so you know whether
the new task scheduled cleanly.

GitHub Actions builds the image and pushes to GHCR; this script flips
the live service. Keeping deploy separate from build means GH Actions
never needs Tailnet access or production credentials.
Captures: image build trigger, tag conventions, deploy script usage,
GHCR visibility caveats, rebase recipe, and an inventory of active
fork patches with the files they live in. Kept short on purpose —
it's a runbook, not a handbook.

Also says explicitly when this file should be deleted: when upstream
merges equivalents of every row in the active-customizations table.
Tailscale MagicDNS resolves contracko-01 to the Hetzner box the same
way the IP did, and it reads better in docs and shell history. User
config already carries the identity and user, so we can drop the
explicit USER override too.
The Hetzner target is x86_64. QEMU-emulated arm64 builds of the
Dokploy Node/Next.js image are disproportionately slow — the first
run stalled past 40 minutes on the arm64 leg alone. We don't run
Dokploy anywhere else, so building amd64 only cuts end-to-end CI
time to roughly 6–8 minutes per image. Add arm64 back if we ever
deploy to ARM hardware.
Dokploy publishes port 3000 in host mode, so start-first deadlocks:
the replacement task cannot bind while the old task still holds the
port. During the first real rollout we hit this and had to force a
convergent update by hand. Switching the default to stop-first trades
30-60s of UI downtime for a deploy that actually converges. Preview
deploys in flight will queue behind the restart, which is fine.
getDomainHost already returns the full URL with scheme, so prepending
another https:// produced 'https://https://preview-...'. Live GitHub
Deployments showed the broken URL and the 'View deployment' button
404'd. Six call sites patched across deployPreviewApplication and
rebuildPreviewApplication.

Caught during the CTD-2065 smoke test on PR Dokploy#2691 — the deployment
itself landed and transitioned states correctly; only the click-through
URL was malformed.
Extends the GitHub Deployments API integration from preview-only
(CTD-2065) to the regular deployApplication path. When Dokploy
builds a GitHub-sourced app (e.g. staging push), it now creates a
GitHub Deployment with the app name as environment and posts
in_progress → success/failure statuses.

Unlike previews: transient_environment=false (persistent env),
environment name is the app name (no -pr-N suffix), and the
environment_url uses the app's first configured domain.

This brings parity with Railway's staging deployment entries in the
repo's Environments tab and PR merge timelines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant