Skip to content

fix(cli): make the deploy lock all-or-nothing across hosts and expand its metadata#3163

Merged
bpamiri merged 4 commits into
developfrom
peter/deploy-w1-lock-correctness
Jun 13, 2026
Merged

fix(cli): make the deploy lock all-or-nothing across hosts and expand its metadata#3163
bpamiri merged 4 commits into
developfrom
peter/deploy-w1-lock-correctness

Conversation

@bpamiri

@bpamiri bpamiri commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Wave 1 of the #2957 deploy-orchestration campaign: lock correctness.

Closes these items from the re-scoped checklist on #2957 (Status 2026-06-12 section):

  • DEP-1a Acquire the deploy lock on a deterministic primary host (Kamal semantics: first host of the primary role) via onEach/sequential with raise=true — not $dispatchAny (DeployMainCli.cfc:88-92; SshPool.onAny swallows per-host failures, SshPool.cfc:101-108).
  • DEP-1b Release on the same host that acquired (DeployMainCli.cfc:113).
  • DEP-10 Double-quote the lock symlink target so $(hostname)/$(date) expand (LockCommands.cfc:39; keep user/message escaping from fix(cli): deploy remote-exec security — stdin registry login, shell escaping, env.secret fail-fast #3008, fix the stale comment at :32-36).

DEP-1 — multi-host lock bypass

Before: deploy() acquired/released the lock via $dispatchAnySshPool.onAny, which is first-success-wins and catches every per-host exception. On a 2+ host fleet, contention on host 1 raised, onAny swallowed it, acquired a fresh lock on host 2, and the concurrent deploy proceeded — mutual exclusion held only for single-host configs. Release was also onAny, so the released host could differ from the acquired host, stranding stale locks.

Now (a strict superset of the checklist's primary-host suggestion — per-wave direction was all-or-nothing with rollback):

  • the lock is acquired on every host, deduped (a host serving two roles must not contend with itself), sequentially in config order with raise=true — two concurrent deploys probe hosts in the same order, so exactly one wins the first host and the other aborts there;
  • a partial failure rolls back only the already-acquired locks (the contended host's lock belongs to the other deploy and is never touched) and aborts with Wheels.Deploy.LockAcquireFailed naming the failing host, with the per-host cause and stderr detail surfaced;
  • release in the finally fans out via onEach to the exact hosts the lock was acquired on (DEP-1b), still allowFail so it can never shadow the original deploy error (Deploy commands report successful execution even when no deployment action actually occurs #2696 contract preserved);
  • the manual wheels deploy lock acquire/release/status verbs (DeployLockCli) follow the same fleet-wide semantics — a manual release that cleared one host of a fleet-wide lock would strand the rest.

SshPool.onAny itself is deliberately unchanged: its first-success-wins semantics are still the right contract for the proxy boot check, which is Wave 2 scope (DEP-5b).

DEP-10 — lock metadata never expanded

LockCommands.acquire wrapped the whole symlink target in single quotes while the (now-rewritten) comment claimed $(hostname)/$(date) were "resolved by the remote shell" — single quotes suppress command substitution, so lock metadata recorded the literal $(hostname)/$(date --iso-8601=seconds) text. The target is now three concatenated shell words: shellEscape(user) + a double-quoted @$(hostname)/$(date --iso-8601=seconds)/ segment + shellEscape(message) — adjacent quoted segments join into one argument, so the substitutions expand while hostile metacharacters in user/message stay inert (keeps the #3008 Base.shellEscape escaping).

Testing

TDD red-first. Real SSH / docker-remote is unverifiable in this harness — FakeSshPool specs + dry-run flows are the bar here (the env-gated E2EDeploySpec covers the live SSH path separately).

New specs (11, all red before the fix, all green after):

  • cli/lucli/tests/specs/deploy/cli/DeployLockAcquisitionSpec.cfc (new) — acquire on every host before pull + release on every host; partial-acquire rollback releases only the acquired host and aborts before any docker pull; contended first host aborts with zero rollback and never probes host 2; shared-host (web+job) config locks once (no self-contention); dry-run renders per-host lock lines with zero pool calls.
  • DeployLockCliSpec.cfc — manual acquire fans out in order; manual acquire rollback on contention; release/status fan out to every host.
  • LockCommandsSpec.cfc$(hostname)/$(date) land in a double-quoted segment; hostile $(...) in user and in message stay single-quoted/inert.
  • New fixtures: multi-host.yml, shared-host.yml.

Full CLI suite in the lucee7 docker harness (wheels-test-lucee7:v1.0.0, /wheels/cli/tests?format=json):

  • baseline (clean develop + new specs): 992 pass / 11 fail (the new red specs) / 2 errors
  • after fix: 1003 pass / 0 fail / 2 errors

The 2 errors are the known docker-env artifacts (SshClientSpec / SshPoolSpec docker-not-found), identical before and after.

Refs #2957

🤖 Generated with Claude Code

… its metadata

DEP-1 (#2957): lock acquire/release went through $dispatchAny ->
SshPool.onAny, which is first-success-wins and swallows per-host
failures. On any multi-host fleet, contention on host 1 raised, onAny
caught it, acquired a fresh lock on host 2, and the concurrent deploy
proceeded — mutual exclusion held only for single-host configs. Release
was also onAny, so the released host could differ from the acquired
host, stranding stale locks.

Now the deploy lock is acquired on EVERY (deduped) host, sequentially
in config order with raise=true; on a partial failure the
already-acquired locks are rolled back (never the contended host's —
that lock belongs to the other deploy) and the per-host error surfaces
as Wheels.Deploy.LockAcquireFailed naming the failing host. Release
fans out to the exact hosts the lock was acquired on. The manual
'wheels deploy lock acquire/release/status' verbs follow the same
fleet-wide semantics. SshPool.onAny itself is unchanged — the proxy
boot check still legitimately uses it (Wave 2 scope).

DEP-10 (#2957): LockCommands.acquire wrapped the whole symlink target
in single quotes while claiming $(hostname)/$(date) were resolved by
the remote shell — single quotes suppress command substitution, so lock
metadata recorded the literal text. The target is now three
concatenated shell words: shellEscape(user) + a double-quoted
substitution segment + shellEscape(message), so the metadata expands
while hostile metacharacters in user/message stay inert.

Refs #2957 (Wave 1: DEP-1a, DEP-1b, DEP-10)

Signed-off-by: Peter Amiri <peter@alurium.com>
@bpamiri bpamiri enabled auto-merge (squash) June 12, 2026 19:54
@github-actions github-actions Bot added the docs label Jun 12, 2026

@wheels-bot wheels-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheels Bot — Reviewer

TL;DR: This PR closes the multi-host deploy-lock bypass (#2957 DEP-1/1b) by making acquisition all-or-nothing — sequential, config-ordered, raise=true, with rollback of only the already-acquired locks — and fixes DEP-10 so $(hostname)/$(date) actually expand in lock metadata while user/message stay shell-inert. I verified the load-bearing structure: the acquire at DeployMainCli.cfc:93 sits outside the inner try/finally, so a failed acquire never reaches the blanket release at line 121 and the contended host's lock is never touched (matching the spec assertion releaseHosts == ["10.0.0.1"]). The three-segment quoting in LockCommands.acquire is correct CFML ("""…""" emits one double-quoted shell word that joins with the single-quoted neighbors), and the specs cover hostile $(…) in both user and message. Test coverage is exemplary — red-first, happy path, partial-failure rollback, first-host contention, dedupe, and dry-run. Verdict: comment — two non-blocking findings below; nothing blocks merge.

Correctness

Non-blocking hardening suggestioncli/lucli/services/deploy/cli/DeployMainCli.cfc:121:

$dispatch(lockHosts, lock.release(), dryRun, true);

The comment above it says release failures "can never shadow the original deploy exception", but allowFail=true only sets raise=false, which tolerates nonzero exit codes. A transport-level failure (dead connection/session on a host that died mid-deploy) still throws from SshPool.onEach — and onEach rethrows on any host failure, where the old onAny release threw only when all hosts failed. An exception escaping this finally replaces the in-flight deploy error. The PR already has the right pattern one screen down: $rollbackAcquiredLocks (DeployMainCli.cfc:511) wraps the identical onEach release in try/catch precisely so "a rollback failure must never shadow" the real error. Suggest wrapping this finally release the same way (a writeOutput warning on swallow would keep the operator informed). Same applies to the DeployLockCli twin if you want the mirrors in lockstep. Not blocking: the window is narrow (a host must die between acquire and release), and the shadowing error is usually that same host's failure anyway.

Docs

The behavior change makes three published guide pages wrong, and this PR (already labeled docs) doesn't touch them:

  • web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/acquire.mdx:3 — "Manually take the per-service deploy lock on one host." and :28 — "…on any single host." Manual acquire now fans out to every deduped host with rollback (DeployLockCli.cfc:32).
  • web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/release.mdx:33 — "Removes … on a single host. Lock state is per-service, stored at one well-known path — any host sees the same lock." That last clause is exactly the false mental model DEP-1 fixed: the lock file is per-host local /tmp, which is why single-host release stranded stale locks. Release now fans out (DeployLockCli.cfc:45).
  • web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/status.mdx:27 — "Reads … on a single host." Status now checks every host (DeployLockCli.cfc:58).

Please update these three pages (and their examples if they show single-host output) so the published docs match the shipped fleet-wide semantics.

Tests

No findings — noting for the record that the suite is the strongest part of the PR: DeployLockAcquisitionSpec.cfc pins the full DEP-1 contract (ordering, bracketing around docker pull, rollback scope, first-host abort with zero releases, shared-host dedupe, dry-run with zero pool calls), DeployLockCliSpec.cfc mirrors it for the manual verbs, and LockCommandsSpec.cfc covers both the expansion fix and the injection-resistance of user and message (pwn$(touch /tmp/pwned) stays single-quoted). The state struct in the catch blocks follows the BoxLang-safe pattern (Cross-Engine Invariant 11), and the closure captures (var c, releaseCmd) match the prior art in $dispatch.

Security

No findings. The DEP-10 quoting change was the riskiest part of the diff and it holds up: shellEscape (Base.cfc:43, single-quote with '\'' re-open) keeps operator-controlled user/message inert, the double-quoted middle segment is a fixed literal containing only the two intended substitutions, and the new specs lock both properties down.

bpamiri and others added 3 commits June 12, 2026 13:24
…and the fleet

Review findings on the all-or-nothing lock PR: allowFail only mapped to
{raise: false} inside SshClient.run, which suppresses nonzero exit codes
but not transport failures.

- DeployLockCli release/status fanned out via SshPool.onEach, which
  pre-resolves a connection for every host before running anything, so
  one unreachable host aborted the whole verb with zero commands
  executed — turning the prescribed stale-lock recovery path into a
  dead-end exactly when a host died mid-deploy. Both verbs now dispatch
  per host with a per-host try/catch and report skipped hosts in the
  summary instead of throwing.

- DeployMainCli's finally-block release claimed it could never shadow
  the original deploy exception, but a transport failure (dead cached
  connection, startSession throws inside the onEach task, rethrown from
  future.get) propagated out of the finally and replaced the in-flight
  deploy error. The release is now per-host best-effort; skipped hosts
  are logged, never thrown, and every healthy host is still released.

- $rollbackAcquiredLocks in both CFCs is host-granular for the same
  reason: one dead host no longer stops the rollback from clearing the
  locks on the remaining healthy hosts.

- FakeSshPool now mirrors the real pool's transport semantics so specs
  can exercise these paths: failConnection(host) models the eager
  connect throw (onEach aborts wholesale, sequential fails lazily,
  onAny skips to the next host) and a scripted transportError result
  throws from run() regardless of raise=false. Verified the new specs
  fail against the previous production code.

Real SSH remains unverifiable in the harness — FakeSshPool mirrors the
verified real-pool semantics (SshPool.onEach pre-resolve, SshClient.run
transport throws) and the full CLI suite passes.

Signed-off-by: Peter Amiri <peter@alurium.com>
Signed-off-by: Peter Amiri <petera@pai.com>

# Conflicts:
#	cli/lucli/services/deploy/cli/DeployLockCli.cfc
#	cli/lucli/services/deploy/cli/DeployMainCli.cfc
#	cli/lucli/tests/specs/deploy/cli/DeployLockCliSpec.cfc
#	cli/lucli/tests/specs/deploy/cli/DeployMainCliSpec.cfc
Signed-off-by: Peter Amiri <petera@pai.com>

# Conflicts:
#	cli/lucli/services/deploy/cli/DeployMainCli.cfc
@bpamiri bpamiri merged commit 5fb4450 into develop Jun 13, 2026
8 checks passed
@bpamiri bpamiri deleted the peter/deploy-w1-lock-correctness branch June 13, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant