fix(cli): make the deploy lock all-or-nothing across hosts and expand its metadata#3163
Conversation
… its metadata DEP-1 (#2957): lock acquire/release went through $dispatchAny -> SshPool.onAny, which is first-success-wins and swallows per-host failures. On any multi-host fleet, contention on host 1 raised, onAny caught it, acquired a fresh lock on host 2, and the concurrent deploy proceeded — mutual exclusion held only for single-host configs. Release was also onAny, so the released host could differ from the acquired host, stranding stale locks. Now the deploy lock is acquired on EVERY (deduped) host, sequentially in config order with raise=true; on a partial failure the already-acquired locks are rolled back (never the contended host's — that lock belongs to the other deploy) and the per-host error surfaces as Wheels.Deploy.LockAcquireFailed naming the failing host. Release fans out to the exact hosts the lock was acquired on. The manual 'wheels deploy lock acquire/release/status' verbs follow the same fleet-wide semantics. SshPool.onAny itself is unchanged — the proxy boot check still legitimately uses it (Wave 2 scope). DEP-10 (#2957): LockCommands.acquire wrapped the whole symlink target in single quotes while claiming $(hostname)/$(date) were resolved by the remote shell — single quotes suppress command substitution, so lock metadata recorded the literal text. The target is now three concatenated shell words: shellEscape(user) + a double-quoted substitution segment + shellEscape(message), so the metadata expands while hostile metacharacters in user/message stay inert. Refs #2957 (Wave 1: DEP-1a, DEP-1b, DEP-10) Signed-off-by: Peter Amiri <peter@alurium.com>
There was a problem hiding this comment.
Wheels Bot — Reviewer
TL;DR: This PR closes the multi-host deploy-lock bypass (#2957 DEP-1/1b) by making acquisition all-or-nothing — sequential, config-ordered, raise=true, with rollback of only the already-acquired locks — and fixes DEP-10 so $(hostname)/$(date) actually expand in lock metadata while user/message stay shell-inert. I verified the load-bearing structure: the acquire at DeployMainCli.cfc:93 sits outside the inner try/finally, so a failed acquire never reaches the blanket release at line 121 and the contended host's lock is never touched (matching the spec assertion releaseHosts == ["10.0.0.1"]). The three-segment quoting in LockCommands.acquire is correct CFML ("""…""" emits one double-quoted shell word that joins with the single-quoted neighbors), and the specs cover hostile $(…) in both user and message. Test coverage is exemplary — red-first, happy path, partial-failure rollback, first-host contention, dedupe, and dry-run. Verdict: comment — two non-blocking findings below; nothing blocks merge.
Correctness
Non-blocking hardening suggestion — cli/lucli/services/deploy/cli/DeployMainCli.cfc:121:
$dispatch(lockHosts, lock.release(), dryRun, true);The comment above it says release failures "can never shadow the original deploy exception", but allowFail=true only sets raise=false, which tolerates nonzero exit codes. A transport-level failure (dead connection/session on a host that died mid-deploy) still throws from SshPool.onEach — and onEach rethrows on any host failure, where the old onAny release threw only when all hosts failed. An exception escaping this finally replaces the in-flight deploy error. The PR already has the right pattern one screen down: $rollbackAcquiredLocks (DeployMainCli.cfc:511) wraps the identical onEach release in try/catch precisely so "a rollback failure must never shadow" the real error. Suggest wrapping this finally release the same way (a writeOutput warning on swallow would keep the operator informed). Same applies to the DeployLockCli twin if you want the mirrors in lockstep. Not blocking: the window is narrow (a host must die between acquire and release), and the shadowing error is usually that same host's failure anyway.
Docs
The behavior change makes three published guide pages wrong, and this PR (already labeled docs) doesn't touch them:
web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/acquire.mdx:3— "Manually take the per-service deploy lock on one host." and:28— "…on any single host." Manual acquire now fans out to every deduped host with rollback (DeployLockCli.cfc:32).web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/release.mdx:33— "Removes … on a single host. Lock state is per-service, stored at one well-known path — any host sees the same lock." That last clause is exactly the false mental model DEP-1 fixed: the lock file is per-host local/tmp, which is why single-host release stranded stale locks. Release now fans out (DeployLockCli.cfc:45).web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/status.mdx:27— "Reads … on a single host." Status now checks every host (DeployLockCli.cfc:58).
Please update these three pages (and their examples if they show single-host output) so the published docs match the shipped fleet-wide semantics.
Tests
No findings — noting for the record that the suite is the strongest part of the PR: DeployLockAcquisitionSpec.cfc pins the full DEP-1 contract (ordering, bracketing around docker pull, rollback scope, first-host abort with zero releases, shared-host dedupe, dry-run with zero pool calls), DeployLockCliSpec.cfc mirrors it for the manual verbs, and LockCommandsSpec.cfc covers both the expansion fix and the injection-resistance of user and message (pwn$(touch /tmp/pwned) stays single-quoted). The state struct in the catch blocks follows the BoxLang-safe pattern (Cross-Engine Invariant 11), and the closure captures (var c, releaseCmd) match the prior art in $dispatch.
Security
No findings. The DEP-10 quoting change was the riskiest part of the diff and it holds up: shellEscape (Base.cfc:43, single-quote with '\'' re-open) keeps operator-controlled user/message inert, the double-quoted middle segment is a fixed literal containing only the two intended substitutions, and the new specs lock both properties down.
…and the fleet
Review findings on the all-or-nothing lock PR: allowFail only mapped to
{raise: false} inside SshClient.run, which suppresses nonzero exit codes
but not transport failures.
- DeployLockCli release/status fanned out via SshPool.onEach, which
pre-resolves a connection for every host before running anything, so
one unreachable host aborted the whole verb with zero commands
executed — turning the prescribed stale-lock recovery path into a
dead-end exactly when a host died mid-deploy. Both verbs now dispatch
per host with a per-host try/catch and report skipped hosts in the
summary instead of throwing.
- DeployMainCli's finally-block release claimed it could never shadow
the original deploy exception, but a transport failure (dead cached
connection, startSession throws inside the onEach task, rethrown from
future.get) propagated out of the finally and replaced the in-flight
deploy error. The release is now per-host best-effort; skipped hosts
are logged, never thrown, and every healthy host is still released.
- $rollbackAcquiredLocks in both CFCs is host-granular for the same
reason: one dead host no longer stops the rollback from clearing the
locks on the remaining healthy hosts.
- FakeSshPool now mirrors the real pool's transport semantics so specs
can exercise these paths: failConnection(host) models the eager
connect throw (onEach aborts wholesale, sequential fails lazily,
onAny skips to the next host) and a scripted transportError result
throws from run() regardless of raise=false. Verified the new specs
fail against the previous production code.
Real SSH remains unverifiable in the harness — FakeSshPool mirrors the
verified real-pool semantics (SshPool.onEach pre-resolve, SshClient.run
transport throws) and the full CLI suite passes.
Signed-off-by: Peter Amiri <peter@alurium.com>
Signed-off-by: Peter Amiri <petera@pai.com> # Conflicts: # cli/lucli/services/deploy/cli/DeployLockCli.cfc # cli/lucli/services/deploy/cli/DeployMainCli.cfc # cli/lucli/tests/specs/deploy/cli/DeployLockCliSpec.cfc # cli/lucli/tests/specs/deploy/cli/DeployMainCliSpec.cfc
Signed-off-by: Peter Amiri <petera@pai.com> # Conflicts: # cli/lucli/services/deploy/cli/DeployMainCli.cfc
Summary
Wave 1 of the #2957 deploy-orchestration campaign: lock correctness.
Closes these items from the re-scoped checklist on #2957 (Status 2026-06-12 section):
DEP-1 — multi-host lock bypass
Before:
deploy()acquired/released the lock via$dispatchAny→SshPool.onAny, which is first-success-wins and catches every per-host exception. On a 2+ host fleet, contention on host 1 raised,onAnyswallowed it, acquired a fresh lock on host 2, and the concurrent deploy proceeded — mutual exclusion held only for single-host configs. Release was alsoonAny, so the released host could differ from the acquired host, stranding stale locks.Now (a strict superset of the checklist's primary-host suggestion — per-wave direction was all-or-nothing with rollback):
raise=true— two concurrent deploys probe hosts in the same order, so exactly one wins the first host and the other aborts there;Wheels.Deploy.LockAcquireFailednaming the failing host, with the per-host cause and stderr detail surfaced;finallyfans out viaonEachto the exact hosts the lock was acquired on (DEP-1b), stillallowFailso it can never shadow the original deploy error (Deploy commands report successful execution even when no deployment action actually occurs #2696 contract preserved);wheels deploy lock acquire/release/statusverbs (DeployLockCli) follow the same fleet-wide semantics — a manualreleasethat cleared one host of a fleet-wide lock would strand the rest.SshPool.onAnyitself is deliberately unchanged: its first-success-wins semantics are still the right contract for the proxy boot check, which is Wave 2 scope (DEP-5b).DEP-10 — lock metadata never expanded
LockCommands.acquirewrapped the whole symlink target in single quotes while the (now-rewritten) comment claimed$(hostname)/$(date)were "resolved by the remote shell" — single quotes suppress command substitution, so lock metadata recorded the literal$(hostname)/$(date --iso-8601=seconds)text. The target is now three concatenated shell words:shellEscape(user)+ a double-quoted@$(hostname)/$(date --iso-8601=seconds)/segment +shellEscape(message)— adjacent quoted segments join into one argument, so the substitutions expand while hostile metacharacters in user/message stay inert (keeps the #3008Base.shellEscapeescaping).Testing
TDD red-first. Real SSH / docker-remote is unverifiable in this harness — FakeSshPool specs + dry-run flows are the bar here (the env-gated
E2EDeploySpeccovers the live SSH path separately).New specs (11, all red before the fix, all green after):
cli/lucli/tests/specs/deploy/cli/DeployLockAcquisitionSpec.cfc(new) — acquire on every host before pull + release on every host; partial-acquire rollback releases only the acquired host and aborts before anydocker pull; contended first host aborts with zero rollback and never probes host 2; shared-host (web+job) config locks once (no self-contention); dry-run renders per-host lock lines with zero pool calls.DeployLockCliSpec.cfc— manual acquire fans out in order; manual acquire rollback on contention; release/status fan out to every host.LockCommandsSpec.cfc—$(hostname)/$(date)land in a double-quoted segment; hostile$(...)in user and in message stay single-quoted/inert.multi-host.yml,shared-host.yml.Full CLI suite in the lucee7 docker harness (
wheels-test-lucee7:v1.0.0,/wheels/cli/tests?format=json):The 2 errors are the known docker-env artifacts (
SshClientSpec/SshPoolSpecdocker-not-found), identical before and after.Refs #2957
🤖 Generated with Claude Code