fix(cli): make the deploy lock all-or-nothing across hosts and expand its metadata by bpamiri · Pull Request #3163 · wheels-dev/wheels

bpamiri · 2026-06-12T19:54:31Z

Summary

Wave 1 of the #2957 deploy-orchestration campaign: lock correctness.

Closes these items from the re-scoped checklist on #2957 (Status 2026-06-12 section):

DEP-1a Acquire the deploy lock on a deterministic primary host (Kamal semantics: first host of the primary role) via onEach/sequential with raise=true — not $dispatchAny (DeployMainCli.cfc:88-92; SshPool.onAny swallows per-host failures, SshPool.cfc:101-108).

DEP-1b Release on the same host that acquired (DeployMainCli.cfc:113).

DEP-10 Double-quote the lock symlink target so $(hostname)/$(date) expand (LockCommands.cfc:39; keep user/message escaping from fix(cli): deploy remote-exec security — stdin registry login, shell escaping, env.secret fail-fast #3008, fix the stale comment at :32-36).

DEP-1 — multi-host lock bypass

Before: deploy() acquired/released the lock via $dispatchAny → SshPool.onAny, which is first-success-wins and catches every per-host exception. On a 2+ host fleet, contention on host 1 raised, onAny swallowed it, acquired a fresh lock on host 2, and the concurrent deploy proceeded — mutual exclusion held only for single-host configs. Release was also onAny, so the released host could differ from the acquired host, stranding stale locks.

Now (a strict superset of the checklist's primary-host suggestion — per-wave direction was all-or-nothing with rollback):

the lock is acquired on every host, deduped (a host serving two roles must not contend with itself), sequentially in config order with raise=true — two concurrent deploys probe hosts in the same order, so exactly one wins the first host and the other aborts there;
a partial failure rolls back only the already-acquired locks (the contended host's lock belongs to the other deploy and is never touched) and aborts with Wheels.Deploy.LockAcquireFailed naming the failing host, with the per-host cause and stderr detail surfaced;
release in the finally fans out via onEach to the exact hosts the lock was acquired on (DEP-1b), still allowFail so it can never shadow the original deploy error (Deploy commands report successful execution even when no deployment action actually occurs #2696 contract preserved);
the manual wheels deploy lock acquire/release/status verbs (DeployLockCli) follow the same fleet-wide semantics — a manual release that cleared one host of a fleet-wide lock would strand the rest.

SshPool.onAny itself is deliberately unchanged: its first-success-wins semantics are still the right contract for the proxy boot check, which is Wave 2 scope (DEP-5b).

DEP-10 — lock metadata never expanded

LockCommands.acquire wrapped the whole symlink target in single quotes while the (now-rewritten) comment claimed $(hostname)/$(date) were "resolved by the remote shell" — single quotes suppress command substitution, so lock metadata recorded the literal $(hostname)/$(date --iso-8601=seconds) text. The target is now three concatenated shell words: shellEscape(user) + a double-quoted @$(hostname)/$(date --iso-8601=seconds)/ segment + shellEscape(message) — adjacent quoted segments join into one argument, so the substitutions expand while hostile metacharacters in user/message stay inert (keeps the #3008 Base.shellEscape escaping).

Testing

TDD red-first. Real SSH / docker-remote is unverifiable in this harness — FakeSshPool specs + dry-run flows are the bar here (the env-gated E2EDeploySpec covers the live SSH path separately).

New specs (11, all red before the fix, all green after):

cli/lucli/tests/specs/deploy/cli/DeployLockAcquisitionSpec.cfc (new) — acquire on every host before pull + release on every host; partial-acquire rollback releases only the acquired host and aborts before any docker pull; contended first host aborts with zero rollback and never probes host 2; shared-host (web+job) config locks once (no self-contention); dry-run renders per-host lock lines with zero pool calls.
DeployLockCliSpec.cfc — manual acquire fans out in order; manual acquire rollback on contention; release/status fan out to every host.
LockCommandsSpec.cfc — $(hostname)/$(date) land in a double-quoted segment; hostile $(...) in user and in message stay single-quoted/inert.
New fixtures: multi-host.yml, shared-host.yml.

Full CLI suite in the lucee7 docker harness (wheels-test-lucee7:v1.0.0, /wheels/cli/tests?format=json):

baseline (clean develop + new specs): 992 pass / 11 fail (the new red specs) / 2 errors
after fix: 1003 pass / 0 fail / 2 errors

The 2 errors are the known docker-env artifacts (SshClientSpec / SshPoolSpec docker-not-found), identical before and after.

Refs #2957

🤖 Generated with Claude Code

… its metadata DEP-1 (#2957): lock acquire/release went through $dispatchAny -> SshPool.onAny, which is first-success-wins and swallows per-host failures. On any multi-host fleet, contention on host 1 raised, onAny caught it, acquired a fresh lock on host 2, and the concurrent deploy proceeded — mutual exclusion held only for single-host configs. Release was also onAny, so the released host could differ from the acquired host, stranding stale locks. Now the deploy lock is acquired on EVERY (deduped) host, sequentially in config order with raise=true; on a partial failure the already-acquired locks are rolled back (never the contended host's — that lock belongs to the other deploy) and the per-host error surfaces as Wheels.Deploy.LockAcquireFailed naming the failing host. Release fans out to the exact hosts the lock was acquired on. The manual 'wheels deploy lock acquire/release/status' verbs follow the same fleet-wide semantics. SshPool.onAny itself is unchanged — the proxy boot check still legitimately uses it (Wave 2 scope). DEP-10 (#2957): LockCommands.acquire wrapped the whole symlink target in single quotes while claiming $(hostname)/$(date) were resolved by the remote shell — single quotes suppress command substitution, so lock metadata recorded the literal text. The target is now three concatenated shell words: shellEscape(user) + a double-quoted substitution segment + shellEscape(message), so the metadata expands while hostile metacharacters in user/message stay inert. Refs #2957 (Wave 1: DEP-1a, DEP-1b, DEP-10) Signed-off-by: Peter Amiri <peter@alurium.com>

wheels-bot

Wheels Bot — Reviewer

TL;DR: This PR closes the multi-host deploy-lock bypass (#2957 DEP-1/1b) by making acquisition all-or-nothing — sequential, config-ordered, raise=true, with rollback of only the already-acquired locks — and fixes DEP-10 so $(hostname)/$(date) actually expand in lock metadata while user/message stay shell-inert. I verified the load-bearing structure: the acquire at DeployMainCli.cfc:93 sits outside the inner try/finally, so a failed acquire never reaches the blanket release at line 121 and the contended host's lock is never touched (matching the spec assertion releaseHosts == ["10.0.0.1"]). The three-segment quoting in LockCommands.acquire is correct CFML ("""…""" emits one double-quoted shell word that joins with the single-quoted neighbors), and the specs cover hostile $(…) in both user and message. Test coverage is exemplary — red-first, happy path, partial-failure rollback, first-host contention, dedupe, and dry-run. Verdict: comment — two non-blocking findings below; nothing blocks merge.

Correctness

Non-blocking hardening suggestion — cli/lucli/services/deploy/cli/DeployMainCli.cfc:121:

$dispatch(lockHosts, lock.release(), dryRun, true);

The comment above it says release failures "can never shadow the original deploy exception", but allowFail=true only sets raise=false, which tolerates nonzero exit codes. A transport-level failure (dead connection/session on a host that died mid-deploy) still throws from SshPool.onEach — and onEach rethrows on any host failure, where the old onAny release threw only when all hosts failed. An exception escaping this finally replaces the in-flight deploy error. The PR already has the right pattern one screen down: $rollbackAcquiredLocks (DeployMainCli.cfc:511) wraps the identical onEach release in try/catch precisely so "a rollback failure must never shadow" the real error. Suggest wrapping this finally release the same way (a writeOutput warning on swallow would keep the operator informed). Same applies to the DeployLockCli twin if you want the mirrors in lockstep. Not blocking: the window is narrow (a host must die between acquire and release), and the shadowing error is usually that same host's failure anyway.

Docs

The behavior change makes three published guide pages wrong, and this PR (already labeled docs) doesn't touch them:

web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/acquire.mdx:3 — "Manually take the per-service deploy lock on one host." and :28 — "…on any single host." Manual acquire now fans out to every deduped host with rollback (DeployLockCli.cfc:32).
web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/release.mdx:33 — "Removes … on a single host. Lock state is per-service, stored at one well-known path — any host sees the same lock." That last clause is exactly the false mental model DEP-1 fixed: the lock file is per-host local /tmp, which is why single-host release stranded stale locks. Release now fans out (DeployLockCli.cfc:45).
web/sites/guides/src/content/docs/v4-0-0/command-line-tools/commands/deploy/lock/status.mdx:27 — "Reads … on a single host." Status now checks every host (DeployLockCli.cfc:58).

Please update these three pages (and their examples if they show single-host output) so the published docs match the shipped fleet-wide semantics.

Tests

No findings — noting for the record that the suite is the strongest part of the PR: DeployLockAcquisitionSpec.cfc pins the full DEP-1 contract (ordering, bracketing around docker pull, rollback scope, first-host abort with zero releases, shared-host dedupe, dry-run with zero pool calls), DeployLockCliSpec.cfc mirrors it for the manual verbs, and LockCommandsSpec.cfc covers both the expansion fix and the injection-resistance of user and message (pwn$(touch /tmp/pwned) stays single-quoted). The state struct in the catch blocks follows the BoxLang-safe pattern (Cross-Engine Invariant 11), and the closure captures (var c, releaseCmd) match the prior art in $dispatch.

Security

No findings. The DEP-10 quoting change was the riskiest part of the diff and it holds up: shellEscape (Base.cfc:43, single-quote with '\'' re-open) keeps operator-controlled user/message inert, the double-quoted middle segment is a fixed literal containing only the two intended substitutions, and the new specs lock both properties down.

…and the fleet Review findings on the all-or-nothing lock PR: allowFail only mapped to {raise: false} inside SshClient.run, which suppresses nonzero exit codes but not transport failures. - DeployLockCli release/status fanned out via SshPool.onEach, which pre-resolves a connection for every host before running anything, so one unreachable host aborted the whole verb with zero commands executed — turning the prescribed stale-lock recovery path into a dead-end exactly when a host died mid-deploy. Both verbs now dispatch per host with a per-host try/catch and report skipped hosts in the summary instead of throwing. - DeployMainCli's finally-block release claimed it could never shadow the original deploy exception, but a transport failure (dead cached connection, startSession throws inside the onEach task, rethrown from future.get) propagated out of the finally and replaced the in-flight deploy error. The release is now per-host best-effort; skipped hosts are logged, never thrown, and every healthy host is still released. - $rollbackAcquiredLocks in both CFCs is host-granular for the same reason: one dead host no longer stops the rollback from clearing the locks on the remaining healthy hosts. - FakeSshPool now mirrors the real pool's transport semantics so specs can exercise these paths: failConnection(host) models the eager connect throw (onEach aborts wholesale, sequential fails lazily, onAny skips to the next host) and a scripted transportError result throws from run() regardless of raise=false. Verified the new specs fail against the previous production code. Real SSH remains unverifiable in the harness — FakeSshPool mirrors the verified real-pool semantics (SshPool.onEach pre-resolve, SshClient.run transport throws) and the full CLI suite passes. Signed-off-by: Peter Amiri <peter@alurium.com>

Signed-off-by: Peter Amiri <petera@pai.com> # Conflicts: # cli/lucli/services/deploy/cli/DeployLockCli.cfc # cli/lucli/services/deploy/cli/DeployMainCli.cfc # cli/lucli/tests/specs/deploy/cli/DeployLockCliSpec.cfc # cli/lucli/tests/specs/deploy/cli/DeployMainCliSpec.cfc

Signed-off-by: Peter Amiri <petera@pai.com> # Conflicts: # cli/lucli/services/deploy/cli/DeployMainCli.cfc

bpamiri enabled auto-merge (squash) June 12, 2026 19:54

github-actions Bot added the docs label Jun 12, 2026

wheels-bot Bot reviewed Jun 12, 2026

View reviewed changes

bpamiri and others added 3 commits June 12, 2026 13:24

Merge remote-tracking branch 'origin/develop' into HEAD

8de1a75

Signed-off-by: Peter Amiri <petera@pai.com> # Conflicts: # cli/lucli/services/deploy/cli/DeployMainCli.cfc

bpamiri merged commit 5fb4450 into develop Jun 13, 2026
8 checks passed

bpamiri deleted the peter/deploy-w1-lock-correctness branch June 13, 2026 05:04

bpamiri mentioned this pull request Jun 13, 2026

review remediation: deploy CLI orchestration correctness — multi-host lock bypass, fresh-host bootstrap, discarded stdout/audit log, flat-alias ssh config, roll-up defects #2957

Closed

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cli): make the deploy lock all-or-nothing across hosts and expand its metadata#3163

fix(cli): make the deploy lock all-or-nothing across hosts and expand its metadata#3163
bpamiri merged 4 commits into
developfrom
peter/deploy-w1-lock-correctness

bpamiri commented Jun 12, 2026

Uh oh!

wheels-bot Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bpamiri commented Jun 12, 2026

Summary

DEP-1 — multi-host lock bypass

DEP-10 — lock metadata never expanded

Testing

Uh oh!

wheels-bot Bot left a comment

Choose a reason for hiding this comment

Wheels Bot — Reviewer

Correctness

Docs

Tests

Security

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant