Skip to content

Commit 5fb4450

Browse files
bpamiriPeter Amiri
andauthored
fix(cli): make the deploy lock all-or-nothing across hosts and expand its metadata (#3163)
* fix(cli): make the deploy lock all-or-nothing across hosts and expand its metadata DEP-1 (#2957): lock acquire/release went through $dispatchAny -> SshPool.onAny, which is first-success-wins and swallows per-host failures. On any multi-host fleet, contention on host 1 raised, onAny caught it, acquired a fresh lock on host 2, and the concurrent deploy proceeded — mutual exclusion held only for single-host configs. Release was also onAny, so the released host could differ from the acquired host, stranding stale locks. Now the deploy lock is acquired on EVERY (deduped) host, sequentially in config order with raise=true; on a partial failure the already-acquired locks are rolled back (never the contended host's — that lock belongs to the other deploy) and the per-host error surfaces as Wheels.Deploy.LockAcquireFailed naming the failing host. Release fans out to the exact hosts the lock was acquired on. The manual 'wheels deploy lock acquire/release/status' verbs follow the same fleet-wide semantics. SshPool.onAny itself is unchanged — the proxy boot check still legitimately uses it (Wave 2 scope). DEP-10 (#2957): LockCommands.acquire wrapped the whole symlink target in single quotes while claiming $(hostname)/$(date) were resolved by the remote shell — single quotes suppress command substitution, so lock metadata recorded the literal text. The target is now three concatenated shell words: shellEscape(user) + a double-quoted substitution segment + shellEscape(message), so the metadata expands while hostile metacharacters in user/message stay inert. Refs #2957 (Wave 1: DEP-1a, DEP-1b, DEP-10) Signed-off-by: Peter Amiri <peter@alurium.com> * fix(cli): per-host best-effort lock release so a dead host cannot strand the fleet Review findings on the all-or-nothing lock PR: allowFail only mapped to {raise: false} inside SshClient.run, which suppresses nonzero exit codes but not transport failures. - DeployLockCli release/status fanned out via SshPool.onEach, which pre-resolves a connection for every host before running anything, so one unreachable host aborted the whole verb with zero commands executed — turning the prescribed stale-lock recovery path into a dead-end exactly when a host died mid-deploy. Both verbs now dispatch per host with a per-host try/catch and report skipped hosts in the summary instead of throwing. - DeployMainCli's finally-block release claimed it could never shadow the original deploy exception, but a transport failure (dead cached connection, startSession throws inside the onEach task, rethrown from future.get) propagated out of the finally and replaced the in-flight deploy error. The release is now per-host best-effort; skipped hosts are logged, never thrown, and every healthy host is still released. - $rollbackAcquiredLocks in both CFCs is host-granular for the same reason: one dead host no longer stops the rollback from clearing the locks on the remaining healthy hosts. - FakeSshPool now mirrors the real pool's transport semantics so specs can exercise these paths: failConnection(host) models the eager connect throw (onEach aborts wholesale, sequential fails lazily, onAny skips to the next host) and a scripted transportError result throws from run() regardless of raise=false. Verified the new specs fail against the previous production code. Real SSH remains unverifiable in the harness — FakeSshPool mirrors the verified real-pool semantics (SshPool.onEach pre-resolve, SshClient.run transport throws) and the full CLI suite passes. Signed-off-by: Peter Amiri <peter@alurium.com> --------- Signed-off-by: Peter Amiri <peter@alurium.com> Co-authored-by: Peter Amiri <petera@pai.com>
1 parent 986e68a commit 5fb4450

12 files changed

Lines changed: 928 additions & 53 deletions

File tree

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
- `wheels deploy` lock acquisition is now all-or-nothing across the fleet: the lock is acquired on every (deduped) host sequentially in config order with failures surfaced, already-acquired locks are rolled back on a partial failure (`Wheels.Deploy.LockAcquireFailed` names the contended host; the contended host's own lock is never touched), and release fans out to every acquired host. Previously the first-success-wins `onAny` dispatch swallowed contention on one host and silently re-acquired on another, so concurrent deploys were only mutually excluded on single-host configs — and release could target a different host than acquire, stranding stale locks. The manual `wheels deploy lock acquire/release/status` verbs follow the same fleet-wide semantics (#2957)
2+
- Deploy lock metadata now actually expands `$(hostname)` and `$(date --iso-8601=seconds)` on the remote: the symlink target double-quotes the substitution segment while keeping the user and message inert via `shellEscape` single-quoting — previously the whole target was single-quoted, which suppressed command substitution and recorded the literal `$(hostname)` text (#2957)

cli/lucli/services/deploy/cli/DeployLockCli.cfc

Lines changed: 184 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,10 @@ component {
2626
user: $currentUser(),
2727
message: arguments.opts.message ?: "manual acquire"
2828
});
29-
$dispatchAny($allHosts(cfg), cmd, dryRun);
29+
// The deploy flow holds the lock on EVERY host (##2957 DEP-1), so a
30+
// manual acquire must match — locking a single host would let a
31+
// concurrent deploy that probes another host proceed.
32+
$acquireLockAllOrNothing($uniqueHosts($allHosts(cfg)), cmd, lock, dryRun);
3033
return $renderResult(arguments.opts, "Acquired deploy lock for " & cfg.service());
3134
}
3235

@@ -37,8 +40,17 @@ component {
3740
var lock = new modules.wheels.services.deploy.commands.LockCommands(cfg);
3841
// rm -f is idempotent; surfacing a failure here only obscures the
3942
// operator's intent ("clear the lock if it's there"). #2696.
40-
$dispatchAny($allHosts(cfg), lock.release(), dryRun, true);
41-
return $renderResult(arguments.opts, "Released deploy lock for " & cfg.service());
43+
// Fan out to every host — the lock lives fleet-wide (##2957 DEP-1),
44+
// so clearing one host would strand stale locks on the rest — but
45+
// per host, best-effort: fleet-wide stale locks most plausibly
46+
// exist BECAUSE a host died mid-deploy, so the recovery path must
47+
// keep working around an unreachable host instead of aborting on it.
48+
var failed = $dispatchPerHostTolerant($uniqueHosts($allHosts(cfg)), lock.release(), dryRun);
49+
return $renderResult(
50+
arguments.opts,
51+
"Released deploy lock for " & cfg.service()
52+
& $skippedHostsSuffix(failed, "the lock was NOT released there; re-run 'wheels deploy lock release' when the host is back")
53+
);
4254
}
4355

4456
public string function status(required struct opts) {
@@ -48,12 +60,16 @@ component {
4860
var lock = new modules.wheels.services.deploy.commands.LockCommands(cfg);
4961
// readlink exits nonzero when the lock file is missing — which is
5062
// exactly what the operator wants to learn from `status`. Treat that
51-
// as advisory output, not a thrown error. #2696.
52-
// #2957 DEP-6a: surface readlink's output (the lock holder on stdout,
53-
// or the "No such file" diagnostic on stderr) instead of dropping it.
54-
var lines = $dispatchAnyCollect($allHosts(cfg), lock.status(), dryRun, true);
63+
// as advisory output, not a thrown error. #2696. Checked on every
64+
// host since the lock lives fleet-wide (##2957 DEP-1) — and the same
65+
// advisory contract covers an unreachable host: report it, don't throw.
66+
// #2957 DEP-6a: also surface readlink's output (the lock holder on
67+
// stdout, or the "No such file" diagnostic on stderr) per host instead
68+
// of dropping it.
69+
var collected = $collectPerHostTolerant($uniqueHosts($allHosts(cfg)), lock.status(), dryRun);
5570
var summary = "Checked deploy lock status for " & cfg.service();
56-
if (arrayLen(lines)) summary &= chr(10) & arrayToList(lines, chr(10));
71+
if (arrayLen(collected.lines)) summary &= chr(10) & arrayToList(collected.lines, chr(10));
72+
summary &= $skippedHostsSuffix(collected.failed, "the lock state there is unknown");
5773
return $renderResult(arguments.opts, summary);
5874
}
5975

@@ -77,47 +93,181 @@ component {
7793
return out;
7894
}
7995

80-
private void function $dispatchAny(required array hosts, required string cmd, required boolean dryRun, boolean allowFail = false) {
96+
/**
97+
* All-or-nothing lock acquisition across every host, in config order,
98+
* with rollback of already-acquired locks on the first failure.
99+
*
100+
* MIRROR: DeployMainCli.$acquireLockAllOrNothing is the deploy-flow
101+
* twin of this contract (##2957 DEP-1) — keep them in lockstep.
102+
*/
103+
private void function $acquireLockAllOrNothing(
104+
required array hosts,
105+
required string acquireCmd,
106+
required any lock,
107+
required boolean dryRun
108+
) {
81109
if (arguments.dryRun) {
82-
if (arrayLen(arguments.hosts)) {
83-
arrayAppend(variables.dryRunBuffer, "[" & arguments.hosts[1] & "] " & arguments.cmd);
110+
for (var h in arguments.hosts) {
111+
arrayAppend(variables.dryRunBuffer, "[" & h & "] " & arguments.acquireCmd);
84112
}
85113
return;
86114
}
87-
// Lock ops target just one host (the lock file lives on one path; any host works).
88-
// #2696: acquire stays strict (contention should surface); release/status tolerate.
89-
var c = arguments.cmd;
90-
var doRaise = !arguments.allowFail;
91-
variables.sshPool.onAny(arguments.hosts, function(ssh, host) { ssh.run(c, {raise: doRaise}); });
115+
var c = arguments.acquireCmd;
116+
// Shared struct so the callback can record progress — closures can't
117+
// reliably mutate outer scalars across engines (anti-pattern ##10).
118+
var state = {acquired: [], lastHost: ""};
119+
try {
120+
variables.sshPool.sequential(arguments.hosts, function(ssh, host) {
121+
state.lastHost = host;
122+
ssh.run(c, {raise: true});
123+
arrayAppend(state.acquired, host);
124+
});
125+
} catch (any e) {
126+
$rollbackAcquiredLocks(state.acquired, arguments.lock);
127+
throw(
128+
type = "Wheels.Deploy.LockAcquireFailed",
129+
message = "Could not acquire the deploy lock on " & state.lastHost
130+
& " — another deploy may hold it. Rolled back "
131+
& arrayLen(state.acquired) & " already-acquired lock(s). "
132+
& "Inspect with 'wheels deploy lock status'; clear a stale lock with "
133+
& "'wheels deploy lock release'. Cause: " & e.message,
134+
detail = e.detail ?: ""
135+
);
136+
}
137+
}
138+
139+
/**
140+
* Best-effort release of the locks a partially-failed acquire already
141+
* placed. A rollback failure must never shadow the LockAcquireFailed
142+
* the caller is about to throw. Host-granular: one unreachable host
143+
* must not stop the rollback from clearing the remaining healthy hosts.
144+
*/
145+
private void function $rollbackAcquiredLocks(required array hosts, required any lock) {
146+
if (!arrayLen(arguments.hosts)) return;
147+
// $dispatchPerHostTolerant never throws — per-host failures are
148+
// swallowed deliberately; the acquire error is the one the operator
149+
// needs to see.
150+
$dispatchPerHostTolerant(arguments.hosts, arguments.lock.release(), false);
151+
}
152+
153+
/** Order-preserving dedupe — a host serving several roles appears once. */
154+
private array function $uniqueHosts(required array hosts) {
155+
var seen = {};
156+
var out = [];
157+
for (var h in arguments.hosts) {
158+
if (!structKeyExists(seen, h)) {
159+
seen[h] = true;
160+
arrayAppend(out, h);
161+
}
162+
}
163+
return out;
92164
}
93165

94166
/**
95-
* Like $dispatchAny, but returns the remote output host-prefixed
96-
* (`[host] line`). Stdout wins; stderr is the fallback so tolerated
97-
* failures (allowFail) still surface their diagnostic. #2957 DEP-6a.
167+
* Per-host best-effort dispatch. allowFail-style onEach is NOT enough
168+
* for tolerant fan-out: the real SshPool.onEach pre-resolves a
169+
* connection for EVERY host before submitting any task, so a single
170+
* unreachable host throws before the command runs anywhere, and a
171+
* transport failure inside a task (dead cached connection) is rethrown
172+
* from future.get() regardless of {raise: false}. Dispatching each host
173+
* in its own sequential([host]) call with a per-host try/catch confines
174+
* every failure mode — connect and transport alike — to its host.
175+
*
176+
* @return array of {host, message} structs for hosts that failed.
177+
*
178+
* MIRROR: DeployMainCli.$dispatchPerHostTolerant is the deploy-flow
179+
* twin of this helper — keep them in lockstep.
98180
*/
99-
private array function $dispatchAnyCollect(required array hosts, required string cmd, required boolean dryRun, boolean allowFail = false) {
181+
private array function $dispatchPerHostTolerant(
182+
required array hosts,
183+
required string cmd,
184+
required boolean dryRun
185+
) {
100186
if (arguments.dryRun) {
101-
if (arrayLen(arguments.hosts)) {
102-
arrayAppend(variables.dryRunBuffer, "[" & arguments.hosts[1] & "] " & arguments.cmd);
187+
for (var h in arguments.hosts) {
188+
arrayAppend(variables.dryRunBuffer, "[" & h & "] " & arguments.cmd);
103189
}
104190
return [];
105191
}
106192
var c = arguments.cmd;
107-
var doRaise = !arguments.allowFail;
193+
var failed = [];
194+
for (var h in arguments.hosts) {
195+
try {
196+
variables.sshPool.sequential([h], function(ssh, host) {
197+
ssh.run(c, {raise: false});
198+
});
199+
} catch (any e) {
200+
arrayAppend(failed, {host: h, message: e.message});
201+
}
202+
}
203+
return failed;
204+
}
205+
206+
/**
207+
* Render the unreachable-host warning appended to a verb's summary.
208+
* Empty string when nothing failed.
209+
*/
210+
private string function $skippedHostsSuffix(required array failed, required string consequence) {
211+
if (!arrayLen(arguments.failed)) return "";
212+
var parts = [];
213+
for (var f in arguments.failed) {
214+
arrayAppend(parts, f.host & " (" & f.message & ")");
215+
}
216+
return chr(10) & "WARNING: skipped " & arrayLen(arguments.failed)
217+
& " unreachable host(s): " & arrayToList(parts, "; ")
218+
& "" & arguments.consequence & ".";
219+
}
220+
221+
/**
222+
* Per-host best-effort dispatch that ALSO collects the remote output,
223+
* host-prefixed (`[host] line`). Combines two ##2957 contracts that the
224+
* `lock status` verb needs at once:
225+
* - DEP-1: read the lock on EVERY host (the lock lives fleet-wide), and
226+
* tolerate an unreachable host — report it, don't throw. Each host
227+
* runs in its own sequential([host]) with a per-host try/catch so a
228+
* dead connect or a dead cached session is confined to that host (see
229+
* $dispatchPerHostTolerant for why allowFail-style onEach/onAny is not
230+
* enough).
231+
* - DEP-6a: surface what readlink actually said. Stdout wins (the lock
232+
* holder); stderr is the fallback so the "No such file" diagnostic on
233+
* an unheld lock still reaches the operator. The command is run with
234+
* {raise: false} so a nonzero exit (no lock held) is advisory output,
235+
* not a thrown error.
236+
*
237+
* @return struct {lines: array of "[host] line", failed: array of
238+
* {host, message} for hosts that were unreachable}.
239+
*/
240+
private struct function $collectPerHostTolerant(
241+
required array hosts,
242+
required string cmd,
243+
required boolean dryRun
244+
) {
245+
if (arguments.dryRun) {
246+
for (var h in arguments.hosts) {
247+
arrayAppend(variables.dryRunBuffer, "[" & h & "] " & arguments.cmd);
248+
}
249+
return {lines: [], failed: []};
250+
}
251+
var c = arguments.cmd;
108252
// Closures can't write outer locals reliably — collect via a shared struct.
109-
var ctx = {lines: []};
110-
variables.sshPool.onAny(arguments.hosts, function(ssh, host) {
111-
var res = ssh.run(c, {raise: doRaise});
112-
var text = trim(res.stdout ?: "");
113-
if (!len(text)) text = trim(res.stderr ?: "");
114-
if (!len(text)) return;
115-
text = replace(text, chr(13), "", "all");
116-
for (var line in listToArray(text, chr(10))) {
117-
arrayAppend(ctx.lines, "[" & host & "] " & line);
253+
var ctx = {lines: [], failed: []};
254+
for (var h in arguments.hosts) {
255+
try {
256+
variables.sshPool.sequential([h], function(ssh, host) {
257+
var res = ssh.run(c, {raise: false});
258+
var text = trim(res.stdout ?: "");
259+
if (!len(text)) text = trim(res.stderr ?: "");
260+
if (!len(text)) return;
261+
text = replace(text, chr(13), "", "all");
262+
for (var line in listToArray(text, chr(10))) {
263+
arrayAppend(ctx.lines, "[" & host & "] " & line);
264+
}
265+
});
266+
} catch (any e) {
267+
arrayAppend(ctx.failed, {host: h, message: e.message});
118268
}
119-
});
120-
return ctx.lines;
269+
}
270+
return ctx;
121271
}
122272

123273
private string function $currentUser() {

0 commit comments

Comments
 (0)