You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
phase/uninstall_mke: fall back to forced swarm dissolution on timeout
The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global
Swarm service, then waits (~2 min hardcoded) for every node to report
back. On large or mixed-OS clusters with cold image caches this deadline
is missed, causing Reset() to fail even though the infrastructure will
be torn down by terraform destroy anyway.
Observed in CI:
smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline
smoke-windows (MKE 3.8.8, Win2025): Win2025 node missed the deadline
MKE itself documents the recovery path when this happens:
1. Remove the stuck ucp-uninstall-agent service.
2. Force every node to leave the swarm.
Implement that as an automatic fallback inside UninstallMKE.Run():
- isUninstallTimeout() detects the specific 'Uninstalling UCP took too
long' fatal line that Bootstrap surfaces from the MKE container.
- dissolveSwarm() removes the stuck service (best-effort), then forces
all non-leader nodes to leave in parallel, then forces the leader to
leave last. Per-node failures are logged as warnings so that a single
unresponsive host does not block the rest.
Other uninstall-ucp errors (connection failures, image pull errors, etc.)
are still returned as hard failures unchanged.
0 commit comments