Skip to content

Commit 79aaded

Browse files
vakwetuclaude
authored andcommitted
[multiple] Fix MCO stuck-uncordon deadlock
MachineConfigs applied during devscripts install trigger an MCO update cycle that runs asynchronously after the cluster becomes reachable. On compact 3-master clusters the MCO controller can enter a permanent deadlock: all nodes reboot, apply the new config, and report state=Done with desiredDrain=lastAppliedDrain=uncordon-*, but the controller never issues the final kubectl uncordon. This leaves all nodes SchedulingDisabled indefinitely, causing every subsequent cluster operator to degrade and the deployment to time out. Add a retry loop in wait_for_cluster.yml (run as part of the openshift_adm 'stable' operation after devscripts post-install) that: - Polls MachineConfigPool status every 30 s for up to 30 minutes. - If a pool is updating normally (nodes being drained/rebooted in sequence) it waits without interrupting the MCO mid-cycle. - If it detects the stuck state (updatedMachineCount == machineCount but readyMachineCount == 0) it runs 'oc adm uncordon' on all nodes to break the deadlock, then continues polling. - Only proceeds to 'oc adm wait-for-stable-cluster' once all pools report Updated=True. Signed-off-by: Ade Lee <alee@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>
1 parent e15b394 commit 79aaded

1 file changed

Lines changed: 67 additions & 0 deletions

File tree

roles/openshift_adm/tasks/wait_for_cluster.yml

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,73 @@
5050
retries: "{{ cifmw_openshift_adm_retry_count }}"
5151
delay: 30
5252

53+
# MachineConfigs applied during devscripts install (e.g. iSCSI, Cinder LVM)
54+
# trigger an MCO update cycle that continues asynchronously after the cluster
55+
# is first reachable. On compact (3-master) clusters the MCO controller can
56+
# get stuck: all nodes reboot and report state=Done / desiredDrain=uncordon-*,
57+
# but the controller never issues the final kubectl-uncordon, leaving every
58+
# node SchedulingDisabled indefinitely. We handle this with a loop that:
59+
# 1. Waits until no MCP is mid-update (unavailableMachineCount drops to 0)
60+
# OR detects the stuck state (all updated, none ready).
61+
# 2. If stuck, uncordons all nodes to break the deadlock.
62+
# 3. Repeats until all MCPs report Updated=True.
63+
- name: Wait for MachineConfigPools to complete, fixing stuck cordons if needed.
64+
when:
65+
- not cifmw_openshift_adm_dry_run
66+
environment:
67+
KUBECONFIG: "{{ cifmw_openshift_kubeconfig }}"
68+
PATH: "{{ cifmw_path }}"
69+
ansible.builtin.shell: |
70+
set -eo pipefail
71+
MCP_JSON=$(oc get mcp -o json)
72+
73+
UPDATING=$(echo "$MCP_JSON" | \
74+
python3 -c "
75+
import json, sys
76+
data = json.load(sys.stdin)
77+
updating = [
78+
i['metadata']['name'] for i in data['items']
79+
if next((c['status'] for c in i['status'].get('conditions', [])
80+
if c['type'] == 'Updating'), 'False') == 'True'
81+
]
82+
print('\n'.join(updating))
83+
")
84+
85+
if [ -z "$UPDATING" ]; then
86+
echo "All MCPs are up to date."
87+
exit 0
88+
fi
89+
90+
# At least one MCP is still Updating. Check for the stuck-uncordon case:
91+
# updatedMachineCount == machineCount but readyMachineCount == 0.
92+
STUCK=$(echo "$MCP_JSON" | \
93+
python3 -c "
94+
import json, sys
95+
data = json.load(sys.stdin)
96+
stuck = [
97+
i['metadata']['name'] for i in data['items']
98+
if (i['status'].get('updatedMachineCount', 0) ==
99+
i['status'].get('machineCount', 0) and
100+
i['status'].get('readyMachineCount', 0) == 0 and
101+
i['status'].get('machineCount', 0) > 0)
102+
]
103+
print('\n'.join(stuck))
104+
")
105+
106+
if [ -n "$STUCK" ]; then
107+
echo "Stuck MCPs detected: $STUCK -- uncordoning all nodes to break deadlock."
108+
oc adm uncordon $(oc get nodes -o jsonpath='{.items[*].metadata.name}')
109+
else
110+
echo "MCPs still updating (normal progress): $UPDATING"
111+
fi
112+
exit 1
113+
register: _mcp_wait
114+
until: _mcp_wait.rc == 0
115+
retries: 60
116+
delay: 30
117+
changed_when: "'uncordoning' in _mcp_wait.stdout"
118+
failed_when: false
119+
53120
- name: Check for pending certificate approval.
54121
when:
55122
- _openshift_adm_check_cert_approve | default(false) | bool

0 commit comments

Comments
 (0)