Skip to content

Commit 023f909

Browse files
authored
Merge pull request #112828 from bjahagir-OpenShift/bjahagir-OSDOCS-18856-cquorum
[OSDOCS#18856]: Implemented manually restoring cquorum feature for 4.22
2 parents 7bb88cd + 82e6641 commit 023f909

1 file changed

Lines changed: 239 additions & 19 deletions

File tree

modules/installation-manual-recovering-when-auto-recovery-is-unavail.adoc

Lines changed: 239 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/Module included in the following assemblies:
1+
//Module included in the following assemblies:
22
//
33
// *installing/installing_two_node_cluster/installing_tnf/install-post-tnf.adoc
44

@@ -7,23 +7,19 @@
77
= Manually recovering from a disruption event when automated recovery is unavailable
88

99
[role="_abstract"]
10-
You might need to perform manual recovery steps if a disruption event prevents fencing from functioning correctly. In this case, you can run commands directly on the control plane nodes to recover the cluster. There are four main recovery scenarios, which should be attempted in the following order:
10+
You might need to perform manual recovery steps if a disruption event prevents fencing from functioning correctly. In this case, you can run commands directly on the control plane nodes to recover the cluster. There are five main recovery scenarios, which should be attempted in the following order:
1111

1212
. Update fencing secrets: Refresh the Baseboard Management Console (BMC) credentials if they are incorrect or outdated.
1313
. Recover from a single-node failure: Restore functionality when only one control plane node is down.
14-
. Recover from a complete node failure: Restore functionality when both control plane nodes are down.
14+
. Recover from dual node power loss: Restore functionality when both control plane nodes are down and can be restarted.
15+
. Restore corosync quorum after dual node power loss: Restore corosync quorum when both control plane nodes lost power but only one node can be restarted.
1516
. Replace a control plane node that cannot be recovered: Replace the node to restore cluster functionality.
1617

1718
.Prerequisites
1819

1920
* You have administrative access to the control plane nodes.
2021
* You can connect to the nodes by using SSH.
2122
22-
[NOTE]
23-
====
24-
Do an etcd backup before proceeding to ensure that you can restore the cluster if any issues occur.
25-
====
26-
2723
.Procedure
2824

2925
. Update the fencing secrets:
@@ -120,21 +116,21 @@ This command provides a detailed view of the current cluster and resource states
120116
121117
.. Run the following additional diagnostic commands, if necessary:
122118
+
123-
Reset the resources on your cluster and instruct Pacemaker to attempt to start them fresh by running the following command:
119+
Reset the resources on your cluster by running the following command:
124120
+
125121
[source,terminal]
126122
----
127123
$ sudo pcs resource cleanup
128124
----
129-
+
130-
Review all Pacemaker activity on the node by running the following command:
125+
126+
.. Review all Pacemaker activity on the node by running the following command:
131127
+
132128
[source,terminal]
133129
----
134130
$ sudo journalctl -u pacemaker
135131
----
136132
+
137-
Diagnose etcd resource startup issues by running the following command:
133+
.. Diagnose etcd resource startup issues by running the following command:
138134
+
139135
[source,terminal]
140136
----
@@ -149,6 +145,24 @@ $ sudo pcs stonith config <node_name>_redfish
149145
----
150146
+
151147
If fencing is required but is not functioning, ensure that the Redfish fencing endpoint is accessible and verify that the credentials are correct.
148+
+
149+
If you have verified the failed node is permanently inaccessible but automated fencing cannot function, verify the failed node meets ALL of the following conditions:
150+
151+
* The node is powered off and cannot be restarted.
152+
* The node cannot access any shared storage or cluster resources.
153+
* The node is completely isolated from the cluster network.
154+
155+
.. Confirm the node is fenced by running the following command:
156+
+
157+
[source,terminal]
158+
----
159+
$ sudo pcs stonith confirm <failed_node_name>
160+
----
161+
+
162+
[WARNING]
163+
====
164+
If the failed node is accessible or can access shared resources, confirming fencing can cause data corruption and cluster failure.
165+
====
152166
153167
.. If etcd is not starting despite fencing being operational, restore etcd from a backup by running the following commands:
154168
+
@@ -165,9 +179,10 @@ $ sudo chown -R etcd:etcd /var/lib/etcd
165179
If the recovery is successful, no further action is required. If the issue persists, proceed to the next step.
166180
--
167181
168-
. Recover from a complete node failure:
182+
. Recover from dual node power loss where both nodes are recoverable:
169183
+
170-
--
184+
This procedure applies when both control plane nodes lost power and both nodes can be restarted. If only one node can be restarted, proceed to step 4.
185+
171186
.. Power on both control plane nodes.
172187
+
173188
Pacemaker starts automatically and begins the recovery operation when it detects both nodes are online. If the recovery does not start as expected, use the diagnostic commands described in the previous step to investigate the issue.
@@ -208,17 +223,222 @@ $ sudo pcs stonith config
208223
----
209224
+
210225
If the recovery is successful, no further action is required. If the issue persists, perform manual recovery as described in the next step.
211-
--
226+
227+
228+
. Restore corosync quorum after dual node power loss (single node recoverable):
229+
+
230+
231+
This procedure applies when both control plane nodes lost power and only one node can be restarted. In this scenario, the cluster has lost corosync quorum because the last known state showed both nodes were online before the power loss.
232+
+
233+
[IMPORTANT]
234+
====
235+
Perform this procedure only when both of the following conditions are met:
236+
237+
- Both control plane nodes lost power
238+
- Only one control plane node can be restarted
239+
240+
====
241+
+
242+
This scenario typically occurs when you need to replace a control plane node (one node is not recoverable) and the surviving node lost power before the replacement procedure.
243+
244+
.. Verify that only one node is online by running the following command on the surviving node:
245+
+
246+
[source,terminal]
247+
----
248+
$ sudo pcs status --full
249+
----
250+
+
251+
The output shows only one node online. The sample output is as follows:
252+
+
253+
[source,terminal]
254+
----
255+
Cluster name: TNF
256+
Cluster Summary:
257+
* Stack: corosync (Pacemaker is running)
258+
* Current DC: NONE
259+
* Last updated: Wed Apr 29 16:21:17 2026 on master-0.ostest.test.metalkube.org
260+
* Last change: Wed Apr 29 16:19:25 2026 by root via root on master-1.ostest.test.metalkube.org
261+
* 2 nodes configured
262+
* 6 resource instances configured
263+
264+
Node List:
265+
* Node master-0.ostest.test.metalkube.org (1): UNCLEAN (offline)
266+
* Node master-1.ostest.test.metalkube.org (2): UNCLEAN (offline)
267+
268+
Full List of Resources:
269+
* Clone Set: kubelet-clone [kubelet]:
270+
* kubelet (systemd:kubelet): Stopped
271+
* kubelet (systemd:kubelet): Stopped
272+
* master-0.ostest.test.metalkube.org_redfish (stonith:fence_redfish): Stopped
273+
* master-1.ostest.test.metalkube.org_redfish (stonith:fence_redfish): Stopped
274+
* Clone Set: etcd-clone [etcd]:
275+
* etcd (ocf:heartbeat:podman-etcd): Stopped
276+
* etcd (ocf:heartbeat:podman-etcd): Stopped
277+
278+
Tickets:
279+
280+
PCSD Status:
281+
master-0.ostest.test.metalkube.org: Online
282+
master-1.ostest.test.metalkube.org: Offline
283+
284+
Daemon Status:
285+
corosync: active/enabled
286+
pacemaker: active/enabled
287+
pcsd: active/enabled
288+
----
289+
+
290+
The PCSD status shows that the master-0 node is Online, and the other is offline.
291+
BOTH nodes in the node list section are offline because neither has quorum.
292+
+
293+
[source,terminal]
294+
----
295+
[core@master-0 ~]$ sudo pcs quorum status --debug
296+
Running: /usr/sbin/corosync-quorumtool -p
297+
Environment:
298+
LC_ALL=C
299+
300+
Finished running: /usr/sbin/corosync-quorumtool -p
301+
Return value: 2
302+
--Debug Stdout Start--
303+
Quorum information
304+
------------------
305+
Date: Wed Apr 29 16:25:55 2026
306+
Quorum provider: corosync_votequorum
307+
Nodes: 1
308+
Node ID: 1
309+
Ring ID: 1.e
310+
Quorate: No
311+
312+
Votequorum information
313+
----------------------
314+
Expected votes: 2
315+
Highest expected: 2
316+
Total votes: 1
317+
Quorum: 1 Activity blocked
318+
Flags: 2Node WaitForAll
319+
320+
Membership information
321+
----------------------
322+
Nodeid Votes Qdevice Name
323+
1 1 NR master-0.ostest.test.metalkube.org (local)
324+
325+
--Debug Stdout End--
326+
--Debug Stderr Start--
327+
328+
--Debug Stderr End--
329+
330+
Error: Unable to get quorum status:
331+
----
332+
333+
.. Verify that the failed node is permanently inaccessible before proceeding.
334+
+
335+
Before confirming to Pacemaker that the failed node is fenced, you must ensure that the failed node meets ALL of the following conditions:
336+
337+
- The node is powered off and cannot be restarted
338+
- The node cannot access any shared storage or cluster resources
339+
- The node is completely isolated from the cluster network
340+
+
341+
If the failed node is accessible or can access shared resources, DO NOT proceed with this step. Confirming fencing for a node that is still active can cause data corruption and cluster failure.
342+
343+
.. Confirm to Pacemaker that the failed node is fenced by running the following command:
344+
+
345+
[source,terminal]
346+
----
347+
$ sudo pcs quorum unblock
348+
----
349+
+
350+
The command shows the following sample output:
351+
+
352+
[source,terminal]
353+
----
354+
WARNING: If node 'master-1' is not powered off or it does have access to shared resources, data corruption and/or cluster failure may occur
355+
Type 'yes' or 'y' to proceed, anything else to cancel:
356+
----
357+
+
358+
Replace <failed_node_name> with the name of the failed control plane node (for example, control-plane-1).
359+
360+
.. Verify that quorum is restored by running the following command:
361+
+
362+
[source,terminal]
363+
----
364+
$ sudo pcs quorum status
365+
----
366+
+
367+
The command shows the following sample output:
368+
+
369+
.Example output
370+
[source,terminal]
371+
----
372+
Quorum information
373+
------------------
374+
Date: Fri Oct 3 14:15:31 2025
375+
Quorum provider: corosync_votequorum
376+
Nodes: 1
377+
Node ID: 1
378+
Ring ID: 1.16
379+
Quorate: Yes
380+
381+
Votequorum information
382+
----------------------
383+
Expected votes: 2
384+
Highest expected: 2
385+
Total votes: 1
386+
Quorum: 1
387+
Flags: 2Node Quorate
388+
----
389+
390+
.. Wait 30 seconds for Pacemaker to process the fencing confirmation and begin recovery.
391+
392+
.. Verify that etcd is running on the surviving node by running the following command:
393+
+
394+
[source,terminal]
395+
----
396+
$ sudo pcs resource status etcd
397+
----
398+
+
399+
If etcd is not running, restart it by running the following command:
400+
+
401+
[source,terminal]
402+
----
403+
$ sudo pcs resource cleanup etcd
404+
----
405+
+
406+
Wait up to 5 minutes for etcd to start. Check the status periodically by running the following command:
407+
+
408+
[source,terminal]
409+
----
410+
$ sudo pcs resource status etcd
411+
----
412+
+
413+
The command shows that the `podman-etcd` resource is started.
414+
If the container is started successfully, you can see the logs by running the following command:
415+
+
416+
[source,terminal]
417+
----
418+
$ sudo podman logs etcd
419+
----
420+
+
421+
If the container is not started, you can see the logs by running the following command:
422+
+
423+
[source,terminal]
424+
----
425+
$ journalctl -u pacemaker | grep podman-etcd
426+
----
427+
+
428+
The relevant logs are placed at `/var/log/paceamaker/pacemaker.log`.
429+
The output must show that etcd is started on the surviving node.
430+
+
431+
After restoring corosync quorum and confirming etcd is running, proceed to step 5 to replace the failed control plane node.
212432
213433
. If you need to manually recover from an event when one of the nodes is not recoverable, follow the procedure in "Replacing control plane nodes in a two-node OpenShift cluster".
214434
+
215-
When a cluster loses a single node, it enters the degraded mode. In this state, Pacemaker automatically unblocks quorum and allows the cluster to temporarily operate on the remaining node.
435+
When a cluster loses a single node, it enters degraded mode. In this state, Pacemaker automatically unblocks quorum and allows the cluster to temporarily operate on the remaining node.
216436
+
217-
If both nodes fail, you must restart both nodes to reestablish quorum so that Pacemaker can resume normal cluster operations.
437+
If both nodes fail and both can be restarted, Pacemaker reestablishes quorum automatically when both nodes are online.
218438
+
219-
If only one of the two nodes can be restarted, follow the node replacement procedure to manually reestablish quorum on the surviving node.
439+
If only one node can be restarted, proceed to step 4 to restore corosync quorum manually.
220440
+
221-
If manual recovery is still required and it fails, collect a must-gather and SOS report, and file a bug.
441+
If manual recovery is still required and it fails, collect a must-gather and sosreport, and file a bug.
222442
223443
.Verification
224444

0 commit comments

Comments
 (0)