Merge pull request #112828 from bjahagir-OpenShift/bjahagir-OSDOCS-18856-cquorum

skopacz1 · web-flow · commit 023f909c079b · 2026-06-05T15:42:14.000-04:00
[OSDOCS#18856]: Implemented manually restoring cquorum feature for 4.22
diff --git a/modules/installation-manual-recovering-when-auto-recovery-is-unavail.adoc b/modules/installation-manual-recovering-when-auto-recovery-is-unavail.adoc
@@ -1,4 +1,4 @@
-/Module included in the following assemblies:
+//Module included in the following assemblies:
 //
 // *installing/installing_two_node_cluster/installing_tnf/install-post-tnf.adoc
 
@@ -7,23 +7,19 @@
 = Manually recovering from a disruption event when automated recovery is unavailable
 
 [role="_abstract"]
-You might need to perform manual recovery steps if a disruption event prevents fencing from functioning correctly. In this case, you can run commands directly on the control plane nodes to recover the cluster. There are four main recovery scenarios, which should be attempted in the following order:
+You might need to perform manual recovery steps if a disruption event prevents fencing from functioning correctly. In this case, you can run commands directly on the control plane nodes to recover the cluster. There are five main recovery scenarios, which should be attempted in the following order:
 
 . Update fencing secrets: Refresh the Baseboard Management Console (BMC) credentials if they are incorrect or outdated.
 . Recover from a single-node failure: Restore functionality when only one control plane node is down.
-. Recover from a complete node failure: Restore functionality when both control plane nodes are down.
+. Recover from dual node power loss: Restore functionality when both control plane nodes are down and can be restarted.
+. Restore corosync quorum after dual node power loss: Restore corosync quorum when both control plane nodes lost power but only one node can be restarted.
 . Replace a control plane node that cannot be recovered: Replace the node to restore cluster functionality.
 
 .Prerequisites
 
 * You have administrative access to the control plane nodes.
 * You can connect to the nodes by using SSH.
 
-[NOTE]
-====
-Do an etcd backup before proceeding to ensure that you can restore the cluster if any issues occur.
-====
-
 .Procedure
 
 . Update the fencing secrets:
@@ -120,21 +116,21 @@ This command provides a detailed view of the current cluster and resource states
 
 .. Run the following additional diagnostic commands, if necessary:
 +
-Reset the resources on your cluster and instruct Pacemaker to attempt to start them fresh by running the following command:
+Reset the resources on your cluster by running the following command:
 +
 [source,terminal]
 ----
 $ sudo pcs resource cleanup
 ----
-+
-Review all Pacemaker activity on the node by running the following command:
+
+.. Review all Pacemaker activity on the node by running the following command:
 +
 [source,terminal]
 ----
 $ sudo journalctl -u pacemaker
 ----
 +
-Diagnose etcd resource startup issues by running the following command:
+.. Diagnose etcd resource startup issues by running the following command:
 +
 [source,terminal]
 ----
@@ -149,6 +145,24 @@ $ sudo pcs stonith config <node_name>_redfish
 ----
 +
 If fencing is required but is not functioning, ensure that the Redfish fencing endpoint is accessible and verify that the credentials are correct.
++
+If you have verified the failed node is permanently inaccessible but automated fencing cannot function, verify the failed node meets ALL of the following conditions:
+
+* The node is powered off and cannot be restarted.
+* The node cannot access any shared storage or cluster resources.
+* The node is completely isolated from the cluster network.
+
+.. Confirm the node is fenced by running the following command:
++
+[source,terminal]
+----
+$ sudo pcs stonith confirm <failed_node_name>
+----
++
+[WARNING]
+====
+If the failed node is accessible or can access shared resources, confirming fencing can cause data corruption and cluster failure.
+====
 
 .. If etcd is not starting despite fencing being operational, restore etcd from a backup by running the following commands:
 +
@@ -165,9 +179,10 @@ $ sudo chown -R etcd:etcd /var/lib/etcd
 If the recovery is successful, no further action is required. If the issue persists, proceed to the next step.
 --
 
-. Recover from a complete node failure:
+. Recover from dual node power loss where both nodes are recoverable:
 +
---
+This procedure applies when both control plane nodes lost power and both nodes can be restarted. If only one node can be restarted, proceed to step 4.
+
 .. Power on both control plane nodes.
 +
 Pacemaker starts automatically and begins the recovery operation when it detects both nodes are online. If the recovery does not start as expected, use the diagnostic commands described in the previous step to investigate the issue.
@@ -208,17 +223,222 @@ $ sudo pcs stonith config
 ----
 +
 If the recovery is successful, no further action is required. If the issue persists, perform manual recovery as described in the next step.
---
+
+
+. Restore corosync quorum after dual node power loss (single node recoverable):
++
+
+This procedure applies when both control plane nodes lost power and only one node can be restarted. In this scenario, the cluster has lost corosync quorum because the last known state showed both nodes were online before the power loss.
++
+[IMPORTANT]
+====
+Perform this procedure only when both of the following conditions are met:
+
+   - Both control plane nodes lost power
+   - Only one control plane node can be restarted
+
+====
++
+This scenario typically occurs when you need to replace a control plane node (one node is not recoverable) and the surviving node lost power before the replacement procedure.
+
+.. Verify that only one node is online by running the following command on the surviving node:
++
+[source,terminal]
+----
+$ sudo pcs status --full
+----
++
+The output shows only one node online. The sample output is as follows:
++
+[source,terminal]
+----
+Cluster name: TNF
+Cluster Summary:
+  * Stack: corosync (Pacemaker is running)
+  * Current DC: NONE
+  * Last updated: Wed Apr 29 16:21:17 2026 on master-0.ostest.test.metalkube.org
+  * Last change:  Wed Apr 29 16:19:25 2026 by root via root on master-1.ostest.test.metalkube.org
+  * 2 nodes configured
+  * 6 resource instances configured
+
+Node List:
+  * Node master-0.ostest.test.metalkube.org (1): UNCLEAN (offline)
+  * Node master-1.ostest.test.metalkube.org (2): UNCLEAN (offline)
+
+Full List of Resources:
+  * Clone Set: kubelet-clone [kubelet]:
+    * kubelet	(systemd:kubelet):	 Stopped
+    * kubelet	(systemd:kubelet):	 Stopped
+  * master-0.ostest.test.metalkube.org_redfish	(stonith:fence_redfish):	 Stopped
+  * master-1.ostest.test.metalkube.org_redfish	(stonith:fence_redfish):	 Stopped
+  * Clone Set: etcd-clone [etcd]:
+    * etcd	(ocf:heartbeat:podman-etcd):	 Stopped
+    * etcd	(ocf:heartbeat:podman-etcd):	 Stopped
+
+Tickets:
+
+PCSD Status:
+  master-0.ostest.test.metalkube.org: Online
+  master-1.ostest.test.metalkube.org: Offline
+
+Daemon Status:
+  corosync: active/enabled
+  pacemaker: active/enabled
+  pcsd: active/enabled
+----
++
+The PCSD status shows that the master-0 node is Online, and the other is offline.
+BOTH nodes in the node list section are offline because neither has quorum.
++
+[source,terminal]
+----
+[core@master-0 ~]$   sudo pcs quorum status --debug
+Running: /usr/sbin/corosync-quorumtool -p
+Environment:
+  LC_ALL=C
+
+Finished running: /usr/sbin/corosync-quorumtool -p
+Return value: 2
+--Debug Stdout Start--
+Quorum information
+------------------
+Date:             Wed Apr 29 16:25:55 2026
+Quorum provider:  corosync_votequorum
+Nodes:            1
+Node ID:          1
+Ring ID:          1.e
+Quorate:          No
+
+Votequorum information
+----------------------
+Expected votes:   2
+Highest expected: 2
+Total votes:      1
+Quorum:           1 Activity blocked
+Flags:            2Node WaitForAll
+
+Membership information
+----------------------
+    Nodeid      Votes    Qdevice Name
+         1          1         NR master-0.ostest.test.metalkube.org (local)
+
+--Debug Stdout End--
+--Debug Stderr Start--
+
+--Debug Stderr End--
+
+Error: Unable to get quorum status:
+----
+
+.. Verify that the failed node is permanently inaccessible before proceeding.
++
+Before confirming to Pacemaker that the failed node is fenced, you must ensure that the failed node meets ALL of the following conditions:
+
+ - The node is powered off and cannot be restarted
+ - The node cannot access any shared storage or cluster resources
+ - The node is completely isolated from the cluster network
++
+If the failed node is accessible or can access shared resources, DO NOT proceed with this step. Confirming fencing for a node that is still active can cause data corruption and cluster failure.
+
+.. Confirm to Pacemaker that the failed node is fenced by running the following command:
++
+[source,terminal]
+----
+$ sudo pcs quorum unblock
+----
++
+The command shows the following sample output:
++
+[source,terminal]
+----
+WARNING: If node 'master-1' is not powered off or it does have access to shared resources, data corruption and/or cluster failure may occur
+Type 'yes' or 'y' to proceed, anything else to cancel:
+----
++
+Replace <failed_node_name> with the name of the failed control plane node (for example, control-plane-1).
+
+.. Verify that quorum is restored by running the following command:
++
+[source,terminal]
+----
+$ sudo pcs quorum status
+----
++
+The command shows the following sample output:
++
+.Example output
+[source,terminal]
+----
+Quorum information
+------------------
+Date:             Fri Oct  3 14:15:31 2025
+Quorum provider:  corosync_votequorum
+Nodes:            1
+Node ID:          1
+Ring ID:          1.16
+Quorate:          Yes
+
+Votequorum information
+----------------------
+Expected votes:   2
+Highest expected: 2
+Total votes:      1
+Quorum:           1
+Flags:            2Node Quorate
+----
+
+.. Wait 30 seconds for Pacemaker to process the fencing confirmation and begin recovery.
+
+.. Verify that etcd is running on the surviving node by running the following command:
++
+[source,terminal]
+----
+$ sudo pcs resource status etcd
+----
++
+If etcd is not running, restart it by running the following command:
++
+[source,terminal]
+----
+$ sudo pcs resource cleanup etcd
+----
++
+Wait up to 5 minutes for etcd to start. Check the status periodically by running the following command:
++
+[source,terminal]
+----
+$ sudo pcs resource status etcd
+----
++
+The command shows that the `podman-etcd` resource is started.
+If the container is started successfully, you can see the logs by running the following command:
++
+[source,terminal]
+----
+$ sudo podman logs etcd
+----
++
+If the container is not started, you can see the logs by running the following command:
++
+[source,terminal]
+----
+$ journalctl -u pacemaker | grep podman-etcd
+----
++
+The relevant logs are placed at `/var/log/paceamaker/pacemaker.log`.
+The output must show that etcd is started on the surviving node.
++
+After restoring corosync quorum and confirming etcd is running, proceed to step 5 to replace the failed control plane node.
 
 . If you need to manually recover from an event when one of the nodes is not recoverable, follow the procedure in "Replacing control plane nodes in a two-node OpenShift cluster".
 +
-When a cluster loses a single node, it enters the degraded mode. In this state, Pacemaker automatically unblocks quorum and allows the cluster to temporarily operate on the remaining node.
+When a cluster loses a single node, it enters degraded mode. In this state, Pacemaker automatically unblocks quorum and allows the cluster to temporarily operate on the remaining node.
 +
-If both nodes fail, you must restart both nodes to reestablish quorum so that Pacemaker can resume normal cluster operations.
+If both nodes fail and both can be restarted, Pacemaker reestablishes quorum automatically when both nodes are online.
 +
-If only one of the two nodes can be restarted, follow the node replacement procedure to manually reestablish quorum on the surviving node.
+If only one node can be restarted, proceed to step 4 to restore corosync quorum manually.
 +
-If manual recovery is still required and it fails, collect a must-gather and SOS report, and file a bug.
+If manual recovery is still required and it fails, collect a must-gather and sosreport, and file a bug.
 
 .Verification