You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
= Manually recovering from a disruption event when automated recovery is unavailable
8
8
9
9
[role="_abstract"]
10
-
You might need to perform manual recovery steps if a disruption event prevents fencing from functioning correctly. In this case, you can run commands directly on the control plane nodes to recover the cluster. There are four main recovery scenarios, which should be attempted in the following order:
10
+
You might need to perform manual recovery steps if a disruption event prevents fencing from functioning correctly. In this case, you can run commands directly on the control plane nodes to recover the cluster. There are five main recovery scenarios, which should be attempted in the following order:
11
11
12
12
. Update fencing secrets: Refresh the Baseboard Management Console (BMC) credentials if they are incorrect or outdated.
13
13
. Recover from a single-node failure: Restore functionality when only one control plane node is down.
14
-
. Recover from a complete node failure: Restore functionality when both control plane nodes are down.
14
+
. Recover from dual node power loss: Restore functionality when both control plane nodes are down and can be restarted.
15
+
. Restore corosync quorum after dual node power loss: Restore corosync quorum when both control plane nodes lost power but only one node can be restarted.
15
16
. Replace a control plane node that cannot be recovered: Replace the node to restore cluster functionality.
16
17
17
18
.Prerequisites
18
19
19
20
* You have administrative access to the control plane nodes.
20
21
* You can connect to the nodes by using SSH.
21
22
22
-
[NOTE]
23
-
====
24
-
Do an etcd backup before proceeding to ensure that you can restore the cluster if any issues occur.
25
-
====
26
-
27
23
.Procedure
28
24
29
25
. Update the fencing secrets:
@@ -120,21 +116,21 @@ This command provides a detailed view of the current cluster and resource states
120
116
121
117
.. Run the following additional diagnostic commands, if necessary:
122
118
+
123
-
Reset the resources on your cluster and instruct Pacemaker to attempt to start them fresh by running the following command:
119
+
Reset the resources on your cluster by running the following command:
124
120
+
125
121
[source,terminal]
126
122
----
127
123
$ sudo pcs resource cleanup
128
124
----
129
-
+
130
-
Review all Pacemaker activity on the node by running the following command:
125
+
126
+
.. Review all Pacemaker activity on the node by running the following command:
131
127
+
132
128
[source,terminal]
133
129
----
134
130
$ sudo journalctl -u pacemaker
135
131
----
136
132
+
137
-
Diagnose etcd resource startup issues by running the following command:
133
+
.. Diagnose etcd resource startup issues by running the following command:
If fencing is required but is not functioning, ensure that the Redfish fencing endpoint is accessible and verify that the credentials are correct.
148
+
+
149
+
If you have verified the failed node is permanently inaccessible but automated fencing cannot function, verify the failed node meets ALL of the following conditions:
150
+
151
+
* The node is powered off and cannot be restarted.
152
+
* The node cannot access any shared storage or cluster resources.
153
+
* The node is completely isolated from the cluster network.
154
+
155
+
.. Confirm the node is fenced by running the following command:
156
+
+
157
+
[source,terminal]
158
+
----
159
+
$ sudo pcs stonith confirm <failed_node_name>
160
+
----
161
+
+
162
+
[WARNING]
163
+
====
164
+
If the failed node is accessible or can access shared resources, confirming fencing can cause data corruption and cluster failure.
165
+
====
152
166
153
167
.. If etcd is not starting despite fencing being operational, restore etcd from a backup by running the following commands:
If the recovery is successful, no further action is required. If the issue persists, proceed to the next step.
166
180
--
167
181
168
-
. Recover from a complete node failure:
182
+
. Recover from dual node power loss where both nodes are recoverable:
169
183
+
170
-
--
184
+
This procedure applies when both control plane nodes lost power and both nodes can be restarted. If only one node can be restarted, proceed to step 4.
185
+
171
186
.. Power on both control plane nodes.
172
187
+
173
188
Pacemaker starts automatically and begins the recovery operation when it detects both nodes are online. If the recovery does not start as expected, use the diagnostic commands described in the previous step to investigate the issue.
@@ -208,17 +223,222 @@ $ sudo pcs stonith config
208
223
----
209
224
+
210
225
If the recovery is successful, no further action is required. If the issue persists, perform manual recovery as described in the next step.
211
-
--
226
+
227
+
228
+
. Restore corosync quorum after dual node power loss (single node recoverable):
229
+
+
230
+
231
+
This procedure applies when both control plane nodes lost power and only one node can be restarted. In this scenario, the cluster has lost corosync quorum because the last known state showed both nodes were online before the power loss.
232
+
+
233
+
[IMPORTANT]
234
+
====
235
+
Perform this procedure only when both of the following conditions are met:
236
+
237
+
- Both control plane nodes lost power
238
+
- Only one control plane node can be restarted
239
+
240
+
====
241
+
+
242
+
This scenario typically occurs when you need to replace a control plane node (one node is not recoverable) and the surviving node lost power before the replacement procedure.
243
+
244
+
.. Verify that only one node is online by running the following command on the surviving node:
245
+
+
246
+
[source,terminal]
247
+
----
248
+
$ sudo pcs status --full
249
+
----
250
+
+
251
+
The output shows only one node online. The sample output is as follows:
252
+
+
253
+
[source,terminal]
254
+
----
255
+
Cluster name: TNF
256
+
Cluster Summary:
257
+
* Stack: corosync (Pacemaker is running)
258
+
* Current DC: NONE
259
+
* Last updated: Wed Apr 29 16:21:17 2026 on master-0.ostest.test.metalkube.org
260
+
* Last change: Wed Apr 29 16:19:25 2026 by root via root on master-1.ostest.test.metalkube.org
.. Verify that the failed node is permanently inaccessible before proceeding.
334
+
+
335
+
Before confirming to Pacemaker that the failed node is fenced, you must ensure that the failed node meets ALL of the following conditions:
336
+
337
+
- The node is powered off and cannot be restarted
338
+
- The node cannot access any shared storage or cluster resources
339
+
- The node is completely isolated from the cluster network
340
+
+
341
+
If the failed node is accessible or can access shared resources, DO NOT proceed with this step. Confirming fencing for a node that is still active can cause data corruption and cluster failure.
342
+
343
+
.. Confirm to Pacemaker that the failed node is fenced by running the following command:
344
+
+
345
+
[source,terminal]
346
+
----
347
+
$ sudo pcs quorum unblock
348
+
----
349
+
+
350
+
The command shows the following sample output:
351
+
+
352
+
[source,terminal]
353
+
----
354
+
WARNING: If node 'master-1' is not powered off or it does have access to shared resources, data corruption and/or cluster failure may occur
355
+
Type 'yes' or 'y' to proceed, anything else to cancel:
356
+
----
357
+
+
358
+
Replace <failed_node_name> with the name of the failed control plane node (for example, control-plane-1).
359
+
360
+
.. Verify that quorum is restored by running the following command:
361
+
+
362
+
[source,terminal]
363
+
----
364
+
$ sudo pcs quorum status
365
+
----
366
+
+
367
+
The command shows the following sample output:
368
+
+
369
+
.Example output
370
+
[source,terminal]
371
+
----
372
+
Quorum information
373
+
------------------
374
+
Date: Fri Oct 3 14:15:31 2025
375
+
Quorum provider: corosync_votequorum
376
+
Nodes: 1
377
+
Node ID: 1
378
+
Ring ID: 1.16
379
+
Quorate: Yes
380
+
381
+
Votequorum information
382
+
----------------------
383
+
Expected votes: 2
384
+
Highest expected: 2
385
+
Total votes: 1
386
+
Quorum: 1
387
+
Flags: 2Node Quorate
388
+
----
389
+
390
+
.. Wait 30 seconds for Pacemaker to process the fencing confirmation and begin recovery.
391
+
392
+
.. Verify that etcd is running on the surviving node by running the following command:
393
+
+
394
+
[source,terminal]
395
+
----
396
+
$ sudo pcs resource status etcd
397
+
----
398
+
+
399
+
If etcd is not running, restart it by running the following command:
400
+
+
401
+
[source,terminal]
402
+
----
403
+
$ sudo pcs resource cleanup etcd
404
+
----
405
+
+
406
+
Wait up to 5 minutes for etcd to start. Check the status periodically by running the following command:
407
+
+
408
+
[source,terminal]
409
+
----
410
+
$ sudo pcs resource status etcd
411
+
----
412
+
+
413
+
The command shows that the `podman-etcd` resource is started.
414
+
If the container is started successfully, you can see the logs by running the following command:
415
+
+
416
+
[source,terminal]
417
+
----
418
+
$ sudo podman logs etcd
419
+
----
420
+
+
421
+
If the container is not started, you can see the logs by running the following command:
422
+
+
423
+
[source,terminal]
424
+
----
425
+
$ journalctl -u pacemaker | grep podman-etcd
426
+
----
427
+
+
428
+
The relevant logs are placed at `/var/log/paceamaker/pacemaker.log`.
429
+
The output must show that etcd is started on the surviving node.
430
+
+
431
+
After restoring corosync quorum and confirming etcd is running, proceed to step 5 to replace the failed control plane node.
212
432
213
433
. If you need to manually recover from an event when one of the nodes is not recoverable, follow the procedure in "Replacing control plane nodes in a two-node OpenShift cluster".
214
434
+
215
-
When a cluster loses a single node, it enters the degraded mode. In this state, Pacemaker automatically unblocks quorum and allows the cluster to temporarily operate on the remaining node.
435
+
When a cluster loses a single node, it enters degraded mode. In this state, Pacemaker automatically unblocks quorum and allows the cluster to temporarily operate on the remaining node.
216
436
+
217
-
If both nodes fail, you must restart both nodes to reestablish quorum so that Pacemaker can resume normal cluster operations.
437
+
If both nodes fail and both can be restarted, Pacemaker reestablishes quorum automatically when both nodes are online.
218
438
+
219
-
If only one of the two nodes can be restarted, follow the node replacement procedure to manually reestablish quorum on the surviving node.
439
+
If only one node can be restarted, proceed to step 4 to restore corosync quorum manually.
220
440
+
221
-
If manual recovery is still required and it fails, collect a must-gather and SOS report, and file a bug.
441
+
If manual recovery is still required and it fails, collect a must-gather and sosreport, and file a bug.
0 commit comments