K8SPXC-1828: Make operator aware of last recovered seqno for auto recovery by egegunes · Pull Request #2467 · percona/percona-xtradb-cluster-operator

egegunes · 2026-05-07T11:35:26Z

CHANGE DESCRIPTION

For auto recovery from full cluster crash, operator selects the PXC pod with highest seqno (wsrep_last_applied) and rebootstraps the cluster from there. This logic is sound but there's a problem: operator immediately forgets the position (uuid:seqno) it used to recover the cluster.

Imagine the scenario:

Crash happens, pods report their positions:
pod-0 -> uuid:100
pod-1 -> uuid:97
pod-2 -> uuid:102

Operator picks pod-2 to recover.

Another crash happens, pods report their positions:
pod-0 -> uuid:91
pod-1 -> uuid:88
pod-2 -> uuid:89

Operator picks pod-0 to recover. But the position actually regressed and doing the recovery from this regressed position will result in data loss.

(Why would wsrep_last_applied regress is a question I don't know the answer of but we've seen it in highly unstable environments where operator needed to perform recovery repeatedly for prolonged periods.)

With these changes, we are making the operator aware of last recovered position and adding a guardrail to auto recovery logic.

Operator is going to store the last recovery information in .status.recovery:

RecoveryStatus{
  clusterUUID             // Galera cluster UUID reported by the pod.
  lastRecoveryTime        // the time when the operator triggered the most recent recovery.
  lastRecoveryPod         // the pod the operator picked to bootstrap from (the one with the highest reported seqno).
  lastRecoverySeqNo       // wsrep sequence number of the pod that was used to bootstrap.
 }

This information will be used in subsequent recoveries to ensure the recovery position doesn't regress. If it does, operator will reject doing the recovery itself. In this case, a human needs to step in and manually do the recovery.

Anti-regression guardrail depends on the fact that wsrep_last_applied (seqno) is monotonic with the same cluster UUID. Operator always recovers the cluster with same UUID, so the UUID stays the same in whole lifecycle of a PXC cluster on K8s. But users do something manually to change the cluster UUID. In this case they will need to update or clean up the last recovery info in PerconaXtraDBCluster object's status.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PXC version?
Does the change support oldest and newest supported Kubernetes version?

Copilot

Pull request overview

This PR improves full PXC cluster crash recovery by enriching the recovery marker with the Galera cluster UUID, adding safety checks to prevent bootstrapping from stale/unrelated data, and persisting the last recovery decision in CR status for future comparisons.

Changes:

Extend the crash-recovery log marker to include UUID:SEQNO and update log parsing to support both new and legacy formats.
Add a status.recovery field to the CRD and persist the last recovery pod/UUID/seqno/time after triggering bootstrap.
Add a safety gate (isAutomaticRecoverySafe) to refuse automatic recovery on UUID mismatch or seqno regression.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pkg/controller/pxc/status.go	Adjusts recovery-log parsing call site for the updated return signature.
pkg/controller/pxc/replication.go	Skips replication reconcile when no proxy is enabled, avoiding proxy-dependent primary detection.
pkg/controller/pxc/full_crash_recovery.go	Implements UUID-aware parsing, recovery safety checks, eventing, and status persistence.
pkg/controller/pxc/full_crash_recovery_test.go	Adds unit tests for log parsing and safety-check behavior.
pkg/apis/pxc/v1/pxc_types.go	Introduces `RecoveryStatus` and wires it into cluster status.
pkg/apis/pxc/v1/zz_generated.deepcopy.go	Adds generated deepcopy support for the new status type.
deploy/cw-bundle.yaml	Updates CRD schema to expose `status.recovery`.
deploy/crd.yaml	Updates CRD schema to expose `status.recovery`.
deploy/bundle.yaml	Updates CRD schema to expose `status.recovery`.
config/crd/bases/pxc.percona.com_perconaxtradbclusters.yaml	Updates CRD schema to expose `status.recovery`.
build/pxc-entrypoint.sh	Emits UUID in crash marker and adds an informational “Cluster UUID” log line.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copilot

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 3 comments.

Files not reviewed (1)

pkg/apis/pxc/v1/zz_generated.deepcopy.go: Language not supported

…overy For auto recovery from full cluster crash, operator selects the PXC pod with highest seqno (wsrep_last_applied) and rebootstraps the cluster from there. This logic is sound but there's a problem: operator immediately forgets the position (uuid:seqno) it used to recover the cluster. Imagine the scenario: Crash happens, pods report their positions: pod-0 -> uuid:100 pod-1 -> uuid:97 pod-2 -> uuid:102 Operator picks pod-2 to recover. Another crash happens, pods report their positions: pod-0 -> uuid:91 pod-1 -> uuid:88 pod-2 -> uuid:89 Operator picks pod-0 to recover. But the position actually regressed and doing the recovery from this regressed position will result in data loss. (Why would wsrep_last_applied regress is a question I don't know the answer of but we've seen it in highly unstable environments where operator needed to perform recovery repeatedly for prolonged periods.) With these changes, we are making the operator aware of last recovered position and adding a guardrail to auto recovery logic. Operator is going to store the last recovery information in `.status.recovery`: RecoveryStatus{ clusterUUID // Galera cluster UUID reported by the pod. lastRecoveryTime // the time when the operator triggered the most recent recovery. lastRecoveryPod // the pod the operator picked to bootstrap from (the one with the highest reported seqno). lastRecoverySeqNo // wsrep sequence number of the pod that was used to bootstrap. } This information will be used in subsequent recoveries to ensure the recovery position doesn't regress. If it does, operator will reject doing the recovery itself. In this case, a human needs to step in and manually do the recovery. Anti-regression guardrail depends on the fact that wsrep_last_applied (seqno) is monotonic with the same cluster UUID. Operator always recovers the cluster with same UUID, so the UUID stays the same in whole lifecycle of a PXC cluster on K8s. But users do something manually to change the cluster UUID. In this case they will need to update or clean up the last recovery info in PerconaXtraDBCluster object's status.

Copilot

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

pkg/apis/pxc/v1/zz_generated.deepcopy.go: Language not supported

gkech · 2026-05-13T08:25:55Z

+	for name, tt := range tests {
+		t.Run(name, func(t *testing.T) {
+			uuid, seq, err := parseRecoveredPosition(tt.log)
+			if tt.wantErr {


wanterr is completely unneeded. We can determine the need to assert an error through the errSubstr

gkech · 2026-05-13T08:32:19Z

+}
+
+func (r *ReconcilePerconaXtraDBCluster) doFullCrashRecovery(ctx context.Context, cr *pxcv1.PerconaXtraDBCluster) error {
 	maxSeq := int64(-100)


why -100 and not something like math.MinInt64

mayankshah1607 · 2026-05-13T09:05:10Z

-	wait_for_running "$cluster-haproxy" 1
-	wait_for_running "$cluster-pxc" 3


Is this related to the changes in this PR?

not really but tls-issue-cert-manager failure was blocking this PR.

the issue is introduced by 7f885a1. probably it slipped away in one of the retries and we merged it. in the test, to rotate the certificates, we delete pods altogether then run wait_for_running which run wait_pod which checks for full cluster crash and exits immediately if it detects crash. but we're causing the crash deliberately. so i removed these and opted for waiting cluster readiness only.

mayankshah1607 · 2026-05-13T09:07:05Z

+	orig := cr.DeepCopy()
+
+	cr.Status.Recovery = &pxcv1.RecoveryStatus{
+		ClusterUUID:       info.uuid,
+		LastRecoverySeqNo: info.seq,
+		LastRecoveryPod:   recoveryPod,
+		LastRecoveryTime:  metav1.Now(),
+	}
+
+	return cl.Status().Patch(ctx, cr, client.MergeFrom(orig))


Do we need to wrap this in a retry block? Otherwise the status may not be written if there are transient failures like conflicts

patch is not prone to conflict errors, so I don't think it's needed

mayankshah1607 · 2026-05-13T09:11:45Z

+			recoveryInfo.seq,
+		)
+		r.recorder.Event(cr, corev1.EventTypeWarning, "AutomaticRecoveryRefused", msg)
+		return errors.New(msg)


If we return an error here, won't the reconciler keep retrying this code path and flood with events? Is it possible to stop retrying once the operator detects that automatic recovery is not safe?

i don't think it's possible to stop retrying without adding additional logic to override the block which user will need to do manually. i don't think it'll be a good improvement

Copilot

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated no new comments.

Files not reviewed (1)

pkg/apis/pxc/v1/zz_generated.deepcopy.go: Language not supported

Copilot

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

pkg/apis/pxc/v1/zz_generated.deepcopy.go: Language not supported

+	if err := updateRecoveryStatus(ctx, r.client, cr, recoveryInfo, recoveryPod); err != nil {
+		log.Error(err, "update recovery status")
+		// we already signalled the pod
+		// not returning here so we can sleep
+	}


 			echo 'Cluster will recover automatically from the crash now.'
 			echo 'If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:'
 			echo "kubectl -n $POD_NAMESPACE exec $(hostname) -c pxc -- sh -c 'kill -s USR1 1'"
 			#DO NOT CHANGE THE LINE BELOW. OUR AUTO-RECOVERY IS USING IT TO DETECT SEQNO OF CURRENT NODE. See K8SPXC-564


JNKPercona · 2026-05-16T22:12:28Z

Test Name	Result	Time
auto-tuning-8-0	passed	00:22:21
allocator-8-0	passed	00:14:46
allocator-8-4	passed	00:15:11
backup-storage-tls-8-0	passed	00:25:54
cross-site-8-0	passed	00:40:46
cross-site-proxysql-8-0	passed	00:45:09
cross-site-proxysql-8-4	passed	00:44:48
custom-users-8-0	passed	00:15:28
demand-backup-cloud-8-0	passed	01:03:40
demand-backup-cloud-8-4	passed	01:06:16
demand-backup-cloud-pxb-8-0	failure	00:42:30
demand-backup-encrypted-with-tls-5-7	passed	00:52:00
demand-backup-encrypted-with-tls-8-0	passed	00:52:15
demand-backup-encrypted-with-tls-8-4	passed	00:51:59
demand-backup-encrypted-with-tls-pxb-5-7	passed	00:22:00
demand-backup-encrypted-with-tls-pxb-8-0	passed	00:19:29
demand-backup-encrypted-with-tls-pxb-8-4	passed	00:20:00
demand-backup-8-0	passed	00:47:30
demand-backup-flow-control-8-0	passed	00:13:01
demand-backup-flow-control-8-4	passed	00:13:07
demand-backup-parallel-8-0	passed	00:12:52
demand-backup-parallel-8-4	passed	00:10:26
demand-backup-without-passwords-8-0	passed	00:20:03
demand-backup-without-passwords-8-4	passed	00:19:06
extra-pvc-8-0	passed	00:29:01
haproxy-5-7	passed	00:20:00
haproxy-8-0	passed	00:18:42
haproxy-8-4	passed	00:16:50
init-deploy-5-7	passed	00:18:20
init-deploy-8-0	passed	00:23:07
limits-8-0	passed	00:12:06
monitoring-2-0-8-0	passed	00:48:38
monitoring-pmm3-8-0	passed	00:22:23
monitoring-pmm3-8-4	passed	00:22:41
one-pod-5-7	passed	00:19:01
one-pod-8-0	passed	00:18:10
pitr-8-0	failure	00:58:40
pitr-8-4	passed	01:06:49
pitr-pxb-8-0	passed	01:06:17
pitr-pxb-8-4	failure	00:57:54
pitr-gap-errors-8-0	passed	00:55:46
pitr-gap-errors-8-4	passed	00:58:30
proxy-protocol-8-0	passed	00:11:19
proxy-switch-8-0	passed	00:16:16
proxysql-sidecar-res-limits-8-0	passed	00:09:16
proxysql-scheduler-8-0	passed	00:28:16
pvc-resize-5-7	passed	00:22:34
pvc-resize-8-0	passed	00:23:16
recreate-8-0	passed	00:22:37
restore-to-encrypted-cluster-8-0	passed	00:32:55
restore-to-encrypted-cluster-8-4	passed	00:27:36
restore-to-encrypted-cluster-pxb-8-0	passed	00:22:01
restore-to-encrypted-cluster-pxb-8-4	passed	00:17:22
scaling-proxysql-8-0	passed	00:09:48
scaling-8-0	passed	00:12:55
scheduled-backup-5-7	passed	01:03:56
scheduled-backup-8-0	failure	00:28:56
scheduled-backup-8-4	passed	01:13:23
security-context-8-0	passed	00:29:04
smart-update1-8-0	passed	00:37:17
smart-update1-8-4	passed	00:38:40
smart-update2-8-0	passed	00:46:03
smart-update2-8-4	passed	00:44:27
smart-update3-8-0	passed	00:20:52
sst-retry-limit-8-0	passed	00:19:12
sst-retry-limit-8-4	passed	00:15:40
storage-8-0	passed	00:11:23
tls-issue-cert-manager-ref-8-0	passed	00:14:59
tls-issue-cert-manager-8-0	passed	00:17:15
tls-issue-self-8-0	failure	00:13:28
upgrade-consistency-8-0	passed	00:13:39
upgrade-consistency-8-4	passed	00:14:44
upgrade-haproxy-5-7	passed	00:28:08
upgrade-haproxy-8-0	passed	00:27:13
upgrade-proxysql-5-7	passed	00:15:10
upgrade-proxysql-8-0	passed	00:21:33
users-5-7	failure	00:00:53
users-8-0	failure	00:00:49
users-scheduler-8-4	failure	00:00:53
validation-hook-8-0	passed	00:04:13

Summary	Value
Tests Run	80/80
Job Duration	04:16:58
Total Test Time	37:22:08

commit: 3dc7f02
image: perconalab/percona-xtradb-cluster-operator:PR-2467-3dc7f023

Copilot AI review requested due to automatic review settings May 7, 2026 11:35

pull-request-size Bot added the size/L 100-499 lines label May 7, 2026

Copilot started reviewing on behalf of egegunes May 7, 2026 11:37 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread pkg/controller/pxc/full_crash_recovery.go Outdated

Comment thread pkg/controller/pxc/full_crash_recovery.go Outdated

Comment thread pkg/controller/pxc/full_crash_recovery.go Outdated

Comment thread pkg/apis/pxc/v1/pxc_types.go Outdated

Copilot AI review requested due to automatic review settings May 11, 2026 19:20

Copilot AI reviewed May 11, 2026

View reviewed changes

hors added this to the v1.20.0 milestone May 12, 2026

Copilot AI review requested due to automatic review settings May 12, 2026 16:45

Copilot started reviewing on behalf of egegunes May 12, 2026 16:46 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread pkg/controller/pxc/replication.go Outdated

Comment thread pkg/controller/pxc/full_crash_recovery.go Outdated

Comment thread pkg/apis/pxc/v1/pxc_types.go Outdated

egegunes force-pushed the K8SPXC-1828 branch from 068873c to dc043b7 Compare May 13, 2026 07:13

egegunes changed the title ~~K8SPXC-1828: Improve full cluster crash recovery~~ K8SPXC-1828: Make operator aware of last recovered seqno for auto recovery May 13, 2026

egegunes requested a review from Copilot May 13, 2026 07:18

egegunes marked this pull request as ready for review May 13, 2026 07:18

egegunes requested review from eleo007, gkech, hors, jvpasinatto, mayankshah1607, nmarukovich, oksana-grishchenko, pooknull and valmiranogueira as code owners May 13, 2026 07:18

Copilot started reviewing on behalf of egegunes May 13, 2026 07:19 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread pkg/controller/pxc/full_crash_recovery.go

Comment thread pkg/apis/pxc/v1/pxc_types.go Outdated

address review comments

72e826e

gkech reviewed May 13, 2026

View reviewed changes

mayankshah1607 reviewed May 13, 2026

View reviewed changes

egegunes added 2 commits May 14, 2026 10:09

address review comments

6b67ce6

Merge branch 'main' into K8SPXC-1828

7d17e79

Copilot AI review requested due to automatic review settings May 14, 2026 07:32

Copilot started reviewing on behalf of egegunes May 14, 2026 07:33 View session

mayankshah1607 previously approved these changes May 14, 2026

View reviewed changes

Copilot AI reviewed May 14, 2026

View reviewed changes

Merge branch 'main' into K8SPXC-1828

0b520ee

egegunes dismissed mayankshah1607’s stale review via 0b520ee May 15, 2026 06:36

Merge branch 'main' into K8SPXC-1828

f8a5d01

Copilot AI review requested due to automatic review settings May 16, 2026 17:31

Copilot started reviewing on behalf of egegunes May 16, 2026 17:31 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Merge branch 'main' into K8SPXC-1828

3dc7f02

		wait_for_running "$cluster-haproxy" 1
		wait_for_running "$cluster-pxc" 3

Conversation

egegunes commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

JNKPercona commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

egegunes commented May 7, 2026 •

edited

Loading