Skip to content

K8SPXC-1828: Make operator aware of last recovered seqno for auto recovery#2467

Open
egegunes wants to merge 7 commits into
mainfrom
K8SPXC-1828
Open

K8SPXC-1828: Make operator aware of last recovered seqno for auto recovery#2467
egegunes wants to merge 7 commits into
mainfrom
K8SPXC-1828

Conversation

@egegunes
Copy link
Copy Markdown
Contributor

@egegunes egegunes commented May 7, 2026

CHANGE DESCRIPTION

For auto recovery from full cluster crash, operator selects the PXC pod with highest seqno (wsrep_last_applied) and rebootstraps the cluster from there. This logic is sound but there's a problem: operator immediately forgets the position (uuid:seqno) it used to recover the cluster.

Imagine the scenario:

Crash happens, pods report their positions:
pod-0 -> uuid:100
pod-1 -> uuid:97
pod-2 -> uuid:102

Operator picks pod-2 to recover.

Another crash happens, pods report their positions:
pod-0 -> uuid:91
pod-1 -> uuid:88
pod-2 -> uuid:89

Operator picks pod-0 to recover. But the position actually regressed and doing the recovery from this regressed position will result in data loss.

(Why would wsrep_last_applied regress is a question I don't know the answer of but we've seen it in highly unstable environments where operator needed to perform recovery repeatedly for prolonged periods.)

With these changes, we are making the operator aware of last recovered position and adding a guardrail to auto recovery logic.

Operator is going to store the last recovery information in .status.recovery:

RecoveryStatus{
  clusterUUID             // Galera cluster UUID reported by the pod.
  lastRecoveryTime        // the time when the operator triggered the most recent recovery.
  lastRecoveryPod         // the pod the operator picked to bootstrap from (the one with the highest reported seqno).
  lastRecoverySeqNo       // wsrep sequence number of the pod that was used to bootstrap.
 }

This information will be used in subsequent recoveries to ensure the recovery position doesn't regress. If it does, operator will reject doing the recovery itself. In this case, a human needs to step in and manually do the recovery.

Anti-regression guardrail depends on the fact that wsrep_last_applied (seqno) is monotonic with the same cluster UUID. Operator always recovers the cluster with same UUID, so the UUID stays the same in whole lifecycle of a PXC cluster on K8s. But users do something manually to change the cluster UUID. In this case they will need to update or clean up the last recovery info in PerconaXtraDBCluster object's status.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PXC version?
  • Does the change support oldest and newest supported Kubernetes version?

Copilot AI review requested due to automatic review settings May 7, 2026 11:35
@pull-request-size pull-request-size Bot added the size/L 100-499 lines label May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves full PXC cluster crash recovery by enriching the recovery marker with the Galera cluster UUID, adding safety checks to prevent bootstrapping from stale/unrelated data, and persisting the last recovery decision in CR status for future comparisons.

Changes:

  • Extend the crash-recovery log marker to include UUID:SEQNO and update log parsing to support both new and legacy formats.
  • Add a status.recovery field to the CRD and persist the last recovery pod/UUID/seqno/time after triggering bootstrap.
  • Add a safety gate (isAutomaticRecoverySafe) to refuse automatic recovery on UUID mismatch or seqno regression.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/controller/pxc/status.go Adjusts recovery-log parsing call site for the updated return signature.
pkg/controller/pxc/replication.go Skips replication reconcile when no proxy is enabled, avoiding proxy-dependent primary detection.
pkg/controller/pxc/full_crash_recovery.go Implements UUID-aware parsing, recovery safety checks, eventing, and status persistence.
pkg/controller/pxc/full_crash_recovery_test.go Adds unit tests for log parsing and safety-check behavior.
pkg/apis/pxc/v1/pxc_types.go Introduces RecoveryStatus and wires it into cluster status.
pkg/apis/pxc/v1/zz_generated.deepcopy.go Adds generated deepcopy support for the new status type.
deploy/cw-bundle.yaml Updates CRD schema to expose status.recovery.
deploy/crd.yaml Updates CRD schema to expose status.recovery.
deploy/bundle.yaml Updates CRD schema to expose status.recovery.
config/crd/bases/pxc.percona.com_perconaxtradbclusters.yaml Updates CRD schema to expose status.recovery.
build/pxc-entrypoint.sh Emits UUID in crash marker and adds an informational “Cluster UUID” log line.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/controller/pxc/full_crash_recovery.go Outdated
Comment thread pkg/controller/pxc/full_crash_recovery.go Outdated
Comment thread pkg/controller/pxc/full_crash_recovery.go Outdated
Comment thread pkg/apis/pxc/v1/pxc_types.go Outdated
Copilot AI review requested due to automatic review settings May 11, 2026 19:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@hors hors added this to the v1.20.0 milestone May 12, 2026
Copilot AI review requested due to automatic review settings May 12, 2026 16:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 3 comments.

Files not reviewed (1)
  • pkg/apis/pxc/v1/zz_generated.deepcopy.go: Language not supported

Comment thread pkg/controller/pxc/replication.go Outdated
Comment thread pkg/controller/pxc/full_crash_recovery.go Outdated
Comment thread pkg/apis/pxc/v1/pxc_types.go Outdated
…overy

For auto recovery from full cluster crash, operator selects the PXC pod
with highest seqno (wsrep_last_applied) and rebootstraps the cluster
from there. This logic is sound but there's a problem: operator
immediately forgets the position (uuid:seqno) it used to recover the
cluster.

Imagine the scenario:

Crash happens, pods report their positions:
pod-0 -> uuid:100
pod-1 -> uuid:97
pod-2 -> uuid:102

Operator picks pod-2 to recover.

Another crash happens, pods report their positions:
pod-0 -> uuid:91
pod-1 -> uuid:88
pod-2 -> uuid:89

Operator picks pod-0 to recover. But the position actually regressed and
doing the recovery from this regressed position will result in data
loss.

(Why would wsrep_last_applied regress is a question I don't know the
answer of but we've seen it in highly unstable environments where
operator needed to perform recovery repeatedly for prolonged periods.)

With these changes, we are making the operator aware of last recovered
position and adding a guardrail to auto recovery logic.

Operator is going to store the last recovery information in
`.status.recovery`:

RecoveryStatus{
	clusterUUID 		// Galera cluster UUID reported by the pod.
	lastRecoveryTime 	// the time when the operator triggered the most recent recovery.
	lastRecoveryPod 	// the pod the operator picked to bootstrap from (the one with the highest reported seqno).
	lastRecoverySeqNo 	// wsrep sequence number of the pod that was used to bootstrap.
}

This information will be used in subsequent recoveries to ensure the
recovery position doesn't regress. If it does, operator will reject
doing the recovery itself. In this case, a human needs to step in and
manually do the recovery.

Anti-regression guardrail depends on the fact that wsrep_last_applied
(seqno) is monotonic with the same cluster UUID. Operator always
recovers the cluster with same UUID, so the UUID stays the same in whole
lifecycle of a PXC cluster on K8s. But users do something manually to
change the cluster UUID. In this case they will need to update or clean up the
last recovery info in PerconaXtraDBCluster object's status.
@egegunes egegunes changed the title K8SPXC-1828: Improve full cluster crash recovery K8SPXC-1828: Make operator aware of last recovered seqno for auto recovery May 13, 2026
@egegunes egegunes requested a review from Copilot May 13, 2026 07:18
@egegunes egegunes marked this pull request as ready for review May 13, 2026 07:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • pkg/apis/pxc/v1/zz_generated.deepcopy.go: Language not supported

Comment thread pkg/controller/pxc/full_crash_recovery.go
Comment thread pkg/apis/pxc/v1/pxc_types.go Outdated
for name, tt := range tests {
t.Run(name, func(t *testing.T) {
uuid, seq, err := parseRecoveredPosition(tt.log)
if tt.wantErr {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanterr is completely unneeded. We can determine the need to assert an error through the errSubstr

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}

func (r *ReconcilePerconaXtraDBCluster) doFullCrashRecovery(ctx context.Context, cr *pxcv1.PerconaXtraDBCluster) error {
maxSeq := int64(-100)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why -100 and not something like math.MinInt64

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines -152 to -153
wait_for_running "$cluster-haproxy" 1
wait_for_running "$cluster-pxc" 3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the changes in this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really but tls-issue-cert-manager failure was blocking this PR.

the issue is introduced by 7f885a1. probably it slipped away in one of the retries and we merged it. in the test, to rotate the certificates, we delete pods altogether then run wait_for_running which run wait_pod which checks for full cluster crash and exits immediately if it detects crash. but we're causing the crash deliberately. so i removed these and opted for waiting cluster readiness only.

Comment on lines +283 to +292
orig := cr.DeepCopy()

cr.Status.Recovery = &pxcv1.RecoveryStatus{
ClusterUUID: info.uuid,
LastRecoverySeqNo: info.seq,
LastRecoveryPod: recoveryPod,
LastRecoveryTime: metav1.Now(),
}

return cl.Status().Patch(ctx, cr, client.MergeFrom(orig))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wrap this in a retry block? Otherwise the status may not be written if there are transient failures like conflicts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch is not prone to conflict errors, so I don't think it's needed

recoveryInfo.seq,
)
r.recorder.Event(cr, corev1.EventTypeWarning, "AutomaticRecoveryRefused", msg)
return errors.New(msg)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return an error here, won't the reconciler keep retrying this code path and flood with events? Is it possible to stop retrying once the operator detects that automatic recovery is not safe?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it's possible to stop retrying without adding additional logic to override the block which user will need to do manually. i don't think it'll be a good improvement

Copilot AI review requested due to automatic review settings May 14, 2026 07:32
mayankshah1607
mayankshah1607 previously approved these changes May 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated no new comments.

Files not reviewed (1)
  • pkg/apis/pxc/v1/zz_generated.deepcopy.go: Language not supported

Copilot AI review requested due to automatic review settings May 16, 2026 17:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • pkg/apis/pxc/v1/zz_generated.deepcopy.go: Language not supported

Comment on lines +234 to +238
if err := updateRecoveryStatus(ctx, r.client, cr, recoveryInfo, recoveryPod); err != nil {
log.Error(err, "update recovery status")
// we already signalled the pod
// not returning here so we can sleep
}
Comment thread build/pxc-entrypoint.sh
echo 'Cluster will recover automatically from the crash now.'
echo 'If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:'
echo "kubectl -n $POD_NAMESPACE exec $(hostname) -c pxc -- sh -c 'kill -s USR1 1'"
#DO NOT CHANGE THE LINE BELOW. OUR AUTO-RECOVERY IS USING IT TO DETECT SEQNO OF CURRENT NODE. See K8SPXC-564
@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
auto-tuning-8-0 passed 00:22:21
allocator-8-0 passed 00:14:46
allocator-8-4 passed 00:15:11
backup-storage-tls-8-0 passed 00:25:54
cross-site-8-0 passed 00:40:46
cross-site-proxysql-8-0 passed 00:45:09
cross-site-proxysql-8-4 passed 00:44:48
custom-users-8-0 passed 00:15:28
demand-backup-cloud-8-0 passed 01:03:40
demand-backup-cloud-8-4 passed 01:06:16
demand-backup-cloud-pxb-8-0 failure 00:42:30
demand-backup-encrypted-with-tls-5-7 passed 00:52:00
demand-backup-encrypted-with-tls-8-0 passed 00:52:15
demand-backup-encrypted-with-tls-8-4 passed 00:51:59
demand-backup-encrypted-with-tls-pxb-5-7 passed 00:22:00
demand-backup-encrypted-with-tls-pxb-8-0 passed 00:19:29
demand-backup-encrypted-with-tls-pxb-8-4 passed 00:20:00
demand-backup-8-0 passed 00:47:30
demand-backup-flow-control-8-0 passed 00:13:01
demand-backup-flow-control-8-4 passed 00:13:07
demand-backup-parallel-8-0 passed 00:12:52
demand-backup-parallel-8-4 passed 00:10:26
demand-backup-without-passwords-8-0 passed 00:20:03
demand-backup-without-passwords-8-4 passed 00:19:06
extra-pvc-8-0 passed 00:29:01
haproxy-5-7 passed 00:20:00
haproxy-8-0 passed 00:18:42
haproxy-8-4 passed 00:16:50
init-deploy-5-7 passed 00:18:20
init-deploy-8-0 passed 00:23:07
limits-8-0 passed 00:12:06
monitoring-2-0-8-0 passed 00:48:38
monitoring-pmm3-8-0 passed 00:22:23
monitoring-pmm3-8-4 passed 00:22:41
one-pod-5-7 passed 00:19:01
one-pod-8-0 passed 00:18:10
pitr-8-0 failure 00:58:40
pitr-8-4 passed 01:06:49
pitr-pxb-8-0 passed 01:06:17
pitr-pxb-8-4 failure 00:57:54
pitr-gap-errors-8-0 passed 00:55:46
pitr-gap-errors-8-4 passed 00:58:30
proxy-protocol-8-0 passed 00:11:19
proxy-switch-8-0 passed 00:16:16
proxysql-sidecar-res-limits-8-0 passed 00:09:16
proxysql-scheduler-8-0 passed 00:28:16
pvc-resize-5-7 passed 00:22:34
pvc-resize-8-0 passed 00:23:16
recreate-8-0 passed 00:22:37
restore-to-encrypted-cluster-8-0 passed 00:32:55
restore-to-encrypted-cluster-8-4 passed 00:27:36
restore-to-encrypted-cluster-pxb-8-0 passed 00:22:01
restore-to-encrypted-cluster-pxb-8-4 passed 00:17:22
scaling-proxysql-8-0 passed 00:09:48
scaling-8-0 passed 00:12:55
scheduled-backup-5-7 passed 01:03:56
scheduled-backup-8-0 failure 00:28:56
scheduled-backup-8-4 passed 01:13:23
security-context-8-0 passed 00:29:04
smart-update1-8-0 passed 00:37:17
smart-update1-8-4 passed 00:38:40
smart-update2-8-0 passed 00:46:03
smart-update2-8-4 passed 00:44:27
smart-update3-8-0 passed 00:20:52
sst-retry-limit-8-0 passed 00:19:12
sst-retry-limit-8-4 passed 00:15:40
storage-8-0 passed 00:11:23
tls-issue-cert-manager-ref-8-0 passed 00:14:59
tls-issue-cert-manager-8-0 passed 00:17:15
tls-issue-self-8-0 failure 00:13:28
upgrade-consistency-8-0 passed 00:13:39
upgrade-consistency-8-4 passed 00:14:44
upgrade-haproxy-5-7 passed 00:28:08
upgrade-haproxy-8-0 passed 00:27:13
upgrade-proxysql-5-7 passed 00:15:10
upgrade-proxysql-8-0 passed 00:21:33
users-5-7 failure 00:00:53
users-8-0 failure 00:00:49
users-scheduler-8-4 failure 00:00:53
validation-hook-8-0 passed 00:04:13
Summary Value
Tests Run 80/80
Job Duration 04:16:58
Total Test Time 37:22:08

commit: 3dc7f02
image: perconalab/percona-xtradb-cluster-operator:PR-2467-3dc7f023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L 100-499 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants