K8SPXC-1828: Make operator aware of last recovered seqno for auto recovery#2467
K8SPXC-1828: Make operator aware of last recovered seqno for auto recovery#2467egegunes wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves full PXC cluster crash recovery by enriching the recovery marker with the Galera cluster UUID, adding safety checks to prevent bootstrapping from stale/unrelated data, and persisting the last recovery decision in CR status for future comparisons.
Changes:
- Extend the crash-recovery log marker to include
UUID:SEQNOand update log parsing to support both new and legacy formats. - Add a
status.recoveryfield to the CRD and persist the last recovery pod/UUID/seqno/time after triggering bootstrap. - Add a safety gate (
isAutomaticRecoverySafe) to refuse automatic recovery on UUID mismatch or seqno regression.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/controller/pxc/status.go | Adjusts recovery-log parsing call site for the updated return signature. |
| pkg/controller/pxc/replication.go | Skips replication reconcile when no proxy is enabled, avoiding proxy-dependent primary detection. |
| pkg/controller/pxc/full_crash_recovery.go | Implements UUID-aware parsing, recovery safety checks, eventing, and status persistence. |
| pkg/controller/pxc/full_crash_recovery_test.go | Adds unit tests for log parsing and safety-check behavior. |
| pkg/apis/pxc/v1/pxc_types.go | Introduces RecoveryStatus and wires it into cluster status. |
| pkg/apis/pxc/v1/zz_generated.deepcopy.go | Adds generated deepcopy support for the new status type. |
| deploy/cw-bundle.yaml | Updates CRD schema to expose status.recovery. |
| deploy/crd.yaml | Updates CRD schema to expose status.recovery. |
| deploy/bundle.yaml | Updates CRD schema to expose status.recovery. |
| config/crd/bases/pxc.percona.com_perconaxtradbclusters.yaml | Updates CRD schema to expose status.recovery. |
| build/pxc-entrypoint.sh | Emits UUID in crash marker and adds an informational “Cluster UUID” log line. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…overy
For auto recovery from full cluster crash, operator selects the PXC pod
with highest seqno (wsrep_last_applied) and rebootstraps the cluster
from there. This logic is sound but there's a problem: operator
immediately forgets the position (uuid:seqno) it used to recover the
cluster.
Imagine the scenario:
Crash happens, pods report their positions:
pod-0 -> uuid:100
pod-1 -> uuid:97
pod-2 -> uuid:102
Operator picks pod-2 to recover.
Another crash happens, pods report their positions:
pod-0 -> uuid:91
pod-1 -> uuid:88
pod-2 -> uuid:89
Operator picks pod-0 to recover. But the position actually regressed and
doing the recovery from this regressed position will result in data
loss.
(Why would wsrep_last_applied regress is a question I don't know the
answer of but we've seen it in highly unstable environments where
operator needed to perform recovery repeatedly for prolonged periods.)
With these changes, we are making the operator aware of last recovered
position and adding a guardrail to auto recovery logic.
Operator is going to store the last recovery information in
`.status.recovery`:
RecoveryStatus{
clusterUUID // Galera cluster UUID reported by the pod.
lastRecoveryTime // the time when the operator triggered the most recent recovery.
lastRecoveryPod // the pod the operator picked to bootstrap from (the one with the highest reported seqno).
lastRecoverySeqNo // wsrep sequence number of the pod that was used to bootstrap.
}
This information will be used in subsequent recoveries to ensure the
recovery position doesn't regress. If it does, operator will reject
doing the recovery itself. In this case, a human needs to step in and
manually do the recovery.
Anti-regression guardrail depends on the fact that wsrep_last_applied
(seqno) is monotonic with the same cluster UUID. Operator always
recovers the cluster with same UUID, so the UUID stays the same in whole
lifecycle of a PXC cluster on K8s. But users do something manually to
change the cluster UUID. In this case they will need to update or clean up the
last recovery info in PerconaXtraDBCluster object's status.
| for name, tt := range tests { | ||
| t.Run(name, func(t *testing.T) { | ||
| uuid, seq, err := parseRecoveredPosition(tt.log) | ||
| if tt.wantErr { |
There was a problem hiding this comment.
wanterr is completely unneeded. We can determine the need to assert an error through the errSubstr
| } | ||
|
|
||
| func (r *ReconcilePerconaXtraDBCluster) doFullCrashRecovery(ctx context.Context, cr *pxcv1.PerconaXtraDBCluster) error { | ||
| maxSeq := int64(-100) |
There was a problem hiding this comment.
why -100 and not something like math.MinInt64
| wait_for_running "$cluster-haproxy" 1 | ||
| wait_for_running "$cluster-pxc" 3 |
There was a problem hiding this comment.
Is this related to the changes in this PR?
There was a problem hiding this comment.
not really but tls-issue-cert-manager failure was blocking this PR.
the issue is introduced by 7f885a1. probably it slipped away in one of the retries and we merged it. in the test, to rotate the certificates, we delete pods altogether then run wait_for_running which run wait_pod which checks for full cluster crash and exits immediately if it detects crash. but we're causing the crash deliberately. so i removed these and opted for waiting cluster readiness only.
| orig := cr.DeepCopy() | ||
|
|
||
| cr.Status.Recovery = &pxcv1.RecoveryStatus{ | ||
| ClusterUUID: info.uuid, | ||
| LastRecoverySeqNo: info.seq, | ||
| LastRecoveryPod: recoveryPod, | ||
| LastRecoveryTime: metav1.Now(), | ||
| } | ||
|
|
||
| return cl.Status().Patch(ctx, cr, client.MergeFrom(orig)) |
There was a problem hiding this comment.
Do we need to wrap this in a retry block? Otherwise the status may not be written if there are transient failures like conflicts
There was a problem hiding this comment.
patch is not prone to conflict errors, so I don't think it's needed
| recoveryInfo.seq, | ||
| ) | ||
| r.recorder.Event(cr, corev1.EventTypeWarning, "AutomaticRecoveryRefused", msg) | ||
| return errors.New(msg) |
There was a problem hiding this comment.
If we return an error here, won't the reconciler keep retrying this code path and flood with events? Is it possible to stop retrying once the operator detects that automatic recovery is not safe?
There was a problem hiding this comment.
i don't think it's possible to stop retrying without adding additional logic to override the block which user will need to do manually. i don't think it'll be a good improvement
| if err := updateRecoveryStatus(ctx, r.client, cr, recoveryInfo, recoveryPod); err != nil { | ||
| log.Error(err, "update recovery status") | ||
| // we already signalled the pod | ||
| // not returning here so we can sleep | ||
| } |
| echo 'Cluster will recover automatically from the crash now.' | ||
| echo 'If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:' | ||
| echo "kubectl -n $POD_NAMESPACE exec $(hostname) -c pxc -- sh -c 'kill -s USR1 1'" | ||
| #DO NOT CHANGE THE LINE BELOW. OUR AUTO-RECOVERY IS USING IT TO DETECT SEQNO OF CURRENT NODE. See K8SPXC-564 |
commit: 3dc7f02 |
CHANGE DESCRIPTION
For auto recovery from full cluster crash, operator selects the PXC pod with highest seqno (wsrep_last_applied) and rebootstraps the cluster from there. This logic is sound but there's a problem: operator immediately forgets the position (uuid:seqno) it used to recover the cluster.
Imagine the scenario:
Crash happens, pods report their positions:
pod-0 -> uuid:100
pod-1 -> uuid:97
pod-2 -> uuid:102
Operator picks pod-2 to recover.
Another crash happens, pods report their positions:
pod-0 -> uuid:91
pod-1 -> uuid:88
pod-2 -> uuid:89
Operator picks pod-0 to recover. But the position actually regressed and doing the recovery from this regressed position will result in data loss.
(Why would wsrep_last_applied regress is a question I don't know the answer of but we've seen it in highly unstable environments where operator needed to perform recovery repeatedly for prolonged periods.)
With these changes, we are making the operator aware of last recovered position and adding a guardrail to auto recovery logic.
Operator is going to store the last recovery information in
.status.recovery:This information will be used in subsequent recoveries to ensure the recovery position doesn't regress. If it does, operator will reject doing the recovery itself. In this case, a human needs to step in and manually do the recovery.
Anti-regression guardrail depends on the fact that wsrep_last_applied (seqno) is monotonic with the same cluster UUID. Operator always recovers the cluster with same UUID, so the UUID stays the same in whole lifecycle of a PXC cluster on K8s. But users do something manually to change the cluster UUID. In this case they will need to update or clean up the last recovery info in PerconaXtraDBCluster object's status.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
compare/*-oc.yml)?Config/Logging/Testability