Commit dc043b7
committed
K8SPXC-1828: Make operator aware of last recovered seqno for auto recovery
For auto recovery from full cluster crash, operator selects the PXC pod
with highest seqno (wsrep_last_applied) and rebootstraps the cluster
from there. This logic is sound but there's a problem: operator
immediately forgets the position (uuid:seqno) it used to recover the
cluster.
Imagine the scenario:
Crash happens, pods report their positions:
pod-0 -> uuid:100
pod-1 -> uuid:97
pod-2 -> uuid:102
Operator picks pod-2 to recover.
Another crash happens, pods report their positions:
pod-0 -> uuid:91
pod-1 -> uuid:88
pod-2 -> uuid:89
Operator picks pod-0 to recover. But the position actually regressed and
doing the recovery from this regressed position will result in data
loss.
(Why would wsrep_last_applied regress is a question I don't know the
answer of but we've seen it in highly unstable environments where
operator needed to perform recovery repeatedly for prolonged periods.)
With these changes, we are making the operator aware of last recovered
position and adding a guardrail to auto recovery logic.
Operator is going to store the last recovery information in
`.status.recovery`:
RecoveryStatus{
clusterUUID // Galera cluster UUID reported by the pod.
lastRecoveryTime // the time when the operator triggered the most recent recovery.
lastRecoveryPod // the pod the operator picked to bootstrap from (the one with the highest reported seqno).
lastRecoverySeqNo // wsrep sequence number of the pod that was used to bootstrap.
}
This information will be used in subsequent recoveries to ensure the
recovery position doesn't regress. If it does, operator will reject
doing the recovery itself. In this case, a human needs to step in and
manually do the recovery.
Anti-regression guardrail depends on the fact that wsrep_last_applied
(seqno) is monotonic with the same cluster UUID. Operator always
recovers the cluster with same UUID, so the UUID stays the same in whole
lifecycle of a PXC cluster on K8s. But users do something manually to
change the cluster UUID. In this case they will need to update or clean up the
last recovery info in PerconaXtraDBCluster object's status.1 parent 98c52d2 commit dc043b7
11 files changed
Lines changed: 473 additions & 36 deletions
File tree
- build
- config/crd/bases
- deploy
- e2e-tests/tls-issue-cert-manager
- pkg
- apis/pxc/v1
- controller/pxc
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
700 | 700 | | |
701 | 701 | | |
702 | 702 | | |
| 703 | + | |
703 | 704 | | |
704 | 705 | | |
705 | 706 | | |
| |||
755 | 756 | | |
756 | 757 | | |
757 | 758 | | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
758 | 762 | | |
759 | 763 | | |
760 | 764 | | |
| |||
765 | 769 | | |
766 | 770 | | |
767 | 771 | | |
| 772 | + | |
768 | 773 | | |
769 | 774 | | |
770 | 775 | | |
771 | 776 | | |
772 | 777 | | |
773 | | - | |
| 778 | + | |
774 | 779 | | |
775 | 780 | | |
776 | 781 | | |
| |||
Lines changed: 13 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11033 | 11033 | | |
11034 | 11034 | | |
11035 | 11035 | | |
| 11036 | + | |
| 11037 | + | |
| 11038 | + | |
| 11039 | + | |
| 11040 | + | |
| 11041 | + | |
| 11042 | + | |
| 11043 | + | |
| 11044 | + | |
| 11045 | + | |
| 11046 | + | |
| 11047 | + | |
| 11048 | + | |
11036 | 11049 | | |
11037 | 11050 | | |
11038 | 11051 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12363 | 12363 | | |
12364 | 12364 | | |
12365 | 12365 | | |
| 12366 | + | |
| 12367 | + | |
| 12368 | + | |
| 12369 | + | |
| 12370 | + | |
| 12371 | + | |
| 12372 | + | |
| 12373 | + | |
| 12374 | + | |
| 12375 | + | |
| 12376 | + | |
| 12377 | + | |
| 12378 | + | |
12366 | 12379 | | |
12367 | 12380 | | |
12368 | 12381 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12363 | 12363 | | |
12364 | 12364 | | |
12365 | 12365 | | |
| 12366 | + | |
| 12367 | + | |
| 12368 | + | |
| 12369 | + | |
| 12370 | + | |
| 12371 | + | |
| 12372 | + | |
| 12373 | + | |
| 12374 | + | |
| 12375 | + | |
| 12376 | + | |
| 12377 | + | |
| 12378 | + | |
12366 | 12379 | | |
12367 | 12380 | | |
12368 | 12381 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12363 | 12363 | | |
12364 | 12364 | | |
12365 | 12365 | | |
| 12366 | + | |
| 12367 | + | |
| 12368 | + | |
| 12369 | + | |
| 12370 | + | |
| 12371 | + | |
| 12372 | + | |
| 12373 | + | |
| 12374 | + | |
| 12375 | + | |
| 12376 | + | |
| 12377 | + | |
| 12378 | + | |
12366 | 12379 | | |
12367 | 12380 | | |
12368 | 12381 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
149 | 149 | | |
150 | 150 | | |
151 | 151 | | |
152 | | - | |
153 | | - | |
154 | 152 | | |
155 | 153 | | |
156 | 154 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
327 | 327 | | |
328 | 328 | | |
329 | 329 | | |
| 330 | + | |
330 | 331 | | |
331 | 332 | | |
332 | 333 | | |
| |||
375 | 376 | | |
376 | 377 | | |
377 | 378 | | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
378 | 402 | | |
379 | 403 | | |
380 | 404 | | |
| |||
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
0 commit comments