fix: persist member join status with conflict retry#10357
Conversation
|
Auto Cherry-pick Instructions CLA Recheck Instructions |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10357 +/- ##
==========================================
+ Coverage 53.15% 53.20% +0.04%
==========================================
Files 533 533
Lines 63457 63470 +13
==========================================
+ Hits 33733 33767 +34
+ Misses 26277 26259 -18
+ Partials 3447 3444 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Runtime validation update for the release-1.0 backport path: Scope:
Result:
Evidence verified:
Boundary:
|
| joinErrors = append(joinErrors, fmt.Errorf("pod %s: %w", pod.Name, err)) | ||
| } else { | ||
| key := types.NamespacedName{Namespace: r.protoITS.Namespace, Name: r.protoITS.Name} | ||
| if err := component.UpdateReplicaStatusWithRetry(r.transCtx.Context, r.cli, key, pod.Name, func(status *component.ReplicaStatus) error { |
There was a problem hiding this comment.
[P1] This writes the live InstanceSet from inside the transformer after memberJoin succeeds, which is the wrong layer for this failure. A conflict while persisting memberJoined is a normal reconcile failure: the controller should retry the whole reconcile and rely on the memberJoin action idempotency, not introduce a second write path outside the DAG/plan execution model. Keep replicas-status persistence in the normal DAG/update flow instead of special-casing this annotation write.
There was a problem hiding this comment.
Agreed. Fixed in commit 07045ea.
Removed UpdateReplicaStatusWithRetry entirely — the plan object update (replicas.Status[i].MemberJoined = ptr.To(true)) goes through the normal DAG/update flow. If resourceVersion conflict occurs, the controller retries the whole reconcile.
memberJoin action idempotency is guaranteed by addon contract (kubeblocks-addon-docs addon-lifecycle-action-idempotent-nonblocking-best-practice-guide.md): if the member already exists, the action returns success. So a retry after conflict safely re-invokes memberJoin and the status converges.
Removed:
UpdateReplicaStatusWithRetryfunction fromreplicas.go- Direct write call in
joinMember4ScaleOut() conflictOnceClienttest helper and retry test fromreplicas_test.go- Unused
retryandtypesimports
|
[P1] The production failure is caused by accepting scale-in while the previous scale-out member lifecycle is still in progress, not by the resourceVersion conflict itself. If any replica still has pending DataLoaded or MemberJoined state, scale-in can delete it and skip or race the engine memberLeave path regardless of this retry. Add an operation/state gate so scale-in waits for scale-out member lifecycle completion, or define an explicit safe cancellation path, before deleting those replicas. |
|
Handled Leon's P1 on the scale-in lifecycle gate. What changed:
How it was verified:
Commit / update point:
Boundary:
|
|
Update after focused review of the broader scale-in gate:
Boundary: this PR no longer claims to solve every scale-in-while-scale-out-lifecycle-pending case. The removed scale-in gate needs direct evidence or a bounded failed-join escape before it should be included. |
Address Leon11 P1: UpdateReplicaStatusWithRetry wrote the live InstanceSet from inside the transformer, bypassing the DAG/plan execution model. ResourceVersion conflict is a normal reconcile failure — the controller retries the whole reconcile and the memberJoin action is idempotent (addon contract requirement), so the status converges naturally through the DAG update path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Replaced by PR #10448 which implements the bounded scale-in gate with envtest coverage. This PR was net-zero after multiple iterations. |
Problem
Oracle 3-to-2 horizontal scaling hit a controller race in two reproduced runs. The component controller successfully executed
memberJoinfor the new pod, then the same reconcile lost the following InstanceSet annotation update to a resourceVersion conflict. A later scale-in readmemberJoined=falsefrom the stale replicas-status annotation and skippedmemberLeave, leaving a stale Oracle DG broker member for the deleted pod.Observed evidence from Oracle validation:
succeed to join member for pod ...-oracle-2Operation cannot be fulfilled on instancesets... object has been modifiedjoined replicas: []whilememberLeavewas definedmemberLeavekbagent action was executedFix
UpdateReplicaStatusWithRetryfor InstanceSet replicas-status annotation writes.memberJoinsucceeds, persistmemberJoined=truethrough a fresh get/update retry loop instead of depending only on the later graph update.Review boundary
A broader scale-in gate for replicas with pending
DataLoaded/MemberJoinedlifecycle state was prototyped, but it is not included in the current PR head. Focused review found a real side-effect risk: a replica whose memberJoin permanently fails could be blocked from scale-in cleanup. That behavior change needs direct evidence or a bounded escape before it belongs in a controller fix.The current PR is intentionally narrowed back to the confirmed low-risk persistence race.
Tests
make test-go-generateKUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -run TestAPIs/replicas -count=1 -timeout=600sKUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -run 'TestAPIs/transformer/workload/scale' -count=1 -timeout=600sKUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1 -timeout=600sKUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1 -timeout=600sgit diff --checkValidation boundary
This closes the controller persistence race for new
memberJoinexecutions. It does not claim to solve every possible scale-in while scale-out lifecycle is still pending.The previous release-1.0 backport runtime validation covered commit
20816453a171a2eab278ed6ef06402109dad95f6. The current PR head should remain draft until focused review and Oracle current-head runtime validation/owner confirmation close.This is not an Oracle full acceptance result and not a release-ready claim.
Fixes #10359