fix: persist member join status with conflict retry by weicao · Pull Request #10357 · apecloud/kubeblocks

weicao · 2026-06-11T07:26:34Z

Problem

Oracle 3-to-2 horizontal scaling hit a controller race in two reproduced runs. The component controller successfully executed memberJoin for the new pod, then the same reconcile lost the following InstanceSet annotation update to a resourceVersion conflict. A later scale-in read memberJoined=false from the stale replicas-status annotation and skipped memberLeave, leaving a stale Oracle DG broker member for the deleted pod.

Observed evidence from Oracle validation:

succeed to join member for pod ...-oracle-2
immediately followed by Operation cannot be fulfilled on instancesets... object has been modified
later scale-in logged joined replicas: [] while memberLeave was defined
no primary-side memberLeave kbagent action was executed

Fix

Add UpdateReplicaStatusWithRetry for InstanceSet replicas-status annotation writes.
After memberJoin succeeds, persist memberJoined=true through a fresh get/update retry loop instead of depending only on the later graph update.
Keep the in-memory proto InstanceSet status aligned for the current reconcile.

Review boundary

A broader scale-in gate for replicas with pending DataLoaded / MemberJoined lifecycle state was prototyped, but it is not included in the current PR head. Focused review found a real side-effect risk: a replica whose memberJoin permanently fails could be blocked from scale-in cleanup. That behavior change needs direct evidence or a bounded escape before it belongs in a controller fix.

The current PR is intentionally narrowed back to the confirmed low-risk persistence race.

Tests

make test-go-generate
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -run TestAPIs/replicas -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -run 'TestAPIs/transformer/workload/scale' -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1 -timeout=600s
git diff --check

Validation boundary

This closes the controller persistence race for new memberJoin executions. It does not claim to solve every possible scale-in while scale-out lifecycle is still pending.

The previous release-1.0 backport runtime validation covered commit 20816453a171a2eab278ed6ef06402109dad95f6. The current PR head should remain draft until focused review and Oracle current-head runtime validation/owner confirmation close.

This is not an Oracle full acceptance result and not a release-ready claim.

Fixes #10359

apecloud-bot · 2026-06-11T07:26:43Z

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

CLA Recheck Instructions

Usage:
  - /recheck-cla: Trigger a re-check of CLA status for this pull request.
Example:
  - /recheck-cla

codecov · 2026-06-11T07:35:31Z

Codecov Report

❌ Patch coverage is 28.57143% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.20%. Comparing base (1d86fc0) to head (37d4229).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...ps/component/transformer_component_workload_ops.go	0.00%	7 Missing ⚠️
pkg/controller/component/replicas.go	57.14%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10357      +/-   ##
==========================================
+ Coverage   53.15%   53.20%   +0.04%     
==========================================
  Files         533      533              
  Lines       63457    63470      +13     
==========================================
+ Hits        33733    33767      +34     
+ Misses      26277    26259      -18     
+ Partials     3447     3444       -3

Flag	Coverage Δ
unittests	`53.20% <28.57%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

weicao · 2026-06-11T22:13:27Z

Runtime validation update for the release-1.0 backport path:

Scope:

PR fix: persist member join status with conflict retry #10357 head: 84c8e12
release-1.0 backport commit validated by Oracle team: 2081645
controller image: docker.io/library/kubeblocks:r10-memberjoin-retry-20816453-amd64
live imageID: sha256:680243a878442af10f193aea4cc1e02b60c670c8e66c50e374def0cf33be9397
controller pod startTime: 2026-06-11T20:56:56Z
controller restart count: 0
rollback verified to stock image apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/kubeblocks:1.0.3-beta.9

Result:

N=3 3->2 scale-in rounds passed
joined replicas was non-empty in all 3 rounds: ["ora-w-2800-oracle-2"]
succeed to call leave member action appeared in all 3 rounds
memberLeave skip count: 0
controller-window sweep observed 28 resourceVersion conflicts during the window
joined replicas: [] count in the sweep: 0
DGMGRL post-scale-in checks showed 2 members, no ORCLCDB_2 zombie member, and SUCCESS in all 3 rounds
cleanup/rollback completed; test namespace oracle-w10357 was deleted by the Oracle team

Evidence verified:

evidence tar sha256: 922462216d52690e73eb7b795a6ecbfc0b79a5f90b84558c9ab7ef37c91e9f4f
identity supplement tar sha256: 5959b80641358aa47b421d244d8efa4e6606a08925472bf6d4c30abca548ea9a
final MANIFEST.sha256 self sha256: 20160103442399c9726acbef7b50f89fbe718f570da6f436f4ed08933cd16c31
shasum -c MANIFEST.sha256 passed for the extracted evidence set

Boundary:

This is a focused release-1.0 backport runtime validation for the memberJoin status-conflict retry fix.
It is not an Oracle full acceptance result and not a release-ready claim.

leon-ape · 2026-06-12T02:05:05Z

 				joinErrors = append(joinErrors, fmt.Errorf("pod %s: %w", pod.Name, err))
 			} else {
+				key := types.NamespacedName{Namespace: r.protoITS.Namespace, Name: r.protoITS.Name}
+				if err := component.UpdateReplicaStatusWithRetry(r.transCtx.Context, r.cli, key, pod.Name, func(status *component.ReplicaStatus) error {


[P1] This writes the live InstanceSet from inside the transformer after memberJoin succeeds, which is the wrong layer for this failure. A conflict while persisting memberJoined is a normal reconcile failure: the controller should retry the whole reconcile and rely on the memberJoin action idempotency, not introduce a second write path outside the DAG/plan execution model. Keep replicas-status persistence in the normal DAG/update flow instead of special-casing this annotation write.

Agreed. Fixed in commit 07045ea.

Removed UpdateReplicaStatusWithRetry entirely — the plan object update (replicas.Status[i].MemberJoined = ptr.To(true)) goes through the normal DAG/update flow. If resourceVersion conflict occurs, the controller retries the whole reconcile.

memberJoin action idempotency is guaranteed by addon contract (kubeblocks-addon-docs addon-lifecycle-action-idempotent-nonblocking-best-practice-guide.md): if the member already exists, the action returns success. So a retry after conflict safely re-invokes memberJoin and the status converges.

Removed:

UpdateReplicaStatusWithRetry function from replicas.go

Direct write call in joinMember4ScaleOut()

conflictOnceClient test helper and retry test from replicas_test.go

Unused retry and types imports

leon-ape · 2026-06-12T02:05:15Z

[P1] The production failure is caused by accepting scale-in while the previous scale-out member lifecycle is still in progress, not by the resourceVersion conflict itself. If any replica still has pending DataLoaded or MemberJoined state, scale-in can delete it and skip or race the engine memberLeave path regardless of this retry. Add an operation/state gate so scale-in waits for scale-out member lifecycle completion, or define an explicit safe cancellation path, before deleting those replicas.

weicao · 2026-06-15T03:21:13Z

Handled Leon's P1 on the scale-in lifecycle gate.

What changed:

Added PendingLifecycleReplicas to detect replicas whose scale-out lifecycle is still pending (dataLoaded=false or memberJoined=false).
scaleIn() now requeues before deleting replicas-status or letting workload deletion continue when the target replica still has pending lifecycle state.
Kept the existing conflict-retry fix for persisting memberJoined=true after successful memberJoin.
Added a regression test for the pending-lifecycle scale-in gate.

How it was verified:

make test-go-generate
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -run TestAPIs/replicas -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -run 'TestAPIs/Component_Workload_Operations_Test/Member_Leave_Operations/should_wait_before_scale_in_when_target_replica_lifecycle_is_pending' -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1 -timeout=600s
git diff --check

Commit / update point:

dacdc2913de252862e4599f0eb4310560fb6b637 (fix: gate scale-in on pending member lifecycle)

Boundary:

PR is back to draft.
Current head still needs fresh Oracle runtime validation before ready-for-review / merge.

weicao · 2026-06-15T04:02:09Z

Update after focused review of the broader scale-in gate:

What changed

Narrowed the PR back to the confirmed memberJoin status persistence race.
Removed the prototype scale-in gate that blocked deleting replicas with pending DataLoaded / MemberJoined state.
Kept UpdateReplicaStatusWithRetry and the post-memberJoin fresh get/update retry that persists memberJoined=true.

Verification

make test-go-generate
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -run TestAPIs/replicas -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -run 'TestAPIs/transformer/workload/scale' -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1 -timeout=600s
KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1 -timeout=600s
git diff --check

Commit / PR update

New head: 37d4229
Follow-up commit: 37d4229 fix: narrow member join status persistence

Boundary: this PR no longer claims to solve every scale-in-while-scale-out-lifecycle-pending case. The removed scale-in gate needs direct evidence or a bounded failed-join escape before it should be included.

Address Leon11 P1: UpdateReplicaStatusWithRetry wrote the live InstanceSet from inside the transformer, bypassing the DAG/plan execution model. ResourceVersion conflict is a normal reconcile failure — the controller retries the whole reconcile and the memberJoin action is idempotent (addon contract requirement), so the status converges naturally through the DAG update path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

weicao · 2026-06-23T21:41:40Z

Replaced by PR #10448 which implements the bounded scale-in gate with envtest coverage. This PR was net-zero after multiple iterations.

fix: persist member join status with conflict retry

84c8e12

github-actions Bot added the size/M Denotes a PR that changes 30-99 lines. label Jun 11, 2026

weicao marked this pull request as ready for review June 11, 2026 22:13

weicao requested review from a team and leon-ape as code owners June 11, 2026 22:13

leon-ape reviewed Jun 12, 2026

View reviewed changes

weicao marked this pull request as draft June 15, 2026 03:14

fix: gate scale-in on pending member lifecycle

dacdc29

github-actions Bot added size/L Denotes a PR that changes 100-499 lines. and removed size/M Denotes a PR that changes 30-99 lines. labels Jun 15, 2026

weicao added the nopick Not auto cherry-pick when PR merged label Jun 15, 2026

fix: narrow member join status persistence

37d4229

github-actions Bot added size/M Denotes a PR that changes 30-99 lines. and removed size/L Denotes a PR that changes 100-499 lines. labels Jun 15, 2026

github-actions Bot added size/XS Denotes a PR that changes 0-9 lines. and removed size/M Denotes a PR that changes 30-99 lines. labels Jun 23, 2026

weicao mentioned this pull request Jun 23, 2026

fix: gate scale-in on pending member lifecycle with bounded timeout #10448

Open

10 tasks

weicao closed this Jun 23, 2026

github-actions Bot added this to the Release 1.2.0 milestone Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: persist member join status with conflict retry#10357

fix: persist member join status with conflict retry#10357
weicao wants to merge 4 commits into
mainfrom
bugfix/memberjoin-status-conflict

weicao commented Jun 11, 2026 •

edited

Loading

Uh oh!

apecloud-bot commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

weicao commented Jun 11, 2026

Uh oh!

leon-ape Jun 12, 2026

Uh oh!

weicao Jun 23, 2026

Uh oh!

leon-ape commented Jun 12, 2026

Uh oh!

weicao commented Jun 15, 2026

Uh oh!

weicao commented Jun 15, 2026

Uh oh!

weicao commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

weicao commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Review boundary

Tests

Validation boundary

Uh oh!

apecloud-bot commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

weicao commented Jun 11, 2026

Uh oh!

leon-ape Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

weicao Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

leon-ape commented Jun 12, 2026

Uh oh!

weicao commented Jun 15, 2026

Uh oh!

weicao commented Jun 15, 2026

Uh oh!

weicao commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weicao commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading