Skip to content

fix: persist member join status with conflict retry#10357

Closed
weicao wants to merge 4 commits into
mainfrom
bugfix/memberjoin-status-conflict
Closed

fix: persist member join status with conflict retry#10357
weicao wants to merge 4 commits into
mainfrom
bugfix/memberjoin-status-conflict

Conversation

@weicao

@weicao weicao commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

Oracle 3-to-2 horizontal scaling hit a controller race in two reproduced runs. The component controller successfully executed memberJoin for the new pod, then the same reconcile lost the following InstanceSet annotation update to a resourceVersion conflict. A later scale-in read memberJoined=false from the stale replicas-status annotation and skipped memberLeave, leaving a stale Oracle DG broker member for the deleted pod.

Observed evidence from Oracle validation:

  • succeed to join member for pod ...-oracle-2
  • immediately followed by Operation cannot be fulfilled on instancesets... object has been modified
  • later scale-in logged joined replicas: [] while memberLeave was defined
  • no primary-side memberLeave kbagent action was executed

Fix

  • Add UpdateReplicaStatusWithRetry for InstanceSet replicas-status annotation writes.
  • After memberJoin succeeds, persist memberJoined=true through a fresh get/update retry loop instead of depending only on the later graph update.
  • Keep the in-memory proto InstanceSet status aligned for the current reconcile.

Review boundary

A broader scale-in gate for replicas with pending DataLoaded / MemberJoined lifecycle state was prototyped, but it is not included in the current PR head. Focused review found a real side-effect risk: a replica whose memberJoin permanently fails could be blocked from scale-in cleanup. That behavior change needs direct evidence or a bounded escape before it belongs in a controller fix.

The current PR is intentionally narrowed back to the confirmed low-risk persistence race.

Tests

  • make test-go-generate
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -run TestAPIs/replicas -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -run 'TestAPIs/transformer/workload/scale' -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1 -timeout=600s
  • git diff --check

Validation boundary

This closes the controller persistence race for new memberJoin executions. It does not claim to solve every possible scale-in while scale-out lifecycle is still pending.

The previous release-1.0 backport runtime validation covered commit 20816453a171a2eab278ed6ef06402109dad95f6. The current PR head should remain draft until focused review and Oracle current-head runtime validation/owner confirmation close.

This is not an Oracle full acceptance result and not a release-ready claim.

Fixes #10359

@apecloud-bot

Copy link
Copy Markdown
Collaborator

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

CLA Recheck Instructions

Usage:
  - /recheck-cla: Trigger a re-check of CLA status for this pull request.
Example:
  - /recheck-cla

@github-actions github-actions Bot added the size/M Denotes a PR that changes 30-99 lines. label Jun 11, 2026
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 28.57143% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.20%. Comparing base (1d86fc0) to head (37d4229).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...ps/component/transformer_component_workload_ops.go 0.00% 7 Missing ⚠️
pkg/controller/component/replicas.go 57.14% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10357      +/-   ##
==========================================
+ Coverage   53.15%   53.20%   +0.04%     
==========================================
  Files         533      533              
  Lines       63457    63470      +13     
==========================================
+ Hits        33733    33767      +34     
+ Misses      26277    26259      -18     
+ Partials     3447     3444       -3     
Flag Coverage Δ
unittests 53.20% <28.57%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao

weicao commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Runtime validation update for the release-1.0 backport path:

Scope:

  • PR fix: persist member join status with conflict retry #10357 head: 84c8e12
  • release-1.0 backport commit validated by Oracle team: 2081645
  • controller image: docker.io/library/kubeblocks:r10-memberjoin-retry-20816453-amd64
  • live imageID: sha256:680243a878442af10f193aea4cc1e02b60c670c8e66c50e374def0cf33be9397
  • controller pod startTime: 2026-06-11T20:56:56Z
  • controller restart count: 0
  • rollback verified to stock image apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/kubeblocks:1.0.3-beta.9

Result:

  • N=3 3->2 scale-in rounds passed
  • joined replicas was non-empty in all 3 rounds: ["ora-w-2800-oracle-2"]
  • succeed to call leave member action appeared in all 3 rounds
  • memberLeave skip count: 0
  • controller-window sweep observed 28 resourceVersion conflicts during the window
  • joined replicas: [] count in the sweep: 0
  • DGMGRL post-scale-in checks showed 2 members, no ORCLCDB_2 zombie member, and SUCCESS in all 3 rounds
  • cleanup/rollback completed; test namespace oracle-w10357 was deleted by the Oracle team

Evidence verified:

  • evidence tar sha256: 922462216d52690e73eb7b795a6ecbfc0b79a5f90b84558c9ab7ef37c91e9f4f
  • identity supplement tar sha256: 5959b80641358aa47b421d244d8efa4e6606a08925472bf6d4c30abca548ea9a
  • final MANIFEST.sha256 self sha256: 20160103442399c9726acbef7b50f89fbe718f570da6f436f4ed08933cd16c31
  • shasum -c MANIFEST.sha256 passed for the extracted evidence set

Boundary:

  • This is a focused release-1.0 backport runtime validation for the memberJoin status-conflict retry fix.
  • It is not an Oracle full acceptance result and not a release-ready claim.

@weicao weicao marked this pull request as ready for review June 11, 2026 22:13
@weicao weicao requested review from a team and leon-ape as code owners June 11, 2026 22:13
joinErrors = append(joinErrors, fmt.Errorf("pod %s: %w", pod.Name, err))
} else {
key := types.NamespacedName{Namespace: r.protoITS.Namespace, Name: r.protoITS.Name}
if err := component.UpdateReplicaStatusWithRetry(r.transCtx.Context, r.cli, key, pod.Name, func(status *component.ReplicaStatus) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] This writes the live InstanceSet from inside the transformer after memberJoin succeeds, which is the wrong layer for this failure. A conflict while persisting memberJoined is a normal reconcile failure: the controller should retry the whole reconcile and rely on the memberJoin action idempotency, not introduce a second write path outside the DAG/plan execution model. Keep replicas-status persistence in the normal DAG/update flow instead of special-casing this annotation write.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed in commit 07045ea.

Removed UpdateReplicaStatusWithRetry entirely — the plan object update (replicas.Status[i].MemberJoined = ptr.To(true)) goes through the normal DAG/update flow. If resourceVersion conflict occurs, the controller retries the whole reconcile.

memberJoin action idempotency is guaranteed by addon contract (kubeblocks-addon-docs addon-lifecycle-action-idempotent-nonblocking-best-practice-guide.md): if the member already exists, the action returns success. So a retry after conflict safely re-invokes memberJoin and the status converges.

Removed:

  • UpdateReplicaStatusWithRetry function from replicas.go
  • Direct write call in joinMember4ScaleOut()
  • conflictOnceClient test helper and retry test from replicas_test.go
  • Unused retry and types imports

@leon-ape

Copy link
Copy Markdown
Contributor

[P1] The production failure is caused by accepting scale-in while the previous scale-out member lifecycle is still in progress, not by the resourceVersion conflict itself. If any replica still has pending DataLoaded or MemberJoined state, scale-in can delete it and skip or race the engine memberLeave path regardless of this retry. Add an operation/state gate so scale-in waits for scale-out member lifecycle completion, or define an explicit safe cancellation path, before deleting those replicas.

@weicao weicao marked this pull request as draft June 15, 2026 03:14
@github-actions github-actions Bot added size/L Denotes a PR that changes 100-499 lines. and removed size/M Denotes a PR that changes 30-99 lines. labels Jun 15, 2026
@weicao

weicao commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Handled Leon's P1 on the scale-in lifecycle gate.

What changed:

  • Added PendingLifecycleReplicas to detect replicas whose scale-out lifecycle is still pending (dataLoaded=false or memberJoined=false).
  • scaleIn() now requeues before deleting replicas-status or letting workload deletion continue when the target replica still has pending lifecycle state.
  • Kept the existing conflict-retry fix for persisting memberJoined=true after successful memberJoin.
  • Added a regression test for the pending-lifecycle scale-in gate.

How it was verified:

  • make test-go-generate
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -run TestAPIs/replicas -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -run 'TestAPIs/Component_Workload_Operations_Test/Member_Leave_Operations/should_wait_before_scale_in_when_target_replica_lifecycle_is_pending' -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1 -timeout=600s
  • git diff --check

Commit / update point:

  • dacdc2913de252862e4599f0eb4310560fb6b637 (fix: gate scale-in on pending member lifecycle)

Boundary:

  • PR is back to draft.
  • Current head still needs fresh Oracle runtime validation before ready-for-review / merge.

@weicao weicao added the nopick Not auto cherry-pick when PR merged label Jun 15, 2026
@github-actions github-actions Bot added size/M Denotes a PR that changes 30-99 lines. and removed size/L Denotes a PR that changes 100-499 lines. labels Jun 15, 2026
@weicao

weicao commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Update after focused review of the broader scale-in gate:

  1. What changed
  • Narrowed the PR back to the confirmed memberJoin status persistence race.
  • Removed the prototype scale-in gate that blocked deleting replicas with pending DataLoaded / MemberJoined state.
  • Kept UpdateReplicaStatusWithRetry and the post-memberJoin fresh get/update retry that persists memberJoined=true.
  1. Verification
  • make test-go-generate
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -run TestAPIs/replicas -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -run 'TestAPIs/transformer/workload/scale' -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/controller/component -count=1 -timeout=600s
  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./controllers/apps/component -count=1 -timeout=600s
  • git diff --check
  1. Commit / PR update
  • New head: 37d4229
  • Follow-up commit: 37d4229 fix: narrow member join status persistence

Boundary: this PR no longer claims to solve every scale-in-while-scale-out-lifecycle-pending case. The removed scale-in gate needs direct evidence or a bounded failed-join escape before it should be included.

Address Leon11 P1: UpdateReplicaStatusWithRetry wrote the live
InstanceSet from inside the transformer, bypassing the DAG/plan
execution model. ResourceVersion conflict is a normal reconcile
failure — the controller retries the whole reconcile and the
memberJoin action is idempotent (addon contract requirement),
so the status converges naturally through the DAG update path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size/XS Denotes a PR that changes 0-9 lines. and removed size/M Denotes a PR that changes 30-99 lines. labels Jun 23, 2026
@weicao

weicao commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Replaced by PR #10448 which implements the bounded scale-in gate with envtest coverage. This PR was net-zero after multiple iterations.

@weicao weicao closed this Jun 23, 2026
@github-actions github-actions Bot added this to the Release 1.2.0 milestone Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nopick Not auto cherry-pick when PR merged size/XS Denotes a PR that changes 0-9 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memberJoin success can lose joined status on conflict

3 participants