Skip to content

chore: patch metadata-only pod updates#10398

Closed
weicao wants to merge 2 commits into
mainfrom
bugfix/instanceset-normal-update-before-resize
Closed

chore: patch metadata-only pod updates#10398
weicao wants to merge 2 commits into
mainfrom
bugfix/instanceset-normal-update-before-resize

Conversation

@weicao

@weicao weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Problem

Targets #10369.

This PR is stacked on #10394 so the resource-comparison fix stays separate from the config-hash/status convergence fix.

After a successful reconfigure, the remaining in-place Pod diff can be only metadata such as the config-hash annotation. Planning that as a full Pod update can race with concurrent status writes and leave the config-hash/status stale, which keeps the OpsRequest running even though the database-side reconfigure action returned OK.

The previous head 04f6d7e90770ed07d32a9c9b9435cab45782692d used a normal-update-before-resize approach and failed the Redis Parameters x replication-twemproxy focused runtime gate. This head replaces that attempt with the narrower metadata-only patch path.

Changes

  • Add an explicit ObjectTree option for planning PATCH actions.
  • Use PATCH for safe metadata-only in-place Pod updates in the InstanceSet and Instance reconcilers.
  • Keep resource-only changes on the /resize subresource path.
  • Keep non-metadata Pod changes on the existing switchover/full-update path.
  • Add kubebuilderx plan-builder coverage and update the focused InstanceSet test to assert metadata-only PATCH planning.

Tests

  • go test ./pkg/controller/kubebuilderx ./pkg/controller/instance ./pkg/controller/instanceset -count=1
  • git diff --check origin/bugfix/instanceset-resize-resource-compare..HEAD
  • CI for head 31908d0cb7aa64c79c882e6ee71dfbce0e668516: green / expected-skip, including make-test and push-pre-check (test).

Runtime status

  • Previous head 04f6d7e90770ed07d32a9c9b9435cab45782692d: exact-image Redis Parameters x replication-twemproxy focused gate failed, PASS 14 / FAIL 1 / SKIP 0; OpsRequest stayed Running; Component/InstanceSet status and config-hash did not converge.
  • Current head 31908d0cb7aa64c79c882e6ee71dfbce0e668516: exact-image Redis Parameters x replication-twemproxy focused gate passed, PASS 34 / FAIL 0 / SKIP 0.
  • Runtime identity for the passing run: local sideload image built from this head, imagePullPolicy: Never, manager pod kubeblocks-6dfbccf448-8vbpw, live imageID sha256:80435fad50cb92d42df0aaff1bb7a23a8650559d44ff8c4f0bbb1f7ae15ff523, restartCount 0.
  • Convergence evidence: 3 reconfigure OpsRequests reached Succeed; Cluster rds-params-twemproxy reached Running; Redis Component generation/observedGeneration reached 5/5; Redis InstanceSet generation/observedGeneration reached 4/4 with ready/updated/available replicas 2/2 and configHash 6b4845ccf7.
  • Data-path evidence: both Redis pods reported maxmemory-policy=allkeys-lru, hz=50, and databases=32; dynamic changes kept pod UIDs unchanged; the static databases change rolled pods as expected; twemproxy SET/GET and replica readback passed.
  • Cleanup evidence: namespace redis-params-twemproxy-r12-31908-r1 was NotFound after cleanup and PV residue was empty.

Boundaries

  • Ready for human review for this stacked delta.
  • Stacked on fix: converge omitted-request resize resources #10394; final merge path still depends on accepting, rebasing, or retargeting the base chain.
  • This is patch-version focused validation for Redis Parameters x replication-twemproxy only, not a Redis addon release-ready claim or a broader matrix result.

When a desired pod template sets CPU or memory limits but omits requests, Kubernetes can preserve a live request value defaulted from an older limit. The in-place resource comparison should not infer a desired request from the desired limit in that case; otherwise a post-resize reconcile keeps reporting a request difference even though the desired template does not own that request.

Compare CPU and memory requests only when the desired template explicitly sets that request. Continue to compare limits normally.
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.54545% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.90%. Comparing base (00dc1b7) to head (31908d0).

Files with missing lines Patch % Lines
pkg/controller/instance/reconciler_update.go 0.00% 7 Missing ⚠️
pkg/controller/instanceset/reconciler_update.go 71.42% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10398      +/-   ##
==========================================
+ Coverage   61.87%   61.90%   +0.02%     
==========================================
  Files         533      533              
  Lines       63609    63623      +14     
==========================================
+ Hits        39360    39384      +24     
+ Misses      20661    20651      -10     
  Partials     3588     3588              
Flag Coverage Δ
unittests 61.90% <79.54%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao

weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Runtime validation for head 04f6d7e90770ed07d32a9c9b9435cab45782692d is FAIL.

Exact image / controller identity:

  • Image: kubeblocks:pr-10398-04f6d7e-stella with imagePullPolicy: Never.
  • Image tar sha256 used for sideload: dbb5ca2ab49bbeedbacf8f0d63d46d8d414a0f76071c791912bc53adf32aca2e.
  • Live manager pod: kubeblocks-7dcfbd744c-5k278, imageID sha256:73f857a01e42c780257ac4755dd01b26bed8591e50b15e71f43c52bb4a98d8a3, restartCount 0, startTime 2026-06-17T17:07:29Z.
  • The tag was imported on all 3 vcluster nodes.

Focused gate result:

  • Redis Parameters x replication-twemproxy: PASS 14 / FAIL 1 / SKIP 0.
  • Failed case: A02 dynamic maxmemory-policy.
  • OpsRequest rds-params-twemproxy-reconf-87783-13901 stayed Running for 900s with WaitForProgressing=True, Validated=True, and ReconfigureStarted=True.
  • Cluster rds-params-twemproxy stayed Updating.
  • Component rds-params-twemproxy-redis stayed Updating, Progressing=True, reason WorkloadNotUpdated.
  • InstanceSet rds-params-twemproxy-redis had generation 2, but status observedGeneration stayed 1; pod config-hash status stayed on old 7fd467d678 while desired spec config hash was 75d4889cd7.
  • Redis data pods stayed Running with restartCount 0, and the pod-side udf-reconfigure-cmpd-redis-replication-config action returned OK.
  • Controller logs show the component controller patched/updated the InstanceSet at 2026-06-17T17:13:25Z, then the InstanceSet controller repeatedly reported successful pod reconfigure for configHash 75d4889cd7 without status convergence.

First-blocker classification: control plane. The setup, exact image identity, runner wait, and probe target are sufficient for this focused result; the Redis product/action path is not the first blocker in this run.

Boundary: #10398 remains draft. This head does not unblock the Redis addon runtime gate or #2819; a follow-up controller fix is required before another exact-image validation.

After a successful reconfigure, the remaining pod diff can be only config-hash metadata. A full Pod update can conflict with concurrent kubelet/status writes and leave the annotation stale, which keeps the OpsRequest running even after the runtime config changed.

Add an explicit ObjectTree patch option and use it for metadata-only in-place pod updates in the InstanceSet and Instance reconcilers.
@weicao weicao force-pushed the bugfix/instanceset-normal-update-before-resize branch from 04f6d7e to 31908d0 Compare June 17, 2026 17:50
@weicao weicao changed the title fix: commit pod metadata before resize fix: patch metadata-only pod updates Jun 17, 2026
@weicao

weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Updated this draft PR to a narrower follow-up head: 31908d0cb7aa64c79c882e6ee71dfbce0e668516.

What changed from the failed 04f6d7e head:

  • Removed the normal-update-before-resize approach.
  • Added ObjectTree PATCH planning and use it only for safe metadata-only in-place Pod updates.
  • Kept resource-only changes on /resize and non-metadata Pod changes on the existing switchover/full-update path.

Local checks for the new head:

  • go test ./pkg/controller/kubebuilderx ./pkg/controller/instance ./pkg/controller/instanceset -count=1 PASS.
  • git diff --check origin/bugfix/instanceset-resize-resource-compare..HEAD clean.

Boundary: this is still draft. The previous exact-image runtime run failed, and the new head still needs CI plus exact-image Redis Parameters x replication-twemproxy validation before it can be treated as a candidate fix for #10369.

@weicao

weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Runtime validation for head 31908d0cb7aa64c79c882e6ee71dfbce0e668516 is PASS.

Exact controller identity:

  • Local sideload image built from this head with imagePullPolicy: Never.
  • Live manager pod: kubeblocks-6dfbccf448-8vbpw.
  • Live imageID: sha256:80435fad50cb92d42df0aaff1bb7a23a8650559d44ff8c4f0bbb1f7ae15ff523.
  • restartCount 0.

Focused gate result:

  • Redis Parameters x replication-twemproxy: PASS 34 / FAIL 0 / SKIP 0.
  • A02 dynamic maxmemory-policy: OpsRequest Succeed, ConfigMap updated, both Redis pods reported CONFIG maxmemory-policy=allkeys-lru, pod UIDs unchanged, twemproxy SET/GET passed.
  • A03 dynamic hz: OpsRequest Succeed, ConfigMap updated, both Redis pods reported CONFIG hz=50, pod UIDs unchanged, twemproxy SET/GET passed.
  • A04 static databases: OpsRequest Succeed, ConfigMap updated, both Redis pods reported CONFIG databases=32, pod UIDs changed as expected for rolling restart, twemproxy SET/GET passed.
  • A05 replica readback passed.
  • A06 cleanup passed: namespace redis-params-twemproxy-r12-31908-r1 was NotFound and PV residue was empty.

Convergence evidence:

  • Cluster rds-params-twemproxy reached Running.
  • Redis Component generation/observedGeneration reached 5/5.
  • Redis InstanceSet generation/observedGeneration reached 4/4.
  • InstanceSet status shows readyReplicas 2, updatedReplicas 2, availableReplicas 2, and configHash 6b4845ccf7 on both Redis pods.

Acceptance classification: patch-version focused validation passed for the Redis Parameters x replication-twemproxy gate. No first blocker was observed in this focused run.

Boundary: this validates the current #10398 metadata-only PATCH path for this #10369-focused gate, but it is not a Redis addon release-ready claim or a broader matrix result. The PR is still stacked on #10394, so the final merge path depends on accepting, rebasing, or retargeting the base chain.

leon-ape
leon-ape previously approved these changes Jun 18, 2026
@apecloud-bot apecloud-bot added the approved PR Approved Test label Jun 18, 2026
@leon-ape leon-ape added nopick Not auto cherry-pick when PR merged and removed approved PR Approved Test labels Jun 18, 2026
@leon-ape leon-ape changed the title fix: patch metadata-only pod updates chore: patch metadata-only pod updates Jun 18, 2026
@leon-ape leon-ape changed the base branch from bugfix/instanceset-resize-resource-compare to main June 18, 2026 05:50
@leon-ape leon-ape dismissed their stale review June 18, 2026 05:50

The base branch was changed.

@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines. label Jun 18, 2026
}
if !equalField(oc.Resources.Requests, realRequests) {
return false
for _, resourceName := range []corev1.ResourceName{corev1.ResourceCPU, corev1.ResourceMemory} {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Preserve explicit non-CPU/memory resource changes. VerticalScaling validation still accepts hugepages-* resources, but this loop now compares only CPU and memory requests, and equalField for ResourceList also ignores non-CPU/memory limits. A desired pod that changes a hugepages-* request/limit can therefore be treated as equal, so getPodUpdatePolicy returns noOps and the scaling operation can converge without applying the requested resource change. Keep the new omitted CPU/memory request handling scoped to CPU/memory defaulting, but continue comparing explicit non-CPU/memory requests/limits or reject them before this path. The same issue exists in pkg/controller/instance/in_place_update_utils.go.

@leon-ape leon-ape closed this Jun 18, 2026
@github-actions github-actions Bot added this to the Release 1.2.0 milestone Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nopick Not auto cherry-pick when PR merged size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants