Skip to content

chore: add reconfigure diagnostic logs#10397

Draft
weicao wants to merge 3 commits into
mainfrom
bugfix/10384-diagnostic-logs
Draft

chore: add reconfigure diagnostic logs#10397
weicao wants to merge 3 commits into
mainfrom
bugfix/10384-diagnostic-logs

Conversation

@weicao

@weicao weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Status

Draft diagnostic PR only. This is not a functional fix and should not be routed as ready for human review or merge until a fresh reproduction uses the logs and produces direct evidence.

This replaces #10396 only to use a branch name that passes the repository pre-check. The code commit is unchanged: cd2a82f8eaade47438b41f240329dbf79392190d.

Why

The current #10384 evidence proves that live pods can remain on the old config hash after repeated successful reconfigure actions, but it does not directly prove which controller write step failed or was skipped. This PR adds temporary logs to close that evidence gap.

What this logs

  • pending reconfigure configs with pod resourceVersion, current pod config hash, and desired config hash
  • reconfigure success with pod resourceVersion, config name, and desired config hash
  • update decision with updatePolicy, specUpdatePolicy, current hash, desired hash, and allConfigUpdated
  • scheduled pod config hash update before tree.Update
  • actual plan submit result for pod config hash updates, including failure details
  • successful submit readback with live pod resourceVersion and live config hash
  • formerly silent plan conflicts now log request and error before requeue

Boundaries

  • no change to resource comparison
  • no change to reconfigure execution
  • no change to update policy selection
  • no change to tree commit behavior
  • one extra read-only pod GET after a successful pod config hash update, for diagnostic evidence only

Local validation

  • go test ./pkg/controller/kubebuilderx ./pkg/controller/instanceset ./pkg/controller/instance -count=1

Local diagnostic image

Built locally for pre-merge sideload testing:

  • tag: kubeblocks:diag-10384-cd2a82f-lily
  • source commit: cd2a82f8eaade47438b41f240329dbf79392190d
  • platform: linux/amd64
  • local image digest: sha256:5ecd838db0aee89e770b11cff4328479c97dfda36d72ee66c4fbf34b4725b4bb

The current machine has no Kubernetes context configured, so IDC sideload/import and focused SQL Server C01 reproduction still need to run from a machine or owner with the target cluster context.

@apecloud-bot

Copy link
Copy Markdown
Collaborator

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

CLA Recheck Instructions

Usage:
  - /recheck-cla: Trigger a re-check of CLA status for this pull request.
Example:
  - /recheck-cla

@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines. label Jun 17, 2026
@weicao weicao added kind/bug Something isn't working nopick Not auto cherry-pick when PR merged labels Jun 18, 2026
@weicao

weicao commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Updated diagnostic branch to head 24324fd6c.

This is still diagnostic-only and does not change reconcile behavior. The new commit adds timing fields around the reconfigure/configHash path so a C01 rerun can separate:

  • pending configHash detection time and pod resourceVersion;
  • controller-side reconfigure action start/end duration;
  • configHash write scheduling time;
  • final pod Update/Patch commit start time, duration, and success/failure result.

Local validation passed:

go test ./pkg/controller/kubebuilderx ./pkg/controller/instanceset ./pkg/controller/instance -count=1
git diff --check HEAD~1..HEAD

Labels updated with kind/bug and nopick. PR remains draft because this branch is for runtime diagnostics, not a production fix.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 21.60000% with 196 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.68%. Comparing base (00dc1b7) to head (dc3b818).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/instance/reconciler_update.go 0.00% 98 Missing ⚠️
pkg/controller/instanceset/reconciler_update.go 46.53% 52 Missing and 2 partials ⚠️
pkg/controller/kubebuilderx/plan_builder.go 14.00% 43 Missing ⚠️
pkg/controller/kubebuilderx/controller.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10397      +/-   ##
==========================================
- Coverage   61.87%   61.68%   -0.20%     
==========================================
  Files         533      533              
  Lines       63609    63857     +248     
==========================================
+ Hits        39360    39389      +29     
- Misses      20661    20875     +214     
- Partials     3588     3593       +5     
Flag Coverage Δ
unittests 61.68% <21.60%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao weicao changed the title chore: add reconfigure config hash diagnostic logs chore: add reconfigure diagnostic logs Jun 18, 2026
@weicao

weicao commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Diagnostic update for head dc3b818296cb6f4b9661f94d836be4ea92d4b147:\n\n- Added controller-side pod reconfigure reconcile timing logs around each per-Pod reconfigure call in Instance and InstanceSet update reconcilers.\n- The log records Pod name/resourceVersion, current and desired config hash summary, update policy, start time, duration string, and duration milliseconds.\n- Existing PlanBuilder logs still cover Pod metadata writeback commit duration, so the next run can separate controller reconfigure calculation time from Pod update/patch commit time.\n- Local validation: git diff --check; go test ./pkg/controller/instanceset ./pkg/controller/instance -count=1.\n- GitHub checks are passing after rerunning the initial transient failures in push-pre-check (manifests) and push-pre-check (test).\n- Temporary controller image for runtime diagnostics: apecloud/kubeblocks:bugfix-10384-diagnostic-dc3b8182; China registry: apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/kubeblocks:bugfix-10384-diagnostic-dc3b8182; image index digest: sha256:580b6ca968884e2bfca840fd3f4a132f57f904316c8b2a471db62dc2eda939e3.\n\nBoundary: this remains a draft/nopick diagnostic PR for temporary runtime evidence collection, not a root-cause fix or merge request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working nopick Not auto cherry-pick when PR merged size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants