Skip to content

chore(zookeeper): render dynamic config on hscale#2727

Merged
leon-ape merged 6 commits into
mainfrom
logan/zookeeper-hscale-memberjoin
Jun 18, 2026
Merged

chore(zookeeper): render dynamic config on hscale#2727
leon-ape merged 6 commits into
mainfrom
logan/zookeeper-hscale-memberjoin

Conversation

@weicao

@weicao weicao commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • run ZooKeeper memberJoin/memberLeave actions on the current leader instead of the joining/leaving pod
  • remove RuntimeReady preCondition from memberJoin/memberLeave so hscale membership work is not blocked by the new pod being unready
  • let the dynamic config template be rendered on hscale so the joining pod local zoo.cfg.dynamic contains its own server entry
  • add observer entries for ordinal >= 3, matching the existing dynamic config template and README scale-out semantics
  • use KB_JOIN_MEMBER_* / KB_LEAVE_MEMBER_* action variables and add idempotent readback checks around /zookeeper/config

Evidence

  • Original failure tracked in zookeeper: HorizontalScaling scale-out pod crashes because dynamic config lacks new server #2726: scale-out pod zk-hscale-zookeeper-3 crashed with My id 3 not in the peer list; dynamic config only had server.0/1/2 while Component/InstanceSet replicas were 4.
  • Intermediate r2 on bcc6d4c1 proved memberJoin executed on leader and wrote /zookeeper/config, but the joining pod local mounted ConfigMap still lacked server.3; evidence:
    • /work/runs/zk-pr152-hscale-r2-patch-bcc6d4c1-20260601T181503Z/live-probe-20260601T182557Z-focused
    • /work/runs/zk-pr152-hscale-r2-patch-bcc6d4c1-20260601T181503Z/live-probe-20260601T182822Z-config-cm, SHA256SUMS sha 6f29672d0c592a8c4f81ee93af3428c0dfd52287c4a5de66db2d8df9aeab95bf
  • Focused r3 on PR head 8de88769 plus kubeblocks-tests head 6458bbc: hscale PASS 24 / FAIL 0 / SKIP 0; scale-out reached server.3=...:observer, role labels matched mntr, data survived scale-out/in, cleanup residual 0. Evidence path /work/runs/zk-pr152-hscale-r3-patch-8de88769-tests-6458bbc-20260601T183240Z/evidence, SHA256SUMS sha fc38e59d44e15f698260bb714d29ae57760aaf9cf3bd16cf5dd5990dfd55dd65.

Validation

  • bash -n addons/zookeeper/scripts/member_join.sh addons/zookeeper/scripts/member_leave.sh addons/zookeeper/scripts/roleprobe.sh addons/zookeeper/scripts/startup.sh
  • helm dependency build addons/zookeeper
  • helm lint addons/zookeeper
  • helm template zookeeper addons/zookeeper --namespace kb-system
  • git diff --check

Follow-up, not blocking this fix: >= 3 quorum/observer threshold is duplicated in config/zookeeper-dynamic.tpl and scripts/member_join.sh; we should later move it to one component variable.

Closes #2726

@codecov-commenter

codecov-commenter commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 94 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (da6cdc1) to head (8d9a3cc).
⚠️ Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
...dons/zookeeper/scripts-ut-spec/member_join_spec.sh 0.00% 56 Missing ⚠️
...ons/zookeeper/scripts-ut-spec/member_leave_spec.sh 0.00% 38 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main   #2727   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         73      75    +2     
  Lines       9236    9330   +94     
=====================================
- Misses      9236    9330   +94     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao weicao changed the title fix(zookeeper): run hscale membership changes on leader fix(zookeeper): render dynamic config on hscale Jun 1, 2026
weicao

This comment was marked as duplicate.

@weicao

weicao commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Fixed the REQUEST_CHANGES blockers in commit 7334af62.

What changed:

  • member_leave.sh: replaced the raw /zookeeper/config read with get_dynamic_config_or_die; it now fails closed on empty output, KeeperErrorCode/NoAuth-style output, connection/auth errors, or output with no server.N= dynamic-config lines before deciding a member is already absent.
  • member_join.sh: uses the same fail-closed dynamic-config read guard.
  • member_join.sh: the already-exists path now requires the exact target server.N=fqdn:2888:3888:participant|observer;2181 line. Same index/FQDN with the wrong participant/observer type now fails instead of false-greening.
  • Added focused ShellSpec coverage for both review cases.

Local evidence:

  • /tmp/logan-shellspec-0.28.1/shellspec/shellspec --load-path ./shellspec --shell /opt/homebrew/bin/bash addons/zookeeper/scripts-ut-spec -> 23 examples, 0 failures.
  • shellcheck addons/zookeeper/scripts/member_join.sh addons/zookeeper/scripts/member_leave.sh -> PASS.
  • bash -n addons/zookeeper/scripts/member_join.sh addons/zookeeper/scripts/member_leave.sh -> PASS.
  • git diff --check -> PASS.

CI has started on the pushed branch; latest view showed shell-check in progress and the remaining checks queued.

@weicao weicao added nopick Not auto cherry-pick when PR merged chart-release Auto release chart when PR merged main. labels Jun 16, 2026
@leon-ape

leon-ape commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review findings:

[P2] The dynamic config template decides observer membership from the FQDN list position ($i >= 3), while member_join.sh decides it from the parsed pod ordinal (ordinal >= 3). These are equivalent only for continuous ordinals. With non-contiguous ordinals, offline instances, or instance-template/flat-ordinal layouts, the same pod can be rendered locally as participant but added to /zookeeper/config as observer. Please use the same rule in both places, preferably based on the parsed ordinal, and add coverage for a non-contiguous ordinal case.

[P2] Removing externalManaged fixes the new-cluster rendering path, but the upgrade path for existing clusters is not proven. Existing Components may already have a spec.configs entry for this template with an external ConfigMap override; KubeBlocks merge/synthesize logic can continue to honor that Component-level config instead of switching back to the new CMPD template. Please provide upgrade evidence for an existing ZooKeeper cluster, or document/add a migration path that clears the old external-managed config state.

[P3] The dynamic parameter list only covers server.1 through server.5, while the addon does not declare a matching maximum replica/ordinal boundary. The template and memberJoin path can generate server.6+, but those keys are outside the ParametersDefinition contract. Please either extend/generalize the supported parameter boundary or explicitly declare and enforce the supported maximum scale range.

@weicao

weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the latest review findings in commit 86358fab.

What changed:

  • Dynamic config rendering now classifies participant/observer by parsed pod ordinal, not FQDN-list position: ge (int $ordinal) 3.
  • member_join.sh already used the parsed ordinal rule; added ShellSpec coverage for a non-contiguous ordinal (zookeeper-4) to verify it joins as observer.
  • Added replicasLimit to the Zookeeper ComponentDefinition: minReplicas: 1, maxReplicas: 6, matching the declared ParametersDefinition boundary (server.1 through server.5; server IDs are zero-based, so server.6+ is outside the supported range).
  • Documented the upgrade-path boundary for older clusters that may still carry a Component-level external config override from the previous externalManaged path. This PR does not claim live upgrade evidence for that old override state; operators must verify/remove the stale override or recreate through the upgraded chart before treating hscale as validated on such clusters.

Local evidence on 86358fab:

  • /tmp/logan-shellspec-0.28.1/shellspec/shellspec --load-path ./shellspec --shell /opt/homebrew/bin/bash addons/zookeeper/scripts-ut-spec -> 24 examples, 0 failures.
  • shellcheck addons/zookeeper/scripts/member_join.sh addons/zookeeper/scripts/member_leave.sh -> PASS.
  • bash -n addons/zookeeper/scripts/member_join.sh addons/zookeeper/scripts/member_leave.sh -> PASS.
  • helm lint addons/zookeeper -> PASS.
  • helm template zookeeper addons/zookeeper --namespace kb-system -> PASS; render includes replicasLimit.maxReplicas: 6 and ParametersDefinition through server.5.
  • git diff --check -> PASS.

Current CI is running on the new head 86358faba0902c057d4631991ec56b3d6cb82ebe.

@weicao

weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: GitHub CI is now green on 86358faba0902c057d4631991ec56b3d6cb82ebe (shellspec-test, shell-check, chart checks, label check all SUCCESS).

@weicao

weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up after current-head runtime validation.

Observed failure on previous head 86358fab:

  • idc3 hscale r1 failed after scale-out. The new observer pod joined ZooKeeper and had role observer, but the Component stayed Updating until the runner timeout.
  • Controller log showed repeated memberJoin failures: server.3 already exists with a different endpoint or member type.
  • Direct ZooKeeper config read showed why: ZooKeeper stores the added member as server.3=...:observer;0.0.0.0:2181, while the script compared against server.3=...:observer;2181.

Fix in 8d9a3ccc:

  • member_join.sh still sends reconfig -add ...;2181.
  • The idempotency check now compares the stable peer part before ;, so ZooKeeper's normalized client address is accepted as the same member.
  • Added ShellSpec coverage for the ;0.0.0.0:2181 readback case.

Evidence on current head 8d9a3ccc05bf08e06e3d013b540354b2cc650e1e:

  • Local ShellSpec: 25 examples, 0 failures.
  • shellcheck, bash -n, git diff --check: PASS.
  • GitHub CI: all checks SUCCESS.
  • idc3 hscale r2 on addon 8d9a3ccc + tests c3904d0: PASS 24 / FAIL 0 / SKIP 0.
  • Runtime evidence: /work/runs/zk-pr2727-current-head-hscale-r2-idc3-patch-8d9a3ccc-tests-c3904d0-20260617T035011Z/evidence, SHA256SUMS sha256 3ff9d1c698927292dadb5704d8e211ff9616a94706fd688ca36c19479a7cf7e7.
  • Cleanup: namespace zk-pr2727-hscale-r2-idc3-035011 is NotFound; residue grep for Cluster/Component/InstanceSet/Pod/PVC/Ops/Backup/Restore/PV is empty.

Boundary:

  • This validates the impacted hscale path for the new PR head. It is not a new repeated full-suite release-acceptance count.
  • The documented old externalManaged upgrade caveat remains unchanged: this PR does not claim live upgrade evidence for clusters that already carry a stale Component-level external override.

Please re-review the focused script behavior in addons/zookeeper/scripts/member_join.sh and the new ShellSpec case in addons/zookeeper/scripts-ut-spec/member_join_spec.sh.

@leon-ape leon-ape changed the title fix(zookeeper): render dynamic config on hscale chore(zookeeper): render dynamic config on hscale Jun 18, 2026
@leon-ape leon-ape merged commit 4f96d62 into main Jun 18, 2026
14 checks passed
@leon-ape leon-ape deleted the logan/zookeeper-hscale-memberjoin branch June 18, 2026 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chart-release Auto release chart when PR merged main. nopick Not auto cherry-pick when PR merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zookeeper: HorizontalScaling scale-out pod crashes because dynamic config lacks new server

3 participants