Skip to content

zookeeper: HorizontalScaling scale-out pod crashes because dynamic config lacks new server #2726

@weicao

Description

@weicao

Summary

ZooKeeper addon main / chart 1.2.0-alpha.0 fails horizontal scale-out from 3 to 4 replicas on KubeBlocks main 1.2.0 test environment.

The newly created pod starts with myid=3, but the mounted dynamic config only contains server.0, server.1, and server.2. ZooKeeper exits with:

java.lang.RuntimeException: My id 3 not in the peer list

The memberJoin action is configured with preCondition: RuntimeReady, so it never gets a chance to run: the new pod cannot become runtime-ready before its server entry exists in the peer list.

Evidence

Runtime run:

  • kubeblocks-tests PR: apecloud/kubeblocks-tests#152
  • PR head: 8fc8b98
  • Suite: hscale,ops
  • Environment: idc1 vcluster zookeeper-pr152-idc1-vc
  • Namespace: zk-pr152-ho-r1-173904
  • Evidence root: /work/runs/zk-pr152-hscale-ops-r1-20260601T173904Z/evidence
  • Live probe: /work/runs/zk-pr152-hscale-ops-r1-20260601T173904Z/live-probe-20260601T174747Z-hscale-crashloop
  • Evidence SHA256SUMS sha: 3b1c2024bbb575793e3796667dac71c20f9c9f689a7e6b8ce82ec8450e6d93cd

Observed state:

Component.spec.replicas=4
InstanceSet.spec.replicas=4
InstanceSet.ready=3
pod/zk-hscale-zookeeper-3 1/2 CrashLoopBackOff
OpsRequest zk-scale-out phase=Running progress=0/1

Dynamic config ConfigMap zk-hscale-zookeeper-zookeeper-dynamic-config:

server.0=zk-hscale-zookeeper-0...:2888:3888:participant
server.1=zk-hscale-zookeeper-1...:2888:3888:participant
server.2=zk-hscale-zookeeper-2...:2888:3888:participant

Missing:

server.3=zk-hscale-zookeeper-3...:2888:3888:participant

Addon CmpD:

lifecycleActions:
  memberJoin:
    preCondition: RuntimeReady
    exec:
      container: zookeeper
      command:
        - /bin/bash
        - -c
        - /kubeblocks/scripts/member_join.sh

member_join.sh would run reconfig -add server.${current_member_index}=..., but the pod must already be runtime ready before the action can execute.

First blocker classification

Layer 5 addon/product contract gap, not environment:

  • Env ready: source 3-replica cluster was Running; images pulled from local registry cache; PVCs bound.
  • Runner ready: hscale suite created source cluster, wrote/read znode, and verified initial dynamic config server count=3 before scale-out.
  • Probe semantics correct: failure is from ZooKeeper process log and live dynamic config.
  • Control plane created the 4th pod and kept OpsRequest Running because runtime readiness never converged.
  • Addon scale-out contract is not satisfied: new member cannot start because peer list does not include its own server id, and memberJoin is gated on RuntimeReady.

Expected

Scale-out should either:

  • ensure the new pod sees a dynamic config containing its own server.N before the ZooKeeper process starts, or
  • run an equivalent join path from an existing ready member before requiring the new member to be runtime ready.

After scale-out, /zookeeper/config and the mounted dynamic config should include the new server and the new pod should become follower/observer as designed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions