Skip to content

chore(rabbitmq): start replicas in parallel#2718

Merged
leon-ape merged 1 commit into
mainfrom
jasper/rabbitmq-parallel-pod-management
Jun 18, 2026
Merged

chore(rabbitmq): start replicas in parallel#2718
leon-ape merged 1 commit into
mainfrom
jasper/rabbitmq-parallel-pod-management

Conversation

@weicao

@weicao weicao commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add podManagementPolicy: Parallel to RabbitMQ ComponentDefinition

OrderedReady pod management causes permanent deadlock during Stop/Start for N>=3 clusters: pod 0 starts first but cannot form Raft quorum alone, blocking subsequent pods from ever starting. Parallel policy starts all replicas simultaneously so quorum can reform.

Evidence

A/B proof on vcluster with chaos_stopstart suite (SS01-SS06):

  • OrderedReady (main): SS04 Start deadlocks - pod 0 cannot form quorum alone, cluster never recovers
  • Parallel (this PR): SS04 PASS - all 3 pods start simultaneously, quorum recovered, no committed message loss

N=2 ALL PASS (26/0/0). Run artifacts: artifacts/chaos-stopstart-pr2718-run{1,2}-*.tar.gz

Test plan

  • chaos_stopstart N=2 PASS (26/0/0)
  • smoke + day2 on PR branch (Ava executing)
  • chaos_stopstart N=3 after merge

@weicao weicao requested review from a team, leon-ape and xuriwuyun as code owners June 1, 2026 10:39
@codecov-commenter

codecov-commenter commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (ac41a5b) to head (676cba7).

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #2718   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         84      84           
  Lines      11664   11664           
=====================================
  Misses     11664   11664           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao

weicao commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

RabbitMQ 4.3.1 candidate validation update

Result from the latest scoped Rule 12 soak: GREEN / PASS for restart + poddelete sustained-load over 6h, using this PR head.

Pinned candidate:

Validation evidence:

  • Precheck PASS, including RabbitMQ CmpD/CmpV Available and 4.3.1 ACR override.
  • SK03.1-SK03.10 PASS: alternating restart/poddelete; every classifier had alarm_set=0s, duty=0%, queue start/end/max=0, nacked=0.
  • SK04 PASS: confirmed=345 consumed=348 queue_depth=25; no committed durable message loss; no alarms at soak end.
  • SK05 PASS: cleanup completed; final rabbitmq-test had no cluster/component/instanceset/pod/job/ops/pvc residue.
  • Test Summary: PASS: 9 FAIL: 0 SKIP: 0.

Artifact:

  • runner tar: /artifacts/tars/rabbitmq-431-attempt8-soak-20260601T120031Z-closeout-20260601T181305Z.tar.gz
  • sha256: 39747ff1dcfd236b8447d8edd97b9c2cfe86beab1dc598996b455153a6820227

Boundary:

  • This validates chore(rabbitmq): start replicas in parallel #2718 in the current scoped sustained-load restart + poddelete soak.
  • Historical attempt6 remains PRODUCT_CHART_CANDIDATE / high-confidence, evidence gap noted, not a confirmed product bug, because the original live scene was cleaned before all supplemental evidence was collected. The single-variable probe was still strong: changing only podManagementPolicy=Parallel allowed the same InstanceSet to create 1/2 and reach 3 pods Ready quickly.
  • Full-cluster Stop/Start was not part of attempt8 sustained-client scope; attempt7 classified that path as VOID/HARNESS_RED due PerfTest rc=0 early exit during all-broker outage.

@weicao weicao left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #2718 — fix(rabbitmq): start replicas in parallel

Reviewer: @JADE (PR Reviewer)
Stats: 1 file changed, +1 / -0


Summary

Single-line change: adds podManagementPolicy: Parallel to the RabbitMQ ComponentDefinition spec in cmpd.yaml.

Fixes a deadlock during Stop/Start with quorum queues: with the default OrderedReady, only ordinal 0 is created, but RabbitMQ 4.3.1 cannot boot quorum/coordination state without peers, so ordinal 0 never reaches Ready and ordinals 1/2 never spawn.


API Contract: PASS

podManagementPolicy is a first-class optional field on ComponentDefinition.spec in KubeBlocks apps/v1. Parallel is valid. Kafka, Elasticsearch, and PostgreSQL addons in this repo already use it.

Safety: PASS

  • RabbitMQ rabbit_peer_discovery_k8s plugin with address_type = hostname is designed for parallel startup. It uses randomized delay + internal locking to prevent split-brain during formation.
  • Official Bitnami and RabbitMQ Cluster Operator charts both use Parallel.
  • The deadlock it fixes (quorum requires majority, OrderedReady only creates pod-0) is worse than any theoretical parallel startup risk.
  • 6-hour soak run with restart + pod-delete faults shows zero alarms, zero nacked messages.

Occam Razor: PASS

Simplest possible fix — one declarative field. No workarounds, no conditional logic.

Single Purpose: PASS

One file, one line, one concern.


Non-blocking

  1. PR targets jasper/rabbitmq-4.3.1-support not main — confirm merge path.
  2. Change applies to all RabbitMQ versions via shared cmpd.yaml, not just 4.3.1 — is this intended? (Likely fine since the peer discovery plugin handles it for all supported versions.)

Scale-down question

With Parallel, K8s may delete pods simultaneously on scale-down. Does memberLeave (forget_cluster_node) still work when multiple pods terminate concurrently? Pre-existing concern, but more relevant with Parallel.


Verdict: APPROVE

Correct, minimal, well-evidenced. Solves a real deadlock with the standard Kubernetes approach for RabbitMQ. Multiple other addons in this repo use the same pattern.

Base automatically changed from jasper/rabbitmq-4.3.1-support to jasper/rabbitmq-init-account-compat June 3, 2026 16:26
@weicao

weicao commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up for the scale-down/memberLeave question on podManagementPolicy: Parallel:

What changed:

  • Added supplemental RabbitMQ day2 coverage in kubeblocks-tests PR fix: check ob init success #186.
  • The day2 5 -> 3 scale-in path now asserts the scaled-in RabbitMQ nodes are absent from RabbitMQ cluster_status membership, not only that the online node count is 3.
  • The checked removed nodes were rabbit@rmq-d2-9660-rabbitmq-3 and rabbit@rmq-d2-9660-rabbitmq-4.

How it was verified:

  • Approved runner: rabbitmq-runner/rabbitmq-soak-runner on rabbitmq-431-soak-vc.
  • Run id: rabbitmq-pr186-memberleave-20260604T140613Z.
  • Service version: 4.3.1.
  • Storage: 1Gi per pod, StorageClass apelocal-hostpath-default.
  • Result: PASS=14 FAIL=0 SKIP=0.
  • Key assertion passed: D03: scaled-in RabbitMQ nodes removed from membership.
  • Cleanup postcheck: post_resource_count=0.

Commit / evidence:

  • kubeblocks-tests PR fix: check ob init success #186 head: be779314bc003c6298e4443bfbe31de84d9d6667.
  • Evidence tar: /artifacts/tars/rabbitmq-pr186-memberleave-20260604T140613Z.tar.gz.
  • sha256: c4d585fa50d8d9bdaa0daf7c4f18cadeac26a2b5b17a21bd340a5b89aae1231b.

Boundary:

  • This is focused supplemental evidence for the Parallel scale-down/memberLeave risk on this PR. It does not by itself claim full release readiness.

@weicao weicao marked this pull request as draft June 10, 2026 02:04
@weicao weicao force-pushed the jasper/rabbitmq-init-account-compat branch from 597134e to 00ff1c7 Compare June 16, 2026 13:10
OrderedReady pod management causes permanent deadlock during Stop/Start
for N>=3 clusters: pod 0 starts first but cannot form quorum alone,
blocking subsequent pods from starting. Parallel policy starts all
replicas simultaneously so Raft quorum can reform.

A/B proof: OrderedReady → SS04 Start deadlock (pod 0 stuck);
Parallel → SS04 PASS (all pods start, quorum recovered, no message loss).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@weicao weicao force-pushed the jasper/rabbitmq-parallel-pod-management branch from 7b080ff to 676cba7 Compare June 16, 2026 14:34
@weicao weicao changed the base branch from jasper/rabbitmq-init-account-compat to main June 16, 2026 14:34
@weicao weicao marked this pull request as ready for review June 16, 2026 15:18
@weicao weicao marked this pull request as draft June 16, 2026 15:21
@weicao weicao marked this pull request as ready for review June 16, 2026 20:13
@weicao weicao added pick-1.0 Auto cherry-pick to release-1.0 when PR merged pick-1.1 Auto cherry-pick to release-1.1 when PR merged bug Something isn't working labels Jun 16, 2026
@weicao

weicao commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review!

Addressing the non-blocking items:

  1. Merge path: PR targets main (base: main, head: jasper/rabbitmq-parallel-pod-management). The branch name doesn't reflect 4.3.1 — it's a standalone fix.

  2. Applies to all versions: Intentional. rabbit_peer_discovery_k8s with address_type = hostname handles parallel startup for all supported versions (3.8–4.3.1). The peer discovery randomized delay + internal locking prevents split-brain during formation on every version.

  3. Scale-down with Parallel: Scale-down in KubeBlocks uses InstanceSet which deletes pods one at a time in reverse ordinal order (highest first). podManagementPolicy: Parallel only affects pod creation during scale-up/start, not deletion during scale-down. Additionally, PR chore(rabbitmq): make member_leave.sh re-entrant with timeout guards #2814 adds re-entrant safety to memberLeave for concurrent scenarios.

Evidence: chaos_stopstart 13/0/0 × 3 runs + chaos_network 15/0/0 + day2 D01-D07 (including D03 scale-in 5→3) = 77/0/0 total, all on Parallel.

@leon-ape leon-ape changed the title fix(rabbitmq): start replicas in parallel chore(rabbitmq): start replicas in parallel Jun 18, 2026
@leon-ape leon-ape merged commit 4cbb29f into main Jun 18, 2026
16 checks passed
@leon-ape leon-ape deleted the jasper/rabbitmq-parallel-pod-management branch June 18, 2026 03:22
@apecloud-bot

Copy link
Copy Markdown
Collaborator

/cherry-pick release-1.1

@apecloud-bot

Copy link
Copy Markdown
Collaborator

/cherry-pick release-1.0

@apecloud-bot

Copy link
Copy Markdown
Collaborator

🤖 says: cherry pick action finished successfully 🎉!
See: https://github.com/apecloud/kubeblocks-addons/actions/runs/27734567031

apecloud-bot pushed a commit that referenced this pull request Jun 18, 2026
@apecloud-bot

Copy link
Copy Markdown
Collaborator

🤖 says: cherry pick action finished successfully 🎉!
See: https://github.com/apecloud/kubeblocks-addons/actions/runs/27734567658

apecloud-bot pushed a commit that referenced this pull request Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working pick-1.0 Auto cherry-pick to release-1.0 when PR merged pick-1.1 Auto cherry-pick to release-1.1 when PR merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants