chore(rabbitmq): start replicas in parallel#2718
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2718 +/- ##
=====================================
Coverage 0.00% 0.00%
=====================================
Files 84 84
Lines 11664 11664
=====================================
Misses 11664 11664 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
RabbitMQ 4.3.1 candidate validation update Result from the latest scoped Rule 12 soak: GREEN / PASS for Pinned candidate:
Validation evidence:
Artifact:
Boundary:
|
weicao
left a comment
There was a problem hiding this comment.
Review: PR #2718 — fix(rabbitmq): start replicas in parallel
Reviewer: @JADE (PR Reviewer)
Stats: 1 file changed, +1 / -0
Summary
Single-line change: adds podManagementPolicy: Parallel to the RabbitMQ ComponentDefinition spec in cmpd.yaml.
Fixes a deadlock during Stop/Start with quorum queues: with the default OrderedReady, only ordinal 0 is created, but RabbitMQ 4.3.1 cannot boot quorum/coordination state without peers, so ordinal 0 never reaches Ready and ordinals 1/2 never spawn.
API Contract: PASS
podManagementPolicy is a first-class optional field on ComponentDefinition.spec in KubeBlocks apps/v1. Parallel is valid. Kafka, Elasticsearch, and PostgreSQL addons in this repo already use it.
Safety: PASS
- RabbitMQ
rabbit_peer_discovery_k8splugin withaddress_type = hostnameis designed for parallel startup. It uses randomized delay + internal locking to prevent split-brain during formation. - Official Bitnami and RabbitMQ Cluster Operator charts both use
Parallel. - The deadlock it fixes (quorum requires majority,
OrderedReadyonly creates pod-0) is worse than any theoretical parallel startup risk. - 6-hour soak run with restart + pod-delete faults shows zero alarms, zero nacked messages.
Occam Razor: PASS
Simplest possible fix — one declarative field. No workarounds, no conditional logic.
Single Purpose: PASS
One file, one line, one concern.
Non-blocking
- PR targets
jasper/rabbitmq-4.3.1-supportnotmain— confirm merge path. - Change applies to all RabbitMQ versions via shared
cmpd.yaml, not just 4.3.1 — is this intended? (Likely fine since the peer discovery plugin handles it for all supported versions.)
Scale-down question
With Parallel, K8s may delete pods simultaneously on scale-down. Does memberLeave (forget_cluster_node) still work when multiple pods terminate concurrently? Pre-existing concern, but more relevant with Parallel.
Verdict: APPROVE
Correct, minimal, well-evidenced. Solves a real deadlock with the standard Kubernetes approach for RabbitMQ. Multiple other addons in this repo use the same pattern.
|
Follow-up for the scale-down/memberLeave question on What changed:
How it was verified:
Commit / evidence:
Boundary:
|
597134e to
00ff1c7
Compare
OrderedReady pod management causes permanent deadlock during Stop/Start for N>=3 clusters: pod 0 starts first but cannot form quorum alone, blocking subsequent pods from starting. Parallel policy starts all replicas simultaneously so Raft quorum can reform. A/B proof: OrderedReady → SS04 Start deadlock (pod 0 stuck); Parallel → SS04 PASS (all pods start, quorum recovered, no message loss). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7b080ff to
676cba7
Compare
|
Thanks for the thorough review! Addressing the non-blocking items:
Evidence: chaos_stopstart 13/0/0 × 3 runs + chaos_network 15/0/0 + day2 D01-D07 (including D03 scale-in 5→3) = 77/0/0 total, all on |
|
/cherry-pick release-1.1 |
|
/cherry-pick release-1.0 |
|
🤖 says: cherry pick action finished successfully 🎉! |
(cherry picked from commit 4cbb29f)
|
🤖 says: cherry pick action finished successfully 🎉! |
(cherry picked from commit 4cbb29f)
Summary
OrderedReady pod management causes permanent deadlock during Stop/Start for N>=3 clusters: pod 0 starts first but cannot form Raft quorum alone, blocking subsequent pods from ever starting. Parallel policy starts all replicas simultaneously so quorum can reform.
Evidence
A/B proof on vcluster with chaos_stopstart suite (SS01-SS06):
N=2 ALL PASS (26/0/0). Run artifacts: artifacts/chaos-stopstart-pr2718-run{1,2}-*.tar.gz
Test plan