Bug Description
LogSet Pod 在短时间内多次 InPlace 重启(image 变更)后,某个 shard 的 Raft 成员变更陷入死锁,shard 永久 pending 无法自愈。
Environment
- Version:
v3.0.0-b79773d55-2026-04-25
- Cluster: dev freetier-01, 3-replica LogSet
- Trigger: InPlace pod restart (image change) 2 times within ~13 minutes
Steps to Reproduce
- 3-replica LogSet running normally (log-0, log-1, log-2)
- Change LogSet image (InPlace update) → log-2 restarts (container kill + restart)
- While log-2 shard 1 is still recovering, change image again → log-2 restarts again
- After second restart, shard 1 on log-2 enters permanent pending state
Observed Behavior
log-2 continuously outputs (every second, indefinitely):
shard 1 is pending, not included into the heartbeat
log-0 (shard 1 leader) fails every L/Add attempt:
ERROR logservice/service_commands.go:80 failed to add replica
error: request rejected
dragonboat/v4@.../request.go:79
HAKeeper keeps retrying with incrementing replicaID (5420635 → 5420781+), every ~18 seconds, all rejected.
Deleting and recreating log-2 Pod does not fix the issue — the pending config change is persisted in log-0/log-1 Raft log.
Expected Behavior
- shard 1 should eventually recover after Pod restart
- If a config change is stuck, there should be a timeout/cleanup mechanism
- At minimum, HAKeeper should detect the deadlock and take corrective action (e.g., remove the stuck pending config change before retrying L/Add)
Root Cause Analysis
dragonboat rejects AddReplica requests when there is an ongoing (pending) config change in the Raft group. The sequence:
- log-2 restarts → HAKeeper detects shard 1 replica down → sends
L/Add with new replicaID
- log-0 starts executing the config change (AddReplica)
- Before config change completes, log-2 restarts again
- HAKeeper sends another
L/Add with newer replicaID
- dragonboat rejects because previous config change is still pending
- The old config change never completes (target node restarted with different state)
- Deadlock: old config change blocks new ones, but old one cannot complete
Impact
- shard 1 runs with only 2/3 replicas (degraded, no redundancy)
- LogSet CR status shows
Ready=True, 3/3 Up — misleading, since it only checks store-level heartbeat (shard 0 / HAKeeper shard is fine)
- If one more log node fails, shard 1 loses majority → data unavailable
Suggested Fix Areas
- dragonboat: Add timeout for pending config changes, auto-abort if not completed within threshold
- logservice: Before retrying L/Add, check if there is a pending config change and wait/abort it first
- HAKeeper: Detect repeated L/Add failures and escalate (e.g., L/Remove old replica first, then L/Add)
- LogSet status: Report per-shard health, not just store-level heartbeat
Logs
Full analysis with timeline: internal doc handbooks:docs/analysis/20260426-dev-freetier01-logset-shard1-raft-deadlock.md
Bug Description
LogSet Pod 在短时间内多次 InPlace 重启(image 变更)后,某个 shard 的 Raft 成员变更陷入死锁,shard 永久 pending 无法自愈。
Environment
v3.0.0-b79773d55-2026-04-25Steps to Reproduce
Observed Behavior
log-2 continuously outputs (every second, indefinitely):
log-0 (shard 1 leader) fails every L/Add attempt:
HAKeeper keeps retrying with incrementing replicaID (5420635 → 5420781+), every ~18 seconds, all rejected.
Deleting and recreating log-2 Pod does not fix the issue — the pending config change is persisted in log-0/log-1 Raft log.
Expected Behavior
Root Cause Analysis
dragonboat rejects
AddReplicarequests when there is an ongoing (pending) config change in the Raft group. The sequence:L/Addwith new replicaIDL/Addwith newer replicaIDImpact
Ready=True, 3/3 Up— misleading, since it only checks store-level heartbeat (shard 0 / HAKeeper shard is fine)Suggested Fix Areas
Logs
Full analysis with timeline: internal doc handbooks:docs/analysis/20260426-dev-freetier01-logset-shard1-raft-deadlock.md