[Bug] LogSet shard pending indefinitely after rapid InPlace restarts: Raft config change deadlock

## Bug Description

LogSet Pod 在短时间内多次 InPlace 重启（image 变更）后，某个 shard 的 Raft 成员变更陷入死锁，shard 永久 pending 无法自愈。

## Environment

- Version: `v3.0.0-b79773d55-2026-04-25`
- Cluster: dev freetier-01, 3-replica LogSet
- Trigger: InPlace pod restart (image change) 2 times within ~13 minutes

## Steps to Reproduce

1. 3-replica LogSet running normally (log-0, log-1, log-2)
2. Change LogSet image (InPlace update) → log-2 restarts (container kill + restart)
3. While log-2 shard 1 is still recovering, change image again → log-2 restarts again
4. After second restart, shard 1 on log-2 enters permanent pending state

## Observed Behavior

**log-2** continuously outputs (every second, indefinitely):
```
shard 1 is pending, not included into the heartbeat
```

**log-0** (shard 1 leader) fails every L/Add attempt:
```
ERROR logservice/service_commands.go:80 failed to add replica
error: request rejected
  dragonboat/v4@.../request.go:79
```

HAKeeper keeps retrying with incrementing replicaID (5420635 → 5420781+), every ~18 seconds, all rejected.

**Deleting and recreating log-2 Pod does not fix the issue** — the pending config change is persisted in log-0/log-1 Raft log.

## Expected Behavior

- shard 1 should eventually recover after Pod restart
- If a config change is stuck, there should be a timeout/cleanup mechanism
- At minimum, HAKeeper should detect the deadlock and take corrective action (e.g., remove the stuck pending config change before retrying L/Add)

## Root Cause Analysis

dragonboat rejects `AddReplica` requests when there is an ongoing (pending) config change in the Raft group. The sequence:

1. log-2 restarts → HAKeeper detects shard 1 replica down → sends `L/Add` with new replicaID
2. log-0 starts executing the config change (AddReplica)
3. Before config change completes, log-2 restarts again
4. HAKeeper sends another `L/Add` with newer replicaID
5. dragonboat rejects because previous config change is still pending
6. The old config change never completes (target node restarted with different state)
7. Deadlock: old config change blocks new ones, but old one cannot complete

## Impact

- shard 1 runs with only 2/3 replicas (degraded, no redundancy)
- LogSet CR status shows `Ready=True, 3/3 Up` — **misleading**, since it only checks store-level heartbeat (shard 0 / HAKeeper shard is fine)
- If one more log node fails, shard 1 loses majority → data unavailable

## Suggested Fix Areas

1. **dragonboat**: Add timeout for pending config changes, auto-abort if not completed within threshold
2. **logservice**: Before retrying L/Add, check if there is a pending config change and wait/abort it first
3. **HAKeeper**: Detect repeated L/Add failures and escalate (e.g., L/Remove old replica first, then L/Add)
4. **LogSet status**: Report per-shard health, not just store-level heartbeat

## Logs

Full analysis with timeline: internal doc [handbooks:docs/analysis/20260426-dev-freetier01-logset-shard1-raft-deadlock.md](https://github.com/xzxiong/moi-core-handbooks/blob/main/docs/analysis/20260426-dev-freetier01-logset-rolling-update-stuck.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] LogSet shard pending indefinitely after rapid InPlace restarts: Raft config change deadlock #24203

Bug Description

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Root Cause Analysis

Impact

Suggested Fix Areas

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] LogSet shard pending indefinitely after rapid InPlace restarts: Raft config change deadlock #24203

Description

Bug Description

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Root Cause Analysis

Impact

Suggested Fix Areas

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions