Commit ab8eac8
authored
ops(rolling-update): bump raftadmin RPC timeout + retry transfer (#799)
## Summary
Make `scripts/rolling-update.sh` survive the post-restart catch-up
window during multi-node rolling updates:
1. Raise default `RAFTADMIN_RPC_TIMEOUT_SECONDS` from `5` → `15`
(single-RPC headroom).
2. Add `LEADERSHIP_TRANSFER_RETRY_ATTEMPTS` (default `3`) and
`LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS` (default `5`). The targeted
`leadership_transfer_to_server` RPC is now retried with backoff before
falling back to the generic transfer; the generic fallback is only used
after all targeted retries are exhausted.
## Why
The 2026-05-21 production re-deploy reproduction:
```
==> [n2@192.168.0.211] start
node is leader; transferring leadership to n1@192.168.0.210:50051
targeted leadership transfer RPC failed: rpc error: code = FailedPrecondition desc = etcd raft leadership transfer aborted
falling back to generic leadership transfer
generic leadership transfer RPC failed: rpc error: code = FailedPrecondition desc = etcd raft leadership transfer aborted
[bailed out, cluster half-deployed]
```
n1 had been rolled-restarted ~10 s earlier and its log had not yet
caught up. raft refused both the targeted and the generic transfer for
the same reason. Manual recovery required
`RAFTADMIN_RPC_TIMEOUT_SECONDS=30` plus a hand-issued `raftadmin` call.
## Caller audit
- `leadership_transfer_to_server` retry: callers
(`maybe_transfer_leadership`) interpret any return failure as a refusal
to restart. The change only delays that decision under transient
failure, never widens its scope.
- `RAFTADMIN_RPC_TIMEOUT_SECONDS`: every raftadmin RPC respects this.
Raising the default does not change which RPCs succeed — only widens the
kill window for a slow RPC.
## Test plan
- [x] `bash -n scripts/rolling-update.sh` — clean
- [ ] Production re-run exercises retry path (would surface as `attempt
N/3` log lines if FailedPrecondition recurs)2 files changed
Lines changed: 56 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | | - | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
68 | 73 | | |
69 | 74 | | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
70 | 83 | | |
71 | 84 | | |
72 | 85 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| 82 | + | |
| 83 | + | |
82 | 84 | | |
83 | 85 | | |
84 | 86 | | |
| |||
198 | 200 | | |
199 | 201 | | |
200 | 202 | | |
201 | | - | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
202 | 210 | | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
203 | 220 | | |
204 | 221 | | |
205 | 222 | | |
| |||
574 | 591 | | |
575 | 592 | | |
576 | 593 | | |
| 594 | + | |
| 595 | + | |
577 | 596 | | |
578 | 597 | | |
579 | 598 | | |
| |||
800 | 819 | | |
801 | 820 | | |
802 | 821 | | |
803 | | - | |
804 | | - | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
805 | 843 | | |
806 | 844 | | |
807 | 845 | | |
808 | 846 | | |
809 | 847 | | |
810 | 848 | | |
811 | | - | |
| 849 | + | |
812 | 850 | | |
813 | 851 | | |
814 | 852 | | |
| |||
0 commit comments