Commit 12e5fdf
committed
ops(rolling-update): bump raftadmin RPC timeout + retry transfer
2026-05-21 production rolling-update reproduction (`d.sh` re-run
after main advanced) aborted on the n2 → n1 leadership transfer
with:
targeted leadership transfer RPC failed:
rpc error: code = FailedPrecondition
desc = etcd raft leadership transfer aborted
n1 had just been rolled-restarted on the previous iteration
(updated 17:16:25) and was still in its post-restart pre-stable
state when n2 tried to hand leadership over. raft refused the
transfer because the candidate's log was not yet caught up. The
script then fell back to generic transfer, which ALSO returned
the same FailedPrecondition, and bailed out — leaving the
cluster half-deployed (n1 new, n2-n5 old). Manual recovery
required a re-run with `RAFTADMIN_RPC_TIMEOUT_SECONDS=30` plus
a manual raftadmin call to nudge leadership.
Two changes:
1. Raise the default `RAFTADMIN_RPC_TIMEOUT_SECONDS` from 5 → 15.
5 seconds gave the transfer RPC no headroom over even a brief
raft-internal abort. 15 s is still small enough that a truly
stuck call surfaces fast, while comfortably covering the
~10 s catch-up window of a freshly-restarted candidate.
2. Retry the targeted `leadership_transfer_to_server` RPC up to
`LEADERSHIP_TRANSFER_RETRY_ATTEMPTS` (default 3) times with
`LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS` (default 5) of
backoff between attempts. Only after all targeted retries are
exhausted does the script fall back to generic transfer.
The retry is on the targeted call, not the generic one,
because the generic fallback chooses whatever the engine
thinks is a healthy candidate — that decision can race with
the same post-restart catch-up window and the same
FailedPrecondition arrives again. Retrying the targeted call
gives the original chosen candidate the few extra seconds it
needs to catch up.
env example updated to match the new defaults and document the
new tunables. The new env vars are forwarded to the remote ssh
sub-process so node-side script invocations see them.
Caller audit:
- `leadership_transfer_to_server` retry: behavior change is
"may retry up to N times before failing". Callers
(`maybe_transfer_leadership`) treat any return failure as a
refusal to restart; the change only delays that decision under
transient failure, never widens its scope.
- `RAFTADMIN_RPC_TIMEOUT_SECONDS`: every raftadmin RPC respects
this. Raising the default does not change which RPCs succeed
— only widens the window before a slow RPC is killed. Safe.
Test:
bash -n scripts/rolling-update.sh -- clean.
Production re-run pending (next d.sh invocation will exercise
the retry path; logs will show `attempt N/3` lines if it
fires).1 parent b218dd3 commit 12e5fdf
2 files changed
Lines changed: 56 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | | - | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
68 | 73 | | |
69 | 74 | | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
70 | 83 | | |
71 | 84 | | |
72 | 85 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| 82 | + | |
| 83 | + | |
82 | 84 | | |
83 | 85 | | |
84 | 86 | | |
| |||
198 | 200 | | |
199 | 201 | | |
200 | 202 | | |
201 | | - | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
202 | 210 | | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
203 | 220 | | |
204 | 221 | | |
205 | 222 | | |
| |||
574 | 591 | | |
575 | 592 | | |
576 | 593 | | |
| 594 | + | |
| 595 | + | |
577 | 596 | | |
578 | 597 | | |
579 | 598 | | |
| |||
800 | 819 | | |
801 | 820 | | |
802 | 821 | | |
803 | | - | |
804 | | - | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
805 | 843 | | |
806 | 844 | | |
807 | 845 | | |
808 | 846 | | |
809 | 847 | | |
810 | 848 | | |
811 | | - | |
| 849 | + | |
812 | 850 | | |
813 | 851 | | |
814 | 852 | | |
| |||
0 commit comments