Skip to content

Commit 12e5fdf

Browse files
committed
ops(rolling-update): bump raftadmin RPC timeout + retry transfer
2026-05-21 production rolling-update reproduction (`d.sh` re-run after main advanced) aborted on the n2 → n1 leadership transfer with: targeted leadership transfer RPC failed: rpc error: code = FailedPrecondition desc = etcd raft leadership transfer aborted n1 had just been rolled-restarted on the previous iteration (updated 17:16:25) and was still in its post-restart pre-stable state when n2 tried to hand leadership over. raft refused the transfer because the candidate's log was not yet caught up. The script then fell back to generic transfer, which ALSO returned the same FailedPrecondition, and bailed out — leaving the cluster half-deployed (n1 new, n2-n5 old). Manual recovery required a re-run with `RAFTADMIN_RPC_TIMEOUT_SECONDS=30` plus a manual raftadmin call to nudge leadership. Two changes: 1. Raise the default `RAFTADMIN_RPC_TIMEOUT_SECONDS` from 5 → 15. 5 seconds gave the transfer RPC no headroom over even a brief raft-internal abort. 15 s is still small enough that a truly stuck call surfaces fast, while comfortably covering the ~10 s catch-up window of a freshly-restarted candidate. 2. Retry the targeted `leadership_transfer_to_server` RPC up to `LEADERSHIP_TRANSFER_RETRY_ATTEMPTS` (default 3) times with `LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS` (default 5) of backoff between attempts. Only after all targeted retries are exhausted does the script fall back to generic transfer. The retry is on the targeted call, not the generic one, because the generic fallback chooses whatever the engine thinks is a healthy candidate — that decision can race with the same post-restart catch-up window and the same FailedPrecondition arrives again. Retrying the targeted call gives the original chosen candidate the few extra seconds it needs to catch up. env example updated to match the new defaults and document the new tunables. The new env vars are forwarded to the remote ssh sub-process so node-side script invocations see them. Caller audit: - `leadership_transfer_to_server` retry: behavior change is "may retry up to N times before failing". Callers (`maybe_transfer_leadership`) treat any return failure as a refusal to restart; the change only delays that decision under transient failure, never widens its scope. - `RAFTADMIN_RPC_TIMEOUT_SECONDS`: every raftadmin RPC respects this. Raising the default does not change which RPCs succeed — only widens the window before a slow RPC is killed. Safe. Test: bash -n scripts/rolling-update.sh -- clean. Production re-run pending (next d.sh invocation will exercise the retry path; logs will show `attempt N/3` lines if it fires).
1 parent b218dd3 commit 12e5fdf

2 files changed

Lines changed: 56 additions & 5 deletions

File tree

scripts/rolling-update.env.example

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,22 @@ SSH_STRICT_HOST_KEY_CHECKING="accept-new"
6464
# If set, this binary must already be executable on the local control host.
6565
# RAFTADMIN_BIN="/absolute/path/to/linux/raftadmin"
6666
RAFTADMIN_REMOTE_BIN="/tmp/elastickv-raftadmin"
67-
RAFTADMIN_RPC_TIMEOUT_SECONDS="5"
67+
# Bumped from 5 to 15 (2026-05-22) so leadership-transfer RPCs survive
68+
# raft's transient pre-stable state right after a peer restart. The
69+
# 2026-05-21 reproduction (Actions run 26198185540) needed ~10 s of
70+
# headroom for the candidate's log to catch up before the transfer
71+
# could commit.
72+
RAFTADMIN_RPC_TIMEOUT_SECONDS="15"
6873
RAFTADMIN_ALLOW_INSECURE="true"
6974

75+
# Retry the targeted leadership_transfer_to_server RPC up to N times
76+
# before falling back to generic transfer. Each retry waits
77+
# LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS to let the candidate's
78+
# log catch up. Counts the first attempt toward the budget; set to 1
79+
# to disable retry.
80+
LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="3"
81+
LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="5"
82+
7083
# OOM defenses applied on 2026-04-24 after kernel OOM-SIGKILL cascades.
7184
# GOMEMLIMIT makes Go GC before the container hits --memory; --memory keeps
7285
# any kill scoped to the container, not host processes. Set either to "" to

scripts/rolling-update.sh

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ Optional environment:
7979
RAFTADMIN_REMOTE_BIN
8080
RAFTADMIN_RPC_TIMEOUT_SECONDS
8181
RAFTADMIN_ALLOW_INSECURE
82+
LEADERSHIP_TRANSFER_RETRY_ATTEMPTS
83+
LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS
8284
8385
EXTRA_ENV
8486
Whitespace-separated list of additional container environment variables to
@@ -198,8 +200,23 @@ ROLLING_DELAY_SECONDS="${ROLLING_DELAY_SECONDS:-2}"
198200
SSH_CONNECT_TIMEOUT_SECONDS="${SSH_CONNECT_TIMEOUT_SECONDS:-10}"
199201
SSH_STRICT_HOST_KEY_CHECKING="${SSH_STRICT_HOST_KEY_CHECKING:-accept-new}"
200202
RAFTADMIN_REMOTE_BIN="${RAFTADMIN_REMOTE_BIN:-/tmp/elastickv-raftadmin}"
201-
RAFTADMIN_RPC_TIMEOUT_SECONDS="${RAFTADMIN_RPC_TIMEOUT_SECONDS:-5}"
203+
# Default raised from 5 s to 15 s after the 2026-05-21 reproduction
204+
# (https://github.com/bootjp/elastickv/actions/runs/26198185540) where
205+
# the previous node's restart left raft in a transient pre-stable state
206+
# for the next leadership-transfer RPC. 5 s gave the RPC no headroom
207+
# over a brief raft-internal abort and the script bailed out. 15 s is
208+
# still small enough that a truly stuck call surfaces quickly.
209+
RAFTADMIN_RPC_TIMEOUT_SECONDS="${RAFTADMIN_RPC_TIMEOUT_SECONDS:-15}"
202210
RAFTADMIN_ALLOW_INSECURE="${RAFTADMIN_ALLOW_INSECURE:-true}"
211+
# LEADERSHIP_TRANSFER_RETRY_ATTEMPTS bounds how many times we re-issue
212+
# a leadership_transfer_to_server RPC when raft returns
213+
# FailedPrecondition (e.g. "etcd raft leadership transfer aborted")
214+
# because the candidate's log has not yet caught up or an in-flight
215+
# conf change is blocking the transfer. The first attempt counts
216+
# toward the budget; ATTEMPTS=1 means "no retry". Default 3 covers
217+
# the ~10 s catch-up window that follows a peer's rolling restart.
218+
LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS:-3}"
219+
LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS:-5}"
203220
NODES="${NODES:-}"
204221
SSH_TARGETS="${SSH_TARGETS:-}"
205222
ROLLING_ORDER="${ROLLING_ORDER:-}"
@@ -574,6 +591,8 @@ update_one_node() {
574591
SQS_FIFO_PARTITION_MAP="$SQS_FIFO_PARTITION_MAP_Q" \
575592
HEALTH_TIMEOUT_SECONDS="$HEALTH_TIMEOUT_SECONDS" \
576593
LEADERSHIP_TRANSFER_TIMEOUT_SECONDS="$LEADERSHIP_TRANSFER_TIMEOUT_SECONDS" \
594+
LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="$LEADERSHIP_TRANSFER_RETRY_ATTEMPTS" \
595+
LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="$LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS" \
577596
LEADER_DISCOVERY_TIMEOUT_SECONDS="$LEADER_DISCOVERY_TIMEOUT_SECONDS" \
578597
RAFTADMIN_RPC_TIMEOUT_SECONDS="$RAFTADMIN_RPC_TIMEOUT_SECONDS" \
579598
RAFTADMIN_ALLOW_INSECURE="$RAFTADMIN_ALLOW_INSECURE" \
@@ -800,15 +819,34 @@ ensure_not_leader_before_restart() {
800819
candidate_addr="${candidate_host}:${RAFT_PORT}"
801820
802821
echo "node is leader; transferring leadership to ${candidate_id}@${candidate_addr}"
803-
rpc_output="$(raftadmin_text "${NODE_HOST}:${RAFT_PORT}" leadership_transfer_to_server "${candidate_id}" "${candidate_addr}")" || {
804-
echo "targeted leadership transfer RPC failed: $rpc_output" >&2
822+
# Retry the targeted transfer up to LEADERSHIP_TRANSFER_RETRY_ATTEMPTS
823+
# times. A common failure shape under rolling restarts is etcd raft
824+
# rejecting the transfer with "etcd raft leadership transfer aborted"
825+
# when the candidate's log has not yet caught up to the leader. The
826+
# candidate typically becomes ready within a few seconds, so a brief
827+
# backoff between attempts is usually enough. Only when ALL targeted
828+
# retries are exhausted do we fall back to the generic transfer.
829+
local attempt=1 transfer_succeeded=false
830+
while (( attempt <= LEADERSHIP_TRANSFER_RETRY_ATTEMPTS )); do
831+
if rpc_output="$(raftadmin_text "${NODE_HOST}:${RAFT_PORT}" leadership_transfer_to_server "${candidate_id}" "${candidate_addr}")"; then
832+
transfer_succeeded=true
833+
break
834+
fi
835+
echo "targeted leadership transfer attempt ${attempt}/${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS} failed: $rpc_output" >&2
836+
if (( attempt < LEADERSHIP_TRANSFER_RETRY_ATTEMPTS )); then
837+
echo "retrying in ${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS}s..." >&2
838+
sleep "${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS}"
839+
fi
840+
attempt=$(( attempt + 1 ))
841+
done
842+
if [[ "$transfer_succeeded" != "true" ]]; then
805843
echo "falling back to generic leadership transfer"
806844
rpc_output="$(raftadmin_text "${NODE_HOST}:${RAFT_PORT}" leadership_transfer)" || {
807845
echo "generic leadership transfer RPC failed: $rpc_output" >&2
808846
return 1
809847
}
810848
candidate_addr=""
811-
}
849+
fi
812850
813851
if ! wait_for_leader_change "${NODE_HOST}:${RAFT_PORT}" "$candidate_addr"; then
814852
echo "leadership did not move away from ${NODE_HOST}:${RAFT_PORT} within ${LEADERSHIP_TRANSFER_TIMEOUT_SECONDS}s" >&2

0 commit comments

Comments
 (0)