Skip to content

Commit ab8eac8

Browse files
authored
ops(rolling-update): bump raftadmin RPC timeout + retry transfer (#799)
## Summary Make `scripts/rolling-update.sh` survive the post-restart catch-up window during multi-node rolling updates: 1. Raise default `RAFTADMIN_RPC_TIMEOUT_SECONDS` from `5` → `15` (single-RPC headroom). 2. Add `LEADERSHIP_TRANSFER_RETRY_ATTEMPTS` (default `3`) and `LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS` (default `5`). The targeted `leadership_transfer_to_server` RPC is now retried with backoff before falling back to the generic transfer; the generic fallback is only used after all targeted retries are exhausted. ## Why The 2026-05-21 production re-deploy reproduction: ``` ==> [n2@192.168.0.211] start node is leader; transferring leadership to n1@192.168.0.210:50051 targeted leadership transfer RPC failed: rpc error: code = FailedPrecondition desc = etcd raft leadership transfer aborted falling back to generic leadership transfer generic leadership transfer RPC failed: rpc error: code = FailedPrecondition desc = etcd raft leadership transfer aborted [bailed out, cluster half-deployed] ``` n1 had been rolled-restarted ~10 s earlier and its log had not yet caught up. raft refused both the targeted and the generic transfer for the same reason. Manual recovery required `RAFTADMIN_RPC_TIMEOUT_SECONDS=30` plus a hand-issued `raftadmin` call. ## Caller audit - `leadership_transfer_to_server` retry: callers (`maybe_transfer_leadership`) interpret any return failure as a refusal to restart. The change only delays that decision under transient failure, never widens its scope. - `RAFTADMIN_RPC_TIMEOUT_SECONDS`: every raftadmin RPC respects this. Raising the default does not change which RPCs succeed — only widens the kill window for a slow RPC. ## Test plan - [x] `bash -n scripts/rolling-update.sh` — clean - [ ] Production re-run exercises retry path (would surface as `attempt N/3` log lines if FailedPrecondition recurs)
2 parents fde4ee3 + 12e5fdf commit ab8eac8

2 files changed

Lines changed: 56 additions & 5 deletions

File tree

scripts/rolling-update.env.example

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,22 @@ SSH_STRICT_HOST_KEY_CHECKING="accept-new"
6464
# If set, this binary must already be executable on the local control host.
6565
# RAFTADMIN_BIN="/absolute/path/to/linux/raftadmin"
6666
RAFTADMIN_REMOTE_BIN="/tmp/elastickv-raftadmin"
67-
RAFTADMIN_RPC_TIMEOUT_SECONDS="5"
67+
# Bumped from 5 to 15 (2026-05-22) so leadership-transfer RPCs survive
68+
# raft's transient pre-stable state right after a peer restart. The
69+
# 2026-05-21 reproduction (Actions run 26198185540) needed ~10 s of
70+
# headroom for the candidate's log to catch up before the transfer
71+
# could commit.
72+
RAFTADMIN_RPC_TIMEOUT_SECONDS="15"
6873
RAFTADMIN_ALLOW_INSECURE="true"
6974

75+
# Retry the targeted leadership_transfer_to_server RPC up to N times
76+
# before falling back to generic transfer. Each retry waits
77+
# LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS to let the candidate's
78+
# log catch up. Counts the first attempt toward the budget; set to 1
79+
# to disable retry.
80+
LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="3"
81+
LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="5"
82+
7083
# OOM defenses applied on 2026-04-24 after kernel OOM-SIGKILL cascades.
7184
# GOMEMLIMIT makes Go GC before the container hits --memory; --memory keeps
7285
# any kill scoped to the container, not host processes. Set either to "" to

scripts/rolling-update.sh

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ Optional environment:
7979
RAFTADMIN_REMOTE_BIN
8080
RAFTADMIN_RPC_TIMEOUT_SECONDS
8181
RAFTADMIN_ALLOW_INSECURE
82+
LEADERSHIP_TRANSFER_RETRY_ATTEMPTS
83+
LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS
8284
8385
EXTRA_ENV
8486
Whitespace-separated list of additional container environment variables to
@@ -198,8 +200,23 @@ ROLLING_DELAY_SECONDS="${ROLLING_DELAY_SECONDS:-2}"
198200
SSH_CONNECT_TIMEOUT_SECONDS="${SSH_CONNECT_TIMEOUT_SECONDS:-10}"
199201
SSH_STRICT_HOST_KEY_CHECKING="${SSH_STRICT_HOST_KEY_CHECKING:-accept-new}"
200202
RAFTADMIN_REMOTE_BIN="${RAFTADMIN_REMOTE_BIN:-/tmp/elastickv-raftadmin}"
201-
RAFTADMIN_RPC_TIMEOUT_SECONDS="${RAFTADMIN_RPC_TIMEOUT_SECONDS:-5}"
203+
# Default raised from 5 s to 15 s after the 2026-05-21 reproduction
204+
# (https://github.com/bootjp/elastickv/actions/runs/26198185540) where
205+
# the previous node's restart left raft in a transient pre-stable state
206+
# for the next leadership-transfer RPC. 5 s gave the RPC no headroom
207+
# over a brief raft-internal abort and the script bailed out. 15 s is
208+
# still small enough that a truly stuck call surfaces quickly.
209+
RAFTADMIN_RPC_TIMEOUT_SECONDS="${RAFTADMIN_RPC_TIMEOUT_SECONDS:-15}"
202210
RAFTADMIN_ALLOW_INSECURE="${RAFTADMIN_ALLOW_INSECURE:-true}"
211+
# LEADERSHIP_TRANSFER_RETRY_ATTEMPTS bounds how many times we re-issue
212+
# a leadership_transfer_to_server RPC when raft returns
213+
# FailedPrecondition (e.g. "etcd raft leadership transfer aborted")
214+
# because the candidate's log has not yet caught up or an in-flight
215+
# conf change is blocking the transfer. The first attempt counts
216+
# toward the budget; ATTEMPTS=1 means "no retry". Default 3 covers
217+
# the ~10 s catch-up window that follows a peer's rolling restart.
218+
LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS:-3}"
219+
LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS:-5}"
203220
NODES="${NODES:-}"
204221
SSH_TARGETS="${SSH_TARGETS:-}"
205222
ROLLING_ORDER="${ROLLING_ORDER:-}"
@@ -574,6 +591,8 @@ update_one_node() {
574591
SQS_FIFO_PARTITION_MAP="$SQS_FIFO_PARTITION_MAP_Q" \
575592
HEALTH_TIMEOUT_SECONDS="$HEALTH_TIMEOUT_SECONDS" \
576593
LEADERSHIP_TRANSFER_TIMEOUT_SECONDS="$LEADERSHIP_TRANSFER_TIMEOUT_SECONDS" \
594+
LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="$LEADERSHIP_TRANSFER_RETRY_ATTEMPTS" \
595+
LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="$LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS" \
577596
LEADER_DISCOVERY_TIMEOUT_SECONDS="$LEADER_DISCOVERY_TIMEOUT_SECONDS" \
578597
RAFTADMIN_RPC_TIMEOUT_SECONDS="$RAFTADMIN_RPC_TIMEOUT_SECONDS" \
579598
RAFTADMIN_ALLOW_INSECURE="$RAFTADMIN_ALLOW_INSECURE" \
@@ -800,15 +819,34 @@ ensure_not_leader_before_restart() {
800819
candidate_addr="${candidate_host}:${RAFT_PORT}"
801820
802821
echo "node is leader; transferring leadership to ${candidate_id}@${candidate_addr}"
803-
rpc_output="$(raftadmin_text "${NODE_HOST}:${RAFT_PORT}" leadership_transfer_to_server "${candidate_id}" "${candidate_addr}")" || {
804-
echo "targeted leadership transfer RPC failed: $rpc_output" >&2
822+
# Retry the targeted transfer up to LEADERSHIP_TRANSFER_RETRY_ATTEMPTS
823+
# times. A common failure shape under rolling restarts is etcd raft
824+
# rejecting the transfer with "etcd raft leadership transfer aborted"
825+
# when the candidate's log has not yet caught up to the leader. The
826+
# candidate typically becomes ready within a few seconds, so a brief
827+
# backoff between attempts is usually enough. Only when ALL targeted
828+
# retries are exhausted do we fall back to the generic transfer.
829+
local attempt=1 transfer_succeeded=false
830+
while (( attempt <= LEADERSHIP_TRANSFER_RETRY_ATTEMPTS )); do
831+
if rpc_output="$(raftadmin_text "${NODE_HOST}:${RAFT_PORT}" leadership_transfer_to_server "${candidate_id}" "${candidate_addr}")"; then
832+
transfer_succeeded=true
833+
break
834+
fi
835+
echo "targeted leadership transfer attempt ${attempt}/${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS} failed: $rpc_output" >&2
836+
if (( attempt < LEADERSHIP_TRANSFER_RETRY_ATTEMPTS )); then
837+
echo "retrying in ${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS}s..." >&2
838+
sleep "${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS}"
839+
fi
840+
attempt=$(( attempt + 1 ))
841+
done
842+
if [[ "$transfer_succeeded" != "true" ]]; then
805843
echo "falling back to generic leadership transfer"
806844
rpc_output="$(raftadmin_text "${NODE_HOST}:${RAFT_PORT}" leadership_transfer)" || {
807845
echo "generic leadership transfer RPC failed: $rpc_output" >&2
808846
return 1
809847
}
810848
candidate_addr=""
811-
}
849+
fi
812850
813851
if ! wait_for_leader_change "${NODE_HOST}:${RAFT_PORT}" "$candidate_addr"; then
814852
echo "leadership did not move away from ${NODE_HOST}:${RAFT_PORT} within ${LEADERSHIP_TRANSFER_TIMEOUT_SECONDS}s" >&2

0 commit comments

Comments
 (0)