Skip to content

Commit df5ea5e

Browse files
authored
scripts: forward ADMIN_* env vars to remote SSH heredoc (follow-up to #669) (#678)
## Summary Follow-up to #669 (now merged). The admin-flag plumbing I added in that PR has a real bug I caught while re-reading the script: every rollout would crash on the first remote node because the `ADMIN_*` variables were never forwarded across the SSH heredoc. ## What was broken `build_admin_flags` lives inside the remote SSH heredoc in `update_one_node` (`bash -s <<'REMOTE'`), but the `env` block that seeds the remote shell only forwarded the existing `IMAGE` / `RAFT_PORT` / `EXTRA_ENV` / etc. variables — no `ADMIN_*`. With `set -u` active on the remote, the first access of `${ADMIN_ENABLED}` inside `build_admin_flags` crashes the rollout **regardless** of whether admin is enabled (the helper is invoked unconditionally from `run_container`). So an operator running this script after #669 with the default `ADMIN_ENABLED=false` would have seen `ADMIN_ENABLED: unbound variable` on the first node touched, leaving at most one node restarted but otherwise the cluster intact (the per-node health check would exit non-zero before moving on). ## Fix 1. **Forward all 9 `ADMIN_*` variables through `env`**, alongside the existing forwarding pattern. Path-like values (`*_FILE`, `*_KEYS`, `ADDRESS`) get `printf %q` quoting at the bottom of the local script (matches the existing `RAFT_TO_REDIS_MAP_Q` etc. pattern). The three boolean flags (`ADMIN_ENABLED`, `ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK`, `ADMIN_ALLOW_INSECURE_DEV_COOKIE`) are forwarded unquoted for readability — but only after a local validation pass that rejects anything other than the literal `true` / `false` so the unquoted forwarding stays metacharacter-safe. 2. **Defense-in-depth `:-` defaults inside `build_admin_flags`.** Every `ADMIN_*` reference inside the helper now reads through `${VAR:-}` once at the top into a local. A future refactor that ever drops one of the forwarded variables will produce the targeted "ADMIN_* required" error instead of an opaque `unbound variable` crash with no hint at which variable. ## Test plan - [x] `bash -n scripts/rolling-update.sh` — passes - [x] `ADMIN_ENABLED=invalid bash scripts/rolling-update.sh` → "must be 'true' or 'false', got 'invalid'" - [x] `ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK=yes` → same validator catches it - [x] `ADMIN_ENABLED=true` (no signing key) → reaches the remote branch where build_admin_flags would 'aborting' on the missing key - [ ] End-to-end rollout against a 3-node staging cluster with `ADMIN_ENABLED=true` (operator to verify before merging into a production deploy.env) - [ ] End-to-end rollout with `ADMIN_ENABLED=false` (the previously broken default path) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Improved validation and error reporting for deployment configuration parameters. * Enhanced environment variable forwarding and path handling in deployment scripts to increase reliability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2 parents 98d38b5 + 01e977b commit df5ea5e

1 file changed

Lines changed: 96 additions & 29 deletions

File tree

scripts/rolling-update.sh

Lines changed: 96 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,24 @@ ADMIN_TLS_KEY_FILE="${ADMIN_TLS_KEY_FILE:-}"
180180
ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK="${ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK:-false}"
181181
ADMIN_ALLOW_INSECURE_DEV_COOKIE="${ADMIN_ALLOW_INSECURE_DEV_COOKIE:-false}"
182182

183+
# Validate the three boolean ADMIN_* flags must be the literal "true"
184+
# or "false" — they are forwarded to the remote shell unquoted (no
185+
# printf %q) for readability, which is only safe when the value is
186+
# already metacharacter-free. Reject anything else here so an operator
187+
# who typed "True", "1", or a stray quote sees a script-level error
188+
# pointing at the variable name instead of an inscrutable failure
189+
# inside the SSH heredoc.
190+
for _bool_var in ADMIN_ENABLED ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK ADMIN_ALLOW_INSECURE_DEV_COOKIE; do
191+
case "${!_bool_var}" in
192+
true|false) ;;
193+
*)
194+
echo "rolling-update: ${_bool_var} must be 'true' or 'false', got '${!_bool_var}'" >&2
195+
exit 1
196+
;;
197+
esac
198+
done
199+
unset _bool_var
200+
183201
# Container OOM defenses. See usage() for rationale. Empty string disables.
184202
DEFAULT_EXTRA_ENV="${DEFAULT_EXTRA_ENV-GOMEMLIMIT=1800MiB}"
185203
CONTAINER_MEMORY_LIMIT="${CONTAINER_MEMORY_LIMIT-2500m}"
@@ -498,6 +516,16 @@ update_one_node() {
498516
RAFT_TO_S3_MAP="$RAFT_TO_S3_MAP_Q" \
499517
EXTRA_ENV="$EXTRA_ENV_Q" \
500518
CONTAINER_MEMORY_LIMIT="$CONTAINER_MEMORY_LIMIT_Q" \
519+
ADMIN_ENABLED="$ADMIN_ENABLED" \
520+
ADMIN_ADDRESS="$ADMIN_ADDRESS_Q" \
521+
ADMIN_FULL_ACCESS_KEYS="$ADMIN_FULL_ACCESS_KEYS_Q" \
522+
ADMIN_READ_ONLY_ACCESS_KEYS="$ADMIN_READ_ONLY_ACCESS_KEYS_Q" \
523+
ADMIN_SESSION_SIGNING_KEY_FILE="$ADMIN_SESSION_SIGNING_KEY_FILE_Q" \
524+
ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE="$ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE_Q" \
525+
ADMIN_TLS_CERT_FILE="$ADMIN_TLS_CERT_FILE_Q" \
526+
ADMIN_TLS_KEY_FILE="$ADMIN_TLS_KEY_FILE_Q" \
527+
ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK="$ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK" \
528+
ADMIN_ALLOW_INSECURE_DEV_COOKIE="$ADMIN_ALLOW_INSECURE_DEV_COOKIE" \
501529
bash -s <<'REMOTE'
502530
set -euo pipefail
503531
@@ -838,73 +866,94 @@ build_admin_flags() {
838866
local -n _flags="$1"
839867
local -n _volumes="$2"
840868
841-
if [[ "${ADMIN_ENABLED}" != "true" ]]; then
869+
# `:-` defaults are defense-in-depth: build_admin_flags runs inside
870+
# the remote SSH heredoc with `set -u` active, and every ADMIN_*
871+
# variable is forwarded explicitly via the env block in
872+
# update_one_node. If a future refactor ever drops one of the
873+
# forwarded variables, the operator gets the targeted "ADMIN_*
874+
# required" error below instead of an opaque "unbound variable"
875+
# crash with no hint at which variable. All nine ADMIN_* values
876+
# are read into locals once at the top so the rest of the helper
877+
# cannot accidentally re-fetch a global and bypass the safety net
878+
# (gemini medium on PR #678 caught the original three-boolean gap).
879+
local enabled="${ADMIN_ENABLED:-false}"
880+
if [[ "$enabled" != "true" ]]; then
842881
return 0
843882
fi
844883
845-
if [[ -z "${ADMIN_SESSION_SIGNING_KEY_FILE}" ]]; then
884+
local signing_key="${ADMIN_SESSION_SIGNING_KEY_FILE:-}"
885+
local full_keys="${ADMIN_FULL_ACCESS_KEYS:-}"
886+
local read_only_keys="${ADMIN_READ_ONLY_ACCESS_KEYS:-}"
887+
local previous_key="${ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE:-}"
888+
local admin_listen="${ADMIN_ADDRESS:-}"
889+
local tls_cert="${ADMIN_TLS_CERT_FILE:-}"
890+
local tls_key="${ADMIN_TLS_KEY_FILE:-}"
891+
local allow_plaintext="${ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK:-false}"
892+
local insecure_cookie="${ADMIN_ALLOW_INSECURE_DEV_COOKIE:-false}"
893+
894+
if [[ -z "$signing_key" ]]; then
846895
echo "ADMIN_ENABLED=true requires ADMIN_SESSION_SIGNING_KEY_FILE; aborting" >&2
847896
exit 1
848897
fi
849-
if [[ -z "${ADMIN_FULL_ACCESS_KEYS}" && -z "${ADMIN_READ_ONLY_ACCESS_KEYS}" ]]; then
898+
if [[ -z "$full_keys" && -z "$read_only_keys" ]]; then
850899
echo "ADMIN_ENABLED=true requires at least one of ADMIN_FULL_ACCESS_KEYS / ADMIN_READ_ONLY_ACCESS_KEYS; aborting" >&2
851900
exit 1
852901
fi
853-
if [[ ! -f "$ADMIN_SESSION_SIGNING_KEY_FILE" || ! -r "$ADMIN_SESSION_SIGNING_KEY_FILE" ]]; then
854-
echo "ADMIN_SESSION_SIGNING_KEY_FILE='$ADMIN_SESSION_SIGNING_KEY_FILE' is missing or unreadable on this host; aborting" >&2
902+
if [[ ! -f "$signing_key" || ! -r "$signing_key" ]]; then
903+
echo "ADMIN_SESSION_SIGNING_KEY_FILE='$signing_key' is missing or unreadable on this host; aborting" >&2
855904
exit 1
856905
fi
857906
858907
_flags+=(--adminEnabled)
859-
_flags+=(--adminSessionSigningKeyFile "$ADMIN_SESSION_SIGNING_KEY_FILE")
860-
_volumes+=(-v "${ADMIN_SESSION_SIGNING_KEY_FILE}:${ADMIN_SESSION_SIGNING_KEY_FILE}:ro")
908+
_flags+=(--adminSessionSigningKeyFile "$signing_key")
909+
_volumes+=(-v "${signing_key}:${signing_key}:ro")
861910
862-
if [[ -n "${ADMIN_ADDRESS}" ]]; then
863-
_flags+=(--adminListen "$ADMIN_ADDRESS")
911+
if [[ -n "$admin_listen" ]]; then
912+
_flags+=(--adminListen "$admin_listen")
864913
fi
865-
if [[ -n "${ADMIN_FULL_ACCESS_KEYS}" ]]; then
866-
_flags+=(--adminFullAccessKeys "$ADMIN_FULL_ACCESS_KEYS")
914+
if [[ -n "$full_keys" ]]; then
915+
_flags+=(--adminFullAccessKeys "$full_keys")
867916
fi
868-
if [[ -n "${ADMIN_READ_ONLY_ACCESS_KEYS}" ]]; then
869-
_flags+=(--adminReadOnlyAccessKeys "$ADMIN_READ_ONLY_ACCESS_KEYS")
917+
if [[ -n "$read_only_keys" ]]; then
918+
_flags+=(--adminReadOnlyAccessKeys "$read_only_keys")
870919
fi
871920
872-
if [[ -n "${ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE}" ]]; then
873-
if [[ ! -f "$ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE" || ! -r "$ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE" ]]; then
874-
echo "ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE='$ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE' is missing or unreadable; aborting" >&2
921+
if [[ -n "$previous_key" ]]; then
922+
if [[ ! -f "$previous_key" || ! -r "$previous_key" ]]; then
923+
echo "ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE='$previous_key' is missing or unreadable; aborting" >&2
875924
exit 1
876925
fi
877-
_flags+=(--adminSessionSigningKeyPreviousFile "$ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE")
878-
_volumes+=(-v "${ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE}:${ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE}:ro")
926+
_flags+=(--adminSessionSigningKeyPreviousFile "$previous_key")
927+
_volumes+=(-v "${previous_key}:${previous_key}:ro")
879928
fi
880929
881930
# TLS pair must be set together. The daemon already rejects partial
882931
# configs at startup, but failing earlier here gives the operator a
883932
# script-level error pointing at the variable name, instead of the
884933
# daemon's "exactly one of cert/key" message after the container is
885934
# already running.
886-
if [[ -n "${ADMIN_TLS_CERT_FILE}" || -n "${ADMIN_TLS_KEY_FILE}" ]]; then
887-
if [[ -z "${ADMIN_TLS_CERT_FILE}" || -z "${ADMIN_TLS_KEY_FILE}" ]]; then
935+
if [[ -n "$tls_cert" || -n "$tls_key" ]]; then
936+
if [[ -z "$tls_cert" || -z "$tls_key" ]]; then
888937
echo "ADMIN_TLS_CERT_FILE and ADMIN_TLS_KEY_FILE must be set together; aborting" >&2
889938
exit 1
890939
fi
891-
if [[ ! -f "$ADMIN_TLS_CERT_FILE" || ! -r "$ADMIN_TLS_CERT_FILE" ]]; then
892-
echo "ADMIN_TLS_CERT_FILE='$ADMIN_TLS_CERT_FILE' is missing or unreadable; aborting" >&2
940+
if [[ ! -f "$tls_cert" || ! -r "$tls_cert" ]]; then
941+
echo "ADMIN_TLS_CERT_FILE='$tls_cert' is missing or unreadable; aborting" >&2
893942
exit 1
894943
fi
895-
if [[ ! -f "$ADMIN_TLS_KEY_FILE" || ! -r "$ADMIN_TLS_KEY_FILE" ]]; then
896-
echo "ADMIN_TLS_KEY_FILE='$ADMIN_TLS_KEY_FILE' is missing or unreadable; aborting" >&2
944+
if [[ ! -f "$tls_key" || ! -r "$tls_key" ]]; then
945+
echo "ADMIN_TLS_KEY_FILE='$tls_key' is missing or unreadable; aborting" >&2
897946
exit 1
898947
fi
899-
_flags+=(--adminTLSCertFile "$ADMIN_TLS_CERT_FILE" --adminTLSKeyFile "$ADMIN_TLS_KEY_FILE")
900-
_volumes+=(-v "${ADMIN_TLS_CERT_FILE}:${ADMIN_TLS_CERT_FILE}:ro")
901-
_volumes+=(-v "${ADMIN_TLS_KEY_FILE}:${ADMIN_TLS_KEY_FILE}:ro")
948+
_flags+=(--adminTLSCertFile "$tls_cert" --adminTLSKeyFile "$tls_key")
949+
_volumes+=(-v "${tls_cert}:${tls_cert}:ro")
950+
_volumes+=(-v "${tls_key}:${tls_key}:ro")
902951
fi
903952
904-
if [[ "${ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK}" == "true" ]]; then
953+
if [[ "$allow_plaintext" == "true" ]]; then
905954
_flags+=(--adminAllowPlaintextNonLoopback)
906955
fi
907-
if [[ "${ADMIN_ALLOW_INSECURE_DEV_COOKIE}" == "true" ]]; then
956+
if [[ "$insecure_cookie" == "true" ]]; then
908957
_flags+=(--adminAllowInsecureDevCookie)
909958
fi
910959
}
@@ -1140,6 +1189,24 @@ CONTAINER_NAME_Q="$(printf '%q' "$CONTAINER_NAME")"
11401189
RAFT_TO_REDIS_MAP_Q="$(printf '%q' "$RAFT_TO_REDIS_MAP")"
11411190
RAFT_TO_S3_MAP_Q="$(printf '%q' "$RAFT_TO_S3_MAP")"
11421191

1192+
# ADMIN_* values may contain commas (allow-lists), spaces (paths with
1193+
# spaces, though discouraged), or other shell metacharacters. The
1194+
# remote bash -s reparses the whole `env KEY=value … bash` line through
1195+
# the login shell once, so every value the operator might set has to
1196+
# survive that pass intact. printf %q is the same hardening every
1197+
# other forwarded path-like variable above gets.
1198+
# The two boolean flags (ADMIN_ENABLED, ADMIN_ALLOW_*) are validated
1199+
# at the top of the local script to be the literal "true" or "false",
1200+
# so they need no extra escaping — kept unquoted at the env site for
1201+
# readability.
1202+
ADMIN_ADDRESS_Q="$(printf '%q' "$ADMIN_ADDRESS")"
1203+
ADMIN_FULL_ACCESS_KEYS_Q="$(printf '%q' "$ADMIN_FULL_ACCESS_KEYS")"
1204+
ADMIN_READ_ONLY_ACCESS_KEYS_Q="$(printf '%q' "$ADMIN_READ_ONLY_ACCESS_KEYS")"
1205+
ADMIN_SESSION_SIGNING_KEY_FILE_Q="$(printf '%q' "$ADMIN_SESSION_SIGNING_KEY_FILE")"
1206+
ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE_Q="$(printf '%q' "$ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE")"
1207+
ADMIN_TLS_CERT_FILE_Q="$(printf '%q' "$ADMIN_TLS_CERT_FILE")"
1208+
ADMIN_TLS_KEY_FILE_Q="$(printf '%q' "$ADMIN_TLS_KEY_FILE")"
1209+
11431210
echo "[rolling-update] target image: $IMAGE"
11441211
for node_id in "${ROLLING_NODE_IDS[@]}"; do
11451212
update_one_node "$node_id" "$(node_host_by_id "$node_id")" "$(ssh_target_by_id "$node_id")"

0 commit comments

Comments
 (0)