Skip to content

Commit 894bce9

Browse files
committed
fix(workflow,docs): address round-3 deploy-via-tailscale review
- workflow nodes filter (Codex P2): reject any raft ID in the `nodes` input that does not appear in NODES_RAFT_MAP. Previously a typo like `n1,n9` silently rolled n1 only; now the workflow fails fast with a list of known IDs so the operator sees the typo before touching prod. - runbook section 4 (Gemini Medium x2): GitHub's native environment protection rules cannot be made conditional on workflow inputs, so the previous "auto-approve dry-run" guidance was wrong. Documented the three workable options: accept the prompt for dry-runs too (v1 default), split into a second unprotected environment, or install a deployment-protection-rule GitHub App. - runbook section 4 NODES_RAFT_MAP example (Gemini Medium): use full MagicDNS FQDNs instead of short hostnames so every node can resolve its peers regardless of local DNS search domains. - runbook section 6 (Gemini Medium): added "If a running workflow is cancelled mid-rollout" recovery steps — find the in-flight node from logs, finish the recreate by hand, confirm leader, rerun scoped. Filed as a tracked gap to teach the workflow per-node start-markers in a follow-up. Not addressed: Gemini HIGH line 187 claiming the workflow file is missing — the file IS present at .github/workflows/rolling-update.yml and has been since the first push of this PR. Third time the bot has flagged this (same finding in rounds 1 and 2); leaving as-is since responding further would just be repeating the same correction.
1 parent ad00bdc commit 894bce9

2 files changed

Lines changed: 71 additions & 10 deletions

File tree

.github/workflows/rolling-update.yml

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -111,15 +111,37 @@ jobs:
111111
fi
112112
if [[ -n "$NODES_FILTER" ]]; then
113113
# Filter NODES_RAFT_MAP and SSH_TARGETS_MAP to the requested subset.
114+
# Reject any filter ID that does not appear in the map: silently
115+
# dropping unknown IDs would let a typo like "n1,n9" proceed as
116+
# a one-node rollout of n1 alone, which is a staged-deploy
117+
# footgun.
118+
IFS=',' read -r -a wanted <<< "$NODES_FILTER"
119+
IFS=',' read -r -a entries <<< "$NODES_RAFT_MAP"
120+
declare -a known_ids=()
121+
for e in "${entries[@]}"; do
122+
known_ids+=("${e%%=*}")
123+
done
124+
unknown=""
125+
for w in "${wanted[@]}"; do
126+
found=0
127+
for k in "${known_ids[@]}"; do
128+
if [[ "$k" == "$w" ]]; then found=1; break; fi
129+
done
130+
if [[ $found -eq 0 ]]; then unknown+="${unknown:+, }$w"; fi
131+
done
132+
if [[ -n "$unknown" ]]; then
133+
echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: ${known_ids[*]}"
134+
exit 1
135+
fi
114136
filter_csv() {
115137
local all="$1"
116138
local filter="$2"
117139
local out=""
118-
IFS=',' read -r -a entries <<< "$all"
119-
IFS=',' read -r -a wanted <<< "$filter"
120-
for e in "${entries[@]}"; do
140+
IFS=',' read -r -a list_entries <<< "$all"
141+
IFS=',' read -r -a list_wanted <<< "$filter"
142+
for e in "${list_entries[@]}"; do
121143
key="${e%%=*}"
122-
for w in "${wanted[@]}"; do
144+
for w in "${list_wanted[@]}"; do
123145
if [[ "$key" == "$w" ]]; then
124146
out+="${e},"
125147
break

docs/deploy_via_tailscale_runbook.md

Lines changed: 45 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,23 @@ Copy the client ID and secret; they go into GitHub in the next step.
7070
Repo → Settings → Environments → New environment: `production`.
7171

7272
### Required reviewers
73-
Configure "Required reviewers" on the environment. Non-dry-run deploys will
74-
pause until one of the reviewers approves. Configure "Deployment protection
75-
rules" to auto-approve if the workflow input `dry_run == true` (optional; cuts
76-
friction for previews).
73+
Configure "Required reviewers" on the environment. **Every run that targets
74+
this environment pauses for approval** — including dry-runs, because
75+
GitHub's native environment-protection rules cannot be made conditional on
76+
workflow inputs. Three ways to handle the dry-run-approval friction:
77+
78+
1. **Accept the prompt for dry-runs too.** A dry-run requires one approver
79+
click before it proceeds; still cheap and keeps the policy simple.
80+
2. **Add a second environment `production-dry-run` without required
81+
reviewers** and change the workflow to pick the environment via
82+
`environment: ${{ inputs.dry_run && 'production-dry-run' || 'production' }}`.
83+
Cleanest but doubles the secrets/vars you must keep in sync.
84+
3. **Install a deployment-protection-rule GitHub App** (custom or
85+
marketplace) that approves runs whose inputs show `dry_run == true`.
86+
Most flexible; most setup.
87+
88+
v1 ships with approach 1 (single environment, prompt on every run).
89+
Approach 2 is the recommended upgrade once the friction becomes annoying.
7790

7891
### Environment secrets
7992

@@ -93,8 +106,8 @@ Regenerate on operator rotation.
93106
|------|-------|---------|
94107
| `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` |
95108
| `SSH_USER` | SSH login on every node | `bootjp` |
96-
| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). The workflow renders this into the script's `NODES` env var. | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` |
97-
| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. | `n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,...` |
109+
| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). Use full MagicDNS FQDNs so every node can resolve the advertised address regardless of local DNS search domains. The workflow renders this into the script's `NODES` env var. | `n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,n3=kv03.<tailnet>.ts.net,n4=kv04.<tailnet>.ts.net,n5=kv05.<tailnet>.ts.net` |
110+
| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. Usually identical to `NODES_RAFT_MAP` unless SSH access uses a different hostname. | `n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,...` |
98111

99112
**Why two names?** The workflow uses `NODES_RAFT_MAP` / `SSH_TARGETS_MAP`
100113
in the `production` environment to keep the GitHub-side names
@@ -129,6 +142,32 @@ Recommended first-run sequence:
129142
Re-run the workflow with `image_tag` set to the previous-known-good sha. The
130143
`nodes` input can target specific nodes if only some carry the bad image.
131144

145+
### If a running workflow is cancelled mid-rollout
146+
147+
GitHub cancelling the job between node steps is the one operational
148+
hazard that needs manual cleanup.
149+
150+
1. **Look at the last log line from the `Roll cluster` step.** The script
151+
logs `[rolling-update] rolling n<id>: docker stop/rm/run ...` before
152+
each node recreate. Whatever `n<id>` appears last is the one in
153+
flight when the cancel signal landed.
154+
2. **SSH into that node** over Tailscale and run `docker ps`. If the
155+
container is absent or `Exited`, finish the recreate by hand with the
156+
docker run arguments the script emitted (which you can see in the
157+
workflow log, step `Roll cluster`).
158+
3. **Confirm the new leader via `raftadmin` or metrics** before re-running
159+
the workflow with `nodes:` scoped to the remaining untouched IDs. Do
160+
NOT re-run the full rollout if the partial one is still in flight —
161+
it will stop the same node you are trying to recover.
162+
4. **File a ticket** with the log excerpt so we can eventually teach the
163+
workflow to set a start-marker on each node and fast-skip completed
164+
nodes on re-run.
165+
166+
The script is idempotent for the "container exists and is up" case, so
167+
re-running the workflow with the same `ref` after confirming the
168+
interrupted node is healthy is safe — the script will stop+recreate
169+
each node in turn regardless of whether it was touched before.
170+
132171
## 7. What the workflow does NOT do (yet)
133172

134173
- **No post-deploy health verification beyond tailnet reachability.** The

0 commit comments

Comments
 (0)