Skip to content

Commit 24383b2

Browse files
committed
docs(deploy): round-5 deploy-via-tailscale review fixes
- Gemini Medium (design §2.6 line 146): the design doc contradicted the runbook by claiming dry-runs do NOT need approval. GitHub's environment-protection rules cannot be made conditional on workflow inputs, so `environment: production` pauses BOTH dry-run and non-dry-run executions in v1. Aligned the design-doc wording with the runbook and cross-referenced §4 for the second-environment upgrade path. - Gemini Medium (runbook §6 line 164 env list): the list of reconstruction vars was incomplete. Listed every env var the workflow actually exports (IMAGE, DATA_DIR, RAFT_PORT, REDIS_PORT, S3_PORT, ENABLE_S3, NODES, SSH_TARGETS, EXTRA_ENV) and called out the script-level defaults for anything not overridden, plus noted GOMEMLIMIT / CONTAINER_MEMORY_LIMIT are propagated via EXTRA_ENV once PR #617 lands. - Gemini Medium (runbook §6 line 178 idempotency): corrected the "stop+recreate every node regardless" claim. The script (scripts/rolling-update.sh:794-798) skips nodes whose running image id matches the target AND whose gRPC endpoint is healthy, so re-running after a partial roll is safe because already-rolled nodes are no-ops, not stops. Declining again Gemini HIGH "workflow file missing" — the file IS in this PR at .github/workflows/rolling-update.yml; this is the fourth round the bot has flagged its own misread. See prior rounds for rationale; no change.
1 parent 165116b commit 24383b2

2 files changed

Lines changed: 32 additions & 16 deletions

File tree

docs/deploy_via_tailscale_runbook.md

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -158,12 +158,19 @@ hazard that needs manual cleanup.
158158
2. **SSH into that node** over Tailscale and run `docker ps`. If the
159159
container is absent or `Exited`, finish the recreate by hand. The
160160
`docker run` invocation itself is redirected to `/dev/null` by the
161-
script, so the workflow log does NOT contain the full argv. Use
162-
the resolved env instead: the step logs `NODES_RAFT_MAP`,
163-
`EXTRA_ENV`, `GOMEMLIMIT`, `CONTAINER_MEMORY_LIMIT`, `IMAGE`, and
164-
`DATA_DIR` before invoking the script — those are sufficient to
165-
reconstruct the same `docker run` you would see if you re-ran with
166-
the same inputs.
161+
script, so the workflow log does NOT contain the full argv. To
162+
reconstruct it, read the `Roll cluster` step's rendered
163+
environment — the workflow exports `IMAGE`, `DATA_DIR`,
164+
`RAFT_PORT`, `REDIS_PORT`, `S3_PORT`, `ENABLE_S3`, `NODES`,
165+
`SSH_TARGETS`, and the merged `EXTRA_ENV` before invoking the
166+
script. Anything not explicitly set (e.g., `RAFT_PORT` in a
167+
minimally-overridden deploy) falls back to the script's default
168+
(`RAFT_PORT=50051`, `REDIS_PORT=6379`, `S3_PORT=9000`,
169+
`ENABLE_S3=true`). GOMEMLIMIT / CONTAINER_MEMORY_LIMIT (PR #617)
170+
are propagated via `EXTRA_ENV` once that PR lands. Together the
171+
rendered env + the node's `deploy.env` is enough to reconstruct
172+
the same `docker run` you would see if you re-ran with the same
173+
inputs.
167174
3. **Confirm the new leader via `raftadmin` or metrics** before re-running
168175
the workflow with `nodes:` scoped to the remaining untouched IDs. Do
169176
NOT re-run the full rollout if the partial one is still in flight —
@@ -172,10 +179,13 @@ hazard that needs manual cleanup.
172179
workflow to set a start-marker on each node and fast-skip completed
173180
nodes on re-run.
174181

175-
The script is idempotent for the "container exists and is up" case, so
176-
re-running the workflow with the same `ref` after confirming the
177-
interrupted node is healthy is safe — the script will stop+recreate
178-
each node in turn regardless of whether it was touched before.
182+
The script is idempotent. `scripts/rolling-update.sh:794-798` skips a
183+
node when its running image id equals the target image and its gRPC
184+
endpoint is healthy — an already-rolled node is a no-op, not a
185+
redundant stop/recreate. Re-running the workflow with the same
186+
`ref` after confirming the interrupted node is healthy is therefore
187+
safe: nodes that already match the target image are passed over,
188+
and only the still-stale one gets recreated.
179189

180190
## 7. What the workflow does NOT do (yet)
181191

docs/design/2026_04_24_proposed_deploy_via_tailscale.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -141,15 +141,21 @@ unreachable over the tailnet) before touching any live container.
141141

142142
### 2.6 Production environment approval
143143

144-
Mark the `production` GitHub environment as requiring approval from a list of
145-
reviewers. A non-dry-run deploy will pause until approved; the dry-run run
146-
itself does not need approval (it only needs the tailnet join).
144+
Mark the `production` GitHub environment as requiring approval from a list
145+
of reviewers. GitHub's native environment-protection rules do NOT support
146+
conditioning approval on workflow inputs, so **both** dry-run and non-
147+
dry-run runs will pause for approval when `environment: production` is
148+
declared unconditionally on the job. That is the v1 policy — simpler,
149+
one environment, one approver list; see runbook §4 for the dry-run-
150+
approval alternatives (a second `production-dry-run` environment without
151+
required reviewers, or a deployment-protection-rule GitHub App).
147152

148153
Alternative: require approval unconditionally and treat the dry-run as a
149-
"preview" that an approver must ack. Simpler policy, slightly more friction.
154+
"preview" that an approver must ack. This is the v1 shape by default.
150155

151-
**Recommendation:** approval required for non-dry-run only. Dry-runs are
152-
cheap and useful.
156+
**Recommendation:** approval required for every run in v1 (one
157+
environment). Add the second environment only when the dry-run friction
158+
becomes annoying.
153159

154160
### 2.7 Rollback
155161

0 commit comments

Comments
 (0)