Skip to content

Commit 165116b

Browse files
committed
docs(deploy): round-4 deploy-via-tailscale review fixes
- Gemini HIGH (design line 82): switch NODES_RAFT_MAP example to full MagicDNS FQDNs so it matches the runbook; bare hostnames resolve differently per node. - Gemini Medium (design line 45): fix YAML — on/workflow_dispatch/inputs must be nested, not on a single line, and the fence is labelled yaml. - Gemini Medium (runbook §3, design §2.3): retract devices:core — not a valid Tailscale OAuth scope; note devices:write as the standard one. - Gemini Medium (runbook §6, line 153-156): correct the cancelled-job log pattern to what the script actually emits (`==> [<id>@<host>] start`, scripts/rolling-update.sh:398), not the fictitious `[rolling-update] rolling n<id>: ...`. - Gemini Medium (runbook §6, line 156-160): clarify that docker run stdout/stderr is redirected to /dev/null, so operators reconstruct the invocation from the step-level env log, not from the docker-run argv. - Codex P2 (runbook §8 approval troubleshooting): clarify that both dry-run and non-dry-run runs pause for approval in v1 because `environment: production` is unconditional; reference §4 for the second-environment upgrade path.
1 parent 894bce9 commit 165116b

2 files changed

Lines changed: 46 additions & 25 deletions

File tree

docs/deploy_via_tailscale_runbook.md

Lines changed: 26 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -58,9 +58,12 @@ Admin console → Settings → OAuth clients → New client:
5858

5959
- Description: `elastickv GitHub Actions deploy`
6060
- Scopes: `auth_keys` (write). Recent `tailscale/github-action` versions
61-
may additionally require `devices:core` (write); enable that if the
62-
join step fails with an authorization error. The action's README is
63-
the definitive source for current scope requirements.
61+
may additionally require `devices:write` (to register and clean up
62+
the ephemeral node); enable that if the join step fails with an
63+
authorization error. The action's README is the definitive source
64+
for current scope requirements. `devices:core` is NOT a valid
65+
Tailscale OAuth scope — earlier drafts of this runbook named it and
66+
would have produced an auth failure.
6467
- Tags: `tag:ci-deploy`
6568

6669
Copy the client ID and secret; they go into GitHub in the next step.
@@ -147,14 +150,20 @@ Re-run the workflow with `image_tag` set to the previous-known-good sha. The
147150
GitHub cancelling the job between node steps is the one operational
148151
hazard that needs manual cleanup.
149152

150-
1. **Look at the last log line from the `Roll cluster` step.** The script
151-
logs `[rolling-update] rolling n<id>: docker stop/rm/run ...` before
152-
each node recreate. Whatever `n<id>` appears last is the one in
153+
1. **Look at the last log line from the `Roll cluster` step.** The
154+
script emits `==> [<raft-id>@<host>] start` at the beginning of
155+
each per-node recreate (see `scripts/rolling-update.sh:398`).
156+
Whichever `<raft-id>` appears in the last such line is the one in
153157
flight when the cancel signal landed.
154158
2. **SSH into that node** over Tailscale and run `docker ps`. If the
155-
container is absent or `Exited`, finish the recreate by hand with the
156-
docker run arguments the script emitted (which you can see in the
157-
workflow log, step `Roll cluster`).
159+
container is absent or `Exited`, finish the recreate by hand. The
160+
`docker run` invocation itself is redirected to `/dev/null` by the
161+
script, so the workflow log does NOT contain the full argv. Use
162+
the resolved env instead: the step logs `NODES_RAFT_MAP`,
163+
`EXTRA_ENV`, `GOMEMLIMIT`, `CONTAINER_MEMORY_LIMIT`, `IMAGE`, and
164+
`DATA_DIR` before invoking the script — those are sufficient to
165+
reconstruct the same `docker run` you would see if you re-ran with
166+
the same inputs.
158167
3. **Confirm the new leader via `raftadmin` or metrics** before re-running
159168
the workflow with `nodes:` scoped to the remaining untouched IDs. Do
160169
NOT re-run the full rollout if the partial one is still in flight —
@@ -185,9 +194,14 @@ each node in turn regardless of whether it was touched before.
185194
## 8. Troubleshooting
186195

187196
### Job pauses indefinitely at "Waiting for approval"
188-
Expected for non-dry-run deploys — a reviewer from the `production` environment
189-
must click Approve. Check the "Required reviewers" list in the environment
190-
settings.
197+
Expected for **every** run in v1 — `.github/workflows/rolling-update.yml`
198+
sets `environment: production` unconditionally, so both dry-run and
199+
non-dry-run executions pause for approval. A reviewer from the
200+
`production` environment must click Approve. Check the "Required
201+
reviewers" list in the environment settings. See §4 "GitHub
202+
environment" for the dry-run-approval alternatives (approach 2: add a
203+
second `production-dry-run` environment without required reviewers)
204+
if the friction becomes intolerable.
191205

192206
### `tailscale ping` fails for a node
193207
The node may not be running `tailscaled`, not tagged `tag:elastickv-node`, or

docs/design/2026_04_24_proposed_deploy_via_tailscale.md

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -40,14 +40,15 @@ logged in on every node, with SSH access enabled over the tailnet.
4040

4141
### 2.1 Workflow shape
4242

43-
```
43+
```yaml
4444
name: Rolling update
45-
on: workflow_dispatch:
46-
inputs:
47-
ref: # git sha/tag of the image to deploy
48-
image_tag: # defaults to $ref; override only for rollbacks
49-
nodes: # subset of raft IDs; empty = full roll
50-
dry_run: # bool, default TRUE — renders plan but doesn't roll
45+
on:
46+
workflow_dispatch:
47+
inputs:
48+
ref: # git sha/tag of the image to deploy
49+
image_tag: # defaults to $ref; override only for rollbacks
50+
nodes: # subset of raft IDs; empty = full roll
51+
dry_run: # bool, default TRUE — renders plan but doesn't roll
5152

5253
jobs:
5354
deploy:
@@ -79,10 +80,13 @@ Stored in a GitHub `production` environment (not repo-wide):
7980
entries. Prevents the first-connect TOFU prompt.
8081

8182
**Variables (non-secret):**
82-
- `NODES_RAFT_MAP``n1=kv01,n2=kv02,...` (advertised hostnames as seen
83-
from inside the tailnet; the script appends `RAFT_PORT` automatically,
84-
so do NOT include a port here).
85-
- `SSH_TARGETS_MAP``n1=kv01.tailnet.ts.net,...` (MagicDNS).
83+
- `NODES_RAFT_MAP` — `n1=kv01.tailnet.ts.net,n2=kv02.tailnet.ts.net,...`
84+
(full MagicDNS FQDNs; bare short names can resolve differently
85+
depending on each node's search-domain configuration). The script
86+
appends `RAFT_PORT` automatically, so do NOT include a port here.
87+
The runbook (`docs/deploy_via_tailscale_runbook.md`) carries the
88+
same FQDN convention; keep the two in sync if either changes.
89+
- `SSH_TARGETS_MAP` — `n1=kv01.tailnet.ts.net,...` (MagicDNS FQDN).
8690
- `IMAGE_BASE` — `ghcr.io/bootjp/elastickv` (tag is appended from the input).
8791
- `SSH_USER` — e.g., `bootjp`.
8892

@@ -93,8 +97,11 @@ Use OAuth ephemeral nodes (not a long-lived auth key):
9397
- Create an OAuth client in Tailscale admin console with scope
9498
`auth_keys` (write) on tag `tag:ci-deploy`. (`tailscale/github-action`
9599
uses the OAuth client to mint a short-lived auth key on each run;
96-
recent action versions may also require `devices:core` — consult the
97-
action's README for the current scope list.)
100+
recent action versions may also require `devices:write` so the
101+
ephemeral node can register and be cleaned up — consult the action's
102+
README for the current scope list. Earlier drafts of this doc named
103+
`devices:core`, which is not a supported Tailscale OAuth scope and
104+
would fail authentication.)
98105
- Store client ID + secret in GitHub env secrets.
99106
- `tailscale/github-action@v3` joins the tailnet for the duration of the job
100107
as an ephemeral tagged node; disconnects automatically on job exit.

0 commit comments

Comments
 (0)