Skip to content

Commit ad00bdc

Browse files
committed
fix(docs,workflow): address round-2 deploy-via-tailscale review
- workflow: add `packages: read` to the job permissions so the `Verify image exists on ghcr.io` step's `docker manifest inspect` call works against private ghcr.io images (Codex P1). - runbook §1: explain that `--ssh=false` disables Tailscale SSH and the workflow relies on the system sshd — operators who use Tailscale SSH elsewhere need to keep that in mind (Gemini Medium). - runbook §4: change `ssh-keyscan` example + troubleshooting to `ssh-keyscan -H` so known_hosts entries are hashed and the secret does not leak tailnet topology in plaintext (Gemini Security Medium). - runbook §4 variables: document that `NODES_RAFT_MAP` / `SSH_TARGETS_MAP` are workflow-side names the render step maps to the script's `NODES` / `SSH_TARGETS`; manual invocation from a workstation must use the script-side names (Gemini Medium). Not addressed: Gemini HIGH claim that the workflow file is missing (line 187) — it IS included at `.github/workflows/rolling-update.yml` in this PR; the reviewer misread the file list. Not addressed: Gemini HIGH re native --dry-run flag + zero-downtime strategy (line 128) — dry-run is deliberately a workflow-level input, not a script-level flag, so the script stays invokable from a workstation without CI-specific options; zero-downtime cutover is outside the scope of a CI wrapper and is tracked in the resilience-roadmap follow-ups.
1 parent 6322748 commit ad00bdc

2 files changed

Lines changed: 19 additions & 4 deletions

File tree

.github/workflows/rolling-update.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ on:
3030
permissions:
3131
contents: read
3232
id-token: write # required by tailscale/github-action OIDC flow
33+
packages: read # required by `docker manifest inspect` on ghcr.io private images
3334

3435
concurrency:
3536
group: rolling-update

docs/deploy_via_tailscale_runbook.md

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,13 @@ sudo tailscale up \
1717
--accept-routes=false
1818
```
1919

20+
`--ssh=false` disables Tailscale SSH, so the node's regular system
21+
sshd must be running and authorised to accept connections on the
22+
tailnet interface. The workflow uses plain SSH over the tailnet
23+
(Tailscale is only the network layer); if you rely on Tailscale SSH
24+
for operator access elsewhere, drop this flag but keep in mind the
25+
workflow still connects to the system sshd.
26+
2027
Verify the node is reachable by MagicDNS from another tailnet peer:
2128

2229
```
@@ -75,7 +82,7 @@ friction for previews).
7582
| `TS_OAUTH_CLIENT_ID` | Tailscale OAuth client ID from step 3 |
7683
| `TS_OAUTH_SECRET` | Tailscale OAuth secret from step 3 |
7784
| `DEPLOY_SSH_PRIVATE_KEY` | OpenSSH private key, authorized on every node under the deploy user |
78-
| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan kv01.<tailnet>.ts.net kv02.<tailnet>.ts.net …` output (one host per line) |
85+
| `DEPLOY_KNOWN_HOSTS` | `ssh-keyscan -H kv01.<tailnet>.ts.net kv02.<tailnet>.ts.net …` output. Use `-H` to hash hostnames so the secret's contents don't leak the tailnet topology if the runner environment is compromised. |
7986

8087
The SSH key should be ed25519, dedicated to CI (not a reused developer key).
8188
Regenerate on operator rotation.
@@ -86,8 +93,15 @@ Regenerate on operator rotation.
8693
|------|-------|---------|
8794
| `IMAGE_BASE` | Container image path (no tag) | `ghcr.io/bootjp/elastickv` |
8895
| `SSH_USER` | SSH login on every node | `bootjp` |
89-
| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`) | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` |
90-
| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host` | `n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,...` |
96+
| `NODES_RAFT_MAP` | Comma-separated `raftId=host` (no port — the script appends `RAFT_PORT`). The workflow renders this into the script's `NODES` env var. | `n1=kv01,n2=kv02,n3=kv03,n4=kv04,n5=kv05` |
97+
| `SSH_TARGETS_MAP` | Comma-separated `raftId=ssh-host`. The workflow renders this into the script's `SSH_TARGETS` env var. | `n1=kv01.<tailnet>.ts.net,n2=kv02.<tailnet>.ts.net,...` |
98+
99+
**Why two names?** The workflow uses `NODES_RAFT_MAP` / `SSH_TARGETS_MAP`
100+
in the `production` environment to keep the GitHub-side names
101+
distinct from the script-side env var names it hands to
102+
`rolling-update.sh`. If you run the script by hand from a workstation
103+
you must export `NODES` and `SSH_TARGETS` directly — the workflow-side
104+
names are only understood by the workflow's render step.
91105

92106
## 5. Running a deploy
93107

@@ -149,5 +163,5 @@ the tag is a moving tag (`latest`) that the verification step can't
149163
distinguish from stale. Specify an immutable tag.
150164

151165
### SSH `Host key verification failed`
152-
`DEPLOY_KNOWN_HOSTS` is stale. Re-run `ssh-keyscan` against every node and
166+
`DEPLOY_KNOWN_HOSTS` is stale. Re-run `ssh-keyscan -H` against every node and
153167
update the secret.

0 commit comments

Comments
 (0)