|
| 1 | +# Data-Tier Apply Runbook — `instant-data` stateful hardening |
| 2 | + |
| 3 | +> Companion to `k8s/APPLY-CHECKLIST.md` (which covers the api/worker/ |
| 4 | +> provisioner **Deployment** manifests). This runbook covers the **stateful |
| 5 | +> data-tier** manifests in `k8s/data/` — the ones that hold real customer data |
| 6 | +> and therefore must be applied deliberately, in order, in a maintenance |
| 7 | +> window. **This repo has no auto-apply (CLAUDE.md rule 15).** |
| 8 | +> |
| 9 | +> **CRITICAL: never `kubectl apply -f k8s/app.yaml`** (stale vs prod — strips |
| 10 | +> `imagePullSecrets`, resets images). The files below are individually |
| 11 | +> applyable; apply them one at a time and read each `--dry-run=server` diff. |
| 12 | +
|
| 13 | +This runbook is the operator apply checklist for four changes that are |
| 14 | +**committed but NOT yet applied to prod** (infra has no auto-apply): |
| 15 | + |
| 16 | +| Tag | File | What it does | Customer-visible risk if mis-applied | |
| 17 | +|---|---|---|---| |
| 18 | +| **S1** | `k8s/data/postgres-customers-lockdown.yaml` + the patched `postgres-customers.yaml` | pg_hba that REJECTS the admin/superuser roles (`instanode_admin`, `instant_cust`) from the public path; preserves `usr_*` customer roles | LOW — admin-only reject; customers unaffected. Detailed runbook: `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md` | |
| 19 | +| **S2** | `k8s/data/networkpolicy.yaml` | ingress NetworkPolicy: only provisioner/migrator/worker (+ nats-proxy for 4222) may reach the data pods | **HIGH — can break ALL customers** if the pg-proxy allow-rule is missing. See §S2 below. | |
| 20 | +| **R6** | `k8s/data/nats.yaml` | JetStream `emptyDir{}` → PVC (`nats-jetstream-pvc`, 5Gi) so queue data survives restarts | LOW — but the migration step (§R6) drains existing in-memory JetStream state. | |
| 21 | +| **R7** | `k8s/data/stateful-priority.yaml` + resource requests in `{postgres-customers,mongodb,redis-provision}.yaml` | PriorityClass `instant-data-critical` + one PDB per stateful pod + right-sized requests (BestEffort → Burstable) | LOW — eviction-ordering + drain-gating only; no data-path change. | |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## Pre-flight (every apply below) |
| 26 | + |
| 27 | +```bash |
| 28 | +# 1. Confirm context — NEVER run against the wrong cluster. |
| 29 | +kubectl config current-context |
| 30 | +# Expected for prod: do-nyc3-instant-prod |
| 31 | + |
| 32 | +# 2. Snapshot current data-tier state for rollback reference. |
| 33 | +kubectl get pods,pvc,netpol,pdb,priorityclass -n instant-data -o wide |
| 34 | +kubectl get priorityclass instant-data-critical 2>/dev/null || echo "no priorityclass yet" |
| 35 | + |
| 36 | +# 3. Server-side dry-run EACH file and read the diff line by line. |
| 37 | +kubectl apply --dry-run=server -f <file> |
| 38 | +``` |
| 39 | + |
| 40 | +Apply in a **maintenance window**. The recommended order is **R7 → R6 → S1 → |
| 41 | +S2** — least-risky and reversible first, the customer-breaking NetworkPolicy |
| 42 | +LAST so it is the freshest thing in your head if customers report errors. |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## R7 — PriorityClass + PDBs + resource requests (apply FIRST) |
| 47 | + |
| 48 | +Pure eviction-protection; no data-path change. Two parts. |
| 49 | + |
| 50 | +**Part A — the PriorityClass + PDBs:** |
| 51 | + |
| 52 | +```bash |
| 53 | +kubectl apply --dry-run=server -f k8s/data/stateful-priority.yaml # read diff |
| 54 | +kubectl apply -f k8s/data/stateful-priority.yaml |
| 55 | + |
| 56 | +# Verify |
| 57 | +kubectl get priorityclass instant-data-critical |
| 58 | +kubectl get pdb -n instant-data |
| 59 | +# Expect 4 PDBs (postgres-customers / mongodb / redis-provision / nats), |
| 60 | +# each ALLOWED DISRUPTIONS reading 0 (single replica, minAvailable 1 → the one |
| 61 | +# pod is "not disruptable" by voluntary eviction, which is the point). |
| 62 | +``` |
| 63 | + |
| 64 | +**Part B — the resource requests + the priorityClassName patch.** The requests |
| 65 | +ship INSIDE each workload manifest (`mongodb.yaml`, `redis-provision.yaml`, |
| 66 | +`postgres-customers.yaml`). Re-applying those manifests rolls the pod (Recreate |
| 67 | +strategy → brief downtime per workload — do this in the window). Because the |
| 68 | +PriorityClass is deliberately NOT inlined in the Deployments (so the priority |
| 69 | +rollout is one auditable step), patch `priorityClassName` in the same roll: |
| 70 | + |
| 71 | +```bash |
| 72 | +# postgres-customers carries the S1 pg_hba mount already — apply it as part of |
| 73 | +# S1 below (§S1) to avoid two rolls. For mongodb + redis-provision, roll now: |
| 74 | +for w in mongodb redis-provision; do |
| 75 | + kubectl apply --dry-run=server -f k8s/data/$w.yaml # read diff: only resources{} added |
| 76 | + kubectl apply -f k8s/data/$w.yaml |
| 77 | + kubectl patch deploy/$w -n instant-data --type=merge \ |
| 78 | + -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}' |
| 79 | + kubectl rollout status deploy/$w -n instant-data --timeout=180s |
| 80 | +done |
| 81 | + |
| 82 | +# Verify QoS flipped from BestEffort → Burstable and priority is set: |
| 83 | +kubectl get pod -n instant-data -l app=mongodb \ |
| 84 | + -o jsonpath='{.items[0].status.qosClass}{" "}{.items[0].spec.priorityClassName}{"\n"}' |
| 85 | +# Expect: Burstable instant-data-critical |
| 86 | +``` |
| 87 | + |
| 88 | +> nats already declared requests; just patch its `priorityClassName` (do it in |
| 89 | +> the R6 roll below so nats only restarts once). |
| 90 | +
|
| 91 | +**Rollback R7:** `kubectl delete -f k8s/data/stateful-priority.yaml` removes the |
| 92 | +PDBs + PriorityClass (pods keep running; priorityClassName on a pod referencing |
| 93 | +a deleted class is harmless until the next reschedule — re-patch to remove). |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## R6 — NATS JetStream emptyDir → PVC |
| 98 | + |
| 99 | +`k8s/data/nats.yaml` now declares `nats-jetstream-pvc` (5Gi, default |
| 100 | +StorageClass = `do-block-storage` on DOKS) and mounts it at `/data/jetstream`. |
| 101 | + |
| 102 | +> **Data note:** pre-cutover JetStream state lived in `emptyDir{}` and is |
| 103 | +> **already non-durable** (every prior restart wiped it). Switching to the PVC |
| 104 | +> does NOT migrate old in-memory state — there is nothing durable to migrate. |
| 105 | +> Existing `legacy_open` queue resources reconnect + re-establish streams on |
| 106 | +> reconnect (same as any nats restart today). Schedule during low queue |
| 107 | +> traffic; clients reconnect automatically. |
| 108 | +
|
| 109 | +```bash |
| 110 | +kubectl apply --dry-run=server -f k8s/data/nats.yaml # read diff: PVC added, volume swapped |
| 111 | + |
| 112 | +# The Deployment uses strategy.type: Recreate (RWO volume — required). Applying |
| 113 | +# rolls the pod: old pod terminates, PVC binds, new pod starts on /data/jetstream. |
| 114 | +kubectl apply -f k8s/data/nats.yaml |
| 115 | + |
| 116 | +# Patch the PriorityClass in the SAME context so nats restarts once (R7 part B): |
| 117 | +kubectl patch deploy/nats -n instant-data --type=merge \ |
| 118 | + -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}' |
| 119 | + |
| 120 | +kubectl rollout status deploy/nats -n instant-data --timeout=180s |
| 121 | + |
| 122 | +# Verify the PVC bound and JetStream is on it: |
| 123 | +kubectl get pvc nats-jetstream-pvc -n instant-data # STATUS Bound |
| 124 | +kubectl exec -n instant-data deploy/nats -- ls -la /data/jetstream |
| 125 | +kubectl get pod -n instant-data -l app=nats \ |
| 126 | + -o jsonpath='{.items[0].status.qosClass}{" "}{.items[0].spec.priorityClassName}{"\n"}' |
| 127 | +# Expect a jetstream dir on the mounted PVC + Burstable instant-data-critical. |
| 128 | + |
| 129 | +# Durability proof (the whole point): publish to a stream, delete the pod, |
| 130 | +# confirm the message survives the restart. |
| 131 | +# kubectl exec ... nats pub test.durability hello ; kubectl delete pod -l app=nats ; |
| 132 | +# (after Ready) nats stream info / consumer next — message must still be there. |
| 133 | +``` |
| 134 | + |
| 135 | +**Rollback R6:** revert the `nats.yaml` change and re-apply (volume goes back to |
| 136 | +`emptyDir{}`). The PVC can be left bound (it costs a few cents) or deleted with |
| 137 | +`kubectl delete pvc nats-jetstream-pvc -n instant-data` once nats is off it. |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## S1 — postgres-customers admin lockdown |
| 142 | + |
| 143 | +Full procedure (root-cause, role analysis, proxy-IP SNAT caveat, the live |
| 144 | +pg_hba stopgap) is in **`POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md`** — follow THAT |
| 145 | +for S1; this is the short pointer + the verification gate. |
| 146 | + |
| 147 | +Apply order: the `postgres-customers-hba` ConfigMap (in |
| 148 | +`postgres-customers-lockdown.yaml`) FIRST, then the patched |
| 149 | +`postgres-customers.yaml` (which mounts the hba file via subPath + sets |
| 150 | +`hba_file=/etc/postgresql/pg_hba.conf` and now also carries the R7 |
| 151 | +resource requests). Roll postgres-customers ONCE for both: |
| 152 | + |
| 153 | +```bash |
| 154 | +kubectl apply -f k8s/data/postgres-customers-lockdown.yaml # ConfigMap (+ any docs) |
| 155 | +kubectl apply -f k8s/data/postgres-customers.yaml # mounts hba + R7 requests |
| 156 | +kubectl patch deploy/postgres-customers -n instant-data --type=merge \ |
| 157 | + -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}' |
| 158 | +kubectl rollout status deploy/postgres-customers -n instant-data --timeout=300s |
| 159 | +``` |
| 160 | + |
| 161 | +> ⚠️ Read `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a` BEFORE applying — the |
| 162 | +> pg-proxy SNATs customer traffic to a pod IP, so the lockdown rejects the |
| 163 | +> admin role BY ROLE NAME (`instanode_admin` AND `instant_cust`), not by source |
| 164 | +> IP. If the runbook's proxy-pod-IP reject lines are stale, fix them first. |
| 165 | +
|
| 166 | +### S1 verification gate (the load-bearing check) |
| 167 | + |
| 168 | +After the roll, the **external admin path MUST FAIL** while in-cluster admin and |
| 169 | +customer paths keep working: |
| 170 | + |
| 171 | +```bash |
| 172 | +# (a) EXTERNAL admin connect MUST be REJECTED by pg_hba (NOT a password prompt |
| 173 | +# that proceeds). SAFE: connection-rejection probe only — no SQL/DDL. |
| 174 | +PGCONNECT_TIMEOUT=5 psql \ |
| 175 | + "host=pg.instanode.dev port=5432 user=instant_cust dbname=instant_customers sslmode=require" \ |
| 176 | + -c '\q' 2>&1 | head |
| 177 | +# EXPECT: 'no pg_hba.conf entry for host ... rejected' (or FATAL 28000 from the |
| 178 | +# proxy). FAILURE TO REJECT = lockdown not in effect — STOP, investigate. |
| 179 | + |
| 180 | +# Repeat for the OTHER admin role (the confirmed truehomie role): |
| 181 | +PGCONNECT_TIMEOUT=5 psql \ |
| 182 | + "host=pg.instanode.dev port=5432 user=instanode_admin dbname=instant_customers sslmode=require" \ |
| 183 | + -c '\q' 2>&1 | head |
| 184 | +# EXPECT: rejected. |
| 185 | + |
| 186 | +# (b) In-cluster admin still works (provisioner path is intact): |
| 187 | +kubectl exec -n instant-data deploy/postgres-customers -- \ |
| 188 | + psql -U instant_cust -d instant_customers -tAc 'select 1;' # expect: 1 |
| 189 | + |
| 190 | +# (c) A real customer usr_<token> still connects through the public path |
| 191 | +# (regression check — the lockdown must NOT catch customer roles). |
| 192 | +# Use a known test-tenant DSN from the dashboard / a /db/new claim. |
| 193 | +``` |
| 194 | + |
| 195 | +**Rollback S1:** see `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §Rollback` (revert |
| 196 | +the manifest, the pod falls back to the stock catch-all pg_hba; the live file |
| 197 | +backup is at `$PGDATA/pg_hba.conf.bak.2026-06-03`). |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +## S2 — data-tier ingress NetworkPolicy (apply LAST — highest risk) |
| 202 | + |
| 203 | +`k8s/data/networkpolicy.yaml` adds a default-deny ingress policy per data pod, |
| 204 | +allowing ONLY provisioner / migrator / worker (+ nats-proxy for 4222/8222). |
| 205 | + |
| 206 | +### ⚠️ The pg-proxy allow-rule — this is the customer-breaking trap |
| 207 | + |
| 208 | +The `postgres-customers-ingress` policy as committed **does NOT list |
| 209 | +`instant-pg-proxy`** — the allow-rule for it is **DORMANT** (commented out at |
| 210 | +`networkpolicy.yaml` lines ~88–103). If the public customer connect path is |
| 211 | +`pg.instanode.dev → ingress-nginx tcp-services → instant-pg-proxy |
| 212 | +(instant ns) → postgres-customers`, then applying this policy AS-IS |
| 213 | +**default-denies the proxy and BREAKS EVERY CUSTOMER POSTGRES CONNECTION.** |
| 214 | + |
| 215 | +`POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §L4` records that as of 2026-06-06 the |
| 216 | +NetworkPolicy is **NOT applied in prod** and that applying it as-is would |
| 217 | +default-deny + break the proxy path. **Do not apply S2 until you have:** |
| 218 | + |
| 219 | +1. **Confirmed the live proxy deployment's namespace + pod labels:** |
| 220 | + ```bash |
| 221 | + kubectl get pods -A -l app=instant-pg-proxy -o wide |
| 222 | + # (the proxy manifest lives in the separate InstaNode-dev/instant-pg-proxy |
| 223 | + # repo, NOT here — read the real ns + labels off the live cluster.) |
| 224 | + ``` |
| 225 | +2. **Uncommented + edited the dormant pg-proxy block** in |
| 226 | + `networkpolicy.yaml` (lines ~88–103) to match those real ns/labels. |
| 227 | +3. **Confirmed Cilium (the CNI) actually enforces NetworkPolicy** in this |
| 228 | + cluster (`kubectl get ds -n kube-system | grep cilium`). |
| 229 | + |
| 230 | +Only then: |
| 231 | + |
| 232 | +```bash |
| 233 | +kubectl apply --dry-run=server -f k8s/data/networkpolicy.yaml # read diff |
| 234 | +kubectl apply -f k8s/data/networkpolicy.yaml |
| 235 | +``` |
| 236 | + |
| 237 | +### S2 verification gate (must run IMMEDIATELY after apply) |
| 238 | + |
| 239 | +```bash |
| 240 | +# (a) Legit in-cluster caller (provisioner) still reaches postgres-customers: |
| 241 | +kubectl exec -n instant-infra deploy/instant-provisioner -- \ |
| 242 | + sh -c 'nc -z -w5 postgres-customers.instant-data.svc.cluster.local 5432 && echo OK' |
| 243 | +# Expect: OK |
| 244 | + |
| 245 | +# (b) THE CUSTOMER PATH still works — connect a real customer usr_<token> |
| 246 | +# through pg.instanode.dev (same DSN as S1 check (c)). If this now FAILS |
| 247 | +# where it worked pre-apply, the pg-proxy allow-rule is missing/wrong: |
| 248 | +# kubectl delete -f k8s/data/networkpolicy.yaml # IMMEDIATE rollback |
| 249 | +# then fix the dormant pg-proxy block and re-apply. |
| 250 | + |
| 251 | +# (c) The 4 NetworkPolicies are present: |
| 252 | +kubectl get networkpolicy -n instant-data |
| 253 | +# Expect: postgres-customers-ingress, redis-provision-ingress, mongodb-ingress, |
| 254 | +# nats-ingress. |
| 255 | +``` |
| 256 | + |
| 257 | +**Rollback S2 (do this fast if customers report connection errors):** |
| 258 | + |
| 259 | +```bash |
| 260 | +kubectl delete -f k8s/data/networkpolicy.yaml |
| 261 | +# Removing the policies returns the pods to allow-all ingress (the pre-apply |
| 262 | +# state). No data loss; instant effect. |
| 263 | +``` |
| 264 | + |
| 265 | +--- |
| 266 | + |
| 267 | +## Apply-order summary |
| 268 | + |
| 269 | +| # | Tag | Command | Verify | Reversible | |
| 270 | +|---|---|---|---|---| |
| 271 | +| 1 | R7-A | `kubectl apply -f k8s/data/stateful-priority.yaml` | `kubectl get pdb,priorityclass -n instant-data` | `kubectl delete -f …` | |
| 272 | +| 2 | R7-B | apply `mongodb.yaml`,`redis-provision.yaml` + patch `priorityClassName` | QoS = Burstable | re-apply prior manifest | |
| 273 | +| 3 | R6 | `kubectl apply -f k8s/data/nats.yaml` (+ priorityClassName patch) | PVC Bound + durability publish/restart test | revert manifest | |
| 274 | +| 4 | S1 | apply lockdown ConfigMap + `postgres-customers.yaml` (+ patch) | **external admin psql REJECTED** | per lockdown runbook | |
| 275 | +| 5 | S2 | **edit dormant pg-proxy rule FIRST**, then apply `networkpolicy.yaml` | provisioner reaches pg AND customer path works | `kubectl delete -f …` | |
| 276 | + |
| 277 | +After every step, sanity-check the platform hot path: |
| 278 | + |
| 279 | +```bash |
| 280 | +curl -sS https://api.instanode.dev/healthz | jq . |
| 281 | +curl -sS https://api.instanode.dev/readyz | jq . # data-tier deep readiness |
| 282 | +``` |
| 283 | + |
| 284 | +--- |
| 285 | + |
| 286 | +## Related |
| 287 | + |
| 288 | +- `k8s/APPLY-CHECKLIST.md` — the api/worker/provisioner Deployment apply rules. |
| 289 | +- `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md` — the full S1 procedure + root cause. |
| 290 | +- `NATS-AUTH-RUNBOOK.md` — NATS operator-mode key generation (separate from R6). |
| 291 | +- `k8s/data/networkpolicy.yaml` — the S2 policy with the dormant pg-proxy block. |
| 292 | +- CLAUDE.md rule 15 — why this repo has no auto-apply. |
0 commit comments