Skip to content

Commit 33131aa

Browse files
docs(lockdown): mark durable pg-proxy role-gate DONE (shipped+deployed+verified) (#64)
The churn-proof PG_PROXY_DENIED_ROLES role-gate is now live in prod (InstaNode-dev/instant-pg-proxy PR #1, image v0.2.0, deployed with PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin). Live-verified pod-IP-independent: external admin rejected at the PROXY layer (28000) even though the proxy now runs on new IPs the pg_hba reject lines don't name; customer usr_* still forwarded; in-cluster provisioning via ClusterIP svc unaffected. The pg_hba proxy-IP reject lines are now redundant belt-and-suspenders. Updates §3a (churn warning → mitigated), §7 (scope), §9 Drill Log (new row + follow-up closed). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 73ff5c6 commit 33131aa

1 file changed

Lines changed: 30 additions & 13 deletions

File tree

POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md

Lines changed: 30 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -144,15 +144,28 @@ kubectl get pods -n instant -l app=instant-pg-proxy -o jsonpath='{range .items[*
144144
# lines at the TOP of the admin block in postgres-customers-lockdown.yaml.
145145
```
146146

147-
**⚠️ CHURN: these IPs change when the proxy reschedules** — that is exactly how
148-
the 2026-06-03 hand-stopgap silently broke (it listed `10.109.3.201`, now dead).
149-
After ANY proxy reschedule, re-run the command above, update the two reject lines,
150-
re-apply the ConfigMap, and `SELECT pg_reload_conf()`. The **durable** churn-proof
151-
closer is the pg-proxy's own privileged-role deny (`PG_PROXY_DENIED_ROLES`, staged
152-
in repo `InstaNode-dev/instant-pg-proxy` per memory) — once that ships and is
153-
deployed, the proxy rejects admin roles before forwarding and these IP lines
154-
become redundant belt-and-suspenders. **Operator follow-up: ship the proxy
155-
role-gate; add an alert on proxy pod restarts so the pg_hba IPs can be refreshed.**
147+
**⚠️ CHURN (historical — now mitigated by the durable role-gate, see below):**
148+
these IPs change when the proxy reschedules — that is exactly how the 2026-06-03
149+
hand-stopgap silently broke (it listed `10.109.3.201`, now dead). After ANY proxy
150+
reschedule, re-run the command above, update the two reject lines, re-apply the
151+
ConfigMap, and `SELECT pg_reload_conf()`.
152+
153+
**✅ DURABLE FIX SHIPPED + DEPLOYED (2026-06-06) — the churn dependency is closed.**
154+
The pg-proxy's own privileged-role deny (`PG_PROXY_DENIED_ROLES`) is LIVE: repo
155+
`InstaNode-dev/instant-pg-proxy` (created 2026-06-06; it did not exist before),
156+
PR #1 (merge `5a86c93`), image `ghcr.io/mastermanas805/instant-pg-proxy:v0.2.0`,
157+
env `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. The
158+
proxy now rejects admin roles at the StartupMessage with a FATAL `28000`
159+
ErrorResponse BEFORE resolving/dialing — **independent of pod IPs / pg_hba.**
160+
Live-verified: after the rollout the proxy runs on NEW IPs (`10.109.6.132`,
161+
`10.109.4.98`) that the pg_hba reject lines do NOT name, yet external admin is
162+
still rejected (at the proxy, not pg_hba). The pg_hba proxy-IP reject lines are
163+
therefore now **redundant belt-and-suspenders** — left in place (harmless), no
164+
longer the sole boundary, no longer require churn-refresh on reschedule.
165+
**Remaining operator follow-up:** add an alert on `instant-pg-proxy` pod restarts
166+
(defense-in-depth visibility), and ensure any future redeploy preserves the
167+
`PG_PROXY_DENIED_ROLES` env (it lives only on the live Deployment patch — fold it
168+
into a committed manifest when one is created for the proxy).
156169

157170
- If the proxy-pod-IP reject lines in the ConfigMap do NOT match the live proxy
158171
IPs at apply time → FIX them first, else the lockdown is a no-op for the live
@@ -280,8 +293,11 @@ consumer the analysis missed.
280293

281294
- It does **not** close the public TCP port on 5432 (customers connect there).
282295
The admin boundary is the pg_hba **role reject**, not the port.
283-
- It does **not** touch the `instant-pg-proxy` repo (the proxy's own role-gate /
284-
pg_hba is the durable fix tracked separately, per memory + the audit doc).
296+
- ~~It does **not** touch the `instant-pg-proxy` repo~~**superseded 2026-06-06:**
297+
the durable fix (the proxy's own `PG_PROXY_DENIED_ROLES` role-gate) IS now shipped
298+
+ deployed (repo `InstaNode-dev/instant-pg-proxy` PR #1, image v0.2.0). This
299+
runbook's pg_hba lockdown is now the belt-and-suspenders layer behind that gate.
300+
See §3a + the §9 Drill Log row 2.
285301
- It does **not** prove the truehomie dropper used this path (H1 remains
286302
hypothesis) — it removes the *capability*, which is the right action regardless.
287303
- It does **not** by itself add an audit trail for in-cluster admin DROPs — that
@@ -309,13 +325,14 @@ the chokepoint ensures every *sanctioned* drop is recorded; the CI guard ensures
309325
| Date | Operator | Action | Result |
310326
|---|---|---|---|
311327
| 2026-06-06 | Claude (operator-authorized apply, "no customers, low blast radius") | **APPLIED to do-nyc3-instant-prod.** Merged PR #61 (squash, merge commit `78cb6677`) after fixing the manifest for two live findings (see below). Applied ConfigMap `postgres-customers-hba`; patched `deploy/postgres-customers` to mount it + `-c hba_file=/etc/postgresql/pg_hba.conf -c password_encryption=scram-sha-256`; changed strategy `RollingUpdate→Recreate` (RWO PVC Multi-Attach). Did NOT apply `networkpolicy.yaml` (verified NOT enforced in prod; applying as-is would default-deny the proxy path). | **SUCCESS.** External admin REJECTED at pg_hba (both `instanode_admin` + `instant_cust`, error names the SNAT'd proxy pod IP) — baseline beforehand reached scram (vector was OPEN). In-cluster admin preserved: provisioner `instant_cust` CREATE/DROP smoke OK, api/worker `instanode_admin` connect + `pg_database_size` OK, customer `usr_*` path still reaches scram. No rollback. |
328+
| 2026-06-06 | Claude (operator-authorized, "no customers, low blast radius") | **DURABLE FIX SHIPPED + DEPLOYED — the churn-proof pg-proxy role-gate.** Created the `InstaNode-dev/instant-pg-proxy` repo (did not exist before — the proxy source was a loose, un-versioned local dir; live image was `ghcr.io/mastermanas805/instant-pg-proxy:v0.1.0` applied by hand, no committed manifest). Merged PR #1 (squash, merge commit `5a86c93`): the proxy parses the StartupMessage `user` and, if in `PG_PROXY_DENIED_ROLES`, returns a FATAL `28000` ErrorResponse (`role is not permitted over the public endpoint`) BEFORE resolving/dialing — default empty = inert. Built+pushed `ghcr.io/mastermanas805/instant-pg-proxy:v0.2.0`; `kubectl patch deploy/instant-pg-proxy -n instant` → image v0.2.0 + `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. | **SUCCESS — durable closure verified, pod-IP-independent.** Rollout landed new pods at `10.109.6.132`/`10.109.4.98` (NOT the `10.109.4.113`/`10.109.0.101` the pg_hba reject lines name — those lines now point at DEAD pods, yet admin is STILL rejected, proving independence). External `instanode_admin`/`instant_cust`/`postgres` over `pg.instanode.dev` → **proxy 28000** (`role is not permitted over the public endpoint`), NOT a pg_hba reject naming a pod IP. Proxy logged `user_denied_public` for all three. Customer `usr_*` → FORWARDED (reached postgres scram → `password authentication failed`, not 28000). In-cluster admin via ClusterIP svc UNAFFECTED: `instant_cust` CREATE+DROP OK (`INCLUSTER_PROVISION_PATH_OK`), `pg_database_size` quota read OK. Provisioner DSN confirmed → `postgres-customers.instant-data.svc.cluster.local:5432` (svc, NOT the public proxy). The pg_hba proxy-IP reject lines are now redundant belt-and-suspenders (left in place, harmless). |
312329

313330
**Manifest fixes made before apply (live pre-apply verification):**
314331
1. **`instanode_admin` was missing.** Prod has TWO superusers — `instanode_admin` (api/worker `CUSTOMER_DATABASE_URL`, the CONFIRMED truehomie vector) and `instant_cust` (provisioner `POSTGRES_CUSTOMERS_URL`). The original PR rejected only `instant_cust`; `instanode_admin` would have matched the catch-all customer allow → vector still open. Both now rejected.
315332
2. **pg-proxy SNAT defeats source-CIDR.** instant-pg-proxy (in-cluster, no hostNetwork) re-originates TCP, so external admin arrives SNAT'd to a proxy pod IP inside `10.0.0.0/8` — a plain `10.0.0.0/8 allow` matches it. Added proxy-pod-IP `reject` lines (`10.109.4.113`, `10.109.0.101`) ordered BEFORE the in-cluster allow. **Verified in the reject error message** (`rejects connection for host "10.109.0.101"`). ⚠️ Churn dependency, see §3a.
316333

317334
**Operator follow-ups created by this apply:**
318-
- **Ship the durable pg-proxy role-gate** (`PG_PROXY_DENIED_ROLES` in `InstaNode-dev/instant-pg-proxy`, staged per memory) so the closure no longer depends on the churning proxy-pod-IP reject lines in the ConfigMap.
319-
- **On any `instant-pg-proxy` reschedule:** refresh the two `host all instanode_admin/instant_cust <proxy-ip>/32 reject` lines in `postgres-customers-lockdown.yaml`, re-apply, `SELECT pg_reload_conf()`. Add a proxy-pod-restart alert.
335+
- ~~**Ship the durable pg-proxy role-gate**~~**DONE 2026-06-06.** `PG_PROXY_DENIED_ROLES` shipped (repo `InstaNode-dev/instant-pg-proxy` created + PR #1, merge `5a86c93`), image `v0.2.0` built+pushed, deployed to `deploy/instant-pg-proxy` with `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. Live-verified the closure is now pod-IP-independent (see §3a + Drill Log row 2). The closure no longer depends on the churning proxy-pod-IP reject lines.
336+
- ~~**On any `instant-pg-proxy` reschedule:** refresh the proxy-IP reject lines~~**no longer required for the security boundary** (the role-gate is now the durable boundary). The pg_hba IP reject lines are redundant belt-and-suspenders; leave them. Still recommended: add a proxy-pod-restart alert for visibility, and persist `PG_PROXY_DENIED_ROLES` into a committed proxy Deployment manifest (currently the env lives only on the live `kubectl patch` — a manual re-create of the Deployment would drop it).
320337
- **`k8s/data/postgres-customers.yaml` updated** to carry the mount/args/Recreate-strategy so a future repo apply does not silently revert the lockdown (shipped in the same follow-up PR).
321338
- The repo `apply.yml` workflow now includes `postgres-customers-lockdown.yaml` (safe — ConfigMap) but ALSO `networkpolicy.yaml`; running that workflow WOULD create the unenforced-today NetPol and default-deny the proxy path. Add it to the apply EXCLUDE list or add the pg-proxy ingress rule before anyone runs the workflow.

0 commit comments

Comments
 (0)