You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(alerts): pg-proxy role-gate disabled / proxy down (truehomie residual) (#65)
Adds the alert for the last residual of the 2026-06-03 truehomie-db DROP
durable fix: the instant-pg-proxy role-gate (PG_PROXY_DENIED_ROLES) is now
committed to a manifest (InstaNode-dev/instant-pg-proxy k8s/), but nothing
alerted if the gate were ever disabled or the proxy went down.
The proxy is a thin TCP proxy with slog-to-stdout only — it exposes NO
/metrics endpoint, so a Prometheus-metric rule is not possible today. The
lowest-effort reliable signal is the proxy's startup log line
`pgproxy.role_gate{denied_role_count}` (count>0 = gate ON, 0 = exposure),
shipped to NR via the newrelic-logging Fluent Bit DaemonSet (verified
running on all nodes). Two log-based NR alerts (operator-apply):
- pg-proxy-role-gate-disabled.json (P0) — fires on denied_role_count==0
- pg-proxy-down.json (P1) — fires on 10m proxy log silence
Plus an admin-defense dashboard page ("pg-proxy public-path gate", 4 tiles)
and a METRICS-CATALOG row (rule 25). Runbook §3a + §9 updated: residual
closed, manifest no-op-verified vs live, alert documented. Proper durable
upgrade documented: add a pgproxy_role_gate_denied_roles gauge + /metrics +
a worker synthetic-reject prober leg.
Operator-apply only (no auto-apply on infra). No live behavior changed.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md
+31-5Lines changed: 31 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -162,10 +162,35 @@ Live-verified: after the rollout the proxy runs on NEW IPs (`10.109.6.132`,
162
162
still rejected (at the proxy, not pg_hba). The pg_hba proxy-IP reject lines are
163
163
therefore now **redundant belt-and-suspenders** — left in place (harmless), no
164
164
longer the sole boundary, no longer require churn-refresh on reschedule.
165
-
**Remaining operator follow-up:** add an alert on `instant-pg-proxy` pod restarts
166
-
(defense-in-depth visibility), and ensure any future redeploy preserves the
167
-
`PG_PROXY_DENIED_ROLES` env (it lives only on the live Deployment patch — fold it
168
-
into a committed manifest when one is created for the proxy).
165
+
**✅ RESIDUAL CLOSED (2026-06-06).** Both halves of the remaining follow-up are done:
166
+
1.**Manifest committed (durable).**`PG_PROXY_DENIED_ROLES` no longer lives only on
167
+
a live `kubectl patch`. The proxy Deployment + Service are now committed to
168
+
`InstaNode-dev/instant-pg-proxy` under `k8s/` (`deployment.yaml`, `service.yaml`,
169
+
`README.md`) as the source of truth. The manifest was captured faithfully from the
170
+
live spec and is a **verified no-op** — `kubectl diff -f k8s/deployment.yaml -n instant`
171
+
returned empty (exit 0), confirming it matches the running proxy (image `v0.2.0` +
172
+
the role-gate env) and re-applying it changes nothing. A future `kubectl delete` +
173
+
`kubectl apply -f k8s/` therefore restores the gate instead of silently dropping it.
174
+
Re-apply is safe at any time; no apply is needed today (already matches live).
175
+
2.**Alert shipped (operator-apply).** Two log-based NR alerts watch the gate:
176
+
-`infra/newrelic/alerts/pg-proxy-role-gate-disabled.json` (P0/CRITICAL) — fires when
177
+
a proxy pod logs `pgproxy.role_gate` with `"denied_role_count":0` (gate disabled,
178
+
e.g. a re-create that dropped the env).
179
+
-`infra/newrelic/alerts/pg-proxy-down.json` (P1/CRITICAL) — fires when the proxy
180
+
emits zero logs for 10m (proxy down / public path broken).
181
+
Both query the proxy's stdout JSON in NR (`k8s_namespace_name='instant' AND
182
+
k8s_label_app='instant-pg-proxy'`), shipped via the `newrelic-logging` Fluent Bit
183
+
DaemonSet (verified running, all nodes). Dashboard: `admin-defense.json` →
184
+
"pg-proxy public-path gate" page (4 tiles). Catalog row in
185
+
`infra/observability/METRICS-CATALOG.md`.
186
+
**Why log-based (interim):** the proxy is a thin TCP proxy with slog-to-stdout only —
187
+
it exposes **no `/metrics` endpoint**, so a Prometheus-metric rule is not possible
188
+
today. The log signal (`pgproxy.role_gate denied_role_count`) is the lowest-effort
189
+
reliable alarm and needs zero code change. **Proper durable upgrade (follow-up):**
190
+
add a `pgproxy_role_gate_denied_roles` gauge + an HTTP `/metrics` listener to the
191
+
proxy, scrape it, alert on `gauge == 0`, AND add a worker synthetic-reject prober leg
192
+
(open a raw StartupMessage to `pg.instanode.dev` as `instanode_admin`, assert FATAL
193
+
`28000`). Until then the log alerts are the alarm.
169
194
170
195
- If the proxy-pod-IP reject lines in the ConfigMap do NOT match the live proxy
171
196
IPs at apply time → FIX them first, else the lockdown is a no-op for the live
@@ -326,13 +351,14 @@ the chokepoint ensures every *sanctioned* drop is recorded; the CI guard ensures
326
351
|---|---|---|---|
327
352
| 2026-06-06 | Claude (operator-authorized apply, "no customers, low blast radius") |**APPLIED to do-nyc3-instant-prod.** Merged PR #61 (squash, merge commit `78cb6677`) after fixing the manifest for two live findings (see below). Applied ConfigMap `postgres-customers-hba`; patched `deploy/postgres-customers` to mount it + `-c hba_file=/etc/postgresql/pg_hba.conf -c password_encryption=scram-sha-256`; changed strategy `RollingUpdate→Recreate` (RWO PVC Multi-Attach). Did NOT apply `networkpolicy.yaml` (verified NOT enforced in prod; applying as-is would default-deny the proxy path). |**SUCCESS.** External admin REJECTED at pg_hba (both `instanode_admin` + `instant_cust`, error names the SNAT'd proxy pod IP) — baseline beforehand reached scram (vector was OPEN). In-cluster admin preserved: provisioner `instant_cust` CREATE/DROP smoke OK, api/worker `instanode_admin` connect + `pg_database_size` OK, customer `usr_*` path still reaches scram. No rollback. |
328
353
| 2026-06-06 | Claude (operator-authorized, "no customers, low blast radius") | **DURABLE FIX SHIPPED + DEPLOYED — the churn-proof pg-proxy role-gate.** Created the `InstaNode-dev/instant-pg-proxy` repo (did not exist before — the proxy source was a loose, un-versioned local dir; live image was `ghcr.io/mastermanas805/instant-pg-proxy:v0.1.0` applied by hand, no committed manifest). Merged PR #1 (squash, merge commit `5a86c93`): the proxy parses the StartupMessage `user` and, if in `PG_PROXY_DENIED_ROLES`, returns a FATAL `28000` ErrorResponse (`role is not permitted over the public endpoint`) BEFORE resolving/dialing — default empty = inert. Built+pushed `ghcr.io/mastermanas805/instant-pg-proxy:v0.2.0`; `kubectl patch deploy/instant-pg-proxy -n instant` → image v0.2.0 + `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. | **SUCCESS — durable closure verified, pod-IP-independent.** Rollout landed new pods at `10.109.6.132`/`10.109.4.98` (NOT the `10.109.4.113`/`10.109.0.101` the pg_hba reject lines name — those lines now point at DEAD pods, yet admin is STILL rejected, proving independence). External `instanode_admin`/`instant_cust`/`postgres` over `pg.instanode.dev` → **proxy 28000** (`role is not permitted over the public endpoint`), NOT a pg_hba reject naming a pod IP. Proxy logged `user_denied_public` for all three. Customer `usr_*` → FORWARDED (reached postgres scram → `password authentication failed`, not 28000). In-cluster admin via ClusterIP svc UNAFFECTED: `instant_cust` CREATE+DROP OK (`INCLUSTER_PROVISION_PATH_OK`), `pg_database_size` quota read OK. Provisioner DSN confirmed → `postgres-customers.instant-data.svc.cluster.local:5432` (svc, NOT the public proxy). The pg_hba proxy-IP reject lines are now redundant belt-and-suspenders (left in place, harmless). |
354
+
| 2026-06-06 | Claude (operator-authorized, "no customers, low blast radius") | **RESIDUAL CLOSED — role-gate persisted to a committed manifest + alerted.** The `PG_PROXY_DENIED_ROLES` env previously lived ONLY on the live `kubectl patch` (a manual Deployment re-create would have silently dropped it → reopened the admin vector). (1) Captured the LIVE spec faithfully (`kubectl get deploy/svc instant-pg-proxy -n instant -o yaml`), stripped live-only noise, committed `k8s/deployment.yaml` + `k8s/service.yaml` + `k8s/README.md` to `InstaNode-dev/instant-pg-proxy` (default branch master) as the source of truth (PR, squash auto-merge). (2) Added two log-based NR alerts (operator-apply) — `pg-proxy-role-gate-disabled.json` (P0; fires on `pgproxy.role_gate denied_role_count==0`) + `pg-proxy-down.json` (P1; fires on 10m proxy log silence) — plus an admin-defense dashboard page + METRICS-CATALOG row (infra PR, squash auto-merge). The proxy exposes no `/metrics`, so the log signal is the lowest-effort reliable alarm; a `pgproxy_role_gate_denied_roles` gauge + synthetic-reject prober leg are the documented durable upgrade. | **SUCCESS — manifest is a verified no-op vs live; live behavior unchanged.** `kubectl diff -f k8s/deployment.yaml -n instant` → empty output, exit 0 (tooling sanity-checked: a deliberate `replicas: 2→3` edit DID surface drift, so the empty diff is genuine). `kubectl diff -f k8s/service.yaml` → also empty/exit 0. Live state at capture: image `v0.2.0`, `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`, pods Running 2/2, `pgproxy.role_gate denied_role_count:4` in logs, and live `pgproxy.user_denied_public` events observed for `instanode_admin`/`instant_cust`/`postgres` (gate actively rejecting). `newrelic-logging` Fluent Bit DaemonSet confirmed running on all nodes (proxy stdout reaches NR `Log`). NO `kubectl apply` performed (not needed — manifest already matches live); operator may apply anytime safely. The infra alerts are operator-apply. |
329
355
330
356
**Manifest fixes made before apply (live pre-apply verification):**
331
357
1.**`instanode_admin` was missing.** Prod has TWO superusers — `instanode_admin` (api/worker `CUSTOMER_DATABASE_URL`, the CONFIRMED truehomie vector) and `instant_cust` (provisioner `POSTGRES_CUSTOMERS_URL`). The original PR rejected only `instant_cust`; `instanode_admin` would have matched the catch-all customer allow → vector still open. Both now rejected.
332
358
2.**pg-proxy SNAT defeats source-CIDR.** instant-pg-proxy (in-cluster, no hostNetwork) re-originates TCP, so external admin arrives SNAT'd to a proxy pod IP inside `10.0.0.0/8` — a plain `10.0.0.0/8 allow` matches it. Added proxy-pod-IP `reject` lines (`10.109.4.113`, `10.109.0.101`) ordered BEFORE the in-cluster allow. **Verified in the reject error message** (`rejects connection for host "10.109.0.101"`). ⚠️ Churn dependency, see §3a.
333
359
334
360
**Operator follow-ups created by this apply:**
335
361
-~~**Ship the durable pg-proxy role-gate**~~ ✅ **DONE 2026-06-06.**`PG_PROXY_DENIED_ROLES` shipped (repo `InstaNode-dev/instant-pg-proxy` created + PR #1, merge `5a86c93`), image `v0.2.0` built+pushed, deployed to `deploy/instant-pg-proxy` with `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. Live-verified the closure is now pod-IP-independent (see §3a + Drill Log row 2). The closure no longer depends on the churning proxy-pod-IP reject lines.
336
-
-~~**On any `instant-pg-proxy` reschedule:** refresh the proxy-IP reject lines~~ — **no longer required for the security boundary** (the role-gate is now the durable boundary). The pg_hba IP reject lines are redundant belt-and-suspenders; leave them. Still recommended: add a proxy-pod-restart alert for visibility, and persist `PG_PROXY_DENIED_ROLES` into a committed proxy Deployment manifest (currently the env lives only on the live `kubectl patch` — a manual re-create of the Deployment would drop it).
362
+
-~~**On any `instant-pg-proxy` reschedule:** refresh the proxy-IP reject lines~~ — **no longer required for the security boundary** (the role-gate is now the durable boundary). The pg_hba IP reject lines are redundant belt-and-suspenders; leave them. ~~Still recommended: add a proxy-pod-restart alert for visibility, and persist `PG_PROXY_DENIED_ROLES` into a committed proxy Deployment manifest~~ — **✅ DONE 2026-06-06** (see the "RESIDUAL CLOSED" block in §3a and Drill Log row 3): the proxy Deployment+Service are committed to `InstaNode-dev/instant-pg-proxy``k8s/` (verified no-op vs live via `kubectl diff`), and two log-based NR alerts (`pg-proxy-role-gate-disabled.json` + `pg-proxy-down.json`) + the admin-defense "pg-proxy public-path gate" dashboard page watch the gate. The proxy has no `/metrics`, so a `pgproxy_role_gate_denied_roles` gauge + synthetic-reject prober leg are the proper durable upgrade (follow-up).
337
363
-**`k8s/data/postgres-customers.yaml` updated** to carry the mount/args/Recreate-strategy so a future repo apply does not silently revert the lockdown (shipped in the same follow-up PR).
338
364
- The repo `apply.yml` workflow now includes `postgres-customers-lockdown.yaml` (safe — ConfigMap) but ALSO `networkpolicy.yaml`; running that workflow WOULD create the unenforced-today NetPol and default-deny the proxy path. Add it to the apply EXCLUDE list or add the pg-proxy ingress rule before anyone runs the workflow.
"name": "instant-pg-proxy — emitted zero logs in 10m (proxy down; customer-DB public path)",
3
+
"type": "NRQL",
4
+
"description": "P1. The instant-pg-proxy pods emitted ZERO log records for 10 minutes — a liveness check for the Postgres-aware TCP proxy that fronts postgres-customers on the public pg.instanode.dev:5432 path. The proxy logs continuously under normal operation (each pod logs `pgproxy.role_gate` + `pgproxy.starting` on boot; every public privileged-role attempt logs `pgproxy.user_denied_public`; connections log at debug), so a 10-minute silence on the whole Deployment = crashed/evicted/scaled-to-zero proxy. Customer connections to pg.instanode.dev then fail (fail-safe), and an operator must not 'fix' it by re-pointing ingress-nginx tcp-services 5432 straight at postgres-customers — that would bypass the role-gate and REOPEN the 2026-06-03 admin DROP vector. Restore the proxy from source of truth instead: `kubectl apply -f k8s/` (repo InstaNode-dev/instant-pg-proxy).\n\nComplements pg-proxy-role-gate-disabled.json (which catches the gate being DISABLED while the proxy is up). Together they bound the boundary: gate-off (exposure) and proxy-off (path broken). The proxy has no /metrics endpoint, so this is a log-liveness check, not a metric scrape. fillValue STATIC 0 ensures a fully silent stream is treated as zero, not no-data.\n\nSource: stdout of instant-pg-proxy pods (k8s_namespace_name='instant', k8s_label_app='instant-pg-proxy'), via the newrelic-logging Fluent Bit DaemonSet. Runbook: infra/POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a + §9.\n\nWhen this fires:\n 1. `kubectl get pods -n instant -l app=instant-pg-proxy -o wide` — are the pods Running?\n 2. `kubectl describe pod -n instant -l app=instant-pg-proxy` + `kubectl logs --previous` — crash cause (OOM, image pull, Redis URL).\n 3. `kubectl rollout status deploy/instant-pg-proxy -n instant`.\n 4. If the Deployment is gone, re-create from source of truth: `kubectl apply -f k8s/` (InstaNode-dev/instant-pg-proxy) — this restores the role-gate env too.\n 5. Confirm `ingress-nginx/ingress-nginx-tcp` 5432 still maps to instant/instant-pg-proxy:5432 (NOT instant-data/postgres-customers).",
5
+
"enabled": true,
6
+
"nrql": {
7
+
"query": "SELECT count(*) FROM Log WHERE k8s_namespace_name = 'instant' AND k8s_label_app = 'instant-pg-proxy'"
"name": "instant-pg-proxy — role-gate DISABLED or proxy down (customer-DB admin exposure)",
3
+
"type": "NRQL",
4
+
"description": "P0 SECURITY. The instant-pg-proxy role-gate (PG_PROXY_DENIED_ROLES) is the durable boundary that bars privileged/superuser roles (instanode_admin, instant_cust, postgres, doadmin) from authenticating over the public pg.instanode.dev:5432 path — it closes the 2026-06-03 truehomie-db DROP DATABASE vector. If the gate is disabled (env dropped on a Deployment re-create) OR the proxy goes down, the customer-DB admin exposure reopens.\n\nThis alert has TWO terms:\n (A) CRITICAL — gate explicitly DISABLED: the proxy logs `pgproxy.role_gate` with `\"denied_role_count\":0` at startup. A non-zero count = gate ON; 0 = inert passthrough (no roles barred). This fires the moment a pod starts with PG_PROXY_DENIED_ROLES empty/missing (e.g. a Deployment re-create that dropped the env). Text-matched on the JSON log so it is robust regardless of whether the numeric field is lifted as an NR attribute.\n (B) CRITICAL — proxy SILENT (down): zero `pgproxy.*` log records from the instant-pg-proxy pods in 10m. The proxy logs continuously (every connection logs at debug, every reject logs `pgproxy.user_denied_public` WARN, and each pod logs `pgproxy.role_gate` + `pgproxy.starting` on boot), so a 10-minute silence = crashed/evicted/scaled-to-zero proxy and the public path is either broken (fail-safe) or being served by something else.\n\nINTERIM SIGNAL — the proxy exposes NO /metrics endpoint today (it is a thin TCP proxy with slog JSON to stdout only). This log-based alert is the lowest-effort reliable signal and requires zero code change. PROPER FOLLOW-UP (durable upgrade): add a `pgproxy_role_gate_denied_roles` gauge + an HTTP /metrics listener to the proxy (repo InstaNode-dev/instant-pg-proxy), scrape it from Prometheus, and add a metric-based rule (gate gauge == 0) PLUS a synthetic-reject prober leg in the worker (open a raw StartupMessage to pg.instanode.dev as instanode_admin and assert a FATAL 28000 'role is not permitted over the public endpoint'). Until then, this log alert is the alarm.\n\nSource: stdout of instant-pg-proxy pods (k8s_namespace_name='instant', k8s_label_app='instant-pg-proxy'), shipped to NR via the newrelic-logging Fluent Bit DaemonSet. Manifest source of truth: InstaNode-dev/instant-pg-proxy k8s/deployment.yaml. Runbook: infra/POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a + §9.\n\nWhen this fires:\n 1. `kubectl get deploy/instant-pg-proxy -n instant -o jsonpath='{.spec.template.spec.containers[0].env}'` — confirm PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin is present.\n 2. If missing: `kubectl apply -f k8s/` from InstaNode-dev/instant-pg-proxy (restores the gate from source of truth).\n 3. `kubectl logs -n instant -l app=instant-pg-proxy | grep role_gate` — verify denied_role_count is back to 4.\n 4. Externally verify the boundary: `psql 'postgresql://instanode_admin:<pw>@pg.instanode.dev:5432/postgres'` must be REJECTED with FATAL 28000 (not reach scram).",
5
+
"enabled": true,
6
+
"nrql": {
7
+
"query": "SELECT count(*) FROM Log WHERE k8s_namespace_name = 'instant' AND k8s_label_app = 'instant-pg-proxy' AND message LIKE '%pgproxy.role_gate%' AND message LIKE '%\"denied_role_count\":0%'"
0 commit comments