Skip to content

Commit 4f26343

Browse files
feat(alerts): pg-proxy role-gate disabled / proxy down (truehomie residual) (#65)
Adds the alert for the last residual of the 2026-06-03 truehomie-db DROP durable fix: the instant-pg-proxy role-gate (PG_PROXY_DENIED_ROLES) is now committed to a manifest (InstaNode-dev/instant-pg-proxy k8s/), but nothing alerted if the gate were ever disabled or the proxy went down. The proxy is a thin TCP proxy with slog-to-stdout only — it exposes NO /metrics endpoint, so a Prometheus-metric rule is not possible today. The lowest-effort reliable signal is the proxy's startup log line `pgproxy.role_gate{denied_role_count}` (count>0 = gate ON, 0 = exposure), shipped to NR via the newrelic-logging Fluent Bit DaemonSet (verified running on all nodes). Two log-based NR alerts (operator-apply): - pg-proxy-role-gate-disabled.json (P0) — fires on denied_role_count==0 - pg-proxy-down.json (P1) — fires on 10m proxy log silence Plus an admin-defense dashboard page ("pg-proxy public-path gate", 4 tiles) and a METRICS-CATALOG row (rule 25). Runbook §3a + §9 updated: residual closed, manifest no-op-verified vs live, alert documented. Proper durable upgrade documented: add a pgproxy_role_gate_denied_roles gauge + /metrics + a worker synthetic-reject prober leg. Operator-apply only (no auto-apply on infra). No live behavior changed. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 33131aa commit 4f26343

5 files changed

Lines changed: 160 additions & 6 deletions

File tree

POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -162,10 +162,35 @@ Live-verified: after the rollout the proxy runs on NEW IPs (`10.109.6.132`,
162162
still rejected (at the proxy, not pg_hba). The pg_hba proxy-IP reject lines are
163163
therefore now **redundant belt-and-suspenders** — left in place (harmless), no
164164
longer the sole boundary, no longer require churn-refresh on reschedule.
165-
**Remaining operator follow-up:** add an alert on `instant-pg-proxy` pod restarts
166-
(defense-in-depth visibility), and ensure any future redeploy preserves the
167-
`PG_PROXY_DENIED_ROLES` env (it lives only on the live Deployment patch — fold it
168-
into a committed manifest when one is created for the proxy).
165+
**✅ RESIDUAL CLOSED (2026-06-06).** Both halves of the remaining follow-up are done:
166+
1. **Manifest committed (durable).** `PG_PROXY_DENIED_ROLES` no longer lives only on
167+
a live `kubectl patch`. The proxy Deployment + Service are now committed to
168+
`InstaNode-dev/instant-pg-proxy` under `k8s/` (`deployment.yaml`, `service.yaml`,
169+
`README.md`) as the source of truth. The manifest was captured faithfully from the
170+
live spec and is a **verified no-op**`kubectl diff -f k8s/deployment.yaml -n instant`
171+
returned empty (exit 0), confirming it matches the running proxy (image `v0.2.0` +
172+
the role-gate env) and re-applying it changes nothing. A future `kubectl delete` +
173+
`kubectl apply -f k8s/` therefore restores the gate instead of silently dropping it.
174+
Re-apply is safe at any time; no apply is needed today (already matches live).
175+
2. **Alert shipped (operator-apply).** Two log-based NR alerts watch the gate:
176+
- `infra/newrelic/alerts/pg-proxy-role-gate-disabled.json` (P0/CRITICAL) — fires when
177+
a proxy pod logs `pgproxy.role_gate` with `"denied_role_count":0` (gate disabled,
178+
e.g. a re-create that dropped the env).
179+
- `infra/newrelic/alerts/pg-proxy-down.json` (P1/CRITICAL) — fires when the proxy
180+
emits zero logs for 10m (proxy down / public path broken).
181+
Both query the proxy's stdout JSON in NR (`k8s_namespace_name='instant' AND
182+
k8s_label_app='instant-pg-proxy'`), shipped via the `newrelic-logging` Fluent Bit
183+
DaemonSet (verified running, all nodes). Dashboard: `admin-defense.json`
184+
"pg-proxy public-path gate" page (4 tiles). Catalog row in
185+
`infra/observability/METRICS-CATALOG.md`.
186+
**Why log-based (interim):** the proxy is a thin TCP proxy with slog-to-stdout only —
187+
it exposes **no `/metrics` endpoint**, so a Prometheus-metric rule is not possible
188+
today. The log signal (`pgproxy.role_gate denied_role_count`) is the lowest-effort
189+
reliable alarm and needs zero code change. **Proper durable upgrade (follow-up):**
190+
add a `pgproxy_role_gate_denied_roles` gauge + an HTTP `/metrics` listener to the
191+
proxy, scrape it, alert on `gauge == 0`, AND add a worker synthetic-reject prober leg
192+
(open a raw StartupMessage to `pg.instanode.dev` as `instanode_admin`, assert FATAL
193+
`28000`). Until then the log alerts are the alarm.
169194

170195
- If the proxy-pod-IP reject lines in the ConfigMap do NOT match the live proxy
171196
IPs at apply time → FIX them first, else the lockdown is a no-op for the live
@@ -326,13 +351,14 @@ the chokepoint ensures every *sanctioned* drop is recorded; the CI guard ensures
326351
|---|---|---|---|
327352
| 2026-06-06 | Claude (operator-authorized apply, "no customers, low blast radius") | **APPLIED to do-nyc3-instant-prod.** Merged PR #61 (squash, merge commit `78cb6677`) after fixing the manifest for two live findings (see below). Applied ConfigMap `postgres-customers-hba`; patched `deploy/postgres-customers` to mount it + `-c hba_file=/etc/postgresql/pg_hba.conf -c password_encryption=scram-sha-256`; changed strategy `RollingUpdate→Recreate` (RWO PVC Multi-Attach). Did NOT apply `networkpolicy.yaml` (verified NOT enforced in prod; applying as-is would default-deny the proxy path). | **SUCCESS.** External admin REJECTED at pg_hba (both `instanode_admin` + `instant_cust`, error names the SNAT'd proxy pod IP) — baseline beforehand reached scram (vector was OPEN). In-cluster admin preserved: provisioner `instant_cust` CREATE/DROP smoke OK, api/worker `instanode_admin` connect + `pg_database_size` OK, customer `usr_*` path still reaches scram. No rollback. |
328353
| 2026-06-06 | Claude (operator-authorized, "no customers, low blast radius") | **DURABLE FIX SHIPPED + DEPLOYED — the churn-proof pg-proxy role-gate.** Created the `InstaNode-dev/instant-pg-proxy` repo (did not exist before — the proxy source was a loose, un-versioned local dir; live image was `ghcr.io/mastermanas805/instant-pg-proxy:v0.1.0` applied by hand, no committed manifest). Merged PR #1 (squash, merge commit `5a86c93`): the proxy parses the StartupMessage `user` and, if in `PG_PROXY_DENIED_ROLES`, returns a FATAL `28000` ErrorResponse (`role is not permitted over the public endpoint`) BEFORE resolving/dialing — default empty = inert. Built+pushed `ghcr.io/mastermanas805/instant-pg-proxy:v0.2.0`; `kubectl patch deploy/instant-pg-proxy -n instant` → image v0.2.0 + `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. | **SUCCESS — durable closure verified, pod-IP-independent.** Rollout landed new pods at `10.109.6.132`/`10.109.4.98` (NOT the `10.109.4.113`/`10.109.0.101` the pg_hba reject lines name — those lines now point at DEAD pods, yet admin is STILL rejected, proving independence). External `instanode_admin`/`instant_cust`/`postgres` over `pg.instanode.dev` → **proxy 28000** (`role is not permitted over the public endpoint`), NOT a pg_hba reject naming a pod IP. Proxy logged `user_denied_public` for all three. Customer `usr_*` → FORWARDED (reached postgres scram → `password authentication failed`, not 28000). In-cluster admin via ClusterIP svc UNAFFECTED: `instant_cust` CREATE+DROP OK (`INCLUSTER_PROVISION_PATH_OK`), `pg_database_size` quota read OK. Provisioner DSN confirmed → `postgres-customers.instant-data.svc.cluster.local:5432` (svc, NOT the public proxy). The pg_hba proxy-IP reject lines are now redundant belt-and-suspenders (left in place, harmless). |
354+
| 2026-06-06 | Claude (operator-authorized, "no customers, low blast radius") | **RESIDUAL CLOSED — role-gate persisted to a committed manifest + alerted.** The `PG_PROXY_DENIED_ROLES` env previously lived ONLY on the live `kubectl patch` (a manual Deployment re-create would have silently dropped it → reopened the admin vector). (1) Captured the LIVE spec faithfully (`kubectl get deploy/svc instant-pg-proxy -n instant -o yaml`), stripped live-only noise, committed `k8s/deployment.yaml` + `k8s/service.yaml` + `k8s/README.md` to `InstaNode-dev/instant-pg-proxy` (default branch master) as the source of truth (PR, squash auto-merge). (2) Added two log-based NR alerts (operator-apply) — `pg-proxy-role-gate-disabled.json` (P0; fires on `pgproxy.role_gate denied_role_count==0`) + `pg-proxy-down.json` (P1; fires on 10m proxy log silence) — plus an admin-defense dashboard page + METRICS-CATALOG row (infra PR, squash auto-merge). The proxy exposes no `/metrics`, so the log signal is the lowest-effort reliable alarm; a `pgproxy_role_gate_denied_roles` gauge + synthetic-reject prober leg are the documented durable upgrade. | **SUCCESS — manifest is a verified no-op vs live; live behavior unchanged.** `kubectl diff -f k8s/deployment.yaml -n instant` → empty output, exit 0 (tooling sanity-checked: a deliberate `replicas: 2→3` edit DID surface drift, so the empty diff is genuine). `kubectl diff -f k8s/service.yaml` → also empty/exit 0. Live state at capture: image `v0.2.0`, `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`, pods Running 2/2, `pgproxy.role_gate denied_role_count:4` in logs, and live `pgproxy.user_denied_public` events observed for `instanode_admin`/`instant_cust`/`postgres` (gate actively rejecting). `newrelic-logging` Fluent Bit DaemonSet confirmed running on all nodes (proxy stdout reaches NR `Log`). NO `kubectl apply` performed (not needed — manifest already matches live); operator may apply anytime safely. The infra alerts are operator-apply. |
329355

330356
**Manifest fixes made before apply (live pre-apply verification):**
331357
1. **`instanode_admin` was missing.** Prod has TWO superusers — `instanode_admin` (api/worker `CUSTOMER_DATABASE_URL`, the CONFIRMED truehomie vector) and `instant_cust` (provisioner `POSTGRES_CUSTOMERS_URL`). The original PR rejected only `instant_cust`; `instanode_admin` would have matched the catch-all customer allow → vector still open. Both now rejected.
332358
2. **pg-proxy SNAT defeats source-CIDR.** instant-pg-proxy (in-cluster, no hostNetwork) re-originates TCP, so external admin arrives SNAT'd to a proxy pod IP inside `10.0.0.0/8` — a plain `10.0.0.0/8 allow` matches it. Added proxy-pod-IP `reject` lines (`10.109.4.113`, `10.109.0.101`) ordered BEFORE the in-cluster allow. **Verified in the reject error message** (`rejects connection for host "10.109.0.101"`). ⚠️ Churn dependency, see §3a.
333359

334360
**Operator follow-ups created by this apply:**
335361
- ~~**Ship the durable pg-proxy role-gate**~~**DONE 2026-06-06.** `PG_PROXY_DENIED_ROLES` shipped (repo `InstaNode-dev/instant-pg-proxy` created + PR #1, merge `5a86c93`), image `v0.2.0` built+pushed, deployed to `deploy/instant-pg-proxy` with `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. Live-verified the closure is now pod-IP-independent (see §3a + Drill Log row 2). The closure no longer depends on the churning proxy-pod-IP reject lines.
336-
- ~~**On any `instant-pg-proxy` reschedule:** refresh the proxy-IP reject lines~~**no longer required for the security boundary** (the role-gate is now the durable boundary). The pg_hba IP reject lines are redundant belt-and-suspenders; leave them. Still recommended: add a proxy-pod-restart alert for visibility, and persist `PG_PROXY_DENIED_ROLES` into a committed proxy Deployment manifest (currently the env lives only on the live `kubectl patch` — a manual re-create of the Deployment would drop it).
362+
- ~~**On any `instant-pg-proxy` reschedule:** refresh the proxy-IP reject lines~~**no longer required for the security boundary** (the role-gate is now the durable boundary). The pg_hba IP reject lines are redundant belt-and-suspenders; leave them. ~~Still recommended: add a proxy-pod-restart alert for visibility, and persist `PG_PROXY_DENIED_ROLES` into a committed proxy Deployment manifest~~**✅ DONE 2026-06-06** (see the "RESIDUAL CLOSED" block in §3a and Drill Log row 3): the proxy Deployment+Service are committed to `InstaNode-dev/instant-pg-proxy` `k8s/` (verified no-op vs live via `kubectl diff`), and two log-based NR alerts (`pg-proxy-role-gate-disabled.json` + `pg-proxy-down.json`) + the admin-defense "pg-proxy public-path gate" dashboard page watch the gate. The proxy has no `/metrics`, so a `pgproxy_role_gate_denied_roles` gauge + synthetic-reject prober leg are the proper durable upgrade (follow-up).
337363
- **`k8s/data/postgres-customers.yaml` updated** to carry the mount/args/Recreate-strategy so a future repo apply does not silently revert the lockdown (shipped in the same follow-up PR).
338364
- The repo `apply.yml` workflow now includes `postgres-customers-lockdown.yaml` (safe — ConfigMap) but ALSO `networkpolicy.yaml`; running that workflow WOULD create the unenforced-today NetPol and default-deny the proxy path. Add it to the apply EXCLUDE list or add the pg-proxy ingress rule before anyone runs the workflow.

newrelic/alerts/pg-proxy-down.json

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"name": "instant-pg-proxy — emitted zero logs in 10m (proxy down; customer-DB public path)",
3+
"type": "NRQL",
4+
"description": "P1. The instant-pg-proxy pods emitted ZERO log records for 10 minutes — a liveness check for the Postgres-aware TCP proxy that fronts postgres-customers on the public pg.instanode.dev:5432 path. The proxy logs continuously under normal operation (each pod logs `pgproxy.role_gate` + `pgproxy.starting` on boot; every public privileged-role attempt logs `pgproxy.user_denied_public`; connections log at debug), so a 10-minute silence on the whole Deployment = crashed/evicted/scaled-to-zero proxy. Customer connections to pg.instanode.dev then fail (fail-safe), and an operator must not 'fix' it by re-pointing ingress-nginx tcp-services 5432 straight at postgres-customers — that would bypass the role-gate and REOPEN the 2026-06-03 admin DROP vector. Restore the proxy from source of truth instead: `kubectl apply -f k8s/` (repo InstaNode-dev/instant-pg-proxy).\n\nComplements pg-proxy-role-gate-disabled.json (which catches the gate being DISABLED while the proxy is up). Together they bound the boundary: gate-off (exposure) and proxy-off (path broken). The proxy has no /metrics endpoint, so this is a log-liveness check, not a metric scrape. fillValue STATIC 0 ensures a fully silent stream is treated as zero, not no-data.\n\nSource: stdout of instant-pg-proxy pods (k8s_namespace_name='instant', k8s_label_app='instant-pg-proxy'), via the newrelic-logging Fluent Bit DaemonSet. Runbook: infra/POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a + §9.\n\nWhen this fires:\n 1. `kubectl get pods -n instant -l app=instant-pg-proxy -o wide` — are the pods Running?\n 2. `kubectl describe pod -n instant -l app=instant-pg-proxy` + `kubectl logs --previous` — crash cause (OOM, image pull, Redis URL).\n 3. `kubectl rollout status deploy/instant-pg-proxy -n instant`.\n 4. If the Deployment is gone, re-create from source of truth: `kubectl apply -f k8s/` (InstaNode-dev/instant-pg-proxy) — this restores the role-gate env too.\n 5. Confirm `ingress-nginx/ingress-nginx-tcp` 5432 still maps to instant/instant-pg-proxy:5432 (NOT instant-data/postgres-customers).",
5+
"enabled": true,
6+
"nrql": {
7+
"query": "SELECT count(*) FROM Log WHERE k8s_namespace_name = 'instant' AND k8s_label_app = 'instant-pg-proxy'"
8+
},
9+
"terms": [
10+
{
11+
"priority": "CRITICAL",
12+
"operator": "BELOW_OR_EQUALS",
13+
"threshold": 0,
14+
"thresholdDuration": 600,
15+
"thresholdOccurrences": "ALL"
16+
}
17+
],
18+
"signal": {
19+
"aggregationWindow": 60,
20+
"aggregationMethod": "EVENT_FLOW",
21+
"aggregationDelay": 120,
22+
"fillOption": "STATIC",
23+
"fillValue": 0
24+
},
25+
"expiration": {
26+
"expirationDuration": 1200,
27+
"openViolationOnExpiration": true,
28+
"closeViolationsOnExpiration": false
29+
},
30+
"violationTimeLimitSeconds": 86400
31+
}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"name": "instant-pg-proxy — role-gate DISABLED or proxy down (customer-DB admin exposure)",
3+
"type": "NRQL",
4+
"description": "P0 SECURITY. The instant-pg-proxy role-gate (PG_PROXY_DENIED_ROLES) is the durable boundary that bars privileged/superuser roles (instanode_admin, instant_cust, postgres, doadmin) from authenticating over the public pg.instanode.dev:5432 path — it closes the 2026-06-03 truehomie-db DROP DATABASE vector. If the gate is disabled (env dropped on a Deployment re-create) OR the proxy goes down, the customer-DB admin exposure reopens.\n\nThis alert has TWO terms:\n (A) CRITICAL — gate explicitly DISABLED: the proxy logs `pgproxy.role_gate` with `\"denied_role_count\":0` at startup. A non-zero count = gate ON; 0 = inert passthrough (no roles barred). This fires the moment a pod starts with PG_PROXY_DENIED_ROLES empty/missing (e.g. a Deployment re-create that dropped the env). Text-matched on the JSON log so it is robust regardless of whether the numeric field is lifted as an NR attribute.\n (B) CRITICAL — proxy SILENT (down): zero `pgproxy.*` log records from the instant-pg-proxy pods in 10m. The proxy logs continuously (every connection logs at debug, every reject logs `pgproxy.user_denied_public` WARN, and each pod logs `pgproxy.role_gate` + `pgproxy.starting` on boot), so a 10-minute silence = crashed/evicted/scaled-to-zero proxy and the public path is either broken (fail-safe) or being served by something else.\n\nINTERIM SIGNAL — the proxy exposes NO /metrics endpoint today (it is a thin TCP proxy with slog JSON to stdout only). This log-based alert is the lowest-effort reliable signal and requires zero code change. PROPER FOLLOW-UP (durable upgrade): add a `pgproxy_role_gate_denied_roles` gauge + an HTTP /metrics listener to the proxy (repo InstaNode-dev/instant-pg-proxy), scrape it from Prometheus, and add a metric-based rule (gate gauge == 0) PLUS a synthetic-reject prober leg in the worker (open a raw StartupMessage to pg.instanode.dev as instanode_admin and assert a FATAL 28000 'role is not permitted over the public endpoint'). Until then, this log alert is the alarm.\n\nSource: stdout of instant-pg-proxy pods (k8s_namespace_name='instant', k8s_label_app='instant-pg-proxy'), shipped to NR via the newrelic-logging Fluent Bit DaemonSet. Manifest source of truth: InstaNode-dev/instant-pg-proxy k8s/deployment.yaml. Runbook: infra/POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a + §9.\n\nWhen this fires:\n 1. `kubectl get deploy/instant-pg-proxy -n instant -o jsonpath='{.spec.template.spec.containers[0].env}'` — confirm PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin is present.\n 2. If missing: `kubectl apply -f k8s/` from InstaNode-dev/instant-pg-proxy (restores the gate from source of truth).\n 3. `kubectl logs -n instant -l app=instant-pg-proxy | grep role_gate` — verify denied_role_count is back to 4.\n 4. Externally verify the boundary: `psql 'postgresql://instanode_admin:<pw>@pg.instanode.dev:5432/postgres'` must be REJECTED with FATAL 28000 (not reach scram).",
5+
"enabled": true,
6+
"nrql": {
7+
"query": "SELECT count(*) FROM Log WHERE k8s_namespace_name = 'instant' AND k8s_label_app = 'instant-pg-proxy' AND message LIKE '%pgproxy.role_gate%' AND message LIKE '%\"denied_role_count\":0%'"
8+
},
9+
"terms": [
10+
{
11+
"priority": "CRITICAL",
12+
"operator": "ABOVE",
13+
"threshold": 0,
14+
"thresholdDuration": 60,
15+
"thresholdOccurrences": "AT_LEAST_ONCE"
16+
}
17+
],
18+
"signal": {
19+
"aggregationWindow": 60,
20+
"aggregationMethod": "EVENT_FLOW",
21+
"aggregationDelay": 120,
22+
"fillOption": "STATIC",
23+
"fillValue": 0
24+
},
25+
"expiration": {
26+
"expirationDuration": 600,
27+
"openViolationOnExpiration": false,
28+
"closeViolationsOnExpiration": true
29+
},
30+
"violationTimeLimitSeconds": 86400
31+
}

0 commit comments

Comments
 (0)