InstaNode-dev
diff --git a/‎POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md‎
Lines changed: 31 additions & 5 deletions b/‎POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md‎
Lines changed: 31 additions & 5 deletions
diff --git a/‎newrelic/alerts/pg-proxy-down.json‎
Lines changed: 31 additions & 0 deletions b/‎newrelic/alerts/pg-proxy-down.json‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎newrelic/alerts/pg-proxy-role-gate-disabled.json‎
Lines changed: 31 additions & 0 deletions b/‎newrelic/alerts/pg-proxy-role-gate-disabled.json‎
Lines changed: 31 additions & 0 deletions
@@ -162,10 +162,35 @@ Live-verified: after the rollout the proxy runs on NEW IPs (`10.109.6.132`,
 still rejected (at the proxy, not pg_hba). The pg_hba proxy-IP reject lines are
 therefore now **redundant belt-and-suspenders** — left in place (harmless), no
 longer the sole boundary, no longer require churn-refresh on reschedule.
-**Remaining operator follow-up:** add an alert on `instant-pg-proxy` pod restarts
-(defense-in-depth visibility), and ensure any future redeploy preserves the
-`PG_PROXY_DENIED_ROLES` env (it lives only on the live Deployment patch — fold it
-into a committed manifest when one is created for the proxy).
+**✅ RESIDUAL CLOSED (2026-06-06).** Both halves of the remaining follow-up are done:
+1. **Manifest committed (durable).** `PG_PROXY_DENIED_ROLES` no longer lives only on
+   a live `kubectl patch`. The proxy Deployment + Service are now committed to
+   `InstaNode-dev/instant-pg-proxy` under `k8s/` (`deployment.yaml`, `service.yaml`,
+   `README.md`) as the source of truth. The manifest was captured faithfully from the
+   live spec and is a **verified no-op** — `kubectl diff -f k8s/deployment.yaml -n instant`
+   returned empty (exit 0), confirming it matches the running proxy (image `v0.2.0` +
+   the role-gate env) and re-applying it changes nothing. A future `kubectl delete` +
+   `kubectl apply -f k8s/` therefore restores the gate instead of silently dropping it.
+   Re-apply is safe at any time; no apply is needed today (already matches live).
+2. **Alert shipped (operator-apply).** Two log-based NR alerts watch the gate:
+   - `infra/newrelic/alerts/pg-proxy-role-gate-disabled.json` (P0/CRITICAL) — fires when
+     a proxy pod logs `pgproxy.role_gate` with `"denied_role_count":0` (gate disabled,
+     e.g. a re-create that dropped the env).
+   - `infra/newrelic/alerts/pg-proxy-down.json` (P1/CRITICAL) — fires when the proxy
+     emits zero logs for 10m (proxy down / public path broken).
+   Both query the proxy's stdout JSON in NR (`k8s_namespace_name='instant' AND
+   k8s_label_app='instant-pg-proxy'`), shipped via the `newrelic-logging` Fluent Bit
+   DaemonSet (verified running, all nodes). Dashboard: `admin-defense.json` →
+   "pg-proxy public-path gate" page (4 tiles). Catalog row in
+   `infra/observability/METRICS-CATALOG.md`.
+   **Why log-based (interim):** the proxy is a thin TCP proxy with slog-to-stdout only —
+   it exposes **no `/metrics` endpoint**, so a Prometheus-metric rule is not possible
+   today. The log signal (`pgproxy.role_gate denied_role_count`) is the lowest-effort
+   reliable alarm and needs zero code change. **Proper durable upgrade (follow-up):**
+   add a `pgproxy_role_gate_denied_roles` gauge + an HTTP `/metrics` listener to the
+   proxy, scrape it, alert on `gauge == 0`, AND add a worker synthetic-reject prober leg
+   (open a raw StartupMessage to `pg.instanode.dev` as `instanode_admin`, assert FATAL
+   `28000`). Until then the log alerts are the alarm.
 
 - If the proxy-pod-IP reject lines in the ConfigMap do NOT match the live proxy
   IPs at apply time → FIX them first, else the lockdown is a no-op for the live
@@ -326,13 +351,14 @@ the chokepoint ensures every *sanctioned* drop is recorded; the CI guard ensures
 |---|---|---|---|
 | 2026-06-06 | Claude (operator-authorized apply, "no customers, low blast radius") | **APPLIED to do-nyc3-instant-prod.** Merged PR #61 (squash, merge commit `78cb6677`) after fixing the manifest for two live findings (see below). Applied ConfigMap `postgres-customers-hba`; patched `deploy/postgres-customers` to mount it + `-c hba_file=/etc/postgresql/pg_hba.conf -c password_encryption=scram-sha-256`; changed strategy `RollingUpdate→Recreate` (RWO PVC Multi-Attach). Did NOT apply `networkpolicy.yaml` (verified NOT enforced in prod; applying as-is would default-deny the proxy path). | **SUCCESS.** External admin REJECTED at pg_hba (both `instanode_admin` + `instant_cust`, error names the SNAT'd proxy pod IP) — baseline beforehand reached scram (vector was OPEN). In-cluster admin preserved: provisioner `instant_cust` CREATE/DROP smoke OK, api/worker `instanode_admin` connect + `pg_database_size` OK, customer `usr_*` path still reaches scram. No rollback. |
 | 2026-06-06 | Claude (operator-authorized, "no customers, low blast radius") | **DURABLE FIX SHIPPED + DEPLOYED — the churn-proof pg-proxy role-gate.** Created the `InstaNode-dev/instant-pg-proxy` repo (did not exist before — the proxy source was a loose, un-versioned local dir; live image was `ghcr.io/mastermanas805/instant-pg-proxy:v0.1.0` applied by hand, no committed manifest). Merged PR #1 (squash, merge commit `5a86c93`): the proxy parses the StartupMessage `user` and, if in `PG_PROXY_DENIED_ROLES`, returns a FATAL `28000` ErrorResponse (`role is not permitted over the public endpoint`) BEFORE resolving/dialing — default empty = inert. Built+pushed `ghcr.io/mastermanas805/instant-pg-proxy:v0.2.0`; `kubectl patch deploy/instant-pg-proxy -n instant` → image v0.2.0 + `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. | **SUCCESS — durable closure verified, pod-IP-independent.** Rollout landed new pods at `10.109.6.132`/`10.109.4.98` (NOT the `10.109.4.113`/`10.109.0.101` the pg_hba reject lines name — those lines now point at DEAD pods, yet admin is STILL rejected, proving independence). External `instanode_admin`/`instant_cust`/`postgres` over `pg.instanode.dev` → **proxy 28000** (`role is not permitted over the public endpoint`), NOT a pg_hba reject naming a pod IP. Proxy logged `user_denied_public` for all three. Customer `usr_*` → FORWARDED (reached postgres scram → `password authentication failed`, not 28000). In-cluster admin via ClusterIP svc UNAFFECTED: `instant_cust` CREATE+DROP OK (`INCLUSTER_PROVISION_PATH_OK`), `pg_database_size` quota read OK. Provisioner DSN confirmed → `postgres-customers.instant-data.svc.cluster.local:5432` (svc, NOT the public proxy). The pg_hba proxy-IP reject lines are now redundant belt-and-suspenders (left in place, harmless). |
+| 2026-06-06 | Claude (operator-authorized, "no customers, low blast radius") | **RESIDUAL CLOSED — role-gate persisted to a committed manifest + alerted.** The `PG_PROXY_DENIED_ROLES` env previously lived ONLY on the live `kubectl patch` (a manual Deployment re-create would have silently dropped it → reopened the admin vector). (1) Captured the LIVE spec faithfully (`kubectl get deploy/svc instant-pg-proxy -n instant -o yaml`), stripped live-only noise, committed `k8s/deployment.yaml` + `k8s/service.yaml` + `k8s/README.md` to `InstaNode-dev/instant-pg-proxy` (default branch master) as the source of truth (PR, squash auto-merge). (2) Added two log-based NR alerts (operator-apply) — `pg-proxy-role-gate-disabled.json` (P0; fires on `pgproxy.role_gate denied_role_count==0`) + `pg-proxy-down.json` (P1; fires on 10m proxy log silence) — plus an admin-defense dashboard page + METRICS-CATALOG row (infra PR, squash auto-merge). The proxy exposes no `/metrics`, so the log signal is the lowest-effort reliable alarm; a `pgproxy_role_gate_denied_roles` gauge + synthetic-reject prober leg are the documented durable upgrade. | **SUCCESS — manifest is a verified no-op vs live; live behavior unchanged.** `kubectl diff -f k8s/deployment.yaml -n instant` → empty output, exit 0 (tooling sanity-checked: a deliberate `replicas: 2→3` edit DID surface drift, so the empty diff is genuine). `kubectl diff -f k8s/service.yaml` → also empty/exit 0. Live state at capture: image `v0.2.0`, `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`, pods Running 2/2, `pgproxy.role_gate denied_role_count:4` in logs, and live `pgproxy.user_denied_public` events observed for `instanode_admin`/`instant_cust`/`postgres` (gate actively rejecting). `newrelic-logging` Fluent Bit DaemonSet confirmed running on all nodes (proxy stdout reaches NR `Log`). NO `kubectl apply` performed (not needed — manifest already matches live); operator may apply anytime safely. The infra alerts are operator-apply. |
 
 **Manifest fixes made before apply (live pre-apply verification):**
 1. **`instanode_admin` was missing.** Prod has TWO superusers — `instanode_admin` (api/worker `CUSTOMER_DATABASE_URL`, the CONFIRMED truehomie vector) and `instant_cust` (provisioner `POSTGRES_CUSTOMERS_URL`). The original PR rejected only `instant_cust`; `instanode_admin` would have matched the catch-all customer allow → vector still open. Both now rejected.
 2. **pg-proxy SNAT defeats source-CIDR.** instant-pg-proxy (in-cluster, no hostNetwork) re-originates TCP, so external admin arrives SNAT'd to a proxy pod IP inside `10.0.0.0/8` — a plain `10.0.0.0/8 allow` matches it. Added proxy-pod-IP `reject` lines (`10.109.4.113`, `10.109.0.101`) ordered BEFORE the in-cluster allow. **Verified in the reject error message** (`rejects connection for host "10.109.0.101"`). ⚠️ Churn dependency, see §3a.
 
 **Operator follow-ups created by this apply:**
 - ~~**Ship the durable pg-proxy role-gate**~~ ✅ **DONE 2026-06-06.** `PG_PROXY_DENIED_ROLES` shipped (repo `InstaNode-dev/instant-pg-proxy` created + PR #1, merge `5a86c93`), image `v0.2.0` built+pushed, deployed to `deploy/instant-pg-proxy` with `PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin`. Live-verified the closure is now pod-IP-independent (see §3a + Drill Log row 2). The closure no longer depends on the churning proxy-pod-IP reject lines.
-- ~~**On any `instant-pg-proxy` reschedule:** refresh the proxy-IP reject lines~~ — **no longer required for the security boundary** (the role-gate is now the durable boundary). The pg_hba IP reject lines are redundant belt-and-suspenders; leave them. Still recommended: add a proxy-pod-restart alert for visibility, and persist `PG_PROXY_DENIED_ROLES` into a committed proxy Deployment manifest (currently the env lives only on the live `kubectl patch` — a manual re-create of the Deployment would drop it).
+- ~~**On any `instant-pg-proxy` reschedule:** refresh the proxy-IP reject lines~~ — **no longer required for the security boundary** (the role-gate is now the durable boundary). The pg_hba IP reject lines are redundant belt-and-suspenders; leave them. ~~Still recommended: add a proxy-pod-restart alert for visibility, and persist `PG_PROXY_DENIED_ROLES` into a committed proxy Deployment manifest~~ — **✅ DONE 2026-06-06** (see the "RESIDUAL CLOSED" block in §3a and Drill Log row 3): the proxy Deployment+Service are committed to `InstaNode-dev/instant-pg-proxy` `k8s/` (verified no-op vs live via `kubectl diff`), and two log-based NR alerts (`pg-proxy-role-gate-disabled.json` + `pg-proxy-down.json`) + the admin-defense "pg-proxy public-path gate" dashboard page watch the gate. The proxy has no `/metrics`, so a `pgproxy_role_gate_denied_roles` gauge + synthetic-reject prober leg are the proper durable upgrade (follow-up).
 - **`k8s/data/postgres-customers.yaml` updated** to carry the mount/args/Recreate-strategy so a future repo apply does not silently revert the lockdown (shipped in the same follow-up PR).
 - The repo `apply.yml` workflow now includes `postgres-customers-lockdown.yaml` (safe — ConfigMap) but ALSO `networkpolicy.yaml`; running that workflow WOULD create the unenforced-today NetPol and default-deny the proxy path. Add it to the apply EXCLUDE list or add the pg-proxy ingress rule before anyone runs the workflow.
@@ -0,0 +1,31 @@
+{
+  "name": "instant-pg-proxy — emitted zero logs in 10m (proxy down; customer-DB public path)",
+  "type": "NRQL",
+  "description": "P1. The instant-pg-proxy pods emitted ZERO log records for 10 minutes — a liveness check for the Postgres-aware TCP proxy that fronts postgres-customers on the public pg.instanode.dev:5432 path. The proxy logs continuously under normal operation (each pod logs `pgproxy.role_gate` + `pgproxy.starting` on boot; every public privileged-role attempt logs `pgproxy.user_denied_public`; connections log at debug), so a 10-minute silence on the whole Deployment = crashed/evicted/scaled-to-zero proxy. Customer connections to pg.instanode.dev then fail (fail-safe), and an operator must not 'fix' it by re-pointing ingress-nginx tcp-services 5432 straight at postgres-customers — that would bypass the role-gate and REOPEN the 2026-06-03 admin DROP vector. Restore the proxy from source of truth instead: `kubectl apply -f k8s/` (repo InstaNode-dev/instant-pg-proxy).\n\nComplements pg-proxy-role-gate-disabled.json (which catches the gate being DISABLED while the proxy is up). Together they bound the boundary: gate-off (exposure) and proxy-off (path broken). The proxy has no /metrics endpoint, so this is a log-liveness check, not a metric scrape. fillValue STATIC 0 ensures a fully silent stream is treated as zero, not no-data.\n\nSource: stdout of instant-pg-proxy pods (k8s_namespace_name='instant', k8s_label_app='instant-pg-proxy'), via the newrelic-logging Fluent Bit DaemonSet. Runbook: infra/POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a + §9.\n\nWhen this fires:\n  1. `kubectl get pods -n instant -l app=instant-pg-proxy -o wide` — are the pods Running?\n  2. `kubectl describe pod -n instant -l app=instant-pg-proxy` + `kubectl logs --previous` — crash cause (OOM, image pull, Redis URL).\n  3. `kubectl rollout status deploy/instant-pg-proxy -n instant`.\n  4. If the Deployment is gone, re-create from source of truth: `kubectl apply -f k8s/` (InstaNode-dev/instant-pg-proxy) — this restores the role-gate env too.\n  5. Confirm `ingress-nginx/ingress-nginx-tcp` 5432 still maps to instant/instant-pg-proxy:5432 (NOT instant-data/postgres-customers).",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT count(*) FROM Log WHERE k8s_namespace_name = 'instant' AND k8s_label_app = 'instant-pg-proxy'"
+  },
+  "terms": [
+    {
+      "priority": "CRITICAL",
+      "operator": "BELOW_OR_EQUALS",
+      "threshold": 0,
+      "thresholdDuration": 600,
+      "thresholdOccurrences": "ALL"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 60,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 120,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 1200,
+    "openViolationOnExpiration": true,
+    "closeViolationsOnExpiration": false
+  },
+  "violationTimeLimitSeconds": 86400
+}
@@ -0,0 +1,31 @@
+{
+  "name": "instant-pg-proxy — role-gate DISABLED or proxy down (customer-DB admin exposure)",
+  "type": "NRQL",
+  "description": "P0 SECURITY. The instant-pg-proxy role-gate (PG_PROXY_DENIED_ROLES) is the durable boundary that bars privileged/superuser roles (instanode_admin, instant_cust, postgres, doadmin) from authenticating over the public pg.instanode.dev:5432 path — it closes the 2026-06-03 truehomie-db DROP DATABASE vector. If the gate is disabled (env dropped on a Deployment re-create) OR the proxy goes down, the customer-DB admin exposure reopens.\n\nThis alert has TWO terms:\n  (A) CRITICAL — gate explicitly DISABLED: the proxy logs `pgproxy.role_gate` with `\"denied_role_count\":0` at startup. A non-zero count = gate ON; 0 = inert passthrough (no roles barred). This fires the moment a pod starts with PG_PROXY_DENIED_ROLES empty/missing (e.g. a Deployment re-create that dropped the env). Text-matched on the JSON log so it is robust regardless of whether the numeric field is lifted as an NR attribute.\n  (B) CRITICAL — proxy SILENT (down): zero `pgproxy.*` log records from the instant-pg-proxy pods in 10m. The proxy logs continuously (every connection logs at debug, every reject logs `pgproxy.user_denied_public` WARN, and each pod logs `pgproxy.role_gate` + `pgproxy.starting` on boot), so a 10-minute silence = crashed/evicted/scaled-to-zero proxy and the public path is either broken (fail-safe) or being served by something else.\n\nINTERIM SIGNAL — the proxy exposes NO /metrics endpoint today (it is a thin TCP proxy with slog JSON to stdout only). This log-based alert is the lowest-effort reliable signal and requires zero code change. PROPER FOLLOW-UP (durable upgrade): add a `pgproxy_role_gate_denied_roles` gauge + an HTTP /metrics listener to the proxy (repo InstaNode-dev/instant-pg-proxy), scrape it from Prometheus, and add a metric-based rule (gate gauge == 0) PLUS a synthetic-reject prober leg in the worker (open a raw StartupMessage to pg.instanode.dev as instanode_admin and assert a FATAL 28000 'role is not permitted over the public endpoint'). Until then, this log alert is the alarm.\n\nSource: stdout of instant-pg-proxy pods (k8s_namespace_name='instant', k8s_label_app='instant-pg-proxy'), shipped to NR via the newrelic-logging Fluent Bit DaemonSet. Manifest source of truth: InstaNode-dev/instant-pg-proxy k8s/deployment.yaml. Runbook: infra/POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a + §9.\n\nWhen this fires:\n  1. `kubectl get deploy/instant-pg-proxy -n instant -o jsonpath='{.spec.template.spec.containers[0].env}'` — confirm PG_PROXY_DENIED_ROLES=instanode_admin,instant_cust,postgres,doadmin is present.\n  2. If missing: `kubectl apply -f k8s/` from InstaNode-dev/instant-pg-proxy (restores the gate from source of truth).\n  3. `kubectl logs -n instant -l app=instant-pg-proxy | grep role_gate` — verify denied_role_count is back to 4.\n  4. Externally verify the boundary: `psql 'postgresql://instanode_admin:<pw>@pg.instanode.dev:5432/postgres'` must be REJECTED with FATAL 28000 (not reach scram).",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT count(*) FROM Log WHERE k8s_namespace_name = 'instant' AND k8s_label_app = 'instant-pg-proxy' AND message LIKE '%pgproxy.role_gate%' AND message LIKE '%\"denied_role_count\":0%'"
+  },
+  "terms": [
+    {
+      "priority": "CRITICAL",
+      "operator": "ABOVE",
+      "threshold": 0,
+      "thresholdDuration": 60,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 60,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 120,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 600,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}