feat(alerts): pg-proxy role-gate disabled / proxy down (truehomie residual)#65
Merged
Merged
Conversation
…idual)
Adds the alert for the last residual of the 2026-06-03 truehomie-db DROP
durable fix: the instant-pg-proxy role-gate (PG_PROXY_DENIED_ROLES) is now
committed to a manifest (InstaNode-dev/instant-pg-proxy k8s/), but nothing
alerted if the gate were ever disabled or the proxy went down.
The proxy is a thin TCP proxy with slog-to-stdout only — it exposes NO
/metrics endpoint, so a Prometheus-metric rule is not possible today. The
lowest-effort reliable signal is the proxy's startup log line
`pgproxy.role_gate{denied_role_count}` (count>0 = gate ON, 0 = exposure),
shipped to NR via the newrelic-logging Fluent Bit DaemonSet (verified
running on all nodes). Two log-based NR alerts (operator-apply):
- pg-proxy-role-gate-disabled.json (P0) — fires on denied_role_count==0
- pg-proxy-down.json (P1) — fires on 10m proxy log silence
Plus an admin-defense dashboard page ("pg-proxy public-path gate", 4 tiles)
and a METRICS-CATALOG row (rule 25). Runbook §3a + §9 updated: residual
closed, manifest no-op-verified vs live, alert documented. Proper durable
upgrade documented: add a pgproxy_role_gate_denied_roles gauge + /metrics +
a worker synthetic-reject prober leg.
Operator-apply only (no auto-apply on infra). No live behavior changed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Closes the last residual of the 2026-06-03 truehomie-db DROP durable fix. The
instant-pg-proxyrole-gate (PG_PROXY_DENIED_ROLES) bars privileged Postgres roles from the publicpg.instanode.dev:5432path. The env is now committed to a manifest (instant-pg-proxy#2, merged), but nothing alerted if the gate were ever disabled or the proxy went down.Why log-based (interim)
The proxy is a thin TCP proxy with slog-to-stdout only — it exposes no
/metricsendpoint, so a Prometheus-metric rule is impossible today. The lowest-effort reliable signal is the proxy's own log: each pod logspgproxy.role_gate{denied_role_count}on boot (count>0 = gate ON, 0 = exposure) andpgproxy.user_denied_publicon every rejected privileged role. Verified live:denied_role_count:4+ activeuser_denied_publicevents forinstanode_admin/instant_cust/postgres. Thenewrelic-loggingFluent Bit DaemonSet (confirmed running on all nodes) ships proxy stdout to NRLog.Files
newrelic/alerts/pg-proxy-role-gate-disabled.json— P0/CRITICAL, fires onpgproxy.role_gateline with"denied_role_count":0(gate disabled).newrelic/alerts/pg-proxy-down.json— P1/CRITICAL, fires on 10m of zero proxy logs (proxy down / path broken).newrelic/dashboards/admin-defense.json— new "pg-proxy public-path gate" page (4 tiles).observability/METRICS-CATALOG.md— catalog row (rule 25).POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md— §3a + §9 + Drill Log row: residual closed, manifest no-op-verified vs live, alert documented.Durable upgrade (documented follow-up)
Add a
pgproxy_role_gate_denied_rolesgauge +/metricslistener to the proxy, scrape it, alert ongauge == 0, AND add a worker synthetic-reject prober leg (raw StartupMessage topg.instanode.devasinstanode_admin, assert FATAL28000). Until then the log alerts are the alarm.Scope
Operator-apply only (infra has no auto-apply). No live behavior changed — JSON/MD only.
🤖 Generated with Claude Code