Skip to content

Commit 78cb667

Browse files
sec(data): DORMANT postgres-customers admin lockdown (truehomie root-cause vector) (#61)
* sec(data): DORMANT postgres-customers admin lockdown (truehomie root-cause vector) Closes the OPEN root cause of the truehomie-db DROP incident (2026-06-03): a direct public-internet admin connection to the shared customer Postgres that can DROP DATABASE/ROLE with no audit_log row. CONFIRMED via config + SAFE checks (verify-don't-assert, 2026-06-06): - pg.instanode.dev -> 152.42.154.144 (same DO LB IP as api/redis/mongo); pg is publicly DNS-routed. - `nc -z pg.instanode.dev 5432` succeeds from the public internet (TCP handshake only -- NO auth, NO SQL attempted). - postgres-customers pod runs stock pgvector/pgvector:pg16 with NO custom pg_hba/postgresql.conf/POSTGRES_HOST_AUTH_METHOD and no config mount -> image default `host all all all scram-sha-256` (catch-all): the admin role (instant_cust) can authenticate from anywhere that reaches the listener. STILL HYPOTHESIS (not asserted): that an external actor actually used this to drop truehomie (we did not attempt auth); the instant-pg-proxy's own role-gate (its config lives in the separate InstaNode-dev/instant-pg-proxy repo). Fix (staged DORMANT -- operator-applies in a maintenance window; rule 15: infra has no auto-apply, high blast radius): - k8s/data/postgres-customers-lockdown.yaml: custom pg_hba.conf ConfigMap that REJECTS the admin role (instant_cust/postgres) from the public internet while still allowing in-cluster admin (provisioner/migrator/worker via pod CIDR) AND customer usr_* roles from anywhere (customer connect path preserved). - k8s/data/networkpolicy.yaml: truehomie context annotations + a DORMANT pg-proxy ingress rule (network-layer second line; admin boundary does not depend on it -- pg_hba role-reject holds regardless of SNAT source). - POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md: confirmed-vs-hypothesis, legitimate- consumer map, pre-apply verification (live admin role name, pg-proxy SNAT/ NetPol interaction, hunt for public-admin automation), apply, legit-access- still-works verify, external-admin-closed verify (SAFE rejection test), and rollback. NOT applied to prod. Requires USER/OPERATOR review + maintenance window. Validated locally: yamllint (repo config) + kubeconform -strict (5/5 valid). Defense-in-depth: this is the infra half. App half already shipped (provisioner guardedDrop chokepoint + DDL-audit + CI guard, PR #50; drop-metric alert/tile/ catalog, infra PR #60). Audit: docs/ci/DATA-INTEGRITY-DROP-PATH-AUDIT.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * sec(data): fix admin lockdown for live findings — instanode_admin + proxy-IP reject Live pre-apply verification (do-nyc3-instant-prod, 2026-06-06) found the PR's pg_hba was incomplete and would NOT have closed the truehomie vector: 1. TWO superusers on the prod customer pod, not one: `instanode_admin` (the role api/worker connect with via CUSTOMER_DATABASE_URL — the CONFIRMED truehomie vector role) AND `instant_cust` (provisioner's POSTGRES_CUSTOMERS_URL POSTGRES_USER). The PR rejected only `instant_cust`, leaving the actual vector (`instanode_admin`) matching the catch-all customer allow. Both now rejected. 2. The pg-proxy SNAT problem: instant-pg-proxy is a normal in-cluster pod that re-originates TCP, so an external `-U instanode_admin` arrives SNAT'd to a proxy pod IP inside 10.0.0.0/8 — a plain `10.0.0.0/8 allow` would match it. Baseline probe confirmed the live vector is OPEN (external admin reaches scram). Added proxy-pod-IP `reject` lines (10.109.4.113, 10.109.0.101) ordered BEFORE the in-cluster allow so the SNAT'd external admin is rejected while legit consumer pods (direct svc DSN, different IPs) still authenticate. Documented the churn dependency + the durable pg-proxy role-gate follow-up. 3. Preserved live local/loopback/replication `trust` lines (readiness probe + backup pg_dumpall) instead of switching to peer. Added doadmin belt-suspenders. Runbook: added §3a (pg-proxy SNAT + proxy-IP churn operator note) and the L1-L5 live-confirmed findings table (supersedes H1-H3). NetworkPolicy NOT applied (verified not enforced in prod; applying as-is would default-deny the proxy path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent af1f2e2 commit 78cb667

3 files changed

Lines changed: 539 additions & 0 deletions

File tree

Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# postgres-customers Admin Lockdown Runbook
2+
3+
> **Status: DORMANT. Operator-applied in a maintenance window. This repo has no
4+
> auto-apply (rule 15). Nothing here runs automatically.**
5+
>
6+
> **HIGH BLAST RADIUS — touches the shared customer-Postgres data tier. Requires
7+
> USER/OPERATOR review and a maintenance window before apply.**
8+
>
9+
> Closes the OPEN root cause of the **truehomie-db DROP incident (2026-06-03)**:
10+
> a direct, public-internet admin connection to `postgres-customers` that could
11+
> `DROP DATABASE` with no `audit_log` row. Memory:
12+
> `project_truehomie_db_drop_incident_2026_06_03`. Audit:
13+
> `docs/ci/DATA-INTEGRITY-DROP-PATH-AUDIT.md` (§truehomie root-cause hypotheses,
14+
> H1 = confirmed vector).
15+
16+
---
17+
18+
## 1. What was confirmed vs hypothesis (verify-don't-assert)
19+
20+
### CONFIRMED — via config + SAFE checks (2026-06-06)
21+
22+
| # | Finding | How confirmed |
23+
|---|---|---|
24+
| C1 | `pg.instanode.dev``152.42.154.144` — the SAME IP as `api`/`redis`/`mongo.instanode.dev` (the shared DO LoadBalancer fronting ingress-nginx). pg is **publicly DNS-routed**. | `dig +short pg.instanode.dev` (and the three siblings) |
25+
| C2 | `pg.instanode.dev:5432` **answers a TCP handshake from the public internet.** | `nc -z -w5 pg.instanode.dev 5432` → "succeeded" (**TCP only — no auth, no SQL, no DDL attempted**) |
26+
| C3 | The `postgres-customers` pod runs the **stock `pgvector/pgvector:pg16` image** with **NO** custom `pg_hba.conf` / `postgresql.conf` / `POSTGRES_HOST_AUTH_METHOD` and **no config volume mount** → the image default `host all all all scram-sha-256` (a **catch-all**) is in effect. The admin/superuser role (`instant_cust`, the `POSTGRES_USER`) can authenticate from any source that reaches the listener. | `k8s/data/postgres-customers.yaml` (only the data PVC is mounted; no `command`/`args`/`POSTGRES_HOST_AUTH_METHOD`); no `pg_hba.conf`/`postgresql.conf` anywhere in the infra repo |
27+
| C4 | The `postgres-customers` **Service is ClusterIP** (no `type:` field). The public exposure is via the external `instant-pg-proxy` + ingress-nginx `tcp-services`, **NOT** this Service. | `k8s/data/postgres-customers.yaml` Service spec |
28+
| C5 | A `postgres-customers-ingress` NetworkPolicy already exists allowing ingress on 5432 only from provisioner/migrator/worker (all `instant-infra`). It does **not** list pg-proxy. **Its prod-apply state is unverified** (infra has no auto-apply). | `k8s/data/networkpolicy.yaml` |
29+
30+
### CONFIRMED LIVE during the 2026-06-06 apply session (supersedes H1–H3 below)
31+
32+
| # | Finding | How confirmed |
33+
|---|---|---|
34+
| L1 | **TWO superuser roles exist on the prod customer pod:** `instanode_admin` (rolsuper=t — the role api/worker connect with via `CUSTOMER_DATABASE_URL`) AND `instant_cust` (rolsuper=t + createdb + createrole — the `POSTGRES_USER` the provisioner connects with via `POSTGRES_CUSTOMERS_URL`). The PR's pg_hba listed only `instant_cust`. **Manifest FIXED to reject BOTH** before apply — else the catch-all customer rule re-opens the vector for `instanode_admin` (the confirmed truehomie role). | `psql -tAc "select rolname,rolsuper from pg_roles where rolsuper"`; `kubectl get secret instant-secrets -o jsonpath CUSTOMER_DATABASE_URL``instanode_admin`; provisioner deploy env `POSTGRES_CUSTOMERS_URL``instant_cust` |
35+
| L2 | **A LIVE pg_hba stopgap was already on the pod** (from the 2026-06-03 incident): `host all instanode_admin <pod-ip>/32 reject` for the THEN proxy pod IPs (`10.109.3.201`, `10.109.0.101`), plus catch-all `host all all all scram`. One rejected IP (`10.109.3.201`) is now **STALE** — the proxy rescheduled to `10.109.4.113` — so the stopgap is partially broken. This ConfigMap's **role-keyed** reject is the churn-proof replacement. Live file backed up at `$PGDATA/pg_hba.conf.bak.2026-06-03`. | `kubectl exec … cat $PGDATA/pg_hba.conf`; `kubectl get pods -l app=instant-pg-proxy -o wide` |
36+
| L3 | **pg-proxy is a custom TCP proxy that SNATs.** `instant-pg-proxy:v0.1.0` (in `instant` ns) routes by Redis prefix `pg_route:` with `PG_PROXY_FALLBACK_BACKEND=postgres-customers.instant-data.svc:5432`. Being a TCP proxy it terminates inbound + re-originates, so customer traffic arrives at postgres-customers as the **proxy pod IP (10.x)**. This confirms the role-based (not source-IP) reject is the correct boundary, and confirms the **fallback** would forward an admin connection straight through (the live vector). | `kubectl get deploy/instant-pg-proxy -o jsonpath env` |
37+
| L4 | **The `postgres-customers-ingress` NetworkPolicy is NOT applied in prod** (`kubectl get netpol -n instant-data` → "No resources found"). Cilium IS the CNI (policies would enforce if applied). So the network layer provides **zero** protection today — the pg_hba role-reject is the **sole** boundary. The NetworkPolicy was therefore **NOT applied** in this session (applying it as-is would default-deny + break the proxy path, which is not in its allow-list). | `kubectl get networkpolicy -n instant-data`; `kubectl get ds -n kube-system \| grep cilium` |
38+
| L5 | **No committed public-admin automation.** `grep -rI pg.instanode.dev` across all repos finds it only as the customer-facing `POSTGRES_PUBLIC_HOST` (the `usr_*` path); nothing pairs it with an admin DSN. `tcp-services` cm currently maps `5432 → instant/instant-pg-proxy` (its `last-applied` annotation shows it was ORIGINALLY `5432 → instant-data/postgres-customers`, i.e. a former direct-to-pod route — historical corroboration of the vector). | `grep`; `kubectl get cm -n ingress-nginx tcp-services -o yaml` |
39+
40+
> **Net:** H1's *vector* is now fully corroborated end-to-end (public DNS → LB →
41+
> ingress tcp-services → SNATting pg-proxy with a fallback → catch-all pg_hba),
42+
> the proxy behaviour (H2) and NetPol non-enforcement (H3) are RESOLVED above. We
43+
> still did NOT attempt auth as the admin role (no destructive pentest); the
44+
> apply-time external test (§5b) uses a connection-rejection probe only.
45+
46+
### ORIGINAL HYPOTHESES (pre-apply; superseded by L1–L5 above)
47+
48+
| # | Open item | Why it could not be confirmed at PR time |
49+
|---|---|---|
50+
| H1 | That an external actor **actually authenticated** as the admin role over the public path. | We deliberately did **not** attempt auth (out-of-scope noisy/destructive pentest). C1–C3 prove the path is *open*, not that it was *used*. |
51+
| H2 | The `instant-pg-proxy`'s own role-gate / `pg_hba` behaviour and whether it already blocks the admin role. | The proxy config lives in the **separate repo `InstaNode-dev/instant-pg-proxy`**. **RESOLVED L3:** it SNATs + has an open fallback (no role gate in v0.1.0). |
52+
| H3 | Whether the existing `postgres-customers-ingress` NetworkPolicy is enforced in prod. | infra has no auto-apply; requires a live `kubectl get netpol -n instant-data` (operator). **RESOLVED L4:** NOT applied. |
53+
54+
> **Net:** the exposure (public-reachable customer-Postgres listener + a
55+
> catch-all default pg_hba that lets the admin role auth from anywhere) is
56+
> **CONFIRMED at the config + reachability level**. Whether it was the actual
57+
> truehomie dropper, and the proxy's own gate, remain hypothesis. The hardening
58+
> agent's #1 hypothesis is therefore **corroborated, not refuted**.
59+
60+
---
61+
62+
## 2. Legitimate consumers of postgres-customers (must NOT break)
63+
64+
| Consumer | How it connects | Role | Preserved by this lockdown? |
65+
|---|---|---|---|
66+
| **instant-provisioner** (`instant-infra`) | `POSTGRES_CUSTOMERS_URL` admin DSN, in-cluster to `postgres-customers.instant-data.svc:5432` | **admin** (`instant_cust`) — CREATE/DROP `db_<token>` + `usr_<token>` | **Yes** — pg_hba allows `instant_cust` from `10.0.0.0/8` (pod CIDR). NetPol already allows provisioner. |
67+
| **instant-migrator** (`instant-infra`) | in-cluster, resource migrations (CopyData/Verify) | admin or per-tenant | **Yes** — same pod-CIDR admin allow + NetPol already allows migrator. |
68+
| **instant-worker** (`instant-infra`) | in-cluster, read-only `pg_database_size` (quota tick) | admin (read) | **Yes** — pod-CIDR admin allow + NetPol already allows worker. |
69+
| **Customers** | public `pg.instanode.dev:5432` → pg-proxy → `db_<token>` | per-tenant **`usr_<token>`** (non-superuser) | **Yes** — pg_hba `host all all 0.0.0.0/0 scram-sha-256` LAST rule still allows customer roles from anywhere. The admin reject does NOT catch them (role name != `instant_cust`/`postgres`). |
70+
| **backup CronJob** (`postgres-customers-backup`) | in-cluster `pg_dumpall` (BACKUP-RESTORE-RUNBOOK) | admin | **Yes** — runs in-cluster (pod CIDR) as the admin role. *Operator: verify its pod lands in 10.0.0.0/8 — it does, all DOKS pods are in pod CIDR.* |
71+
| **restore-drill sidecar** | throwaway namespace, never touches the live pod | n/a | Unaffected. |
72+
73+
**The one thing this CLOSES:** a direct `psql -h pg.instanode.dev -U instant_cust`
74+
(or `-U postgres`) from **outside** the cluster. That is the truehomie vector.
75+
76+
> **Unverifiable-consumer caution:** if any **ad-hoc operator/CI workflow**
77+
> currently connects to the admin role over the **public** `pg.instanode.dev`
78+
> (e.g. a migration run from a laptop or a GitHub Action), the lockdown will
79+
> **break it by design** — that path IS the vulnerability. Before apply, the
80+
> operator MUST confirm no legitimate automation depends on public admin access
81+
> (search CI secrets / workflows for `pg.instanode.dev` + an admin DSN). If one
82+
> exists, migrate it to an in-cluster runner / `kubectl exec` first.
83+
84+
---
85+
86+
## 3. Pre-apply verification (do this FIRST, in the window)
87+
88+
```bash
89+
kubectl config current-context # MUST be do-nyc3-instant-prod
90+
91+
# (a) Is the existing ingress NetworkPolicy enforced? (H3)
92+
kubectl get netpol -n instant-data postgres-customers-ingress -o yaml | sed -n '1,60p'
93+
94+
# (b) Where is pg-proxy, and does customer traffic SNAT through it? (H2)
95+
# The proxy manifest is in the SEPARATE instant-pg-proxy repo; find it live:
96+
kubectl get pods -A | grep -i pg-proxy
97+
kubectl get svc,cm -A | grep -iE 'tcp-services|pg-proxy'
98+
# Inspect ingress-nginx tcp-services to see what 5432 maps to:
99+
kubectl get cm -n ingress-nginx tcp-services -o yaml 2>/dev/null
100+
101+
# (c) Does a real customer connection currently work end-to-end? (baseline to
102+
# compare AFTER lockdown — use a KNOWN test tenant's usr_/db_, NOT admin)
103+
# (operator runs from a real customer connection string they own)
104+
105+
# (d) Confirm the admin role name actually is `instant_cust` (POSTGRES_USER) on
106+
# the live pod (don't trust the manifest blindly):
107+
kubectl exec -n instant-data deploy/postgres-customers -- \
108+
psql -U instant_cust -d instant_customers -tAc "select rolname,rolsuper from pg_roles where rolsuper;"
109+
# Expect the superuser to be `instant_cust` (and possibly `postgres`).
110+
# If the admin role differs, EDIT the pg_hba ConfigMap to match BEFORE apply.
111+
112+
# (e) Confirm no legitimate automation uses PUBLIC admin access:
113+
# (search your CI secrets/workflows + local shell history for the DSN)
114+
# grep your repos for: pg.instanode.dev .* instant_cust (or :@pg.instanode.dev)
115+
```
116+
117+
**Decision gate:**
118+
- If (d) shows a different admin role → fix the ConfigMap, re-run pre-apply.
119+
- If (e) finds a public-admin automation → migrate it in-cluster FIRST.
120+
- If (b) shows pg-proxy SNATs and (a) shows the NetPol enforced and customers
121+
currently work → the NetPol must already allow the proxy somehow; do NOT touch
122+
the NetPol, rely on the pg_hba role-reject alone. **If customers do NOT
123+
currently work, that is a pre-existing issue — do not conflate it with this
124+
lockdown.**
125+
126+
### 3a. ⚠️ The pg-proxy SNAT problem — proxy-pod-IP reject is REQUIRED (and churns)
127+
128+
**LIVE-VERIFIED 2026-06-06, and it changes the design:** `instant-pg-proxy` is a
129+
normal in-cluster pod (not hostNetwork) that terminates the inbound TCP and
130+
re-originates to `postgres-customers`. So EVERY public connection — including an
131+
external `psql -U instanode_admin` — arrives SNAT'd to a **proxy pod IP inside
132+
10.109.x (i.e. inside 10.0.0.0/8)**. A plain `instanode_admin 10.0.0.0/8 allow`
133+
would therefore MATCH a SNAT'd external admin and NOT close the vector. Baseline
134+
probe before apply confirmed the live vector is OPEN: `psql -U instanode_admin`
135+
over `pg.instanode.dev` returns `password authentication failed` (it REACHED
136+
scram). The proxy v0.1.0 has no role gate and an open fallback.
137+
138+
**Consequence for the ConfigMap:** the admin reject MUST list the CURRENT proxy
139+
pod IPs and be ordered BEFORE the `10.0.0.0/8` allow (first-match wins). Get them:
140+
141+
```bash
142+
kubectl get pods -n instant -l app=instant-pg-proxy -o jsonpath='{range .items[*]}{.status.podIP}{"\n"}{end}'
143+
# Put these into the `host all instanode_admin <ip>/32 reject` (and instant_cust)
144+
# lines at the TOP of the admin block in postgres-customers-lockdown.yaml.
145+
```
146+
147+
**⚠️ CHURN: these IPs change when the proxy reschedules** — that is exactly how
148+
the 2026-06-03 hand-stopgap silently broke (it listed `10.109.3.201`, now dead).
149+
After ANY proxy reschedule, re-run the command above, update the two reject lines,
150+
re-apply the ConfigMap, and `SELECT pg_reload_conf()`. The **durable** churn-proof
151+
closer is the pg-proxy's own privileged-role deny (`PG_PROXY_DENIED_ROLES`, staged
152+
in repo `InstaNode-dev/instant-pg-proxy` per memory) — once that ships and is
153+
deployed, the proxy rejects admin roles before forwarding and these IP lines
154+
become redundant belt-and-suspenders. **Operator follow-up: ship the proxy
155+
role-gate; add an alert on proxy pod restarts so the pg_hba IPs can be refreshed.**
156+
157+
- If the proxy-pod-IP reject lines in the ConfigMap do NOT match the live proxy
158+
IPs at apply time → FIX them first, else the lockdown is a no-op for the live
159+
public path.
160+
161+
---
162+
163+
## 4. Apply (online pg_hba reload first; pod patch is the durable step)
164+
165+
The pg_hba change is **online-reloadable** — no customer downtime for the
166+
config itself. The pod patch (to mount the ConfigMap + start with the custom
167+
`hba_file`) is a **pod restart** (single-replica → brief connect blip; provisioner
168+
retries; customers reconnect).
169+
170+
```bash
171+
kubectl config current-context # do-nyc3-instant-prod — re-confirm
172+
173+
# 1. Apply the ConfigMap (inert until mounted — safe to apply anytime).
174+
kubectl apply -f k8s/data/postgres-customers-lockdown.yaml
175+
176+
# 2. Patch the Deployment to mount the ConfigMap and start postgres with the
177+
# custom hba_file. (Single, reviewable strategic patch — read the diff first.)
178+
kubectl patch deploy/postgres-customers -n instant-data --type=strategic -p '
179+
spec:
180+
template:
181+
spec:
182+
containers:
183+
- name: postgres
184+
args: ["-c", "hba_file=/etc/postgresql/pg_hba.conf", "-c", "password_encryption=scram-sha-256"]
185+
volumeMounts:
186+
- name: hba
187+
mountPath: /etc/postgresql/pg_hba.conf
188+
subPath: pg_hba.conf
189+
readOnly: true
190+
volumes:
191+
- name: hba
192+
configMap:
193+
name: postgres-customers-hba
194+
items:
195+
- key: pg_hba.conf
196+
path: pg_hba.conf
197+
'
198+
# NOTE: the container name on the live pod is `postgres` (per the manifest).
199+
# Confirm with: kubectl get deploy/postgres-customers -n instant-data \
200+
# -o jsonpath='{.spec.template.spec.containers[0].name}'
201+
202+
# 3. Wait for the new pod to be Ready.
203+
kubectl rollout status deploy/postgres-customers -n instant-data --timeout=180s
204+
205+
# (Alternative to a restart, if you want ZERO downtime and the file is already
206+
# mounted on a prior apply: edit pg_hba on the pod's mounted path is read-only,
207+
# so the reload path is to update the ConfigMap and `pg_ctl reload`:)
208+
# kubectl exec -n instant-data deploy/postgres-customers -- \
209+
# psql -U instant_cust -d instant_customers -c "SELECT pg_reload_conf();"
210+
```
211+
212+
---
213+
214+
## 5. Verify AFTER apply
215+
216+
### 5a. Legitimate access STILL works (do these FIRST)
217+
218+
```bash
219+
# Provisioner admin path (in-cluster) — must still authenticate:
220+
kubectl exec -n instant-data deploy/postgres-customers -- \
221+
psql -U instant_cust -d instant_customers -tAc "select 1;" # expect: 1
222+
223+
# A provisioning smoke test through the real API (creates + lists a db):
224+
curl -sS -X POST https://api.instanode.dev/db/new | jq '.connection_string!=null' # expect true
225+
# then connect to the returned connection string as the customer (usr_ role)
226+
# from OUTSIDE the cluster and run `select 1;` — expect SUCCESS (customer path
227+
# preserved).
228+
229+
# Backup CronJob smoke (or wait for the nightly): trigger a manual run and
230+
# confirm it still dumps (BACKUP-RESTORE-RUNBOOK §verify).
231+
kubectl create job -n instant-data --from=cronjob/postgres-customers-backup pg-lockdown-verify
232+
kubectl logs -n instant-data job/pg-lockdown-verify --follow # expect a clean dumpall
233+
```
234+
235+
### 5b. External ADMIN access is CLOSED (the whole point)
236+
237+
```bash
238+
# From a machine OUTSIDE the cluster, attempt the ADMIN role over the public host.
239+
# EXPECT: rejected by pg_hba ("no pg_hba.conf entry ... rejected"), NOT a password
240+
# prompt that proceeds. This is a SAFE connection-rejection test — it does NOT
241+
# require valid credentials and runs NO SQL.
242+
PGCONNECT_TIMEOUT=5 psql "host=pg.instanode.dev port=5432 user=instant_cust dbname=instant_customers sslmode=require" -c '\q' 2>&1 | head
243+
# PASS = an explicit pg_hba REJECT / "no pg_hba.conf entry for host ... user
244+
# \"instant_cust\" ... rejected" (FATAL).
245+
# FAIL = a password prompt / "password authentication failed" (means the hba
246+
# rule did NOT reject — admin is still reachable; ROLL BACK + investigate).
247+
# (The TCP handshake will still succeed — that is expected; the boundary is the
248+
# pg_hba role reject, not the port. The customer usr_* path is unaffected.)
249+
```
250+
251+
> The TCP port stays open (customers need it). The boundary is the **role-level
252+
> reject** at pg_hba. If you want the port itself closed to the public, that is a
253+
> separate, larger change in the `instant-pg-proxy` repo + ingress-nginx
254+
> tcp-services (do not attempt as part of this lockdown).
255+
256+
---
257+
258+
## 6. Rollback
259+
260+
```bash
261+
# Revert the pod patch (drops the custom hba_file + mount → back to image default):
262+
kubectl patch deploy/postgres-customers -n instant-data --type=json -p '[
263+
{"op":"remove","path":"/spec/template/spec/containers/0/args"},
264+
{"op":"remove","path":"/spec/template/spec/containers/0/volumeMounts/0"},
265+
{"op":"remove","path":"/spec/template/spec/volumes/0"}
266+
]'
267+
kubectl rollout status deploy/postgres-customers -n instant-data --timeout=180s
268+
269+
# Optionally delete the ConfigMap (inert either way):
270+
kubectl delete cm -n instant-data postgres-customers-hba --ignore-not-found
271+
```
272+
273+
Rollback restores the (vulnerable) catch-all default. Only roll back if a
274+
**legitimate** consumer breaks — and capture which one, because that maps to a
275+
consumer the analysis missed.
276+
277+
---
278+
279+
## 7. What this does NOT do (scope honesty)
280+
281+
- It does **not** close the public TCP port on 5432 (customers connect there).
282+
The admin boundary is the pg_hba **role reject**, not the port.
283+
- It does **not** touch the `instant-pg-proxy` repo (the proxy's own role-gate /
284+
pg_hba is the durable fix tracked separately, per memory + the audit doc).
285+
- It does **not** prove the truehomie dropper used this path (H1 remains
286+
hypothesis) — it removes the *capability*, which is the right action regardless.
287+
- It does **not** by itself add an audit trail for in-cluster admin DROPs — that
288+
is the provisioner `guardedDrop` chokepoint (already shipped, audit doc §Layer 1)
289+
+ the DDL-logging trap set on the cluster (memory).
290+
291+
---
292+
293+
## 8. Defense-in-depth context (already shipped elsewhere)
294+
295+
This lockdown is the **infra** half of the truehomie fix. The **application** half
296+
is already shipped (audit doc):
297+
- provisioner `guardedDrop` chokepoint + DDL-audit log + `instant_provisioner_drop_total` (PR #50)
298+
- CI guard test: no raw DROP outside the chokepoint (PR #50)
299+
- NR alert + dashboard tile + catalog row for the drop metric (infra PR #60, merged)
300+
301+
Together: this runbook removes the *unaudited external admin DROP capability*;
302+
the chokepoint ensures every *sanctioned* drop is recorded; the CI guard ensures a
303+
*new* unaudited drop call site cannot be merged.

0 commit comments

Comments
 (0)