Skip to content

Commit 11bc470

Browse files
infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook (#69)
S4 (monitoring half) — pair the api agent's new instant_razorpay_webhook_sig_fail_total counter with full rule-25 coverage, mirroring the GitHub-webhook bad-signature alert: - Prom rule RazorpayWebhookSigFailSpike (new instant-billing group, P2/warning) - NR alert newrelic/alerts/razorpay-webhook-sig-fail.json (P2) - billboard tile on instanode-reliability.json (row 84, threshold 1) - METRICS-CATALOG.md row R6 — convert NATS JetStream /data/jetstream from emptyDir{} to PVC (nats-jetstream-pvc, 5Gi, cluster-default StorageClass = do-block-storage on DOKS) so stream/consumer state + persisted messages survive pod restarts. Recreate strategy retained (RWO single-attach). R7 — eviction protection for the stateful instant-data pods under the cluster's 200%+ memory overcommit: - k8s/data/stateful-priority.yaml: PriorityClass instant-data-critical (value 1000000) + one PDB (minAvailable:1) each for postgres-customers, mongodb, redis-provision, nats. (minio excluded — retired 2026-05-20.) - right-sized resource requests/limits on postgres-customers, mongodb, redis-provision (were BestEffort QoS → first-evicted; now Burstable). Apply runbook — k8s/DATA-TIER-APPLY-RUNBOOK.md documents operator apply-order + verification gates for R6/R7 AND the already-committed-but-unapplied S1 (postgres-customers-lockdown) and S2 (networkpolicy), including the explicit warning that the NetworkPolicy needs the dormant instant-pg-proxy allow-rule or it breaks every customer Postgres connection, and the load-bearing S1 verify (external psql -U instant_cust/instanode_admin -h pg.instanode.dev MUST be rejected after lockdown). Cross-linked from APPLY-CHECKLIST.md. NO kubectl apply performed — infra has no auto-apply (CLAUDE.md rule 15); manifests ship as PR + runbook for the operator to apply in a window. yamllint + kubeconform (CI parity) green on all changed manifests. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 6a6064c commit 11bc470

11 files changed

Lines changed: 644 additions & 6 deletions

k8s/APPLY-CHECKLIST.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,12 @@ This checklist applies to:
1414
Per CLAUDE.md rule 15: **this repo has no auto-apply by design.** Manifest
1515
apply is a deliberate, human-driven step.
1616

17+
> **Stateful data-tier manifests** (`k8s/data/*` — postgres-customers,
18+
> mongodb, redis-provision, nats: PVCs, NetworkPolicy, pg_hba lockdown,
19+
> PriorityClass/PDBs) have their own apply order + verification gates in
20+
> **`k8s/DATA-TIER-APPLY-RUNBOOK.md`**. Use that for S1/S2/R6/R7. This file
21+
> is for the api/worker/provisioner Deployment manifests only.
22+
1723
---
1824

1925
## Hard rules

k8s/DATA-TIER-APPLY-RUNBOOK.md

Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
# Data-Tier Apply Runbook — `instant-data` stateful hardening
2+
3+
> Companion to `k8s/APPLY-CHECKLIST.md` (which covers the api/worker/
4+
> provisioner **Deployment** manifests). This runbook covers the **stateful
5+
> data-tier** manifests in `k8s/data/` — the ones that hold real customer data
6+
> and therefore must be applied deliberately, in order, in a maintenance
7+
> window. **This repo has no auto-apply (CLAUDE.md rule 15).**
8+
>
9+
> **CRITICAL: never `kubectl apply -f k8s/app.yaml`** (stale vs prod — strips
10+
> `imagePullSecrets`, resets images). The files below are individually
11+
> applyable; apply them one at a time and read each `--dry-run=server` diff.
12+
13+
This runbook is the operator apply checklist for four changes that are
14+
**committed but NOT yet applied to prod** (infra has no auto-apply):
15+
16+
| Tag | File | What it does | Customer-visible risk if mis-applied |
17+
|---|---|---|---|
18+
| **S1** | `k8s/data/postgres-customers-lockdown.yaml` + the patched `postgres-customers.yaml` | pg_hba that REJECTS the admin/superuser roles (`instanode_admin`, `instant_cust`) from the public path; preserves `usr_*` customer roles | LOW — admin-only reject; customers unaffected. Detailed runbook: `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md` |
19+
| **S2** | `k8s/data/networkpolicy.yaml` | ingress NetworkPolicy: only provisioner/migrator/worker (+ nats-proxy for 4222) may reach the data pods | **HIGH — can break ALL customers** if the pg-proxy allow-rule is missing. See §S2 below. |
20+
| **R6** | `k8s/data/nats.yaml` | JetStream `emptyDir{}` → PVC (`nats-jetstream-pvc`, 5Gi) so queue data survives restarts | LOW — but the migration step (§R6) drains existing in-memory JetStream state. |
21+
| **R7** | `k8s/data/stateful-priority.yaml` + resource requests in `{postgres-customers,mongodb,redis-provision}.yaml` | PriorityClass `instant-data-critical` + one PDB per stateful pod + right-sized requests (BestEffort → Burstable) | LOW — eviction-ordering + drain-gating only; no data-path change. |
22+
23+
---
24+
25+
## Pre-flight (every apply below)
26+
27+
```bash
28+
# 1. Confirm context — NEVER run against the wrong cluster.
29+
kubectl config current-context
30+
# Expected for prod: do-nyc3-instant-prod
31+
32+
# 2. Snapshot current data-tier state for rollback reference.
33+
kubectl get pods,pvc,netpol,pdb,priorityclass -n instant-data -o wide
34+
kubectl get priorityclass instant-data-critical 2>/dev/null || echo "no priorityclass yet"
35+
36+
# 3. Server-side dry-run EACH file and read the diff line by line.
37+
kubectl apply --dry-run=server -f <file>
38+
```
39+
40+
Apply in a **maintenance window**. The recommended order is **R7 → R6 → S1 →
41+
S2** — least-risky and reversible first, the customer-breaking NetworkPolicy
42+
LAST so it is the freshest thing in your head if customers report errors.
43+
44+
---
45+
46+
## R7 — PriorityClass + PDBs + resource requests (apply FIRST)
47+
48+
Pure eviction-protection; no data-path change. Two parts.
49+
50+
**Part A — the PriorityClass + PDBs:**
51+
52+
```bash
53+
kubectl apply --dry-run=server -f k8s/data/stateful-priority.yaml # read diff
54+
kubectl apply -f k8s/data/stateful-priority.yaml
55+
56+
# Verify
57+
kubectl get priorityclass instant-data-critical
58+
kubectl get pdb -n instant-data
59+
# Expect 4 PDBs (postgres-customers / mongodb / redis-provision / nats),
60+
# each ALLOWED DISRUPTIONS reading 0 (single replica, minAvailable 1 → the one
61+
# pod is "not disruptable" by voluntary eviction, which is the point).
62+
```
63+
64+
**Part B — the resource requests + the priorityClassName patch.** The requests
65+
ship INSIDE each workload manifest (`mongodb.yaml`, `redis-provision.yaml`,
66+
`postgres-customers.yaml`). Re-applying those manifests rolls the pod (Recreate
67+
strategy → brief downtime per workload — do this in the window). Because the
68+
PriorityClass is deliberately NOT inlined in the Deployments (so the priority
69+
rollout is one auditable step), patch `priorityClassName` in the same roll:
70+
71+
```bash
72+
# postgres-customers carries the S1 pg_hba mount already — apply it as part of
73+
# S1 below (§S1) to avoid two rolls. For mongodb + redis-provision, roll now:
74+
for w in mongodb redis-provision; do
75+
kubectl apply --dry-run=server -f k8s/data/$w.yaml # read diff: only resources{} added
76+
kubectl apply -f k8s/data/$w.yaml
77+
kubectl patch deploy/$w -n instant-data --type=merge \
78+
-p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}'
79+
kubectl rollout status deploy/$w -n instant-data --timeout=180s
80+
done
81+
82+
# Verify QoS flipped from BestEffort → Burstable and priority is set:
83+
kubectl get pod -n instant-data -l app=mongodb \
84+
-o jsonpath='{.items[0].status.qosClass}{" "}{.items[0].spec.priorityClassName}{"\n"}'
85+
# Expect: Burstable instant-data-critical
86+
```
87+
88+
> nats already declared requests; just patch its `priorityClassName` (do it in
89+
> the R6 roll below so nats only restarts once).
90+
91+
**Rollback R7:** `kubectl delete -f k8s/data/stateful-priority.yaml` removes the
92+
PDBs + PriorityClass (pods keep running; priorityClassName on a pod referencing
93+
a deleted class is harmless until the next reschedule — re-patch to remove).
94+
95+
---
96+
97+
## R6 — NATS JetStream emptyDir → PVC
98+
99+
`k8s/data/nats.yaml` now declares `nats-jetstream-pvc` (5Gi, default
100+
StorageClass = `do-block-storage` on DOKS) and mounts it at `/data/jetstream`.
101+
102+
> **Data note:** pre-cutover JetStream state lived in `emptyDir{}` and is
103+
> **already non-durable** (every prior restart wiped it). Switching to the PVC
104+
> does NOT migrate old in-memory state — there is nothing durable to migrate.
105+
> Existing `legacy_open` queue resources reconnect + re-establish streams on
106+
> reconnect (same as any nats restart today). Schedule during low queue
107+
> traffic; clients reconnect automatically.
108+
109+
```bash
110+
kubectl apply --dry-run=server -f k8s/data/nats.yaml # read diff: PVC added, volume swapped
111+
112+
# The Deployment uses strategy.type: Recreate (RWO volume — required). Applying
113+
# rolls the pod: old pod terminates, PVC binds, new pod starts on /data/jetstream.
114+
kubectl apply -f k8s/data/nats.yaml
115+
116+
# Patch the PriorityClass in the SAME context so nats restarts once (R7 part B):
117+
kubectl patch deploy/nats -n instant-data --type=merge \
118+
-p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}'
119+
120+
kubectl rollout status deploy/nats -n instant-data --timeout=180s
121+
122+
# Verify the PVC bound and JetStream is on it:
123+
kubectl get pvc nats-jetstream-pvc -n instant-data # STATUS Bound
124+
kubectl exec -n instant-data deploy/nats -- ls -la /data/jetstream
125+
kubectl get pod -n instant-data -l app=nats \
126+
-o jsonpath='{.items[0].status.qosClass}{" "}{.items[0].spec.priorityClassName}{"\n"}'
127+
# Expect a jetstream dir on the mounted PVC + Burstable instant-data-critical.
128+
129+
# Durability proof (the whole point): publish to a stream, delete the pod,
130+
# confirm the message survives the restart.
131+
# kubectl exec ... nats pub test.durability hello ; kubectl delete pod -l app=nats ;
132+
# (after Ready) nats stream info / consumer next — message must still be there.
133+
```
134+
135+
**Rollback R6:** revert the `nats.yaml` change and re-apply (volume goes back to
136+
`emptyDir{}`). The PVC can be left bound (it costs a few cents) or deleted with
137+
`kubectl delete pvc nats-jetstream-pvc -n instant-data` once nats is off it.
138+
139+
---
140+
141+
## S1 — postgres-customers admin lockdown
142+
143+
Full procedure (root-cause, role analysis, proxy-IP SNAT caveat, the live
144+
pg_hba stopgap) is in **`POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md`** — follow THAT
145+
for S1; this is the short pointer + the verification gate.
146+
147+
Apply order: the `postgres-customers-hba` ConfigMap (in
148+
`postgres-customers-lockdown.yaml`) FIRST, then the patched
149+
`postgres-customers.yaml` (which mounts the hba file via subPath + sets
150+
`hba_file=/etc/postgresql/pg_hba.conf` and now also carries the R7
151+
resource requests). Roll postgres-customers ONCE for both:
152+
153+
```bash
154+
kubectl apply -f k8s/data/postgres-customers-lockdown.yaml # ConfigMap (+ any docs)
155+
kubectl apply -f k8s/data/postgres-customers.yaml # mounts hba + R7 requests
156+
kubectl patch deploy/postgres-customers -n instant-data --type=merge \
157+
-p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}'
158+
kubectl rollout status deploy/postgres-customers -n instant-data --timeout=300s
159+
```
160+
161+
> ⚠️ Read `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a` BEFORE applying — the
162+
> pg-proxy SNATs customer traffic to a pod IP, so the lockdown rejects the
163+
> admin role BY ROLE NAME (`instanode_admin` AND `instant_cust`), not by source
164+
> IP. If the runbook's proxy-pod-IP reject lines are stale, fix them first.
165+
166+
### S1 verification gate (the load-bearing check)
167+
168+
After the roll, the **external admin path MUST FAIL** while in-cluster admin and
169+
customer paths keep working:
170+
171+
```bash
172+
# (a) EXTERNAL admin connect MUST be REJECTED by pg_hba (NOT a password prompt
173+
# that proceeds). SAFE: connection-rejection probe only — no SQL/DDL.
174+
PGCONNECT_TIMEOUT=5 psql \
175+
"host=pg.instanode.dev port=5432 user=instant_cust dbname=instant_customers sslmode=require" \
176+
-c '\q' 2>&1 | head
177+
# EXPECT: 'no pg_hba.conf entry for host ... rejected' (or FATAL 28000 from the
178+
# proxy). FAILURE TO REJECT = lockdown not in effect — STOP, investigate.
179+
180+
# Repeat for the OTHER admin role (the confirmed truehomie role):
181+
PGCONNECT_TIMEOUT=5 psql \
182+
"host=pg.instanode.dev port=5432 user=instanode_admin dbname=instant_customers sslmode=require" \
183+
-c '\q' 2>&1 | head
184+
# EXPECT: rejected.
185+
186+
# (b) In-cluster admin still works (provisioner path is intact):
187+
kubectl exec -n instant-data deploy/postgres-customers -- \
188+
psql -U instant_cust -d instant_customers -tAc 'select 1;' # expect: 1
189+
190+
# (c) A real customer usr_<token> still connects through the public path
191+
# (regression check — the lockdown must NOT catch customer roles).
192+
# Use a known test-tenant DSN from the dashboard / a /db/new claim.
193+
```
194+
195+
**Rollback S1:** see `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §Rollback` (revert
196+
the manifest, the pod falls back to the stock catch-all pg_hba; the live file
197+
backup is at `$PGDATA/pg_hba.conf.bak.2026-06-03`).
198+
199+
---
200+
201+
## S2 — data-tier ingress NetworkPolicy (apply LAST — highest risk)
202+
203+
`k8s/data/networkpolicy.yaml` adds a default-deny ingress policy per data pod,
204+
allowing ONLY provisioner / migrator / worker (+ nats-proxy for 4222/8222).
205+
206+
### ⚠️ The pg-proxy allow-rule — this is the customer-breaking trap
207+
208+
The `postgres-customers-ingress` policy as committed **does NOT list
209+
`instant-pg-proxy`** — the allow-rule for it is **DORMANT** (commented out at
210+
`networkpolicy.yaml` lines ~88–103). If the public customer connect path is
211+
`pg.instanode.dev → ingress-nginx tcp-services → instant-pg-proxy
212+
(instant ns) → postgres-customers`, then applying this policy AS-IS
213+
**default-denies the proxy and BREAKS EVERY CUSTOMER POSTGRES CONNECTION.**
214+
215+
`POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §L4` records that as of 2026-06-06 the
216+
NetworkPolicy is **NOT applied in prod** and that applying it as-is would
217+
default-deny + break the proxy path. **Do not apply S2 until you have:**
218+
219+
1. **Confirmed the live proxy deployment's namespace + pod labels:**
220+
```bash
221+
kubectl get pods -A -l app=instant-pg-proxy -o wide
222+
# (the proxy manifest lives in the separate InstaNode-dev/instant-pg-proxy
223+
# repo, NOT here — read the real ns + labels off the live cluster.)
224+
```
225+
2. **Uncommented + edited the dormant pg-proxy block** in
226+
`networkpolicy.yaml` (lines ~88–103) to match those real ns/labels.
227+
3. **Confirmed Cilium (the CNI) actually enforces NetworkPolicy** in this
228+
cluster (`kubectl get ds -n kube-system | grep cilium`).
229+
230+
Only then:
231+
232+
```bash
233+
kubectl apply --dry-run=server -f k8s/data/networkpolicy.yaml # read diff
234+
kubectl apply -f k8s/data/networkpolicy.yaml
235+
```
236+
237+
### S2 verification gate (must run IMMEDIATELY after apply)
238+
239+
```bash
240+
# (a) Legit in-cluster caller (provisioner) still reaches postgres-customers:
241+
kubectl exec -n instant-infra deploy/instant-provisioner -- \
242+
sh -c 'nc -z -w5 postgres-customers.instant-data.svc.cluster.local 5432 && echo OK'
243+
# Expect: OK
244+
245+
# (b) THE CUSTOMER PATH still works — connect a real customer usr_<token>
246+
# through pg.instanode.dev (same DSN as S1 check (c)). If this now FAILS
247+
# where it worked pre-apply, the pg-proxy allow-rule is missing/wrong:
248+
# kubectl delete -f k8s/data/networkpolicy.yaml # IMMEDIATE rollback
249+
# then fix the dormant pg-proxy block and re-apply.
250+
251+
# (c) The 4 NetworkPolicies are present:
252+
kubectl get networkpolicy -n instant-data
253+
# Expect: postgres-customers-ingress, redis-provision-ingress, mongodb-ingress,
254+
# nats-ingress.
255+
```
256+
257+
**Rollback S2 (do this fast if customers report connection errors):**
258+
259+
```bash
260+
kubectl delete -f k8s/data/networkpolicy.yaml
261+
# Removing the policies returns the pods to allow-all ingress (the pre-apply
262+
# state). No data loss; instant effect.
263+
```
264+
265+
---
266+
267+
## Apply-order summary
268+
269+
| # | Tag | Command | Verify | Reversible |
270+
|---|---|---|---|---|
271+
| 1 | R7-A | `kubectl apply -f k8s/data/stateful-priority.yaml` | `kubectl get pdb,priorityclass -n instant-data` | `kubectl delete -f …` |
272+
| 2 | R7-B | apply `mongodb.yaml`,`redis-provision.yaml` + patch `priorityClassName` | QoS = Burstable | re-apply prior manifest |
273+
| 3 | R6 | `kubectl apply -f k8s/data/nats.yaml` (+ priorityClassName patch) | PVC Bound + durability publish/restart test | revert manifest |
274+
| 4 | S1 | apply lockdown ConfigMap + `postgres-customers.yaml` (+ patch) | **external admin psql REJECTED** | per lockdown runbook |
275+
| 5 | S2 | **edit dormant pg-proxy rule FIRST**, then apply `networkpolicy.yaml` | provisioner reaches pg AND customer path works | `kubectl delete -f …` |
276+
277+
After every step, sanity-check the platform hot path:
278+
279+
```bash
280+
curl -sS https://api.instanode.dev/healthz | jq .
281+
curl -sS https://api.instanode.dev/readyz | jq . # data-tier deep readiness
282+
```
283+
284+
---
285+
286+
## Related
287+
288+
- `k8s/APPLY-CHECKLIST.md` — the api/worker/provisioner Deployment apply rules.
289+
- `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md` — the full S1 procedure + root cause.
290+
- `NATS-AUTH-RUNBOOK.md` — NATS operator-mode key generation (separate from R6).
291+
- `k8s/data/networkpolicy.yaml` — the S2 policy with the dormant pg-proxy block.
292+
- CLAUDE.md rule 15 — why this repo has no auto-apply.

k8s/data/mongodb.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,17 @@ spec:
3232
image: mongo:7
3333
ports:
3434
- containerPort: 27017
35+
# R7 (2026-06-10): requests added so this pod is Burstable, not
36+
# BestEffort (BestEffort = first evicted under the cluster's memory
37+
# overcommit). WiredTiger sizes its cache to 50% of (RAM - 1GB) by
38+
# default; the 1Gi limit keeps that bounded for the free-tier nosql
39+
# footprint. Bump both if dedicated/Team mongodb lands here.
40+
resources:
41+
requests:
42+
memory: "256Mi"
43+
cpu: "100m"
44+
limits:
45+
memory: "1Gi"
3546
env:
3647
- name: MONGO_INITDB_ROOT_USERNAME
3748
value: root

k8s/data/nats.yaml

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,43 @@ data:
7272
resolver: MEMORY
7373
7474
---
75+
# JetStream durability (R6, 2026-06-10). Before this PVC, the JetStream
76+
# store_dir (/data/jetstream) was an emptyDir{} — every pod restart (the
77+
# Recreate rollout, an OOMKill, or a node drain) WIPED all stream + consumer
78+
# state and every persisted message. For a queue product that promises
79+
# "queue data survives pod restarts" that is a durability lie. This PVC backs
80+
# /data/jetstream with real block storage so stream/consumer state + messages
81+
# persist across restarts.
82+
#
83+
# 5Gi is conservative — sized to the JetStream config's max_file_store: 50GB
84+
# CEILING, not its current footprint; today's queue volume is tiny. Grow with
85+
# `kubectl edit pvc nats-jetstream-pvc` (do-block-storage / EBS support online
86+
# expansion when allowVolumeExpansion=true on the StorageClass) if file_store
87+
# usage approaches the request. Do NOT pre-allocate 50Gi — that is the hard
88+
# ceiling, not the working set.
89+
#
90+
# storageClassName is OMITTED → falls back to the cluster default. On DOKS prod
91+
# that default is `do-block-storage` (confirmed in k8s/self-hosted-runner.yaml
92+
# :152 + k8s/data/postgres-customers.yaml, which use the same omit-for-default
93+
# convention). Local dev (Rancher Desktop / k3s) gets `local-path` via the
94+
# cluster default there, or layer a kustomize overlay setting
95+
# storageClassName: local-path. Block storage is RWO single-attach, which is
96+
# why the Deployment below MUST stay strategy.type: Recreate (a RollingUpdate
97+
# would Multi-Attach-deadlock the new pod against the old holder — same
98+
# constraint postgres-customers.yaml documents).
99+
apiVersion: v1
100+
kind: PersistentVolumeClaim
101+
metadata:
102+
name: nats-jetstream-pvc
103+
namespace: instant-data
104+
labels:
105+
app: nats
106+
spec:
107+
accessModes: [ReadWriteOnce]
108+
resources:
109+
requests:
110+
storage: 5Gi
111+
---
75112
apiVersion: apps/v1
76113
kind: Deployment
77114
metadata:
@@ -172,9 +209,15 @@ spec:
172209
# Secret + restart. Once the Secret exists the pod converges.
173210
optional: false
174211
- name: rendered-conf
175-
emptyDir: {}
212+
emptyDir: {} # render scratch only — operator.conf is re-rendered
213+
# from the nats-operator Secret by the initContainer on
214+
# every pod start, so this one stays ephemeral by design.
176215
- name: jetstream-data
177-
emptyDir: {} # TODO: convert to PVC for prod durability
216+
# R6 (2026-06-10): was emptyDir{} — now PVC-backed so JetStream
217+
# stream/consumer state + persisted messages survive pod restarts.
218+
# See the nats-jetstream-pvc PersistentVolumeClaim above.
219+
persistentVolumeClaim:
220+
claimName: nats-jetstream-pvc
178221
---
179222
apiVersion: v1
180223
kind: Service

0 commit comments

Comments
 (0)