diff --git a/k8s/APPLY-CHECKLIST.md b/k8s/APPLY-CHECKLIST.md
index ea3c463..d449782 100644
--- a/k8s/APPLY-CHECKLIST.md
+++ b/k8s/APPLY-CHECKLIST.md
@@ -14,6 +14,12 @@ This checklist applies to:
 Per CLAUDE.md rule 15: **this repo has no auto-apply by design.** Manifest
 apply is a deliberate, human-driven step.
 
+> **Stateful data-tier manifests** (`k8s/data/*` — postgres-customers,
+> mongodb, redis-provision, nats: PVCs, NetworkPolicy, pg_hba lockdown,
+> PriorityClass/PDBs) have their own apply order + verification gates in
+> **`k8s/DATA-TIER-APPLY-RUNBOOK.md`**. Use that for S1/S2/R6/R7. This file
+> is for the api/worker/provisioner Deployment manifests only.
+
 ---
 
 ## Hard rules
diff --git a/k8s/DATA-TIER-APPLY-RUNBOOK.md b/k8s/DATA-TIER-APPLY-RUNBOOK.md
new file mode 100644
index 0000000..0fd4a7e
--- /dev/null
+++ b/k8s/DATA-TIER-APPLY-RUNBOOK.md
@@ -0,0 +1,292 @@
+# Data-Tier Apply Runbook — `instant-data` stateful hardening
+
+> Companion to `k8s/APPLY-CHECKLIST.md` (which covers the api/worker/
+> provisioner **Deployment** manifests). This runbook covers the **stateful
+> data-tier** manifests in `k8s/data/` — the ones that hold real customer data
+> and therefore must be applied deliberately, in order, in a maintenance
+> window. **This repo has no auto-apply (CLAUDE.md rule 15).**
+>
+> **CRITICAL: never `kubectl apply -f k8s/app.yaml`** (stale vs prod — strips
+> `imagePullSecrets`, resets images). The files below are individually
+> applyable; apply them one at a time and read each `--dry-run=server` diff.
+
+This runbook is the operator apply checklist for four changes that are
+**committed but NOT yet applied to prod** (infra has no auto-apply):
+
+| Tag | File | What it does | Customer-visible risk if mis-applied |
+|---|---|---|---|
+| **S1** | `k8s/data/postgres-customers-lockdown.yaml` + the patched `postgres-customers.yaml` | pg_hba that REJECTS the admin/superuser roles (`instanode_admin`, `instant_cust`) from the public path; preserves `usr_*` customer roles | LOW — admin-only reject; customers unaffected. Detailed runbook: `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md` |
+| **S2** | `k8s/data/networkpolicy.yaml` | ingress NetworkPolicy: only provisioner/migrator/worker (+ nats-proxy for 4222) may reach the data pods | **HIGH — can break ALL customers** if the pg-proxy allow-rule is missing. See §S2 below. |
+| **R6** | `k8s/data/nats.yaml` | JetStream `emptyDir{}` → PVC (`nats-jetstream-pvc`, 5Gi) so queue data survives restarts | LOW — but the migration step (§R6) drains existing in-memory JetStream state. |
+| **R7** | `k8s/data/stateful-priority.yaml` + resource requests in `{postgres-customers,mongodb,redis-provision}.yaml` | PriorityClass `instant-data-critical` + one PDB per stateful pod + right-sized requests (BestEffort → Burstable) | LOW — eviction-ordering + drain-gating only; no data-path change. |
+
+---
+
+## Pre-flight (every apply below)
+
+```bash
+# 1. Confirm context — NEVER run against the wrong cluster.
+kubectl config current-context
+# Expected for prod: do-nyc3-instant-prod
+
+# 2. Snapshot current data-tier state for rollback reference.
+kubectl get pods,pvc,netpol,pdb,priorityclass -n instant-data -o wide
+kubectl get priorityclass instant-data-critical 2>/dev/null || echo "no priorityclass yet"
+
+# 3. Server-side dry-run EACH file and read the diff line by line.
+kubectl apply --dry-run=server -f <file>
+```
+
+Apply in a **maintenance window**. The recommended order is **R7 → R6 → S1 →
+S2** — least-risky and reversible first, the customer-breaking NetworkPolicy
+LAST so it is the freshest thing in your head if customers report errors.
+
+---
+
+## R7 — PriorityClass + PDBs + resource requests (apply FIRST)
+
+Pure eviction-protection; no data-path change. Two parts.
+
+**Part A — the PriorityClass + PDBs:**
+
+```bash
+kubectl apply --dry-run=server -f k8s/data/stateful-priority.yaml   # read diff
+kubectl apply              -f k8s/data/stateful-priority.yaml
+
+# Verify
+kubectl get priorityclass instant-data-critical
+kubectl get pdb -n instant-data
+# Expect 4 PDBs (postgres-customers / mongodb / redis-provision / nats),
+# each ALLOWED DISRUPTIONS reading 0 (single replica, minAvailable 1 → the one
+# pod is "not disruptable" by voluntary eviction, which is the point).
+```
+
+**Part B — the resource requests + the priorityClassName patch.** The requests
+ship INSIDE each workload manifest (`mongodb.yaml`, `redis-provision.yaml`,
+`postgres-customers.yaml`). Re-applying those manifests rolls the pod (Recreate
+strategy → brief downtime per workload — do this in the window). Because the
+PriorityClass is deliberately NOT inlined in the Deployments (so the priority
+rollout is one auditable step), patch `priorityClassName` in the same roll:
+
+```bash
+# postgres-customers carries the S1 pg_hba mount already — apply it as part of
+# S1 below (§S1) to avoid two rolls. For mongodb + redis-provision, roll now:
+for w in mongodb redis-provision; do
+  kubectl apply --dry-run=server -f k8s/data/$w.yaml   # read diff: only resources{} added
+  kubectl apply              -f k8s/data/$w.yaml
+  kubectl patch deploy/$w -n instant-data --type=merge \
+    -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}'
+  kubectl rollout status deploy/$w -n instant-data --timeout=180s
+done
+
+# Verify QoS flipped from BestEffort → Burstable and priority is set:
+kubectl get pod -n instant-data -l app=mongodb \
+  -o jsonpath='{.items[0].status.qosClass}{" "}{.items[0].spec.priorityClassName}{"\n"}'
+# Expect: Burstable instant-data-critical
+```
+
+> nats already declared requests; just patch its `priorityClassName` (do it in
+> the R6 roll below so nats only restarts once).
+
+**Rollback R7:** `kubectl delete -f k8s/data/stateful-priority.yaml` removes the
+PDBs + PriorityClass (pods keep running; priorityClassName on a pod referencing
+a deleted class is harmless until the next reschedule — re-patch to remove).
+
+---
+
+## R6 — NATS JetStream emptyDir → PVC
+
+`k8s/data/nats.yaml` now declares `nats-jetstream-pvc` (5Gi, default
+StorageClass = `do-block-storage` on DOKS) and mounts it at `/data/jetstream`.
+
+> **Data note:** pre-cutover JetStream state lived in `emptyDir{}` and is
+> **already non-durable** (every prior restart wiped it). Switching to the PVC
+> does NOT migrate old in-memory state — there is nothing durable to migrate.
+> Existing `legacy_open` queue resources reconnect + re-establish streams on
+> reconnect (same as any nats restart today). Schedule during low queue
+> traffic; clients reconnect automatically.
+
+```bash
+kubectl apply --dry-run=server -f k8s/data/nats.yaml   # read diff: PVC added, volume swapped
+
+# The Deployment uses strategy.type: Recreate (RWO volume — required). Applying
+# rolls the pod: old pod terminates, PVC binds, new pod starts on /data/jetstream.
+kubectl apply -f k8s/data/nats.yaml
+
+# Patch the PriorityClass in the SAME context so nats restarts once (R7 part B):
+kubectl patch deploy/nats -n instant-data --type=merge \
+  -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}'
+
+kubectl rollout status deploy/nats -n instant-data --timeout=180s
+
+# Verify the PVC bound and JetStream is on it:
+kubectl get pvc nats-jetstream-pvc -n instant-data        # STATUS Bound
+kubectl exec -n instant-data deploy/nats -- ls -la /data/jetstream
+kubectl get pod -n instant-data -l app=nats \
+  -o jsonpath='{.items[0].status.qosClass}{" "}{.items[0].spec.priorityClassName}{"\n"}'
+# Expect a jetstream dir on the mounted PVC + Burstable instant-data-critical.
+
+# Durability proof (the whole point): publish to a stream, delete the pod,
+# confirm the message survives the restart.
+# kubectl exec ... nats pub test.durability hello ; kubectl delete pod -l app=nats ;
+# (after Ready) nats stream info / consumer next — message must still be there.
+```
+
+**Rollback R6:** revert the `nats.yaml` change and re-apply (volume goes back to
+`emptyDir{}`). The PVC can be left bound (it costs a few cents) or deleted with
+`kubectl delete pvc nats-jetstream-pvc -n instant-data` once nats is off it.
+
+---
+
+## S1 — postgres-customers admin lockdown
+
+Full procedure (root-cause, role analysis, proxy-IP SNAT caveat, the live
+pg_hba stopgap) is in **`POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md`** — follow THAT
+for S1; this is the short pointer + the verification gate.
+
+Apply order: the `postgres-customers-hba` ConfigMap (in
+`postgres-customers-lockdown.yaml`) FIRST, then the patched
+`postgres-customers.yaml` (which mounts the hba file via subPath + sets
+`hba_file=/etc/postgresql/pg_hba.conf` and now also carries the R7
+resource requests). Roll postgres-customers ONCE for both:
+
+```bash
+kubectl apply -f k8s/data/postgres-customers-lockdown.yaml   # ConfigMap (+ any docs)
+kubectl apply -f k8s/data/postgres-customers.yaml            # mounts hba + R7 requests
+kubectl patch deploy/postgres-customers -n instant-data --type=merge \
+  -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}'
+kubectl rollout status deploy/postgres-customers -n instant-data --timeout=300s
+```
+
+> ⚠️ Read `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a` BEFORE applying — the
+> pg-proxy SNATs customer traffic to a pod IP, so the lockdown rejects the
+> admin role BY ROLE NAME (`instanode_admin` AND `instant_cust`), not by source
+> IP. If the runbook's proxy-pod-IP reject lines are stale, fix them first.
+
+### S1 verification gate (the load-bearing check)
+
+After the roll, the **external admin path MUST FAIL** while in-cluster admin and
+customer paths keep working:
+
+```bash
+# (a) EXTERNAL admin connect MUST be REJECTED by pg_hba (NOT a password prompt
+#     that proceeds). SAFE: connection-rejection probe only — no SQL/DDL.
+PGCONNECT_TIMEOUT=5 psql \
+  "host=pg.instanode.dev port=5432 user=instant_cust dbname=instant_customers sslmode=require" \
+  -c '\q' 2>&1 | head
+# EXPECT: 'no pg_hba.conf entry for host ... rejected' (or FATAL 28000 from the
+#         proxy). FAILURE TO REJECT = lockdown not in effect — STOP, investigate.
+
+# Repeat for the OTHER admin role (the confirmed truehomie role):
+PGCONNECT_TIMEOUT=5 psql \
+  "host=pg.instanode.dev port=5432 user=instanode_admin dbname=instant_customers sslmode=require" \
+  -c '\q' 2>&1 | head
+# EXPECT: rejected.
+
+# (b) In-cluster admin still works (provisioner path is intact):
+kubectl exec -n instant-data deploy/postgres-customers -- \
+  psql -U instant_cust -d instant_customers -tAc 'select 1;'   # expect: 1
+
+# (c) A real customer usr_<token> still connects through the public path
+#     (regression check — the lockdown must NOT catch customer roles).
+#     Use a known test-tenant DSN from the dashboard / a /db/new claim.
+```
+
+**Rollback S1:** see `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §Rollback` (revert
+the manifest, the pod falls back to the stock catch-all pg_hba; the live file
+backup is at `$PGDATA/pg_hba.conf.bak.2026-06-03`).
+
+---
+
+## S2 — data-tier ingress NetworkPolicy (apply LAST — highest risk)
+
+`k8s/data/networkpolicy.yaml` adds a default-deny ingress policy per data pod,
+allowing ONLY provisioner / migrator / worker (+ nats-proxy for 4222/8222).
+
+### ⚠️ The pg-proxy allow-rule — this is the customer-breaking trap
+
+The `postgres-customers-ingress` policy as committed **does NOT list
+`instant-pg-proxy`** — the allow-rule for it is **DORMANT** (commented out at
+`networkpolicy.yaml` lines ~88–103). If the public customer connect path is
+`pg.instanode.dev → ingress-nginx tcp-services → instant-pg-proxy
+(instant ns) → postgres-customers`, then applying this policy AS-IS
+**default-denies the proxy and BREAKS EVERY CUSTOMER POSTGRES CONNECTION.**
+
+`POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §L4` records that as of 2026-06-06 the
+NetworkPolicy is **NOT applied in prod** and that applying it as-is would
+default-deny + break the proxy path. **Do not apply S2 until you have:**
+
+1. **Confirmed the live proxy deployment's namespace + pod labels:**
+   ```bash
+   kubectl get pods -A -l app=instant-pg-proxy -o wide
+   # (the proxy manifest lives in the separate InstaNode-dev/instant-pg-proxy
+   #  repo, NOT here — read the real ns + labels off the live cluster.)
+   ```
+2. **Uncommented + edited the dormant pg-proxy block** in
+   `networkpolicy.yaml` (lines ~88–103) to match those real ns/labels.
+3. **Confirmed Cilium (the CNI) actually enforces NetworkPolicy** in this
+   cluster (`kubectl get ds -n kube-system | grep cilium`).
+
+Only then:
+
+```bash
+kubectl apply --dry-run=server -f k8s/data/networkpolicy.yaml   # read diff
+kubectl apply              -f k8s/data/networkpolicy.yaml
+```
+
+### S2 verification gate (must run IMMEDIATELY after apply)
+
+```bash
+# (a) Legit in-cluster caller (provisioner) still reaches postgres-customers:
+kubectl exec -n instant-infra deploy/instant-provisioner -- \
+  sh -c 'nc -z -w5 postgres-customers.instant-data.svc.cluster.local 5432 && echo OK'
+# Expect: OK
+
+# (b) THE CUSTOMER PATH still works — connect a real customer usr_<token>
+#     through pg.instanode.dev (same DSN as S1 check (c)). If this now FAILS
+#     where it worked pre-apply, the pg-proxy allow-rule is missing/wrong:
+#       kubectl delete -f k8s/data/networkpolicy.yaml   # IMMEDIATE rollback
+#     then fix the dormant pg-proxy block and re-apply.
+
+# (c) The 4 NetworkPolicies are present:
+kubectl get networkpolicy -n instant-data
+# Expect: postgres-customers-ingress, redis-provision-ingress, mongodb-ingress,
+#         nats-ingress.
+```
+
+**Rollback S2 (do this fast if customers report connection errors):**
+
+```bash
+kubectl delete -f k8s/data/networkpolicy.yaml
+# Removing the policies returns the pods to allow-all ingress (the pre-apply
+# state). No data loss; instant effect.
+```
+
+---
+
+## Apply-order summary
+
+| # | Tag | Command | Verify | Reversible |
+|---|---|---|---|---|
+| 1 | R7-A | `kubectl apply -f k8s/data/stateful-priority.yaml` | `kubectl get pdb,priorityclass -n instant-data` | `kubectl delete -f …` |
+| 2 | R7-B | apply `mongodb.yaml`,`redis-provision.yaml` + patch `priorityClassName` | QoS = Burstable | re-apply prior manifest |
+| 3 | R6 | `kubectl apply -f k8s/data/nats.yaml` (+ priorityClassName patch) | PVC Bound + durability publish/restart test | revert manifest |
+| 4 | S1 | apply lockdown ConfigMap + `postgres-customers.yaml` (+ patch) | **external admin psql REJECTED** | per lockdown runbook |
+| 5 | S2 | **edit dormant pg-proxy rule FIRST**, then apply `networkpolicy.yaml` | provisioner reaches pg AND customer path works | `kubectl delete -f …` |
+
+After every step, sanity-check the platform hot path:
+
+```bash
+curl -sS https://api.instanode.dev/healthz | jq .
+curl -sS https://api.instanode.dev/readyz  | jq .   # data-tier deep readiness
+```
+
+---
+
+## Related
+
+- `k8s/APPLY-CHECKLIST.md` — the api/worker/provisioner Deployment apply rules.
+- `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md` — the full S1 procedure + root cause.
+- `NATS-AUTH-RUNBOOK.md` — NATS operator-mode key generation (separate from R6).
+- `k8s/data/networkpolicy.yaml` — the S2 policy with the dormant pg-proxy block.
+- CLAUDE.md rule 15 — why this repo has no auto-apply.
diff --git a/k8s/data/mongodb.yaml b/k8s/data/mongodb.yaml
index 64b23f4..eaad8e0 100644
--- a/k8s/data/mongodb.yaml
+++ b/k8s/data/mongodb.yaml
@@ -32,6 +32,17 @@ spec:
           image: mongo:7
           ports:
             - containerPort: 27017
+          # R7 (2026-06-10): requests added so this pod is Burstable, not
+          # BestEffort (BestEffort = first evicted under the cluster's memory
+          # overcommit). WiredTiger sizes its cache to 50% of (RAM - 1GB) by
+          # default; the 1Gi limit keeps that bounded for the free-tier nosql
+          # footprint. Bump both if dedicated/Team mongodb lands here.
+          resources:
+            requests:
+              memory: "256Mi"
+              cpu: "100m"
+            limits:
+              memory: "1Gi"
           env:
             - name: MONGO_INITDB_ROOT_USERNAME
               value: root
diff --git a/k8s/data/nats.yaml b/k8s/data/nats.yaml
index a95e994..d9ee96b 100644
--- a/k8s/data/nats.yaml
+++ b/k8s/data/nats.yaml
@@ -72,6 +72,43 @@ data:
     resolver: MEMORY
 
 ---
+# JetStream durability (R6, 2026-06-10). Before this PVC, the JetStream
+# store_dir (/data/jetstream) was an emptyDir{} — every pod restart (the
+# Recreate rollout, an OOMKill, or a node drain) WIPED all stream + consumer
+# state and every persisted message. For a queue product that promises
+# "queue data survives pod restarts" that is a durability lie. This PVC backs
+# /data/jetstream with real block storage so stream/consumer state + messages
+# persist across restarts.
+#
+# 5Gi is conservative — sized to the JetStream config's max_file_store: 50GB
+# CEILING, not its current footprint; today's queue volume is tiny. Grow with
+# `kubectl edit pvc nats-jetstream-pvc` (do-block-storage / EBS support online
+# expansion when allowVolumeExpansion=true on the StorageClass) if file_store
+# usage approaches the request. Do NOT pre-allocate 50Gi — that is the hard
+# ceiling, not the working set.
+#
+# storageClassName is OMITTED → falls back to the cluster default. On DOKS prod
+# that default is `do-block-storage` (confirmed in k8s/self-hosted-runner.yaml
+# :152 + k8s/data/postgres-customers.yaml, which use the same omit-for-default
+# convention). Local dev (Rancher Desktop / k3s) gets `local-path` via the
+# cluster default there, or layer a kustomize overlay setting
+# storageClassName: local-path. Block storage is RWO single-attach, which is
+# why the Deployment below MUST stay strategy.type: Recreate (a RollingUpdate
+# would Multi-Attach-deadlock the new pod against the old holder — same
+# constraint postgres-customers.yaml documents).
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: nats-jetstream-pvc
+  namespace: instant-data
+  labels:
+    app: nats
+spec:
+  accessModes: [ReadWriteOnce]
+  resources:
+    requests:
+      storage: 5Gi
+---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
@@ -172,9 +209,15 @@ spec:
           # Secret + restart. Once the Secret exists the pod converges.
           optional: false
       - name: rendered-conf
-        emptyDir: {}
+        emptyDir: {}   # render scratch only — operator.conf is re-rendered
+                       # from the nats-operator Secret by the initContainer on
+                       # every pod start, so this one stays ephemeral by design.
       - name: jetstream-data
-        emptyDir: {}   # TODO: convert to PVC for prod durability
+        # R6 (2026-06-10): was emptyDir{} — now PVC-backed so JetStream
+        # stream/consumer state + persisted messages survive pod restarts.
+        # See the nats-jetstream-pvc PersistentVolumeClaim above.
+        persistentVolumeClaim:
+          claimName: nats-jetstream-pvc
 ---
 apiVersion: v1
 kind: Service
diff --git a/k8s/data/postgres-customers.yaml b/k8s/data/postgres-customers.yaml
index 413b631..421b68c 100644
--- a/k8s/data/postgres-customers.yaml
+++ b/k8s/data/postgres-customers.yaml
@@ -50,6 +50,21 @@ spec:
             - "password_encryption=scram-sha-256"
           ports:
             - containerPort: 5432
+          # R7 (2026-06-10): requests added so this pod is Burstable, not
+          # BestEffort. postgres-customers holds EVERY customer db_<token> —
+          # evicting it first under memory pressure (BestEffort QoS) is the
+          # worst possible eviction order. request 256Mi covers shared_buffers
+          # + idle backends for the free-tier footprint; limit 1Gi bounds it
+          # while leaving room for connection-heavy load. Pair with the
+          # instant-data-critical PriorityClass (applied via
+          # DATA-TIER-APPLY-RUNBOOK.md) so it is also scheduled ahead of and
+          # preempted after stateless app pods.
+          resources:
+            requests:
+              memory: "256Mi"
+              cpu: "100m"
+            limits:
+              memory: "1Gi"
           env:
             - name: POSTGRES_DB
               value: instant_customers
diff --git a/k8s/data/redis-provision.yaml b/k8s/data/redis-provision.yaml
index bfbb47b..1a6f8d2 100644
--- a/k8s/data/redis-provision.yaml
+++ b/k8s/data/redis-provision.yaml
@@ -44,6 +44,19 @@ spec:
             - "yes"
             - --dir
             - /data
+          # R7 (2026-06-10): requests added so this pod is Burstable, not
+          # BestEffort — BestEffort is the FIRST thing evicted under the
+          # cluster's memory overcommit. request 128Mi covers idle + small
+          # working set; limit 384Mi gives headroom over the --maxmemory 256mb
+          # cap (Redis RSS runs above the dataset due to fragmentation + COW on
+          # AOF rewrite). cpu request is a floor only (Redis is single-threaded;
+          # no cpu limit to avoid throttling latency-sensitive ops).
+          resources:
+            requests:
+              memory: "128Mi"
+              cpu: "50m"
+            limits:
+              memory: "384Mi"
           volumeMounts:
             - name: redis-data
               mountPath: /data
diff --git a/k8s/data/stateful-priority.yaml b/k8s/data/stateful-priority.yaml
new file mode 100644
index 0000000..f550f6c
--- /dev/null
+++ b/k8s/data/stateful-priority.yaml
@@ -0,0 +1,142 @@
+---
+# Stateful data-tier eviction protection (R7, 2026-06-10).
+#
+# ─────────────────────────────────────────────────────────────────────────────
+# WHY THIS EXISTS
+# ─────────────────────────────────────────────────────────────────────────────
+# The instant-data namespace holds the platform's stateful workloads —
+# postgres-customers (every customer db_<token>), mongodb (nosql resources),
+# redis-provision (cache resources), and nats (JetStream queues). These are
+# single-replica, PVC-backed, and CANNOT be casually rescheduled: an eviction
+# is a customer-visible outage (postgres-customers Recreate downtime) or, for
+# emptyDir-era nats, data loss.
+#
+# The cluster runs at 200%+ memory OVERCOMMIT (sum of limits >> node
+# allocatable). Under node memory pressure the kubelet evicts pods in this
+# order: (1) BestEffort (no requests/limits) first, then (2) Burstable using
+# MORE than their request, then by Priority ASCENDING. Two failure modes today:
+#   1. mongodb / redis-provision / postgres-customers declare NO resource
+#      requests → BestEffort QoS → FIRST evicted under pressure, ahead of any
+#      stateless app pod. The data tier should be the LAST thing evicted, not
+#      the first. (Right-sized requests land in their own manifests — see
+#      k8s/data/{mongodb,redis-provision,postgres-customers}.yaml.)
+#   2. No PriorityClass → these pods sit at the default priority (0), tied with
+#      every throwaway build/preview pod. A high PriorityClass makes the
+#      scheduler evict lower-priority stateless pods to make room for the data
+#      tier, and protects the data tier from preemption.
+#
+# This file ships the eviction-protection PRIMITIVES (PriorityClass + one PDB
+# per stateful workload). The matching resource requests live in each
+# workload's own manifest so the QoS change is reviewable alongside the
+# workload it sizes.
+#
+# minio is intentionally NOT covered: the self-hosted MinIO Deployment was
+# retired 2026-05-20 (DO Spaces is the canonical object store — see
+# APPLY-CHECKLIST.md §"MinIO retirement"). There is no minio workload in
+# instant-data to protect. If a local-dev minio is ever re-introduced, add its
+# PDB here.
+#
+# ─────────────────────────────────────────────────────────────────────────────
+# PriorityClass
+# ─────────────────────────────────────────────────────────────────────────────
+# Cluster-scoped (PriorityClass is not namespaced). value 1_000_000 sits ABOVE
+# the default (0) and above typical app workloads, but BELOW the reserved
+# system-node-critical (2000000000) / system-cluster-critical (2000000000)
+# bands so we never starve kube-system. preemptionPolicy PreemptLowerPriority
+# (the default) lets the scheduler evict lower-priority stateless pods to
+# schedule a data-tier pod that would otherwise go Pending — exactly what we
+# want when a node is tight. globalDefault MUST stay false (true would silently
+# give EVERY unclassified pod this priority and defeat the point).
+apiVersion: scheduling.k8s.io/v1
+kind: PriorityClass
+metadata:
+  name: instant-data-critical
+value: 1000000
+globalDefault: false
+preemptionPolicy: PreemptLowerPriority
+description: >-
+  Stateful data-tier workloads in instant-data (postgres-customers, mongodb,
+  redis-provision, nats). High priority so they are scheduled ahead of and
+  evicted after stateless app pods under the cluster's memory overcommit.
+  Apply the priorityClassName to each workload's pod template via the operator
+  patch in DATA-TIER-APPLY-RUNBOOK.md (kept out of the Deployment manifests
+  there so the priority rollout is a single auditable step).
+---
+# ─────────────────────────────────────────────────────────────────────────────
+# PodDisruptionBudgets — minAvailable: 1 per stateful workload.
+# ─────────────────────────────────────────────────────────────────────────────
+# A PDB with minAvailable: 1 blocks VOLUNTARY disruptions (node drain /
+# `kubectl drain`, cluster autoscaler scale-down, a rolling node upgrade) from
+# evicting the single replica until a replacement is Ready. It does NOT stop
+# INVOLUNTARY eviction (OOM kill, node hardware failure, kubelet
+# memory-pressure eviction) — that is what the PriorityClass + resource
+# requests above defend against. The two mechanisms are complementary:
+# PriorityClass/requests = involuntary eviction ordering; PDB = voluntary
+# disruption gate.
+#
+# CAVEAT for single-replica workloads: minAvailable: 1 on a 1-replica
+# Deployment means a node drain will BLOCK until the pod is rescheduled and
+# Ready elsewhere. For PVC-backed RWO workloads (all four here) the volume must
+# detach from the draining node and re-attach on the new node first, so a drain
+# is NOT instantaneous — it is correctly gated, not deadlocked, as long as the
+# replacement node can attach the volume. This is the intended behaviour: a
+# drain that would take customer Postgres offline now waits for a healthy
+# replacement instead of yanking it. Operators draining a node for maintenance
+# should expect the drain to pause here and complete once the data pod is
+# Ready on the new node (or use --disable-eviction only as a deliberate,
+# logged override).
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: postgres-customers-pdb
+  namespace: instant-data
+  labels:
+    app: postgres-customers
+    app.kubernetes.io/component: data
+spec:
+  minAvailable: 1
+  selector:
+    matchLabels:
+      app: postgres-customers
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: mongodb-pdb
+  namespace: instant-data
+  labels:
+    app: mongodb
+    app.kubernetes.io/component: data
+spec:
+  minAvailable: 1
+  selector:
+    matchLabels:
+      app: mongodb
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: redis-provision-pdb
+  namespace: instant-data
+  labels:
+    app: redis-provision
+    app.kubernetes.io/component: data
+spec:
+  minAvailable: 1
+  selector:
+    matchLabels:
+      app: redis-provision
+---
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: nats-pdb
+  namespace: instant-data
+  labels:
+    app: nats
+    app.kubernetes.io/component: data
+spec:
+  minAvailable: 1
+  selector:
+    matchLabels:
+      app: nats
diff --git a/k8s/prometheus-rules.yaml b/k8s/prometheus-rules.yaml
index 4021752..27a6af4 100644
--- a/k8s/prometheus-rules.yaml
+++ b/k8s/prometheus-rules.yaml
@@ -935,6 +935,59 @@ spec:
               handlers/github_webhook.go push handler → deploy enqueue path.
               NR mirror: newrelic/alerts/github-pushdeploy-error.json.
 
+    # Razorpay inbound billing webhook signature gate (S4, 2026-06-10).
+    # Metric emitted by api/internal/handlers/billing.go on a failed
+    # HMAC-SHA256 verify of POST /razorpay/webhook (the 400 invalid_signature
+    # path):
+    #   instant_razorpay_webhook_sig_fail_total
+    # Lazy Counter — the series only materialises on the first failed verify.
+    # Mirror of newrelic/alerts/razorpay-webhook-sig-fail.json (P2). Pairs the
+    # GitHub-webhook bad-signature alert above; same abuse class, higher blast
+    # radius (the signature gate is the only thing between a forged
+    # subscription.charged and a free plan upgrade).
+    - name: instant-billing
+      rules:
+        # Razorpay webhook signature-failure spike — forged billing payload or
+        # RAZORPAY_WEBHOOK_SECRET mismatch.
+        # P2/abuse: WARNING only. Non-zero is common on secret rotations that
+        # update one side (Razorpay dashboard vs instant-secrets) before the
+        # other — every legit retry then fails the gate. A sustained rate from
+        # a single IP carrying a well-formed-but-unsigned subscription.charged
+        # is an active upgrade-forgery attempt (the highest-value forgery on
+        # the platform). The 400 short-circuits BEFORE dispatch, so no plan_tier
+        # can flip — this alert is about visibility + WAF/source-block response,
+        # not damage already done.
+        - alert: RazorpayWebhookSigFailSpike
+          expr: |
+            increase(instant_razorpay_webhook_sig_fail_total[10m]) > 0
+          for: 10m
+          labels:
+            severity: warning
+            service: api
+          annotations:
+            summary: "Razorpay webhook signature failures > 0 in 10m — forged billing webhook or RAZORPAY_WEBHOOK_SECRET mismatch"
+            description: |
+              instant_razorpay_webhook_sig_fail_total > 0 for >10m. A POST to
+              /razorpay/webhook failed HMAC-SHA256 signature verification. Either
+              (a) an attacker is probing the billing webhook with a forged payload
+              (a correctly-shaped subscription.charged with a bad signature is the
+              highest-value forgery here — the gate is the only thing between it and
+              a free Pro/Team upgrade), or (b) RAZORPAY_WEBHOOK_SECRET in
+              instant-secrets drifted from the Razorpay dashboard (modal cause:
+              secret rotated on one side only → every legit retry fails), or (c) a
+              live/test secret split (handler tries RAZORPAY_WEBHOOK_SECRET then
+              RAZORPAY_TEST_WEBHOOK_SECRET; a TEST payload against a prod with no
+              test secret fails both). The 400 short-circuits BEFORE dispatch, so
+              no plan_tier flips. Investigate: grep NR Logs api
+              message='billing.webhook.signature_failed' for event_id + client IP.
+              If single-IP with a well-formed subscription.charged: ACTIVE forgery —
+              capture the IP for a WAF/source block; do NOT rotate blindly (rotation
+              fixes drift, not forgers). If legitimate Razorpay delivery retries with
+              no IP anomaly: secret mismatch — re-sync RAZORPAY_WEBHOOK_SECRET in
+              instant-secrets AND the Razorpay dashboard. Source: api/internal/
+              handlers/billing.go (verifyRazorpaySignature → invalid_signature 400).
+              NR mirror: newrelic/alerts/razorpay-webhook-sig-fail.json.
+
     # instant-worker — entitlement drift outpacing regrade (Rule 25 sweep 2026-06-04).
     # The entitlement_reconciler DETECTS Postgres resources whose connection cap no
     # longer matches their team's plan tier (instant_entitlement_drift_detected_total)
diff --git a/newrelic/alerts/razorpay-webhook-sig-fail.json b/newrelic/alerts/razorpay-webhook-sig-fail.json
new file mode 100644
index 0000000..b4fb738
--- /dev/null
+++ b/newrelic/alerts/razorpay-webhook-sig-fail.json
@@ -0,0 +1,31 @@
+{
+  "name": "instant-api — razorpay_webhook_sig_fail elevated [forged billing webhook or RAZORPAY_WEBHOOK_SECRET mismatch]",
+  "type": "NRQL",
+  "description": "P2/abuse. Fires when instant_razorpay_webhook_sig_fail_total rate is elevated over ~10m. A non-zero signature-failure count on POST /razorpay/webhook means one of: (a) an attacker is probing the billing webhook endpoint with a forged payload trying to drive a free plan upgrade (a correctly-shaped subscription.charged with a bad signature is the highest-value forgery on the platform — the signature gate is the ONLY thing standing between a forged success and a free Pro/Team upgrade), (b) RAZORPAY_WEBHOOK_SECRET in instant-secrets has drifted from the secret configured in the Razorpay dashboard (the modal cause after a secret rotation on one side only — Razorpay will then retry every real webhook and each retry fails the signature check), or (c) a misconfigured test/live secret split (the handler tries the live RAZORPAY_WEBHOOK_SECRET first, then RAZORPAY_TEST_WEBHOOK_SECRET — a TEST-mode payload against a prod with no test secret set fails both). Razorpay signatures are hex(HMAC-SHA256(key=webhookSecret, msg=rawBody)) with NO timestamp prefix (unlike Stripe); the handler constant-time-compares in verifyRazorpaySignature. Cross-correlate against NR Logs for api service message='billing.webhook.signature_failed' (carries event_id + client IP) and the audit_log row. If the rate tracks with a single IP and a well-formed-but-unsigned subscription.charged payload, treat as an ACTIVE upgrade-forgery attempt — do NOT rotate blindly (rotation won't stop a forger, it only fixes a legit-secret drift); confirm the customer's plan_tier did NOT flip (it cannot — the 400 short-circuits before the handler dispatches) and capture the source IP for a WAF block. If the rate tracks with legitimate Razorpay delivery retries and no IP anomaly, treat as secret mismatch: re-sync RAZORPAY_WEBHOOK_SECRET in instant-secrets AND the Razorpay dashboard. Source: api/internal/handlers/billing.go (verifyRazorpaySignature → invalid_signature 400); counter registered as instant_razorpay_webhook_sig_fail_total. Lazy counter — the series only appears at /metrics after the first failed verification. Mirrors the GitHub-webhook bad-signature alert (newrelic/alerts/github-webhook-bad-signature.json) and the Prom rule RazorpayWebhookSigFailSpike in k8s/prometheus-rules.yaml group instant-billing.",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT sum(instant_razorpay_webhook_sig_fail_total) FROM Metric WHERE metricName = 'instant_razorpay_webhook_sig_fail_total'"
+  },
+  "terms": [
+    {
+      "priority": "WARNING",
+      "operator": "ABOVE",
+      "threshold": 0,
+      "thresholdDuration": 600,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 60,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 120,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 3600,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}
diff --git a/newrelic/dashboards/instanode-reliability.json b/newrelic/dashboards/instanode-reliability.json
index 8752ac0..551563f 100644
--- a/newrelic/dashboards/instanode-reliability.json
+++ b/newrelic/dashboards/instanode-reliability.json
@@ -982,7 +982,7 @@
           }
         },
         {
-          "title": "Orphan-DB sweep — current candidate backlog by kind (0 until enabled)",
+          "title": "Orphan-DB sweep \u2014 current candidate backlog by kind (0 until enabled)",
           "layout": {
             "column": 1,
             "row": 81,
@@ -1519,7 +1519,7 @@
           }
         },
         {
-          "title": "Layer-3 payment prober — outcomes per leg (6h) [money heartbeat]",
+          "title": "Layer-3 payment prober \u2014 outcomes per leg (6h) [money heartbeat]",
           "layout": {
             "column": 1,
             "row": 75,
@@ -1544,7 +1544,7 @@
           }
         },
         {
-          "title": "Layer-3 payment prober — fails (last 6h, must be 0; degraded excluded)",
+          "title": "Layer-3 payment prober \u2014 fails (last 6h, must be 0; degraded excluded)",
           "layout": {
             "column": 7,
             "row": 75,
@@ -1575,7 +1575,7 @@
           }
         },
         {
-          "title": "Layer-3 payment prober — P95 latency per leg (6h)",
+          "title": "Layer-3 payment prober \u2014 P95 latency per leg (6h)",
           "layout": {
             "column": 10,
             "row": 75,
@@ -1598,6 +1598,37 @@
               "ignoreTimeRange": false
             }
           }
+        },
+        {
+          "title": "Razorpay webhook \u2014 signature failures (1h, must be 0 in steady state) [S4]",
+          "layout": {
+            "column": 1,
+            "row": 84,
+            "width": 3,
+            "height": 3
+          },
+          "visualization": {
+            "id": "viz.billboard"
+          },
+          "rawConfiguration": {
+            "nrqlQueries": [
+              {
+                "accountIds": [
+                  0
+                ],
+                "query": "SELECT sum(instant_razorpay_webhook_sig_fail_total) AS 'sig_fail' FROM Metric WHERE metricName = 'instant_razorpay_webhook_sig_fail_total' SINCE 1 hour ago"
+              }
+            ],
+            "platformOptions": {
+              "ignoreTimeRange": false
+            },
+            "thresholds": [
+              {
+                "alertSeverity": "WARNING",
+                "value": 1
+              }
+            ]
+          }
         }
       ]
     }
diff --git a/observability/METRICS-CATALOG.md b/observability/METRICS-CATALOG.md
index 39b6326..63957b2 100644
--- a/observability/METRICS-CATALOG.md
+++ b/observability/METRICS-CATALOG.md
@@ -54,6 +54,7 @@ fires. Operators need this so they don't panic when a fresh deploy looks
 | `instant_github_webhook_received_total` | api | `event,result` | lazy (CounterVec — label series only materialise on first delivery of each `{event,result}` combination; `bad_signature` only appears after the first malformed/spoofed delivery; `ok` appears after the first valid push. event ∈ {push,installation,...}; result ∈ {ok,bad_signature,replay,no_match,error}. P4 GitHub App push-to-deploy, pre-staged 2026-06-03) | `github-webhook-bad-signature.json` | `GitHubWebhookBadSignatureSpike` | "GitHub webhook — received by event+result (6h)", "GitHub webhook — bad_signature count (1h, must be 0 in steady state)" |
 | `instant_github_pushdeploy_total` | api | `result` | lazy (CounterVec — label series materialise on first push matching an installation+connection; `error` only appears after the first enqueue failure. result ∈ {enqueued,rate_limited,no_connection,error}. Enqueued = happy path; rate_limited = expected; no_connection = repo not linked to a stack; error = broken pipeline. P4 GitHub App push-to-deploy, pre-staged 2026-06-03) | `github-pushdeploy-error.json` | `GitHubPushDeployError` | "GitHub push-to-deploy — result breakdown (6h)", "GitHub push-to-deploy — enqueued vs errors (6h)" |
 | `instant_github_app_token_mint_total` | api | `result` | lazy (CounterVec — label series materialise on first installation auth attempt; `cache_hit` only appears after the first token cache hit. result ∈ {minted,cache_hit,error}. minted=fresh JWT from GitHub API; cache_hit=reused unexpired token (reduces GitHub API calls); error=private key missing/malformed or GitHub API down. P4 GitHub App push-to-deploy, pre-staged 2026-06-03) | (no standalone alert; error visible in `github-pushdeploy-error.json` cascade) | (no standalone rule; covered by `GitHubPushDeployError` cascade) | "GitHub App token mint — result breakdown (6h)" |
+| `instant_razorpay_webhook_sig_fail_total` | api | (none) | lazy (Counter — the series only materialises at `/metrics` after the first failed HMAC-SHA256 verify of POST /razorpay/webhook (the 400 invalid_signature path in billing.go). Must stay 0 in steady state: non-zero = forged billing webhook (highest-value forgery — the gate is the only thing between a forged subscription.charged and a free upgrade) OR RAZORPAY_WEBHOOK_SECRET drift after a one-sided rotation. The 400 short-circuits before dispatch, so no plan_tier can flip. S4, 2026-06-10) | `razorpay-webhook-sig-fail.json` | `RazorpayWebhookSigFailSpike` (instant-billing group) | "Razorpay webhook — signature failures (1h, must be 0 in steady state) [S4]" |
 | `instant_entitlement_drift_detected_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts Postgres resources found drifted below their team's plan tier per sweep) | `entitlement-drift-outpacing-regrade.json` (paired with `_regraded_total`) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" |
 | `instant_entitlement_regraded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts resources successfully re-graded to the entitled cap, provisioner applied=true) | `entitlement-drift-outpacing-regrade.json` (denominator: detected - regraded) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" |
 | `instant_deploy_job_failed_detected_total` | worker | `reason` | lazy (CounterVec — first observation is a real Kaniko build-Job Failed detection; reason ∈ {DeadlineExceeded, BackoffLimitExceeded, ...}. metrics_test forces a label so the metric registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-05-30 incident) | `deploy-job-failed-detected.json` | `DeployJobFailedDetected` (instant-worker-deploy-job-failed group) | "Deploy build-Job failures by reason (6h)", "Deploy build-Job failures (1h, detected; must be 0 in steady state)" |