diff --git a/k8s/APPLY-CHECKLIST.md b/k8s/APPLY-CHECKLIST.md index ea3c463..d449782 100644 --- a/k8s/APPLY-CHECKLIST.md +++ b/k8s/APPLY-CHECKLIST.md @@ -14,6 +14,12 @@ This checklist applies to: Per CLAUDE.md rule 15: **this repo has no auto-apply by design.** Manifest apply is a deliberate, human-driven step. +> **Stateful data-tier manifests** (`k8s/data/*` — postgres-customers, +> mongodb, redis-provision, nats: PVCs, NetworkPolicy, pg_hba lockdown, +> PriorityClass/PDBs) have their own apply order + verification gates in +> **`k8s/DATA-TIER-APPLY-RUNBOOK.md`**. Use that for S1/S2/R6/R7. This file +> is for the api/worker/provisioner Deployment manifests only. + --- ## Hard rules diff --git a/k8s/DATA-TIER-APPLY-RUNBOOK.md b/k8s/DATA-TIER-APPLY-RUNBOOK.md new file mode 100644 index 0000000..0fd4a7e --- /dev/null +++ b/k8s/DATA-TIER-APPLY-RUNBOOK.md @@ -0,0 +1,292 @@ +# Data-Tier Apply Runbook — `instant-data` stateful hardening + +> Companion to `k8s/APPLY-CHECKLIST.md` (which covers the api/worker/ +> provisioner **Deployment** manifests). This runbook covers the **stateful +> data-tier** manifests in `k8s/data/` — the ones that hold real customer data +> and therefore must be applied deliberately, in order, in a maintenance +> window. **This repo has no auto-apply (CLAUDE.md rule 15).** +> +> **CRITICAL: never `kubectl apply -f k8s/app.yaml`** (stale vs prod — strips +> `imagePullSecrets`, resets images). The files below are individually +> applyable; apply them one at a time and read each `--dry-run=server` diff. + +This runbook is the operator apply checklist for four changes that are +**committed but NOT yet applied to prod** (infra has no auto-apply): + +| Tag | File | What it does | Customer-visible risk if mis-applied | +|---|---|---|---| +| **S1** | `k8s/data/postgres-customers-lockdown.yaml` + the patched `postgres-customers.yaml` | pg_hba that REJECTS the admin/superuser roles (`instanode_admin`, `instant_cust`) from the public path; preserves `usr_*` customer roles | LOW — admin-only reject; customers unaffected. Detailed runbook: `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md` | +| **S2** | `k8s/data/networkpolicy.yaml` | ingress NetworkPolicy: only provisioner/migrator/worker (+ nats-proxy for 4222) may reach the data pods | **HIGH — can break ALL customers** if the pg-proxy allow-rule is missing. See §S2 below. | +| **R6** | `k8s/data/nats.yaml` | JetStream `emptyDir{}` → PVC (`nats-jetstream-pvc`, 5Gi) so queue data survives restarts | LOW — but the migration step (§R6) drains existing in-memory JetStream state. | +| **R7** | `k8s/data/stateful-priority.yaml` + resource requests in `{postgres-customers,mongodb,redis-provision}.yaml` | PriorityClass `instant-data-critical` + one PDB per stateful pod + right-sized requests (BestEffort → Burstable) | LOW — eviction-ordering + drain-gating only; no data-path change. | + +--- + +## Pre-flight (every apply below) + +```bash +# 1. Confirm context — NEVER run against the wrong cluster. +kubectl config current-context +# Expected for prod: do-nyc3-instant-prod + +# 2. Snapshot current data-tier state for rollback reference. +kubectl get pods,pvc,netpol,pdb,priorityclass -n instant-data -o wide +kubectl get priorityclass instant-data-critical 2>/dev/null || echo "no priorityclass yet" + +# 3. Server-side dry-run EACH file and read the diff line by line. +kubectl apply --dry-run=server -f +``` + +Apply in a **maintenance window**. The recommended order is **R7 → R6 → S1 → +S2** — least-risky and reversible first, the customer-breaking NetworkPolicy +LAST so it is the freshest thing in your head if customers report errors. + +--- + +## R7 — PriorityClass + PDBs + resource requests (apply FIRST) + +Pure eviction-protection; no data-path change. Two parts. + +**Part A — the PriorityClass + PDBs:** + +```bash +kubectl apply --dry-run=server -f k8s/data/stateful-priority.yaml # read diff +kubectl apply -f k8s/data/stateful-priority.yaml + +# Verify +kubectl get priorityclass instant-data-critical +kubectl get pdb -n instant-data +# Expect 4 PDBs (postgres-customers / mongodb / redis-provision / nats), +# each ALLOWED DISRUPTIONS reading 0 (single replica, minAvailable 1 → the one +# pod is "not disruptable" by voluntary eviction, which is the point). +``` + +**Part B — the resource requests + the priorityClassName patch.** The requests +ship INSIDE each workload manifest (`mongodb.yaml`, `redis-provision.yaml`, +`postgres-customers.yaml`). Re-applying those manifests rolls the pod (Recreate +strategy → brief downtime per workload — do this in the window). Because the +PriorityClass is deliberately NOT inlined in the Deployments (so the priority +rollout is one auditable step), patch `priorityClassName` in the same roll: + +```bash +# postgres-customers carries the S1 pg_hba mount already — apply it as part of +# S1 below (§S1) to avoid two rolls. For mongodb + redis-provision, roll now: +for w in mongodb redis-provision; do + kubectl apply --dry-run=server -f k8s/data/$w.yaml # read diff: only resources{} added + kubectl apply -f k8s/data/$w.yaml + kubectl patch deploy/$w -n instant-data --type=merge \ + -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}' + kubectl rollout status deploy/$w -n instant-data --timeout=180s +done + +# Verify QoS flipped from BestEffort → Burstable and priority is set: +kubectl get pod -n instant-data -l app=mongodb \ + -o jsonpath='{.items[0].status.qosClass}{" "}{.items[0].spec.priorityClassName}{"\n"}' +# Expect: Burstable instant-data-critical +``` + +> nats already declared requests; just patch its `priorityClassName` (do it in +> the R6 roll below so nats only restarts once). + +**Rollback R7:** `kubectl delete -f k8s/data/stateful-priority.yaml` removes the +PDBs + PriorityClass (pods keep running; priorityClassName on a pod referencing +a deleted class is harmless until the next reschedule — re-patch to remove). + +--- + +## R6 — NATS JetStream emptyDir → PVC + +`k8s/data/nats.yaml` now declares `nats-jetstream-pvc` (5Gi, default +StorageClass = `do-block-storage` on DOKS) and mounts it at `/data/jetstream`. + +> **Data note:** pre-cutover JetStream state lived in `emptyDir{}` and is +> **already non-durable** (every prior restart wiped it). Switching to the PVC +> does NOT migrate old in-memory state — there is nothing durable to migrate. +> Existing `legacy_open` queue resources reconnect + re-establish streams on +> reconnect (same as any nats restart today). Schedule during low queue +> traffic; clients reconnect automatically. + +```bash +kubectl apply --dry-run=server -f k8s/data/nats.yaml # read diff: PVC added, volume swapped + +# The Deployment uses strategy.type: Recreate (RWO volume — required). Applying +# rolls the pod: old pod terminates, PVC binds, new pod starts on /data/jetstream. +kubectl apply -f k8s/data/nats.yaml + +# Patch the PriorityClass in the SAME context so nats restarts once (R7 part B): +kubectl patch deploy/nats -n instant-data --type=merge \ + -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}' + +kubectl rollout status deploy/nats -n instant-data --timeout=180s + +# Verify the PVC bound and JetStream is on it: +kubectl get pvc nats-jetstream-pvc -n instant-data # STATUS Bound +kubectl exec -n instant-data deploy/nats -- ls -la /data/jetstream +kubectl get pod -n instant-data -l app=nats \ + -o jsonpath='{.items[0].status.qosClass}{" "}{.items[0].spec.priorityClassName}{"\n"}' +# Expect a jetstream dir on the mounted PVC + Burstable instant-data-critical. + +# Durability proof (the whole point): publish to a stream, delete the pod, +# confirm the message survives the restart. +# kubectl exec ... nats pub test.durability hello ; kubectl delete pod -l app=nats ; +# (after Ready) nats stream info / consumer next — message must still be there. +``` + +**Rollback R6:** revert the `nats.yaml` change and re-apply (volume goes back to +`emptyDir{}`). The PVC can be left bound (it costs a few cents) or deleted with +`kubectl delete pvc nats-jetstream-pvc -n instant-data` once nats is off it. + +--- + +## S1 — postgres-customers admin lockdown + +Full procedure (root-cause, role analysis, proxy-IP SNAT caveat, the live +pg_hba stopgap) is in **`POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md`** — follow THAT +for S1; this is the short pointer + the verification gate. + +Apply order: the `postgres-customers-hba` ConfigMap (in +`postgres-customers-lockdown.yaml`) FIRST, then the patched +`postgres-customers.yaml` (which mounts the hba file via subPath + sets +`hba_file=/etc/postgresql/pg_hba.conf` and now also carries the R7 +resource requests). Roll postgres-customers ONCE for both: + +```bash +kubectl apply -f k8s/data/postgres-customers-lockdown.yaml # ConfigMap (+ any docs) +kubectl apply -f k8s/data/postgres-customers.yaml # mounts hba + R7 requests +kubectl patch deploy/postgres-customers -n instant-data --type=merge \ + -p '{"spec":{"template":{"spec":{"priorityClassName":"instant-data-critical"}}}}' +kubectl rollout status deploy/postgres-customers -n instant-data --timeout=300s +``` + +> ⚠️ Read `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §3a` BEFORE applying — the +> pg-proxy SNATs customer traffic to a pod IP, so the lockdown rejects the +> admin role BY ROLE NAME (`instanode_admin` AND `instant_cust`), not by source +> IP. If the runbook's proxy-pod-IP reject lines are stale, fix them first. + +### S1 verification gate (the load-bearing check) + +After the roll, the **external admin path MUST FAIL** while in-cluster admin and +customer paths keep working: + +```bash +# (a) EXTERNAL admin connect MUST be REJECTED by pg_hba (NOT a password prompt +# that proceeds). SAFE: connection-rejection probe only — no SQL/DDL. +PGCONNECT_TIMEOUT=5 psql \ + "host=pg.instanode.dev port=5432 user=instant_cust dbname=instant_customers sslmode=require" \ + -c '\q' 2>&1 | head +# EXPECT: 'no pg_hba.conf entry for host ... rejected' (or FATAL 28000 from the +# proxy). FAILURE TO REJECT = lockdown not in effect — STOP, investigate. + +# Repeat for the OTHER admin role (the confirmed truehomie role): +PGCONNECT_TIMEOUT=5 psql \ + "host=pg.instanode.dev port=5432 user=instanode_admin dbname=instant_customers sslmode=require" \ + -c '\q' 2>&1 | head +# EXPECT: rejected. + +# (b) In-cluster admin still works (provisioner path is intact): +kubectl exec -n instant-data deploy/postgres-customers -- \ + psql -U instant_cust -d instant_customers -tAc 'select 1;' # expect: 1 + +# (c) A real customer usr_ still connects through the public path +# (regression check — the lockdown must NOT catch customer roles). +# Use a known test-tenant DSN from the dashboard / a /db/new claim. +``` + +**Rollback S1:** see `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §Rollback` (revert +the manifest, the pod falls back to the stock catch-all pg_hba; the live file +backup is at `$PGDATA/pg_hba.conf.bak.2026-06-03`). + +--- + +## S2 — data-tier ingress NetworkPolicy (apply LAST — highest risk) + +`k8s/data/networkpolicy.yaml` adds a default-deny ingress policy per data pod, +allowing ONLY provisioner / migrator / worker (+ nats-proxy for 4222/8222). + +### ⚠️ The pg-proxy allow-rule — this is the customer-breaking trap + +The `postgres-customers-ingress` policy as committed **does NOT list +`instant-pg-proxy`** — the allow-rule for it is **DORMANT** (commented out at +`networkpolicy.yaml` lines ~88–103). If the public customer connect path is +`pg.instanode.dev → ingress-nginx tcp-services → instant-pg-proxy +(instant ns) → postgres-customers`, then applying this policy AS-IS +**default-denies the proxy and BREAKS EVERY CUSTOMER POSTGRES CONNECTION.** + +`POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md §L4` records that as of 2026-06-06 the +NetworkPolicy is **NOT applied in prod** and that applying it as-is would +default-deny + break the proxy path. **Do not apply S2 until you have:** + +1. **Confirmed the live proxy deployment's namespace + pod labels:** + ```bash + kubectl get pods -A -l app=instant-pg-proxy -o wide + # (the proxy manifest lives in the separate InstaNode-dev/instant-pg-proxy + # repo, NOT here — read the real ns + labels off the live cluster.) + ``` +2. **Uncommented + edited the dormant pg-proxy block** in + `networkpolicy.yaml` (lines ~88–103) to match those real ns/labels. +3. **Confirmed Cilium (the CNI) actually enforces NetworkPolicy** in this + cluster (`kubectl get ds -n kube-system | grep cilium`). + +Only then: + +```bash +kubectl apply --dry-run=server -f k8s/data/networkpolicy.yaml # read diff +kubectl apply -f k8s/data/networkpolicy.yaml +``` + +### S2 verification gate (must run IMMEDIATELY after apply) + +```bash +# (a) Legit in-cluster caller (provisioner) still reaches postgres-customers: +kubectl exec -n instant-infra deploy/instant-provisioner -- \ + sh -c 'nc -z -w5 postgres-customers.instant-data.svc.cluster.local 5432 && echo OK' +# Expect: OK + +# (b) THE CUSTOMER PATH still works — connect a real customer usr_ +# through pg.instanode.dev (same DSN as S1 check (c)). If this now FAILS +# where it worked pre-apply, the pg-proxy allow-rule is missing/wrong: +# kubectl delete -f k8s/data/networkpolicy.yaml # IMMEDIATE rollback +# then fix the dormant pg-proxy block and re-apply. + +# (c) The 4 NetworkPolicies are present: +kubectl get networkpolicy -n instant-data +# Expect: postgres-customers-ingress, redis-provision-ingress, mongodb-ingress, +# nats-ingress. +``` + +**Rollback S2 (do this fast if customers report connection errors):** + +```bash +kubectl delete -f k8s/data/networkpolicy.yaml +# Removing the policies returns the pods to allow-all ingress (the pre-apply +# state). No data loss; instant effect. +``` + +--- + +## Apply-order summary + +| # | Tag | Command | Verify | Reversible | +|---|---|---|---|---| +| 1 | R7-A | `kubectl apply -f k8s/data/stateful-priority.yaml` | `kubectl get pdb,priorityclass -n instant-data` | `kubectl delete -f …` | +| 2 | R7-B | apply `mongodb.yaml`,`redis-provision.yaml` + patch `priorityClassName` | QoS = Burstable | re-apply prior manifest | +| 3 | R6 | `kubectl apply -f k8s/data/nats.yaml` (+ priorityClassName patch) | PVC Bound + durability publish/restart test | revert manifest | +| 4 | S1 | apply lockdown ConfigMap + `postgres-customers.yaml` (+ patch) | **external admin psql REJECTED** | per lockdown runbook | +| 5 | S2 | **edit dormant pg-proxy rule FIRST**, then apply `networkpolicy.yaml` | provisioner reaches pg AND customer path works | `kubectl delete -f …` | + +After every step, sanity-check the platform hot path: + +```bash +curl -sS https://api.instanode.dev/healthz | jq . +curl -sS https://api.instanode.dev/readyz | jq . # data-tier deep readiness +``` + +--- + +## Related + +- `k8s/APPLY-CHECKLIST.md` — the api/worker/provisioner Deployment apply rules. +- `POSTGRES-CUSTOMERS-LOCKDOWN-RUNBOOK.md` — the full S1 procedure + root cause. +- `NATS-AUTH-RUNBOOK.md` — NATS operator-mode key generation (separate from R6). +- `k8s/data/networkpolicy.yaml` — the S2 policy with the dormant pg-proxy block. +- CLAUDE.md rule 15 — why this repo has no auto-apply. diff --git a/k8s/data/mongodb.yaml b/k8s/data/mongodb.yaml index 64b23f4..eaad8e0 100644 --- a/k8s/data/mongodb.yaml +++ b/k8s/data/mongodb.yaml @@ -32,6 +32,17 @@ spec: image: mongo:7 ports: - containerPort: 27017 + # R7 (2026-06-10): requests added so this pod is Burstable, not + # BestEffort (BestEffort = first evicted under the cluster's memory + # overcommit). WiredTiger sizes its cache to 50% of (RAM - 1GB) by + # default; the 1Gi limit keeps that bounded for the free-tier nosql + # footprint. Bump both if dedicated/Team mongodb lands here. + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "1Gi" env: - name: MONGO_INITDB_ROOT_USERNAME value: root diff --git a/k8s/data/nats.yaml b/k8s/data/nats.yaml index a95e994..d9ee96b 100644 --- a/k8s/data/nats.yaml +++ b/k8s/data/nats.yaml @@ -72,6 +72,43 @@ data: resolver: MEMORY --- +# JetStream durability (R6, 2026-06-10). Before this PVC, the JetStream +# store_dir (/data/jetstream) was an emptyDir{} — every pod restart (the +# Recreate rollout, an OOMKill, or a node drain) WIPED all stream + consumer +# state and every persisted message. For a queue product that promises +# "queue data survives pod restarts" that is a durability lie. This PVC backs +# /data/jetstream with real block storage so stream/consumer state + messages +# persist across restarts. +# +# 5Gi is conservative — sized to the JetStream config's max_file_store: 50GB +# CEILING, not its current footprint; today's queue volume is tiny. Grow with +# `kubectl edit pvc nats-jetstream-pvc` (do-block-storage / EBS support online +# expansion when allowVolumeExpansion=true on the StorageClass) if file_store +# usage approaches the request. Do NOT pre-allocate 50Gi — that is the hard +# ceiling, not the working set. +# +# storageClassName is OMITTED → falls back to the cluster default. On DOKS prod +# that default is `do-block-storage` (confirmed in k8s/self-hosted-runner.yaml +# :152 + k8s/data/postgres-customers.yaml, which use the same omit-for-default +# convention). Local dev (Rancher Desktop / k3s) gets `local-path` via the +# cluster default there, or layer a kustomize overlay setting +# storageClassName: local-path. Block storage is RWO single-attach, which is +# why the Deployment below MUST stay strategy.type: Recreate (a RollingUpdate +# would Multi-Attach-deadlock the new pod against the old holder — same +# constraint postgres-customers.yaml documents). +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: nats-jetstream-pvc + namespace: instant-data + labels: + app: nats +spec: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 5Gi +--- apiVersion: apps/v1 kind: Deployment metadata: @@ -172,9 +209,15 @@ spec: # Secret + restart. Once the Secret exists the pod converges. optional: false - name: rendered-conf - emptyDir: {} + emptyDir: {} # render scratch only — operator.conf is re-rendered + # from the nats-operator Secret by the initContainer on + # every pod start, so this one stays ephemeral by design. - name: jetstream-data - emptyDir: {} # TODO: convert to PVC for prod durability + # R6 (2026-06-10): was emptyDir{} — now PVC-backed so JetStream + # stream/consumer state + persisted messages survive pod restarts. + # See the nats-jetstream-pvc PersistentVolumeClaim above. + persistentVolumeClaim: + claimName: nats-jetstream-pvc --- apiVersion: v1 kind: Service diff --git a/k8s/data/postgres-customers.yaml b/k8s/data/postgres-customers.yaml index 413b631..421b68c 100644 --- a/k8s/data/postgres-customers.yaml +++ b/k8s/data/postgres-customers.yaml @@ -50,6 +50,21 @@ spec: - "password_encryption=scram-sha-256" ports: - containerPort: 5432 + # R7 (2026-06-10): requests added so this pod is Burstable, not + # BestEffort. postgres-customers holds EVERY customer db_ — + # evicting it first under memory pressure (BestEffort QoS) is the + # worst possible eviction order. request 256Mi covers shared_buffers + # + idle backends for the free-tier footprint; limit 1Gi bounds it + # while leaving room for connection-heavy load. Pair with the + # instant-data-critical PriorityClass (applied via + # DATA-TIER-APPLY-RUNBOOK.md) so it is also scheduled ahead of and + # preempted after stateless app pods. + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "1Gi" env: - name: POSTGRES_DB value: instant_customers diff --git a/k8s/data/redis-provision.yaml b/k8s/data/redis-provision.yaml index bfbb47b..1a6f8d2 100644 --- a/k8s/data/redis-provision.yaml +++ b/k8s/data/redis-provision.yaml @@ -44,6 +44,19 @@ spec: - "yes" - --dir - /data + # R7 (2026-06-10): requests added so this pod is Burstable, not + # BestEffort — BestEffort is the FIRST thing evicted under the + # cluster's memory overcommit. request 128Mi covers idle + small + # working set; limit 384Mi gives headroom over the --maxmemory 256mb + # cap (Redis RSS runs above the dataset due to fragmentation + COW on + # AOF rewrite). cpu request is a floor only (Redis is single-threaded; + # no cpu limit to avoid throttling latency-sensitive ops). + resources: + requests: + memory: "128Mi" + cpu: "50m" + limits: + memory: "384Mi" volumeMounts: - name: redis-data mountPath: /data diff --git a/k8s/data/stateful-priority.yaml b/k8s/data/stateful-priority.yaml new file mode 100644 index 0000000..f550f6c --- /dev/null +++ b/k8s/data/stateful-priority.yaml @@ -0,0 +1,142 @@ +--- +# Stateful data-tier eviction protection (R7, 2026-06-10). +# +# ───────────────────────────────────────────────────────────────────────────── +# WHY THIS EXISTS +# ───────────────────────────────────────────────────────────────────────────── +# The instant-data namespace holds the platform's stateful workloads — +# postgres-customers (every customer db_), mongodb (nosql resources), +# redis-provision (cache resources), and nats (JetStream queues). These are +# single-replica, PVC-backed, and CANNOT be casually rescheduled: an eviction +# is a customer-visible outage (postgres-customers Recreate downtime) or, for +# emptyDir-era nats, data loss. +# +# The cluster runs at 200%+ memory OVERCOMMIT (sum of limits >> node +# allocatable). Under node memory pressure the kubelet evicts pods in this +# order: (1) BestEffort (no requests/limits) first, then (2) Burstable using +# MORE than their request, then by Priority ASCENDING. Two failure modes today: +# 1. mongodb / redis-provision / postgres-customers declare NO resource +# requests → BestEffort QoS → FIRST evicted under pressure, ahead of any +# stateless app pod. The data tier should be the LAST thing evicted, not +# the first. (Right-sized requests land in their own manifests — see +# k8s/data/{mongodb,redis-provision,postgres-customers}.yaml.) +# 2. No PriorityClass → these pods sit at the default priority (0), tied with +# every throwaway build/preview pod. A high PriorityClass makes the +# scheduler evict lower-priority stateless pods to make room for the data +# tier, and protects the data tier from preemption. +# +# This file ships the eviction-protection PRIMITIVES (PriorityClass + one PDB +# per stateful workload). The matching resource requests live in each +# workload's own manifest so the QoS change is reviewable alongside the +# workload it sizes. +# +# minio is intentionally NOT covered: the self-hosted MinIO Deployment was +# retired 2026-05-20 (DO Spaces is the canonical object store — see +# APPLY-CHECKLIST.md §"MinIO retirement"). There is no minio workload in +# instant-data to protect. If a local-dev minio is ever re-introduced, add its +# PDB here. +# +# ───────────────────────────────────────────────────────────────────────────── +# PriorityClass +# ───────────────────────────────────────────────────────────────────────────── +# Cluster-scoped (PriorityClass is not namespaced). value 1_000_000 sits ABOVE +# the default (0) and above typical app workloads, but BELOW the reserved +# system-node-critical (2000000000) / system-cluster-critical (2000000000) +# bands so we never starve kube-system. preemptionPolicy PreemptLowerPriority +# (the default) lets the scheduler evict lower-priority stateless pods to +# schedule a data-tier pod that would otherwise go Pending — exactly what we +# want when a node is tight. globalDefault MUST stay false (true would silently +# give EVERY unclassified pod this priority and defeat the point). +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: instant-data-critical +value: 1000000 +globalDefault: false +preemptionPolicy: PreemptLowerPriority +description: >- + Stateful data-tier workloads in instant-data (postgres-customers, mongodb, + redis-provision, nats). High priority so they are scheduled ahead of and + evicted after stateless app pods under the cluster's memory overcommit. + Apply the priorityClassName to each workload's pod template via the operator + patch in DATA-TIER-APPLY-RUNBOOK.md (kept out of the Deployment manifests + there so the priority rollout is a single auditable step). +--- +# ───────────────────────────────────────────────────────────────────────────── +# PodDisruptionBudgets — minAvailable: 1 per stateful workload. +# ───────────────────────────────────────────────────────────────────────────── +# A PDB with minAvailable: 1 blocks VOLUNTARY disruptions (node drain / +# `kubectl drain`, cluster autoscaler scale-down, a rolling node upgrade) from +# evicting the single replica until a replacement is Ready. It does NOT stop +# INVOLUNTARY eviction (OOM kill, node hardware failure, kubelet +# memory-pressure eviction) — that is what the PriorityClass + resource +# requests above defend against. The two mechanisms are complementary: +# PriorityClass/requests = involuntary eviction ordering; PDB = voluntary +# disruption gate. +# +# CAVEAT for single-replica workloads: minAvailable: 1 on a 1-replica +# Deployment means a node drain will BLOCK until the pod is rescheduled and +# Ready elsewhere. For PVC-backed RWO workloads (all four here) the volume must +# detach from the draining node and re-attach on the new node first, so a drain +# is NOT instantaneous — it is correctly gated, not deadlocked, as long as the +# replacement node can attach the volume. This is the intended behaviour: a +# drain that would take customer Postgres offline now waits for a healthy +# replacement instead of yanking it. Operators draining a node for maintenance +# should expect the drain to pause here and complete once the data pod is +# Ready on the new node (or use --disable-eviction only as a deliberate, +# logged override). +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: postgres-customers-pdb + namespace: instant-data + labels: + app: postgres-customers + app.kubernetes.io/component: data +spec: + minAvailable: 1 + selector: + matchLabels: + app: postgres-customers +--- +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: mongodb-pdb + namespace: instant-data + labels: + app: mongodb + app.kubernetes.io/component: data +spec: + minAvailable: 1 + selector: + matchLabels: + app: mongodb +--- +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: redis-provision-pdb + namespace: instant-data + labels: + app: redis-provision + app.kubernetes.io/component: data +spec: + minAvailable: 1 + selector: + matchLabels: + app: redis-provision +--- +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: nats-pdb + namespace: instant-data + labels: + app: nats + app.kubernetes.io/component: data +spec: + minAvailable: 1 + selector: + matchLabels: + app: nats diff --git a/k8s/prometheus-rules.yaml b/k8s/prometheus-rules.yaml index 4021752..27a6af4 100644 --- a/k8s/prometheus-rules.yaml +++ b/k8s/prometheus-rules.yaml @@ -935,6 +935,59 @@ spec: handlers/github_webhook.go push handler → deploy enqueue path. NR mirror: newrelic/alerts/github-pushdeploy-error.json. + # Razorpay inbound billing webhook signature gate (S4, 2026-06-10). + # Metric emitted by api/internal/handlers/billing.go on a failed + # HMAC-SHA256 verify of POST /razorpay/webhook (the 400 invalid_signature + # path): + # instant_razorpay_webhook_sig_fail_total + # Lazy Counter — the series only materialises on the first failed verify. + # Mirror of newrelic/alerts/razorpay-webhook-sig-fail.json (P2). Pairs the + # GitHub-webhook bad-signature alert above; same abuse class, higher blast + # radius (the signature gate is the only thing between a forged + # subscription.charged and a free plan upgrade). + - name: instant-billing + rules: + # Razorpay webhook signature-failure spike — forged billing payload or + # RAZORPAY_WEBHOOK_SECRET mismatch. + # P2/abuse: WARNING only. Non-zero is common on secret rotations that + # update one side (Razorpay dashboard vs instant-secrets) before the + # other — every legit retry then fails the gate. A sustained rate from + # a single IP carrying a well-formed-but-unsigned subscription.charged + # is an active upgrade-forgery attempt (the highest-value forgery on + # the platform). The 400 short-circuits BEFORE dispatch, so no plan_tier + # can flip — this alert is about visibility + WAF/source-block response, + # not damage already done. + - alert: RazorpayWebhookSigFailSpike + expr: | + increase(instant_razorpay_webhook_sig_fail_total[10m]) > 0 + for: 10m + labels: + severity: warning + service: api + annotations: + summary: "Razorpay webhook signature failures > 0 in 10m — forged billing webhook or RAZORPAY_WEBHOOK_SECRET mismatch" + description: | + instant_razorpay_webhook_sig_fail_total > 0 for >10m. A POST to + /razorpay/webhook failed HMAC-SHA256 signature verification. Either + (a) an attacker is probing the billing webhook with a forged payload + (a correctly-shaped subscription.charged with a bad signature is the + highest-value forgery here — the gate is the only thing between it and + a free Pro/Team upgrade), or (b) RAZORPAY_WEBHOOK_SECRET in + instant-secrets drifted from the Razorpay dashboard (modal cause: + secret rotated on one side only → every legit retry fails), or (c) a + live/test secret split (handler tries RAZORPAY_WEBHOOK_SECRET then + RAZORPAY_TEST_WEBHOOK_SECRET; a TEST payload against a prod with no + test secret fails both). The 400 short-circuits BEFORE dispatch, so + no plan_tier flips. Investigate: grep NR Logs api + message='billing.webhook.signature_failed' for event_id + client IP. + If single-IP with a well-formed subscription.charged: ACTIVE forgery — + capture the IP for a WAF/source block; do NOT rotate blindly (rotation + fixes drift, not forgers). If legitimate Razorpay delivery retries with + no IP anomaly: secret mismatch — re-sync RAZORPAY_WEBHOOK_SECRET in + instant-secrets AND the Razorpay dashboard. Source: api/internal/ + handlers/billing.go (verifyRazorpaySignature → invalid_signature 400). + NR mirror: newrelic/alerts/razorpay-webhook-sig-fail.json. + # instant-worker — entitlement drift outpacing regrade (Rule 25 sweep 2026-06-04). # The entitlement_reconciler DETECTS Postgres resources whose connection cap no # longer matches their team's plan tier (instant_entitlement_drift_detected_total) diff --git a/newrelic/alerts/razorpay-webhook-sig-fail.json b/newrelic/alerts/razorpay-webhook-sig-fail.json new file mode 100644 index 0000000..b4fb738 --- /dev/null +++ b/newrelic/alerts/razorpay-webhook-sig-fail.json @@ -0,0 +1,31 @@ +{ + "name": "instant-api — razorpay_webhook_sig_fail elevated [forged billing webhook or RAZORPAY_WEBHOOK_SECRET mismatch]", + "type": "NRQL", + "description": "P2/abuse. Fires when instant_razorpay_webhook_sig_fail_total rate is elevated over ~10m. A non-zero signature-failure count on POST /razorpay/webhook means one of: (a) an attacker is probing the billing webhook endpoint with a forged payload trying to drive a free plan upgrade (a correctly-shaped subscription.charged with a bad signature is the highest-value forgery on the platform — the signature gate is the ONLY thing standing between a forged success and a free Pro/Team upgrade), (b) RAZORPAY_WEBHOOK_SECRET in instant-secrets has drifted from the secret configured in the Razorpay dashboard (the modal cause after a secret rotation on one side only — Razorpay will then retry every real webhook and each retry fails the signature check), or (c) a misconfigured test/live secret split (the handler tries the live RAZORPAY_WEBHOOK_SECRET first, then RAZORPAY_TEST_WEBHOOK_SECRET — a TEST-mode payload against a prod with no test secret set fails both). Razorpay signatures are hex(HMAC-SHA256(key=webhookSecret, msg=rawBody)) with NO timestamp prefix (unlike Stripe); the handler constant-time-compares in verifyRazorpaySignature. Cross-correlate against NR Logs for api service message='billing.webhook.signature_failed' (carries event_id + client IP) and the audit_log row. If the rate tracks with a single IP and a well-formed-but-unsigned subscription.charged payload, treat as an ACTIVE upgrade-forgery attempt — do NOT rotate blindly (rotation won't stop a forger, it only fixes a legit-secret drift); confirm the customer's plan_tier did NOT flip (it cannot — the 400 short-circuits before the handler dispatches) and capture the source IP for a WAF block. If the rate tracks with legitimate Razorpay delivery retries and no IP anomaly, treat as secret mismatch: re-sync RAZORPAY_WEBHOOK_SECRET in instant-secrets AND the Razorpay dashboard. Source: api/internal/handlers/billing.go (verifyRazorpaySignature → invalid_signature 400); counter registered as instant_razorpay_webhook_sig_fail_total. Lazy counter — the series only appears at /metrics after the first failed verification. Mirrors the GitHub-webhook bad-signature alert (newrelic/alerts/github-webhook-bad-signature.json) and the Prom rule RazorpayWebhookSigFailSpike in k8s/prometheus-rules.yaml group instant-billing.", + "enabled": true, + "nrql": { + "query": "SELECT sum(instant_razorpay_webhook_sig_fail_total) FROM Metric WHERE metricName = 'instant_razorpay_webhook_sig_fail_total'" + }, + "terms": [ + { + "priority": "WARNING", + "operator": "ABOVE", + "threshold": 0, + "thresholdDuration": 600, + "thresholdOccurrences": "AT_LEAST_ONCE" + } + ], + "signal": { + "aggregationWindow": 60, + "aggregationMethod": "EVENT_FLOW", + "aggregationDelay": 120, + "fillOption": "STATIC", + "fillValue": 0 + }, + "expiration": { + "expirationDuration": 3600, + "openViolationOnExpiration": false, + "closeViolationsOnExpiration": true + }, + "violationTimeLimitSeconds": 86400 +} diff --git a/newrelic/dashboards/instanode-reliability.json b/newrelic/dashboards/instanode-reliability.json index 8752ac0..551563f 100644 --- a/newrelic/dashboards/instanode-reliability.json +++ b/newrelic/dashboards/instanode-reliability.json @@ -982,7 +982,7 @@ } }, { - "title": "Orphan-DB sweep — current candidate backlog by kind (0 until enabled)", + "title": "Orphan-DB sweep \u2014 current candidate backlog by kind (0 until enabled)", "layout": { "column": 1, "row": 81, @@ -1519,7 +1519,7 @@ } }, { - "title": "Layer-3 payment prober — outcomes per leg (6h) [money heartbeat]", + "title": "Layer-3 payment prober \u2014 outcomes per leg (6h) [money heartbeat]", "layout": { "column": 1, "row": 75, @@ -1544,7 +1544,7 @@ } }, { - "title": "Layer-3 payment prober — fails (last 6h, must be 0; degraded excluded)", + "title": "Layer-3 payment prober \u2014 fails (last 6h, must be 0; degraded excluded)", "layout": { "column": 7, "row": 75, @@ -1575,7 +1575,7 @@ } }, { - "title": "Layer-3 payment prober — P95 latency per leg (6h)", + "title": "Layer-3 payment prober \u2014 P95 latency per leg (6h)", "layout": { "column": 10, "row": 75, @@ -1598,6 +1598,37 @@ "ignoreTimeRange": false } } + }, + { + "title": "Razorpay webhook \u2014 signature failures (1h, must be 0 in steady state) [S4]", + "layout": { + "column": 1, + "row": 84, + "width": 3, + "height": 3 + }, + "visualization": { + "id": "viz.billboard" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT sum(instant_razorpay_webhook_sig_fail_total) AS 'sig_fail' FROM Metric WHERE metricName = 'instant_razorpay_webhook_sig_fail_total' SINCE 1 hour ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + }, + "thresholds": [ + { + "alertSeverity": "WARNING", + "value": 1 + } + ] + } } ] } diff --git a/observability/METRICS-CATALOG.md b/observability/METRICS-CATALOG.md index 39b6326..63957b2 100644 --- a/observability/METRICS-CATALOG.md +++ b/observability/METRICS-CATALOG.md @@ -54,6 +54,7 @@ fires. Operators need this so they don't panic when a fresh deploy looks | `instant_github_webhook_received_total` | api | `event,result` | lazy (CounterVec — label series only materialise on first delivery of each `{event,result}` combination; `bad_signature` only appears after the first malformed/spoofed delivery; `ok` appears after the first valid push. event ∈ {push,installation,...}; result ∈ {ok,bad_signature,replay,no_match,error}. P4 GitHub App push-to-deploy, pre-staged 2026-06-03) | `github-webhook-bad-signature.json` | `GitHubWebhookBadSignatureSpike` | "GitHub webhook — received by event+result (6h)", "GitHub webhook — bad_signature count (1h, must be 0 in steady state)" | | `instant_github_pushdeploy_total` | api | `result` | lazy (CounterVec — label series materialise on first push matching an installation+connection; `error` only appears after the first enqueue failure. result ∈ {enqueued,rate_limited,no_connection,error}. Enqueued = happy path; rate_limited = expected; no_connection = repo not linked to a stack; error = broken pipeline. P4 GitHub App push-to-deploy, pre-staged 2026-06-03) | `github-pushdeploy-error.json` | `GitHubPushDeployError` | "GitHub push-to-deploy — result breakdown (6h)", "GitHub push-to-deploy — enqueued vs errors (6h)" | | `instant_github_app_token_mint_total` | api | `result` | lazy (CounterVec — label series materialise on first installation auth attempt; `cache_hit` only appears after the first token cache hit. result ∈ {minted,cache_hit,error}. minted=fresh JWT from GitHub API; cache_hit=reused unexpired token (reduces GitHub API calls); error=private key missing/malformed or GitHub API down. P4 GitHub App push-to-deploy, pre-staged 2026-06-03) | (no standalone alert; error visible in `github-pushdeploy-error.json` cascade) | (no standalone rule; covered by `GitHubPushDeployError` cascade) | "GitHub App token mint — result breakdown (6h)" | +| `instant_razorpay_webhook_sig_fail_total` | api | (none) | lazy (Counter — the series only materialises at `/metrics` after the first failed HMAC-SHA256 verify of POST /razorpay/webhook (the 400 invalid_signature path in billing.go). Must stay 0 in steady state: non-zero = forged billing webhook (highest-value forgery — the gate is the only thing between a forged subscription.charged and a free upgrade) OR RAZORPAY_WEBHOOK_SECRET drift after a one-sided rotation. The 400 short-circuits before dispatch, so no plan_tier can flip. S4, 2026-06-10) | `razorpay-webhook-sig-fail.json` | `RazorpayWebhookSigFailSpike` (instant-billing group) | "Razorpay webhook — signature failures (1h, must be 0 in steady state) [S4]" | | `instant_entitlement_drift_detected_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts Postgres resources found drifted below their team's plan tier per sweep) | `entitlement-drift-outpacing-regrade.json` (paired with `_regraded_total`) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" | | `instant_entitlement_regraded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; counts resources successfully re-graded to the entitled cap, provisioner applied=true) | `entitlement-drift-outpacing-regrade.json` (denominator: detected - regraded) | `EntitlementDriftOutpacingRegrade` (instant-worker-entitlement-drift group) | "Entitlement drift detected vs regraded (6h)", "Entitlement drift backlog (1h, detected - regraded; must be 0)" | | `instant_deploy_job_failed_detected_total` | worker | `reason` | lazy (CounterVec — first observation is a real Kaniko build-Job Failed detection; reason ∈ {DeadlineExceeded, BackoffLimitExceeded, ...}. metrics_test forces a label so the metric registers at boot. Silent-deploy-failure fix, CLAUDE.md rule 27 / 2026-05-30 incident) | `deploy-job-failed-detected.json` | `DeployJobFailedDetected` (instant-worker-deploy-job-failed group) | "Deploy build-Job failures by reason (6h)", "Deploy build-Job failures (1h, detected; must be 0 in steady state)" |