infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook by mastermanas805 · Pull Request #69 · InstaNode-dev/infra

mastermanas805 · 2026-06-10T03:53:43Z

What

Stateful instant-data hardening: durability (R6), eviction protection (R7), Razorpay inbound-webhook signature-failure monitoring (S4), and a consolidated operator apply runbook covering these PLUS the already-committed-but-unapplied S1/S2.

NO kubectl apply was performed. Per CLAUDE.md rule 15 this repo has no auto-apply — manifests ship as PR + runbook for the operator to apply in a maintenance window (real customer data is on these pods).

Changes

S4 — Razorpay webhook signature-failure monitoring (rule 25, mirrors the GitHub bad-signature alert)

Pairs the new instant_razorpay_webhook_sig_fail_total counter the api agent is adding. P2/abuse — the signature gate is the only thing between a forged subscription.charged and a free plan upgrade; the 400 short-circuits before dispatch so no plan_tier flips, this is visibility + WAF response.

k8s/prometheus-rules.yaml — new instant-billing group, alert RazorpayWebhookSigFailSpike (increase(...[10m]) > 0, warning).
newrelic/alerts/razorpay-webhook-sig-fail.json — P2 NRQL alert (same shape as github-webhook-bad-signature.json).
newrelic/dashboards/instanode-reliability.json — billboard tile (row 84, threshold 1).
observability/METRICS-CATALOG.md — catalog row.

R6 — NATS JetStream `emptyDir{}` → PVC

k8s/data/nats.yaml — adds nats-jetstream-pvc (5Gi, cluster-default StorageClass = do-block-storage on DOKS), mounts it at /data/jetstream. Closes the # TODO: convert to PVC marker so queue stream/consumer state + messages survive pod restarts. Recreate strategy retained (RWO single-attach).

R7 — eviction protection under the 200%+ memory overcommit

k8s/data/stateful-priority.yaml (new) — PriorityClass instant-data-critical (value 1000000) + one PodDisruptionBudget(minAvailable:1) each for postgres-customers / mongodb / redis-provision / nats. (minio excluded — retired 2026-05-20.)
Right-sized resources.requests/limits on postgres-customers, mongodb, redis-provision — these were BestEffort QoS (no requests → first-evicted under pressure), now Burstable. nats already had requests.

Apply runbook

k8s/DATA-TIER-APPLY-RUNBOOK.md (new) — operator apply-order (R7 → R6 → S1 → S2, least-risky-first) + per-step verification gates for R6/R7 AND the unapplied S1 (postgres-customers-lockdown.yaml) and S2 (networkpolicy.yaml). Includes:
- the explicit S2 trap: the NetworkPolicy default-denies the in-cluster instant-pg-proxy unless the dormant allow-rule (lines ~88–103) is uncommented + matched to the live proxy ns/labels — applying as-is breaks every customer Postgres connection.
- the load-bearing S1 verify: external psql -U instant_cust / -U instanode_admin -h pg.instanode.dev MUST be rejected after lockdown (safe connection-rejection probe, no SQL/DDL), while in-cluster admin + customer usr_* paths keep working.
k8s/APPLY-CHECKLIST.md — cross-reference pointer to the new runbook.

Validation

yamllint (CI relaxed config) — clean on all changed manifests.
kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.31.0 — full k8s/ sweep: 106 valid, 0 invalid, 0 errors (PrometheusRule CRD skipped, expected).
JSON parse-checked: alert + dashboard.
YAML structure-checked: instant-billing group nests under spec.groups; nats jetstream-data volume is persistentVolumeClaim, not emptyDir.

Operator apply order (in a maintenance window)

R7-A: kubectl apply -f k8s/data/stateful-priority.yaml
R7-B: apply mongodb.yaml + redis-provision.yaml, patch priorityClassName
R6: kubectl apply -f k8s/data/nats.yaml (+ priorityClassName patch; durability publish/restart test)
S1: apply lockdown ConfigMap + postgres-customers.yaml — verify external admin psql REJECTED
S2: edit the dormant pg-proxy rule first, then apply networkpolicy.yaml — verify customer path still works

🤖 Generated with Claude Code

…eviction protection + apply runbook S4 (monitoring half) — pair the api agent's new instant_razorpay_webhook_sig_fail_total counter with full rule-25 coverage, mirroring the GitHub-webhook bad-signature alert: - Prom rule RazorpayWebhookSigFailSpike (new instant-billing group, P2/warning) - NR alert newrelic/alerts/razorpay-webhook-sig-fail.json (P2) - billboard tile on instanode-reliability.json (row 84, threshold 1) - METRICS-CATALOG.md row R6 — convert NATS JetStream /data/jetstream from emptyDir{} to PVC (nats-jetstream-pvc, 5Gi, cluster-default StorageClass = do-block-storage on DOKS) so stream/consumer state + persisted messages survive pod restarts. Recreate strategy retained (RWO single-attach). R7 — eviction protection for the stateful instant-data pods under the cluster's 200%+ memory overcommit: - k8s/data/stateful-priority.yaml: PriorityClass instant-data-critical (value 1000000) + one PDB (minAvailable:1) each for postgres-customers, mongodb, redis-provision, nats. (minio excluded — retired 2026-05-20.) - right-sized resource requests/limits on postgres-customers, mongodb, redis-provision (were BestEffort QoS → first-evicted; now Burstable). Apply runbook — k8s/DATA-TIER-APPLY-RUNBOOK.md documents operator apply-order + verification gates for R6/R7 AND the already-committed-but-unapplied S1 (postgres-customers-lockdown) and S2 (networkpolicy), including the explicit warning that the NetworkPolicy needs the dormant instant-pg-proxy allow-rule or it breaks every customer Postgres connection, and the load-bearing S1 verify (external psql -U instant_cust/instanode_admin -h pg.instanode.dev MUST be rejected after lockdown). Cross-linked from APPLY-CHECKLIST.md. NO kubectl apply performed — infra has no auto-apply (CLAUDE.md rule 15); manifests ship as PR + runbook for the operator to apply in a window. yamllint + kubeconform (CI parity) green on all changed manifests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ploy/propagation/data-tier OOMKill) (#73) A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot. The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending), so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic- logging Fluent Bit DaemonSet) + Synthetics are live. Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines (verified against worker code, file:line cited in each description): - customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals). Complements the pre-existing auth-only CRITICAL alert. - backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker #106 (which previously erred on EVERY tick, unalerted, for hours). - deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts (rule 27 silent-deploy-failure class). - propagation-dead-lettered-log.json — propagation_runner.dead_lettered + unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert propagation-dead-lettered metric alert (paid customer regrade fell through). - data-tier-pod-oomkill-restart.json — image-native startup banner of each instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the customer DB this session. Flagged blind spot: a banner detector cannot read exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline. Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md with the verified source log line + severity + NRQL key per alert, and the two acknowledged blind spots. FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs, per-pod memory/cpu requests+limits, maintenance-window apply runbook) already landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml + k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here. NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl --dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs only; operator-apply (apply.sh) — no auto-apply. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

mastermanas805 merged commit 11bc470 into master Jun 10, 2026
3 checks passed

mastermanas805 mentioned this pull request Jun 11, 2026

obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill) #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook#69

infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook#69
mastermanas805 merged 1 commit into
masterfrom
infra/s4-r6-r7-stateful-hardening

mastermanas805 commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mastermanas805 commented Jun 10, 2026

What

Changes

S4 — Razorpay webhook signature-failure monitoring (rule 25, mirrors the GitHub bad-signature alert)

R6 — NATS JetStream emptyDir{} → PVC

R7 — eviction protection under the 200%+ memory overcommit

Apply runbook

Validation

Operator apply order (in a maintenance window)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

R6 — NATS JetStream `emptyDir{}` → PVC