infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook#69
Merged
Conversation
…eviction protection + apply runbook
S4 (monitoring half) — pair the api agent's new instant_razorpay_webhook_sig_fail_total
counter with full rule-25 coverage, mirroring the GitHub-webhook bad-signature alert:
- Prom rule RazorpayWebhookSigFailSpike (new instant-billing group, P2/warning)
- NR alert newrelic/alerts/razorpay-webhook-sig-fail.json (P2)
- billboard tile on instanode-reliability.json (row 84, threshold 1)
- METRICS-CATALOG.md row
R6 — convert NATS JetStream /data/jetstream from emptyDir{} to PVC
(nats-jetstream-pvc, 5Gi, cluster-default StorageClass = do-block-storage on DOKS)
so stream/consumer state + persisted messages survive pod restarts. Recreate
strategy retained (RWO single-attach).
R7 — eviction protection for the stateful instant-data pods under the cluster's
200%+ memory overcommit:
- k8s/data/stateful-priority.yaml: PriorityClass instant-data-critical
(value 1000000) + one PDB (minAvailable:1) each for postgres-customers,
mongodb, redis-provision, nats. (minio excluded — retired 2026-05-20.)
- right-sized resource requests/limits on postgres-customers, mongodb,
redis-provision (were BestEffort QoS → first-evicted; now Burstable).
Apply runbook — k8s/DATA-TIER-APPLY-RUNBOOK.md documents operator apply-order +
verification gates for R6/R7 AND the already-committed-but-unapplied S1
(postgres-customers-lockdown) and S2 (networkpolicy), including the explicit
warning that the NetworkPolicy needs the dormant instant-pg-proxy allow-rule
or it breaks every customer Postgres connection, and the load-bearing S1 verify
(external psql -U instant_cust/instanode_admin -h pg.instanode.dev MUST be
rejected after lockdown). Cross-linked from APPLY-CHECKLIST.md.
NO kubectl apply performed — infra has no auto-apply (CLAUDE.md rule 15);
manifests ship as PR + runbook for the operator to apply in a window.
yamllint + kubeconform (CI parity) green on all changed manifests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
Jun 11, 2026
…ploy/propagation/data-tier OOMKill) (#73) A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot. The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending), so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic- logging Fluent Bit DaemonSet) + Synthetics are live. Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines (verified against worker code, file:line cited in each description): - customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals). Complements the pre-existing auth-only CRITICAL alert. - backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker #106 (which previously erred on EVERY tick, unalerted, for hours). - deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts (rule 27 silent-deploy-failure class). - propagation-dead-lettered-log.json — propagation_runner.dead_lettered + unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert propagation-dead-lettered metric alert (paid customer regrade fell through). - data-tier-pod-oomkill-restart.json — image-native startup banner of each instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the customer DB this session. Flagged blind spot: a banner detector cannot read exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline. Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md with the verified source log line + severity + NRQL key per alert, and the two acknowledged blind spots. FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs, per-pod memory/cpu requests+limits, maintenance-window apply runbook) already landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml + k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here. NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl --dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs only; operator-apply (apply.sh) — no auto-apply. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Stateful
instant-datahardening: durability (R6), eviction protection (R7), Razorpay inbound-webhook signature-failure monitoring (S4), and a consolidated operator apply runbook covering these PLUS the already-committed-but-unapplied S1/S2.NO
kubectl applywas performed. Per CLAUDE.md rule 15 this repo has no auto-apply — manifests ship as PR + runbook for the operator to apply in a maintenance window (real customer data is on these pods).Changes
S4 — Razorpay webhook signature-failure monitoring (rule 25, mirrors the GitHub bad-signature alert)
Pairs the new
instant_razorpay_webhook_sig_fail_totalcounter the api agent is adding. P2/abuse — the signature gate is the only thing between a forgedsubscription.chargedand a free plan upgrade; the 400 short-circuits before dispatch so noplan_tierflips, this is visibility + WAF response.k8s/prometheus-rules.yaml— newinstant-billinggroup, alertRazorpayWebhookSigFailSpike(increase(...[10m]) > 0, warning).newrelic/alerts/razorpay-webhook-sig-fail.json— P2 NRQL alert (same shape asgithub-webhook-bad-signature.json).newrelic/dashboards/instanode-reliability.json— billboard tile (row 84, threshold 1).observability/METRICS-CATALOG.md— catalog row.R6 — NATS JetStream
emptyDir{}→ PVCk8s/data/nats.yaml— addsnats-jetstream-pvc(5Gi, cluster-default StorageClass =do-block-storageon DOKS), mounts it at/data/jetstream. Closes the# TODO: convert to PVCmarker so queue stream/consumer state + messages survive pod restarts.Recreatestrategy retained (RWO single-attach).R7 — eviction protection under the 200%+ memory overcommit
k8s/data/stateful-priority.yaml(new) —PriorityClass instant-data-critical(value 1000000) + onePodDisruptionBudget(minAvailable:1)each for postgres-customers / mongodb / redis-provision / nats. (minio excluded — retired 2026-05-20.)resources.requests/limitsonpostgres-customers,mongodb,redis-provision— these were BestEffort QoS (no requests → first-evicted under pressure), now Burstable. nats already had requests.Apply runbook
k8s/DATA-TIER-APPLY-RUNBOOK.md(new) — operator apply-order (R7 → R6 → S1 → S2, least-risky-first) + per-step verification gates for R6/R7 AND the unapplied S1 (postgres-customers-lockdown.yaml) and S2 (networkpolicy.yaml). Includes:instant-pg-proxyunless the dormant allow-rule (lines ~88–103) is uncommented + matched to the live proxy ns/labels — applying as-is breaks every customer Postgres connection.psql -U instant_cust/-U instanode_admin -h pg.instanode.devMUST be rejected after lockdown (safe connection-rejection probe, no SQL/DDL), while in-cluster admin + customerusr_*paths keep working.k8s/APPLY-CHECKLIST.md— cross-reference pointer to the new runbook.Validation
yamllint(CI relaxed config) — clean on all changed manifests.kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.31.0— fullk8s/sweep: 106 valid, 0 invalid, 0 errors (PrometheusRule CRD skipped, expected).instant-billinggroup nests underspec.groups; natsjetstream-datavolume ispersistentVolumeClaim, notemptyDir.Operator apply order (in a maintenance window)
kubectl apply -f k8s/data/stateful-priority.yamlmongodb.yaml+redis-provision.yaml, patchpriorityClassNamekubectl apply -f k8s/data/nats.yaml(+ priorityClassName patch; durability publish/restart test)postgres-customers.yaml— verify external admin psql REJECTEDnetworkpolicy.yaml— verify customer path still works🤖 Generated with Claude Code