Skip to content

infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook#69

Merged
mastermanas805 merged 1 commit into
masterfrom
infra/s4-r6-r7-stateful-hardening
Jun 10, 2026
Merged

infra(data-tier): S4 razorpay sig-fail monitoring + R6 nats PVC + R7 eviction protection + apply runbook#69
mastermanas805 merged 1 commit into
masterfrom
infra/s4-r6-r7-stateful-hardening

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What

Stateful instant-data hardening: durability (R6), eviction protection (R7), Razorpay inbound-webhook signature-failure monitoring (S4), and a consolidated operator apply runbook covering these PLUS the already-committed-but-unapplied S1/S2.

NO kubectl apply was performed. Per CLAUDE.md rule 15 this repo has no auto-apply — manifests ship as PR + runbook for the operator to apply in a maintenance window (real customer data is on these pods).

Changes

S4 — Razorpay webhook signature-failure monitoring (rule 25, mirrors the GitHub bad-signature alert)

Pairs the new instant_razorpay_webhook_sig_fail_total counter the api agent is adding. P2/abuse — the signature gate is the only thing between a forged subscription.charged and a free plan upgrade; the 400 short-circuits before dispatch so no plan_tier flips, this is visibility + WAF response.

  • k8s/prometheus-rules.yaml — new instant-billing group, alert RazorpayWebhookSigFailSpike (increase(...[10m]) > 0, warning).
  • newrelic/alerts/razorpay-webhook-sig-fail.json — P2 NRQL alert (same shape as github-webhook-bad-signature.json).
  • newrelic/dashboards/instanode-reliability.json — billboard tile (row 84, threshold 1).
  • observability/METRICS-CATALOG.md — catalog row.

R6 — NATS JetStream emptyDir{} → PVC

  • k8s/data/nats.yaml — adds nats-jetstream-pvc (5Gi, cluster-default StorageClass = do-block-storage on DOKS), mounts it at /data/jetstream. Closes the # TODO: convert to PVC marker so queue stream/consumer state + messages survive pod restarts. Recreate strategy retained (RWO single-attach).

R7 — eviction protection under the 200%+ memory overcommit

  • k8s/data/stateful-priority.yaml (new) — PriorityClass instant-data-critical (value 1000000) + one PodDisruptionBudget(minAvailable:1) each for postgres-customers / mongodb / redis-provision / nats. (minio excluded — retired 2026-05-20.)
  • Right-sized resources.requests/limits on postgres-customers, mongodb, redis-provision — these were BestEffort QoS (no requests → first-evicted under pressure), now Burstable. nats already had requests.

Apply runbook

  • k8s/DATA-TIER-APPLY-RUNBOOK.md (new) — operator apply-order (R7 → R6 → S1 → S2, least-risky-first) + per-step verification gates for R6/R7 AND the unapplied S1 (postgres-customers-lockdown.yaml) and S2 (networkpolicy.yaml). Includes:
    • the explicit S2 trap: the NetworkPolicy default-denies the in-cluster instant-pg-proxy unless the dormant allow-rule (lines ~88–103) is uncommented + matched to the live proxy ns/labels — applying as-is breaks every customer Postgres connection.
    • the load-bearing S1 verify: external psql -U instant_cust / -U instanode_admin -h pg.instanode.dev MUST be rejected after lockdown (safe connection-rejection probe, no SQL/DDL), while in-cluster admin + customer usr_* paths keep working.
  • k8s/APPLY-CHECKLIST.md — cross-reference pointer to the new runbook.

Validation

  • yamllint (CI relaxed config) — clean on all changed manifests.
  • kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.31.0 — full k8s/ sweep: 106 valid, 0 invalid, 0 errors (PrometheusRule CRD skipped, expected).
  • JSON parse-checked: alert + dashboard.
  • YAML structure-checked: instant-billing group nests under spec.groups; nats jetstream-data volume is persistentVolumeClaim, not emptyDir.

Operator apply order (in a maintenance window)

  1. R7-A: kubectl apply -f k8s/data/stateful-priority.yaml
  2. R7-B: apply mongodb.yaml + redis-provision.yaml, patch priorityClassName
  3. R6: kubectl apply -f k8s/data/nats.yaml (+ priorityClassName patch; durability publish/restart test)
  4. S1: apply lockdown ConfigMap + postgres-customers.yamlverify external admin psql REJECTED
  5. S2: edit the dormant pg-proxy rule first, then apply networkpolicy.yaml — verify customer path still works

🤖 Generated with Claude Code

…eviction protection + apply runbook

S4 (monitoring half) — pair the api agent's new instant_razorpay_webhook_sig_fail_total
counter with full rule-25 coverage, mirroring the GitHub-webhook bad-signature alert:
  - Prom rule RazorpayWebhookSigFailSpike (new instant-billing group, P2/warning)
  - NR alert newrelic/alerts/razorpay-webhook-sig-fail.json (P2)
  - billboard tile on instanode-reliability.json (row 84, threshold 1)
  - METRICS-CATALOG.md row

R6 — convert NATS JetStream /data/jetstream from emptyDir{} to PVC
(nats-jetstream-pvc, 5Gi, cluster-default StorageClass = do-block-storage on DOKS)
so stream/consumer state + persisted messages survive pod restarts. Recreate
strategy retained (RWO single-attach).

R7 — eviction protection for the stateful instant-data pods under the cluster's
200%+ memory overcommit:
  - k8s/data/stateful-priority.yaml: PriorityClass instant-data-critical
    (value 1000000) + one PDB (minAvailable:1) each for postgres-customers,
    mongodb, redis-provision, nats. (minio excluded — retired 2026-05-20.)
  - right-sized resource requests/limits on postgres-customers, mongodb,
    redis-provision (were BestEffort QoS → first-evicted; now Burstable).

Apply runbook — k8s/DATA-TIER-APPLY-RUNBOOK.md documents operator apply-order +
verification gates for R6/R7 AND the already-committed-but-unapplied S1
(postgres-customers-lockdown) and S2 (networkpolicy), including the explicit
warning that the NetworkPolicy needs the dormant instant-pg-proxy allow-rule
or it breaks every customer Postgres connection, and the load-bearing S1 verify
(external psql -U instant_cust/instanode_admin -h pg.instanode.dev MUST be
rejected after lockdown). Cross-linked from APPLY-CHECKLIST.md.

NO kubectl apply performed — infra has no auto-apply (CLAUDE.md rule 15);
manifests ship as PR + runbook for the operator to apply in a window.
yamllint + kubeconform (CI parity) green on all changed manifests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 11bc470 into master Jun 10, 2026
3 checks passed
mastermanas805 added a commit that referenced this pull request Jun 11, 2026
…ploy/propagation/data-tier OOMKill) (#73)

A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a
provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot.
The metric-based alerts that should have caught these are INERT: prod has NO
Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending),
so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic-
logging Fluent Bit DaemonSet) + Synthetics are live.

Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines
(verified against worker code, file:line cited in each description):

- customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed
  reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals).
  Complements the pre-existing auth-only CRITICAL alert.
- backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL
  ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker
  #106 (which previously erred on EVERY tick, unalerted, for hours).
- deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL
  ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts
  (rule 27 silent-deploy-failure class).
- propagation-dead-lettered-log.json — propagation_runner.dead_lettered +
  unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert
  propagation-dead-lettered metric alert (paid customer regrade fell through).
- data-tier-pod-oomkill-restart.json — image-native startup banner of each
  instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET
  k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the
  customer DB this session. Flagged blind spot: a banner detector cannot read
  exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative
  reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline.

Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md
with the verified source log line + severity + NRQL key per alert, and the two
acknowledged blind spots.

FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs,
per-pod memory/cpu requests+limits, maintenance-window apply runbook) already
landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml +
k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here.

NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl
--dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs
only; operator-apply (apply.sh) — no auto-apply.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant