From 618e44b14df1e4ac2082bfcc388808d865dda657 Mon Sep 17 00:00:00 2001 From: Manas Srivastava Date: Sat, 6 Jun 2026 20:16:43 +0530 Subject: [PATCH] observability: alert + prom rule + tile + catalog for instant_payment_probe_outcome_total (rule 25) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rule-25 observability for the Layer-3 payment prober (the money heartbeat, worker/internal/jobs/payment_probe.go — forum verdict §4). Ships in lockstep with the worker PR that adds the metric. - newrelic/alerts/payment-probe-fail.json — P1 page on instant_payment_probe_outcome_total{result="fail"} > 0 in 10m (paid revenue path down). result="degraded" EXCLUDED so the prober never false-pages before the operator lights PAYMENT_PROBE_ENABLED + the test webhook secret. - k8s/prometheus-rules.yaml — instant-worker-payment-probe group / PaymentProbeFail (mirror of the NR alert). - newrelic/dashboards/instanode-reliability.json — three tiles: outcomes per leg, fails billboard (must be 0), P95 latency per leg. - observability/METRICS-CATALOG.md — rows for the outcome counter + latency histogram (both lazy *Vec, INERT until PAYMENT_PROBE_ENABLED=true). Operator-apply (infra has no auto-apply). Awaiting operator PAYMENT_PROBE_ENABLED=true (+ RAZORPAY_TEST_WEBHOOK_SECRET for the upgrade leg) before any series materialises. Co-Authored-By: Claude Opus 4.8 --- k8s/prometheus-rules.yaml | 46 +++++++++++ newrelic/alerts/payment-probe-fail.json | 31 +++++++ .../dashboards/instanode-reliability.json | 81 +++++++++++++++++++ observability/METRICS-CATALOG.md | 2 + 4 files changed, 160 insertions(+) create mode 100644 newrelic/alerts/payment-probe-fail.json diff --git a/k8s/prometheus-rules.yaml b/k8s/prometheus-rules.yaml index eb86a46..52ea0b6 100644 --- a/k8s/prometheus-rules.yaml +++ b/k8s/prometheus-rules.yaml @@ -790,6 +790,52 @@ spec: the worker structured slog ERROR `deploy_probe_failed leg=... reason=...`. Source: worker/internal/jobs/deploy_probe.go. + # Layer-3 payment prober (the money heartbeat). Every 5 minutes the + # worker drives the iframe-free payment-funnel contract path against + # prod: checkout reachability (POST /api/v1/billing/checkout, non-5xx), + # billing + invoices read surfaces (non-5xx), the webhook signature + # SECURITY contract (POST /razorpay/webhook with a garbage unsigned + # payload MUST be rejected 400 invalid_signature), and an OPTIONAL + # test-mode upgrade proof (inject a correctly-signed TEST webhook → + # assert teams.plan_tier flips → reap the cohort team). Any fail + # outcome over a 10-minute window pages — at the 5-minute cadence two + # consecutive fail ticks is an unambiguous paid-funnel regression, not + # a flake. result="degraded" is the config-unset / slow-but-correct + # state (no test webhook secret, no JWT secret, or over-budget) and is + # EXCLUDED, so the prober never false-pages before the operator lights + # PAYMENT_PROBE_ENABLED. Mirrors payment-probe-fail.json in + # newrelic/alerts/. Forum verdict: docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4. + - name: instant-worker-payment-probe + rules: + - alert: PaymentProbeFail + expr: | + sum(increase(instant_payment_probe_outcome_total{result="fail"}[10m])) > 0 + for: 10m + labels: + severity: critical + service: worker + annotations: + summary: "Layer-3 payment prober fail (paid revenue path broken in prod)" + description: | + instant_payment_probe_outcome_total{result="fail"} > 0 in 10m. + The Layer-3 payment prober (the money heartbeat) drives the + iframe-free payment-funnel contract path against prod every 5 + minutes. A fail outcome means one of: checkout_reachable — + POST /api/v1/billing/checkout returned a 5xx (handler crash; + a 402/409/502 blocked-but-alive shape is a PASS while Razorpay + live-recurring is operator-blocked); billing_state / + invoices_reachable — GET /api/v1/billing[/invoices] 5xx (a + paid-tier read-surface regression); webhook_security — the + /razorpay/webhook signature gate ACCEPTED an unsigned payload + (a CRITICAL security regression — a forged success could drive + a free upgrade) OR returned a non-canonical error_code; or + upgrade_webhook_e2e — a signed TEST-mode subscription.charged + did NOT flip teams.plan_tier (the rule-12 downstream truth + surface, NOT a webhook 200). Cross-correlate against audit_log + kind=payment_probe_failed + the worker structured slog ERROR + `payment_probe_failed leg=... reason=...`. Source: + worker/internal/jobs/payment_probe.go. + # GitHub App push-to-deploy pipeline (P4, pre-staged 2026-06-03). # Metrics emitted by api/internal/handlers/github_webhook.go: # instant_github_webhook_received_total{event,result} diff --git a/newrelic/alerts/payment-probe-fail.json b/newrelic/alerts/payment-probe-fail.json new file mode 100644 index 0000000..5bf1219 --- /dev/null +++ b/newrelic/alerts/payment-probe-fail.json @@ -0,0 +1,31 @@ +{ + "name": "instant-worker — payment_probe_failed [Layer-3 money heartbeat: paid revenue path broken in prod]", + "type": "NRQL", + "description": "P1 page on ANY occurrence of instant_payment_probe_outcome_total{result=\"fail\"}. The Layer-3 payment prober (the money heartbeat — forum verdict docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4 Layer 3) drives the iframe-free payment-funnel contract path against prod every 5 minutes. A fail outcome means one of: (a) checkout_reachable — POST /api/v1/billing/checkout returned a 5xx (handler crash/regression — NOT a 402/409/502 blocked-but-alive shape, which the prober treats as PASS while Razorpay live-recurring is operator-blocked); (b) billing_state / invoices_reachable — GET /api/v1/billing or /api/v1/billing/invoices returned a 5xx (a paid-tier read-surface regression); (c) webhook_security — POST /razorpay/webhook ACCEPTED an unsigned/garbage payload (a CRITICAL security regression: the signature gate would let a forged 'success' drive a free upgrade) OR rejected it with a non-canonical error_code; (d) upgrade_webhook_e2e — a correctly-signed TEST-mode subscription.charged was injected against a fresh cohort team but teams.plan_tier did NOT advance (the upgrade pipeline is broken — this is the rule-12 downstream truth surface, NOT a webhook 200). result='degraded' is EXCLUDED from this alert (it is the config-unset / slow-but-correct state — no test webhook secret, no JWT secret, or over-budget latency — so the prober never false-pages before the operator wires PAYMENT_PROBE_ENABLED + the test secret). Cross-correlate against the audit_log row written by the worker (kind=payment_probe_failed, actor='system:payment_probe') AND the structured slog ERROR line payment_probe_failed leg=... reason=... — same content on both surfaces. cohort='synthetic' so this never pollutes billing/revenue dashboards. Source: worker/internal/jobs/payment_probe.go (PaymentProbeWorker), metric registered in worker/internal/metrics/metrics.go (PaymentProbeOutcomeTotal). Threshold ABOVE 0 with a 10m window: a single fail tick over 10 minutes (two ticks at the 5-minute cadence) is an unambiguous paid-funnel regression, not a flake. AWAITING operator PAYMENT_PROBE_ENABLED=true (and, for the upgrade leg, RAZORPAY_TEST_WEBHOOK_SECRET) before any series materialises.", + "enabled": true, + "nrql": { + "query": "SELECT sum(instant_payment_probe_outcome_total) FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' AND result = 'fail'" + }, + "terms": [ + { + "priority": "CRITICAL", + "operator": "ABOVE", + "threshold": 0, + "thresholdDuration": 600, + "thresholdOccurrences": "AT_LEAST_ONCE" + } + ], + "signal": { + "aggregationWindow": 60, + "aggregationMethod": "EVENT_FLOW", + "aggregationDelay": 120, + "fillOption": "STATIC", + "fillValue": 0 + }, + "expiration": { + "expirationDuration": 7200, + "openViolationOnExpiration": false, + "closeViolationsOnExpiration": true + }, + "violationTimeLimitSeconds": 86400 +} diff --git a/newrelic/dashboards/instanode-reliability.json b/newrelic/dashboards/instanode-reliability.json index abc7934..f50a654 100644 --- a/newrelic/dashboards/instanode-reliability.json +++ b/newrelic/dashboards/instanode-reliability.json @@ -1436,6 +1436,87 @@ "ignoreTimeRange": false } } + }, + { + "title": "Layer-3 payment prober — outcomes per leg (6h) [money heartbeat]", + "layout": { + "column": 1, + "row": 75, + "width": 6, + "height": 3 + }, + "visualization": { + "id": "viz.line" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT sum(instant_payment_probe_outcome_total) FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' FACET leg, result TIMESERIES SINCE 6 hours ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } + }, + { + "title": "Layer-3 payment prober — fails (last 6h, must be 0; degraded excluded)", + "layout": { + "column": 7, + "row": 75, + "width": 3, + "height": 3 + }, + "visualization": { + "id": "viz.billboard" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT sum(instant_payment_probe_outcome_total) AS 'fails' FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' AND result = 'fail' SINCE 6 hours ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + }, + "thresholds": [ + { + "alertSeverity": "CRITICAL", + "value": 1 + } + ] + } + }, + { + "title": "Layer-3 payment prober — P95 latency per leg (6h)", + "layout": { + "column": 10, + "row": 75, + "width": 3, + "height": 3 + }, + "visualization": { + "id": "viz.line" + }, + "rawConfiguration": { + "nrqlQueries": [ + { + "accountIds": [ + 0 + ], + "query": "SELECT percentile(instant_payment_probe_latency_seconds, 95) AS 'p95' FROM Metric WHERE metricName = 'instant_payment_probe_latency_seconds' FACET leg TIMESERIES SINCE 6 hours ago" + } + ], + "platformOptions": { + "ignoreTimeRange": false + } + } } ] } diff --git a/observability/METRICS-CATALOG.md b/observability/METRICS-CATALOG.md index 9757cd9..98e73d3 100644 --- a/observability/METRICS-CATALOG.md +++ b/observability/METRICS-CATALOG.md @@ -44,6 +44,8 @@ fires. Operators need this so they don't panic when a fresh deploy looks | `instant_auth_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only on a real HTTP response; DNS/TCP errors omit the observation so the histogram isn't polluted with 0s timeouts) | (covered by `auth-probe-fail.json`) | (covered by `AuthProbeFail`) | "AUTH-004 synthetic prober — P95 latency per leg (1h)" | | `instant_deploy_probe_outcome_total` | worker | `leg,result` | lazy (CounterVec — `pass`/`degraded` materialise on the first happy tick; `fail` only appears after a real regression. Hourly deploy prober: every 60 min the worker drives /deploy/new + status-poll until healthy + public-host GET against prod. Closes the 2026-05-30 stuck-build gap that hid a broken deploy pipeline for ~30 min) | `deploy-probe-fail.json` | `DeployProbeFail` | "Hourly deploy prober — outcomes per leg (6h)", "Hourly deploy prober — fails (last 6h, must be 0)" | | `instant_deploy_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only on a real HTTP response or successful status flip; DNS/TCP errors omit the observation. Buckets span the per-leg budgets up to the 120s cold-cluster Kaniko ceiling) | (covered by `deploy-probe-fail.json`) | (covered by `DeployProbeFail`) | "Hourly deploy prober — P95 latency per leg (6h)" | +| `instant_payment_probe_outcome_total` | worker | `leg,result` | lazy (CounterVec — INERT until `PAYMENT_PROBE_ENABLED=true`; once on, `pass`/`degraded` materialise on the first tick and `fail` only on a real regression. Layer-3 payment prober (the money heartbeat, forum verdict docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4): every 5 min drives the iframe-free payment-funnel contract path against prod — `leg ∈ checkout_reachable / billing_state / invoices_reachable / webhook_security / upgrade_webhook_e2e`, each reading a rule-12 truth surface, NO real money. The upgrade leg additionally needs `RAZORPAY_TEST_WEBHOOK_SECRET` (degraded otherwise). label families primed in `metrics_test.go`) | `payment-probe-fail.json` | `PaymentProbeFail` (instant-worker-payment-probe group) | "Layer-3 payment prober — outcomes per leg (6h)", "Layer-3 payment prober — fails (last 6h, must be 0)" | +| `instant_payment_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only when a real request was performed; a config-skipped leg omits the observation. Buckets span the per-leg budgets up to the 8s upgrade-leg ceiling. INERT until `PAYMENT_PROBE_ENABLED=true`) | (covered by `payment-probe-fail.json`) | (covered by `PaymentProbeFail`) | "Layer-3 payment prober — P95 latency per leg (6h)" | | `instant_tier_upgrade_ttl_promote_total` | api | `outcome` | lazy (CounterVec — outcome label series materialise on first paid-tier upgrade after deploy; `error` should stay absent in a healthy deploy. P1 fix 2026-05-31 — emits from billing.handleSubscriptionCharged → PromoteDeploymentTTLsForTeam) | `tier-upgrade-ttl-promote-failed.json` | `TierUpgradeTTLPromoteFailed` | "Tier-upgrade TTL promote outcomes (24h) — error must be 0" | | `instant_customer_backup_failed_total` | worker | `reason` | lazy (CounterVec — `reason` series materialise on first failure: auth/decrypt/config/dump/upload. `auth`=credential drift, SLA breach, won't self-heal → CRITICAL; others WARNING. Added 2026-06-03 after a failed backup paged no one — stale=36h, no-followup=stuck-only) | `customer-backup-failed.json` | `CustomerBackupCredentialFailure`, `CustomerBackupFailures` | "Customer backup failures by reason (24h)" | | `instant_customer_backup_succeeded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; paired with `_failed_total` for the success-ratio billboard) | (ratio feeds the dashboard; no standalone alert) | (none — denominator only) | "Backup success rate (last 24h, all teams)" |