Skip to content

Commit 618e44b

Browse files
observability: alert + prom rule + tile + catalog for instant_payment_probe_outcome_total (rule 25)
Rule-25 observability for the Layer-3 payment prober (the money heartbeat, worker/internal/jobs/payment_probe.go — forum verdict §4). Ships in lockstep with the worker PR that adds the metric. - newrelic/alerts/payment-probe-fail.json — P1 page on instant_payment_probe_outcome_total{result="fail"} > 0 in 10m (paid revenue path down). result="degraded" EXCLUDED so the prober never false-pages before the operator lights PAYMENT_PROBE_ENABLED + the test webhook secret. - k8s/prometheus-rules.yaml — instant-worker-payment-probe group / PaymentProbeFail (mirror of the NR alert). - newrelic/dashboards/instanode-reliability.json — three tiles: outcomes per leg, fails billboard (must be 0), P95 latency per leg. - observability/METRICS-CATALOG.md — rows for the outcome counter + latency histogram (both lazy *Vec, INERT until PAYMENT_PROBE_ENABLED=true). Operator-apply (infra has no auto-apply). Awaiting operator PAYMENT_PROBE_ENABLED=true (+ RAZORPAY_TEST_WEBHOOK_SECRET for the upgrade leg) before any series materialises. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 78cb667 commit 618e44b

4 files changed

Lines changed: 160 additions & 0 deletions

File tree

k8s/prometheus-rules.yaml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -790,6 +790,52 @@ spec:
790790
the worker structured slog ERROR `deploy_probe_failed leg=...
791791
reason=...`. Source: worker/internal/jobs/deploy_probe.go.
792792
793+
# Layer-3 payment prober (the money heartbeat). Every 5 minutes the
794+
# worker drives the iframe-free payment-funnel contract path against
795+
# prod: checkout reachability (POST /api/v1/billing/checkout, non-5xx),
796+
# billing + invoices read surfaces (non-5xx), the webhook signature
797+
# SECURITY contract (POST /razorpay/webhook with a garbage unsigned
798+
# payload MUST be rejected 400 invalid_signature), and an OPTIONAL
799+
# test-mode upgrade proof (inject a correctly-signed TEST webhook →
800+
# assert teams.plan_tier flips → reap the cohort team). Any fail
801+
# outcome over a 10-minute window pages — at the 5-minute cadence two
802+
# consecutive fail ticks is an unambiguous paid-funnel regression, not
803+
# a flake. result="degraded" is the config-unset / slow-but-correct
804+
# state (no test webhook secret, no JWT secret, or over-budget) and is
805+
# EXCLUDED, so the prober never false-pages before the operator lights
806+
# PAYMENT_PROBE_ENABLED. Mirrors payment-probe-fail.json in
807+
# newrelic/alerts/. Forum verdict: docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4.
808+
- name: instant-worker-payment-probe
809+
rules:
810+
- alert: PaymentProbeFail
811+
expr: |
812+
sum(increase(instant_payment_probe_outcome_total{result="fail"}[10m])) > 0
813+
for: 10m
814+
labels:
815+
severity: critical
816+
service: worker
817+
annotations:
818+
summary: "Layer-3 payment prober fail (paid revenue path broken in prod)"
819+
description: |
820+
instant_payment_probe_outcome_total{result="fail"} > 0 in 10m.
821+
The Layer-3 payment prober (the money heartbeat) drives the
822+
iframe-free payment-funnel contract path against prod every 5
823+
minutes. A fail outcome means one of: checkout_reachable —
824+
POST /api/v1/billing/checkout returned a 5xx (handler crash;
825+
a 402/409/502 blocked-but-alive shape is a PASS while Razorpay
826+
live-recurring is operator-blocked); billing_state /
827+
invoices_reachable — GET /api/v1/billing[/invoices] 5xx (a
828+
paid-tier read-surface regression); webhook_security — the
829+
/razorpay/webhook signature gate ACCEPTED an unsigned payload
830+
(a CRITICAL security regression — a forged success could drive
831+
a free upgrade) OR returned a non-canonical error_code; or
832+
upgrade_webhook_e2e — a signed TEST-mode subscription.charged
833+
did NOT flip teams.plan_tier (the rule-12 downstream truth
834+
surface, NOT a webhook 200). Cross-correlate against audit_log
835+
kind=payment_probe_failed + the worker structured slog ERROR
836+
`payment_probe_failed leg=... reason=...`. Source:
837+
worker/internal/jobs/payment_probe.go.
838+
793839
# GitHub App push-to-deploy pipeline (P4, pre-staged 2026-06-03).
794840
# Metrics emitted by api/internal/handlers/github_webhook.go:
795841
# instant_github_webhook_received_total{event,result}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"name": "instant-worker — payment_probe_failed [Layer-3 money heartbeat: paid revenue path broken in prod]",
3+
"type": "NRQL",
4+
"description": "P1 page on ANY occurrence of instant_payment_probe_outcome_total{result=\"fail\"}. The Layer-3 payment prober (the money heartbeat — forum verdict docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4 Layer 3) drives the iframe-free payment-funnel contract path against prod every 5 minutes. A fail outcome means one of: (a) checkout_reachable — POST /api/v1/billing/checkout returned a 5xx (handler crash/regression — NOT a 402/409/502 blocked-but-alive shape, which the prober treats as PASS while Razorpay live-recurring is operator-blocked); (b) billing_state / invoices_reachable — GET /api/v1/billing or /api/v1/billing/invoices returned a 5xx (a paid-tier read-surface regression); (c) webhook_security — POST /razorpay/webhook ACCEPTED an unsigned/garbage payload (a CRITICAL security regression: the signature gate would let a forged 'success' drive a free upgrade) OR rejected it with a non-canonical error_code; (d) upgrade_webhook_e2e — a correctly-signed TEST-mode subscription.charged was injected against a fresh cohort team but teams.plan_tier did NOT advance (the upgrade pipeline is broken — this is the rule-12 downstream truth surface, NOT a webhook 200). result='degraded' is EXCLUDED from this alert (it is the config-unset / slow-but-correct state — no test webhook secret, no JWT secret, or over-budget latency — so the prober never false-pages before the operator wires PAYMENT_PROBE_ENABLED + the test secret). Cross-correlate against the audit_log row written by the worker (kind=payment_probe_failed, actor='system:payment_probe') AND the structured slog ERROR line payment_probe_failed leg=... reason=... — same content on both surfaces. cohort='synthetic' so this never pollutes billing/revenue dashboards. Source: worker/internal/jobs/payment_probe.go (PaymentProbeWorker), metric registered in worker/internal/metrics/metrics.go (PaymentProbeOutcomeTotal). Threshold ABOVE 0 with a 10m window: a single fail tick over 10 minutes (two ticks at the 5-minute cadence) is an unambiguous paid-funnel regression, not a flake. AWAITING operator PAYMENT_PROBE_ENABLED=true (and, for the upgrade leg, RAZORPAY_TEST_WEBHOOK_SECRET) before any series materialises.",
5+
"enabled": true,
6+
"nrql": {
7+
"query": "SELECT sum(instant_payment_probe_outcome_total) FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' AND result = 'fail'"
8+
},
9+
"terms": [
10+
{
11+
"priority": "CRITICAL",
12+
"operator": "ABOVE",
13+
"threshold": 0,
14+
"thresholdDuration": 600,
15+
"thresholdOccurrences": "AT_LEAST_ONCE"
16+
}
17+
],
18+
"signal": {
19+
"aggregationWindow": 60,
20+
"aggregationMethod": "EVENT_FLOW",
21+
"aggregationDelay": 120,
22+
"fillOption": "STATIC",
23+
"fillValue": 0
24+
},
25+
"expiration": {
26+
"expirationDuration": 7200,
27+
"openViolationOnExpiration": false,
28+
"closeViolationsOnExpiration": true
29+
},
30+
"violationTimeLimitSeconds": 86400
31+
}

newrelic/dashboards/instanode-reliability.json

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1436,6 +1436,87 @@
14361436
"ignoreTimeRange": false
14371437
}
14381438
}
1439+
},
1440+
{
1441+
"title": "Layer-3 payment prober — outcomes per leg (6h) [money heartbeat]",
1442+
"layout": {
1443+
"column": 1,
1444+
"row": 75,
1445+
"width": 6,
1446+
"height": 3
1447+
},
1448+
"visualization": {
1449+
"id": "viz.line"
1450+
},
1451+
"rawConfiguration": {
1452+
"nrqlQueries": [
1453+
{
1454+
"accountIds": [
1455+
0
1456+
],
1457+
"query": "SELECT sum(instant_payment_probe_outcome_total) FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' FACET leg, result TIMESERIES SINCE 6 hours ago"
1458+
}
1459+
],
1460+
"platformOptions": {
1461+
"ignoreTimeRange": false
1462+
}
1463+
}
1464+
},
1465+
{
1466+
"title": "Layer-3 payment prober — fails (last 6h, must be 0; degraded excluded)",
1467+
"layout": {
1468+
"column": 7,
1469+
"row": 75,
1470+
"width": 3,
1471+
"height": 3
1472+
},
1473+
"visualization": {
1474+
"id": "viz.billboard"
1475+
},
1476+
"rawConfiguration": {
1477+
"nrqlQueries": [
1478+
{
1479+
"accountIds": [
1480+
0
1481+
],
1482+
"query": "SELECT sum(instant_payment_probe_outcome_total) AS 'fails' FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' AND result = 'fail' SINCE 6 hours ago"
1483+
}
1484+
],
1485+
"platformOptions": {
1486+
"ignoreTimeRange": false
1487+
},
1488+
"thresholds": [
1489+
{
1490+
"alertSeverity": "CRITICAL",
1491+
"value": 1
1492+
}
1493+
]
1494+
}
1495+
},
1496+
{
1497+
"title": "Layer-3 payment prober — P95 latency per leg (6h)",
1498+
"layout": {
1499+
"column": 10,
1500+
"row": 75,
1501+
"width": 3,
1502+
"height": 3
1503+
},
1504+
"visualization": {
1505+
"id": "viz.line"
1506+
},
1507+
"rawConfiguration": {
1508+
"nrqlQueries": [
1509+
{
1510+
"accountIds": [
1511+
0
1512+
],
1513+
"query": "SELECT percentile(instant_payment_probe_latency_seconds, 95) AS 'p95' FROM Metric WHERE metricName = 'instant_payment_probe_latency_seconds' FACET leg TIMESERIES SINCE 6 hours ago"
1514+
}
1515+
],
1516+
"platformOptions": {
1517+
"ignoreTimeRange": false
1518+
}
1519+
}
14391520
}
14401521
]
14411522
}

observability/METRICS-CATALOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ fires. Operators need this so they don't panic when a fresh deploy looks
4444
| `instant_auth_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only on a real HTTP response; DNS/TCP errors omit the observation so the histogram isn't polluted with 0s timeouts) | (covered by `auth-probe-fail.json`) | (covered by `AuthProbeFail`) | "AUTH-004 synthetic prober — P95 latency per leg (1h)" |
4545
| `instant_deploy_probe_outcome_total` | worker | `leg,result` | lazy (CounterVec — `pass`/`degraded` materialise on the first happy tick; `fail` only appears after a real regression. Hourly deploy prober: every 60 min the worker drives /deploy/new + status-poll until healthy + public-host GET against prod. Closes the 2026-05-30 stuck-build gap that hid a broken deploy pipeline for ~30 min) | `deploy-probe-fail.json` | `DeployProbeFail` | "Hourly deploy prober — outcomes per leg (6h)", "Hourly deploy prober — fails (last 6h, must be 0)" |
4646
| `instant_deploy_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only on a real HTTP response or successful status flip; DNS/TCP errors omit the observation. Buckets span the per-leg budgets up to the 120s cold-cluster Kaniko ceiling) | (covered by `deploy-probe-fail.json`) | (covered by `DeployProbeFail`) | "Hourly deploy prober — P95 latency per leg (6h)" |
47+
| `instant_payment_probe_outcome_total` | worker | `leg,result` | lazy (CounterVec — INERT until `PAYMENT_PROBE_ENABLED=true`; once on, `pass`/`degraded` materialise on the first tick and `fail` only on a real regression. Layer-3 payment prober (the money heartbeat, forum verdict docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4): every 5 min drives the iframe-free payment-funnel contract path against prod — `leg ∈ checkout_reachable / billing_state / invoices_reachable / webhook_security / upgrade_webhook_e2e`, each reading a rule-12 truth surface, NO real money. The upgrade leg additionally needs `RAZORPAY_TEST_WEBHOOK_SECRET` (degraded otherwise). label families primed in `metrics_test.go`) | `payment-probe-fail.json` | `PaymentProbeFail` (instant-worker-payment-probe group) | "Layer-3 payment prober — outcomes per leg (6h)", "Layer-3 payment prober — fails (last 6h, must be 0)" |
48+
| `instant_payment_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only when a real request was performed; a config-skipped leg omits the observation. Buckets span the per-leg budgets up to the 8s upgrade-leg ceiling. INERT until `PAYMENT_PROBE_ENABLED=true`) | (covered by `payment-probe-fail.json`) | (covered by `PaymentProbeFail`) | "Layer-3 payment prober — P95 latency per leg (6h)" |
4749
| `instant_tier_upgrade_ttl_promote_total` | api | `outcome` | lazy (CounterVec — outcome label series materialise on first paid-tier upgrade after deploy; `error` should stay absent in a healthy deploy. P1 fix 2026-05-31 — emits from billing.handleSubscriptionCharged → PromoteDeploymentTTLsForTeam) | `tier-upgrade-ttl-promote-failed.json` | `TierUpgradeTTLPromoteFailed` | "Tier-upgrade TTL promote outcomes (24h) — error must be 0" |
4850
| `instant_customer_backup_failed_total` | worker | `reason` | lazy (CounterVec — `reason` series materialise on first failure: auth/decrypt/config/dump/upload. `auth`=credential drift, SLA breach, won't self-heal → CRITICAL; others WARNING. Added 2026-06-03 after a failed backup paged no one — stale=36h, no-followup=stuck-only) | `customer-backup-failed.json` | `CustomerBackupCredentialFailure`, `CustomerBackupFailures` | "Customer backup failures by reason (24h)" |
4951
| `instant_customer_backup_succeeded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; paired with `_failed_total` for the success-ratio billboard) | (ratio feeds the dashboard; no standalone alert) | (none — denominator only) | "Backup success rate (last 24h, all teams)" |

0 commit comments

Comments
 (0)