You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rule-25 observability for the Layer-3 payment prober (the money heartbeat,
worker/internal/jobs/payment_probe.go — forum verdict §4). Ships in lockstep
with the worker PR that adds the metric.
- newrelic/alerts/payment-probe-fail.json — P1 page on
instant_payment_probe_outcome_total{result="fail"} > 0 in 10m (paid revenue
path down). result="degraded" EXCLUDED so the prober never false-pages
before the operator lights PAYMENT_PROBE_ENABLED + the test webhook secret.
- k8s/prometheus-rules.yaml — instant-worker-payment-probe group / PaymentProbeFail
(mirror of the NR alert).
- newrelic/dashboards/instanode-reliability.json — three tiles: outcomes per
leg, fails billboard (must be 0), P95 latency per leg.
- observability/METRICS-CATALOG.md — rows for the outcome counter + latency
histogram (both lazy *Vec, INERT until PAYMENT_PROBE_ENABLED=true).
Operator-apply (infra has no auto-apply). Awaiting operator
PAYMENT_PROBE_ENABLED=true (+ RAZORPAY_TEST_WEBHOOK_SECRET for the upgrade leg)
before any series materialises.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
"description": "P1 page on ANY occurrence of instant_payment_probe_outcome_total{result=\"fail\"}. The Layer-3 payment prober (the money heartbeat — forum verdict docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4 Layer 3) drives the iframe-free payment-funnel contract path against prod every 5 minutes. A fail outcome means one of: (a) checkout_reachable — POST /api/v1/billing/checkout returned a 5xx (handler crash/regression — NOT a 402/409/502 blocked-but-alive shape, which the prober treats as PASS while Razorpay live-recurring is operator-blocked); (b) billing_state / invoices_reachable — GET /api/v1/billing or /api/v1/billing/invoices returned a 5xx (a paid-tier read-surface regression); (c) webhook_security — POST /razorpay/webhook ACCEPTED an unsigned/garbage payload (a CRITICAL security regression: the signature gate would let a forged 'success' drive a free upgrade) OR rejected it with a non-canonical error_code; (d) upgrade_webhook_e2e — a correctly-signed TEST-mode subscription.charged was injected against a fresh cohort team but teams.plan_tier did NOT advance (the upgrade pipeline is broken — this is the rule-12 downstream truth surface, NOT a webhook 200). result='degraded' is EXCLUDED from this alert (it is the config-unset / slow-but-correct state — no test webhook secret, no JWT secret, or over-budget latency — so the prober never false-pages before the operator wires PAYMENT_PROBE_ENABLED + the test secret). Cross-correlate against the audit_log row written by the worker (kind=payment_probe_failed, actor='system:payment_probe') AND the structured slog ERROR line payment_probe_failed leg=... reason=... — same content on both surfaces. cohort='synthetic' so this never pollutes billing/revenue dashboards. Source: worker/internal/jobs/payment_probe.go (PaymentProbeWorker), metric registered in worker/internal/metrics/metrics.go (PaymentProbeOutcomeTotal). Threshold ABOVE 0 with a 10m window: a single fail tick over 10 minutes (two ticks at the 5-minute cadence) is an unambiguous paid-funnel regression, not a flake. AWAITING operator PAYMENT_PROBE_ENABLED=true (and, for the upgrade leg, RAZORPAY_TEST_WEBHOOK_SECRET) before any series materialises.",
5
+
"enabled": true,
6
+
"nrql": {
7
+
"query": "SELECT sum(instant_payment_probe_outcome_total) FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' AND result = 'fail'"
Copy file name to clipboardExpand all lines: newrelic/dashboards/instanode-reliability.json
+81Lines changed: 81 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1436,6 +1436,87 @@
1436
1436
"ignoreTimeRange": false
1437
1437
}
1438
1438
}
1439
+
},
1440
+
{
1441
+
"title": "Layer-3 payment prober — outcomes per leg (6h) [money heartbeat]",
1442
+
"layout": {
1443
+
"column": 1,
1444
+
"row": 75,
1445
+
"width": 6,
1446
+
"height": 3
1447
+
},
1448
+
"visualization": {
1449
+
"id": "viz.line"
1450
+
},
1451
+
"rawConfiguration": {
1452
+
"nrqlQueries": [
1453
+
{
1454
+
"accountIds": [
1455
+
0
1456
+
],
1457
+
"query": "SELECT sum(instant_payment_probe_outcome_total) FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' FACET leg, result TIMESERIES SINCE 6 hours ago"
1458
+
}
1459
+
],
1460
+
"platformOptions": {
1461
+
"ignoreTimeRange": false
1462
+
}
1463
+
}
1464
+
},
1465
+
{
1466
+
"title": "Layer-3 payment prober — fails (last 6h, must be 0; degraded excluded)",
1467
+
"layout": {
1468
+
"column": 7,
1469
+
"row": 75,
1470
+
"width": 3,
1471
+
"height": 3
1472
+
},
1473
+
"visualization": {
1474
+
"id": "viz.billboard"
1475
+
},
1476
+
"rawConfiguration": {
1477
+
"nrqlQueries": [
1478
+
{
1479
+
"accountIds": [
1480
+
0
1481
+
],
1482
+
"query": "SELECT sum(instant_payment_probe_outcome_total) AS 'fails' FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' AND result = 'fail' SINCE 6 hours ago"
1483
+
}
1484
+
],
1485
+
"platformOptions": {
1486
+
"ignoreTimeRange": false
1487
+
},
1488
+
"thresholds": [
1489
+
{
1490
+
"alertSeverity": "CRITICAL",
1491
+
"value": 1
1492
+
}
1493
+
]
1494
+
}
1495
+
},
1496
+
{
1497
+
"title": "Layer-3 payment prober — P95 latency per leg (6h)",
1498
+
"layout": {
1499
+
"column": 10,
1500
+
"row": 75,
1501
+
"width": 3,
1502
+
"height": 3
1503
+
},
1504
+
"visualization": {
1505
+
"id": "viz.line"
1506
+
},
1507
+
"rawConfiguration": {
1508
+
"nrqlQueries": [
1509
+
{
1510
+
"accountIds": [
1511
+
0
1512
+
],
1513
+
"query": "SELECT percentile(instant_payment_probe_latency_seconds, 95) AS 'p95' FROM Metric WHERE metricName = 'instant_payment_probe_latency_seconds' FACET leg TIMESERIES SINCE 6 hours ago"
Copy file name to clipboardExpand all lines: observability/METRICS-CATALOG.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,6 +44,8 @@ fires. Operators need this so they don't panic when a fresh deploy looks
44
44
|`instant_auth_probe_latency_seconds`| worker |`leg`| lazy (HistogramVec — observation only on a real HTTP response; DNS/TCP errors omit the observation so the histogram isn't polluted with 0s timeouts) | (covered by `auth-probe-fail.json`) | (covered by `AuthProbeFail`) | "AUTH-004 synthetic prober — P95 latency per leg (1h)" |
45
45
|`instant_deploy_probe_outcome_total`| worker |`leg,result`| lazy (CounterVec — `pass`/`degraded` materialise on the first happy tick; `fail` only appears after a real regression. Hourly deploy prober: every 60 min the worker drives /deploy/new + status-poll until healthy + public-host GET against prod. Closes the 2026-05-30 stuck-build gap that hid a broken deploy pipeline for ~30 min) |`deploy-probe-fail.json`|`DeployProbeFail`| "Hourly deploy prober — outcomes per leg (6h)", "Hourly deploy prober — fails (last 6h, must be 0)" |
46
46
|`instant_deploy_probe_latency_seconds`| worker |`leg`| lazy (HistogramVec — observation only on a real HTTP response or successful status flip; DNS/TCP errors omit the observation. Buckets span the per-leg budgets up to the 120s cold-cluster Kaniko ceiling) | (covered by `deploy-probe-fail.json`) | (covered by `DeployProbeFail`) | "Hourly deploy prober — P95 latency per leg (6h)" |
47
+
|`instant_payment_probe_outcome_total`| worker |`leg,result`| lazy (CounterVec — INERT until `PAYMENT_PROBE_ENABLED=true`; once on, `pass`/`degraded` materialise on the first tick and `fail` only on a real regression. Layer-3 payment prober (the money heartbeat, forum verdict docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4): every 5 min drives the iframe-free payment-funnel contract path against prod — `leg ∈ checkout_reachable / billing_state / invoices_reachable / webhook_security / upgrade_webhook_e2e`, each reading a rule-12 truth surface, NO real money. The upgrade leg additionally needs `RAZORPAY_TEST_WEBHOOK_SECRET` (degraded otherwise). label families primed in `metrics_test.go`) |`payment-probe-fail.json`|`PaymentProbeFail` (instant-worker-payment-probe group) | "Layer-3 payment prober — outcomes per leg (6h)", "Layer-3 payment prober — fails (last 6h, must be 0)" |
48
+
|`instant_payment_probe_latency_seconds`| worker |`leg`| lazy (HistogramVec — observation only when a real request was performed; a config-skipped leg omits the observation. Buckets span the per-leg budgets up to the 8s upgrade-leg ceiling. INERT until `PAYMENT_PROBE_ENABLED=true`) | (covered by `payment-probe-fail.json`) | (covered by `PaymentProbeFail`) | "Layer-3 payment prober — P95 latency per leg (6h)" |
47
49
|`instant_tier_upgrade_ttl_promote_total`| api |`outcome`| lazy (CounterVec — outcome label series materialise on first paid-tier upgrade after deploy; `error` should stay absent in a healthy deploy. P1 fix 2026-05-31 — emits from billing.handleSubscriptionCharged → PromoteDeploymentTTLsForTeam) |`tier-upgrade-ttl-promote-failed.json`|`TierUpgradeTTLPromoteFailed`| "Tier-upgrade TTL promote outcomes (24h) — error must be 0" |
48
50
|`instant_customer_backup_failed_total`| worker |`reason`| lazy (CounterVec — `reason` series materialise on first failure: auth/decrypt/config/dump/upload. `auth`=credential drift, SLA breach, won't self-heal → CRITICAL; others WARNING. Added 2026-06-03 after a failed backup paged no one — stale=36h, no-followup=stuck-only) |`customer-backup-failed.json`|`CustomerBackupCredentialFailure`, `CustomerBackupFailures`| "Customer backup failures by reason (24h)" |
49
51
|`instant_customer_backup_succeeded_total`| worker | (none) |**eager** (Counter — visible as 0 at boot; paired with `_failed_total` for the success-ratio billboard) | (ratio feeds the dashboard; no standalone alert) | (none — denominator only) | "Backup success rate (last 24h, all teams)" |
0 commit comments