Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions k8s/prometheus-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -790,6 +790,52 @@ spec:
the worker structured slog ERROR `deploy_probe_failed leg=...
reason=...`. Source: worker/internal/jobs/deploy_probe.go.

# Layer-3 payment prober (the money heartbeat). Every 5 minutes the
# worker drives the iframe-free payment-funnel contract path against
# prod: checkout reachability (POST /api/v1/billing/checkout, non-5xx),
# billing + invoices read surfaces (non-5xx), the webhook signature
# SECURITY contract (POST /razorpay/webhook with a garbage unsigned
# payload MUST be rejected 400 invalid_signature), and an OPTIONAL
# test-mode upgrade proof (inject a correctly-signed TEST webhook →
# assert teams.plan_tier flips → reap the cohort team). Any fail
# outcome over a 10-minute window pages — at the 5-minute cadence two
# consecutive fail ticks is an unambiguous paid-funnel regression, not
# a flake. result="degraded" is the config-unset / slow-but-correct
# state (no test webhook secret, no JWT secret, or over-budget) and is
# EXCLUDED, so the prober never false-pages before the operator lights
# PAYMENT_PROBE_ENABLED. Mirrors payment-probe-fail.json in
# newrelic/alerts/. Forum verdict: docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4.
- name: instant-worker-payment-probe
rules:
- alert: PaymentProbeFail
expr: |
sum(increase(instant_payment_probe_outcome_total{result="fail"}[10m])) > 0
for: 10m
labels:
severity: critical
service: worker
annotations:
summary: "Layer-3 payment prober fail (paid revenue path broken in prod)"
description: |
instant_payment_probe_outcome_total{result="fail"} > 0 in 10m.
The Layer-3 payment prober (the money heartbeat) drives the
iframe-free payment-funnel contract path against prod every 5
minutes. A fail outcome means one of: checkout_reachable —
POST /api/v1/billing/checkout returned a 5xx (handler crash;
a 402/409/502 blocked-but-alive shape is a PASS while Razorpay
live-recurring is operator-blocked); billing_state /
invoices_reachable — GET /api/v1/billing[/invoices] 5xx (a
paid-tier read-surface regression); webhook_security — the
/razorpay/webhook signature gate ACCEPTED an unsigned payload
(a CRITICAL security regression — a forged success could drive
a free upgrade) OR returned a non-canonical error_code; or
upgrade_webhook_e2e — a signed TEST-mode subscription.charged
did NOT flip teams.plan_tier (the rule-12 downstream truth
surface, NOT a webhook 200). Cross-correlate against audit_log
kind=payment_probe_failed + the worker structured slog ERROR
`payment_probe_failed leg=... reason=...`. Source:
worker/internal/jobs/payment_probe.go.

# GitHub App push-to-deploy pipeline (P4, pre-staged 2026-06-03).
# Metrics emitted by api/internal/handlers/github_webhook.go:
# instant_github_webhook_received_total{event,result}
Expand Down
31 changes: 31 additions & 0 deletions newrelic/alerts/payment-probe-fail.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"name": "instant-worker — payment_probe_failed [Layer-3 money heartbeat: paid revenue path broken in prod]",
"type": "NRQL",
"description": "P1 page on ANY occurrence of instant_payment_probe_outcome_total{result=\"fail\"}. The Layer-3 payment prober (the money heartbeat — forum verdict docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4 Layer 3) drives the iframe-free payment-funnel contract path against prod every 5 minutes. A fail outcome means one of: (a) checkout_reachable — POST /api/v1/billing/checkout returned a 5xx (handler crash/regression — NOT a 402/409/502 blocked-but-alive shape, which the prober treats as PASS while Razorpay live-recurring is operator-blocked); (b) billing_state / invoices_reachable — GET /api/v1/billing or /api/v1/billing/invoices returned a 5xx (a paid-tier read-surface regression); (c) webhook_security — POST /razorpay/webhook ACCEPTED an unsigned/garbage payload (a CRITICAL security regression: the signature gate would let a forged 'success' drive a free upgrade) OR rejected it with a non-canonical error_code; (d) upgrade_webhook_e2e — a correctly-signed TEST-mode subscription.charged was injected against a fresh cohort team but teams.plan_tier did NOT advance (the upgrade pipeline is broken — this is the rule-12 downstream truth surface, NOT a webhook 200). result='degraded' is EXCLUDED from this alert (it is the config-unset / slow-but-correct state — no test webhook secret, no JWT secret, or over-budget latency — so the prober never false-pages before the operator wires PAYMENT_PROBE_ENABLED + the test secret). Cross-correlate against the audit_log row written by the worker (kind=payment_probe_failed, actor='system:payment_probe') AND the structured slog ERROR line payment_probe_failed leg=... reason=... — same content on both surfaces. cohort='synthetic' so this never pollutes billing/revenue dashboards. Source: worker/internal/jobs/payment_probe.go (PaymentProbeWorker), metric registered in worker/internal/metrics/metrics.go (PaymentProbeOutcomeTotal). Threshold ABOVE 0 with a 10m window: a single fail tick over 10 minutes (two ticks at the 5-minute cadence) is an unambiguous paid-funnel regression, not a flake. AWAITING operator PAYMENT_PROBE_ENABLED=true (and, for the upgrade leg, RAZORPAY_TEST_WEBHOOK_SECRET) before any series materialises.",
"enabled": true,
"nrql": {
"query": "SELECT sum(instant_payment_probe_outcome_total) FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' AND result = 'fail'"
},
"terms": [
{
"priority": "CRITICAL",
"operator": "ABOVE",
"threshold": 0,
"thresholdDuration": 600,
"thresholdOccurrences": "AT_LEAST_ONCE"
}
],
"signal": {
"aggregationWindow": 60,
"aggregationMethod": "EVENT_FLOW",
"aggregationDelay": 120,
"fillOption": "STATIC",
"fillValue": 0
},
"expiration": {
"expirationDuration": 7200,
"openViolationOnExpiration": false,
"closeViolationsOnExpiration": true
},
"violationTimeLimitSeconds": 86400
}
81 changes: 81 additions & 0 deletions newrelic/dashboards/instanode-reliability.json
Original file line number Diff line number Diff line change
Expand Up @@ -1436,6 +1436,87 @@
"ignoreTimeRange": false
}
}
},
{
"title": "Layer-3 payment prober — outcomes per leg (6h) [money heartbeat]",
"layout": {
"column": 1,
"row": 75,
"width": 6,
"height": 3
},
"visualization": {
"id": "viz.line"
},
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [
0
],
"query": "SELECT sum(instant_payment_probe_outcome_total) FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' FACET leg, result TIMESERIES SINCE 6 hours ago"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
},
{
"title": "Layer-3 payment prober — fails (last 6h, must be 0; degraded excluded)",
"layout": {
"column": 7,
"row": 75,
"width": 3,
"height": 3
},
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [
0
],
"query": "SELECT sum(instant_payment_probe_outcome_total) AS 'fails' FROM Metric WHERE metricName = 'instant_payment_probe_outcome_total' AND result = 'fail' SINCE 6 hours ago"
}
],
"platformOptions": {
"ignoreTimeRange": false
},
"thresholds": [
{
"alertSeverity": "CRITICAL",
"value": 1
}
]
}
},
{
"title": "Layer-3 payment prober — P95 latency per leg (6h)",
"layout": {
"column": 10,
"row": 75,
"width": 3,
"height": 3
},
"visualization": {
"id": "viz.line"
},
"rawConfiguration": {
"nrqlQueries": [
{
"accountIds": [
0
],
"query": "SELECT percentile(instant_payment_probe_latency_seconds, 95) AS 'p95' FROM Metric WHERE metricName = 'instant_payment_probe_latency_seconds' FACET leg TIMESERIES SINCE 6 hours ago"
}
],
"platformOptions": {
"ignoreTimeRange": false
}
}
}
]
}
Expand Down
2 changes: 2 additions & 0 deletions observability/METRICS-CATALOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ fires. Operators need this so they don't panic when a fresh deploy looks
| `instant_auth_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only on a real HTTP response; DNS/TCP errors omit the observation so the histogram isn't polluted with 0s timeouts) | (covered by `auth-probe-fail.json`) | (covered by `AuthProbeFail`) | "AUTH-004 synthetic prober — P95 latency per leg (1h)" |
| `instant_deploy_probe_outcome_total` | worker | `leg,result` | lazy (CounterVec — `pass`/`degraded` materialise on the first happy tick; `fail` only appears after a real regression. Hourly deploy prober: every 60 min the worker drives /deploy/new + status-poll until healthy + public-host GET against prod. Closes the 2026-05-30 stuck-build gap that hid a broken deploy pipeline for ~30 min) | `deploy-probe-fail.json` | `DeployProbeFail` | "Hourly deploy prober — outcomes per leg (6h)", "Hourly deploy prober — fails (last 6h, must be 0)" |
| `instant_deploy_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only on a real HTTP response or successful status flip; DNS/TCP errors omit the observation. Buckets span the per-leg budgets up to the 120s cold-cluster Kaniko ceiling) | (covered by `deploy-probe-fail.json`) | (covered by `DeployProbeFail`) | "Hourly deploy prober — P95 latency per leg (6h)" |
| `instant_payment_probe_outcome_total` | worker | `leg,result` | lazy (CounterVec — INERT until `PAYMENT_PROBE_ENABLED=true`; once on, `pass`/`degraded` materialise on the first tick and `fail` only on a real regression. Layer-3 payment prober (the money heartbeat, forum verdict docs/ci/FORUM-PAYMENT-E2E-TOOLING.md §4): every 5 min drives the iframe-free payment-funnel contract path against prod — `leg ∈ checkout_reachable / billing_state / invoices_reachable / webhook_security / upgrade_webhook_e2e`, each reading a rule-12 truth surface, NO real money. The upgrade leg additionally needs `RAZORPAY_TEST_WEBHOOK_SECRET` (degraded otherwise). label families primed in `metrics_test.go`) | `payment-probe-fail.json` | `PaymentProbeFail` (instant-worker-payment-probe group) | "Layer-3 payment prober — outcomes per leg (6h)", "Layer-3 payment prober — fails (last 6h, must be 0)" |
| `instant_payment_probe_latency_seconds` | worker | `leg` | lazy (HistogramVec — observation only when a real request was performed; a config-skipped leg omits the observation. Buckets span the per-leg budgets up to the 8s upgrade-leg ceiling. INERT until `PAYMENT_PROBE_ENABLED=true`) | (covered by `payment-probe-fail.json`) | (covered by `PaymentProbeFail`) | "Layer-3 payment prober — P95 latency per leg (6h)" |
| `instant_tier_upgrade_ttl_promote_total` | api | `outcome` | lazy (CounterVec — outcome label series materialise on first paid-tier upgrade after deploy; `error` should stay absent in a healthy deploy. P1 fix 2026-05-31 — emits from billing.handleSubscriptionCharged → PromoteDeploymentTTLsForTeam) | `tier-upgrade-ttl-promote-failed.json` | `TierUpgradeTTLPromoteFailed` | "Tier-upgrade TTL promote outcomes (24h) — error must be 0" |
| `instant_customer_backup_failed_total` | worker | `reason` | lazy (CounterVec — `reason` series materialise on first failure: auth/decrypt/config/dump/upload. `auth`=credential drift, SLA breach, won't self-heal → CRITICAL; others WARNING. Added 2026-06-03 after a failed backup paged no one — stale=36h, no-followup=stuck-only) | `customer-backup-failed.json` | `CustomerBackupCredentialFailure`, `CustomerBackupFailures` | "Customer backup failures by reason (24h)" |
| `instant_customer_backup_succeeded_total` | worker | (none) | **eager** (Counter — visible as 0 at boot; paired with `_failed_total` for the success-ratio billboard) | (ratio feeds the dashboard; no standalone alert) | (none — denominator only) | "Backup success rate (last 24h, all teams)" |
Expand Down
Loading