obs(newrelic): log-based backstops for silent-failure gaps (backup/deploy/propagation/data-tier OOMKill) (#73)

mastermanas805 · claude · web-flow · commit 430667baf4e9 · 2026-06-11T05:10:08.000Z
A customer-facing failure (a backup failed + a mongodb pod OOMKill that lost a provisioned DB) went UNDETECTED for hours until a customer emailed a screenshot. The metric-based alerts that should have caught these are INERT: prod has NO Prometheus pipeline (newrelic-prometheus-agent / #72 is operator-apply-pending), so every FROM Metric alert queries an empty stream. Only FROM Log (newrelic- logging Fluent Bit DaemonSet) + Synthetics are live. Add 5 LOG-based alert backstops keyed on the REAL emitted worker log lines (verified against worker code, file:line cited in each description): - customer-backup-failed-nonauth-log.json — jobs.customer_backup_runner.failed reason!='auth', WARNING ABOVE 3/15m (sustained; transient dump self-heals). Complements the pre-existing auth-only CRITICAL alert. - backup-stuck-row-recovery-failed.json — stuck_row_recovery_failed, CRITICAL ABOVE 0/10m. Regression guard for the NULL-started_at flood fixed in worker #106 (which previously erred on EVERY tick, unalerted, for hours). - deploy-failed-autopsy-log.json — deploy_failure_autopsy.captured, CRITICAL ABOVE 0/5m. LOG twin of the inert deploy-job/runtime-failed metric alerts (rule 27 silent-deploy-failure class). - propagation-dead-lettered-log.json — propagation_runner.dead_lettered + unknown_kind_dead_lettered, CRITICAL ABOVE 0/5m. LOG twin of the inert propagation-dead-lettered metric alert (paid customer regrade fell through). - data-tier-pod-oomkill-restart.json — image-native startup banner of each instant-data stateful pod reappearing = restart, CRITICAL ABOVE 0/5m FACET k8s_label_app. This is the exact failure that OOMKilled mongodb and lost the customer DB this session. Flagged blind spot: a banner detector cannot read exitCode 137 or distinguish OOMKill from a planned rollout — the authoritative reason='OOMKilled' event needs kube-state-metrics / the #72 pipeline. Document all six in a new LOG-ALERTS section of observability/METRICS-CATALOG.md with the verified source log line + severity + NRQL key per alert, and the two acknowledged blind spots. FIX 2 (data-tier OOMKill PROTECTION — PriorityClass instant-data-critical, PDBs, per-pod memory/cpu requests+limits, maintenance-window apply runbook) already landed in #69 (k8s/data/stateful-priority.yaml + k8s/data/*.yaml + k8s/DATA-TIER-APPLY-RUNBOOK.md), operator-apply-pending; not duplicated here. NR alert test suite green (49/49, 98->103 JSONs parse). typos clean. kubectl --dry-run=client clean on the FIX-2 manifests. No code change; YAML/JSON/docs only; operator-apply (apply.sh) — no auto-apply. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
diff --git a/newrelic/alerts/backup-stuck-row-recovery-failed.json b/newrelic/alerts/backup-stuck-row-recovery-failed.json
@@ -0,0 +1,31 @@
+{
+  "name": "instant-worker — backup stuck-row recovery FAILED (recoverStuckRows UPDATE erroring) [regression guard]",
+  "type": "NRQL",
+  "description": "CRITICAL on ANY occurrence. The customer_backup_runner's recoverStuckRows() sweep resets backup rows orphaned at status='running' (a casualty of a worker pod kill mid-backup) back to 'pending' so a future tick re-claims them. If that recovery UPDATE itself errors, the orphaned rows stay stuck at 'running' FOREVER — every backup for those resources silently stops, no SUCCESS line is ever emitted again, and (because the row is neither pending nor failed) the existing backup alerts stay quiet. This is exactly the silent-failure class behind the 2026-06 incident: a backup stopped and nobody knew for hours.\n\nThis is also a direct regression guard for the just-fixed NULL-started_at bug (worker commit 17a18ca / #106): the recovery UPDATE previously bound started_at=NULL into the TIMESTAMPTZ NOT NULL column (api migration 031_backups.sql), so the UPDATE failed on EVERY tick — recovery never worked and the log FLOODED with stuck_row_recovery_failed lines, unalerted, for hours. The fix stopped touching started_at; this alert ensures that if the recovery UPDATE ever errors again (schema drift, a new NOT-NULL constraint, a DB brownout that the best-effort sweep swallows), an operator is paged on the FIRST occurrence instead of discovering it via a customer screenshot.\n\nSource log line: worker recoverStuckRows emits slog.Warn('jobs.customer_backup_runner.stuck_row_recovery_failed', error=<...>) at worker/internal/jobs/customer_backup_runner.go:370. The sweep is best-effort (it logs and returns so the normal pending-row drain still proceeds), so the LOG line is the ONLY signal — there is no metric and no row-state change to alert on. Shipped to NR via the newrelic-logging Fluent Bit DaemonSet. Any occurrence = a bug; threshold ABOVE 0.\n\nWhen this fires:\n  1. Pull the error: NR Logs `service='worker' message LIKE '%stuck_row_recovery_failed%'` — the `error` field carries the raw DB error (e.g. a constraint violation = schema drift on resource_backups).\n  2. Check for orphaned rows: `SELECT id, resource_id, status, started_at FROM resource_backups WHERE status='running' AND started_at < now() - interval '15 minutes'` — these are the rows recovery should have reset.\n  3. If it's a constraint/schema error: a migration changed resource_backups in a way that breaks the recovery UPDATE — fix the UPDATE (worker customer_backup_runner.go recoverStuckRows) and redeploy; manually reset the orphaned rows to 'pending' to unstick them in the meantime.\n  4. If it's a transient DB error: confirm the next sweep succeeds (the line should stop) and the orphaned rows drain. See infra/BACKUP-RESTORE-RUNBOOK.md.",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT count(*) FROM Log WHERE service = 'worker' AND message LIKE '%customer_backup_runner.stuck_row_recovery_failed%'"
+  },
+  "terms": [
+    {
+      "priority": "CRITICAL",
+      "operator": "ABOVE",
+      "threshold": 0,
+      "thresholdDuration": 600,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 300,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 120,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 1800,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}
diff --git a/newrelic/alerts/customer-backup-failed-nonauth-log.json b/newrelic/alerts/customer-backup-failed-nonauth-log.json
@@ -0,0 +1,31 @@
+{
+  "name": "instant-worker — customer backup FAILED (non-auth, sustained) [LOG backstop]",
+  "type": "NRQL",
+  "description": "WARNING. Log-based backstop that fires when per-tenant (customer) backups fail for a NON-auth reason (reason != 'auth' — i.e. 'dump' / 'upload' / 'config' / 'decrypt': the DB was briefly unreachable, a timeout, an object-store write error, or a credential-decrypt/config problem) at a sustained rate. A single transient 'dump' failure self-heals on the next scheduled run, so this is intentionally a sustained-rate WARNING (ABOVE 3 in 15m), NOT a per-occurrence page — that distinguishes it from the credential-drift case which never self-heals (see customer-backup-failed.json, reason='auth', CRITICAL ABOVE 0).\n\nWhy this exists: this is the 'never blind again' backstop for the 2026-06 incident where a backup failed and went undetected for hours until a customer emailed a screenshot. The pre-existing backup alerts MISS a sustained-but-non-auth failure cluster: backup-stale-36h.json only fires after 36h of no SUCCESS line (far too slow), backup-requested-no-followup.json only fires on STUCK rows (a clean failure emits backup.failed so it counts as 'followed up'), and customer-backup-failed.json only covers reason='auth'. A node going memory-tight and OOMKilling the customer Postgres/Mongo pod, or DO Spaces rejecting uploads, produces a burst of reason='dump'/'upload' failures that previously paged no one. This alert is a LOG alert (FROM Log) so it works TODAY: prod has NO Prometheus pipeline (newrelic-prometheus-agent is operator-apply-pending), so the metric instant_customer_backup_failed_total{reason} is INERT. Once that pipeline lands the metric alert becomes the primary and this remains a useful belt-and-suspenders.\n\nSource log line: worker markFailed emits slog.Error('jobs.customer_backup_runner.failed', reason=<auth|dump|upload|config|decrypt>, internal_detail=..., duration_ms=...) at worker/internal/jobs/customer_backup_runner.go:729. Classified by backupFailReason() (line 617). The line carries backup_id + internal_detail (raw dump-tool stderr) for triage. Shipped to NR via the newrelic-logging Fluent Bit DaemonSet.\n\nWhen this fires:\n  1. Identify scope + reason: NR Logs `service='worker' message LIKE '%customer_backup_runner.failed%' reason != 'auth'` FACET reason — is it one resource flapping or many at once?\n  2. reason='dump' across MANY resources at once = the shared customer DB pod (postgres-customers / mongodb / redis-provision) is unreachable or was OOMKilled/evicted — check `kubectl get pods -n instant-data` for restarts and cross-reference data-tier-pod-oomkill-restart.json.\n  3. reason='upload' = DO Spaces / object-store write path is failing — check the bucket + credentials.\n  4. reason='config'/'decrypt' = an internal AES_KEY / connection_url problem — page, this won't self-heal.\n  5. Confirm recovery: the next scheduled backup tick should re-run; watch for the success line `jobs.customer_backup_runner.succeeded`. See infra/BACKUP-RESTORE-RUNBOOK.md.",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT count(*) FROM Log WHERE service = 'worker' AND message LIKE '%customer_backup_runner.failed%' AND reason != 'auth'"
+  },
+  "terms": [
+    {
+      "priority": "WARNING",
+      "operator": "ABOVE",
+      "threshold": 3,
+      "thresholdDuration": 900,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 300,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 120,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 1800,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}
diff --git a/newrelic/alerts/data-tier-pod-oomkill-restart.json b/newrelic/alerts/data-tier-pod-oomkill-restart.json
@@ -0,0 +1,31 @@
+{
+  "name": "instant-data — stateful pod RESTARTED (OOMKill/eviction signal: postgres-customers/mongodb/redis-provision/nats)",
+  "type": "NRQL",
+  "description": "CRITICAL on ANY occurrence. Detects a restart of a single-replica STATEFUL data-tier pod in the instant-data namespace by watching for its container's fresh startup banner reappearing in the log stream. On the memory-overcommitted cluster these pods were BestEffort QoS (resources:{}) and were the FIRST thing OOMKilled/evicted under node memory pressure — which is EXACTLY the failure that OOMKilled the mongodb pod (exit 137) and lost a provisioned customer DB this session, undetected for hours until a customer emailed a screenshot.\n\nWHY A LOG (startup-banner) DETECTOR: prod ships only pod stdout (via the newrelic-logging Fluent Bit DaemonSet), APM, and OTLP — there is NO Prometheus pipeline and NO kube-state-metrics / kube-events integration. So the AUTHORITATIVE OOMKill signal (a Kubernetes pod event with reason='OOMKilled', or kube_pod_container_status_restarts_total / container last-terminated exitCode=137) is NOT available in NR today. What IS available is each pod's own stdout. A single-replica stateful pod prints its startup banner ONLY on (re)start, so a banner appearing = the pod just (re)started — and outside a deliberate operator rollout, a stateful-pod restart is an OOMKill/eviction/crash. The startup strings below are the stable, image-native banners of the official upstream images (verified against the pinned images: pgvector/pgvector:pg16, mongo:7, redis:7-alpine, nats:2.10-alpine — NOT platform-emitted strings):\n  - postgres-customers (pgvector:pg16):  'database system is ready to accept connections'\n  - mongodb (mongo:7):                   'Waiting for connections'   (mongo logs structured JSON: \"msg\":\"Waiting for connections\")\n  - redis-provision (redis:7-alpine):    'Ready to accept connections'\n  - nats (nats:2.10-alpine):             'Server is ready'\nPod selector: k8s_namespace_name='instant-data', faceted by k8s_label_app so the violation names which data pod bounced. Shipped to NR via the newrelic-logging Fluent Bit DaemonSet.\n\nKNOWN LIMITATION / FALSE POSITIVES (accepted to have a signal TODAY): (a) this CANNOT distinguish an OOMKill from a deliberate operator rollout/restart — a planned `kubectl rollout restart` or the DATA-TIER-APPLY-RUNBOOK maintenance-window apply WILL fire this once per pod (expected; ack it during the window). (b) It cannot read the exit code, so it can't confirm exitCode=137 specifically. PROPER FOLLOW-UP (durable upgrade): once kube-state-metrics + the newrelic-prometheus-agent pipeline (#72) OR the NR Kubernetes/kube-events integration is applied in prod, replace/augment this with the authoritative event-based detector — alert on K8sContainerSample reason='OOMKilled' or kube_pod_container_status_last_terminated_reason{reason='OOMKilled'} / restarts_total derivative > 0, which fires ONLY on a real involuntary kill and never on a planned rollout. Until then, this banner detector is the alarm and is paired with the eviction-PROTECTION manifest (k8s/data/stateful-priority.yaml PriorityClass+PDBs + per-pod resource requests/limits in k8s/data/*.yaml; R7 #69, operator-apply-pending) that PREVENTS the OOMKill in the first place.\n\nWhen this fires:\n  1. Confirm + cause: `kubectl get pods -n instant-data -o wide` (RESTARTS column) then `kubectl describe pod -n instant-data <pod>` — look at Last State: Terminated, Reason: OOMKilled, Exit Code: 137. `kubectl logs --previous -n instant-data <pod>` shows the pre-crash tail.\n  2. If OOMKilled: the pod hit its memory limit (or had none → BestEffort eviction). Verify the R7 eviction-protection manifest is APPLIED (`kubectl get pod -n instant-data <pod> -o jsonpath='{.status.qosClass} {.spec.priorityClassName}'` should be Burstable/Guaranteed + instant-data-critical, NOT BestEffort/<empty>). If not applied, follow k8s/DATA-TIER-APPLY-RUNBOOK.md in a maintenance window.\n  3. VERIFY DATA INTACT (this is why the incident hurt): for mongodb confirm the lost DB/collection is present (`mongosh ... show dbs`); for postgres-customers confirm the customer db_/usr_ rows still exist; restore from the backup ladder if data was lost (infra/BACKUP-RESTORE-RUNBOOK.md).\n  4. If it was a planned operator rollout/maintenance-window apply: ack and close — this is the expected single fire per pod.",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT count(*) FROM Log WHERE k8s_namespace_name = 'instant-data' AND ((k8s_label_app = 'postgres-customers' AND message LIKE '%database system is ready to accept connections%') OR (k8s_label_app = 'mongodb' AND message LIKE '%Waiting for connections%') OR (k8s_label_app = 'redis-provision' AND message LIKE '%Ready to accept connections%') OR (k8s_label_app = 'nats' AND message LIKE '%Server is ready%')) FACET k8s_label_app"
+  },
+  "terms": [
+    {
+      "priority": "CRITICAL",
+      "operator": "ABOVE",
+      "threshold": 0,
+      "thresholdDuration": 300,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 300,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 120,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 1800,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}
diff --git a/newrelic/alerts/deploy-failed-autopsy-log.json b/newrelic/alerts/deploy-failed-autopsy-log.json
@@ -0,0 +1,31 @@
+{
+  "name": "instant-worker — deploy build/runtime FAILURE detected (autopsy captured) [LOG backstop, rule 27]",
+  "type": "NRQL",
+  "description": "CRITICAL on ANY occurrence. Log-based backstop for the silent-deploy-failure class (CLAUDE.md rule 27, the 2026-05-30 incident). When a deploy's build Job lands in a terminal Failed state (BackoffLimitExceeded / DeadlineExceeded) OR the runtime rollout cannot start its container (StartFailed / ImagePullBackOff / CrashLoopBackOff / OOMKilled / ProgressDeadlineExceeded), the worker's deploy_failure_autopsy job captures the cause, stamps deployments.error_message, and emits an audit_log kind='deploy.failed' so the email forwarder dispatches the user-visible failure email — even when the api goroutine died mid-build and never wrote the terminal row. Every autopsy capture means a real customer deploy silently failed and a backstop fired.\n\nWhy a LOG alert: prod has NO Prometheus pipeline (newrelic-prometheus-agent is operator-apply-pending), so the metric alerts for this class — deploy-job-failed-detected.json and deploy-runtime-failed-detected.json (both FROM Metric on instant_deploy_job_failed_detected_total / instant_deploy_runtime_failed_detected_total) — are INERT today. This alert keys on the worker's autopsy LOG line so the failure surface is monitored RIGHT NOW; once the metrics pipeline lands the metric alerts become primary and this stays as a belt-and-suspenders.\n\nSource log line: worker emits slog.Info('jobs.deploy_failure_autopsy.captured', deployment_id=..., reason=<DeadlineExceeded|StartFailed|BuildFailed|BackoffLimitExceeded|ProgressDeadlineExceeded|OOMKilled|CrashLoopBackOff|ImagePullBackOff|Error|Unknown>, outcome=..., lines_captured=N) at worker/internal/jobs/deploy_failure_autopsy.go:402, paired with the audit_log kind='deploy.failed' row inserted by emitDeployFailedAudit (line 672). The `reason` field is the bounded failure-cause label. Shipped to NR via the newrelic-logging Fluent Bit DaemonSet. (The api also emits a synchronous deploy.failed on its own runDeploy path; this alert deliberately keys on the worker AUTOPSY line because the autopsy is the BACKSTOP that fires precisely when the api path died — the case the incident exposed.)\n\nWhen this fires:\n  1. Identify the deploy + cause: NR Logs `service='worker' message LIKE '%deploy_failure_autopsy.captured%'` FACET reason — which deployment_id, and is one reason dominating (a platform problem) vs scattered per-customer Dockerfile errors (expected at a low rate)?\n  2. reason=DeadlineExceeded across MANY deploys = the Kaniko build slot is timing out platform-wide (image bloat / degraded GHCR push path) — investigate the build pipeline, not the customer.\n  3. reason=OOMKilled/Evicted on the runtime pod = the build node is memory-tight — cross-reference data-tier-pod-oomkill-restart.json (same memory-overcommit failure mode).\n  4. Confirm the customer was told: check the deployment's error_message + that a deploy.failed failure email was dispatched (forwarder_sent ledger). Read surface for users: GET /api/v1/deployments/:id/events. See CLAUDE.md rule 27.",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT count(*) FROM Log WHERE service = 'worker' AND message LIKE '%deploy_failure_autopsy.captured%'"
+  },
+  "terms": [
+    {
+      "priority": "CRITICAL",
+      "operator": "ABOVE",
+      "threshold": 0,
+      "thresholdDuration": 300,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 300,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 180,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 3600,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}
diff --git a/newrelic/alerts/propagation-dead-lettered-log.json b/newrelic/alerts/propagation-dead-lettered-log.json
@@ -0,0 +1,31 @@
+{
+  "name": "instant-worker — propagation dead-lettered (paid customer regrade fell through) [LOG backstop]",
+  "type": "NRQL",
+  "description": "CRITICAL on ANY occurrence. Log-based backstop for the propagation dead-letter — the last line of defence between a Razorpay subscription.charged webhook landing and the customer's infra actually being re-graded. When a row in pending_propagations exhausts its maxAttempts retries (provisioner gRPC down, markApplied DB blip, an unexpected_skip-as-failure) OR is rejected because no handler is registered for its kind (api/worker image skew), the row is dead-lettered: the customer PAID and the platform did NOT deliver the tier they bought. This must page immediately.\n\nWhy a LOG alert: prod has NO Prometheus pipeline (newrelic-prometheus-agent is operator-apply-pending), so the existing propagation-dead-lettered.json (FROM Metric on instant_propagation_dead_lettered_total) is INERT today. This alert keys on the worker's dead-letter LOG lines so the signal works RIGHT NOW; once the metrics pipeline lands the metric alert becomes primary and this stays as a backstop.\n\nSource log lines (both matched): worker emits slog.Error('jobs.propagation_runner.dead_lettered', propagation_id=..., team_id=..., kind=..., attempts=..., last_error=...) at worker/internal/jobs/propagation_runner.go:892 for the max-attempts path, AND slog.Error('jobs.propagation_runner.unknown_kind_dead_lettered', ...) at line 985 for the unknown-kind path. Both also write an audit_log row (audit_kind='propagation.dead_lettered' / 'propagation.unknown_kind_dead_lettered'). Shipped to NR via the newrelic-logging Fluent Bit DaemonSet.\n\nWhen this fires:\n  1. Identify the team + kind: NR Logs `service='worker' message LIKE '%propagation_runner%dead_lettered%'` — the line carries team_id, kind, and last_error (the underlying failure).\n  2. unknown_kind = api/worker image skew (a kind the worker doesn't have a handler for) — confirm both services are on the same SHA; redeploy the lagging one.\n  3. max_attempts = a persistent downstream failure (provisioner gRPC down, DB error) — fix the downstream, then re-arm: either DELETE the pending_propagations row to let entitlement_reconciler converge, OR reset failed_at=NULL + attempts=0 to re-run the runner.\n  4. Verify the customer's resources match their paid tier afterwards (resources.tier vs team.plan_tier). See worker propagation_runner.go + CHAOS-DRILL-2026-05-20.md F1/F2/F3.",
+  "enabled": true,
+  "nrql": {
+    "query": "SELECT count(*) FROM Log WHERE service = 'worker' AND (message LIKE '%propagation_runner.dead_lettered%' OR message LIKE '%propagation_runner.unknown_kind_dead_lettered%')"
+  },
+  "terms": [
+    {
+      "priority": "CRITICAL",
+      "operator": "ABOVE",
+      "threshold": 0,
+      "thresholdDuration": 300,
+      "thresholdOccurrences": "AT_LEAST_ONCE"
+    }
+  ],
+  "signal": {
+    "aggregationWindow": 300,
+    "aggregationMethod": "EVENT_FLOW",
+    "aggregationDelay": 120,
+    "fillOption": "STATIC",
+    "fillValue": 0
+  },
+  "expiration": {
+    "expirationDuration": 3600,
+    "openViolationOnExpiration": false,
+    "closeViolationsOnExpiration": true
+  },
+  "violationTimeLimitSeconds": 86400
+}
diff --git a/observability/METRICS-CATALOG.md b/observability/METRICS-CATALOG.md