|
| 1 | +# Billing Runbooks |
| 2 | + |
| 3 | +Operational runbooks for the billing module. Each runbook references real endpoints — see `modules/billing/routes/billing.admin.routes.js` for auth requirements (JWT admin token required for all `/api/admin/*` routes). |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 1 — Stripe Dispute |
| 8 | + |
| 9 | +**Context**: Stripe gives 7 calendar days from dispute creation to submit evidence. Missing the window results in an automatic loss. `billing.dispute.opened` fires an ntfy alert on day 1. The dispute funds are held by Stripe immediately on `charge.dispute.funds_withdrawn`. |
| 10 | + |
| 11 | +**Steps**: |
| 12 | + |
| 13 | +1. Confirm dispute in Stripe Dashboard → Radar → Disputes. Note the `charge_id` and `dispute_id`. |
| 14 | +2. Retrieve customer state to verify DB-side ledger matches Stripe: |
| 15 | + ``` |
| 16 | + GET /api/admin/billing/customer/:orgId |
| 17 | + ``` |
| 18 | + Confirm `stripeStatus` matches `subscription.status` in DB. |
| 19 | +3. If the dispute is fraudulent (stolen card), cancel the subscription immediately: |
| 20 | + ``` |
| 21 | + POST /api/admin/billing/cancel/:orgId |
| 22 | + ``` |
| 23 | +4. Gather evidence in Stripe Dashboard: usage records, signup email, ToS acceptance timestamp, IP logs. |
| 24 | +5. Submit evidence before day 7 via Stripe Dashboard → Dispute → Submit evidence. |
| 25 | +6. If dispute won (`charge.dispute.funds_reinstated` received): verify the extras credit was re-applied by re-checking the ledger via `GET /api/admin/billing/customer/:orgId`. |
| 26 | +7. If dispute lost: extras balance debited by `charge.dispute.funds_withdrawn` is not refunded — log in incident tracker. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +## 2 — Dead-Letter Investigation |
| 31 | + |
| 32 | +**Context**: Stripe webhook events that fail processing 3+ times (or where the idempotency guard fires on a poisoned payload) are marked `deadLetter: true` in `processedStripeEvents`. They accumulate and must be reviewed manually — partial TTL index excludes them from auto-expiry. |
| 33 | + |
| 34 | +**Steps**: |
| 35 | + |
| 36 | +1. List all dead-letter events: |
| 37 | + ``` |
| 38 | + GET /api/admin/billing/dead-letters |
| 39 | + ``` |
| 40 | + Response includes `eventId`, `type`, `createdAt`, `lastError` for each. |
| 41 | + |
| 42 | +2. For each suspicious event, attempt replay (re-fetches event from Stripe API, re-dispatches through the webhook pipeline): |
| 43 | + ``` |
| 44 | + POST /api/admin/billing/webhook/replay |
| 45 | + Body: { "eventId": "evt_xxx" } |
| 46 | + ``` |
| 47 | + On success: the event is re-processed and the `deadLetter` flag cleared automatically. |
| 48 | + |
| 49 | +3. If replay succeeds but state is still inconsistent (e.g. subscription not updated), force a DB sync from Stripe: |
| 50 | + ``` |
| 51 | + POST /api/admin/billing/sync/:orgId |
| 52 | + ``` |
| 53 | + |
| 54 | +4. If the event is stale/unrecoverable (e.g. the subscription no longer exists in Stripe), purge it: |
| 55 | + ``` |
| 56 | + DELETE /api/admin/billing/dead-letters/:eventId |
| 57 | + ``` |
| 58 | + |
| 59 | +5. If the same event type keeps dead-lettering: check `lastError` for the root cause, open a fix issue, and monitor the next occurrence before purging. |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## 3 — Meter Mismatch |
| 64 | + |
| 65 | +**Context**: `billing.reconcile` cron (Sundays 03:00 UTC) logs `billing.reconciliation.divergence` when Stripe subscription status or plan differs from the DB. Operations must investigate and resolve manually — no auto-fix to avoid masking bugs. |
| 66 | + |
| 67 | +**Steps**: |
| 68 | + |
| 69 | +1. Identify the divergence from the weekly reconciliation log: |
| 70 | + ``` |
| 71 | + kubectl logs -n pierreb-projects job/billing-reconcile-<timestamp> |
| 72 | + ``` |
| 73 | + Look for lines containing `divergence detected` — they include `orgId`, `stripeStatus`, `dbStatus`, `stripePlan`, `dbPlan`. |
| 74 | + |
| 75 | +2. Get the full customer state for the affected org: |
| 76 | + ``` |
| 77 | + GET /api/admin/billing/customer/:orgId |
| 78 | + ``` |
| 79 | + Compare `stripeSnapshot` (live from Stripe API) vs `dbSnapshot` (local DB) fields. |
| 80 | + |
| 81 | +3. If Stripe is authoritative (e.g. subscription renewed but DB missed the webhook), sync Stripe → DB: |
| 82 | + ``` |
| 83 | + POST /api/admin/billing/sync/:orgId |
| 84 | + ``` |
| 85 | + |
| 86 | +4. If the plan needs manual correction (e.g. plan bump after payment confirmation): |
| 87 | + ``` |
| 88 | + PATCH /api/admin/billing/plans/bump |
| 89 | + Body: { "orgId": "...", "planId": "pro", "reason": "manual reconciliation post-mismatch" } |
| 90 | + ``` |
| 91 | + |
| 92 | +5. Re-run `GET /api/admin/billing/customer/:orgId` to confirm `stripeSnapshot` and `dbSnapshot` now match. |
| 93 | + |
| 94 | +6. If mismatch persists after sync: open an incident, check dead-letter queue (Runbook #2), replay missing events. |
| 95 | + |
| 96 | +--- |
| 97 | + |
| 98 | +## 4 — Stripe LIVE Rollout |
| 99 | + |
| 100 | +**Context**: Pre-live checklist before switching `STRIPE_SECRET_KEY` from `sk_test_*` to `sk_live_*` in production. |
| 101 | + |
| 102 | +**Pre-live checklist** (complete all before toggling): |
| 103 | + |
| 104 | +- [ ] Stripe Dashboard (LIVE mode): 10 webhook events enabled (see `STRIPE_SETUP.md`) |
| 105 | +- [ ] Stripe Dashboard (LIVE mode): Smart Retries enabled (Billing settings → Smart Retries) |
| 106 | +- [ ] Stripe Dashboard (LIVE mode): `tax_id` collection enabled in Checkout (B2B EU) |
| 107 | +- [ ] `STRIPE_SECRET_KEY` = `sk_live_*` set in K8s secret `trawl-node-env` |
| 108 | +- [ ] `STRIPE_WEBHOOK_SECRET` = `whsec_*` (LIVE mode endpoint secret) updated in K8s secret |
| 109 | +- [ ] `STRIPE_PRICE_*` env vars point to LIVE price IDs (not test price IDs) |
| 110 | +- [ ] All 4 CronJob manifests deployed: `trawl-billing-dunning-sweep`, `trawl-billing-weekly-reset`, `trawl-billing-extras-expiration`, `trawl-billing-reconcile` |
| 111 | +- [ ] Dead-letter queue empty: `GET /api/admin/billing/dead-letters` → 0 entries |
| 112 | +- [ ] Test mode webhooks drained: Stripe Dashboard → Webhooks → no pending test deliveries |
| 113 | +- [ ] Smoke test: in staging pointed at **TEST** Stripe keys (not LIVE), create a checkout session using Stripe test card `4242 4242 4242 4242` — confirm `checkout.session.completed` webhook received + subscription created in DB. Do **not** use test cards against LIVE keys (they are rejected; use this step to validate the integration flow, then cut over to LIVE keys for production) |
| 114 | +- [ ] Rollback plan documented: toggling `STRIPE_SECRET_KEY` back to test key is sufficient for rollback (no DB migration required) |
| 115 | + |
| 116 | +**Go/no-go gate**: all checkboxes ticked + at least 1 successful end-to-end checkout in staging with LIVE keys. |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +## 5 — Stripe API Down |
| 121 | + |
| 122 | +**Context**: When Stripe's API is unavailable, the billing module degrades gracefully rather than erroring. No revenue-blocking occurs for existing subscribers. |
| 123 | + |
| 124 | +**Behavior during outage**: |
| 125 | + |
| 126 | +- `GET /api/billing/plans` (`getPlans`): returns stale cache (up to 24h old) + emits `billing.plans.stale` event. After 24h stale TTL, throws — clients see a 503 but cannot subscribe anyway (checkout would fail at Stripe). |
| 127 | +- Incoming webhooks: Stripe's retry queue accumulates events and delivers them when connectivity is restored. No action required — events replay automatically via Stripe's retry schedule. |
| 128 | +- `POST /api/admin/billing/sync/:orgId`: fails with Stripe error — do not retry in a loop; wait for Stripe status page to confirm recovery. |
| 129 | +- `POST /api/admin/billing/webhook/replay`: fails if Stripe API unreachable (event re-fetch fails). Queue replays until recovery. |
| 130 | +- Meter usage (`incrementMeter`): continues working — no Stripe call on the hot path. Extras debit is DB-only. |
| 131 | +- Admin operations (`adminBumpPlan`, `adminCancelSubscription`): `adminBumpPlan` is DB-only and continues working. `adminCancelSubscription` calls `stripe.subscriptions.cancel` — will fail; retry after recovery. |
| 132 | + |
| 133 | +**Steps during outage**: |
| 134 | + |
| 135 | +1. Confirm Stripe outage via https://status.stripe.com — check if it is API-wide or specific to Webhooks/Dashboard. |
| 136 | +2. No action required for existing subscribers — their access is unaffected. |
| 137 | +3. Disable any scheduled marketing emails that reference plan upgrade CTAs to avoid confusing users who cannot checkout. |
| 138 | +4. Monitor `billing.plans.stale` event frequency — if the stale cache is 24h+, alert the on-call to decide whether to take the plans endpoint down entirely or serve a static fallback. |
| 139 | +5. Once Stripe recovers: `POST /api/admin/billing/sync/:orgId` on any org that attempted a subscription change during the outage. |
| 140 | +6. Check dead-letter queue for events that exhausted retries during the outage window: `GET /api/admin/billing/dead-letters`. |
0 commit comments