Skip to content

Commit f81aa78

Browse files
feat(billing): operational maturity — runbooks, stale cache, structured logs (Étape 4) (#3613)
* feat(billing): operational maturity — runbooks, stale cache, structured logs (Étape 4) A — RUNBOOKS.md: 5 operational runbooks (dispute, dead-letter, meter mismatch, LIVE rollout checklist, Stripe API down fallback) with real endpoint refs. B — STRIPE_SETUP.md: 10 webhook events, Smart Retries, extras forfeit policy, 5-year FR retention, tax_id B2B EU setup. C.1 — getPlans stale-while-revalidate: fresh TTL 5min, stale TTL 24h, emits billing.plans.stale on Stripe error fallback. C.2 — Crons: replace console.* with structured logger in all 4 crons (dunningSweep, weeklyReset, extrasExpiration, reconcile) with entry/exit timing and contextual fields. C.3 — billing.usage.service: replace console.warn/error with logger on hot path (extras debit failure, threshold mark failure, unsupported threshold) including organizationId + idempotencyKey for traceability. Tests: +3 stale-while-revalidate unit tests; fix logger mock in 4 test files that transitively import billing.plans.service (billing.plans, billing.usage.service, billing.service.extras, billing.usageHeader); update console.warn spy → logger.warn in 2 existing tests. * test(billing): cover extras debit unexpectedly-not-applied logger.error branch Add test for the `debitResult.applied === false && reason !== 'duplicate_step'` path in incrementMeter to bring patch coverage to 100% on new code. * fix(billing): critical-review findings — TTL precision + mock isolation + runbook clarity - billing.plans.service: use Date.now() inline in catch (not captured now) to account for Stripe round-trip latency at the 24h stale boundary - billing.plans.unit.tests: wrap all Date.now manual mocks in try/finally so restore always runs even when an assertion fails (mock-leak fix) - RUNBOOKS.md #4: smoke test must use TEST keys + test card; add warning that test cards are rejected against LIVE keys (false-pass prevention) * fix(billing/crons): structured error serialization + applyJitter inside try Per Copilot review: - Replace `{ err }` (serializes to {}) with `{ err: err?.message, stack: err?.stack }` in all 4 cron fatal/per-item error log calls. - Move `await applyJitter(...)` inside `try` block in weeklyReset + reconcile so jitter failures are caught by the structured `[cron.X] failed` handler.
1 parent 5ff7deb commit f81aa78

12 files changed

Lines changed: 484 additions & 69 deletions

modules/billing/RUNBOOKS.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Billing Runbooks
2+
3+
Operational runbooks for the billing module. Each runbook references real endpoints — see `modules/billing/routes/billing.admin.routes.js` for auth requirements (JWT admin token required for all `/api/admin/*` routes).
4+
5+
---
6+
7+
## 1 — Stripe Dispute
8+
9+
**Context**: Stripe gives 7 calendar days from dispute creation to submit evidence. Missing the window results in an automatic loss. `billing.dispute.opened` fires an ntfy alert on day 1. The dispute funds are held by Stripe immediately on `charge.dispute.funds_withdrawn`.
10+
11+
**Steps**:
12+
13+
1. Confirm dispute in Stripe Dashboard → Radar → Disputes. Note the `charge_id` and `dispute_id`.
14+
2. Retrieve customer state to verify DB-side ledger matches Stripe:
15+
```
16+
GET /api/admin/billing/customer/:orgId
17+
```
18+
Confirm `stripeStatus` matches `subscription.status` in DB.
19+
3. If the dispute is fraudulent (stolen card), cancel the subscription immediately:
20+
```
21+
POST /api/admin/billing/cancel/:orgId
22+
```
23+
4. Gather evidence in Stripe Dashboard: usage records, signup email, ToS acceptance timestamp, IP logs.
24+
5. Submit evidence before day 7 via Stripe Dashboard → Dispute → Submit evidence.
25+
6. If dispute won (`charge.dispute.funds_reinstated` received): verify the extras credit was re-applied by re-checking the ledger via `GET /api/admin/billing/customer/:orgId`.
26+
7. If dispute lost: extras balance debited by `charge.dispute.funds_withdrawn` is not refunded — log in incident tracker.
27+
28+
---
29+
30+
## 2 — Dead-Letter Investigation
31+
32+
**Context**: Stripe webhook events that fail processing 3+ times (or where the idempotency guard fires on a poisoned payload) are marked `deadLetter: true` in `processedStripeEvents`. They accumulate and must be reviewed manually — partial TTL index excludes them from auto-expiry.
33+
34+
**Steps**:
35+
36+
1. List all dead-letter events:
37+
```
38+
GET /api/admin/billing/dead-letters
39+
```
40+
Response includes `eventId`, `type`, `createdAt`, `lastError` for each.
41+
42+
2. For each suspicious event, attempt replay (re-fetches event from Stripe API, re-dispatches through the webhook pipeline):
43+
```
44+
POST /api/admin/billing/webhook/replay
45+
Body: { "eventId": "evt_xxx" }
46+
```
47+
On success: the event is re-processed and the `deadLetter` flag cleared automatically.
48+
49+
3. If replay succeeds but state is still inconsistent (e.g. subscription not updated), force a DB sync from Stripe:
50+
```
51+
POST /api/admin/billing/sync/:orgId
52+
```
53+
54+
4. If the event is stale/unrecoverable (e.g. the subscription no longer exists in Stripe), purge it:
55+
```
56+
DELETE /api/admin/billing/dead-letters/:eventId
57+
```
58+
59+
5. If the same event type keeps dead-lettering: check `lastError` for the root cause, open a fix issue, and monitor the next occurrence before purging.
60+
61+
---
62+
63+
## 3 — Meter Mismatch
64+
65+
**Context**: `billing.reconcile` cron (Sundays 03:00 UTC) logs `billing.reconciliation.divergence` when Stripe subscription status or plan differs from the DB. Operations must investigate and resolve manually — no auto-fix to avoid masking bugs.
66+
67+
**Steps**:
68+
69+
1. Identify the divergence from the weekly reconciliation log:
70+
```
71+
kubectl logs -n pierreb-projects job/billing-reconcile-<timestamp>
72+
```
73+
Look for lines containing `divergence detected` — they include `orgId`, `stripeStatus`, `dbStatus`, `stripePlan`, `dbPlan`.
74+
75+
2. Get the full customer state for the affected org:
76+
```
77+
GET /api/admin/billing/customer/:orgId
78+
```
79+
Compare `stripeSnapshot` (live from Stripe API) vs `dbSnapshot` (local DB) fields.
80+
81+
3. If Stripe is authoritative (e.g. subscription renewed but DB missed the webhook), sync Stripe → DB:
82+
```
83+
POST /api/admin/billing/sync/:orgId
84+
```
85+
86+
4. If the plan needs manual correction (e.g. plan bump after payment confirmation):
87+
```
88+
PATCH /api/admin/billing/plans/bump
89+
Body: { "orgId": "...", "planId": "pro", "reason": "manual reconciliation post-mismatch" }
90+
```
91+
92+
5. Re-run `GET /api/admin/billing/customer/:orgId` to confirm `stripeSnapshot` and `dbSnapshot` now match.
93+
94+
6. If mismatch persists after sync: open an incident, check dead-letter queue (Runbook #2), replay missing events.
95+
96+
---
97+
98+
## 4 — Stripe LIVE Rollout
99+
100+
**Context**: Pre-live checklist before switching `STRIPE_SECRET_KEY` from `sk_test_*` to `sk_live_*` in production.
101+
102+
**Pre-live checklist** (complete all before toggling):
103+
104+
- [ ] Stripe Dashboard (LIVE mode): 10 webhook events enabled (see `STRIPE_SETUP.md`)
105+
- [ ] Stripe Dashboard (LIVE mode): Smart Retries enabled (Billing settings → Smart Retries)
106+
- [ ] Stripe Dashboard (LIVE mode): `tax_id` collection enabled in Checkout (B2B EU)
107+
- [ ] `STRIPE_SECRET_KEY` = `sk_live_*` set in K8s secret `trawl-node-env`
108+
- [ ] `STRIPE_WEBHOOK_SECRET` = `whsec_*` (LIVE mode endpoint secret) updated in K8s secret
109+
- [ ] `STRIPE_PRICE_*` env vars point to LIVE price IDs (not test price IDs)
110+
- [ ] All 4 CronJob manifests deployed: `trawl-billing-dunning-sweep`, `trawl-billing-weekly-reset`, `trawl-billing-extras-expiration`, `trawl-billing-reconcile`
111+
- [ ] Dead-letter queue empty: `GET /api/admin/billing/dead-letters` → 0 entries
112+
- [ ] Test mode webhooks drained: Stripe Dashboard → Webhooks → no pending test deliveries
113+
- [ ] Smoke test: in staging pointed at **TEST** Stripe keys (not LIVE), create a checkout session using Stripe test card `4242 4242 4242 4242` — confirm `checkout.session.completed` webhook received + subscription created in DB. Do **not** use test cards against LIVE keys (they are rejected; use this step to validate the integration flow, then cut over to LIVE keys for production)
114+
- [ ] Rollback plan documented: toggling `STRIPE_SECRET_KEY` back to test key is sufficient for rollback (no DB migration required)
115+
116+
**Go/no-go gate**: all checkboxes ticked + at least 1 successful end-to-end checkout in staging with LIVE keys.
117+
118+
---
119+
120+
## 5 — Stripe API Down
121+
122+
**Context**: When Stripe's API is unavailable, the billing module degrades gracefully rather than erroring. No revenue-blocking occurs for existing subscribers.
123+
124+
**Behavior during outage**:
125+
126+
- `GET /api/billing/plans` (`getPlans`): returns stale cache (up to 24h old) + emits `billing.plans.stale` event. After 24h stale TTL, throws — clients see a 503 but cannot subscribe anyway (checkout would fail at Stripe).
127+
- Incoming webhooks: Stripe's retry queue accumulates events and delivers them when connectivity is restored. No action required — events replay automatically via Stripe's retry schedule.
128+
- `POST /api/admin/billing/sync/:orgId`: fails with Stripe error — do not retry in a loop; wait for Stripe status page to confirm recovery.
129+
- `POST /api/admin/billing/webhook/replay`: fails if Stripe API unreachable (event re-fetch fails). Queue replays until recovery.
130+
- Meter usage (`incrementMeter`): continues working — no Stripe call on the hot path. Extras debit is DB-only.
131+
- Admin operations (`adminBumpPlan`, `adminCancelSubscription`): `adminBumpPlan` is DB-only and continues working. `adminCancelSubscription` calls `stripe.subscriptions.cancel` — will fail; retry after recovery.
132+
133+
**Steps during outage**:
134+
135+
1. Confirm Stripe outage via https://status.stripe.com — check if it is API-wide or specific to Webhooks/Dashboard.
136+
2. No action required for existing subscribers — their access is unaffected.
137+
3. Disable any scheduled marketing emails that reference plan upgrade CTAs to avoid confusing users who cannot checkout.
138+
4. Monitor `billing.plans.stale` event frequency — if the stale cache is 24h+, alert the on-call to decide whether to take the plans endpoint down entirely or serve a static fallback.
139+
5. Once Stripe recovers: `POST /api/admin/billing/sync/:orgId` on any org that attempted a subscription change during the outage.
140+
6. Check dead-letter queue for events that exhausted retries during the outage window: `GET /api/admin/billing/dead-letters`.

modules/billing/STRIPE_SETUP.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Stripe Dashboard Setup
2+
3+
Configuration steps required in the Stripe Dashboard before going LIVE. All steps apply to both test mode (staging validation) and LIVE mode (production).
4+
5+
---
6+
7+
## 1 — Webhook Events to Enable
8+
9+
Navigate to: **Stripe Dashboard → Developers → Webhooks → [your endpoint] → Update details**
10+
11+
Enable exactly these 10 events:
12+
13+
| Event | Trigger |
14+
|---|---|
15+
| `checkout.session.completed` | User completes a checkout → subscription created |
16+
| `customer.subscription.created` | Stripe creates a subscription (also fires on checkout) |
17+
| `customer.subscription.updated` | Plan change, renewal, status change |
18+
| `customer.subscription.deleted` | Subscription cancelled or expired |
19+
| `invoice.payment_succeeded` | Successful invoice payment → renews subscription |
20+
| `invoice.payment_failed` | Failed invoice → triggers dunning flow |
21+
| `charge.refunded` | Refund issued → credits extras ledger |
22+
| `charge.dispute.created` | Dispute opened → 7-day evidence window starts |
23+
| `charge.dispute.funds_withdrawn` | Dispute funds held → debits extras balance |
24+
| `charge.dispute.funds_reinstated` | Dispute won → re-credits extras balance |
25+
26+
No other events are required. Adding extra events without handlers results in dead-letter accumulation.
27+
28+
---
29+
30+
## 2 — Smart Retries
31+
32+
Navigate to: **Stripe Dashboard → Settings → Billing → Smart Retries**
33+
34+
Enable **Smart Retries**. Stripe uses ML to schedule retry attempts at the optimal time for each card. This reduces involuntary churn without requiring a custom retry schedule in the dunning cron.
35+
36+
The dunning cron (`billing.dunningSweep`) activates 14 days after `pastDueSince` — coordinated with the default Smart Retries window of 7 days. If you extend Smart Retries beyond 7 days, adjust `billing.dunningThresholdDays` accordingly.
37+
38+
---
39+
40+
## 3 — Extras Cancel Policy
41+
42+
**Policy**: topup extras packs are forfeited without refund on subscription cancellation.
43+
44+
This is consistent with the ToS (Section CGU — Extras packs). No proration or partial refund is owed.
45+
46+
**Stripe configuration**: no dashboard setting needed — the `adminCancelSubscription` endpoint handles cancel at period end and the extras balance is NOT refunded. The DB ledger retains the history for audit.
47+
48+
---
49+
50+
## 4 — Billing Data Retention
51+
52+
**Obligation légale FR (LCB-FT + RGPD)**: billing data must be retained for **5 years** from the last transaction date.
53+
54+
**Implementation**:
55+
- `BillingSubscription`, `BillingExtraBalance`, `BillingUsage`, `ProcessedStripeEvent` collections have no TTL on production data.
56+
- The `processedStripeEvents` TTL index (`expireAfterSeconds: 2592000` = 30 days) applies only to non-dead-letter events — processed events are safe to expire (the authoritative record is in Stripe). The ledger (`BillingExtraBalance`) is permanent.
57+
- On account deletion: **anonymize, do not delete** billing records. Replace `organizationId` reference with a tombstone ID, remove PII from the org record, retain the financial ledger for 5 years.
58+
- Stripe customer data: use `stripe.customers.del()` to delete the Stripe customer on account deletion. This satisfies RGPD right-to-erasure for Stripe-side PII. The local ledger rows (amount, date, plan) are not PII and must be retained.
59+
60+
---
61+
62+
## 5 — Tax ID Collection (B2B EU / FR)
63+
64+
Navigate to: **Stripe Dashboard → Settings → Checkout → Tax ID collection**
65+
66+
Enable **Tax ID collection** in Checkout.
67+
68+
This allows EU B2B customers to enter their VAT number (`FR12345678901`) during checkout. Stripe validates the VAT ID via VIES and attaches it to the customer record.
69+
70+
**Effect**: invoices generated by Stripe include the customer's VAT ID, making them eligible for VAT reversal (autoliquidation) under EU B2B rules — required for French B2B invoice compliance.
71+
72+
No code change required — `billing.service.js` passes `allow_promotion_codes: true` to Checkout; `tax_id_collection` is a Dashboard-level toggle that Stripe applies automatically to all Checkout sessions.

modules/billing/crons/billing.dunningSweep.js

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,20 +22,25 @@ process.env.NODE_ENV = process.env.NODE_ENV || 'development';
2222
const [
2323
{ default: config },
2424
{ default: mongooseService },
25+
{ default: logger },
2526
{ applyJitter },
2627
{ getCronJitterMaxMs, getDunningThresholdDays },
2728
] = await Promise.all([
2829
import('../../../config/index.js'),
2930
import('../../../lib/services/mongoose.js'),
31+
import('../../../lib/services/logger.js'),
3032
import('../lib/billing.cron-utils.js'),
3133
import('../lib/billing.constants.js'),
3234
]);
3335

3436
if (!config?.billing?.meterMode) {
35-
console.log('[billing.dunningSweep] meterMode disabled — skipping.');
37+
logger.info('[cron.dunningSweep] meterMode disabled — skipping.');
3638
process.exit(0);
3739
}
3840

41+
const startMs = Date.now();
42+
logger.info('[cron.dunningSweep] start');
43+
3944
try {
4045
await applyJitter(getCronJitterMaxMs());
4146
await mongooseService.connect();
@@ -49,7 +54,7 @@ try {
4954
const threshold = new Date(now.getTime() - getDunningThresholdDays() * 24 * 60 * 60 * 1000);
5055

5156
const staleSubs = await BillingSubscriptionRepository.findStaleDunning(threshold);
52-
console.log(`[billing.dunningSweep] ${staleSubs.length} stale past_due subscription(s) found`);
57+
logger.info('[cron.dunningSweep] stale past_due subscriptions found', { count: staleSubs.length });
5358

5459
let processed = 0;
5560
let errors = 0;
@@ -59,7 +64,7 @@ try {
5964
try {
6065
const subscription = await BillingSubscriptionRepository.markUnpaid(String(sub._id), threshold);
6166
if (!subscription) {
62-
console.log(`[billing.dunningSweep] sub ${sub._id} skipped — already recovered`);
67+
logger.info('[cron.dunningSweep] sub skipped — already recovered', { subId: String(sub._id) });
6368
continue;
6469
}
6570

@@ -68,22 +73,30 @@ try {
6873
} catch (orgErr) {
6974
// Compensation: Subscription is now unpaid but Org.plan update failed.
7075
// Log for manual reconciliation — do not revert Subscription status.
71-
console.error('[billing.dunningSweep] Org plan sync failed (manual reconciliation required):', orgErr);
76+
logger.error('[cron.dunningSweep] Org plan sync failed (manual reconciliation required)', {
77+
subId: String(sub._id),
78+
orgId: String(sub.organization),
79+
err: orgErr?.message,
80+
stack: orgErr?.stack,
81+
});
7282
desyncErrors += 1;
7383
}
7484

75-
console.log(`[billing.dunningSweep] sub ${sub._id} → unpaid, org ${sub.organization} → free`);
85+
logger.info('[cron.dunningSweep] sub transitioned to unpaid', {
86+
subId: String(sub._id),
87+
orgId: String(sub.organization),
88+
});
7689
processed += 1;
7790
} catch (err) {
7891
errors += 1;
79-
console.error(`[billing.dunningSweep] failed for sub ${sub._id}:`, err);
92+
logger.error('[cron.dunningSweep] failed for sub', { subId: String(sub._id), err: err?.message, stack: err?.stack });
8093
}
8194
}
8295

83-
console.log(`[billing.dunningSweep] done — processed: ${processed}, errors: ${errors}, desyncErrors: ${desyncErrors}`);
96+
logger.info('[cron.dunningSweep] complete', { processed, errors, desyncErrors, durationMs: Date.now() - startMs });
8497
process.exitCode = errors > 0 ? 1 : 0;
8598
} catch (err) {
86-
console.error('[billing.dunningSweep] fatal:', err);
99+
logger.error('[cron.dunningSweep] failed', { err: err?.message, stack: err?.stack });
87100
process.exitCode = 1;
88101
} finally {
89102
await mongooseService.disconnect?.();

modules/billing/crons/billing.extrasExpiration.js

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,20 +17,25 @@ process.env.NODE_ENV = process.env.NODE_ENV || 'development';
1717
const [
1818
{ default: config },
1919
{ default: mongooseService },
20+
{ default: logger },
2021
{ applyJitter },
2122
{ getCronJitterMaxMs },
2223
] = await Promise.all([
2324
import('../../../config/index.js'),
2425
import('../../../lib/services/mongoose.js'),
26+
import('../../../lib/services/logger.js'),
2527
import('../lib/billing.cron-utils.js'),
2628
import('../lib/billing.constants.js'),
2729
]);
2830

2931
if (!config?.billing?.meterMode) {
30-
console.log('[billing.extrasExpiration] meterMode disabled — skipping.');
32+
logger.info('[cron.extrasExpiration] meterMode disabled — skipping.');
3133
process.exit(0);
3234
}
3335

36+
const startMs = Date.now();
37+
logger.info('[cron.extrasExpiration] start');
38+
3439
try {
3540
await applyJitter(getCronJitterMaxMs());
3641
await mongooseService.connect();
@@ -50,18 +55,18 @@ try {
5055
for (const orgId of orgIds) {
5156
try {
5257
const added = await BillingExtraService.expireOldEntries(orgId);
53-
console.log(`[billing.extrasExpiration] org ${orgId}: ${added} expiration entries added`);
58+
logger.info('[cron.extrasExpiration] org processed', { orgId: String(orgId), entriesAdded: added });
5459
processed += 1;
5560
} catch (err) {
5661
errors += 1;
57-
console.error(`[billing.extrasExpiration] expireOldEntries failed for org ${orgId}:`, err);
62+
logger.error('[cron.extrasExpiration] expireOldEntries failed', { orgId: String(orgId), err: err?.message, stack: err?.stack });
5863
}
5964
}
6065

61-
console.log(`[billing.extrasExpiration] done — processed: ${processed}, errors: ${errors}`);
66+
logger.info('[cron.extrasExpiration] complete', { processed, errors, durationMs: Date.now() - startMs });
6267
process.exitCode = errors > 0 ? 1 : 0;
6368
} catch (err) {
64-
console.error('[billing.extrasExpiration] fatal:', err);
69+
logger.error('[cron.extrasExpiration] failed', { err: err?.message, stack: err?.stack });
6570
process.exitCode = 1;
6671
} finally {
6772
await mongooseService.disconnect?.();

modules/billing/crons/billing.reconcile.js

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,23 +20,27 @@ process.env.NODE_ENV = process.env.NODE_ENV || 'development';
2020
const [
2121
{ default: config },
2222
{ default: mongooseService },
23+
{ default: logger },
2324
{ applyJitter },
2425
{ getCronJitterMaxMs },
2526
] = await Promise.all([
2627
import('../../../config/index.js'),
2728
import('../../../lib/services/mongoose.js'),
29+
import('../../../lib/services/logger.js'),
2830
import('../lib/billing.cron-utils.js'),
2931
import('../lib/billing.constants.js'),
3032
]);
3133

3234
if (!config?.billing?.meterMode) {
33-
console.log('[billing.reconcile] meterMode disabled — skipping.');
35+
logger.info('[cron.reconcile] meterMode disabled — skipping.');
3436
process.exit(0);
3537
}
3638

37-
await applyJitter(getCronJitterMaxMs());
39+
const startMs = Date.now();
40+
logger.info('[cron.reconcile] start');
3841

3942
try {
43+
await applyJitter(getCronJitterMaxMs());
4044
await mongooseService.loadModels();
4145
await mongooseService.connect();
4246

@@ -47,12 +51,10 @@ try {
4751
]);
4852

4953
const result = await BillingReconcileService.runReconciliation();
50-
console.log(
51-
`[billing.reconcile] done — checked: ${result.checked}, divergences: ${result.divergences}, errors: ${result.errors}`,
52-
);
54+
logger.info('[cron.reconcile] complete', { checked: result.checked, divergences: result.divergences, errors: result.errors, durationMs: Date.now() - startMs });
5355
process.exitCode = result.errors > 0 ? 1 : 0;
5456
} catch (err) {
55-
console.error('[billing.reconcile] fatal:', err);
57+
logger.error('[cron.reconcile] failed', { err: err?.message, stack: err?.stack });
5658
process.exitCode = 1;
5759
} finally {
5860
await mongooseService.disconnect?.();

0 commit comments

Comments
 (0)