Skip to content

Commit 1b9f7a6

Browse files
fix(billing): retry refund correlation backfill 3x + dead-letter (#3690)
* feat(billing): retryWithBackoff helper + BillingFailedBackfill dead-letter model * fix(billing): retry refund correlation backfill 3x + dead-letter PaymentIntent metadata backfill on checkout.session.completed now retries 3 times with exponential backoff (200ms, 400ms, 800ms) before escalating. On exhaustion, a persistent BillingFailedBackfill dead-letter record is written so operators can manually patch missing PI metadata and preserve refund correlation. Strategy chosen: Option B — new BillingFailedBackfill model (no prior mechanism for this specific failure path). The existing ProcessedStripeEvent dead-letter covers Stripe event delivery failures; this addresses a different failure class (Stripe API write failure on a side-effect within a successfully-delivered event handler). BillingFailedBackfill model auto-registers via assets.js glob: modules/*/models/*.mongoose.js — no billing.init.js change required. Closes audit P1 (2026-05-21) — silent warn-log path on refund correlation. * docs(billing): runbook for refund correlation backfill failure * fix(billing): address code-quality findings on refund correlation retry - IMPORTANT: JSDoc delay sequence corrected (200 → 400, not 200 → 400 → 800 for default attempts=3, baseMs=200 — only 2 delays fire) - IMPORTANT: BillingFailedBackfill access moved to billing.failedBackfill.repository.js (Option B) — lazy mongoose.model() removed from service layer; test mocks updated to stub the repository boundary instead of mongoose.model() - IMPORTANT: sparse index → partial index on resolvedAt (sparse no-op because field has default: null, all docs include it — partial targets only unresolved docs) - MINOR: RUNBOOK section 6 symptom includes [billing.webhook] prefix * fix(billing): validate retryWithBackoff options + document makeSession - guard attempts (positive integer) + baseMs (non-negative finite) with TypeError — prevents entering loop with attempts<=0 (would throw undefined) - JSDoc header on makeSession test helper Addresses CodeRabbit re-review on PR #3690. * fix(billing): add err.stack + consistent stripeSessionId key to backfill logs Fixes two logger.error calls (lines 332 and 344 of billing.webhook.service.js): - rename sessionId key to stripeSessionId for consistency with the rest of the file - add stack: err?.stack to both meta objects, matching surrounding logger.error style Also updates billing.refund-correlation unit test header comment to reference BillingFailedBackfillRepository.record instead of the stale BillingFailedBackfill.create.
1 parent d397764 commit 1b9f7a6

6 files changed

Lines changed: 441 additions & 14 deletions

File tree

modules/billing/RUNBOOKS.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,3 +218,47 @@ kubectl exec -n pierreb-projects mongo-0 -- mongosh \
218218
```
219219

220220
**Prevention:** Lock TTLs are intentionally conservative. If you see frequent stuck-lock incidents, investigate cron duration (slow query? tenant scale?) rather than lower the TTL — a TTL too short defeats the mutual-exclusion guarantee.
221+
222+
---
223+
224+
## 7 — Refund Correlation Backfill Failure
225+
226+
**Symptom:** ERROR log `[billing.webhook] PI metadata backfill failed after retries — refund correlation at risk`.
227+
A later refund webhook may additionally log `refund unresolved — no stripeSessionId on charge metadata`.
228+
229+
**Cause:** `stripe.paymentIntents.update` failed for all 3 retry attempts during
230+
`checkout.session.completed` handling. Likely cause: transient Stripe API outage.
231+
232+
**Triage:**
233+
234+
1. Query unresolved entries:
235+
236+
```bash
237+
db.billing_failed_backfills.find({ resolvedAt: null })
238+
```
239+
240+
Each document contains `paymentIntentId`, `stripeSessionId`, `error`, and `failedAt`.
241+
242+
2. For each entry, manually patch the PI via Stripe CLI or Dashboard:
243+
244+
```bash
245+
stripe payment_intents update pi_xxx \
246+
--metadata stripeSessionId=cs_xxx
247+
```
248+
249+
Verify in Stripe Dashboard → Payments → PaymentIntent → Metadata.
250+
251+
3. Mark the record resolved to close the loop:
252+
253+
```bash
254+
db.billing_failed_backfills.updateOne(
255+
{ _id: ObjectId('...') },
256+
{ $set: { resolvedAt: new Date(), resolvedBy: 'admin' } }
257+
)
258+
```
259+
260+
4. Confirm refund correlation: if a refund was already processed while the PI metadata was missing,
261+
check for `billing.refund.unresolved` alerts and follow **Runbook #2** (Dead-Letter Investigation)
262+
to replay the refund event after the PI metadata is patched.
263+
264+
**Escalate if:** frequency exceeds 1 per week → promote to cron-based auto-retry.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
/**
2+
* Retry an async operation with exponential backoff.
3+
*
4+
* For default opts (attempts=3, baseMs=200), delays are 200ms then 400ms
5+
* (no delay after the final attempt). General formula: baseMs * 2^i for
6+
* each non-final attempt i.
7+
*
8+
* Returns the result of the first successful call, or throws the last
9+
* error after all attempts are exhausted.
10+
*
11+
* @param {() => Promise<T>} fn - Async function to attempt.
12+
* @param {object} [opts]
13+
* @param {number} [opts.attempts=3] - Maximum number of attempts (including the first call).
14+
* @param {number} [opts.baseMs=200] - Base delay in ms for the first retry.
15+
* @returns {Promise<T>}
16+
*/
17+
export async function retryWithBackoff(fn, { attempts = 3, baseMs = 200 } = {}) {
18+
if (!Number.isInteger(attempts) || attempts < 1) {
19+
throw new TypeError(`retryWithBackoff: attempts must be a positive integer, received ${attempts}`);
20+
}
21+
if (!Number.isFinite(baseMs) || baseMs < 0) {
22+
throw new TypeError(`retryWithBackoff: baseMs must be a non-negative finite number, received ${baseMs}`);
23+
}
24+
let lastErr;
25+
for (let i = 0; i < attempts; i++) {
26+
try {
27+
return await fn();
28+
} catch (err) {
29+
lastErr = err;
30+
if (i < attempts - 1) {
31+
await new Promise((resolve) => setTimeout(resolve, baseMs * 2 ** i));
32+
}
33+
}
34+
}
35+
throw lastErr;
36+
}
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
/**
2+
* Module dependencies
3+
*/
4+
import mongoose from 'mongoose';
5+
6+
const Schema = mongoose.Schema;
7+
8+
/**
9+
* BillingFailedBackfill Data Model Mongoose
10+
*
11+
* Dead-letter store for PaymentIntent metadata backfill failures.
12+
* Records are written when the refund-correlation backfill (stripe.paymentIntents.update
13+
* in handleCheckoutPaymentCompleted) fails after all retry attempts.
14+
*
15+
* Kept permanently so operators can manually reconcile unresolved entries.
16+
* Never auto-expired — resolvedAt is set by the operator after manual fix.
17+
*/
18+
const BillingFailedBackfillMongoose = new Schema(
19+
{
20+
paymentIntentId: {
21+
type: String,
22+
required: true,
23+
index: true,
24+
},
25+
stripeSessionId: {
26+
type: String,
27+
required: true,
28+
},
29+
/**
30+
* Serialised error message from the last failed attempt.
31+
*/
32+
error: {
33+
type: String,
34+
default: null,
35+
},
36+
/**
37+
* Timestamp of the first failure (when the record was created).
38+
*/
39+
failedAt: {
40+
type: Date,
41+
required: true,
42+
default: () => new Date(),
43+
},
44+
/**
45+
* Timestamp set by the operator after the PI metadata has been manually patched
46+
* and the refund correlation risk resolved.
47+
*/
48+
resolvedAt: {
49+
type: Date,
50+
default: null,
51+
},
52+
/**
53+
* Operator tag explaining how the record was resolved.
54+
* E.g. 'admin', 'cron'.
55+
*/
56+
resolvedBy: {
57+
type: String,
58+
default: null,
59+
},
60+
},
61+
{
62+
collection: 'billing_failed_backfills',
63+
timestamps: false,
64+
},
65+
);
66+
67+
// Partial index — only unresolved documents are indexed, so this stays small
68+
// even after the collection accumulates many resolved entries.
69+
// (Sparse would be a no-op here: resolvedAt has default: null, so every document
70+
// has the field present — sparse skips only docs where the field is absent.)
71+
BillingFailedBackfillMongoose.index(
72+
{ resolvedAt: 1 },
73+
{ partialFilterExpression: { resolvedAt: null } },
74+
);
75+
76+
/**
77+
* Returns the hex string representation of the document ObjectId.
78+
* @returns {string} Hex string of the ObjectId.
79+
*/
80+
function addID() {
81+
return this._id.toHexString();
82+
}
83+
84+
/**
85+
* Model configuration
86+
*/
87+
BillingFailedBackfillMongoose.virtual('id').get(addID);
88+
BillingFailedBackfillMongoose.set('toJSON', {
89+
virtuals: true,
90+
});
91+
92+
export const BillingFailedBackfill =
93+
mongoose.models.BillingFailedBackfill ??
94+
mongoose.model('BillingFailedBackfill', BillingFailedBackfillMongoose);
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
/**
2+
* Module dependencies
3+
*/
4+
import mongoose from 'mongoose';
5+
6+
/**
7+
* @function BillingFailedBackfill
8+
* @description Lazily resolves the BillingFailedBackfill Mongoose model.
9+
* Deferred to keep unit tests importable before model registration.
10+
* @returns {import('mongoose').Model} The registered BillingFailedBackfill model.
11+
*/
12+
// biome-ignore lint/correctness/useQwikValidLexicalScope: false positive — Node.js repository, not Qwik
13+
const BillingFailedBackfill = () => mongoose.model('BillingFailedBackfill');
14+
15+
/**
16+
* @function record
17+
* @description Write a dead-letter entry for a PaymentIntent metadata backfill failure.
18+
* Called by billing.webhook.service after all retry attempts are exhausted.
19+
* @param {object} opts
20+
* @param {string} opts.paymentIntentId - Stripe PaymentIntent id (pi_*).
21+
* @param {string} opts.stripeSessionId - Stripe checkout session id (cs_*).
22+
* @param {string|null} [opts.error] - Serialised error message from the last failed attempt.
23+
* @param {Date} [opts.failedAt] - Timestamp of the failure (defaults to now).
24+
* @returns {Promise<import('mongoose').Document>}
25+
*/
26+
// biome-ignore lint/correctness/useQwikValidLexicalScope: false positive — Node.js repository, not Qwik
27+
const record = ({ paymentIntentId, stripeSessionId, error = null, failedAt = new Date() }) =>
28+
BillingFailedBackfill().create({ paymentIntentId, stripeSessionId, error, failedAt });
29+
30+
export default { record };

modules/billing/services/billing.webhook.service.js

Lines changed: 31 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,12 @@ import logger from '../../../lib/services/logger.js';
99
import SubscriptionRepository from '../repositories/billing.subscription.repository.js';
1010
import ProcessedStripeEventRepository from '../repositories/billing.processedStripeEvent.repository.js';
1111
import OrganizationRepository from '../../organizations/repositories/organizations.repository.js';
12+
import BillingFailedBackfillRepository from '../repositories/billing.failedBackfill.repository.js';
1213
import BillingExtraService from './billing.extra.service.js';
1314
import BillingResetService from './billing.reset.service.js';
1415
import billingEvents from '../lib/events.js';
1516
import { SENTINEL_PENDING } from '../lib/billing.constants.js';
17+
import { retryWithBackoff } from '../lib/billing.retry.js';
1618

1719
/**
1820
* Treats a stripeSessionId as "unresolved" when absent, empty, or still the
@@ -311,21 +313,36 @@ const handleCheckoutPaymentCompleted = async (session) => {
311313
const stripe = getStripe();
312314
if (stripe) {
313315
try {
314-
await stripe.paymentIntents.update(paymentIntentId, {
315-
metadata: {
316-
organizationId,
317-
packId,
318-
kind: 'extras',
319-
stripeSessionId, // real cs_* ID (replaces SENTINEL_PENDING)
320-
},
321-
});
316+
await retryWithBackoff(
317+
() =>
318+
stripe.paymentIntents.update(paymentIntentId, {
319+
metadata: {
320+
organizationId,
321+
packId,
322+
kind: 'extras',
323+
stripeSessionId, // real cs_* ID (replaces SENTINEL_PENDING)
324+
},
325+
}),
326+
{ attempts: 3, baseMs: 200 },
327+
);
322328
} catch (err) {
323-
// Log but don't fail — refund correlation may use the backfill resolver path
324-
logger.warn('[billing.webhook] PaymentIntent metadata update failed', {
325-
paymentIntentId,
326-
error: err?.message ?? String(err),
327-
stack: err?.stack,
328-
});
329+
logger.error(
330+
'[billing.webhook] PI metadata backfill failed after retries — refund correlation at risk',
331+
{ paymentIntentId, stripeSessionId, error: err?.message ?? String(err), stack: err?.stack },
332+
);
333+
try {
334+
await BillingFailedBackfillRepository.record({
335+
paymentIntentId,
336+
stripeSessionId,
337+
error: err?.message ?? String(err),
338+
failedAt: new Date(),
339+
});
340+
} catch (dlqErr) {
341+
logger.error(
342+
'[billing.webhook] dead-letter write failed — manual reconciliation required',
343+
{ paymentIntentId, stripeSessionId, error: dlqErr?.message ?? String(dlqErr), stack: dlqErr?.stack },
344+
);
345+
}
329346
}
330347
}
331348
}

0 commit comments

Comments
 (0)