Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions docs/designs/refund-modeling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# Refund modeling — net revenue for finance

**Author:** Data engineering (draft)
**Status:** Proposed
**Last updated:** 2026-06-08
**Related:** DATA-456, DATA-145, `order_fact`, `order_line_fact`

## Context

Finance needs **net revenue** while `order_fact.revenue` remains **gross**
(sum of `quantity × unit_price` across line items, no tax, shipping, discounts,
or refunds — see `orders_dw.md`).

Three refund sources exist in `raw/` but are not yet loaded into the DW:


| Source | Grain | Line detail | Key fields |
| ---------------------- | ----------------------------- | ----------- | ------------------------------------------------------------------------- |
| `refunds_shopify` | Refund event × line | Yes | `line_item_id`, `qty_refunded`, `amount_in_cents`, `refunded_at` |
| `refunds_stripe` | Refund event × order × tender | No | `tender_type` (`card`, `store_credit`), `amount_in_cents`, `processed_at` |
| `refunds_internal_pos` | Refund event × order | No | `amount_in_cents`, `refunded_at` |


The Q3 2024 orders redesign (`2024-Q3-orders-redesign.md`) intentionally scoped
refunds out of `order_fact` and called for a separate `refund_fact` so refund
logic does not get tangled with order status and gross revenue reasoning again.

---

## Observations

### Merchant payment and refund sources overlap

A single merchant can operate across more than one payment/refund channel.
In practice, a merchant may use **internal POS** alongside **Shopify** or
**Stripe** — not every merchant uses a single source.

Refund events therefore arrive at different grains and from different systems
for the same underlying business activity. Naively summing all three sources
per order will over-count refunds where the same refund is recorded in both an
operational system (Shopify) and a payment rail (Stripe).

Example from sample data — order `O005064`:

- Shopify records a line-level refund (`SHF000004`, line `L0009590`, 178,905 cents).
- Stripe records the same refund split across tenders (`STR000005` card 89,452 +
`STR000006` store_credit 89,453 = 178,905 cents).

These are one refund, not two.

### Stripe tender splits are one refund, not two

Stripe refunds can appear as multiple rows per order when the refund is paid
out through more than one tender (e.g. `card` + `store_credit`). When rolling
up to order-level refund totals, **sum Stripe rows per order** — but interpret
card + store_credit on the same refund episode as components of a **single**
refund, not independent refunds to add on top of each other.
Comment on lines +53 to +57

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Define deterministic cross-source refund episode matching.

The spec references the “same refund episode” but doesn’t define how Shopify and Stripe/POS events are correlated. Without an explicit matching key/window, reconciliation can drift and either double-count or mis-flag mismatches.

Proposed spec patch
+#### Refund episode correlation contract
+
+To reconcile cross-source refunds deterministically, define `refund_episode_key` using:
+1. Preferred key: shared external reference if present (gateway charge/payment intent + refund reference).
+2. Fallback key: (`order_id`, normalized refund amount, refunded timestamp bucket).
+3. Timestamp tolerance: ±N minutes (set and document N).
+4. Tie-breaker: nearest timestamp, then stable lexical order of source refund IDs.
+
+Store:
+- `refund_episode_key`
+- `episode_match_method` (`direct_id`, `fallback_amount_time`)
+- `episode_match_confidence` (`high`, `medium`, `low`)

Also applies to: 117-118

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/designs/refund-modeling.md` around lines 53 - 57, Add a deterministic
cross-source refund-episode matching rule to the "Stripe refunds" / "refund
episode" section: explicitly state the matching key and window (e.g., prefer a
canonical refund_id when available, otherwise match by order_id + normalized
refund_reference from source + event_timestamp within a configurable window such
as ±2 hours) and describe tie-breakers (prefer source refund_id > payment_intent
> combined tender rows) so Shopify and Stripe/POS rows are reconciled
deterministically when rolling up order-level refund totals; call out the fields
to use from each source (e.g., stripe.refund_id, stripe.payment_intent,
shopify.refund_id/refund_event_id) and add a short note on handling multi-tender
Stripe rows (treat same refund_id/payment_intent as one episode and sum
tenders).


The same care applies to POS: one order may have one POS refund row representing
the full order-level amount.

### Source grains differ

- **Shopify** is the only source with native **line-level** detail
(`line_item_id`, `qty_refunded`). Partial line refunds are possible
(`qty_refunded` < line `quantity`).
- **Stripe** and **internal POS** are **order-level** only — no `line_item_id`.
Per-line refund amounts on `order_line_fact` require an allocation rule for
these sources.

### Gross revenue contract is already fixed

`order_fact.revenue` and `order_line_fact.line_revenue` should stay gross.
Net figures belong in new columns (`refund_amount`, `net_revenue`, etc.) so
existing consumers (e.g. `daily_revenue`) do not change behavior silently.

---

## Proposed modeling approach

### 1. Staging and unified refund events (Recommended)

Ingest each raw source into its own staging model, then union into a single
refund-events layer at **one row per `refund_id` + `source_system`**.

Normalize to a common shape:

- `refund_id`, `source_system`, `order_id`
- `line_item_id` (nullable)
- `qty_refunded` (nullable)
- `refund_amount` (dollars; convert from cents)
- `refunded_at` (map Stripe `processed_at` here)
- `tender_type` (Stripe only)
- `refund_grain` (`line` vs `order`)

Do not deduplicate in staging — keep every source event for audit and
reconciliation. Materialize as a `**refund_fact**` at refund-event grain.



### 2. Order-level reconciliation

Before rolling up to `order_fact`, derive an `**order_refund_summary**`
(one row per `order_id`) with a reconciled `gross_refund_amount`.

Suggested source-of-truth rules by merchant/refund pattern:


| Pattern | Authoritative source for refund total | Notes |
| ----------------- | --------------------------------------------------------------- | ------------------------------------------ |
| Shopify-only | Sum Shopify events per order | Line detail available natively |
| Stripe-only | Sum Stripe tenders per order | Card + store_credit = one refund |
| Internal POS-only | Sum POS events per order | Order-level only |
| Shopify + Stripe | Shopify for allocation total; Stripe for payment reconciliation | Do not add Stripe totals on top of Shopify |


Flag orders where cross-source totals do not reconcile (e.g. Shopify amount ≠
sum of Stripe tenders for the same refund episode).

### 3. Line-level allocation for `order_line_fact`


| Refund origin | Allocation |
| -------------------------------- | -------------------------------------------------------------------------------------- |
| Shopify line refunds | Direct — attach `refund_amount` and `qty_refunded` to matching `line_item_id` |
| Stripe / POS order-level refunds | Allocate order refund to lines via a documented rule (e.g. pro-rata by `line_revenue`) |


For Shopify + Stripe orders, allocate from **Shopify line events only**; do not
re-allocate from Stripe payment totals.

Each line ends up with:

- `line_refund_amount`
- `net_line_revenue` = `line_revenue − line_refund_amount`

### 4. Surface on existing facts

`**order_fact`** (grain unchanged: one row per `order_id`):


| Column | Definition |
| ----------------------------------------------- | -------------------------------------- |
| `refund_amount` | Reconciled total refunds for the order |
| `net_revenue` | `revenue − refund_amount` |
| optionally `first_refunded_at`, `refund_source` | Audit / reporting |


`**order_line_fact**` (grain unchanged: one row per `line_item_id`):


| Column | Definition |
| -------------------- | -------------------------------------------- |
| `line_refund_amount` | Direct (Shopify) or allocated (Stripe / POS) |
| `net_line_revenue` | `line_revenue − line_refund_amount` |


### 5. Data flow

```
raw refunds (shopify, stripe, internal_pos)
→ staging per source
→ refund_fact (event grain, all sources preserved)
→ order_refund_summary (reconciled order totals)
→ line refund allocations
→ order_fact (+ refund_amount, net_revenue)
→ order_line_fact (+ line_refund_amount, net_line_revenue)
```
Comment on lines +160 to +168

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add fenced-code language for markdown lint compliance.

The block starting at Line [160] is missing a language identifier (MD040). Use text (or another appropriate language) on the fence.

Suggested edit
-```
+```text
 raw refunds (shopify, stripe, internal_pos)
   → staging per source
   → refund_fact (event grain, all sources preserved)
   → order_refund_summary (reconciled order totals)
   → line refund allocations
   → order_fact (+ refund_amount, net_revenue)
   → order_line_fact (+ line_refund_amount, net_line_revenue)
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 160-160: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/designs/refund-modeling.md` around lines 160 - 168, The fenced code
block in refund-modeling.md is missing a language identifier (MD040); update the
triple-backtick fence to include a language such as "text" (e.g., change ``` to
```text) for the block that lists the refund pipeline (the block containing "raw
refunds (shopify, stripe, internal_pos)" through "order_line_fact (+
line_refund_amount, net_line_revenue)"), ensuring Markdown lint compliance.

Source: Linters/SAST tools


---

## Guardrails

1. **No double-counting** — merchant-pattern reconciliation rules; flag mismatches.
2. **Test orders** — exclude consistently (same rule as `daily_revenue`).
3. **Partial refunds** — honor Shopify `qty_refunded`; validate amount vs qty × unit price.
4. **Reconciliation tests** — `sum(line_refund_amount)` per order ≈ `order_fact.refund_amount`;
`sum(net_line_revenue)` per order ≈ `order_fact.net_revenue`.
5. **Incremental refresh** — refund models keyed on `refunded_at` / `processed_at`;
facts updated when refunds arrive after the original order load.

---

## Out of scope (for now)

- Tender-level reporting (card vs store_credit) — keep on `refund_fact`, not `order_fact`
- Tax, shipping, discounts in net revenue — follow the same scope as gross `revenue` unless finance expands the definition
- Changing `order_status = 'refunded'` semantics — keep status and refund/revenue logic separate

---

## Open questions

- Confirm merchant-to-source mapping (which merchants are Shopify-only vs Stripe-only vs POS vs multi-source) — may live on `lkp_merchants` or be inferred from which sources emit events per order.
- Choose default line allocation method for order-level refunds (pro-rata by `line_revenue` is the suggested default).
- Whether `daily_revenue` should move to `net_revenue` or expose both gross and net.

13 changes: 11 additions & 2 deletions models/orders/dw/order_fact.sql
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,18 @@ WITH shipment_lines AS (
, shipped_at
, count(DISTINCT line_item_id) AS line_count
, sum(quantity_shipped) AS total_quantity
, sum(quantity_shipped * unit_price) AS shipment_revenue
FROM joined
GROUP BY order_id, merchant_id, customer_id, order_status, is_test, ordered_at, paid_at, shipment_id, shipped_at
)

, order_revenue AS (
SELECT
li.order_id
, sum(li.quantity * li.unit_price) AS revenue
FROM {{ ref('stg_line_items') }} AS li
GROUP BY li.order_id
)

, shipment_counts AS (
SELECT
order_id
Expand All @@ -77,8 +84,10 @@ WITH shipment_lines AS (
, sc.shipment_count
, st.line_count
, st.total_quantity
, st.shipment_revenue AS revenue
, ol.revenue
FROM shipment_totals AS st
LEFT JOIN order_revenue AS ol
ON st.order_id = ol.order_id
LEFT JOIN {{ ref('lkp_merchants') }} AS m
ON st.merchant_id = m.merchant_id
LEFT JOIN shipment_counts AS sc
Expand Down