Skip to content

Commit 72e1087

Browse files
committed
docs: exhaustively document the multi-stage grain directive
Add a comprehensive `grain` reference section (keep_only / exclude / include) to the measures reference, with semantics and worked result tables verified against the multi-stage-grain integration test. Document grain as the canonical form of group_by / reduce_by / add_group_by (which map to keep_only / exclude / include), and that a measure uses either a `grain` block or the legacy parameters, never both — per the Tesseract planner's build_grain_from_legacy / from_measure_ definition, a `grain` block causes the legacy directives to be ignored. Add a conceptual "Controlling grain" subsection and a new "Semi-additive (end-of-period) measures" recipe demonstrating grain (include + keep_only) and rank for end-of-period balances, and register it in the docs navigation.
1 parent 64db8b6 commit 72e1087

4 files changed

Lines changed: 539 additions & 0 deletions

File tree

docs-mintlify/docs.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -616,6 +616,7 @@
616616
"recipes/data-modeling/nested-aggregates",
617617
"recipes/data-modeling/filtered-aggregates",
618618
"recipes/data-modeling/share-of-total",
619+
"recipes/data-modeling/semi-additive-measures",
619620
"recipes/data-modeling/period-over-period",
620621
"recipes/data-modeling/passing-dynamic-parameters-in-a-query",
621622
"recipes/data-modeling/using-dynamic-measures",

docs-mintlify/docs/data-modeling/measures.mdx

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -408,6 +408,37 @@ measures:
408408
type: rank
409409
```
410410

411+
### Controlling grain
412+
413+
[`group_by`][ref-group-by], [`reduce_by`][ref-reduce-by], and
414+
[`add_group_by`][ref-add-group-by] each adjust the grain of a multi-stage
415+
measure's inner aggregation in one direction. The [`grain`][ref-grain] parameter
416+
is a unified alternative that expresses all of them — and their combinations —
417+
through three composable keys: `keep_only`, `exclude`, and `include`.
418+
419+
```yaml
420+
measures:
421+
- name: total_amount
422+
sql: amount
423+
type: sum
424+
425+
# Per-status total, repeated across every other query dimension —
426+
# the denominator for a "share of status" calculation.
427+
- name: amount_by_status
428+
multi_stage: true
429+
sql: "{total_amount}"
430+
type: sum
431+
grain:
432+
keep_only:
433+
- status
434+
```
435+
436+
`keep_only` restricts the grain to only the listed dimensions, `exclude` removes
437+
them, and `include` adds dimensions to the inner grain that the outer stage then
438+
re-aggregates away. See the [`grain` reference][ref-grain] for the full semantics
439+
and the [semi-additive measures recipe][ref-semi-additive-recipe] for a worked
440+
end-of-period example.
441+
411442
### Conditional measures
412443

413444
Conditional measures depend on the value of a dimension, using the
@@ -466,6 +497,8 @@ measures:
466497
[ref-group-by]: /reference/data-modeling/measures#group_by
467498
[ref-reduce-by]: /reference/data-modeling/measures#reduce_by
468499
[ref-add-group-by]: /reference/data-modeling/measures#add_group_by
500+
[ref-grain]: /reference/data-modeling/measures#grain
501+
[ref-semi-additive-recipe]: /recipes/data-modeling/semi-additive-measures
469502
[ref-filter]: /reference/data-modeling/measures#filter
470503
[ref-case]: /reference/data-modeling/measures#case
471504
[ref-switch-dim]: /reference/data-modeling/dimensions#type
Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
---
2+
title: Semi-additive (end-of-period) measures
3+
description: Model balances and other snapshot metrics that sum across entities but not across time, using a multi-stage rank measure and the grain directive to pick each period's last snapshot at any date grain.
4+
---
5+
6+
## Use case
7+
8+
A **semi-additive** measure can be summed across some dimensions but not others. The canonical
9+
example is an **account balance**: summing every daily balance in a month is meaningless — what
10+
you want is the **balance on the last day of the period** (end-of-month, end-of-quarter, and so
11+
on). Balances are additive across entities (accounts, products, stores) but *not* across time.
12+
13+
This recipe uses a multi-stage [`rank`][ref-type] measure and the [`grain`][ref-grain] directive
14+
to pick each period's final snapshot declaratively, so it works for *any* date grain the user
15+
groups by — no per-grain SQL and no query rewriting.
16+
17+
<Warning>
18+
19+
This pattern requires the Tesseract SQL planner. Set
20+
[`CUBEJS_TESSERACT_SQL_PLANNER=true`][ref-tesseract-env]. The multi-stage `grain` directive is
21+
not supported on the legacy planner.
22+
23+
</Warning>
24+
25+
## The data
26+
27+
Consider a fact table of **daily balance snapshots**, one row per account per day:
28+
29+
| `snapshot_date` | `account_id` | `balance` | `snapshot_frequency_key` |
30+
|---|---|---:|---|
31+
| 2024-01-30 | A1 | 1,000 | 1 *(daily)* |
32+
| 2024-01-31 | A1 | 1,200 | 1 *(daily)* |
33+
| 2024-01-31 | A2 | 500 | 1 *(daily)* |
34+
| 2024-02-29 | A1 | 900 | 1 *(daily)* |
35+
| 2024-01-31 | A1 | 1,150 | 2 *(month-end)* |
36+
| 2024-02-29 | A1 | 880 | 2 *(month-end)* |
37+
38+
- **`snapshot_frequency_key`** discriminates independent snapshot streams — here `1` = daily and
39+
`2` = a separate official month-end feed (note the daily and month-end balances for A1 on
40+
2024-01-31 differ: 1,200 vs 1,150). Keeping it in the partition prevents the two streams from
41+
contaminating each other's "latest" calculation; consumers then pick a stream with a filter.
42+
Keep the key in the model even if you start with a single stream — it future-proofs the partition.
43+
- Note `A2` has **no row on 2024-01-30**. End-of-January for `A2` should be its 01-31 balance
44+
(500); a *missing* account at period end should contribute **0**, not its last-seen value. That
45+
edge case is exactly what the partition design below gets right.
46+
47+
A standard date-dimension cube (`date_dim`) joins on `snapshot_date` and exposes the usual grains
48+
`calendar_year`, `calendar_quarter`, `calendar_month`, `calendar_week`, and so on.
49+
50+
## Data modeling
51+
52+
The pattern is two multi-stage measures: a `rank` that finds each period's last snapshot date,
53+
and a `sum` that totals only the rows on that date.
54+
55+
```yaml
56+
cubes:
57+
- name: balance_snapshots
58+
sql_table: analytics.balance_snapshots
59+
60+
joins:
61+
- name: date_dim
62+
relationship: many_to_one
63+
sql: "{CUBE}.snapshot_date = {date_dim}.date_val"
64+
65+
dimensions:
66+
- name: snapshot_date_key
67+
sql: snapshot_date
68+
type: time
69+
- name: snapshot_frequency_key
70+
sql: snapshot_frequency_key
71+
type: number
72+
- name: account_id
73+
sql: account_id
74+
type: string
75+
76+
measures:
77+
# Plain additive base measure — safe to sum across accounts AND days.
78+
- name: balance
79+
sql: balance
80+
type: sum
81+
82+
# ── Step 1: rank snapshot dates within each period (latest = 1) ──────────────
83+
- name: eop_rank
84+
public: false
85+
multi_stage: true
86+
type: rank
87+
order_by:
88+
- sql: "{snapshot_date_key}"
89+
dir: desc
90+
# A single grain block shapes both stages of this measure.
91+
#
92+
# When a measure declares `grain`, the legacy `group_by` / `reduce_by` /
93+
# `add_group_by` directives are ignored — `grain` is their canonical form
94+
# (group_by → keep_only, reduce_by → exclude, add_group_by → include). So
95+
# everything goes inside `grain`; do NOT also set `add_group_by` here.
96+
grain:
97+
# include = the leaf GROUP BY. Pushes snapshot_date_key into the leaf so the
98+
# rank has a physical column to order by (without it: "missing FROM-clause
99+
# entry"), and snapshot_frequency_key so it survives into the partition below.
100+
include:
101+
- snapshot_date_key
102+
- snapshot_frequency_key
103+
# keep_only = the window PARTITION BY. List ONLY the date grains a query might
104+
# group by, plus snapshot_frequency_key. Everything else (account_id) is
105+
# dropped from the partition, so "rank = 1" means the period's GLOBAL last
106+
# snapshot date — identical for every account.
107+
keep_only:
108+
- snapshot_frequency_key
109+
- date_dim.calendar_year
110+
- date_dim.calendar_quarter
111+
- date_dim.calendar_month
112+
- date_dim.calendar_week
113+
# …add every date-dim grain a query might group by. A MISSING entry silently
114+
# over-counts (period computed too coarse); extra entries are harmless.
115+
116+
# ── Step 2: sum the base measure, keeping only the period's last snapshot ─────
117+
- name: balance_eop
118+
title: End-of-Period Balance
119+
multi_stage: true
120+
type: sum
121+
sql: "{balance}"
122+
# Repeat the leaf members (grain.include) so the rank filter is evaluated per
123+
# snapshot date — same canonical form as eop_rank, no add_group_by.
124+
grain:
125+
include:
126+
- snapshot_date_key
127+
- snapshot_frequency_key
128+
filters:
129+
- sql: "{eop_rank} = 1"
130+
```
131+
132+
That is the whole pattern. `balance_eop` now returns the correct end-of-period balance at **any**
133+
grain the consumer groups by — no rewrite logic and no per-grain measures.
134+
135+
## What SQL this generates
136+
137+
<Note>
138+
139+
The SQL below is **illustrative** — simplified to show how `grain.include` and `grain.keep_only`
140+
map onto the `GROUP BY` and `PARTITION BY` clauses. The SQL Cube actually emits will differ in
141+
detail (CTE naming, column aliasing, extra wrapping subqueries, dialect-specific syntax). Inspect
142+
the real output for your setup via the [`/v1/sql` endpoint][ref-sql-api].
143+
144+
</Note>
145+
146+
For the query:
147+
148+
```sql
149+
SELECT calendar_year, calendar_month, MEASURE(balance_snapshots.balance_eop)
150+
FROM balance_snapshots
151+
GROUP BY 1, 2
152+
```
153+
154+
Cube compiles the multi-stage measure into two stacked stages:
155+
156+
```sql
157+
-- STAGE 1 (leaf): the GROUP BY. Grain = queried dims + grain.include members.
158+
WITH leaf AS (
159+
SELECT
160+
calendar_year,
161+
calendar_month,
162+
snapshot_frequency_key, --\__ injected by grain.include
163+
snapshot_date_key, --/ (not in the user's SELECT)
164+
SUM(balance) AS balance
165+
FROM balance_snapshots
166+
JOIN date_dim ON balance_snapshots.snapshot_date = date_dim.date_val
167+
GROUP BY 1, 2, 3, 4 -- ← grain.include forced cols 3 & 4 in here
168+
),
169+
170+
-- STAGE 2 (window): the rank. PARTITION BY = grain AFTER keep_only.
171+
ranked AS (
172+
SELECT
173+
leaf.*,
174+
RANK() OVER (
175+
PARTITION BY calendar_year, calendar_month, snapshot_frequency_key
176+
-- ↑ only date grains + frequency survived keep_only;
177+
-- account_id was stripped out here.
178+
ORDER BY snapshot_date_key DESC -- ← order_by
179+
) AS eop_rank
180+
FROM leaf
181+
)
182+
183+
-- FINAL: sum the surviving rows.
184+
SELECT calendar_year, calendar_month, SUM(balance) AS balance_eop
185+
FROM ranked
186+
WHERE eop_rank = 1 -- ← filters: {eop_rank} = 1
187+
GROUP BY 1, 2
188+
```
189+
190+
## Why both `include` and `keep_only` are needed
191+
192+
Both keys live in the **same `grain` block** but act on **different clauses of different stages**
193+
and pull in opposite directions:
194+
195+
| | `grain.include` | `grain.keep_only` |
196+
|---|---|---|
197+
| Acts on | Stage 1 `GROUP BY` (leaf grain) | Stage 2 `PARTITION BY` (window) |
198+
| Direction | **adds** members | **restricts** members |
199+
| Purpose | make `snapshot_date_key` exist so `order_by` can reference it | scope the rank to the date grain only, dropping entity dims |
200+
| Omit it and… | `missing FROM-clause entry for snapshot_date_key` | rank computed per-account, not per period-end |
201+
202+
<Note>
203+
204+
`grain.include` is the canonical form of the legacy [`add_group_by`][ref-add-group-by] directive
205+
(and `keep_only` / `exclude` are the canonical forms of `group_by` / `reduce_by`). When a measure
206+
sets a `grain` block, those legacy directives are ignored — so keep everything inside `grain`
207+
rather than mixing the two styles.
208+
209+
</Note>
210+
211+
### The partition is the whole point
212+
213+
```sql
214+
-- WITHOUT keep_only — account_id stays in the partition:
215+
PARTITION BY calendar_year, calendar_month, account_id, snapshot_frequency_key
216+
-- → rank = 1 is "latest date THIS account appears". An account absent on Jan 31 but
217+
-- present Jan 30 ranks its Jan-30 row = 1 → counted. WRONG.
218+
219+
-- WITH keep_only — account_id dropped:
220+
PARTITION BY calendar_year, calendar_month, snapshot_frequency_key
221+
-- → rank = 1 is the month's GLOBAL last snapshot date (Jan 31), same for every account.
222+
-- An account with no Jan-31 row has no rank-1 row → contributes 0. CORRECT.
223+
```
224+
225+
This "missing at period end ⇒ 0" behavior is usually what end-of-period reporting wants, and it
226+
falls out automatically once entity dimensions are excluded from the partition.
227+
228+
## Gotchas
229+
230+
- **Tesseract only.** The `grain` directive requires
231+
[`CUBEJS_TESSERACT_SQL_PLANNER=true`][ref-tesseract-env].
232+
- **Don't mix `grain` with the legacy directives.** When a measure sets a `grain` block, the
233+
legacy [`group_by`][ref-group-by] / [`reduce_by`][ref-reduce-by] / [`add_group_by`][ref-add-group-by]
234+
directives on that measure are ignored. Put leaf members under `grain.include`, not `add_group_by`.
235+
- **Members only.** Multi-stage measures reference **members**, never `{CUBE}.raw_column`. Wrap
236+
raw columns in a base measure or dimension first (here, `balance` and `snapshot_date_key`).
237+
- **`order_by` must be in the leaf grain.** A `rank` can only order by a column present in the
238+
leaf — that is why `snapshot_date_key` must be listed under `grain.include` on the rank measure
239+
(and on the consuming sum).
240+
- **`keep_only` takes explicit member paths only** — no cube-level or wildcard references.
241+
Enumerate every date grain a query might group by. A missing grain over-counts silently; extras
242+
are harmless.
243+
- **Default the frequency.** Consumers should constrain `snapshot_frequency_key` (for example via
244+
a view's [`default_filters`][ref-default-filters]) so they don't mix daily and month-end streams
245+
unintentionally.
246+
247+
## Related
248+
249+
<CardGroup cols={2}>
250+
<Card title="grain reference" icon="layer-group" href="/reference/data-modeling/measures#grain">
251+
Full semantics of the `keep_only`, `exclude`, and `include` keys.
252+
</Card>
253+
<Card title="Calculating share of total" icon="percent" href="/recipes/data-modeling/share-of-total">
254+
Use `grain` to compute each row's contribution to a group or grand total.
255+
</Card>
256+
</CardGroup>
257+
258+
[ref-type]: /reference/data-modeling/measures#type
259+
[ref-grain]: /reference/data-modeling/measures#grain
260+
[ref-add-group-by]: /reference/data-modeling/measures#add_group_by
261+
[ref-group-by]: /reference/data-modeling/measures#group_by
262+
[ref-reduce-by]: /reference/data-modeling/measures#reduce_by
263+
[ref-tesseract-env]: /reference/configuration/environment-variables#cubejs_tesseract_sql_planner
264+
[ref-sql-api]: /reference/core-data-apis/rest-api/reference
265+
[ref-default-filters]: /reference/data-modeling/view#default_filters

0 commit comments

Comments
 (0)