Skip to content

Commit 05405aa

Browse files
committed
docs: exhaustively document the multi-stage grain directive
Add a comprehensive `grain` reference section (keep_only / exclude / include) to the measures reference, with semantics and worked result tables verified against the multi-stage-grain integration test. Document grain as the canonical form of group_by / reduce_by / add_group_by (which map to keep_only / exclude / include), and that a measure uses either a `grain` block or the legacy parameters, never both — per the Tesseract planner's build_grain_from_legacy / from_measure_ definition, a `grain` block causes the legacy directives to be ignored. Add a conceptual "Controlling grain" subsection and a new "Semi-additive (end-of-period) measures" recipe demonstrating grain (include + keep_only) and rank for end-of-period balances, and register it in the docs navigation.
1 parent 0e89fb0 commit 05405aa

4 files changed

Lines changed: 561 additions & 43 deletions

File tree

docs-mintlify/docs.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -620,6 +620,7 @@
620620
"recipes/data-modeling/nested-aggregates",
621621
"recipes/data-modeling/filtered-aggregates",
622622
"recipes/data-modeling/share-of-total",
623+
"recipes/data-modeling/semi-additive-measures",
623624
"recipes/data-modeling/period-over-period",
624625
"recipes/data-modeling/passing-dynamic-parameters-in-a-query",
625626
"recipes/data-modeling/using-dynamic-measures",

docs-mintlify/docs/data-modeling/measures.mdx

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -412,13 +412,36 @@ measures:
412412
type: rank
413413
```
414414

415-
<Note>
415+
### Controlling grain
416416

417-
`grain` replaces the standalone `group_by`, `reduce_by`, and `add_group_by`
418-
parameters, which remain supported. See the [`grain`][ref-grain] reference for
419-
the migration mapping.
417+
[`group_by`][ref-group-by], [`reduce_by`][ref-reduce-by], and
418+
[`add_group_by`][ref-add-group-by] each adjust the grain of a multi-stage
419+
measure's inner aggregation in one direction. The [`grain`][ref-grain] parameter
420+
is a unified alternative that expresses all of them — and their combinations —
421+
through three composable keys: `keep_only`, `exclude`, and `include`.
420422

421-
</Note>
423+
```yaml
424+
measures:
425+
- name: total_amount
426+
sql: amount
427+
type: sum
428+
429+
# Per-status total, repeated across every other query dimension —
430+
# the denominator for a "share of status" calculation.
431+
- name: amount_by_status
432+
multi_stage: true
433+
sql: "{total_amount}"
434+
type: sum
435+
grain:
436+
keep_only:
437+
- status
438+
```
439+
440+
`keep_only` restricts the grain to only the listed dimensions, `exclude` removes
441+
them, and `include` adds dimensions to the inner grain that the outer stage then
442+
re-aggregates away. See the [`grain` reference][ref-grain] for the full semantics
443+
and the [semi-additive measures recipe][ref-semi-additive-recipe] for a worked
444+
end-of-period example.
422445

423446
### Conditional measures
424447

@@ -475,7 +498,11 @@ measures:
475498
[ref-format]: /reference/data-modeling/measures#format
476499
[ref-rolling-window]: /reference/data-modeling/measures#rolling_window
477500
[ref-time-shift]: /reference/data-modeling/measures#time_shift
501+
[ref-group-by]: /reference/data-modeling/measures#group_by-reduce_by-and-add_group_by-legacy
502+
[ref-reduce-by]: /reference/data-modeling/measures#group_by-reduce_by-and-add_group_by-legacy
503+
[ref-add-group-by]: /reference/data-modeling/measures#group_by-reduce_by-and-add_group_by-legacy
478504
[ref-grain]: /reference/data-modeling/measures#grain
505+
[ref-semi-additive-recipe]: /recipes/data-modeling/semi-additive-measures
479506
[ref-filter]: /reference/data-modeling/measures#filter
480507
[ref-case]: /reference/data-modeling/measures#case
481508
[ref-switch-dim]: /reference/data-modeling/dimensions#type
Lines changed: 311 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
---
2+
title: Semi-additive (end-of-period) measures
3+
description: Model balances and other snapshot metrics that sum across entities but not across time, using a multi-stage rank measure and the grain directive to pick each period's last snapshot at any date grain.
4+
---
5+
6+
## Use case
7+
8+
A **semi-additive** measure can be summed across some dimensions but not others. The canonical
9+
example is an **account balance**: summing every daily balance in a month is meaningless — what
10+
you want is the **balance on the last day of the period** (end-of-month, end-of-quarter, and so
11+
on). Balances are additive across entities (accounts, products, stores) but *not* across time.
12+
13+
This recipe uses a multi-stage [`rank`][ref-type] measure and the [`grain`][ref-grain] directive
14+
to pick each period's final snapshot declaratively, so it works for *any* date grain the user
15+
groups by — no per-grain SQL and no query rewriting.
16+
17+
<Warning>
18+
19+
This pattern requires the Tesseract SQL planner. Set
20+
[`CUBEJS_TESSERACT_SQL_PLANNER=true`][ref-tesseract-env]. The multi-stage `grain` directive is
21+
not supported on the legacy planner.
22+
23+
</Warning>
24+
25+
## The data
26+
27+
Consider a fact table of **daily balance snapshots**, one row per account per day:
28+
29+
| `snapshot_date` | `account_id` | `balance` | `snapshot_frequency_key` |
30+
|---|---|---:|---|
31+
| 2024-01-30 | A1 | 1,000 | 1 *(daily)* |
32+
| 2024-01-31 | A1 | 1,200 | 1 *(daily)* |
33+
| 2024-01-31 | A2 | 500 | 1 *(daily)* |
34+
| 2024-02-29 | A1 | 900 | 1 *(daily)* |
35+
| 2024-01-31 | A1 | 1,150 | 2 *(month-end)* |
36+
| 2024-02-29 | A1 | 880 | 2 *(month-end)* |
37+
38+
- **`snapshot_frequency_key`** discriminates independent snapshot streams — here `1` = daily and
39+
`2` = a separate official month-end feed (note the daily and month-end balances for A1 on
40+
2024-01-31 differ: 1,200 vs 1,150). Keeping it in the partition prevents the two streams from
41+
contaminating each other's "latest" calculation; consumers then pick a stream with a filter.
42+
Keep the key in the model even if you start with a single stream — it future-proofs the partition.
43+
- Note `A2` has **no row on 2024-01-30**. End-of-January for `A2` should be its 01-31 balance
44+
(500); a *missing* account at period end should contribute **0**, not its last-seen value. That
45+
edge case is exactly what the partition design below gets right.
46+
47+
A standard date-dimension cube (`date_dim`) joins on `snapshot_date` and exposes the usual grains
48+
`calendar_year`, `calendar_quarter`, `calendar_month`, `calendar_week`, and so on.
49+
50+
## Data modeling
51+
52+
The pattern is two multi-stage measures: a `rank` that finds each period's last snapshot date,
53+
and a `sum` that totals only the rows on that date.
54+
55+
```yaml
56+
cubes:
57+
- name: balance_snapshots
58+
sql_table: analytics.balance_snapshots
59+
60+
joins:
61+
- name: date_dim
62+
relationship: many_to_one
63+
sql: "{CUBE}.snapshot_date = {date_dim}.date_val"
64+
65+
dimensions:
66+
- name: snapshot_date_key
67+
sql: snapshot_date
68+
type: time
69+
- name: snapshot_frequency_key
70+
sql: snapshot_frequency_key
71+
type: number
72+
- name: account_id
73+
sql: account_id
74+
type: string
75+
76+
measures:
77+
# Plain additive base measure — safe to sum across accounts AND days.
78+
- name: balance
79+
sql: balance
80+
type: sum
81+
82+
# ── Step 1: rank snapshot dates within each period (latest = 1) ──────────────
83+
- name: eop_rank
84+
public: false
85+
multi_stage: true
86+
type: rank
87+
order_by:
88+
- sql: "{snapshot_date_key}"
89+
dir: desc
90+
# A single grain block shapes both stages of this measure.
91+
grain:
92+
# include = the leaf GROUP BY. Pushes snapshot_date_key into the leaf so the
93+
# rank has a physical column to order by (without it: "missing FROM-clause
94+
# entry"), and snapshot_frequency_key so it survives into the partition below.
95+
include:
96+
- snapshot_date_key
97+
- snapshot_frequency_key
98+
# keep_only = the window PARTITION BY. List ONLY the date grains a query might
99+
# group by, plus snapshot_frequency_key. Everything else (account_id) is
100+
# dropped from the partition, so "rank = 1" means the period's GLOBAL last
101+
# snapshot date — identical for every account.
102+
keep_only:
103+
- snapshot_frequency_key
104+
- date_dim.calendar_year
105+
- date_dim.calendar_quarter
106+
- date_dim.calendar_month
107+
- date_dim.calendar_week
108+
# …add every date-dim grain a query might group by. A MISSING entry silently
109+
# over-counts (period computed too coarse); extra entries are harmless.
110+
111+
# ── Step 2: sum the base measure, keeping only the period's last snapshot ─────
112+
- name: balance_eop
113+
title: End-of-Period Balance
114+
multi_stage: true
115+
type: sum
116+
sql: "{balance}"
117+
# Repeat the leaf members (grain.include) so the rank filter is evaluated per
118+
# snapshot date — same canonical form as eop_rank, no add_group_by.
119+
grain:
120+
include:
121+
- snapshot_date_key
122+
- snapshot_frequency_key
123+
filters:
124+
- sql: "{eop_rank} = 1"
125+
```
126+
127+
That is the whole pattern. `balance_eop` now returns the correct end-of-period balance at **any**
128+
grain the consumer groups by — no rewrite logic and no per-grain measures.
129+
130+
## What SQL this generates
131+
132+
<Note>
133+
134+
The SQL below is **illustrative** — simplified to show how `grain.include` and `grain.keep_only`
135+
map onto the `GROUP BY` and `PARTITION BY` clauses. The SQL Cube actually emits will differ in
136+
detail (CTE naming, column aliasing, extra wrapping subqueries, dialect-specific syntax). Inspect
137+
the real output for your setup via the [`/v1/sql` endpoint][ref-sql-api].
138+
139+
</Note>
140+
141+
For the query:
142+
143+
```sql
144+
SELECT calendar_year, calendar_month, MEASURE(balance_snapshots.balance_eop)
145+
FROM balance_snapshots
146+
GROUP BY 1, 2
147+
```
148+
149+
Cube compiles the multi-stage measure into two stacked stages:
150+
151+
```sql
152+
-- STAGE 1 (leaf): the GROUP BY. Grain = queried dims + grain.include members.
153+
WITH leaf AS (
154+
SELECT
155+
calendar_year,
156+
calendar_month,
157+
snapshot_frequency_key, --\__ injected by grain.include
158+
snapshot_date_key, --/ (not in the user's SELECT)
159+
SUM(balance) AS balance
160+
FROM balance_snapshots
161+
JOIN date_dim ON balance_snapshots.snapshot_date = date_dim.date_val
162+
GROUP BY 1, 2, 3, 4 -- ← grain.include forced cols 3 & 4 in here
163+
),
164+
165+
-- STAGE 2 (window): the rank. PARTITION BY = grain AFTER keep_only.
166+
ranked AS (
167+
SELECT
168+
leaf.*,
169+
RANK() OVER (
170+
PARTITION BY calendar_year, calendar_month, snapshot_frequency_key
171+
-- ↑ only date grains + frequency survived keep_only;
172+
-- account_id was stripped out here.
173+
ORDER BY snapshot_date_key DESC -- ← order_by
174+
) AS eop_rank
175+
FROM leaf
176+
)
177+
178+
-- FINAL: sum the surviving rows.
179+
SELECT calendar_year, calendar_month, SUM(balance) AS balance_eop
180+
FROM ranked
181+
WHERE eop_rank = 1 -- ← filters: {eop_rank} = 1
182+
GROUP BY 1, 2
183+
```
184+
185+
## Why both `include` and `keep_only` are needed
186+
187+
Both keys live in the **same `grain` block** but act on **different clauses of different stages**
188+
and pull in opposite directions:
189+
190+
| | `grain.include` | `grain.keep_only` |
191+
|---|---|---|
192+
| Acts on | Stage 1 `GROUP BY` (leaf grain) | Stage 2 `PARTITION BY` (window) |
193+
| Direction | **adds** members | **restricts** members |
194+
| Purpose | make `snapshot_date_key` exist so `order_by` can reference it | scope the rank to the date grain only, dropping entity dims |
195+
| Omit it and… | `missing FROM-clause entry for snapshot_date_key` | rank computed per-account, not per period-end |
196+
197+
<Note>
198+
199+
`grain.include` is the canonical form of the legacy [`add_group_by`][ref-add-group-by] directive
200+
(and `keep_only` / `exclude` are the canonical forms of `group_by` / `reduce_by`). When a measure
201+
sets a `grain` block, those legacy directives are ignored — so keep everything inside `grain`
202+
rather than mixing the two styles.
203+
204+
</Note>
205+
206+
### The partition is the whole point
207+
208+
```sql
209+
-- WITHOUT keep_only — account_id stays in the partition:
210+
PARTITION BY calendar_year, calendar_month, account_id, snapshot_frequency_key
211+
-- → rank = 1 is "latest date THIS account appears". An account absent on Jan 31 but
212+
-- present Jan 30 ranks its Jan-30 row = 1 → counted. WRONG.
213+
214+
-- WITH keep_only — account_id dropped:
215+
PARTITION BY calendar_year, calendar_month, snapshot_frequency_key
216+
-- → rank = 1 is the month's GLOBAL last snapshot date (Jan 31), same for every account.
217+
-- An account with no Jan-31 row has no rank-1 row → contributes 0. CORRECT.
218+
```
219+
220+
This "missing at period end ⇒ 0" behavior is usually what end-of-period reporting wants, and it
221+
falls out automatically once entity dimensions are excluded from the partition.
222+
223+
## Defaulting the snapshot stream
224+
225+
`snapshot_frequency_key` is in the partition, so `balance_eop` is computed *per stream*. If a
226+
query touches more than one stream at once — for example both the daily feed (`1`) and the
227+
official month-end feed (`2`) — every period ends up with one rank-1 row *per stream*, and the
228+
outer sum adds them together, double-counting the balance.
229+
230+
That is exactly why the column exists: consumers should always be scoped to a single stream.
231+
Rather than rely on every consumer remembering to filter, expose the cube through a view with a
232+
[`default_filters`][ref-default-filters] entry that pins the stream by default but releases as
233+
soon as the consumer filters it themselves (via [`unless`][ref-default-filters]):
234+
235+
```yaml
236+
views:
237+
- name: balances
238+
cubes:
239+
- join_path: balance_snapshots
240+
includes: "*"
241+
# The calendar grains the measures partition by live on date_dim, so the
242+
# view must include them via the join path or consumers can't group by them.
243+
- join_path: balance_snapshots.date_dim
244+
includes:
245+
- calendar_year
246+
- calendar_quarter
247+
- calendar_month
248+
- calendar_week
249+
250+
default_filters:
251+
# Default every query to the daily snapshot stream...
252+
- member: snapshot_frequency_key
253+
operator: equals
254+
values:
255+
- "1"
256+
# ...but step aside the moment the consumer filters the stream
257+
# themselves (e.g. to the official month-end feed, 2).
258+
unless:
259+
- snapshot_frequency_key
260+
```
261+
262+
With this in place, a consumer querying `balances` by `calendar_month` gets the daily-stream
263+
end-of-period balance with no extra effort, and a consumer who adds
264+
`snapshot_frequency_key = 2` transparently switches to the official month-end feed instead.
265+
266+
<Tip>
267+
268+
If you want the default to be editable or removable in a workbook (rather than released only by
269+
an explicit filter), use [`meta.default_ui_filters`][ref-default-ui-filters] instead of
270+
`default_filters`.
271+
272+
</Tip>
273+
274+
## Gotchas
275+
276+
- **Tesseract only.** The `grain` directive requires
277+
[`CUBEJS_TESSERACT_SQL_PLANNER=true`][ref-tesseract-env].
278+
- **Don't mix `grain` with the legacy directives.** When a measure sets a `grain` block, the
279+
legacy [`group_by`][ref-group-by] / [`reduce_by`][ref-reduce-by] / [`add_group_by`][ref-add-group-by]
280+
directives on that measure are ignored. Put leaf members under `grain.include`, not `add_group_by`.
281+
- **Members only.** Multi-stage measures reference **members**, never `{CUBE}.raw_column`. Wrap
282+
raw columns in a base measure or dimension first (here, `balance` and `snapshot_date_key`).
283+
- **`order_by` must be in the leaf grain.** A `rank` can only order by a column present in the
284+
leaf — that is why `snapshot_date_key` must be listed under `grain.include` on the rank measure
285+
(and on the consuming sum).
286+
- **`keep_only` takes explicit member paths only** — no cube-level or wildcard references.
287+
Enumerate every date grain a query might group by. A missing grain over-counts silently; extras
288+
are harmless.
289+
- **Default the frequency.** Always scope queries to a single `snapshot_frequency_key` so streams
290+
aren't mixed and double-counted — see [Defaulting the snapshot stream](#defaulting-the-snapshot-stream).
291+
292+
## Related
293+
294+
<CardGroup cols={2}>
295+
<Card title="grain reference" icon="layer-group" href="/reference/data-modeling/measures#grain">
296+
Full semantics of the `keep_only`, `exclude`, and `include` keys.
297+
</Card>
298+
<Card title="Calculating share of total" icon="percent" href="/recipes/data-modeling/share-of-total">
299+
Use `grain` to compute each row's contribution to a group or grand total.
300+
</Card>
301+
</CardGroup>
302+
303+
[ref-type]: /reference/data-modeling/measures#type
304+
[ref-grain]: /reference/data-modeling/measures#grain
305+
[ref-add-group-by]: /reference/data-modeling/measures#group_by-reduce_by-and-add_group_by-legacy
306+
[ref-group-by]: /reference/data-modeling/measures#group_by-reduce_by-and-add_group_by-legacy
307+
[ref-reduce-by]: /reference/data-modeling/measures#group_by-reduce_by-and-add_group_by-legacy
308+
[ref-tesseract-env]: /reference/configuration/environment-variables#cubejs_tesseract_sql_planner
309+
[ref-sql-api]: /reference/core-data-apis/rest-api/reference
310+
[ref-default-filters]: /reference/data-modeling/view#default_filters
311+
[ref-default-ui-filters]: /reference/data-modeling/view#default_ui_filters

0 commit comments

Comments
 (0)