Skip to content

Commit 4a1efcb

Browse files
timsaucerclaude
andcommitted
Add grouping sets documentation and note grouping() alias limitation
Add user documentation for GroupingSet.rollup, .cube, and .grouping_sets with Pokemon dataset examples. Document the upstream alias limitation (apache/datafusion#21411) in both the grouping() docstring and the aggregation user guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c9183dd commit 4a1efcb

File tree

2 files changed

+178
-1
lines changed

2 files changed

+178
-1
lines changed

docs/source/user-guide/common-operations/aggregations.rst

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,168 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
163163
f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])
164164
165165
166+
Grouping Sets
167+
-------------
168+
169+
The default style of aggregation produces one row per group. Sometimes you want a single query to
170+
produce rows at multiple levels of detail — for example, totals per type *and* an overall grand
171+
total, or subtotals for every combination of two columns plus the individual column totals. Writing
172+
separate queries and concatenating them is tedious and runs the data multiple times. Grouping sets
173+
solve this by letting you specify several grouping levels in one pass.
174+
175+
DataFusion supports three grouping set styles through the
176+
:py:class:`~datafusion.expr.GroupingSet` class:
177+
178+
- :py:meth:`~datafusion.expr.GroupingSet.rollup` — hierarchical subtotals, like a drill-down report
179+
- :py:meth:`~datafusion.expr.GroupingSet.cube` — every possible subtotal combination, like a pivot table
180+
- :py:meth:`~datafusion.expr.GroupingSet.grouping_sets` — explicitly list exactly which grouping levels you want
181+
182+
Because result rows come from different grouping levels, a column that is *not* part of a
183+
particular level will be ``null`` in that row. Use :py:func:`~datafusion.functions.grouping` to
184+
distinguish a real ``null`` in the data from one that means "this column was aggregated across."
185+
It returns ``0`` when the column is a grouping key for that row, and ``1`` when it is not.
186+
187+
Rollup
188+
^^^^^^
189+
190+
:py:meth:`~datafusion.expr.GroupingSet.rollup` creates a hierarchy. ``rollup(a, b)`` produces
191+
grouping sets ``(a, b)``, ``(a)``, and ``()`` — like nested subtotals in a report. This is useful
192+
when your columns have a natural hierarchy, such as region → city or type → subtype.
193+
194+
Suppose we want to summarize Pokemon stats by ``Type 1`` with subtotals and a grand total. With
195+
the default aggregation style we would need two separate queries. With ``rollup`` we get it all at
196+
once:
197+
198+
.. ipython:: python
199+
200+
from datafusion.expr import GroupingSet
201+
202+
df.aggregate(
203+
[GroupingSet.rollup(col_type_1)],
204+
[f.count(col_speed).alias("Count"),
205+
f.avg(col_speed).alias("Avg Speed"),
206+
f.max(col_speed).alias("Max Speed")]
207+
).sort(col_type_1.sort(ascending=True, nulls_first=True))
208+
209+
The first row — where ``Type 1`` is ``null`` — is the grand total across all types. But how do you
210+
tell a grand-total ``null`` apart from a Pokemon that genuinely has no type? The
211+
:py:func:`~datafusion.functions.grouping` function returns ``0`` when the column is a grouping key
212+
for that row and ``1`` when it is aggregated across.
213+
214+
.. note::
215+
216+
Due to an upstream DataFusion limitation
217+
(`apache/datafusion#21411 <https://github.com/apache/datafusion/issues/21411>`_),
218+
``.alias()`` cannot be applied directly to a ``grouping()`` expression — it will raise an
219+
error at execution time. Instead, use
220+
:py:meth:`~datafusion.dataframe.DataFrame.with_column_renamed` on the result DataFrame to
221+
give the column a readable name. Once the upstream issue is resolved, you will be able to
222+
use ``.alias()`` directly and the workaround below will no longer be necessary.
223+
224+
The raw column name generated by ``grouping()`` contains internal identifiers, so we use
225+
:py:meth:`~datafusion.dataframe.DataFrame.with_column_renamed` to clean it up:
226+
227+
.. ipython:: python
228+
229+
result = df.aggregate(
230+
[GroupingSet.rollup(col_type_1)],
231+
[f.count(col_speed).alias("Count"),
232+
f.avg(col_speed).alias("Avg Speed"),
233+
f.grouping(col_type_1)]
234+
)
235+
for field in result.schema():
236+
if field.name.startswith("grouping("):
237+
result = result.with_column_renamed(field.name, "Is Total")
238+
result.sort(col_type_1.sort(ascending=True, nulls_first=True))
239+
240+
With two columns the hierarchy becomes more apparent. ``rollup(Type 1, Type 2)`` produces:
241+
242+
- one row per ``(Type 1, Type 2)`` pair — the most detailed level
243+
- one row per ``Type 1`` — subtotals
244+
- one grand total row
245+
246+
.. ipython:: python
247+
248+
df.aggregate(
249+
[GroupingSet.rollup(col_type_1, col_type_2)],
250+
[f.count(col_speed).alias("Count"),
251+
f.avg(col_speed).alias("Avg Speed")]
252+
).sort(
253+
col_type_1.sort(ascending=True, nulls_first=True),
254+
col_type_2.sort(ascending=True, nulls_first=True)
255+
)
256+
257+
Cube
258+
^^^^
259+
260+
:py:meth:`~datafusion.expr.GroupingSet.cube` produces every possible subset. ``cube(a, b)``
261+
produces grouping sets ``(a, b)``, ``(a)``, ``(b)``, and ``()`` — one more than ``rollup`` because
262+
it also includes ``(b)`` alone. This is useful when neither column is "above" the other in a
263+
hierarchy and you want all cross-tabulations.
264+
265+
For our Pokemon data, ``cube(Type 1, Type 2)`` gives us stats broken down by the type pair,
266+
by ``Type 1`` alone, by ``Type 2`` alone, and a grand total — all in one query:
267+
268+
.. ipython:: python
269+
270+
df.aggregate(
271+
[GroupingSet.cube(col_type_1, col_type_2)],
272+
[f.count(col_speed).alias("Count"),
273+
f.avg(col_speed).alias("Avg Speed")]
274+
).sort(
275+
col_type_1.sort(ascending=True, nulls_first=True),
276+
col_type_2.sort(ascending=True, nulls_first=True)
277+
)
278+
279+
Compared to the ``rollup`` example above, notice the extra rows where ``Type 1`` is ``null`` but
280+
``Type 2`` has a value — those are the per-``Type 2`` subtotals that ``rollup`` does not include.
281+
282+
Explicit Grouping Sets
283+
^^^^^^^^^^^^^^^^^^^^^^
284+
285+
:py:meth:`~datafusion.expr.GroupingSet.grouping_sets` lets you list exactly which grouping levels
286+
you need when ``rollup`` or ``cube`` would produce too many or too few. Each argument is a list of
287+
columns forming one grouping set.
288+
289+
For example, if we want only the per-``Type 1`` totals and per-``Type 2`` totals — but *not* the
290+
full ``(Type 1, Type 2)`` detail rows or the grand total — we can ask for exactly that:
291+
292+
.. ipython:: python
293+
294+
df.aggregate(
295+
[GroupingSet.grouping_sets([col_type_1], [col_type_2])],
296+
[f.count(col_speed).alias("Count"),
297+
f.avg(col_speed).alias("Avg Speed")]
298+
).sort(
299+
col_type_1.sort(ascending=True, nulls_first=True),
300+
col_type_2.sort(ascending=True, nulls_first=True)
301+
)
302+
303+
Each row belongs to exactly one grouping level. The :py:func:`~datafusion.functions.grouping`
304+
function tells you which level each row comes from:
305+
306+
.. ipython:: python
307+
308+
result = df.aggregate(
309+
[GroupingSet.grouping_sets([col_type_1], [col_type_2])],
310+
[f.count(col_speed).alias("Count"),
311+
f.avg(col_speed).alias("Avg Speed"),
312+
f.grouping(col_type_1),
313+
f.grouping(col_type_2)]
314+
)
315+
for field in result.schema():
316+
if field.name.startswith("grouping("):
317+
clean = field.name.split(".")[-1].rstrip(")")
318+
result = result.with_column_renamed(field.name, f"grouping({clean})")
319+
result.sort(
320+
col_type_1.sort(ascending=True, nulls_first=True),
321+
col_type_2.sort(ascending=True, nulls_first=True)
322+
)
323+
324+
Where ``grouping(Type 1)`` is ``0`` the row is a per-``Type 1`` total (and ``Type 2`` is ``null``).
325+
Where ``grouping(Type 2)`` is ``0`` the row is a per-``Type 2`` total (and ``Type 1`` is ``null``).
326+
327+
166328
Aggregate Functions
167329
-------------------
168330

@@ -213,4 +375,9 @@ The available aggregate functions are:
213375
- :py:func:`datafusion.functions.approx_median`
214376
- :py:func:`datafusion.functions.approx_percentile_cont`
215377
- :py:func:`datafusion.functions.approx_percentile_cont_with_weight`
378+
10. Grouping Set Functions
379+
- :py:func:`datafusion.functions.grouping`
380+
- :py:meth:`datafusion.expr.GroupingSet.rollup`
381+
- :py:meth:`datafusion.expr.GroupingSet.cube`
382+
- :py:meth:`datafusion.expr.GroupingSet.grouping_sets`
216383

python/datafusion/functions.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4425,9 +4425,19 @@ def grouping(
44254425
:py:meth:`GroupingSet.cube <datafusion.expr.GroupingSet.cube>`, or
44264426
:py:meth:`GroupingSet.grouping_sets <datafusion.expr.GroupingSet.grouping_sets>`,
44274427
where different rows are grouped by different subsets of columns. In a
4428-
regular ``GROUP BY`` without grouping sets every column is always part
4428+
default aggregation without grouping sets every column is always part
44294429
of the key, so ``grouping()`` always returns 0.
44304430
4431+
.. warning::
4432+
4433+
Due to an upstream DataFusion limitation
4434+
(`#21411 <https://github.com/apache/datafusion/issues/21411>`_),
4435+
``.alias()`` cannot be applied directly to a ``grouping()``
4436+
expression. Doing so will raise an error at execution time. To
4437+
rename the column, use
4438+
:py:meth:`~datafusion.dataframe.DataFrame.with_column_renamed`
4439+
on the result DataFrame instead.
4440+
44314441
Args:
44324442
expression: The column to check grouping status for
44334443
distinct: If True, compute on distinct values only

0 commit comments

Comments
 (0)