@@ -163,6 +163,168 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
163163 f.avg(col_speed, filter = col_attack < lit(50 )).alias(" Avg Speed Low Attack" )])
164164
165165
166+ Grouping Sets
167+ -------------
168+
169+ The default style of aggregation produces one row per group. Sometimes you want a single query to
170+ produce rows at multiple levels of detail — for example, totals per type *and * an overall grand
171+ total, or subtotals for every combination of two columns plus the individual column totals. Writing
172+ separate queries and concatenating them is tedious and runs the data multiple times. Grouping sets
173+ solve this by letting you specify several grouping levels in one pass.
174+
175+ DataFusion supports three grouping set styles through the
176+ :py:class: `~datafusion.expr.GroupingSet ` class:
177+
178+ - :py:meth: `~datafusion.expr.GroupingSet.rollup ` — hierarchical subtotals, like a drill-down report
179+ - :py:meth: `~datafusion.expr.GroupingSet.cube ` — every possible subtotal combination, like a pivot table
180+ - :py:meth: `~datafusion.expr.GroupingSet.grouping_sets ` — explicitly list exactly which grouping levels you want
181+
182+ Because result rows come from different grouping levels, a column that is *not * part of a
183+ particular level will be ``null `` in that row. Use :py:func: `~datafusion.functions.grouping ` to
184+ distinguish a real ``null `` in the data from one that means "this column was aggregated across."
185+ It returns ``0 `` when the column is a grouping key for that row, and ``1 `` when it is not.
186+
187+ Rollup
188+ ^^^^^^
189+
190+ :py:meth: `~datafusion.expr.GroupingSet.rollup ` creates a hierarchy. ``rollup(a, b) `` produces
191+ grouping sets ``(a, b) ``, ``(a) ``, and ``() `` — like nested subtotals in a report. This is useful
192+ when your columns have a natural hierarchy, such as region → city or type → subtype.
193+
194+ Suppose we want to summarize Pokemon stats by ``Type 1 `` with subtotals and a grand total. With
195+ the default aggregation style we would need two separate queries. With ``rollup `` we get it all at
196+ once:
197+
198+ .. ipython :: python
199+
200+ from datafusion.expr import GroupingSet
201+
202+ df.aggregate(
203+ [GroupingSet.rollup(col_type_1)],
204+ [f.count(col_speed).alias(" Count" ),
205+ f.avg(col_speed).alias(" Avg Speed" ),
206+ f.max(col_speed).alias(" Max Speed" )]
207+ ).sort(col_type_1.sort(ascending = True , nulls_first = True ))
208+
209+ The first row — where ``Type 1 `` is ``null `` — is the grand total across all types. But how do you
210+ tell a grand-total ``null `` apart from a Pokemon that genuinely has no type? The
211+ :py:func: `~datafusion.functions.grouping ` function returns ``0 `` when the column is a grouping key
212+ for that row and ``1 `` when it is aggregated across.
213+
214+ .. note ::
215+
216+ Due to an upstream DataFusion limitation
217+ (`apache/datafusion#21411 <https://github.com/apache/datafusion/issues/21411 >`_),
218+ ``.alias() `` cannot be applied directly to a ``grouping() `` expression — it will raise an
219+ error at execution time. Instead, use
220+ :py:meth: `~datafusion.dataframe.DataFrame.with_column_renamed ` on the result DataFrame to
221+ give the column a readable name. Once the upstream issue is resolved, you will be able to
222+ use ``.alias() `` directly and the workaround below will no longer be necessary.
223+
224+ The raw column name generated by ``grouping() `` contains internal identifiers, so we use
225+ :py:meth: `~datafusion.dataframe.DataFrame.with_column_renamed ` to clean it up:
226+
227+ .. ipython :: python
228+
229+ result = df.aggregate(
230+ [GroupingSet.rollup(col_type_1)],
231+ [f.count(col_speed).alias(" Count" ),
232+ f.avg(col_speed).alias(" Avg Speed" ),
233+ f.grouping(col_type_1)]
234+ )
235+ for field in result.schema():
236+ if field.name.startswith(" grouping(" ):
237+ result = result.with_column_renamed(field.name, " Is Total" )
238+ result.sort(col_type_1.sort(ascending = True , nulls_first = True ))
239+
240+ With two columns the hierarchy becomes more apparent. ``rollup(Type 1, Type 2) `` produces:
241+
242+ - one row per ``(Type 1, Type 2) `` pair — the most detailed level
243+ - one row per ``Type 1 `` — subtotals
244+ - one grand total row
245+
246+ .. ipython :: python
247+
248+ df.aggregate(
249+ [GroupingSet.rollup(col_type_1, col_type_2)],
250+ [f.count(col_speed).alias(" Count" ),
251+ f.avg(col_speed).alias(" Avg Speed" )]
252+ ).sort(
253+ col_type_1.sort(ascending = True , nulls_first = True ),
254+ col_type_2.sort(ascending = True , nulls_first = True )
255+ )
256+
257+ Cube
258+ ^^^^
259+
260+ :py:meth: `~datafusion.expr.GroupingSet.cube ` produces every possible subset. ``cube(a, b) ``
261+ produces grouping sets ``(a, b) ``, ``(a) ``, ``(b) ``, and ``() `` — one more than ``rollup `` because
262+ it also includes ``(b) `` alone. This is useful when neither column is "above" the other in a
263+ hierarchy and you want all cross-tabulations.
264+
265+ For our Pokemon data, ``cube(Type 1, Type 2) `` gives us stats broken down by the type pair,
266+ by ``Type 1 `` alone, by ``Type 2 `` alone, and a grand total — all in one query:
267+
268+ .. ipython :: python
269+
270+ df.aggregate(
271+ [GroupingSet.cube(col_type_1, col_type_2)],
272+ [f.count(col_speed).alias(" Count" ),
273+ f.avg(col_speed).alias(" Avg Speed" )]
274+ ).sort(
275+ col_type_1.sort(ascending = True , nulls_first = True ),
276+ col_type_2.sort(ascending = True , nulls_first = True )
277+ )
278+
279+ Compared to the ``rollup `` example above, notice the extra rows where ``Type 1 `` is ``null `` but
280+ ``Type 2 `` has a value — those are the per-``Type 2 `` subtotals that ``rollup `` does not include.
281+
282+ Explicit Grouping Sets
283+ ^^^^^^^^^^^^^^^^^^^^^^
284+
285+ :py:meth: `~datafusion.expr.GroupingSet.grouping_sets ` lets you list exactly which grouping levels
286+ you need when ``rollup `` or ``cube `` would produce too many or too few. Each argument is a list of
287+ columns forming one grouping set.
288+
289+ For example, if we want only the per-``Type 1 `` totals and per-``Type 2 `` totals — but *not * the
290+ full ``(Type 1, Type 2) `` detail rows or the grand total — we can ask for exactly that:
291+
292+ .. ipython :: python
293+
294+ df.aggregate(
295+ [GroupingSet.grouping_sets([col_type_1], [col_type_2])],
296+ [f.count(col_speed).alias(" Count" ),
297+ f.avg(col_speed).alias(" Avg Speed" )]
298+ ).sort(
299+ col_type_1.sort(ascending = True , nulls_first = True ),
300+ col_type_2.sort(ascending = True , nulls_first = True )
301+ )
302+
303+ Each row belongs to exactly one grouping level. The :py:func: `~datafusion.functions.grouping `
304+ function tells you which level each row comes from:
305+
306+ .. ipython :: python
307+
308+ result = df.aggregate(
309+ [GroupingSet.grouping_sets([col_type_1], [col_type_2])],
310+ [f.count(col_speed).alias(" Count" ),
311+ f.avg(col_speed).alias(" Avg Speed" ),
312+ f.grouping(col_type_1),
313+ f.grouping(col_type_2)]
314+ )
315+ for field in result.schema():
316+ if field.name.startswith(" grouping(" ):
317+ clean = field.name.split(" ." )[- 1 ].rstrip(" )" )
318+ result = result.with_column_renamed(field.name, f " grouping( { clean} ) " )
319+ result.sort(
320+ col_type_1.sort(ascending = True , nulls_first = True ),
321+ col_type_2.sort(ascending = True , nulls_first = True )
322+ )
323+
324+ Where ``grouping(Type 1) `` is ``0 `` the row is a per-``Type 1 `` total (and ``Type 2 `` is ``null ``).
325+ Where ``grouping(Type 2) `` is ``0 `` the row is a per-``Type 2 `` total (and ``Type 1 `` is ``null ``).
326+
327+
166328Aggregate Functions
167329-------------------
168330
@@ -213,4 +375,9 @@ The available aggregate functions are:
213375 - :py:func: `datafusion.functions.approx_median `
214376 - :py:func: `datafusion.functions.approx_percentile_cont `
215377 - :py:func: `datafusion.functions.approx_percentile_cont_with_weight `
378+ 10. Grouping Set Functions
379+ - :py:func: `datafusion.functions.grouping `
380+ - :py:meth: `datafusion.expr.GroupingSet.rollup `
381+ - :py:meth: `datafusion.expr.GroupingSet.cube `
382+ - :py:meth: `datafusion.expr.GroupingSet.grouping_sets `
216383
0 commit comments