Skip to content

Commit 898d73d

Browse files
timsaucerclaude
andauthored
Add missing aggregate functions (#1471)
* Add missing aggregate functions: grouping, percentile_cont, var_population Expose upstream DataFusion aggregate functions that were not yet available in the Python API. Closes #1454. - grouping: returns grouping set membership indicator (rewritten by the ResolveGroupingFunction analyzer rule before physical planning) - percentile_cont: computes exact percentile using continuous interpolation (unlike approx_percentile_cont which uses t-digest) - var_population: alias for var_pop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix grouping() distinct parameter type for API consistency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Improve aggregate function tests and docstrings per review feedback Add docstring example to grouping(), parametrize percentile_cont tests, and add multi-column grouping test case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add GroupingSet.rollup, .cube, and .grouping_sets factory methods Expose ROLLUP, CUBE, and GROUPING SETS via the DataFrame API by adding static methods on GroupingSet that construct the corresponding Expr variants. Update grouping() docstring and tests to use the new API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove _GroupingSetInternal alias, use expr_internal.GroupingSet directly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Parametrize grouping set tests for rollup and cube Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add grouping sets documentation and note grouping() alias limitation Add user documentation for GroupingSet.rollup, .cube, and .grouping_sets with Pokemon dataset examples. Document the upstream alias limitation (apache/datafusion#21411) in both the grouping() docstring and the aggregation user guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add grouping sets note to DataFrame.aggregate() docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address PR review feedback: add quantile_cont alias and simplify examples - Add quantile_cont as alias for percentile_cont (matches upstream) - Replace pa.concat_arrays batch pattern with collect_column() in docstrings - Add percentile_cont, quantile_cont, var_population to docs function list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Accept string column names in GroupingSet factory methods GroupingSet.rollup(), .cube(), and .grouping_sets() now accept both Expr objects and string column names, consistent with DataFrame.aggregate(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add agent instructions to keep aggregation/window docs in sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * dfn is already available globally * Remove unnecessary import on doctest --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d07fdb3 commit 898d73d

File tree

8 files changed

+614
-12
lines changed

8 files changed

+614
-12
lines changed

AGENTS.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,15 @@ Every Python function must include a docstring with usage examples.
4242
- **Alias functions**: Functions that are simple aliases (e.g., `list_sort` aliasing
4343
`array_sort`) only need a one-line description and a `See Also` reference to the
4444
primary function. They do not need their own examples.
45+
46+
## Aggregate and Window Function Documentation
47+
48+
When adding or updating an aggregate or window function, ensure the corresponding
49+
site documentation is kept in sync:
50+
51+
- **Aggregations**: `docs/source/user-guide/common-operations/aggregations.rst`
52+
add new aggregate functions to the "Aggregate Functions" list and include usage
53+
examples if appropriate.
54+
- **Window functions**: `docs/source/user-guide/common-operations/windows.rst`
55+
add new window functions to the "Available Functions" list and include usage
56+
examples if appropriate.

crates/core/src/expr/grouping_set.rs

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,11 @@
1515
// specific language governing permissions and limitations
1616
// under the License.
1717

18-
use datafusion::logical_expr::GroupingSet;
18+
use datafusion::logical_expr::{Expr, GroupingSet};
1919
use pyo3::prelude::*;
2020

21+
use crate::expr::PyExpr;
22+
2123
#[pyclass(
2224
from_py_object,
2325
frozen,
@@ -30,6 +32,39 @@ pub struct PyGroupingSet {
3032
grouping_set: GroupingSet,
3133
}
3234

35+
#[pymethods]
36+
impl PyGroupingSet {
37+
#[staticmethod]
38+
#[pyo3(signature = (*exprs))]
39+
fn rollup(exprs: Vec<PyExpr>) -> PyExpr {
40+
Expr::GroupingSet(GroupingSet::Rollup(
41+
exprs.into_iter().map(|e| e.expr).collect(),
42+
))
43+
.into()
44+
}
45+
46+
#[staticmethod]
47+
#[pyo3(signature = (*exprs))]
48+
fn cube(exprs: Vec<PyExpr>) -> PyExpr {
49+
Expr::GroupingSet(GroupingSet::Cube(
50+
exprs.into_iter().map(|e| e.expr).collect(),
51+
))
52+
.into()
53+
}
54+
55+
#[staticmethod]
56+
#[pyo3(signature = (*expr_lists))]
57+
fn grouping_sets(expr_lists: Vec<Vec<PyExpr>>) -> PyExpr {
58+
Expr::GroupingSet(GroupingSet::GroupingSets(
59+
expr_lists
60+
.into_iter()
61+
.map(|list| list.into_iter().map(|e| e.expr).collect())
62+
.collect(),
63+
))
64+
.into()
65+
}
66+
}
67+
3368
impl From<PyGroupingSet> for GroupingSet {
3469
fn from(grouping_set: PyGroupingSet) -> Self {
3570
grouping_set.grouping_set

crates/core/src/functions.rs

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -791,9 +791,10 @@ aggregate_function!(var_pop);
791791
aggregate_function!(approx_distinct);
792792
aggregate_function!(approx_median);
793793

794-
// Code is commented out since grouping is not yet implemented
795-
// https://github.com/apache/datafusion-python/issues/861
796-
// aggregate_function!(grouping);
794+
// The grouping function's physical plan is not implemented, but the
795+
// ResolveGroupingFunction analyzer rule rewrites it before the physical
796+
// planner sees it, so it works correctly at runtime.
797+
aggregate_function!(grouping);
797798

798799
#[pyfunction]
799800
#[pyo3(signature = (sort_expression, percentile, num_centroids=None, filter=None))]
@@ -831,6 +832,19 @@ pub fn approx_percentile_cont_with_weight(
831832
add_builder_fns_to_aggregate(agg_fn, None, filter, None, None)
832833
}
833834

835+
#[pyfunction]
836+
#[pyo3(signature = (sort_expression, percentile, filter=None))]
837+
pub fn percentile_cont(
838+
sort_expression: PySortExpr,
839+
percentile: f64,
840+
filter: Option<PyExpr>,
841+
) -> PyDataFusionResult<PyExpr> {
842+
let agg_fn =
843+
functions_aggregate::expr_fn::percentile_cont(sort_expression.sort, lit(percentile));
844+
845+
add_builder_fns_to_aggregate(agg_fn, None, filter, None, None)
846+
}
847+
834848
// We handle last_value explicitly because the signature expects an order_by
835849
// https://github.com/apache/datafusion/issues/12376
836850
#[pyfunction]
@@ -1031,6 +1045,7 @@ pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> {
10311045
m.add_wrapped(wrap_pyfunction!(approx_median))?;
10321046
m.add_wrapped(wrap_pyfunction!(approx_percentile_cont))?;
10331047
m.add_wrapped(wrap_pyfunction!(approx_percentile_cont_with_weight))?;
1048+
m.add_wrapped(wrap_pyfunction!(percentile_cont))?;
10341049
m.add_wrapped(wrap_pyfunction!(range))?;
10351050
m.add_wrapped(wrap_pyfunction!(array_agg))?;
10361051
m.add_wrapped(wrap_pyfunction!(arrow_typeof))?;
@@ -1080,7 +1095,7 @@ pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> {
10801095
m.add_wrapped(wrap_pyfunction!(from_unixtime))?;
10811096
m.add_wrapped(wrap_pyfunction!(gcd))?;
10821097
m.add_wrapped(wrap_pyfunction!(greatest))?;
1083-
// m.add_wrapped(wrap_pyfunction!(grouping))?;
1098+
m.add_wrapped(wrap_pyfunction!(grouping))?;
10841099
m.add_wrapped(wrap_pyfunction!(in_list))?;
10851100
m.add_wrapped(wrap_pyfunction!(initcap))?;
10861101
m.add_wrapped(wrap_pyfunction!(isnan))?;

docs/source/user-guide/common-operations/aggregations.rst

Lines changed: 171 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,168 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
163163
f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])
164164
165165
166+
Grouping Sets
167+
-------------
168+
169+
The default style of aggregation produces one row per group. Sometimes you want a single query to
170+
produce rows at multiple levels of detail — for example, totals per type *and* an overall grand
171+
total, or subtotals for every combination of two columns plus the individual column totals. Writing
172+
separate queries and concatenating them is tedious and runs the data multiple times. Grouping sets
173+
solve this by letting you specify several grouping levels in one pass.
174+
175+
DataFusion supports three grouping set styles through the
176+
:py:class:`~datafusion.expr.GroupingSet` class:
177+
178+
- :py:meth:`~datafusion.expr.GroupingSet.rollup` — hierarchical subtotals, like a drill-down report
179+
- :py:meth:`~datafusion.expr.GroupingSet.cube` — every possible subtotal combination, like a pivot table
180+
- :py:meth:`~datafusion.expr.GroupingSet.grouping_sets` — explicitly list exactly which grouping levels you want
181+
182+
Because result rows come from different grouping levels, a column that is *not* part of a
183+
particular level will be ``null`` in that row. Use :py:func:`~datafusion.functions.grouping` to
184+
distinguish a real ``null`` in the data from one that means "this column was aggregated across."
185+
It returns ``0`` when the column is a grouping key for that row, and ``1`` when it is not.
186+
187+
Rollup
188+
^^^^^^
189+
190+
:py:meth:`~datafusion.expr.GroupingSet.rollup` creates a hierarchy. ``rollup(a, b)`` produces
191+
grouping sets ``(a, b)``, ``(a)``, and ``()`` — like nested subtotals in a report. This is useful
192+
when your columns have a natural hierarchy, such as region → city or type → subtype.
193+
194+
Suppose we want to summarize Pokemon stats by ``Type 1`` with subtotals and a grand total. With
195+
the default aggregation style we would need two separate queries. With ``rollup`` we get it all at
196+
once:
197+
198+
.. ipython:: python
199+
200+
from datafusion.expr import GroupingSet
201+
202+
df.aggregate(
203+
[GroupingSet.rollup(col_type_1)],
204+
[f.count(col_speed).alias("Count"),
205+
f.avg(col_speed).alias("Avg Speed"),
206+
f.max(col_speed).alias("Max Speed")]
207+
).sort(col_type_1.sort(ascending=True, nulls_first=True))
208+
209+
The first row — where ``Type 1`` is ``null`` — is the grand total across all types. But how do you
210+
tell a grand-total ``null`` apart from a Pokemon that genuinely has no type? The
211+
:py:func:`~datafusion.functions.grouping` function returns ``0`` when the column is a grouping key
212+
for that row and ``1`` when it is aggregated across.
213+
214+
.. note::
215+
216+
Due to an upstream DataFusion limitation
217+
(`apache/datafusion#21411 <https://github.com/apache/datafusion/issues/21411>`_),
218+
``.alias()`` cannot be applied directly to a ``grouping()`` expression — it will raise an
219+
error at execution time. Instead, use
220+
:py:meth:`~datafusion.dataframe.DataFrame.with_column_renamed` on the result DataFrame to
221+
give the column a readable name. Once the upstream issue is resolved, you will be able to
222+
use ``.alias()`` directly and the workaround below will no longer be necessary.
223+
224+
The raw column name generated by ``grouping()`` contains internal identifiers, so we use
225+
:py:meth:`~datafusion.dataframe.DataFrame.with_column_renamed` to clean it up:
226+
227+
.. ipython:: python
228+
229+
result = df.aggregate(
230+
[GroupingSet.rollup(col_type_1)],
231+
[f.count(col_speed).alias("Count"),
232+
f.avg(col_speed).alias("Avg Speed"),
233+
f.grouping(col_type_1)]
234+
)
235+
for field in result.schema():
236+
if field.name.startswith("grouping("):
237+
result = result.with_column_renamed(field.name, "Is Total")
238+
result.sort(col_type_1.sort(ascending=True, nulls_first=True))
239+
240+
With two columns the hierarchy becomes more apparent. ``rollup(Type 1, Type 2)`` produces:
241+
242+
- one row per ``(Type 1, Type 2)`` pair — the most detailed level
243+
- one row per ``Type 1`` — subtotals
244+
- one grand total row
245+
246+
.. ipython:: python
247+
248+
df.aggregate(
249+
[GroupingSet.rollup(col_type_1, col_type_2)],
250+
[f.count(col_speed).alias("Count"),
251+
f.avg(col_speed).alias("Avg Speed")]
252+
).sort(
253+
col_type_1.sort(ascending=True, nulls_first=True),
254+
col_type_2.sort(ascending=True, nulls_first=True)
255+
)
256+
257+
Cube
258+
^^^^
259+
260+
:py:meth:`~datafusion.expr.GroupingSet.cube` produces every possible subset. ``cube(a, b)``
261+
produces grouping sets ``(a, b)``, ``(a)``, ``(b)``, and ``()`` — one more than ``rollup`` because
262+
it also includes ``(b)`` alone. This is useful when neither column is "above" the other in a
263+
hierarchy and you want all cross-tabulations.
264+
265+
For our Pokemon data, ``cube(Type 1, Type 2)`` gives us stats broken down by the type pair,
266+
by ``Type 1`` alone, by ``Type 2`` alone, and a grand total — all in one query:
267+
268+
.. ipython:: python
269+
270+
df.aggregate(
271+
[GroupingSet.cube(col_type_1, col_type_2)],
272+
[f.count(col_speed).alias("Count"),
273+
f.avg(col_speed).alias("Avg Speed")]
274+
).sort(
275+
col_type_1.sort(ascending=True, nulls_first=True),
276+
col_type_2.sort(ascending=True, nulls_first=True)
277+
)
278+
279+
Compared to the ``rollup`` example above, notice the extra rows where ``Type 1`` is ``null`` but
280+
``Type 2`` has a value — those are the per-``Type 2`` subtotals that ``rollup`` does not include.
281+
282+
Explicit Grouping Sets
283+
^^^^^^^^^^^^^^^^^^^^^^
284+
285+
:py:meth:`~datafusion.expr.GroupingSet.grouping_sets` lets you list exactly which grouping levels
286+
you need when ``rollup`` or ``cube`` would produce too many or too few. Each argument is a list of
287+
columns forming one grouping set.
288+
289+
For example, if we want only the per-``Type 1`` totals and per-``Type 2`` totals — but *not* the
290+
full ``(Type 1, Type 2)`` detail rows or the grand total — we can ask for exactly that:
291+
292+
.. ipython:: python
293+
294+
df.aggregate(
295+
[GroupingSet.grouping_sets([col_type_1], [col_type_2])],
296+
[f.count(col_speed).alias("Count"),
297+
f.avg(col_speed).alias("Avg Speed")]
298+
).sort(
299+
col_type_1.sort(ascending=True, nulls_first=True),
300+
col_type_2.sort(ascending=True, nulls_first=True)
301+
)
302+
303+
Each row belongs to exactly one grouping level. The :py:func:`~datafusion.functions.grouping`
304+
function tells you which level each row comes from:
305+
306+
.. ipython:: python
307+
308+
result = df.aggregate(
309+
[GroupingSet.grouping_sets([col_type_1], [col_type_2])],
310+
[f.count(col_speed).alias("Count"),
311+
f.avg(col_speed).alias("Avg Speed"),
312+
f.grouping(col_type_1),
313+
f.grouping(col_type_2)]
314+
)
315+
for field in result.schema():
316+
if field.name.startswith("grouping("):
317+
clean = field.name.split(".")[-1].rstrip(")")
318+
result = result.with_column_renamed(field.name, f"grouping({clean})")
319+
result.sort(
320+
col_type_1.sort(ascending=True, nulls_first=True),
321+
col_type_2.sort(ascending=True, nulls_first=True)
322+
)
323+
324+
Where ``grouping(Type 1)`` is ``0`` the row is a per-``Type 1`` total (and ``Type 2`` is ``null``).
325+
Where ``grouping(Type 2)`` is ``0`` the row is a per-``Type 2`` total (and ``Type 1`` is ``null``).
326+
327+
166328
Aggregate Functions
167329
-------------------
168330

@@ -192,6 +354,7 @@ The available aggregate functions are:
192354
- :py:func:`datafusion.functions.stddev_pop`
193355
- :py:func:`datafusion.functions.var_samp`
194356
- :py:func:`datafusion.functions.var_pop`
357+
- :py:func:`datafusion.functions.var_population`
195358
6. Linear Regression Functions
196359
- :py:func:`datafusion.functions.regr_count`
197360
- :py:func:`datafusion.functions.regr_slope`
@@ -208,9 +371,16 @@ The available aggregate functions are:
208371
- :py:func:`datafusion.functions.nth_value`
209372
8. String Functions
210373
- :py:func:`datafusion.functions.string_agg`
211-
9. Approximation Functions
374+
9. Percentile Functions
375+
- :py:func:`datafusion.functions.percentile_cont`
376+
- :py:func:`datafusion.functions.quantile_cont`
212377
- :py:func:`datafusion.functions.approx_distinct`
213378
- :py:func:`datafusion.functions.approx_median`
214379
- :py:func:`datafusion.functions.approx_percentile_cont`
215380
- :py:func:`datafusion.functions.approx_percentile_cont_with_weight`
381+
10. Grouping Set Functions
382+
- :py:func:`datafusion.functions.grouping`
383+
- :py:meth:`datafusion.expr.GroupingSet.rollup`
384+
- :py:meth:`datafusion.expr.GroupingSet.cube`
385+
- :py:meth:`datafusion.expr.GroupingSet.grouping_sets`
216386

python/datafusion/dataframe.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -633,8 +633,22 @@ def aggregate(
633633
) -> DataFrame:
634634
"""Aggregates the rows of the current DataFrame.
635635
636+
By default each unique combination of the ``group_by`` columns
637+
produces one row. To get multiple levels of subtotals in a
638+
single pass, pass a
639+
:py:class:`~datafusion.expr.GroupingSet` expression
640+
(created via
641+
:py:meth:`~datafusion.expr.GroupingSet.rollup`,
642+
:py:meth:`~datafusion.expr.GroupingSet.cube`, or
643+
:py:meth:`~datafusion.expr.GroupingSet.grouping_sets`)
644+
as the ``group_by`` argument. See the
645+
:ref:`aggregation` user guide for detailed examples.
646+
636647
Args:
637-
group_by: Sequence of expressions or column names to group by.
648+
group_by: Sequence of expressions or column names to group
649+
by. A :py:class:`~datafusion.expr.GroupingSet`
650+
expression may be included to produce multiple grouping
651+
levels (rollup, cube, or explicit grouping sets).
638652
aggs: Sequence of expressions to aggregate.
639653
640654
Returns:

0 commit comments

Comments
 (0)