feat: Reduce allocations for aggregating `Statistics` by jonathanc-n · Pull Request #20768 · apache/datafusion

jonathanc-n · 2026-03-07T02:08:45Z

Which issue does this PR close?

Closes Merging Statistics is slow when sum statistic is present #15809.

Rationale for this change

What changes are included in this PR?

Vectorize aggregations for combining statistics by gathering all values then calling kernels once

Are these changes tested?

Unit tests + existing tests

Are there any user-facing changes?

Removed merge_iter

datafusion/common/src/stats.rs

datafusion/common/src/utils/aggregate.rs

jonathanc-n · 2026-03-07T02:14:09Z

I verified this has a 5x speed up for numeric primitive values using small benchmark. felt unnecssary to add the benchmark since it is jsut a regular vectorization optimization

datafusion/common/src/utils/aggregate.rs

…hanc-n/datafusion into speed-up-stats-aggregation

jonathanc-n · 2026-03-07T21:56:04Z

@Dandandan cleaned up the implementation. The performance hit came from using ScalarValue::add which did three unnecessary heap allocations. Now it performs an operation on the internal values. Should be ready for another look.

Dandandan

I think it looks good, it could benefit though from a benchmark somewhere.

jonathanc-n · 2026-03-08T18:47:39Z

Main benchmark


cargo bench --bench stats_merge --package datafusion-common -- --save-baseline main 2>&1
Compiling datafusion-common v52.2.0 (/Users/jonathanchen/Documents/GitHub/datafusion/datafusion/common)
    Finished `bench` profile [optimized] target(s) in 39.17s
     Running benches/stats_merge.rs (target/release/deps/stats_merge-9ea63eb990f61794)
Gnuplot not found, using plotters backend
Benchmarking stats_merge/try_merge_iter/10parts_1cols
Benchmarking stats_merge/try_merge_iter/10parts_1cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/10parts_1cols: Collecting 100 samples in estimated 5.0027 s (2.1M iterations)
Benchmarking stats_merge/try_merge_iter/10parts_1cols: Analyzing
stats_merge/try_merge_iter/10parts_1cols
                        time:   [2.2793 µs 2.3045 µs 2.3333 µs]
Found 17 outliers among 100 measurements (17.00%)
  16 (16.00%) high mild
  1 (1.00%) high severe
Benchmarking stats_merge/try_merge_iter/10parts_5cols
Benchmarking stats_merge/try_merge_iter/10parts_5cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/10parts_5cols: Collecting 100 samples in estimated 5.0026 s (490k iterations)
Benchmarking stats_merge/try_merge_iter/10parts_5cols: Analyzing
stats_merge/try_merge_iter/10parts_5cols
                        time:   [10.415 µs 10.804 µs 11.336 µs]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking stats_merge/try_merge_iter/10parts_20cols
Benchmarking stats_merge/try_merge_iter/10parts_20cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/10parts_20cols: Collecting 100 samples in estimated 5.0821 s (126k iterations)
Benchmarking stats_merge/try_merge_iter/10parts_20cols: Analyzing
stats_merge/try_merge_iter/10parts_20cols
                        time:   [39.750 µs 40.193 µs 40.724 µs]
Found 17 outliers among 100 measurements (17.00%)
  11 (11.00%) high mild
  6 (6.00%) high severe
Benchmarking stats_merge/try_merge_iter/100parts_1cols
Benchmarking stats_merge/try_merge_iter/100parts_1cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/100parts_1cols: Collecting 100 samples in estimated 5.1022 s (227k iterations)
Benchmarking stats_merge/try_merge_iter/100parts_1cols: Analyzing
stats_merge/try_merge_iter/100parts_1cols
                        time:   [22.068 µs 22.322 µs 22.632 µs]
Found 19 outliers among 100 measurements (19.00%)
  17 (17.00%) high mild
  2 (2.00%) high severe
Benchmarking stats_merge/try_merge_iter/100parts_5cols
Benchmarking stats_merge/try_merge_iter/100parts_5cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/100parts_5cols: Collecting 100 samples in estimated 5.4778 s (50k iterations)
Benchmarking stats_merge/try_merge_iter/100parts_5cols: Analyzing
stats_merge/try_merge_iter/100parts_5cols
                        time:   [115.09 µs 117.26 µs 119.73 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking stats_merge/try_merge_iter/100parts_20cols
Benchmarking stats_merge/try_merge_iter/100parts_20cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/100parts_20cols: Collecting 100 samples in estimated 5.0523 s (10k iterations)
Benchmarking stats_merge/try_merge_iter/100parts_20cols: Analyzing
stats_merge/try_merge_iter/100parts_20cols
                        time:   [471.86 µs 478.85 µs 486.69 µs]
Found 16 outliers among 100 measurements (16.00%)
  12 (12.00%) high mild
  4 (4.00%) high severe
Benchmarking stats_merge/try_merge_iter/500parts_1cols
Benchmarking stats_merge/try_merge_iter/500parts_1cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/500parts_1cols: Collecting 100 samples in estimated 5.5118 s (45k iterations)
Benchmarking stats_merge/try_merge_iter/500parts_1cols: Analyzing
stats_merge/try_merge_iter/500parts_1cols
                        time:   [122.45 µs 124.32 µs 126.49 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking stats_merge/try_merge_iter/500parts_5cols
Benchmarking stats_merge/try_merge_iter/500parts_5cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/500parts_5cols: Collecting 100 samples in estimated 5.7578 s (10k iterations)
Benchmarking stats_merge/try_merge_iter/500parts_5cols: Analyzing
stats_merge/try_merge_iter/500parts_5cols
                        time:   [556.81 µs 563.27 µs 570.61 µs]
Found 18 outliers among 100 measurements (18.00%)
  15 (15.00%) high mild
  3 (3.00%) high severe
Benchmarking stats_merge/try_merge_iter/500parts_20cols
Benchmarking stats_merge/try_merge_iter/500parts_20cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/500parts_20cols: Collecting 100 samples in estimated 5.1959 s (2300 iterations)
Benchmarking stats_merge/try_merge_iter/500parts_20cols: Analyzing
stats_merge/try_merge_iter/500parts_20cols
                        time:   [2.2558 ms 2.3018 ms 2.3507 ms]
Found 13 outliers among 100 measurements (13.00%)
  13 (13.00%) high mild

Speed up benchmark

cargo bench --bench stats_merge --package datafusion-common -- --baseline main 2>&1
Compiling datafusion-common v52.2.0 (/Users/jonathanchen/Documents/GitHub/datafusion/datafusion/common)
    Finished `bench` profile [optimized] target(s) in 36.93s
     Running benches/stats_merge.rs (target/release/deps/stats_merge-9ea63eb990f61794)
Gnuplot not found, using plotters backend
Benchmarking stats_merge/try_merge_iter/10parts_1cols
Benchmarking stats_merge/try_merge_iter/10parts_1cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/10parts_1cols: Collecting 100 samples in estimated 5.0001 s (21M iterations)
Benchmarking stats_merge/try_merge_iter/10parts_1cols: Analyzing
stats_merge/try_merge_iter/10parts_1cols
                        time:   [236.31 ns 236.87 ns 237.44 ns]
                        change: [−90.353% −90.177% −90.002%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Benchmarking stats_merge/try_merge_iter/10parts_5cols
Benchmarking stats_merge/try_merge_iter/10parts_5cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/10parts_5cols: Collecting 100 samples in estimated 5.0010 s (5.6M iterations)
Benchmarking stats_merge/try_merge_iter/10parts_5cols: Analyzing
stats_merge/try_merge_iter/10parts_5cols
                        time:   [880.23 ns 882.90 ns 885.78 ns]
                        change: [−92.639% −92.317% −92.014%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking stats_merge/try_merge_iter/10parts_20cols
Benchmarking stats_merge/try_merge_iter/10parts_20cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/10parts_20cols: Collecting 100 samples in estimated 5.0075 s (1.5M iterations)
Benchmarking stats_merge/try_merge_iter/10parts_20cols: Analyzing
stats_merge/try_merge_iter/10parts_20cols
                        time:   [3.2228 µs 3.3417 µs 3.4869 µs]
                        change: [−92.502% −92.280% −92.045%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
Benchmarking stats_merge/try_merge_iter/100parts_1cols
Benchmarking stats_merge/try_merge_iter/100parts_1cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/100parts_1cols: Collecting 100 samples in estimated 5.0080 s (2.5M iterations)
Benchmarking stats_merge/try_merge_iter/100parts_1cols: Analyzing
stats_merge/try_merge_iter/100parts_1cols
                        time:   [2.0240 µs 2.0402 µs 2.0677 µs]
                        change: [−91.564% −91.373% −91.177%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking stats_merge/try_merge_iter/100parts_5cols
Benchmarking stats_merge/try_merge_iter/100parts_5cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/100parts_5cols: Collecting 100 samples in estimated 5.0060 s (545k iterations)
Benchmarking stats_merge/try_merge_iter/100parts_5cols: Analyzing
stats_merge/try_merge_iter/100parts_5cols
                        time:   [9.1888 µs 9.2089 µs 9.2306 µs]
                        change: [−92.623% −92.450% −92.277%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
Benchmarking stats_merge/try_merge_iter/100parts_20cols
Benchmarking stats_merge/try_merge_iter/100parts_20cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/100parts_20cols: Collecting 100 samples in estimated 5.1036 s (146k iterations)
Benchmarking stats_merge/try_merge_iter/100parts_20cols: Analyzing
stats_merge/try_merge_iter/100parts_20cols
                        time:   [34.055 µs 34.146 µs 34.256 µs]
                        change: [−93.360% −93.215% −93.075%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
Benchmarking stats_merge/try_merge_iter/500parts_1cols
Benchmarking stats_merge/try_merge_iter/500parts_1cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/500parts_1cols: Collecting 100 samples in estimated 5.0427 s (505k iterations)
Benchmarking stats_merge/try_merge_iter/500parts_1cols: Analyzing
stats_merge/try_merge_iter/500parts_1cols
                        time:   [9.8597 µs 9.8938 µs 9.9345 µs]
                        change: [−92.696% −92.532% −92.366%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe
Benchmarking stats_merge/try_merge_iter/500parts_5cols
Benchmarking stats_merge/try_merge_iter/500parts_5cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/500parts_5cols: Collecting 100 samples in estimated 5.2203 s (111k iterations)
Benchmarking stats_merge/try_merge_iter/500parts_5cols: Analyzing
stats_merge/try_merge_iter/500parts_5cols
                        time:   [45.800 µs 45.883 µs 45.968 µs]
                        change: [−92.405% −92.247% −92.096%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
Benchmarking stats_merge/try_merge_iter/500parts_20cols
Benchmarking stats_merge/try_merge_iter/500parts_20cols: Warming up for 3.0000 s
Benchmarking stats_merge/try_merge_iter/500parts_20cols: Collecting 100 samples in estimated 5.2143 s (30k iterations)
Benchmarking stats_merge/try_merge_iter/500parts_20cols: Analyzing
stats_merge/try_merge_iter/500parts_20cols
                        time:   [171.52 µs 171.75 µs 171.97 µs]
                        change: [−92.700% −92.543% −92.392%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

added bench. Looking at around a 10-12x improvement

asolimando

LGTM, left a couple of minor comments and questions

datafusion/common/src/utils/aggregate.rs

asolimando · 2026-03-09T15:27:36Z

datafusion/common/src/utils/aggregate.rs

+        ($lhs:expr, $rhs:expr, $VARIANT:ident) => {
+            match ($lhs, $rhs) {
+                (ScalarValue::$VARIANT(Some(a)), ScalarValue::$VARIANT(Some(b))) => {
+                    Ok(ScalarValue::$VARIANT(Some(a.wrapping_add(*b))))


Correct me if I am wrong, but it seems that wrapping_add would wrap the sum upon overflow, producing a negative number, I am not sure this is what we want for statistics, I propose we either mark it Absent or at the latest mark it as Inexact (which doesn't seem to be happening) as done in Precision::add.

Not in scope for this PR, but on the subject I also think pub sum_value: Precision<ScalarValue>, unlike min and max should not have matched the data type of the column, but a wide type like Int64 to minimize the chances of overflow.

Not in scope for this PR, but on the subject I also think pub sum_value: Precision<ScalarValue>, unlike min and max should not have matched the data type of the column, but a wide type like Int64 to minimize the chances of overflow.

Good point, filed in #20826

@asolimando Added the fix! 256 needs to be dealt with differently, so set up a small function for that

Checked the new commit, thanks for addressing the comment and filing the issue, marking as resolved!

EDIT: it looks I lack privileges, I will let you do that if you want

asolimando

LGTM (modulo fixing the cargo doc failure)

xudong963 · 2026-03-11T09:49:33Z

datafusion/common/src/utils/aggregate.rs

+        ScalarValue::Float64(_) => add_float!(lhs, rhs, Float64),
+        ScalarValue::Decimal32(_, _, _) => add_decimal!(lhs, rhs, Decimal32),
+        ScalarValue::Decimal64(_, _, _) => add_decimal!(lhs, rhs, Decimal64),
+        ScalarValue::Decimal128(_, _, _) => add_decimal!(lhs, rhs, Decimal128),


For Decimal128(Some(a), p, s), saturating to i128::MAX doesn't respect the precision constraint. A Decimal128 with precision 5 can't hold i128::MAX.

For example:

let a = Decimal128(Some(99999), 5, 2); // 999.99 let b = Decimal128(Some(99999), 5, 2); // 999.99

xudong963 · 2026-03-11T09:50:38Z

datafusion/common/src/utils/aggregate.rs

@@ -0,0 +1,149 @@
+// Licensed to the Apache Software Foundation (ASF) under one


Not sure if we can put it into a higher level, is it possible that other nodes may use this code in the future, not only agg.

alamb · 2026-03-12T12:51:44Z

run benchmark sql_planner

alamb

Nice!

alamb · 2026-03-12T12:50:56Z

datafusion/common/src/utils/aggregate.rs

+/// converts the result back — 3 heap allocations per call.
+///
+/// For non-primitive types, falls back to `ScalarValue::add`.
+pub(crate) fn scalar_add(lhs: &ScalarValue, rhs: &ScalarValue) -> Result<ScalarValue> {


I bet you can make this even faster by making it mutate lhs rather than make a new one

pub(crate) fn scalar_add(lhs: &mut ScalarValue, rhs: &ScalarValue) -> Result<()> {

Filed a tracking ticket

Improve performance of ScalarValue::add #20933

alamb · 2026-03-12T12:51:19Z

datafusion/common/src/utils/aggregate.rs

+pub(crate) fn precision_add(
+    lhs: &Precision<ScalarValue>,
+    rhs: &Precision<ScalarValue>,
+) -> Precision<ScalarValue> {


Ditto here -- reusing lhs is probably even faster

alamb · 2026-03-12T12:52:51Z

datafusion/common/src/utils/aggregate.rs

+//!
+//! Provides a cheap pairwise [`ScalarValue`] addition that directly
+//! extracts inner primitive values, avoiding the expensive
+//! `ScalarValue::add` path (which round-trips through Arrow arrays).


Why not just add this special case to ScalarValue::add ?

maybe as a follow on PR

alamb-ghbot · 2026-03-12T14:35:18Z

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speed-up-stats-aggregation (d7ac0da) to 92078d9 diff
BENCH_NAME=sql_planner
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner
BENCH_FILTER=
BENCH_BRANCH_NAME=speed-up-stats-aggregation
Results will be posted here when complete

alamb-ghbot · 2026-03-12T15:48:00Z

🤖: Benchmark completed

Details

group                                                 main                                   speed-up-stats-aggregation
-----                                                 ----                                   --------------------------
logical_aggregate_with_join                           1.00    636.4±3.72µs        ? ?/sec    1.00    635.6±8.93µs        ? ?/sec
logical_plan_struct_join_agg_sort                     1.00    289.1±5.73µs        ? ?/sec    1.01    291.6±8.28µs        ? ?/sec
logical_select_all_from_1000                          1.00     10.4±0.17ms        ? ?/sec    1.02     10.7±0.04ms        ? ?/sec
logical_select_one_from_700                           1.01    417.8±3.45µs        ? ?/sec    1.00    415.8±3.78µs        ? ?/sec
logical_trivial_join_high_numbered_columns            1.00    376.6±3.39µs        ? ?/sec    1.00    374.9±4.18µs        ? ?/sec
logical_trivial_join_low_numbered_columns             1.02    365.8±3.56µs        ? ?/sec    1.00    359.3±3.51µs        ? ?/sec
physical_intersection                                 1.00  1631.2±60.44µs        ? ?/sec    1.02  1658.2±73.77µs        ? ?/sec
physical_join_consider_sort                           1.00      2.3±0.06ms        ? ?/sec    1.01      2.3±0.06ms        ? ?/sec
physical_join_distinct                                1.01    356.0±3.78µs        ? ?/sec    1.00    351.7±7.09µs        ? ?/sec
physical_many_self_joins                              1.01     12.6±0.26ms        ? ?/sec    1.00     12.4±0.09ms        ? ?/sec
physical_plan_clickbench_all                          1.00    198.6±4.30ms        ? ?/sec    1.00    199.4±4.71ms        ? ?/sec
physical_plan_clickbench_q1                           1.00      2.1±0.03ms        ? ?/sec    1.01      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q10                          1.00      3.6±0.01ms        ? ?/sec    1.04      3.7±0.14ms        ? ?/sec
physical_plan_clickbench_q11                          1.00      4.1±0.06ms        ? ?/sec    1.01      4.1±0.11ms        ? ?/sec
physical_plan_clickbench_q12                          1.00      4.2±0.09ms        ? ?/sec    1.01      4.2±0.14ms        ? ?/sec
physical_plan_clickbench_q13                          1.00      3.7±0.05ms        ? ?/sec    1.01      3.8±0.12ms        ? ?/sec
physical_plan_clickbench_q14                          1.00      4.1±0.08ms        ? ?/sec    1.00      4.1±0.03ms        ? ?/sec
physical_plan_clickbench_q15                          1.00      3.8±0.08ms        ? ?/sec    1.02      3.9±0.12ms        ? ?/sec
physical_plan_clickbench_q16                          1.00      3.6±0.05ms        ? ?/sec    1.02      3.7±0.13ms        ? ?/sec
physical_plan_clickbench_q17                          1.00      3.7±0.10ms        ? ?/sec    1.00      3.7±0.03ms        ? ?/sec
physical_plan_clickbench_q18                          1.00      2.6±0.05ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
physical_plan_clickbench_q19                          1.00      4.1±0.12ms        ? ?/sec    1.01      4.2±0.14ms        ? ?/sec
physical_plan_clickbench_q2                           1.00      2.8±0.05ms        ? ?/sec    1.00      2.8±0.05ms        ? ?/sec
physical_plan_clickbench_q20                          1.00      2.1±0.03ms        ? ?/sec    1.01      2.2±0.05ms        ? ?/sec
physical_plan_clickbench_q21                          1.00      2.8±0.04ms        ? ?/sec    1.00      2.8±0.06ms        ? ?/sec
physical_plan_clickbench_q22                          1.01      4.0±0.15ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q23                          1.00      4.2±0.09ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q24                          1.00      4.8±0.08ms        ? ?/sec    1.00      4.8±0.05ms        ? ?/sec
physical_plan_clickbench_q25                          1.00      3.4±0.03ms        ? ?/sec    1.01      3.5±0.06ms        ? ?/sec
physical_plan_clickbench_q26                          1.00      2.9±0.04ms        ? ?/sec    1.01      2.9±0.09ms        ? ?/sec
physical_plan_clickbench_q27                          1.00      3.5±0.07ms        ? ?/sec    1.02      3.5±0.12ms        ? ?/sec
physical_plan_clickbench_q28                          1.00      4.4±0.11ms        ? ?/sec    1.00      4.4±0.14ms        ? ?/sec
physical_plan_clickbench_q29                          1.00      4.6±0.02ms        ? ?/sec    1.01      4.7±0.16ms        ? ?/sec
physical_plan_clickbench_q3                           1.00      2.5±0.05ms        ? ?/sec    1.01      2.5±0.06ms        ? ?/sec
physical_plan_clickbench_q30                          1.00     15.2±0.17ms        ? ?/sec    1.00     15.3±0.36ms        ? ?/sec
physical_plan_clickbench_q31                          1.00      4.4±0.14ms        ? ?/sec    1.00      4.5±0.13ms        ? ?/sec
physical_plan_clickbench_q32                          1.01      4.5±0.12ms        ? ?/sec    1.00      4.4±0.09ms        ? ?/sec
physical_plan_clickbench_q33                          1.00      3.6±0.09ms        ? ?/sec    1.00      3.6±0.10ms        ? ?/sec
physical_plan_clickbench_q34                          1.00      3.2±0.04ms        ? ?/sec    1.02      3.3±0.10ms        ? ?/sec
physical_plan_clickbench_q35                          1.00      3.3±0.02ms        ? ?/sec    1.01      3.3±0.04ms        ? ?/sec
physical_plan_clickbench_q36                          1.01      3.9±0.09ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q37                          1.02      4.7±0.16ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
physical_plan_clickbench_q38                          1.00      4.7±0.13ms        ? ?/sec    1.00      4.7±0.11ms        ? ?/sec
physical_plan_clickbench_q39                          1.00      4.0±0.08ms        ? ?/sec    1.00      4.0±0.10ms        ? ?/sec
physical_plan_clickbench_q4                           1.00      2.2±0.05ms        ? ?/sec    1.01      2.2±0.07ms        ? ?/sec
physical_plan_clickbench_q40                          1.02      4.8±0.13ms        ? ?/sec    1.00      4.7±0.05ms        ? ?/sec
physical_plan_clickbench_q41                          1.00      4.2±0.03ms        ? ?/sec    1.01      4.2±0.10ms        ? ?/sec
physical_plan_clickbench_q42                          1.00      4.1±0.04ms        ? ?/sec    1.00      4.1±0.07ms        ? ?/sec
physical_plan_clickbench_q43                          1.01      4.5±0.10ms        ? ?/sec    1.00      4.4±0.04ms        ? ?/sec
physical_plan_clickbench_q44                          1.00      2.3±0.04ms        ? ?/sec    1.01      2.3±0.04ms        ? ?/sec
physical_plan_clickbench_q45                          1.00      2.3±0.03ms        ? ?/sec    1.01      2.3±0.05ms        ? ?/sec
physical_plan_clickbench_q46                          1.00      3.2±0.03ms        ? ?/sec    1.00      3.2±0.02ms        ? ?/sec
physical_plan_clickbench_q47                          1.00      4.7±0.13ms        ? ?/sec    1.00      4.7±0.12ms        ? ?/sec
physical_plan_clickbench_q48                          1.00      5.0±0.07ms        ? ?/sec    1.01      5.0±0.13ms        ? ?/sec
physical_plan_clickbench_q49                          1.00      5.3±0.19ms        ? ?/sec    1.00      5.3±0.18ms        ? ?/sec
physical_plan_clickbench_q5                           1.01      2.5±0.09ms        ? ?/sec    1.00      2.5±0.03ms        ? ?/sec
physical_plan_clickbench_q50                          1.02      4.2±0.10ms        ? ?/sec    1.00      4.2±0.07ms        ? ?/sec
physical_plan_clickbench_q51                          1.00      3.5±0.02ms        ? ?/sec    1.01      3.5±0.06ms        ? ?/sec
physical_plan_clickbench_q6                           1.00      2.5±0.04ms        ? ?/sec    1.02      2.6±0.08ms        ? ?/sec
physical_plan_clickbench_q7                           1.01      2.1±0.06ms        ? ?/sec    1.00      2.1±0.03ms        ? ?/sec
physical_plan_clickbench_q8                           1.00      3.4±0.10ms        ? ?/sec    1.00      3.4±0.06ms        ? ?/sec
physical_plan_clickbench_q9                           1.00      3.6±0.08ms        ? ?/sec    1.02      3.6±0.11ms        ? ?/sec
physical_plan_struct_join_agg_sort                    1.01      3.2±0.07ms        ? ?/sec    1.00      3.2±0.04ms        ? ?/sec
physical_plan_tpcds_all                               1.00  1892.6±22.43ms        ? ?/sec    1.01  1905.9±44.58ms        ? ?/sec
physical_plan_tpch_all                                1.00    125.1±0.57ms        ? ?/sec    1.02    127.5±4.57ms        ? ?/sec
physical_plan_tpch_q1                                 1.01      2.9±0.08ms        ? ?/sec    1.00      2.9±0.07ms        ? ?/sec
physical_plan_tpch_q10                                1.00      7.2±0.13ms        ? ?/sec    1.01      7.3±0.26ms        ? ?/sec
physical_plan_tpch_q11                                1.00      8.5±0.15ms        ? ?/sec    1.03      8.7±0.50ms        ? ?/sec
physical_plan_tpch_q12                                1.00      3.0±0.02ms        ? ?/sec    1.00      3.0±0.04ms        ? ?/sec
physical_plan_tpch_q13                                1.00      3.0±0.03ms        ? ?/sec    1.04      3.1±0.18ms        ? ?/sec
physical_plan_tpch_q14                                1.00      3.0±0.02ms        ? ?/sec    1.01      3.0±0.05ms        ? ?/sec
physical_plan_tpch_q16                                1.04      5.5±0.31ms        ? ?/sec    1.00      5.3±0.19ms        ? ?/sec
physical_plan_tpch_q17                                1.01      5.6±0.16ms        ? ?/sec    1.00      5.6±0.02ms        ? ?/sec
physical_plan_tpch_q18                                1.02      6.1±0.30ms        ? ?/sec    1.00      5.9±0.07ms        ? ?/sec
physical_plan_tpch_q19                                1.00      5.2±0.16ms        ? ?/sec    1.00      5.2±0.14ms        ? ?/sec
physical_plan_tpch_q2                                 1.00     12.4±0.16ms        ? ?/sec    1.00     12.4±0.35ms        ? ?/sec
physical_plan_tpch_q20                                1.00      8.1±0.28ms        ? ?/sec    1.00      8.1±0.18ms        ? ?/sec
physical_plan_tpch_q21                                1.01     10.4±0.51ms        ? ?/sec    1.00     10.2±0.39ms        ? ?/sec
physical_plan_tpch_q22                                1.00      6.5±0.09ms        ? ?/sec    1.03      6.7±0.39ms        ? ?/sec
physical_plan_tpch_q3                                 1.00      5.6±0.03ms        ? ?/sec    1.00      5.6±0.10ms        ? ?/sec
physical_plan_tpch_q4                                 1.00      3.0±0.04ms        ? ?/sec    1.01      3.0±0.10ms        ? ?/sec
physical_plan_tpch_q5                                 1.00      6.0±0.23ms        ? ?/sec    1.01      6.1±0.34ms        ? ?/sec
physical_plan_tpch_q6                                 1.02  1617.9±59.69µs        ? ?/sec    1.00  1586.6±18.90µs        ? ?/sec
physical_plan_tpch_q7                                 1.00      7.1±0.05ms        ? ?/sec    1.02      7.3±0.27ms        ? ?/sec
physical_plan_tpch_q8                                 1.01      9.4±0.32ms        ? ?/sec    1.00      9.3±0.30ms        ? ?/sec
physical_plan_tpch_q9                                 1.00      6.7±0.25ms        ? ?/sec    1.03      6.9±0.34ms        ? ?/sec
physical_select_aggregates_from_200                   1.00     17.2±0.04ms        ? ?/sec    1.00     17.2±0.11ms        ? ?/sec
physical_select_all_from_1000                         1.00     23.4±0.30ms        ? ?/sec    1.02     23.8±0.09ms        ? ?/sec
physical_select_one_from_700                          1.01  1329.8±23.95µs        ? ?/sec    1.00   1320.0±6.62µs        ? ?/sec
physical_sorted_union_order_by_10_int64               1.00      9.9±0.36ms        ? ?/sec    1.00      9.9±0.33ms        ? ?/sec
physical_sorted_union_order_by_10_uint64              1.00     27.0±0.07ms        ? ?/sec    1.01     27.2±0.55ms        ? ?/sec
physical_sorted_union_order_by_50_int64               1.00    152.7±2.38ms        ? ?/sec    1.00    153.2±2.21ms        ? ?/sec
physical_sorted_union_order_by_50_uint64              1.00   935.4±12.21ms        ? ?/sec    1.00   939.0±13.74ms        ? ?/sec
physical_theta_join_consider_sort                     1.00      2.7±0.06ms        ? ?/sec    1.00      2.6±0.06ms        ? ?/sec
physical_unnest_to_join                               1.00      3.1±0.06ms        ? ?/sec    1.01      3.1±0.08ms        ? ?/sec
physical_window_function_partition_by_12_on_values    1.00  1275.9±28.71µs        ? ?/sec    1.01  1282.4±24.97µs        ? ?/sec
physical_window_function_partition_by_30_on_values    1.00      2.1±0.02ms        ? ?/sec    1.00      2.1±0.03ms        ? ?/sec
physical_window_function_partition_by_4_on_values     1.00   944.6±23.64µs        ? ?/sec    1.00   943.7±16.35µs        ? ?/sec
physical_window_function_partition_by_7_on_values     1.00  1065.9±26.64µs        ? ?/sec    1.00  1071.1±34.10µs        ? ?/sec
physical_window_function_partition_by_8_on_values     1.00  1105.0±18.83µs        ? ?/sec    1.01  1110.9±20.87µs        ? ?/sec
with_param_values_many_columns                        1.00   579.5±11.86µs        ? ?/sec    1.01    583.4±5.15µs        ? ?/sec

alamb · 2026-03-13T20:05:20Z

Thanks again @jonathanc-n @Dandandan and @asolimando

## Which issue does this PR close?  - Closes apache#15809. ## Rationale for this change  ## What changes are included in this PR? Vectorize aggregations for combining statistics by gathering all values then calling kernels once  ## Are these changes tested? Unit tests + existing tests  ## Are there any user-facing changes? Removed `merge_iter`

#20865) ## Which issue does this PR close?  - Closes #20826. ## Rationale for this change As discussed in the review thread on #20768 and tracked by #20826, `sum_value` should not keep narrow integer column types during stats aggregation, because merge/multiply paths can overflow before values are widened.  ## What changes are included in this PR? This PR updates statistics `sum_value` arithmetic to match SUM-style widening for small integer types, and applies that behavior consistently across merge and multiplication paths.  ## Are these changes tested? Yes  ## Are there any user-facing changes?

## Which issue does this PR close? Follow up of #20768. ## Rationale for this change `Precision::min/max` allocates a lot of new `ScalarValues`, and it can be done in place. While running the `sql_planner` benchmark, it seems like for clickbench `Statistics::try_merge_iter` is a significant part of the runtime, and this PR improves that part by about 20-25% locally. ## What changes are included in this PR? Introduces a couple of of new internal functions to calculate the min/max of a `Precision` in-place. ## Are these changes tested? Existing general tests, and a few new unit tests. ## Are there any user-facing changes? None --------- Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

jonathanc-n added 3 commits March 6, 2026 19:56

feat: Vectorize merging Statistics

d0c9bbc

int

010ee19

add fallback

6df99db

github-actions bot added the common Related to common crate label Mar 7, 2026

jonathanc-n commented Mar 7, 2026

View reviewed changes

datafusion/common/src/stats.rs Show resolved Hide resolved

datafusion/common/src/utils/aggregate.rs Show resolved Hide resolved

Dandandan reviewed Mar 7, 2026

View reviewed changes

datafusion/common/src/utils/aggregate.rs Outdated Show resolved Hide resolved

jonathanc-n changed the title ~~feat: Vectorize aggregating Statistics~~ feat: Reduce allocations for aggregating Statistics Mar 7, 2026

jonathanc-n added 3 commits March 7, 2026 14:38

optimize accumulation

1ad9ca7

Merge branch 'main' into speed-up-stats-aggregation

06cd380

clippy: use += instead of acc = acc + *v

e51023e

jonathanc-n force-pushed the speed-up-stats-aggregation branch from 06cd380 to e51023e Compare March 7, 2026 21:24

jonathanc-n added 3 commits March 7, 2026 15:36

Merge branch 'speed-up-stats-aggregation' of https://github.com/jonat…

304bfc5

…hanc-n/datafusion into speed-up-stats-aggregation

simplify: single-loop accumulation with cheap scalar_add

c1724a2

Merge branch 'speed-up-stats-aggregation' of https://github.com/jonat…

2f31848

…hanc-n/datafusion into speed-up-stats-aggregation

Dandandan approved these changes Mar 8, 2026

View reviewed changes

add bench

3b6ce55

asolimando reviewed Mar 9, 2026

View reviewed changes

jonathanc-n mentioned this pull request Mar 9, 2026

Match Precision sum function against Int64 to prevent overflow. #20826

Closed

saturating_add

304824b

asolimando approved these changes Mar 9, 2026

View reviewed changes

fix

d7ac0da

kumarUjjawal mentioned this pull request Mar 11, 2026

fix(stats): widen sum_value integer arithmetic to SUM-compatible types #20865

Merged

xudong963 reviewed Mar 11, 2026

View reviewed changes

alamb reviewed Mar 12, 2026

View reviewed changes

alamb mentioned this pull request Mar 13, 2026

Improve performance of ScalarValue::add #20933

Closed

alamb added this pull request to the merge queue Mar 13, 2026

Merged via the queue into apache:main with commit d09ff92 Mar 13, 2026
33 checks passed

asolimando mentioned this pull request Mar 16, 2026

feat: Extract NDV (distinct_count) statistics from Parquet metadata #19957

Merged

AdamGS mentioned this pull request Mar 28, 2026

perf: Merge Precision in-place #21219

Merged

robert3005 mentioned this pull request Apr 1, 2026

Default write strategy is a little unexpected vortex-data/vortex#7210

Open

3 tasks

		@@ -0,0 +1,149 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Conversation

jonathanc-n commented Mar 7, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

jonathanc-n commented Mar 7, 2026

Uh oh!

Uh oh!

jonathanc-n commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

jonathanc-n commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 12, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb-ghbot commented Mar 12, 2026

Uh oh!

alamb-ghbot commented Mar 12, 2026

Uh oh!

alamb commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jonathanc-n commented Mar 7, 2026 •

edited

Loading

jonathanc-n commented Mar 8, 2026 •

edited

Loading

asolimando Mar 9, 2026 •

edited

Loading

alamb commented Mar 13, 2026 •

edited

Loading