Skip to content

Commit c74ed91

Browse files
authored
Estimate aggregate output rows using existing NDV statistics (#20926)
## Which issue does this PR close? Part of #20766 ## Rationale for this change Grouped aggregations currently estimate output rows as input_rows, ignoring available NDV statistics. Spark's AggregateEstimation and Trino's AggregationStatsRule both use NDV products to tighten this estimate. This PR is highly referenced by both. - [Spark reference](https://github.com/apache/spark/blob/e8d8e6a8d040d26aae9571e968e0c64bda0875dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala#L38-L61) - [Trino reference](https://github.com/trinodb/trino/blob/43c8c3ba8bff814697c5926149ce13b9532f030b/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java#L92-L101) ## What changes are included in this PR? - Estimate aggregate output rows as min(input_rows, product(NDV_i + null_adj_i) * grouping_sets) - Cap by Top K limit when active since output row cannot be higher than K - Propagate distinct_count from child stats to group-by output columns ## Are these changes tested? Yes existing and new tests that cover different scenarios and edge cases ## Are there any user-facing changes? No
1 parent 8a48a87 commit c74ed91

File tree

2 files changed

+652
-73
lines changed

2 files changed

+652
-73
lines changed

datafusion/core/tests/physical_optimizer/partition_statistics.rs

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -935,7 +935,10 @@ mod test {
935935
num_rows: Precision::Exact(0),
936936
total_byte_size: Precision::Absent,
937937
column_statistics: vec![
938-
ColumnStatistics::new_unknown(),
938+
ColumnStatistics {
939+
distinct_count: Precision::Exact(0),
940+
..ColumnStatistics::new_unknown()
941+
},
939942
ColumnStatistics::new_unknown(),
940943
ColumnStatistics::new_unknown(),
941944
],

0 commit comments

Comments
 (0)