Skip to content

Commit fdd0c92

Browse files
author
Nathan Bezualem
committed
Add direct intern() benchmark for vectorized vs row-based GROUP BY
Adds a fair apples-to-apples benchmark that directly calls GroupValues::intern() with identical Int32 data for both GroupValuesColumn (vectorized) and GroupValuesRows (row-based). This eliminates the previous confounding factors (different data types, SQL/planning overhead) and confirms the regression reported in #17850: row-based is 16-19% faster at low cardinality (64 groups), with a crossover at ~200K-500K groups where vectorized becomes faster. Experiments: - Issue #17850 reproduction (3 cols, 64 groups, 1M-50M rows) - Low cardinality sweep (8-4096 groups) - Batch size sensitivity (1K-32K) - Column count scaling (2-10 cols) - High cardinality scaling (1M groups) - Group count sweep (16 to 1M groups) - Random vs sequential data patterns
1 parent 135a39b commit fdd0c92

3 files changed

Lines changed: 455 additions & 1 deletion

File tree

datafusion/physical-plan/Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,3 +116,7 @@ required-features = ["test_utils"]
116116
[[bench]]
117117
harness = false
118118
name = "dictionary_group_values"
119+
120+
[[bench]]
121+
harness = false
122+
name = "multi_group_by"

0 commit comments

Comments
 (0)