Commit fdd0c92
Nathan Bezualem
Add direct intern() benchmark for vectorized vs row-based GROUP BY
Adds a fair apples-to-apples benchmark that directly calls
GroupValues::intern() with identical Int32 data for both
GroupValuesColumn (vectorized) and GroupValuesRows (row-based).
This eliminates the previous confounding factors (different data types,
SQL/planning overhead) and confirms the regression reported in #17850:
row-based is 16-19% faster at low cardinality (64 groups), with a
crossover at ~200K-500K groups where vectorized becomes faster.
Experiments:
- Issue #17850 reproduction (3 cols, 64 groups, 1M-50M rows)
- Low cardinality sweep (8-4096 groups)
- Batch size sensitivity (1K-32K)
- Column count scaling (2-10 cols)
- High cardinality scaling (1M groups)
- Group count sweep (16 to 1M groups)
- Random vs sequential data patterns1 parent 135a39b commit fdd0c92
3 files changed
Lines changed: 455 additions & 1 deletion
File tree
- datafusion/physical-plan
- benches
- src/aggregates/group_values
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
116 | 116 | | |
117 | 117 | | |
118 | 118 | | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
0 commit comments