Commit 135a39b

Nathan Bezualem

and

committed

Address review comments: compare vectorized vs row-based, fix planning overhead, exact NDV

Addresses all three review comments from @kosiew: 1. **Implementation comparison**: Benchmarks both GroupValuesColumn (vectorized, via Int32 columns) and GroupValuesRows (row-based, via FixedSizeBinary(4) columns that trigger the fallback path) side-by-side. 2. **Execution-only timing**: Pre-optimizes the logical plan once via `df.into_parts()`. Each benchmark iteration only does physical planning + execution, excluding SQL parsing and logical optimization. 3. **Exact cardinality**: Replaces random sampling with sequential enumeration (`global_row % num_distinct_groups` decomposed per-column), guaranteeing precise distinct group counts with no birthday-paradox error. Additionally motivated by #17850, adds comprehensive experiments: - Issue #17850 regression reproduction (3 cols, 64 groups, 1M-50M rows) - Low cardinality sweep (8-4096 groups) - Batch size sensitivity (1K-32K) - Column count scaling (2-10 cols, low and high cardinality) - Group count sweep (16 to 1M groups) - Random vs sequential data patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1 parent d25b053 commit 135a39bCopy full SHA for 135a39b

1 file changed

datafusion/core/benches
- multi_group_by.rs

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 135a39b

File tree

0 commit comments