Commit 135a39b
Address review comments: compare vectorized vs row-based, fix planning overhead, exact NDV
Addresses all three review comments from @kosiew:
1. **Implementation comparison**: Benchmarks both GroupValuesColumn (vectorized,
via Int32 columns) and GroupValuesRows (row-based, via FixedSizeBinary(4)
columns that trigger the fallback path) side-by-side.
2. **Execution-only timing**: Pre-optimizes the logical plan once via
`df.into_parts()`. Each benchmark iteration only does physical planning +
execution, excluding SQL parsing and logical optimization.
3. **Exact cardinality**: Replaces random sampling with sequential enumeration
(`global_row % num_distinct_groups` decomposed per-column), guaranteeing
precise distinct group counts with no birthday-paradox error.
Additionally motivated by #17850,
adds comprehensive experiments:
- Issue #17850 regression reproduction (3 cols, 64 groups, 1M-50M rows)
- Low cardinality sweep (8-4096 groups)
- Batch size sensitivity (1K-32K)
- Column count scaling (2-10 cols, low and high cardinality)
- Group count sweep (16 to 1M groups)
- Random vs sequential data patterns
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent d25b053 commit 135a39b
1 file changed
Lines changed: 419 additions & 126 deletions
0 commit comments