Skip to content

Commit 135a39b

Browse files
Nathan Bezualemclaude
andcommitted
Address review comments: compare vectorized vs row-based, fix planning overhead, exact NDV
Addresses all three review comments from @kosiew: 1. **Implementation comparison**: Benchmarks both GroupValuesColumn (vectorized, via Int32 columns) and GroupValuesRows (row-based, via FixedSizeBinary(4) columns that trigger the fallback path) side-by-side. 2. **Execution-only timing**: Pre-optimizes the logical plan once via `df.into_parts()`. Each benchmark iteration only does physical planning + execution, excluding SQL parsing and logical optimization. 3. **Exact cardinality**: Replaces random sampling with sequential enumeration (`global_row % num_distinct_groups` decomposed per-column), guaranteeing precise distinct group counts with no birthday-paradox error. Additionally motivated by #17850, adds comprehensive experiments: - Issue #17850 regression reproduction (3 cols, 64 groups, 1M-50M rows) - Low cardinality sweep (8-4096 groups) - Batch size sensitivity (1K-32K) - Column count scaling (2-10 cols, low and high cardinality) - Group count sweep (16 to 1M groups) - Random vs sequential data patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d25b053 commit 135a39b

1 file changed

Lines changed: 419 additions & 126 deletions

File tree

0 commit comments

Comments
 (0)