[WIP] feat: Use put_batch_with_indices to track partitions by index for sort-based shuffle#1451
[WIP] feat: Use put_batch_with_indices to track partitions by index for sort-based shuffle#1451mattcuento wants to merge 2 commits intoapache:mainfrom
Conversation
b475fdc to
2967b31
Compare
|
interleave_record_batch Benchmarking sort_shuffle_no_spill/10_batches_200_partitions: Warming up for 3.0000 s Benchmarking sort_shuffle_with_spill/50_batches_200_partitions_8mb_limit: Warming up for 3.0000 s |
|
main: Benchmarking sort_shuffle_no_spill/10_batches_200_partitions: Warming up for 3.0000 s Benchmarking sort_shuffle_with_spill/50_batches_200_partitions_8mb_limit: Warming up for 3.0000 s |
|
push_batch_with_indices: Benchmarking sort_shuffle_no_spill/10_batches_200_partitions: Warming up for 3.0000 s Benchmarking sort_shuffle_with_spill/50_batches_200_partitions_8mb_limit: Warming up for 3.0000 s |
|
@sqlbenchmark tpch --iterations 3 --scale-factor 10 |
Ballista TPC-H Benchmark ResultsPR: #1451 - interleave_record_batch Query Comparison
Total: Main=45581.80ms, PR=63595.70ms (+39.5%) Automated benchmark run by dfbench |
Ah, branch needs rebase |
…t-based shuffle lint first round of otpimizations second round lint lint
…
Which issue does this PR close?
Closes #1432 .
Rationale for this change
The default
BatchPartitionerused in sort-based shuffle creates separateRecordBatchinstances for each output partition. In situations where the number of output partitions is high, there's a lot of overhead in the metadata with eachRecordBatch.This PR aims to defer materializing the shuffled output batches until spilling/writing results. In the interim, the same input batches are stored, and rows for each partition are tracked via a separate index data structure.
What changes are included in this PR?
PartitionBufferstructInputBatchStoreandScratchSpacefor storing input batches and tracking output partition ownershipI found that existing test coverage via the client sort shuffle tests felt sufficient for the changes to the writer.
Are there any user-facing changes?
No