Resampling performance improvement and sparse aggregation columns support#3062
Open
IvoDD wants to merge 8 commits into
Open
Resampling performance improvement and sparse aggregation columns support#3062IvoDD wants to merge 8 commits into
IvoDD wants to merge 8 commits into
Conversation
419c30a to
0de92a2
Compare
Base automatically changed from
arrow-use-in-memory-storage-for-unit-tests
to
master
May 7, 2026 11:30
a5ac868 to
a9e8ee4
Compare
36122bc to
4231a4f
Compare
5679aa0 to
4b7e881
Compare
210a17b to
086284c
Compare
086284c to
5e4edb7
Compare
Contributor
ArcticDB Code Review SummaryDelta since last review is a single commit (89d9fd8 "Address Alex comments") touching cpp/arcticdb/processing/sorted_aggregation.cpp and cpp/arcticdb/processing/test/test_resample.cpp. The accumulate refactor, multiplication-based threshold (avoiding division by zero), and added comments are all correct improvements over the previous code. No new issues introduced. PR Title and Description
Documentation
Notes (no action required)
|
alexowens90
reviewed
May 18, 2026
alexowens90
reviewed
May 18, 2026
alexowens90
reviewed
May 18, 2026
alexowens90
reviewed
May 18, 2026
alexowens90
reviewed
May 18, 2026
0c2d98c to
6120021
Compare
IvoDD
added a commit
that referenced
this pull request
May 21, 2026
#### Reference Issues/PRs Optimizations on top of #3091 Used in #3062 #### What does this implement or fix? Some micro optimizations on binary search methods: - Don't keep `TypedBlockData` in `ColumnDataIterator`. Instead only keep `block_data_` and `block_size_` - Don't recalculate block pointer and size when we already know them during gallop #### Any other comments? Benchmarks for all search and iteration methods: | Benchmark | Before (ns) | After (ns) | Delta | |---|---:|---:|---:| | iterate_irregular_blocks_1 (one row per block) | 478,496 | 311,163 | −35.0% | | iterate_with_iterator (100 rows) | 798 | 719 | −9.9% | | exponential_lb_single_block (in first 100) | 356 | 323 | −9.2% | | exponential_lb_single_block (full gallop) | 458 | 424 | −7.4% | | exponential_lb_regular (in first 100) | 364 | 339 | −6.7% | | exponential_lb_irregular_1000 (in first 100) | 360 | 335 | −6.7% | | exponential_lb_irregular_1000 (full gallop) | 496 | 476 | −3.9% | | exponential_lb_regular (full gallop) | 504 | 489 | −2.9% | | exponential_lb_irregular_1 (in first 100) | 464 | 455 | −2.0% | | exponential_lb_irregular_1 (full gallop) | 687 | 679 | −1.3% | | lower_bound_single_block | 411 | 394 | −4.1% | | lower_bound_irregular_1000 | 444 | 431 | −3.0% | | lower_bound_irregular_1 | 595 | 579 | −2.8% | | lower_bound_regular_blocks | 443 | 436 | −1.4% | | iterate_single_block | 27,305 | 27,247 | −0.2% | | iterate_regular_blocks | 29,051 | 28,734 | −1.1% | | iterate_irregular_blocks_1000 | 28,136 | 27,893 | −0.9% | | iterate_with_scalar_at (100 rows) | 182,183,122 | 182,088,026 | −0.1% | #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> Co-authored-by: Ivo <ivo.dilov@man.com>
added 8 commits
May 21, 2026 15:19
Previously each of `generate_output_index_column`, `generate_resample_output_column` and `aggregate` had complicated logic to identify which row corresponds to which output column. This is simplified by creating a `ResampleMapping` when building the output index column to store which output row corresponds to which input values. Then `ResampleMapping` is used in the other methods.
A lot of resampling runtime was spent during generation of output index column. This can be sped up significantly in the common case where number of buckets is much smaller then input rows by using exponential binary search.
Helps speed up and decrease memory usage for the very rare case where num_buckets >> num_input_rows.
With benchmarking of various rows_per_bucket it was confirmed that exponential_search becomes faster than linear scan at around 32 elements. For <32 rows per bucket the linear pass is faster. For >32 the exponential search is faster.
Construct output agg column based on rs_index of input sparse columns. Then use sparse iterators to populate the values.
5e4edb7 to
89d9fd8
Compare
alexowens90
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Monday ref: 11679866800
Depends on PRs #3091 and #3110
Issues
generate_output_index_column,generate_resampling_output_column,SortedAggregator::aggregateChanges (split per commit for easier review)
generate_output_index_columntosorted_aggregation.cpp.ResampleMappingingenerate_output_index_columnand use it directly in other methods.ResampleMappingjust has a mapping fromoutput_rowto(start_column_index, start_column_offset), (end_column_index, end_column_offset).generate_output_index_columnto skip past all rows in a single bucket at once.O(num_input_rows + num_buckets)toO(num_buckets × log(rows_per_bucket)).O(num_input_rows + num_buckets)even whennum_buckets ≥ num_input_rows.min(num_buckets, num_input_rows)instead ofnum_buckets.ResampleMappingfrom commit 2.Resample benchmark timings
BM_resample/<rows_per_seg>/<num_segs>/<num_buckets>/<num_cols>. Total rows ~1M.Source:
cpp/arcticdb/processing/test/benchmark_resample.cpp. Times in ms,--benchmark_min_time=2s.100k × 10, 1k buckets100k × 10, 10k buckets100k × 10, 100k buckets2k × 500, 100 buckets100k × 10, 10M buckets1 aggregation column
100 aggregation columns
Deltas vs baseline (row 0).
Notes on benchmark results
ARCTICDB_LIKELYandARCTICDB_UNLIKELY).