Resampling performance improvement and sparse aggregation columns support by IvoDD · Pull Request #3062 · man-group/ArcticDB

IvoDD · 2026-04-30T09:39:24Z

Reference Issues/PRs

Monday ref: 11679866800

Depends on PRs #3091 and #3110

Issues

There is complicated bucket hopping logic in three places: generate_output_index_column, generate_resampling_output_column, SortedAggregator::aggregate
The bucket hopping logic involves many branches with loads of checks

Changes (split per commit for easier review)

Adds C++ benchmarks which measures the CPU intensive part of resampling
Pure move of the generate_output_index_column to sorted_aggregation.cpp.
- This way all bucket hopping logic is in one place.
Construct a ResampleMapping in generate_output_index_column and use it directly in other methods.
- ResampleMapping just has a mapping from output_row to (start_column_index, start_column_offset), (end_column_index, end_column_offset).
- Resolves the 3 places with similar logic.
- Makes the implementation of sparse aggregation easier.
Use galloping search in generate_output_index_column to skip past all rows in a single bucket at once.
- Index column construction was the bottleneck: aggregation vectorises well but index iteration does not.
- Changes complexity from O(num_input_rows + num_buckets) to O(num_buckets × log(rows_per_bucket)).
- Always ≤ O(num_input_rows + num_buckets) even when num_buckets ≥ num_input_rows.
Preallocate the output index column to min(num_buckets, num_input_rows) instead of num_buckets.
- Galloping search has a higher constant than linear scan and regresses at low rows per bucket.
- Slightly improves the case where most buckets are empty due to smaller allocation.
Use a runtime heuristic to choose between linear scan and galloping search.
- Linear scan is faster below ~32 rows/bucket (because of smaller constant and better branch prediction); galloping search is faster above.
- Threshold determined empirically from benchmarks at intermediate bucket counts. Extra benchmarking was done with more parametrization of the existing benchmark. Not kept in PR to avoid a huge amount of benchmarking code.
- Recovers the Dense-100k and Empty regressions from commit 3 while retaining all gains elsewhere.
Implement sparse resampling.
- Small change made straightforward by the ResampleMapping from commit 2.
- Minimal overhead for the dense case.

Resample benchmark timings

BM_resample/<rows_per_seg>/<num_segs>/<num_buckets>/<num_cols>. Total rows ~1M.
Source: cpp/arcticdb/processing/test/benchmark_resample.cpp. Times in ms, --benchmark_min_time=2s.

Regime	Args	rows/bucket	Description
Dense-1k	`100k × 10, 1k buckets`	~1000	Many rows/bucket, single row-slice
Dense-100	`100k × 10, 10k buckets`	~100	Medium rows/bucket, single row-slice
Dense-10	`100k × 10, 100k buckets`	~10	Few rows/bucket, single row-slice
Spanning	`2k × 500, 100 buckets`	~10k	Buckets span multiple row-slices
Empty	`100k × 10, 10M buckets`	<1	Bucket smaller than row spacing; most empty

1 aggregation column

#	Change	D-1k	D-100	D-10	Spanning	Empty
0	Baseline	1.27	1.34	1.47	1.65	11.1
1	Code move	1.02 (−20%)	1.12 (−16%)	1.27 (−14%)	1.40 (−15%)	11.1 (0%)
2	ResampleMapping	1.02 (−20%)	1.12 (−16%)	1.32 (−10%)	1.40 (−15%)	11.8 (+6%)
3	Galloping search	0.059 (−95%)	0.385 (−71%)	2.94 (+100%)	0.285 (−83%)	21.9 (+97%)
4	Bounded allocation	0.058 (−95%)	0.396 (−70%)	2.91 (+98%)	0.291 (−82%)	21.5 (+94%)
5	Heuristic (lin/EUB)	0.059 (−95%)	0.383 (−71%)	1.27 (−14%)	0.293 (−82%)	11.5 (+4%)
6	Sparse-input support	0.068 (−95%)	0.449 (−66%)	1.28 (−13%)	0.296 (−82%)	11.5 (+4%)

100 aggregation columns

#	Change	D-1k	D-100	D-10	Spanning	Empty
0	Baseline	1.37	1.43	1.56	6.22	48.0
1	Code move	1.11 (−19%)	1.18 (−17%)	1.34 (−14%)	5.92 (−5%)	46.2 (−4%)
2	ResampleMapping	1.11 (−19%)	1.19 (−17%)	1.39 (−11%)	5.87 (−6%)	50.4 (+5%)
3	Galloping search	0.148 (−89%)	0.471 (−67%)	2.96 (+90%)	4.65 (−25%)	63.1 (+31%)
4	Bounded allocation	0.148 (−89%)	0.480 (−66%)	2.95 (+89%)	4.67 (−25%)	44.1 (−8%)
5	Heuristic (lin/EUB)	0.149 (−89%)	0.477 (−67%)	1.33 (−15%)	4.70 (−24%)	35.9 (−25%)
6	Sparse-input support	0.158 (−88%)	0.537 (−62%)	1.35 (−13%)	4.94 (−21%)	36.0 (−25%)

Deltas vs baseline (row 0).

Notes on benchmark results

Load average varied across runs so there are some artifacts in results like "Code move" improvements.
Galloping search improves the speed when there are more rows in a single bucket significantly. Thorough benchmarking showed exponential upper bound (EUB) becomes faster than linear search at ~32 rows per bucket. Hence we see some performance regressions in the 10 rows per bucket and in the mostly empty bucket cases.
Bounded allocation mostly helps the empty case as expected
Using the heuristic to choose between EUB and linear search helps when rows_per_bucket < 32. It is even more efficient than the baseline due to slightly better branch prediction (improved use of ARCTICDB_LIKELY and ARCTICDB_UNLIKELY).
Final state: every regime at or faster than baseline; Dense 1000 rows per bucket is the biggest winner with 20x improvement; Mostly empty bucket is the only usecase with no improvement and remains around baseline (+4%)

claude · 2026-05-14T13:54:39Z

ArcticDB Code Review Summary

Delta since last review is a single commit (89d9fd8 "Address Alex comments") touching cpp/arcticdb/processing/sorted_aggregation.cpp and cpp/arcticdb/processing/test/test_resample.cpp. The accumulate refactor, multiplication-based threshold (avoiding division by zero), and added comments are all correct improvements over the previous code. No new issues introduced.

PR Title and Description

Typo "aggragation" -> "aggregation" fixed in title.

Documentation

Sparse resampling support still not reflected in technical docs: docs/claude/cpp/PROCESSING.md and docs/claude/python/QUERY_PROCESSING.md describe resampling but do not mention that sparse aggregation columns are now supported. The previously-thrown "Cannot aggregate sparse column" schema error has been removed (and the corresponding test_resampling_sparse_data deleted), so a brief note that sparse aggregation columns are now supported would keep the docs aligned with behaviour.
No user-facing doc or tutorial mention of the new sparse resampling capability. Optional but worth considering before release.

Notes (no action required)

Multiplication form total_input_rows < linear_scan_threshold * num_buckets is mathematically equivalent to the old total_input_rows / num_buckets < linear_scan_threshold for realistic inputs; overflow would require more than 5e17 buckets, so not a practical concern.
Removing the post-advance_boundary_past_value re-check of current_bucket.contains(it->value()) in generate_output_index_column is safe because advance_boundary_past_value always lands bucket_end_it on a bucket whose half-open interval contains the probed value for both LEFT- and RIGHT-closed boundaries. The previous defensive re-check was redundant.
This PR is stacked on binary-search-utils-optimization and depends on PRs Binary search utils #3091 and Optimize binary search methods #3110 - merge order matters.

#### Reference Issues/PRs Optimizations on top of #3091 Used in #3062 #### What does this implement or fix? Some micro optimizations on binary search methods: - Don't keep `TypedBlockData` in `ColumnDataIterator`. Instead only keep `block_data_` and `block_size_` - Don't recalculate block pointer and size when we already know them during gallop #### Any other comments? Benchmarks for all search and iteration methods: | Benchmark | Before (ns) | After (ns) | Delta | |---|---:|---:|---:| | iterate_irregular_blocks_1 (one row per block) | 478,496 | 311,163 | −35.0% | | iterate_with_iterator (100 rows) | 798 | 719 | −9.9% | | exponential_lb_single_block (in first 100) | 356 | 323 | −9.2% | | exponential_lb_single_block (full gallop) | 458 | 424 | −7.4% | | exponential_lb_regular (in first 100) | 364 | 339 | −6.7% | | exponential_lb_irregular_1000 (in first 100) | 360 | 335 | −6.7% | | exponential_lb_irregular_1000 (full gallop) | 496 | 476 | −3.9% | | exponential_lb_regular (full gallop) | 504 | 489 | −2.9% | | exponential_lb_irregular_1 (in first 100) | 464 | 455 | −2.0% | | exponential_lb_irregular_1 (full gallop) | 687 | 679 | −1.3% | | lower_bound_single_block | 411 | 394 | −4.1% | | lower_bound_irregular_1000 | 444 | 431 | −3.0% | | lower_bound_irregular_1 | 595 | 579 | −2.8% | | lower_bound_regular_blocks | 443 | 436 | −1.4% | | iterate_single_block | 27,305 | 27,247 | −0.2% | | iterate_regular_blocks | 29,051 | 28,734 | −1.1% | | iterate_irregular_blocks_1000 | 28,136 | 27,893 | −0.9% | | iterate_with_scalar_at (100 rows) | 182,183,122 | 182,088,026 | −0.1% | #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>  Co-authored-by: Ivo <ivo.dilov@man.com>

Previously each of `generate_output_index_column`, `generate_resample_output_column` and `aggregate` had complicated logic to identify which row corresponds to which output column. This is simplified by creating a `ResampleMapping` when building the output index column to store which output row corresponds to which input values. Then `ResampleMapping` is used in the other methods.

A lot of resampling runtime was spent during generation of output index column. This can be sped up significantly in the common case where number of buckets is much smaller then input rows by using exponential binary search.

Helps speed up and decrease memory usage for the very rare case where num_buckets >> num_input_rows.

With benchmarking of various rows_per_bucket it was confirmed that exponential_search becomes faster than linear scan at around 32 elements. For <32 rows per bucket the linear pass is faster. For >32 the exponential search is faster.

Construct output agg column based on rs_index of input sparse columns. Then use sparse iterators to populate the values.

IvoDD changed the base branch from master to arrow-use-in-memory-storage-for-unit-tests April 30, 2026 10:17

maxim-morozov self-requested a review April 30, 2026 16:42

IvoDD force-pushed the arrow-use-in-memory-storage-for-unit-tests branch from 419c30a to 0de92a2 Compare May 5, 2026 14:11

Base automatically changed from arrow-use-in-memory-storage-for-unit-tests to master May 7, 2026 11:30

IvoDD force-pushed the sparse-resampling-support branch from a5ac868 to a9e8ee4 Compare May 11, 2026 09:18

IvoDD changed the base branch from master to binary-search-utils May 11, 2026 09:18

IvoDD force-pushed the sparse-resampling-support branch 2 times, most recently from 36122bc to 4231a4f Compare May 12, 2026 15:18

IvoDD force-pushed the binary-search-utils branch 2 times, most recently from 5679aa0 to 4b7e881 Compare May 13, 2026 08:20

IvoDD force-pushed the sparse-resampling-support branch 3 times, most recently from 210a17b to 086284c Compare May 13, 2026 14:53

IvoDD mentioned this pull request May 14, 2026

Optimize binary search methods #3110

Merged

5 tasks

IvoDD force-pushed the sparse-resampling-support branch from 086284c to 5e4edb7 Compare May 14, 2026 12:10

IvoDD added the patch Small change, should increase patch version label May 14, 2026

IvoDD changed the title ~~[Draft] Sparse resampling support~~ Resampling performance improvement and sparse aggragation columns support May 14, 2026

IvoDD changed the base branch from binary-search-utils to binary-search-utils-optimization May 14, 2026 13:12

IvoDD marked this pull request as ready for review May 14, 2026 13:47

IvoDD requested review from alexowens90 and poodlewars as code owners May 14, 2026 13:47

IvoDD changed the title ~~Resampling performance improvement and sparse aggragation columns support~~ Resampling performance improvement and sparse aggregation columns support May 14, 2026