perf: Optimize remaining non-join range operators#37
Open
mwiewior wants to merge 2 commits into
Open
Conversation
…rators - subtract: normalize (merge) right-side intervals after collection to shrink candidate set; replace linear cursor advance with binary search (partition_point) - interval_tree: resolve position arrays once per batch instead of per-row dispatch in both build and probe paths - complement: replace Vec<String> seen_contigs with AHashSet for O(1) lookup; add merge_cursor tracking to avoid re-scanning merged intervals - grouped_stream: eliminate double to_string() allocation for new contigs - Add 21 new unit tests and 2 scaling benchmarks (count_overlaps, coverage) Closes #36 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The binary search (partition_point) replacement caused a regression on cross-dataset subtract (chainXenTro3Link vs fBrain-DS14718): ~1.9x slower at p=1. Root cause: the original persistent cursor is O(n+m) amortized across sorted left intervals, while partition_point is O(n * log m) per contig — worse when right-side intervals per contig are small. Fix: restore the persistent linear cursor while keeping normalize_intervals (which merges fragmented right-side intervals in-place). This preserves the massive 26-74x speedup on self-subtract with heavy overlap, while matching master performance on cross-dataset workloads. Also adds bench_scaling_subtract_cross benchmark using chainXenTro3Link vs fBrain-DS14718 for cross-dataset regression testing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #36 — optimizes subtract, count_overlaps/coverage, complement, and grouped_stream operators:
partition_point) for O(log n) right-interval lookup per left interval.resolve()instead of per-rowvalue(i)?dispatch with Result overhead.Vec<String>seen_contigs withAHashSet<String>for O(1) membership testing. Add merge_cursor tracking inemit_contig_complementto avoid re-scanning merged intervals from the start for each view interval.to_string()allocation for new contigs in bothStreamCollectorandFullBatchCollector.Test plan
cargo test -p datafusion-bio-function-rangescargo fmt -- --checkcargo clippy -- -D warningscargo test -p datafusion-bio-function-ranges --release bench_scaling_subtract -- --ignored --nocapturecargo test -p datafusion-bio-function-ranges --release bench_scaling_complement -- --ignored --nocapturecargo test -p datafusion-bio-function-ranges --release bench_scaling_count_overlaps -- --ignored --nocapturecargo test -p datafusion-bio-function-ranges --release bench_scaling_coverage -- --ignored --nocapture🤖 Generated with Claude Code