Skip to content

perf: Optimize remaining non-join range operators#37

Open
mwiewior wants to merge 2 commits into
masterfrom
optimize-range-operators
Open

perf: Optimize remaining non-join range operators#37
mwiewior wants to merge 2 commits into
masterfrom
optimize-range-operators

Conversation

@mwiewior
Copy link
Copy Markdown
Contributor

Summary

Addresses #36 — optimizes subtract, count_overlaps/coverage, complement, and grouped_stream operators:

  • subtract: Normalize (merge) right-side intervals after collection, dramatically shrinking the candidate set for subtraction. Replace linear cursor advance with binary search (partition_point) for O(log n) right-interval lookup per left interval.
  • interval_tree (count_overlaps/coverage): Resolve position arrays once per batch via resolve() instead of per-row value(i)? dispatch with Result overhead.
  • complement: Replace Vec<String> seen_contigs with AHashSet<String> for O(1) membership testing. Add merge_cursor tracking in emit_contig_complement to avoid re-scanning merged intervals from the start for each view interval.
  • grouped_stream: Eliminate double to_string() allocation for new contigs in both StreamCollector and FullBatchCollector.
  • Add 21 new unit tests covering subtract, count_overlaps, coverage, complement, merge, and cluster edge cases.
  • Add 2 new scaling benchmarks for count_overlaps and coverage.

Test plan

  • All 110 tests pass: cargo test -p datafusion-bio-function-ranges
  • Formatting clean: cargo fmt -- --check
  • Linting clean: cargo clippy -- -D warnings
  • Run scaling benchmarks on chainXenTro3Link data to verify performance improvements:
    • cargo test -p datafusion-bio-function-ranges --release bench_scaling_subtract -- --ignored --nocapture
    • cargo test -p datafusion-bio-function-ranges --release bench_scaling_complement -- --ignored --nocapture
    • cargo test -p datafusion-bio-function-ranges --release bench_scaling_count_overlaps -- --ignored --nocapture
    • cargo test -p datafusion-bio-function-ranges --release bench_scaling_coverage -- --ignored --nocapture

🤖 Generated with Claude Code

mwiewior and others added 2 commits March 13, 2026 14:12
…rators

- subtract: normalize (merge) right-side intervals after collection to shrink
  candidate set; replace linear cursor advance with binary search
  (partition_point)
- interval_tree: resolve position arrays once per batch instead of per-row
  dispatch in both build and probe paths
- complement: replace Vec<String> seen_contigs with AHashSet for O(1) lookup;
  add merge_cursor tracking to avoid re-scanning merged intervals
- grouped_stream: eliminate double to_string() allocation for new contigs
- Add 21 new unit tests and 2 scaling benchmarks (count_overlaps, coverage)

Closes #36

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The binary search (partition_point) replacement caused a regression on
cross-dataset subtract (chainXenTro3Link vs fBrain-DS14718): ~1.9x slower
at p=1. Root cause: the original persistent cursor is O(n+m) amortized
across sorted left intervals, while partition_point is O(n * log m) per
contig — worse when right-side intervals per contig are small.

Fix: restore the persistent linear cursor while keeping normalize_intervals
(which merges fragmented right-side intervals in-place). This preserves
the massive 26-74x speedup on self-subtract with heavy overlap, while
matching master performance on cross-dataset workloads.

Also adds bench_scaling_subtract_cross benchmark using chainXenTro3Link
vs fBrain-DS14718 for cross-dataset regression testing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant