Skip to content

ranges: trim interval join build payload#35

Open
mwiewior wants to merge 2 commits into
perf/issue-33-interval-joinfrom
perf/issue-33-interval-join-build-probe
Open

ranges: trim interval join build payload#35
mwiewior wants to merge 2 commits into
perf/issue-33-interval-joinfrom
perf/issue-33-interval-join-build-probe

Conversation

@mwiewior
Copy link
Copy Markdown
Contributor

Partially addresses #33

Summary

  • build interval indexes directly in algorithm-specific buffers instead of first materializing BioInterval
  • replace the callback-based probe API with collect_matches(..., &mut Vec<u32>)
  • fill probe-side row indices directly while probing instead of RLE -> expand_probe_indices
  • project the concatenated build-side batch down to only the left columns needed by the join output or residual filter
  • add examples/interval_join_bench.rs to benchmark hash join vs interval join algorithms on count-heavy and wide-output overlap queries

Why

IntervalJoinExec still spent a disproportionate amount of time in build-side materialization on count-style overlap queries. The first PR branch concatenated the full left payload even when the join output was empty (COUNT(*)) and no residual filter needed those columns.

This branch keeps the fast take-based output path, but trims the concatenated build batch to just the columns that are actually read later. It also removes two layers of probe-side overhead that were still scalar and allocation-heavy.

Benchmark

Release benchmark:

cargo run --release -p datafusion-bio-function-ranges --example interval_join_bench

bucketed_sparse_count EXPLAIN ANALYZE, IntervalJoinExec with Algorithm::Coitrees:

  • base branch perf/issue-33-interval-join: build_time=4.124417ms, join_time=2.035374ms
  • this branch: build_time=3.268291ms, join_time=1.883876ms

Representative wall-clock results on the same synthetic dataset (49,152 x 49,152 rows):

  • bucketed_sparse_count: 6ms -> 6ms (within noise, but lower operator metrics)
  • bucketed_sparse_wide: 8ms -> 8ms (no regression on wide output)
  • contig_only_sparse_count: 5ms -> 5ms

The main win is inside the operator metrics for count-style overlap joins, where the build side no longer concatenates unused left payload columns.

Validation

  • cargo test -p datafusion-bio-function-ranges
  • cargo run --release -p datafusion-bio-function-ranges --example interval_join_bench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant