ranges: trim interval join build payload by mwiewior · Pull Request #35 · biodatageeks/datafusion-bio-functions

mwiewior · 2026-03-12T09:40:57Z

Partially addresses #33

Summary

build interval indexes directly in algorithm-specific buffers instead of first materializing BioInterval
replace the callback-based probe API with collect_matches(..., &mut Vec<u32>)
fill probe-side row indices directly while probing instead of RLE -> expand_probe_indices
project the concatenated build-side batch down to only the left columns needed by the join output or residual filter
add examples/interval_join_bench.rs to benchmark hash join vs interval join algorithms on count-heavy and wide-output overlap queries

Why

IntervalJoinExec still spent a disproportionate amount of time in build-side materialization on count-style overlap queries. The first PR branch concatenated the full left payload even when the join output was empty (COUNT(*)) and no residual filter needed those columns.

This branch keeps the fast take-based output path, but trims the concatenated build batch to just the columns that are actually read later. It also removes two layers of probe-side overhead that were still scalar and allocation-heavy.

Benchmark

Release benchmark:

cargo run --release -p datafusion-bio-function-ranges --example interval_join_bench

bucketed_sparse_count EXPLAIN ANALYZE, IntervalJoinExec with Algorithm::Coitrees:

base branch perf/issue-33-interval-join: build_time=4.124417ms, join_time=2.035374ms
this branch: build_time=3.268291ms, join_time=1.883876ms

Representative wall-clock results on the same synthetic dataset (49,152 x 49,152 rows):

bucketed_sparse_count: 6ms -> 6ms (within noise, but lower operator metrics)
bucketed_sparse_wide: 8ms -> 8ms (no regression on wide output)
contig_only_sparse_count: 5ms -> 5ms

The main win is inside the operator metrics for count-style overlap joins, where the build side no longer concatenates unused left payload columns.

Validation

cargo test -p datafusion-bio-function-ranges
cargo run --release -p datafusion-bio-function-ranges --example interval_join_bench

mwiewior added 2 commits March 12, 2026 10:39

Optimize interval join build projection

2c3f5e5

Fix clippy format arg lint

3166c99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ranges: trim interval join build payload#35

ranges: trim interval join build payload#35
mwiewior wants to merge 2 commits into
perf/issue-33-interval-joinfrom
perf/issue-33-interval-join-build-probe

mwiewior commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mwiewior commented Mar 12, 2026

Summary

Why

Benchmark

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant