Skip to content

Commit 2cf7aa1

Browse files
mwiewiorclaude
andauthored
feat: Add bio-function-ranges crate (#10)
* feat: Add bio-function-ranges crate with interval join, coverage, count-overlaps, and nearest Add the datafusion-bio-function-ranges crate (ported from sequila-native) with: - Interval join optimization via custom query planner and physical optimizer rule - coverage() and count_overlaps() SQL table functions registered via register_ranges_functions() - create_bio_session() convenience function for fully configured SessionContext - 6 interval tree algorithm backends (Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals, CoitreesNearest) - Integration tests for count_overlaps, coverage (CSV + 438K-row parquet), and nearest join - Unit tests for merge_intervals and get_coverage - Migration guide from sequila-native in README.md - Test data: CSV interval files and parquet exons/fBrain datasets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve clippy warnings for CI - Add #[cfg] gate on aligned_vec import (unused without avx2/neon) - Add Default impl for IntervalMap<T> - Replace needless return with expression - Use saturating_sub instead of manual bounds check - Allow unnecessary_cast in aarch64 SIMD block - Inline format args across integration tests and interval_join Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review findings - Pin arrow/parquet to 56.1.0 (was 56.2.0) to match ecosystem requirements - Document MIT license compatibility for vendored superintervals in README - Replace all unwrap(), panic!(), and todo!() in array_utils.rs with proper Result-based error handling (DataFusionError::Plan for missing columns, NotImplemented for unsupported types, Internal for downcast failures) - Propagate Result from get_join_col_arrays through build_coitree_from_batches and get_stream callers - Replace unwrap() on RecordBatch::try_new with proper error mapping - Use match on Option instead of is_none()/unwrap() pattern for tree lookup in get_stream Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a05af74 commit 2cf7aa1

46 files changed

Lines changed: 11008 additions & 49 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Cargo.lock

Lines changed: 569 additions & 48 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
[workspace]
22
resolver = "2"
3-
members = ["datafusion/bio-function-pileup"]
3+
members = ["datafusion/bio-function-pileup", "datafusion/bio-function-ranges"]
4+
exclude = ["datafusion/bio-function-ranges/superintervals"]
45

56
[workspace.package]
67
license = "Apache-2.0"
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
[package]
2+
name = "datafusion-bio-function-ranges"
3+
version = "0.1.0"
4+
description = "Interval join, coverage, and count-overlaps for Apache DataFusion"
5+
license.workspace = true
6+
authors.workspace = true
7+
repository.workspace = true
8+
homepage.workspace = true
9+
edition.workspace = true
10+
11+
[dependencies]
12+
datafusion = { workspace = true }
13+
tokio = { workspace = true }
14+
futures = { workspace = true }
15+
log = { workspace = true }
16+
async-trait = "0.1.88"
17+
ahash = "0.8.11"
18+
coitrees = "0.4.0"
19+
fnv = "1.0.7"
20+
bio = "2.0.1"
21+
rust-lapper = "1.1.0"
22+
superintervals = { path = "superintervals" }
23+
parking_lot = "0.12.3"
24+
hashbrown = "0.14.5"
25+
26+
[dev-dependencies]
27+
tokio = { workspace = true, features = ["rt-multi-thread", "macros"] }
28+
rstest = "0.22.0"
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# datafusion-bio-function-ranges
2+
3+
Interval join, coverage, and count-overlaps for Apache DataFusion.
4+
5+
This crate provides optimized genomic interval operations as DataFusion extensions:
6+
7+
- **Interval join** — SQL joins with range overlap conditions, optimized via multiple interval tree algorithms
8+
- **Coverage** — base-pair overlap depth between two interval sets
9+
- **Count overlaps** — number of overlapping intervals per region
10+
- **Nearest** — nearest-neighbor interval matching
11+
12+
## Quick Start
13+
14+
```rust
15+
use datafusion_bio_function_ranges::{create_bio_session, register_ranges_functions};
16+
17+
// Option 1: Create a fully configured session (recommended)
18+
let ctx = create_bio_session();
19+
20+
// Option 2: Register functions on an existing bio-configured session
21+
use datafusion::config::ConfigOptions;
22+
use datafusion::prelude::{SessionConfig, SessionContext};
23+
use datafusion_bio_function_ranges::{BioConfig, BioSessionExt};
24+
25+
let config = SessionConfig::from(ConfigOptions::new())
26+
.with_option_extension(BioConfig::default())
27+
.with_information_schema(true)
28+
.with_repartition_joins(false);
29+
let ctx = SessionContext::new_with_bio(config);
30+
register_ranges_functions(&ctx);
31+
```
32+
33+
## Registering Functions
34+
35+
### `register_ranges_functions(ctx)`
36+
37+
Registers the `coverage` and `count_overlaps` SQL table functions on an existing `SessionContext`. This is analogous to `register_pileup_functions` in the pileup crate.
38+
39+
```rust
40+
use datafusion_bio_function_ranges::register_ranges_functions;
41+
42+
register_ranges_functions(&ctx);
43+
```
44+
45+
### `create_bio_session()`
46+
47+
Convenience function that creates a `SessionContext` with:
48+
- Custom query planner for automatic interval join detection
49+
- Physical optimizer rule that converts hash/nested-loop joins to interval joins
50+
- `BioConfig` extension for algorithm selection via `SET bio.*` statements
51+
- `coverage()` and `count_overlaps()` SQL table functions
52+
53+
```rust
54+
use datafusion_bio_function_ranges::create_bio_session;
55+
56+
let ctx = create_bio_session();
57+
```
58+
59+
## SQL Table Functions
60+
61+
### `coverage(left_table, right_table [, columns...] [, filter_op])`
62+
63+
Computes base-pair coverage depth. Builds an interval tree from `left_table`, then for each row in `right_table`, computes the total overlap in base pairs with the merged intervals.
64+
65+
```sql
66+
-- Default column names: contig, pos_start, pos_end
67+
SELECT * FROM coverage('reads', 'targets')
68+
69+
-- Custom shared column names
70+
SELECT * FROM coverage('reads', 'targets', 'chrom', 'start', 'end')
71+
72+
-- Separate column names for left and right tables
73+
SELECT * FROM coverage('reads', 'targets', 'chrom', 'start', 'end', 'contig', 'pos_start', 'pos_end')
74+
75+
-- For 0-based half-open coordinates (adjusts boundaries with +1/-1)
76+
SELECT * FROM coverage('reads', 'targets', 'contig', 'pos_start', 'pos_end', 'strict')
77+
```
78+
79+
### `count_overlaps(left_table, right_table [, columns...] [, filter_op])`
80+
81+
Counts overlapping intervals. Same interface as `coverage`, but returns the count of overlapping (non-merged) intervals instead of base-pair overlap.
82+
83+
```sql
84+
SELECT * FROM count_overlaps('reads', 'targets')
85+
```
86+
87+
### Filter Operations
88+
89+
| Value | Description | Use when |
90+
|-------|-------------|----------|
91+
| `'weak'` (default) | Standard overlap: `start <= end AND end >= start` | 1-based inclusive coordinates |
92+
| `'strict'` | Adjusted boundaries: queries with `start+1, end-1` | 0-based half-open coordinates |
93+
94+
## Interval Join (SQL)
95+
96+
When using a bio-configured session (`create_bio_session()` or `BioSessionExt::new_with_bio()`), SQL joins with range overlap conditions are automatically optimized:
97+
98+
```sql
99+
-- Automatically detected and optimized as interval join
100+
SELECT *
101+
FROM reads
102+
JOIN targets
103+
ON reads.contig = targets.contig
104+
AND reads.pos_start <= targets.pos_end
105+
AND reads.pos_end >= targets.pos_start
106+
```
107+
108+
### Algorithm Selection
109+
110+
```sql
111+
-- Select interval join algorithm
112+
SET bio.interval_join_algorithm = Coitrees; -- default, best general performance
113+
114+
-- Available algorithms:
115+
-- Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals
116+
-- CoitreesNearest (1 nearest match per right-side row)
117+
```
118+
119+
### Nearest Join
120+
121+
```sql
122+
SET bio.interval_join_algorithm = CoitreesNearest;
123+
124+
SELECT *
125+
FROM targets
126+
JOIN reads
127+
ON targets.contig = reads.contig
128+
AND targets.pos_start <= reads.pos_end
129+
AND targets.pos_end >= reads.pos_start
130+
```
131+
132+
Returns exactly one match per right-side row: the overlapping interval if one exists, otherwise the nearest interval by distance.
133+
134+
## Programmatic API
135+
136+
For direct Rust usage without SQL:
137+
138+
```rust
139+
use std::sync::Arc;
140+
use datafusion_bio_function_ranges::{CountOverlapsProvider, FilterOp};
141+
142+
let provider = CountOverlapsProvider::new(
143+
Arc::new(ctx.clone()),
144+
"reads".to_string(), // left table (built into interval tree)
145+
"targets".to_string(), // right table (gets count/coverage column)
146+
targets_schema, // Schema of the right table
147+
vec!["contig".into(), "pos_start".into(), "pos_end".into()], // left columns
148+
vec!["contig".into(), "pos_start".into(), "pos_end".into()], // right columns
149+
FilterOp::Weak, // or FilterOp::Strict for 0-based half-open
150+
true, // true = coverage, false = count_overlaps
151+
);
152+
ctx.register_table("result", Arc::new(provider))?;
153+
let df = ctx.sql("SELECT * FROM result").await?;
154+
```
155+
156+
## Migration from sequila-native
157+
158+
This crate replaces the `sequila-core` crate from the [sequila-native](https://github.com/biodatageeks/sequila-native) repository. The functionality is identical; only names and the module structure have changed.
159+
160+
### Type Renames
161+
162+
| sequila-native | datafusion-bio-function-ranges |
163+
|----------------|-------------------------------|
164+
| `sequila_core::session_context::SeQuiLaSessionExt` | `BioSessionExt` |
165+
| `sequila_core::session_context::SequilaConfig` | `BioConfig` |
166+
| `sequila_core::session_context::Algorithm` | `Algorithm` |
167+
| `SessionContext::new_with_sequila(config)` | `SessionContext::new_with_bio(config)` |
168+
169+
### Configuration Namespace
170+
171+
| sequila-native | datafusion-bio-function-ranges |
172+
|----------------|-------------------------------|
173+
| `SET sequila.prefer_interval_join = true` | `SET bio.prefer_interval_join = true` |
174+
| `SET sequila.interval_join_algorithm = Coitrees` | `SET bio.interval_join_algorithm = Coitrees` |
175+
| `SET sequila.interval_join_low_memory = true` | `SET bio.interval_join_low_memory = true` |
176+
177+
### Registration Pattern
178+
179+
**Before (sequila-native):**
180+
```rust
181+
use sequila_core::session_context::{SeQuiLaSessionExt, SequilaConfig};
182+
183+
let mut sequila_config = SequilaConfig::default();
184+
sequila_config.prefer_interval_join = true;
185+
186+
let config = SessionConfig::from(options)
187+
.with_option_extension(sequila_config);
188+
189+
let ctx = SessionContext::new_with_sequila(config);
190+
```
191+
192+
**After (datafusion-bio-function-ranges):**
193+
```rust
194+
use datafusion_bio_function_ranges::{create_bio_session, register_ranges_functions};
195+
196+
// Simple: creates context with everything configured
197+
let ctx = create_bio_session();
198+
199+
// Or manually:
200+
use datafusion_bio_function_ranges::{BioConfig, BioSessionExt};
201+
202+
let config = SessionConfig::from(options)
203+
.with_option_extension(BioConfig::default());
204+
let ctx = SessionContext::new_with_bio(config);
205+
register_ranges_functions(&ctx); // registers coverage() and count_overlaps() UDTFs
206+
```
207+
208+
### New SQL Table Functions
209+
210+
The `coverage` and `count_overlaps` operations are now available as SQL table functions (previously only accessible via the Rust `CountOverlapsProvider` API):
211+
212+
```sql
213+
SELECT * FROM coverage('reads', 'targets')
214+
SELECT * FROM count_overlaps('reads', 'targets')
215+
```
216+
217+
### Dependency Update
218+
219+
**Before:**
220+
```toml
221+
sequila-core = { git = "https://github.com/biodatageeks/sequila-native.git", rev = "..." }
222+
```
223+
224+
**After:**
225+
```toml
226+
datafusion-bio-function-ranges = { git = "https://github.com/biodatageeks/datafusion-bio-functions.git", rev = "..." }
227+
```
228+
229+
## Version Compatibility
230+
231+
| Dependency | Version |
232+
|-----------|---------|
233+
| DataFusion | 50.3.0 |
234+
| Arrow | 56.1.0 |
235+
| Rust edition | 2024 |
236+
237+
These versions must stay in sync with `datafusion-bio-formats` and `polars-bio`.
238+
239+
## License
240+
241+
This crate is licensed under the **Apache License 2.0**, consistent with the rest of the `datafusion-bio-functions` workspace.
242+
243+
The vendored `superintervals` sub-crate (in `superintervals/`) is licensed under the **MIT License** by Kez Cleal. MIT is a permissive license fully compatible with Apache 2.0 — MIT-licensed code can be included in Apache 2.0 projects without restriction.

0 commit comments

Comments
 (0)