|
| 1 | +# datafusion-bio-function-ranges |
| 2 | + |
| 3 | +Interval join, coverage, and count-overlaps for Apache DataFusion. |
| 4 | + |
| 5 | +This crate provides optimized genomic interval operations as DataFusion extensions: |
| 6 | + |
| 7 | +- **Interval join** — SQL joins with range overlap conditions, optimized via multiple interval tree algorithms |
| 8 | +- **Coverage** — base-pair overlap depth between two interval sets |
| 9 | +- **Count overlaps** — number of overlapping intervals per region |
| 10 | +- **Nearest** — nearest-neighbor interval matching |
| 11 | + |
| 12 | +## Quick Start |
| 13 | + |
| 14 | +```rust |
| 15 | +use datafusion_bio_function_ranges::{create_bio_session, register_ranges_functions}; |
| 16 | + |
| 17 | +// Option 1: Create a fully configured session (recommended) |
| 18 | +let ctx = create_bio_session(); |
| 19 | + |
| 20 | +// Option 2: Register functions on an existing bio-configured session |
| 21 | +use datafusion::config::ConfigOptions; |
| 22 | +use datafusion::prelude::{SessionConfig, SessionContext}; |
| 23 | +use datafusion_bio_function_ranges::{BioConfig, BioSessionExt}; |
| 24 | + |
| 25 | +let config = SessionConfig::from(ConfigOptions::new()) |
| 26 | + .with_option_extension(BioConfig::default()) |
| 27 | + .with_information_schema(true) |
| 28 | + .with_repartition_joins(false); |
| 29 | +let ctx = SessionContext::new_with_bio(config); |
| 30 | +register_ranges_functions(&ctx); |
| 31 | +``` |
| 32 | + |
| 33 | +## Registering Functions |
| 34 | + |
| 35 | +### `register_ranges_functions(ctx)` |
| 36 | + |
| 37 | +Registers the `coverage` and `count_overlaps` SQL table functions on an existing `SessionContext`. This is analogous to `register_pileup_functions` in the pileup crate. |
| 38 | + |
| 39 | +```rust |
| 40 | +use datafusion_bio_function_ranges::register_ranges_functions; |
| 41 | + |
| 42 | +register_ranges_functions(&ctx); |
| 43 | +``` |
| 44 | + |
| 45 | +### `create_bio_session()` |
| 46 | + |
| 47 | +Convenience function that creates a `SessionContext` with: |
| 48 | +- Custom query planner for automatic interval join detection |
| 49 | +- Physical optimizer rule that converts hash/nested-loop joins to interval joins |
| 50 | +- `BioConfig` extension for algorithm selection via `SET bio.*` statements |
| 51 | +- `coverage()` and `count_overlaps()` SQL table functions |
| 52 | + |
| 53 | +```rust |
| 54 | +use datafusion_bio_function_ranges::create_bio_session; |
| 55 | + |
| 56 | +let ctx = create_bio_session(); |
| 57 | +``` |
| 58 | + |
| 59 | +## SQL Table Functions |
| 60 | + |
| 61 | +### `coverage(left_table, right_table [, columns...] [, filter_op])` |
| 62 | + |
| 63 | +Computes base-pair coverage depth. Builds an interval tree from `left_table`, then for each row in `right_table`, computes the total overlap in base pairs with the merged intervals. |
| 64 | + |
| 65 | +```sql |
| 66 | +-- Default column names: contig, pos_start, pos_end |
| 67 | +SELECT * FROM coverage('reads', 'targets') |
| 68 | + |
| 69 | +-- Custom shared column names |
| 70 | +SELECT * FROM coverage('reads', 'targets', 'chrom', 'start', 'end') |
| 71 | + |
| 72 | +-- Separate column names for left and right tables |
| 73 | +SELECT * FROM coverage('reads', 'targets', 'chrom', 'start', 'end', 'contig', 'pos_start', 'pos_end') |
| 74 | + |
| 75 | +-- For 0-based half-open coordinates (adjusts boundaries with +1/-1) |
| 76 | +SELECT * FROM coverage('reads', 'targets', 'contig', 'pos_start', 'pos_end', 'strict') |
| 77 | +``` |
| 78 | + |
| 79 | +### `count_overlaps(left_table, right_table [, columns...] [, filter_op])` |
| 80 | + |
| 81 | +Counts overlapping intervals. Same interface as `coverage`, but returns the count of overlapping (non-merged) intervals instead of base-pair overlap. |
| 82 | + |
| 83 | +```sql |
| 84 | +SELECT * FROM count_overlaps('reads', 'targets') |
| 85 | +``` |
| 86 | + |
| 87 | +### Filter Operations |
| 88 | + |
| 89 | +| Value | Description | Use when | |
| 90 | +|-------|-------------|----------| |
| 91 | +| `'weak'` (default) | Standard overlap: `start <= end AND end >= start` | 1-based inclusive coordinates | |
| 92 | +| `'strict'` | Adjusted boundaries: queries with `start+1, end-1` | 0-based half-open coordinates | |
| 93 | + |
| 94 | +## Interval Join (SQL) |
| 95 | + |
| 96 | +When using a bio-configured session (`create_bio_session()` or `BioSessionExt::new_with_bio()`), SQL joins with range overlap conditions are automatically optimized: |
| 97 | + |
| 98 | +```sql |
| 99 | +-- Automatically detected and optimized as interval join |
| 100 | +SELECT * |
| 101 | +FROM reads |
| 102 | +JOIN targets |
| 103 | + ON reads.contig = targets.contig |
| 104 | + AND reads.pos_start <= targets.pos_end |
| 105 | + AND reads.pos_end >= targets.pos_start |
| 106 | +``` |
| 107 | + |
| 108 | +### Algorithm Selection |
| 109 | + |
| 110 | +```sql |
| 111 | +-- Select interval join algorithm |
| 112 | +SET bio.interval_join_algorithm = Coitrees; -- default, best general performance |
| 113 | + |
| 114 | +-- Available algorithms: |
| 115 | +-- Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals |
| 116 | +-- CoitreesNearest (1 nearest match per right-side row) |
| 117 | +``` |
| 118 | + |
| 119 | +### Nearest Join |
| 120 | + |
| 121 | +```sql |
| 122 | +SET bio.interval_join_algorithm = CoitreesNearest; |
| 123 | + |
| 124 | +SELECT * |
| 125 | +FROM targets |
| 126 | +JOIN reads |
| 127 | + ON targets.contig = reads.contig |
| 128 | + AND targets.pos_start <= reads.pos_end |
| 129 | + AND targets.pos_end >= reads.pos_start |
| 130 | +``` |
| 131 | + |
| 132 | +Returns exactly one match per right-side row: the overlapping interval if one exists, otherwise the nearest interval by distance. |
| 133 | + |
| 134 | +## Programmatic API |
| 135 | + |
| 136 | +For direct Rust usage without SQL: |
| 137 | + |
| 138 | +```rust |
| 139 | +use std::sync::Arc; |
| 140 | +use datafusion_bio_function_ranges::{CountOverlapsProvider, FilterOp}; |
| 141 | + |
| 142 | +let provider = CountOverlapsProvider::new( |
| 143 | + Arc::new(ctx.clone()), |
| 144 | + "reads".to_string(), // left table (built into interval tree) |
| 145 | + "targets".to_string(), // right table (gets count/coverage column) |
| 146 | + targets_schema, // Schema of the right table |
| 147 | + vec!["contig".into(), "pos_start".into(), "pos_end".into()], // left columns |
| 148 | + vec!["contig".into(), "pos_start".into(), "pos_end".into()], // right columns |
| 149 | + FilterOp::Weak, // or FilterOp::Strict for 0-based half-open |
| 150 | + true, // true = coverage, false = count_overlaps |
| 151 | +); |
| 152 | +ctx.register_table("result", Arc::new(provider))?; |
| 153 | +let df = ctx.sql("SELECT * FROM result").await?; |
| 154 | +``` |
| 155 | + |
| 156 | +## Migration from sequila-native |
| 157 | + |
| 158 | +This crate replaces the `sequila-core` crate from the [sequila-native](https://github.com/biodatageeks/sequila-native) repository. The functionality is identical; only names and the module structure have changed. |
| 159 | + |
| 160 | +### Type Renames |
| 161 | + |
| 162 | +| sequila-native | datafusion-bio-function-ranges | |
| 163 | +|----------------|-------------------------------| |
| 164 | +| `sequila_core::session_context::SeQuiLaSessionExt` | `BioSessionExt` | |
| 165 | +| `sequila_core::session_context::SequilaConfig` | `BioConfig` | |
| 166 | +| `sequila_core::session_context::Algorithm` | `Algorithm` | |
| 167 | +| `SessionContext::new_with_sequila(config)` | `SessionContext::new_with_bio(config)` | |
| 168 | + |
| 169 | +### Configuration Namespace |
| 170 | + |
| 171 | +| sequila-native | datafusion-bio-function-ranges | |
| 172 | +|----------------|-------------------------------| |
| 173 | +| `SET sequila.prefer_interval_join = true` | `SET bio.prefer_interval_join = true` | |
| 174 | +| `SET sequila.interval_join_algorithm = Coitrees` | `SET bio.interval_join_algorithm = Coitrees` | |
| 175 | +| `SET sequila.interval_join_low_memory = true` | `SET bio.interval_join_low_memory = true` | |
| 176 | + |
| 177 | +### Registration Pattern |
| 178 | + |
| 179 | +**Before (sequila-native):** |
| 180 | +```rust |
| 181 | +use sequila_core::session_context::{SeQuiLaSessionExt, SequilaConfig}; |
| 182 | + |
| 183 | +let mut sequila_config = SequilaConfig::default(); |
| 184 | +sequila_config.prefer_interval_join = true; |
| 185 | + |
| 186 | +let config = SessionConfig::from(options) |
| 187 | + .with_option_extension(sequila_config); |
| 188 | + |
| 189 | +let ctx = SessionContext::new_with_sequila(config); |
| 190 | +``` |
| 191 | + |
| 192 | +**After (datafusion-bio-function-ranges):** |
| 193 | +```rust |
| 194 | +use datafusion_bio_function_ranges::{create_bio_session, register_ranges_functions}; |
| 195 | + |
| 196 | +// Simple: creates context with everything configured |
| 197 | +let ctx = create_bio_session(); |
| 198 | + |
| 199 | +// Or manually: |
| 200 | +use datafusion_bio_function_ranges::{BioConfig, BioSessionExt}; |
| 201 | + |
| 202 | +let config = SessionConfig::from(options) |
| 203 | + .with_option_extension(BioConfig::default()); |
| 204 | +let ctx = SessionContext::new_with_bio(config); |
| 205 | +register_ranges_functions(&ctx); // registers coverage() and count_overlaps() UDTFs |
| 206 | +``` |
| 207 | + |
| 208 | +### New SQL Table Functions |
| 209 | + |
| 210 | +The `coverage` and `count_overlaps` operations are now available as SQL table functions (previously only accessible via the Rust `CountOverlapsProvider` API): |
| 211 | + |
| 212 | +```sql |
| 213 | +SELECT * FROM coverage('reads', 'targets') |
| 214 | +SELECT * FROM count_overlaps('reads', 'targets') |
| 215 | +``` |
| 216 | + |
| 217 | +### Dependency Update |
| 218 | + |
| 219 | +**Before:** |
| 220 | +```toml |
| 221 | +sequila-core = { git = "https://github.com/biodatageeks/sequila-native.git", rev = "..." } |
| 222 | +``` |
| 223 | + |
| 224 | +**After:** |
| 225 | +```toml |
| 226 | +datafusion-bio-function-ranges = { git = "https://github.com/biodatageeks/datafusion-bio-functions.git", rev = "..." } |
| 227 | +``` |
| 228 | + |
| 229 | +## Version Compatibility |
| 230 | + |
| 231 | +| Dependency | Version | |
| 232 | +|-----------|---------| |
| 233 | +| DataFusion | 50.3.0 | |
| 234 | +| Arrow | 56.1.0 | |
| 235 | +| Rust edition | 2024 | |
| 236 | + |
| 237 | +These versions must stay in sync with `datafusion-bio-formats` and `polars-bio`. |
| 238 | + |
| 239 | +## License |
| 240 | + |
| 241 | +This crate is licensed under the **Apache License 2.0**, consistent with the rest of the `datafusion-bio-functions` workspace. |
| 242 | + |
| 243 | +The vendored `superintervals` sub-crate (in `superintervals/`) is licensed under the **MIT License** by Kez Cleal. MIT is a permissive license fully compatible with Apache 2.0 — MIT-licensed code can be included in Apache 2.0 projects without restriction. |
0 commit comments