Commit f22c24e
feat: optimize annotation context loading (#44)
* feat: optimize annotation context loading with column projection and miss worklist
- Fix profiling bug: replace .any() short-circuit with full iteration
over all batches for accurate cache hit/miss counts
- Add MissWorklist: collects cache-miss variant positions, coalesces
nearby intervals, generates interval-aware SQL predicates
- Column projection: replace SELECT * with explicit column lists in all
context loaders. The load_translations change alone saves ~13s by
skipping 132 MB of sift/polyphen data the main loader discards.
- Interval predicates: regulatory/motif/mirna/structural loaders use
position-overlap predicates instead of chrom-only IN clauses
- Rust-side transcript_id filter: after loading transcripts, filter
exons and translations by HashSet in Rust (microseconds)
- Support split translation layout: translations_sift_table option
directs sift window loading to a dedicated sift parquet file
Measured impact (chr1, 319K variants, no --everything):
context_tables_total: 13.5s → 0.47s (-97%)
total pipeline: 60.0s → 49.6s (-17%)
Refs: #43
Refs: biodatageeks/datafusion-bio-formats#131
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: support split translation layout in golden bench and profile examples
- annotate_vep_golden_bench: discover translation_core and
translation_sift parquet files from context_dir, register
translations_sift_table for sift window loading
- profile_annotation: same split layout support with fallback
to unsplit translation file
- Add benchmark.md with golden benchmark invocation and results:
80/80 CSQ fields at 100% accuracy, context_tables 13.5s → 0.5s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: filter sift loading by translation transcript_ids and add split timing
- Add restrict_to parameter to load_sift_window to skip transcripts
not in the loaded translations set
- Add separate VEP_PROFILE timing for sift loading vs annotation
(7a. sift_lazy_load_only, 7b. annotate_batches_only)
Golden benchmark: 80/80 fields at 100% accuracy, no regression.
For the distributed chr1 workload (all translations loaded), the filter
has minimal effect. The two-pass approach (filter by actual missense
transcript_ids) is needed for significant sift savings.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Revert "perf: filter sift loading by translation transcript_ids and add split timing"
This reverts commit 5340e15.
* perf: add split profiling for sift loading vs annotation
Adds separate VEP_PROFILE timing lines:
- 7a. sift_lazy_load_only: time spent in sift window SQL queries
- 7b. annotate_batches_only: time spent in transcript consequence engine
This makes it clear that sift I/O dominates (48s) vs annotation (8s)
in --everything mode.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: compact sift/polyphen predictions — eliminate String allocations
Replace HashMap<(i32, String), (String, f32)> with sorted
Vec<CompactPrediction> where amino acid and prediction type are
encoded as u8 indices instead of heap-allocated Strings.
- Amino acid: single char → u8 (A=0..Y=24)
- Prediction: ~5 unique strings → u8 enum
- Lookup: binary search on sorted Vec vs HashMap get
- Eliminates ~256M String allocations across 51 sift windows
Measured impact (chr1 --everything, 319K variants, golden bench):
sift_lazy_load: 48.0s → 35.8s (-25%)
total pipeline: 104.6s → 91.7s (-12%)
correctness: 80/80 CSQ fields at 100% accuracy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: update benchmark.md with compact prediction timings
- Add 7a/7b split profiling (sift_lazy_load: 35.8s, annotate: 8.8s)
- Update --everything before/after table: 107.7s → 91.7s (-15%)
- List all optimizations applied
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: zero-copy Arrow parsing for sift predictions
Add read_compact_predictions() that reads directly from Arrow arrays
into CompactPrediction without intermediate ProteinPrediction structs
or String allocations. Amino acid and prediction &str are read from
Arrow buffers (zero-copy) and encoded to u8 in-place.
Eliminates ~133M String allocations (2.6M per window × 51 windows)
that were created in read_protein_predictions() and immediately
discarded after encoding to u8.
Measured impact (chr1 --everything, 319K variants, golden bench):
sift_lazy_load: 35.8s → 26.5s (-26%)
total pipeline: 91.7s → 83.2s (-9%)
correctness: 80/80 CSQ fields at 100% accuracy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add --output=<path> VCF sink to profile_annotation example
Writes annotated output as minimal VCF (CHROM, POS, ID, REF, ALT,
QUAL, FILTER, INFO with CSQ field) with timing. Useful for comparing
annotation output across code changes.
Usage:
cargo run --release --example profile_annotation -- \
input.vcf.gz cache_dir 1000 --output=/tmp/output.vcf
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: update benchmark.md with latest profile_annotation timings
- Full chr1 --everything with VCF output: 82.3s (was 107.7s baseline)
- sift_lazy_load: 26.1s (was 48.0s, -46%)
- context_tables: 0.5s (was 13.6s, -96%)
- Total savings: 25.4s (-24%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: direct parquet-rs sift reader with cached ArrowReaderMetadata
Bypass DataFusion session.sql() for sift/polyphen window loading.
Read parquet footer once, pre-compute RG position ranges, then reuse
cached metadata for all 51 window queries. Falls back to session.sql()
when the sift parquet file path cannot be resolved.
- Move parquet crate from dev-dependencies to dependencies
- Add SiftDirectReader struct with cached metadata + projection + RG ranges
- Resolve sift file path from cache_source parent directory
- Each window: open file (fd only), reuse metadata, select matching RGs
Measured impact (chr1 --everything, 319K variants, profile_annotation):
sift_lazy_load: 26.1s → 23.7s (-9%)
total pipeline: 82.3s → 78.5s (-5%)
correctness: 80/80 CSQ fields at 100% accuracy (golden bench)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: update benchmark.md with direct parquet-rs sift reader results
- sift_lazy_load: 48.0s → 23.7s (-51%)
- total pipeline: 107.7s → 78.5s (-27%)
- Add cumulative optimization impact table
- Add remaining bottlenecks breakdown
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: reuse CSQ string buffer across rows instead of Vec<String> + join
Replace per-row `Vec<String>` + `format!()` + `join(",")` pattern with
a single reusable `csq_buf: String` and `write!()`. Also reuse
`terms_buf` for the per-CSQ-entry terms.join("&").
Eliminates ~3M intermediate String allocations (one per CSQ entry) and
~321K Vec<String> allocations (one per row). Unmeasurable wall-clock
impact (~90ms theoretical) but reduces allocator pressure.
Golden benchmark: 80/80 CSQ fields at 100% accuracy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: correct VCF output positions and add id/qual/filter columns
- Fix int64_at to handle UInt32 (VCF provider's start column type)
- Write correct 1-based VCF POS (start column is already 1-based)
- Include id, qual, filter columns from the annotation output
- ID field is empty string when not available (VCF provider behavior)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: pass through original VCF INFO fields in annotation output
- Register VCF with None for info_fields/format_fields (include all)
instead of Some(vec![]) (include none)
- VCF writer merges original INFO fields with CSQ annotation
- Fix int64_at to handle UInt32/UInt64 column types
The annotation UDTF output Arrow table now contains all original VCF
columns (INFO fields, FORMAT/sample fields) alongside annotation
columns (csq, most_severe_consequence, cache columns).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: proper VCF output with INFO, FORMAT, and sample columns
Use VCF field metadata (bio.vcf.field.field_type) from the original
VCF input schema to properly classify columns as INFO vs FORMAT.
The annotation pipeline output loses field metadata, so we capture
the input schema before registration and use it during VCF writing.
- INFO columns: written as key=value pairs in INFO field
- FORMAT columns: written as colon-separated values per sample
- Sample names: read from bio.vcf.samples schema metadata
- CSQ annotation: appended to INFO field
Output format: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add VCF output round-trip tests with noodles-vcf
Add noodles-vcf as dev-dependency and write 4 round-trip tests that:
1. Generate a VCF from annotation output Arrow batches
2. Read it back with noodles-vcf parser
3. Verify the file is valid VCF (header, records parse without error)
4. Verify positions, CSQ in INFO, FORMAT/sample columns, empty batches
Tests:
- test_vcf_roundtrip_noodles_can_parse: header + 3 records parse OK
- test_vcf_roundtrip_positions_correct: CHROM, POS, REF, ALT values
- test_vcf_roundtrip_csq_preserved: CSQ annotation in INFO field
- test_vcf_roundtrip_empty_batches: empty input produces valid VCF
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: update Cargo.lock for noodles-vcf dev-dependency
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: apply cargo fmt formatting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: resolve clippy warnings for CI (--all-targets --all-features)
- Remove unused mut on records Vec in roundtrip test
- Replace redundant closure with PathBuf::from
- Collapse nested if-let statements
- Use rsplit().next() instead of split().last() on DoubleEndedIterator
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address PR review feedback from Claude and Codex
- Remove dead code: resolve_parquet_path (always returned None),
ProteinPrediction struct, read_protein_predictions (replaced by
read_compact_predictions)
- Fix {sift_load_ms:.1} format specifier: .1 has no effect on u128
- Extract MAX_INTERVAL_CLAUSES constant (50) replacing magic number
- Add doc comment on MissWorklist::chroms explaining bare-name invariant
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address second round of PR review feedback
- Remove dead AnnotateProvider::chrom_filter_clause (callers migrated
to worklist.chrom_filter_clause())
- Remove dead CompactPrediction::decode_amino_acid (never called)
- Write minimal VCF header for empty batch results instead of empty file
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent b11017d commit f22c24e
10 files changed
Lines changed: 1488 additions & 240 deletions
File tree
- datafusion/bio-function-vep
- examples
- src
- tests
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
| 33 | + | |
32 | 34 | | |
33 | 35 | | |
34 | | - | |
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
Lines changed: 27 additions & 13 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
441 | 441 | | |
442 | 442 | | |
443 | 443 | | |
444 | | - | |
445 | | - | |
446 | | - | |
447 | | - | |
448 | | - | |
449 | | - | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
450 | 451 | | |
451 | 452 | | |
452 | 453 | | |
453 | | - | |
454 | | - | |
455 | | - | |
456 | | - | |
457 | | - | |
458 | | - | |
459 | | - | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
460 | 473 | | |
461 | 474 | | |
462 | 475 | | |
| |||
532 | 545 | | |
533 | 546 | | |
534 | 547 | | |
| 548 | + | |
535 | 549 | | |
536 | 550 | | |
537 | 551 | | |
| |||
0 commit comments