Commit f812944
feat: contig-by-contig VEP annotation with partitioned parquet cache (#47)
* feat: contig-by-contig VEP annotation with partitioned parquet cache
Refactor the parquet annotation path so everything is contig-scoped:
VCF reading (filter pushdown → tabix seek), variation lookup (per-contig
COITree), context loading (per-contig parquet files), and annotation
(per-contig PreparedContext). Memory is freed after each contig.
The partitioned cache layout (variation/chrN.parquet, transcript/chrN.parquet,
etc.) is auto-detected by PartitionedParquetCache::detect() and can be
controlled via "partitioned": true/false in options_json.
Contig discovery uses zero-cost VCF schema metadata (bio.vcf.contigs)
with SQL fallback. The existing monolithic path is completely untouched.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: use TBI-indexed contigs for zero-cost data-bearing contig discovery
Prefer bio.vcf.contigs.indexed metadata (TBI-derived, only contigs with
actual data) over bio.vcf.contigs (all header contigs). Fall back to
SELECT DISTINCT chrom when indexed metadata is unavailable.
This eliminates empty contig overhead: for a chr1-only VCF, processes 1
contig instead of 24 (saving ~3s / 11% on 1K variant benchmark).
Bumps datafusion-bio-format-vcf to rev 47e7ad3 (PR #136).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: prefer TBI-indexed contigs, fall back to SELECT DISTINCT chrom
Skip bio.vcf.contigs (all header contigs) which includes ~195 GRCh38
sequences even for single-chrom VCFs. Prefer bio.vcf.contigs.indexed
(data-bearing only), fall back to SELECT DISTINCT chrom.
Bumps datafusion-bio-format-vcf to rev e92ff6f.
Eliminates ~10s empty-contig overhead on chr1 benchmark (24→1 contig).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: streaming pipelined contig annotation with per-contig memory reclamation
Replace MemTable-based batch accumulation with a pull-based
ContigAnnotationExec / ContigAnnotationStream state machine that
processes one contig at a time and reclaims memory after each.
Key changes:
- Add ContigAnnotationExec (leaf ExecutionPlan) and ContigAnnotationStream
(StartContig → PreparingContig → Draining → Done state machine)
- Extract per-contig logic into prepare_and_annotate_contig() async fn
- Add MissWorklist::for_chrom() for single-contig worklist without
scanning base batches
- Add Clone derive to PartitionedParquetCache
Verified: 323K chr1 variants, 80 fields --everything, 100% accuracy
against VEP 115 golden truth (0 mismatches in 2,997,504 CSQ entries).
Timing: 72s (no regression vs previous MemTable baseline).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: true e2e streaming annotation with window-based HGVS hydration
Two key changes enable fully streaming contig annotation:
1. VariantLookupExec: buffer matched rows during probe phase, emit only
after probe completes. This ensures the colocated sink is fully
populated before any downstream consumer sees the first batch.
New EmitMatched state yields buffered matches, then EmitUnmatched.
2. ContigAnnotationStream: rich state machine with window-based processing.
- PreparingContig: parallel context loading + lookup stream setup
- AnnotatingContig: pull lookup batches into windows of 1000, then
per-window: hydrate HGVS (cumulative, skip already-hydrated
transcripts — same sliding-window pattern as SIFT), rebuild
PreparedContext, annotate, yield
- DrainingWindow: yield annotated batches one at a time
- CleaningUp: deregister ephemeral tables
Context loaded via MissWorklist::for_chrom() (no dependency on lookup
results for the partitioned path).
Verified: 323K chr1 variants, 80 fields --everything, 100% accuracy
(0 mismatches in 2,997,504 CSQ entries). 72.3s (no regression).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: fix rustfmt formatting for CI consistency
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address PR review findings — 7 fixes
1. Chr-prefix normalization in contig intersection (Critical)
VCF "chr1" now matches cache "1" and vice versa, matching
MissWorklist::expanded_chroms() behavior.
2. Ephemeral table cleanup on error paths (Critical)
Three error paths (lookup stream, hydrate_window, annotate_window)
now transition to ErrorCleaningUp which deregisters tables before
propagating the error. Added make_cleanup_future() helper.
3. Corrected misleading "parallel" docstring (Moderate)
Removed false claim about tokio::try_join! parallelism.
4. Pass reference_fasta_path to LookupProvider (Moderate)
Was hardcoded None, disabling reference-based allele shifting
for colocated variant matching in partitioned path.
5. Named constant ANNOTATION_COLUMN_COUNT replaces magic 2 (Moderate)
Documents that output schema appends csq + most_severe_consequence
+ CACHE_OUTPUT_COLUMNS after VCF fields.
6. Documented miRNA/structural gap in partitioned path (Minor)
7. Removed unnecessary filter() just to read VCF schema (Minor)
Skipped #7 from review (auto-detection opt-out semantics) — existing
behavior, low risk, not worth changing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: O(n²) Vec::remove(0) → VecDeque::pop_front() in EmitMatched
Change matched_batches from Vec to VecDeque so each emit is O(1)
instead of O(n) shift. For chr1 WGS with ~10K batches this avoids
~50M element moves.
Also documented that matched_batches peaks at full chromosome size
(inherent — colocated sink must be complete before annotation).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: eager per-contig memory reclamation — drop BuildSide, sink, context
After the lookup stream is exhausted:
- Drop the lookup stream (reclaims BuildSide: COITrees, hash indices,
concatenated VCF batch — several hundred MB for chr1)
- Clear the colocated sink (data already copied to colocated_map)
After the last annotation window:
- Clear colocated_map, transcripts, exons, translations, regulatory,
motifs before entering the async cleanup phase
Previously these stayed alive inside ContigAnnotationState until the
cleanup future completed, preventing per-contig memory reclamation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix project lifecycle
* refactor: remove monolithic annotation path, partitioned-only
Remove scan_with_transcript_engine (monolithic single-parquet path) and
all supporting helpers (resolve_cache_table_name, generated_cache_table_name,
resolve_transcript_context_tables, resolve_optional_context_table).
All annotation now goes through the partitioned streaming path
(ContigAnnotationExec → ContigAnnotationStream). When no partitioned
cache directory is detected, scan() returns a clear error message.
Refactored 17 tests to use partitioned cache layout:
- Added write_partitioned_cache/write_batch_to_cache/write_batch_to_chrom
helpers that write per-chrom parquet files to TempDir
- Updated cache_batch() to include both chrom "1" and "2" variation data
- Changed tests from register_table("var_cache") pattern to writing
partitioned parquet files and passing directory path with
{"partitioned":true} in options_json
- Updated assertions for partitioned behavior (intergenic_variant
when no context tables, vs old sequence_variant placeholder)
- Exon/translation batches (no chrom column) use write_batch_to_chrom
Net: -1186 lines removed (monolithic path + old test patterns).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 1c68844 commit f812944
12 files changed
Lines changed: 2545 additions & 2092 deletions
File tree
- benchmark-infra/terraform
- datafusion/bio-function-vep
- examples
- src
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
184 | 184 | | |
185 | 185 | | |
186 | 186 | | |
187 | | - | |
188 | | - | |
| 187 | + | |
| 188 | + | |
189 | 189 | | |
190 | 190 | | |
191 | 191 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| |||
Lines changed: 59 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
403 | 403 | | |
404 | 404 | | |
405 | 405 | | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
406 | 465 | | |
407 | 466 | | |
408 | 467 | | |
| |||
0 commit comments