Skip to content

Commit b11017d

Browse files
mwiewiorclaude
andauthored
feat: Ensembl VEP annotations phase 1 (#30)
* Add bio-function-vep crate with lookup_variants table function Scaffold the VEP annotation crate (Phase 1) with: - allele.rs: VCF↔VEP allele conversion (vcf_to_vep_allele, allele_matches) and match_allele/vep_allele scalar UDFs - coordinate.rs: CoordinateNormalizer for 0-based/1-based coordinate system detection via schema metadata, with FilterOp selection - schema_contract.rs: variation cache schema validation, default column definitions, and column list parsing - lookup_provider.rs: LookupProvider that generates interval join SQL against a variation cache table with allele matching post-filter - table_function.rs: lookup_variants() UDTF with argument parsing (vcf_table, cache_table, optional columns, optional prune_nulls) - lib.rs: register_vep_functions() and create_vep_session() building on bio-function-ranges' BioQueryPlanner with IntervalJoinExec Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: Interval join parser now handles extra non-range predicates in join filter The IntervalJoinPhysicalOptimizationRule failed to recognize interval joins when the join filter contained additional predicates beyond the two range conditions (e.g. match_allele UDF in lookup_variants). The parser expected exactly two BinaryExpr children under a top-level AND, but filters like `a.end >= b.start AND a.start <= b.end AND match_allele(...)` produce a nested AND tree with three leaves. Changes: - Add flatten_and() to recursively collect leaf predicates from AND trees - Rewrite try_parse() to iterate leaves and skip non-range predicates - Change IntervalBuilder methods from panic to Result for graceful handling of duplicate bounds and missing fields - Add integration test confirming lookup_variants produces IntervalJoinExec - Add unit tests for extra-predicate and no-range-predicate cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: LEFT JOIN support in IntervalJoinExec + swap build/probe sides for VEP lookup VCF (small) is now the build side and cache (large) is the probe side. LEFT JOIN ensures all VCF variants appear in output with NULLs for unmatched cache annotations. Cache rows with all-NULL annotation columns are pre-filtered via subquery before probing. Changes: - IntervalJoinExec: matched_build_rows bitmap, EmitUnmatchedBuild state, single-partition probe for LEFT joins - lookup_provider: FROM vcf LEFT JOIN cache ON ..., NULL pre-filter subquery - Tests for LEFT JOIN semantics and NOT NULL pushdown verification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: Downgrade DataFusion back to 50.3.0 Reverts the DF 52.1.0 upgrade (Cargo.toml, Cargo.lock, pileup dep pin, CLAUDE.md). Fixes create_hashes signature (&mut Vec<u64> in DF 50). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: Accept Utf8View columns in variation cache schema validation DataFusion 50+ reads parquet strings as Utf8View by default. Treat Utf8, Utf8View, and LargeUtf8 as interchangeable in schema validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: Normalize chromosome "chr" prefix for VEP cache join Ensembl VEP cache uses bare chromosome names (1, 22) while VCF files may use chr-prefixed names (chr1, chr22). Detect the mismatch at scan time and wrap the VCF side in a subquery that strips the prefix, keeping the equi-join as plain column equality so IntervalJoinExec is preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: Default to all annotation columns in lookup_variants When no columns argument is passed, return all cache columns except coordinate columns (chrom, start, end) and source_* bookkeeping columns, instead of a hardcoded 3-column default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: Apply JoinFilter in IntervalJoinExec for non-range predicates IntervalJoinExec stored the JoinFilter but never evaluated it, causing row explosion when joining VCF against VEP cache — overlapping cache entries at the same position passed the interval check but should have been rejected by allele matching. Add apply_join_filter() and filter_result_batch() helpers that evaluate the filter expression on matched row pairs and update the LEFT JOIN bitmap only for surviving matches. The filter is applied at all four batch-emission sites (streaming, low-memory continuation, low-memory complete, and full-batch paths). For CoitreesNearest and CoitreesCountOverlaps the filter is skipped via effective_filter(), since these algorithms have semantics that conflict with the range overlap predicates typically present in the JoinFilter. When no filter is present (the common case), the overhead is negligible: one function call plus two None-branch checks, with the batch returned as-is without copying or allocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: Correct VEP lookup false positives from coordinate mismatch and missing REF check Two issues caused ~7% extra rows (4,345,025 vs VEP's 4,048,342): 1. Coordinate system mismatch: VCF uses half-open intervals [start, end) while VEP cache uses 1-based closed [start, end]. The symmetric weak overlap (>=, <=) caused VCF [100, 101) to match cache entries at both position 100 AND 101. Fixed by using asymmetric overlap (>, <=) which correctly handles half-open vs closed intervals. 2. Missing REF allele verification: match_allele() only checked the ALT allele, allowing false positives where a cache entry had the same ALT but different REF (e.g. "C/G" matching a VCF A->G variant). Added REF allele check supporting both VEP-format and VCF-format allele strings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * perf: Skip redundant JoinFilter evaluation for pure range interval joins The JoinFilter application added in a609a2c evaluated ALL filter predicates on every output batch, including the range overlap conditions (a.end >= b.start AND a.start <= b.end) that the interval tree already enforces. This caused ~50% performance regression for range table providers (overlap, merge, etc.) where the filter contains only range predicates. Now effective_filter() walks the filter's AND-tree and returns None when every leaf is a BinaryExpr (comparison already handled by the tree). The filter is only applied when non-range predicates are present — e.g. UDF calls like match_allele() in VEP lookups. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: Deduplicate VEP cache entries with STRING_AGG for variation_name VEP caches contain duplicate rows on (chrom, start, end, allele_string) with different variation_name values (e.g. dbSNP + COSMIC IDs at the same position). This caused ~1,133 extra output rows vs VEP's output. The cache subquery now GROUP BYs on the join key columns: - variation_name: STRING_AGG(DISTINCT ..., ',') to comma-concatenate all co-located variant IDs, matching VEP's Existing_variation format - Other annotation columns: MAX() (values are identical across dupes) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: Move cache dedup GROUP BY after the join, not before The GROUP BY was inside the cache subquery (before the join), which only collapsed duplicates with the same allele_string. A VCF variant matching multiple cache entries with different allele_strings would still produce multiple output rows. Restructured to GROUP BY all VCF columns AFTER the LEFT JOIN, matching VEP's one-row-per-variant semantics. Added plan assertion verifying AggregateExec sits above IntervalJoinExec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * perf: Remove blocking GROUP BY, restore streaming join pipeline The post-join GROUP BY turned the streaming IntervalJoinExec pipeline into a blocking operation with high memory usage (had to accumulate all join results before aggregating). Removed the GROUP BY entirely. Cache duplicates (~0.03% extra rows from entries with identical position+allele but different variation_name) pass through as separate rows. This is a conscious trade-off: streaming performance over exact VEP row-count parity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Optimize lookup_variants scan pruning and add partitioned LEFT join opt-in * Use coordinate metadata for lookup overlap * fix: support pipe-ALT matching and insertion-style cache joins * feat: add colocated-id lookup mode and robust multi-alt matching * feat: add non-consequence vep-existing lookup fallback mode * vep lookup: propagate somatic in fallback match modes * feat: add fjall KV cache backend for VEP variant lookup Add new `bio-function-vep-cache` crate providing a window-based KV cache backend using fjall LSM-tree storage. This replaces full Parquet scans (~1B entries) with targeted window lookups, reducing cache loads from ~6M per-variant to ~3K per-window per sample. Key components: - key_encoding: canonical chromosome ordering with (chrom, window_id) keys - kv_store: fjall wrapper with Arrow IPC serialization (LZ4 compressed) - loader: parallel Parquet-to-fjall ingestion (per-chromosome parallelism) - cache_exec: KvLookupExec streaming ExecutionPlan with LEFT JOIN semantics - cache_provider: KvCacheTableProvider with auto-detection via downcast - allele_index: per-window position index with injected matcher functions Integration with bio-function-vep via optional `kv-cache` feature flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: columnar v1 cache format with position index and fjall tuning APIs Split monolithic Arrow IPC windows into compact binary position index (for probing) + per-column Arrow IPC entries (loaded on-demand), reducing deserialization waste from 256x amplification to near-zero. New APIs: - VepKvStore::open_with_cache_size(path, bytes) — custom fjall block cache - VepKvStore::create_with_options(path, schema, ...) — full fjall tuning - KvCacheTableProvider::open_with_cache_size(path, bytes) - PositionIndex: compact binary position index (from_batch/to_bytes/from_bytes) - Format v1: EntryType-based key encoding (position index + per-column entries) Benchmark (chr1, 319K variants, 88M cache rows): - v0 baseline: 29.7K variants/s (12T, 32MB cache) - v1 default: 149K variants/s (4T, 32MB cache) — 5x faster - v1 tuned: 227K variants/s (6T, 4GB cache) — 7.6x faster Default fjall block cache raised from 32MB to 256MB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: parallel cache loading via DataFusion partitions Replace per-chromosome SQL queries with partition-level parallelism. Instead of discovering chromosomes and issuing 25+ separate queries, create a single physical plan and execute each DataFusion partition concurrently with per-window locking for safe concurrent writes. - Add WindowLocks for per-window tokio::sync::Mutex coordination - Add split_batch_into_windows() using arrow::compute::take() - Add flush_with_lock() for lock-guarded window writes - Replace load_chromosome() with load_partition() - Change parallelism from fixed 25 to optional cap (None = no limit) - Add target_partitions CLI arg to load_cache example - Add tests for multi-partition loading and concurrent cross-window indels Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix kv-cache lookup type mismatch and v1 column index overflow * Fix parallel cache loading hang: two-phase read/write approach Replace per-partition flush (read-modify-write amplification causing indefinite hangs on real data) with two-phase approach: - Phase 1: parallel partition reads into SharedWindowBuffers (in-memory) - Phase 2: sequential single-pass flush of each window to fjall Removes WindowLocks, flush_with_lock, load_partition, FLUSH_THRESHOLD. Adds SharedWindowBuffers, read_partition. Net -60 lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Optimize cache loading: parallel serialize + fjall batch writes Move IPC serialization to parallel spawn_blocking tasks and use fjall's Batch API to reduce L0 compaction pressure. Three-phase approach: 1. Parallel read: partition streams → shared window buffers 2. Parallel serialize: spawn_blocking per window (IPC + LZ4) 3. Batched write: 100-window batch commits to fjall chr22 benchmark (15.1M variants, 78 cols): Before: ~31s (21s flush with L0 stalls from 309K individual inserts) After: ~19s (6s serialize+write phase, no L0 stalls) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add RepartitionExec for even partition distribution after filtering Sorted Parquet files concentrate filtered data (e.g. WHERE chrom='22') into a single partition via predicate pushdown, making other partitions empty. Wrap the physical plan with RepartitionExec(RoundRobinBatch) to redistribute filtered data evenly across all partitions. chr22 benchmark (15.1M variants, 8 partitions): Before repartition: 1 partition gets all data → 19s total After repartition: 8 partitions balanced → 17s total Original sequential per-chrom: → 33s total Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove RepartitionExec from cache loader Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Stream cache loading batch-by-batch with position-order flushing Replace three-phase load (accumulate all → serialize all → write all) with per-partition streaming that flushes windows incrementally as the stream advances past them. Memory drops from O(total_variants) to O(active_windows_per_partition). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve loader flush concurrency without enforcing sort * vep-cache: remove v0 paths and enforce v1 output typing * Optimize KV lookup hot path and add profiling * V5 zstd dictionary compression + hot-path optimizations Add zstd dictionary-trained compression for V5 position entries: - Train 112KB dictionary from first batch samples during cache creation - Store dictionary in fjall meta partition, load at open time - Compress entries on write, decompress on read with reusable decompressor - Disk usage: 3.3GB → 1.16GB (65% reduction) for chr22 Hot-path optimizations in the match loop (144K rows/s, up from 76K): - Precompute column offsets in PositionEntryReader::new() (O(1) vs O(N^2)) - Inline null bitmap reads replacing Vec<bool> allocations per column - Reuse decompression buffer across entries (amortized zero-alloc) - Reuse allele match buffer across rows Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add bio.annotation config namespace and V5 zstd compression tuning Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Consolidate vep-cache to single V0 format and add window_size config Remove all V1–V4 format code paths, mmap_block_store, window-based indexing, and dead modules (window.rs, position_index.rs). Rename V5 to V0 as the sole format. Add bio.annotation.window_size session parameter (default 1M bp). Simplify examples and config naming (v5_zstd_level → zstd_level, v5_dict_size_kb → dict_size_kb). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix parallel dict training race and remove serial contig example Train zstd dictionary once from a LIMIT 10000 sample before spawning partition tasks. Share dict via Arc across all partitions so each creates its own compressor from the same dictionary. Removes the race where parallel partitions would overwrite each other's dictionaries. Delete load_cache_by_contig_serial.rs — unnecessary with row-oriented V0 format (no contig-level ordering requirement). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: update vep cache config and benchmark tooling * Fix Fjall/parquet lookup parity and document cache API * Document lookup_variants match_mode behaviors * Handle zstd decompression growth for large cache entries * Add Vortex cache integration and cache export tooling * Revert "Add Vortex cache integration and cache export tooling" This reverts commit 9e751c3682923976b1c0be74055d89a426fb6ec2. * Rewrite VEP lookup from interval join to equi-join with extended_probes opt-in Replace interval-overlap SQL with equi-join on exact normalized (chrom, start, end) coordinates. DataFusion now plans lookup_variants() as HashJoinExec (equi on 3 keys) + JoinFilter (match_allele), which is simpler and avoids the custom IntervalJoinExec LEFT join extensions. The default equi-join requires VCF and cache coordinates to match exactly after normalization. Variants with VEP-style shifted coordinates (insertion start>end, prefix-trimmed deletions, tandem-repeat shifts) need the new extended_probes=true parameter, which restores interval-overlap SQL and multi-probe KV lookups. Changes: - Revert IntervalJoinExec LEFT join extensions back to v0.3.1 (INNER only) - Remove interval_join_partitioned_left_join config - Rewrite lookup SQL to use equality conditions (HashJoinExec path) - Add extended_probes boolean parameter to lookup_variants() signature - Gate KvLookupExec multi-probe logic behind extended_probes flag - Update tests: fixtures for equi-join coordinates, shifted-coord tests use extended_probes=true - Document equi-join limitations for insertions/deletions in module docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Merge vep cache into bio-function-vep and fix workspace tests * Add annotate_vep golden benchmark and crate docs * feat(vep): wire transcript/exon consequence baseline and update porting plan * feat(vep): extend consequence coverage with context-aware SO handlers * vep: add translation-aware codon CSQ path and backend-consistent context tests * vep: extend codon-aware consequence classification for coding edits * vep: add term-level golden parity fixtures and edge-case coverage * vep: filter context loading by input chroms for lookup mode Restrict transcript, exon, and translation loading to only the chromosomes present in the input variants. This avoids reading the full genome-wide annotation tables when annotating a small region, significantly reducing memory and latency in lookup mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: strip parent SO terms and filter non-VEP transcripts for golden parity Remove parent SO terms (coding_sequence_variant, protein_altering_variant) when more specific children are present, matching VEP behavior. Filter out CCDS, IG locus, and other non-standard transcript IDs that VEP does not annotate against. Adds VCF indel normalization constructor (from_vcf) for future use. Updates test fixtures to use ENST-prefixed transcript IDs. Golden benchmark at 5000 chr22 variants: most_severe 99.9%, term_set 99.7%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix frameshift/stop_gained ranking and indel splice-site normalization Suppress stop_gained/stop_lost from codon classification when frameshift is already determined — VEP treats downstream protein changes as implicit in frameshifts. Normalize VCF indel coordinates (strip anchor base) for splice-site distance checks only, preserving original range for exon overlap detection. Golden benchmark at 5000 chr22 variants: most_severe 99.92% (5041/5045), term_set 99.68%. All 4 remaining most_severe mismatches are upstream CDS data quality issues (issue #119). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: remove false missense heuristic for transcripts without translation data When no CDS sequence is available, do not guess missense_variant for SNVs in CDS — emit coding_sequence_variant instead. Retain start/stop codon heuristics (position + allele pattern based) and premature stop gain detection. This eliminates false missense classifications on transcripts lacking translation data (e.g. ENST00000577821). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: use coitrees for transcript overlap, fix CDS N-padding and cross-transcript parent stripping Replace O(V×T) linear transcript scan with per-chromosome COITree interval queries (O(V×(log T + k))), reducing annotation phase from ~40s to ~1.8s on chr22. Fix cross-transcript collapse to only strip coding parent terms (not intron_variant). Handle leading N-padded CDS sequences and N-containing codon translation for VEP parity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: strip splice_donor_region with 5th_base and suppress stop_gained on inframe indels VEP omits splice_donor_region_variant when splice_donor_5th_base_variant is present (the 5th base is within the donor region). VEP also does not emit stop_gained/stop_lost alongside inframe_deletion or inframe_insertion. Reuse existing golden VCF output in benchmark to skip docker reruns. Reduces chr22 golden discrepancies from 30 to 19. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix insertion exon boundary overlap and trim VCF anchor in annotation loop Use VariantInput::from_vcf() in the annotation loop so VCF anchor bases are trimmed before consequence evaluation. For pure insertions, require both flanking positions to be within the exon (VEP convention: insertion at exon start is intronic). Reuse existing golden VCF in benchmark. Reduces chr22 golden discrepancies from 19 to 13. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: detect stop_gained in CDS with pre-existing internal stops For pseudogene/LoF transcripts with internal premature stop codons, the global first-stop comparison missed new stops introduced after an existing one. Add per-codon stop analysis matching VEP Perl behavior. Reduces chr22 golden discrepancies from 13 to 11. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix insertion upstream/downstream detection at transcript boundaries For insertions at a transcript's start position, use the left flanking position (start-1) for upstream/downstream window overlap to match VEP behavior. Also apply insertion-aware overlap for the transcript-level overlap check, consistent with the exon boundary fix. Reduces chr22 golden discrepancies from 11 to 9. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: intron-based splice detection matching VEP Perl algorithm Replace exon-based splice site iteration with intron-derived boundaries, matching VEP's _intron_effects algorithm from TranscriptVariationAllele.pm. Key changes: - Derive introns from sorted exon pairs and check splice sites using intron-relative offsets (matching VEP's intron_start/intron_end coords) - Run splice checks for ALL variants (not just fully-intronic), using extended boundary regions that reach into exons — this naturally handles exon-spanning deletions without synthetic variant workarounds - Add splice_region_variant suppression when more specific terms present (donor_region, 5th_base, donor, acceptor) per VEP's splice_region guards - Remove broken add_exon_spanning_splice_terms() synthetic variant approach Golden bench: 8 discrepancies (was 9), 50,855/50,861 most_severe match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: cargo fmt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: frameshift intron skip, VEP insertion coords, cross-transcript coding_sequence_variant Three fixes matching VEP Perl behavior: 1. Skip splice checks for "frameshift introns" (≤13bp), matching VEP's _frameshift flag in BaseTranscriptVariation.pm. Fixes Cases 1,2 where 1bp deletions got false splice_acceptor/donor on tiny introns. 2. Correct insertion coordinate model: VEP's overlap(P, P-1, X, Y) effectively checks P in [X+1, Y], which is narrower than our point checks. Single-position overlap is impossible for insertions (5th_base never fires). Fixes Cases 4,7,8 insertion splice misclassifications. 3. Don't strip coding_sequence_variant at cross-transcript collapse level — it may legitimately come from a different transcript than the specific coding term (frameshift, missense, etc.). Golden bench: 7 discrepancies (was 8), 50,860/50,861 most_severe match (only mature_miRNA remaining). All 7 remaining are term_set only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix insertion polypyrimidine range using VEP dual-loop coord model VEP checks polypyrimidine in two loops with different coordinate models: - _overlapped_introns loop uses normalized (P-1, P) giving wider range - _overlapped_introns_boundary loop uses raw (P, P-1) giving narrower range The union range for positive strand is [intron_end-16, intron_end-1] and for negative strand [intron_start+2, intron_start+17], wider than our previous ranges by 1 position on each boundary. Golden bench: 5 discrepancies (was 7), 50,856/50,861 term_set match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: frameshift intron boundary splice_region, coding_sequence_variant, PPT suppression - Frameshift introns (≤13bp): only skip splice checks when variant overlaps the intron body; variants in exonic boundary region still get splice_region_variant via _intron_overlap checks - Expand intron boundary quick-check to cover VEP's dual interval trees (intron_end-8 can extend past intron_start-4 for tiny introns) - Emit coding_sequence_variant for variants in frameshift introns within CDS bounds (VEP treats these as part of the coding context) - Add intron_variant when variant overlaps any intron body, even if it also overlaps an exon (exon-spanning deletions) - Suppress splice_polypyrimidine_tract_variant when splice_donor or splice_acceptor is present on the same transcript Reduces golden discrepancies from 9 to 2 (50859/50861 = 99.996%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: 100% golden parity — mature_miRNA_variant, complex indel, intron_variant range fix Fix final 2 VEP discrepancies on chr22 (50,861 variants), achieving 100% term_set and most_severe accuracy vs Ensembl VEP 115. - Add mature_mirna_regions to TranscriptFeature, parsed from raw_object_json cDNA attributes during transcript loading. Per-transcript mature_miRNA_variant check in evaluate_transcript_overlap. - Narrow variant_overlaps_intron range to intron_start+2..intron_end-2, excluding splice site positions. Skip frameshift introns. Use VEP's both-flanking-positions semantics for insertions. Remove post-hoc intron_variant stripping from strip_parent_terms. - Add is_complex_indel detection for deletions spanning exon→intron boundaries (excluding frameshift introns). Short-circuit add_coding_terms to emit only coding_sequence_variant for such variants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: filter Ensembl-only transcripts by default, add merged flag VEP without --merged only annotates against ENST transcripts. Our is_vep_transcript() was unconditionally accepting RefSeq (NM_, NR_, XM_, XR_), producing ~6k extra consequence terms on chr22. Add a `merged` bool parameter defaulting to false (Ensembl-only). Parse "merged" from options_json in the provider and pass through from the benchmark CLI. Result: most_severe 100%, term_set 100% on 50,861 chr22 variants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: suppress non_coding_transcript_{exon_,}variant inside mature miRNA regions VEP's non_coding_exon_variant predicate returns 0 when within_mature_miRNA is true, and within_non_coding_gene also guards against it. Mirror both checks: skip NonCodingTranscriptExonVariant and NonCodingTranscriptVariant when MatureMirnaVariant is assigned. Eliminates the last remaining discrepancy (22:31160117 A/G) — 0/50,861 mismatches on chr22. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: populate all 29 CSQ fields — EXON, INTRON, cDNA/CDS/protein positions, amino acids, codons, distance, flags Extend TranscriptConsequence with 9 new metadata fields and populate them during transcript evaluation. 27 of 29 fields now match VEP 115 golden output at the 83.5% entry-matching ceiling (up from 0-23% on many fields). Key changes: - Add exon/intron numbering (which_exon_str, which_intron_str helpers) - Compute cDNA, CDS, and protein positions via genomic_to_cdna_index - Extract amino acids and codons from classify_coding_change with VEP case convention (format_codon_display) - Propagate distance from upstream_downstream_term - Parse cds_start_NF/cds_end_NF flags from transcript raw_object_json - Align CSQ format to 29 fields matching VEP --fields default order - Update golden benchmark VEP command to request all 29 fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: VEP-style allele minimization, per-feature regulatory entries, fix SOURCE - Apply VEP allele minimization in CSQ Allele field: strip shared REF/ALT prefix (e.g., REF=T ALT=TCAC → "CAC", REF=TCAC ALT=T → "-") - Emit one CSQ entry per overlapping regulatory feature with its stable_id as Feature (matching VEP behavior), instead of collapsing all into one - Stop emitting parquet transcript source as VEP SOURCE field (different semantics: annotation source vs cache source type) - Add unmatched CSQ entry diagnostic to golden benchmark - Fix cache_table mock to include all CACHE_OUTPUT_COLUMNS Result: 99.4-100.0% accuracy across all 29 CSQ fields (up from 83.5%), only 46 unmatched entries remain (down from 118,221). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: add unit tests for CSQ field helpers and regulatory per-feature entries Tests for format_codon_display, which_exon_str, which_intron_str, genomic_to_cdna_index, compute_cdna_position, compute_flags, per-feature regulatory emission, and diagnose_unmatched_csq. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix BIOTYPE, FLAGS parsing, allele suffix trimming, indel codons - Add feature_type to RegulatoryFeature and emit as BIOTYPE in CSQ (fixes 3,960 mismatches for regulatory entries) - Replace naive contains() in parse_transcript_flags() with proper JSON parsing that validates value=="1" (fixes 188 false positives) - Add suffix trimming to vcf_to_vep_allele() and unify inline vep_allele computation (fixes 11+ unmatched entries) - Compute codons for in-frame indels, not just SNV substitutions (fixes 129 missing codon annotations and 74 amino acid mismatches) All 29 CSQ fields now at 100.0% accuracy on chr22 golden benchmark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix FLAGS nesting, MNV suffix trimming, indel codons/positions/amino acids - Fix parse_transcript_flags() to unwrap nested __value in VEP cache attribute JSON, preserving encounter order (27,180 FLAGS fixes) - Skip suffix trimming for same-length substitutions (MNVs) in vcf_to_vep_allele() — VEP only suffix-trims indels (35 unmatched → 0) - Rework codon formatting for frameshift and inframe indels to match VEP case conventions (changed bases uppercase, context lowercase) - Use X instead of * for frameshift amino acids - Add classify_insertion() for strand-aware insertion CDS/protein positions and codon formatting - Fix position range formatting for multi-base indels (CDS, protein) - Add ? notation for boundary-spanning deletion cDNA positions - Add field-level mismatch sample collection to golden benchmark Reduces chr22 golden benchmark mismatches from ~27,353 to 11 across 715,133 CSQ entries (28/29 fields at 100%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix last 11 CSQ mismatches — amino acids preserved-AA + cDNA exon boundary Amino_acids: restrict preserved-AA check in classify_coding_change() to insertion frameshifts only (deletion frameshifts always use ref/X). Add the same preserved-AA check to classify_insertion() for consistency. cDNA_position: relax insertion in_exon check to include either flanking position at exon boundaries, and infer the intronic flanking cDNA index from the exonic one (adjacent positions are always consecutive). All 29 CSQ fields now at 715,133/715,133 (100.0%), 0 mismatches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: add unit tests for data fixes — codons, amino acids, cDNA, FLAGS, allele trimming Cover all recent CSQ field fixes with 37 new unit tests: - classify_coding_change: SNV codon case, frameshift/inframe deletion codons & amino acids - classify_insertion: frameshift/inframe codons, preserved-AA logic, CDS/protein positions - compute_cdna_position: insertion at exon boundaries, boundary-spanning deletions with ? - VariantInput::from_vcf: MNV no-suffix-trim vs indel suffix-trim - parse_transcript_flags: mixed wrapped/unwrapped, non-1 rejection, malformed JSON - compute_flags: flags_str encounter order, regulatory feature biotype variants Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: populate all 29 CSQ fields, add HGVS module, enrich cache structs - Wire variation_name into cache-hit CSQ (Existing_variation field 18) - Extract and conditionally emit SOURCE (field 29, only with --merged) - Add version field to TranscriptFeature, stable_id/version to TranslationFeature - Load version/stable_id from transcript/translation parquet caches - New hgvs.rs module: format_hgvsc (exonic + intronic), format_hgvsp (missense, synonymous, stop gained, frameshift, inframe del/ins) - Compute HGVSc/HGVSp per transcript (stored but not emitted by default, matching VEP which requires --hgvs + FASTA to populate these fields) - All 29 CSQ fields now 100% match on full chr22 (715,133 entries) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: replace raw_object_json parsing with promoted parquet columns Read cds_start_nf, cds_end_nf (bool) and mature_mirna_regions (List<Struct<start,end>>) directly from top-level transcript cache columns instead of parsing raw_object_json at runtime. Remove parse_mirna_regions_from_json, parse_transcript_flags, parse_cdna_range and the serde_json dependency. Golden benchmark confirms 100% match on all 29 CSQ fields (2421 entries, 0 discrepancies). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: add 12 new CSQ fields (41 total) — VARIANT_CLASS, CANONICAL, TSL, protein IDs Extend CSQ output from 29 to 41 pipe-delimited fields per transcript: VARIANT_CLASS (computed from VEP-minimized alleles), CANONICAL, TSL, MANE_SELECT, MANE_PLUS_CLINICAL, ENSP, GENE_PHENO, CCDS, SWISSPROT, TREMBL, UNIPARC, UNIPROT_ISOFORM. All fields sourced from promoted parquet columns (no JSON parsing). Golden benchmark: 41/41 fields at 100% match against VEP 115. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: add Batch 3 — allele frequencies, co-located variants, MAX_AF, PUBMED (74 CSQ fields) Add VepFlags with VEP-compatible flag names (check_existing, af, af_1kg, af_gnomade, af_gnomadg, max_af, pubmed). Implement AF extraction from cache allele:freq strings, MAX_AF/MAX_AF_POPS computation across 1000G and gnomAD populations, and co-located variant fields (somatic, phenotype_or_disease, Existing_variation). Extend CSQ from 41 to 74 fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix AF precision, CSQ comma escaping, allele-aware co-located fallback - Apply VEP's sprintf("%.4f") formatting to global AF field (94.6% → 99.5%) - Escape commas in TREMBL, CLIN_SIG, PUBMED values before emitting to CSQ (commas split CSQ entries, causing ~200K phantom entries and field misalignment where UniParc IDs leaked into consequence terms) - Add match_allele_relaxed to co-located variant fallback join (was position-only, could pick wrong variant at multi-variant loci) - Use MIN(CASE WHEN ... 'rs%') aggregation for co-located variation_name (prefer rsIDs over other namespaces, matching VEP existing mode) Full chr22 benchmark (715,133 entries): 57/74 fields at 100%, term_set 100%, most_severe 100%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: add position-based co-located variant aggregation for Existing_variation, SOMATIC, PHENO Collect ALL variants at the same chromosomal position (not just allele-matched) for co-located variant fields, matching VEP's behavior with COSMIC_MUTATION and HGMD_MUTATION entries. Sorts rs* IDs before others, alphabetical within groups. Suppresses all-zero somatic/pheno flags to empty string. Benchmark: SOMATIC 93.7%→99.3%, PHENO 93.1%→98.5%, Existing_variation 78.3%→81.6% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: add VEP coordinate normalization UDFs for indel cache lookup Add vep_norm_start/vep_norm_end UDFs that compute VEP-style coordinates by stripping common REF/ALT prefix/suffix. Use these in the equi-join ON clause so VCF indel coordinates match the VEP cache convention. Also update co-located variant map to use VEP-normalized coordinates. Benchmark: AF 89.9%→90.9%, gnomADg_AF 84.8%→86.1%, MAX_AF 84.7%→86.0%, SOMATIC 99.3%→99.6%, PHENO 98.5%→98.9%, gnomADe_AF 97.8%→98.0% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix cache-hit CSQ field alignment, allele-filtered co-located data, add 108+ unit tests Fix Existing_variation at position 19→17 in cache-hit/fallback CSQ format strings (2 extra pipes before, 2 fewer after). Add allele-filtered SOMATIC, PHENO, CLIN_SIG, PUBMED via enriched ColocatedData with per-entry metadata. Add comprehensive unit tests for CSQ format validation, type coercion helpers, co-located allele filtering, and SQL construction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: add unit tests for CDS/protein position ?-N format, frameshift codons/amino acids fixes Tests cover: cds_start_nf "?-N" formatting for CDS_position and Protein_position (single and range), frameshift insertion at codon boundary protein position range, deletion frameshift preserving last ref AA before X, and frameshift insertion at codon boundary "-/X" codon format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix star allele filtering, regulatory insertion overlap, incomplete_terminal_codon stripping, UTR insertion boundary, start_retained heuristic - Skip star alleles (*) entirely in both engine and annotate_provider - Use VEP's stricter insertion overlap for regulatory/motif/miRNA features (insertion must be strictly inside feature, not at boundary) - Strip incomplete_terminal_codon_variant when stop_lost/stop_gained present - Fix UTR detection for insertions at CDS boundary - Emit start_retained for indels that don't touch start codon bases - Add 10 unit tests covering all fixes Reduces chr1 benchmark discrepancies from 10 to 5 (remaining are frameshift/inframe edge cases and PPT multi-base deletion edge case). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix CDS ?-N over-application, boundary insertion amino acids, intron_variant stripping - CDS/Protein position: only use "?-N" format when BOTH cds_start_nf AND CDS sequence starts with N padding (non-zero phase), fixing 144 mismatches - Amino acids: use "-/X" for frameshift insertions at codon boundaries (no existing codon disrupted), fixing 12 mismatches - Strip intron_variant when splice_acceptor/donor_variant present (SO hierarchy parent removal), fixing 12 Consequence mismatches - Update existing test for intron_variant + splice_donor coexistence - Add 4 new unit tests for boundary insertion, intron stripping Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix 5 remaining term-set discrepancies, add 8 unit tests Fix all 5 chr1 term-set mismatches (now 0/323430): - Remove PPT stripping alongside splice_acceptor/donor (tier-3 independent) - Add UTR boundary insertion detection for exon-edge insertions - Stop_retained overrides frameshift→inframe for insertions near stop codon - Start_retained co-emission with start_lost for boundary deletions - Suppress frameshift for deletions spanning CDS+UTR boundary - CDS boundary clipping in mutated_cds_first3 for start codon heuristic Add 8 new tests: UTR boundary insertion (5), stop_retained override (2), boundary deletion start_lost+start_retained co-emission (1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: add VariantLookupExec hash+COITree dual-index join, 53/74 CSQ fields perfect Replaces SQL-based variant lookup with custom physical operator that builds VCF rows into per-chromosome COITrees with hash index for O(1) exact match and COITree fallback for shifted-coordinate indels. Includes fixes for CDS ?-N prefix, boundary insertion amino acids, intron_variant stripping, star allele filtering, regulatory insertion overlap, splice term hierarchy, and 108+ unit tests. Removes fjall cache artifacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: piggybacked co-located collection via COITree sink, fix Int8/Int16 accessor, remove SQL co-located path Replace SQL-based build_colocated_variant_map() with piggybacked collection during VariantLookupExec probe phase. Co-located data (variation_name, somatic, pheno, clin_sig, pubmed) is collected into Arc<Mutex<HashMap>> sink during the existing 88M-row cache scan — zero additional parquet I/O. Key fixes: - Add Int8/Int16 support to I64Accessor (somatic/pheno are Int8 in parquet) - Fix CDS_position ?- prefix: only apply when variant is within N-padded region - Remove dead build_colocated_map/build_colocated_variant_map SQL functions and tests 53/74 CSQ fields perfect, most_severe=100%, term_set=100% on chr1 (323k variants). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: strip PPT with splice_acceptor, store orig_start in BuildRow - Strip splice_polypyrimidine_tract_variant when splice_acceptor_variant is present (VEP Perl's PPT check returns false at splice acceptor sites) - Store original 1-based start coordinate in BuildRow for future unshifted co-located matching (VEP Perl checks both shifted and unshifted positions via get_matched_variant_alleles) - Update tests to reflect PPT stripping behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vep: fix insertion PPT detection range, VEP trim_sequences co-located normalization, AF collection - Fix PPT detection for insertions: VEP Perl uses VCF-level coords (POS, POS) for _intron_effects, not minimized insertion coords. Widen insertion PPT range to [intron_end-16, intron_end-1] (positive) and [intron_start+2, intron_start+17] (negative) to match VEP behavior. - Remove PPT stripping with splice_acceptor (VEP tier system keeps both). - Keep intron_variant stripping with splice_donor/acceptor as safety net. - Add VEP trim_sequences normalization for co-located variant matching with tree_had_overlaps guard (zero overhead for 99% of cache rows). - Add AF column collection in ColocatedCacheEntry for co-located AF fallback. - Result: 0 term_set mismatches, 0 most_severe mismatches on chr1 (323k variants). 53/74 CSQ fields perfect, IMPACT now 100%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update VEP README with chr1 benchmark results and usage guide Document exact commands for running the golden benchmark with precomputed Ensembl VEP 115 output, positional arguments, output files, and latest chr1 accuracy results (100% most_severe/term_set, 53/74 perfect CSQ fields). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: derive benchmark output filenames from source VCF stem Output filenames are now derived from the source VCF filename (e.g., HG002_chr1_0_norm.vcf from HG002_chr1.vcf.gz) instead of being hardcoded to HG002_chr22. The local_copy_vcf_gz arg can now consistently match the source VCF. TBI paths are derived from their respective VCF paths instead of hardcoded defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Port VEP matched-alleles flow for existing variant parity * Refine VEP colocation parity diagnostics * Advance strict VEP parity slice * Improve VEP parity coverage * Add VEP shift-state plumbing for strict parity work * Match VEP variation-tabix candidate window * Align VEP amino acid formatting with codon output * Add coverage for prepared feature indexes * Align VEP intron and colocated output parity * Add VEP traceability refs and helper coverage * Finish strict chr1 VEP parity * Add missing regulatory traceability refs * Add VEP distance and regulatory integration coverage * Add HGVS harness and transcript numbering parity * Port HGVS transcript coordinate semantics * Port transcript HGVS notation normalization * Align HGVSp output-layer semantics * Wire HGVSp prediction-format output * Add HGVS shift flag and flag parity audit * Wire HGVS coding mapper bounds * Improve HGVS parity: 860→187 HGVSc, 114→125 HGVSp mismatches Major improvements to HGVSc generation: - Fix 3' shifting with Ensembl-aligned perform_shift_ensembl() - Add edited transcript shifted allele support for RefSeq RNA edits - Wire CDS position shortcut for coding SNVs - Add display_start/display_end for reverse-strand shift offsets - Fix notation_to_hgvsc_coords insertion coordinate mapping 72/74 CSQ fields at 100% parity on chr1 golden benchmark. Remaining: 187 HGVSc (rotation/dup detection) and 125 HGVSp (extTer distance, protein dup, insertion flanking). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix HGVSp stop-loss/frameshift extension distance with 3' UTR translation Extend the mutated CDS with 3' UTR sequence before translating for HGVS protein data, so that stop_loss_extra_aa() can find the new stop codon even when it falls in the UTR region. Uses separate translation for HGVS vs consequence classification to avoid regression. This fixes extTer? and fsTer? mismatches where VEP correctly computes the extension distance but we output "?" due to truncated translation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix HGVSp protein dup detection order and stop-loss extension Two fixes for HGVSp mismatches: 1. Reorder peptide dup detection to run BEFORE 3' shift, matching Ensembl VEP's hgvs_protein() order. VEP checks _check_for_peptide_ duplication() before _shift_3prime(), not after. This fixes cases like Gln424_Gln429dup and Lys361dup that were incorrectly reported as insertions. 2. Extend mutated CDS with 3' UTR for HGVS translation only (not for consequence classification) so stop_loss_extra_aa() can find the new stop codon when it falls in the UTR region. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix genomic shift seq_strand to match VEP _genomic_shift semantics Ensembl VEP's _genomic_shift() always passes seq_strand=1 to perform_shift() because the genomic shift operates on forward-strand coordinates. Our code was passing the transcript strand, causing hgvs_reverse=true for reverse-strand transcripts, which rotated the HGVS output allele in the wrong direction. Also replace simple rotate_left/rotate_right in the transcript-level shift with a proper call to perform_shift_ensembl() matching VEP's _return_3prime() → perform_shift() call chain. HGVSc mismatches: 187 → 106 (-43%) Traceability: - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L431 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L237-L243 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update STRICT_VEP_PARITY_PLAN with Phase 4 HGVS status and root cause analysis Document current benchmark state (106 HGVSc + 125 HGVSp mismatches), fixes applied with Ensembl VEP GitHub references, and remaining root causes with exact traceability links. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix HGVSp codon-boundary insertion position to match VEP genomic2pep VEP's hgvs_protein() gets protein position from translation_start() which uses the genomic2pep() mapper. For codon-boundary insertions (ref = "-"), the mapper returns the codon AFTER the boundary. Our classify_insertion() was using codon_at + 1 (the codon before). Fix: when ref_peptide is "-" and start != end (boundary insertion), use end as the HGVS start position. This aligns check_for_peptide_ duplication() upstream lookup with VEP's index. HGVSp mismatches: 125 → 60 (-52%) Traceability: - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L1680-L1682 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/BaseTranscriptVariation.pm#L467-L499 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix HGVSp protein dup detection with boundary insertion position alignment Two complementary fixes for protein dup detection: 1. For codon-boundary insertions (ref_peptide = "-", start != end), use protein_position_end as the HGVS start to match VEP's genomic2pep() mapper which returns the codon after the boundary. 2. Add fallback dup check at start+1 for boundary insertions (empty preseq) to handle cases where the mapper position differs. This uses try_peptide_dup_at() which tests the dup match at a specific position without mutating notation unless a match is found. HGVSp mismatches: 125 → 60 (-52%) No regressions on Protein_position, Amino_acids, or any other field. Traceability: - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L1680-L1682 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L2371-L2410 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/BaseTranscriptVariation.pm#L467-L499 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add cdna_seq fallback for 3' UTR extraction in stop-loss/frameshift HGVS Load the cdna_seq column from transcript parquet as a fallback for three_prime_utr_seq() when spliced_seq is absent. This provides UTR data for transcripts that have cdna_seq but not spliced_seq (most Ensembl and RefSeq transcripts in the merged cache). Traceability: - VEP _three_prime_utr() derives UTR from the transcript object which always has the full cDNA sequence available https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L2412-L2418 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Hydrate full spliced cDNA from FASTA for 3' UTR in stop-loss/frameshift VEP's _three_prime_utr() derives UTR from the transcript's exon-based seq() which is always available. Our cache only stores spliced_seq for edited RefSeq transcripts. Add hydrate_transcript_cdna_from_reference() which reconstructs the full spliced cDNA from exon coordinates + FASTA for all coding transcripts overlapping input variants. This provides three_prime_utr_seq() with the UTR data needed to compute fsTer and extTer distances for Ensembl and non-edited RefSeq transcripts. HGVSp mismatches: 60 → 5 (-92%) Traceability: - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L2335-L2372 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L2412-L2418 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update plan with non-merged HGVS benchmark: 6 mismatches out of 2,997,504 Non-merged (Ensembl-only) chr1 benchmark with Docker VEP 115.2: - 72/74 fields at zero mismatches - HGVSc: 2 mismatches (spurious HGVSc on 2 transcripts) - HGVSp: 4 mismatches (3 extTer, 1 insertion flanking) - Total: 6 out of 2,997,504 CSQ entries (99.9998% accuracy) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert HGVSc ? suppression that caused regressions The cdna_position "?" check was too aggressive — it suppressed HGVSc for deletions spanning UTR-CDS boundaries where VEP correctly computes notation like c.-7_1del. Removed the check. Non-merged benchmark stable at 6 mismatches (2 HGVSc + 4 HGVSp) out of 2,997,504 CSQ entries (99.9998% accuracy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix spurious HGVSc for boundary-spanning deletions and LoF extTer 1. Suppress HGVSc for deletions that extend beyond transcript boundaries, matching VEP's _get_cDNA_position() which returns undef for out-of-bounds positions. 2. Skip 3' UTR extension for protein_coding_LoF biotype transcripts, matching VEP's _three_prime_utr() which returns undef for LoF transcripts that lack functional UTR annotations. Non-merged benchmark: HGVSc 0 mismatches, HGVSp 4 mismatches. 73/74 fields perfect. Merged benchmark: HGVSc 102, HGVSp 5 (107 total, down from 111). Traceability: - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L1416 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/BaseTranscriptVariation.pm#L1106-L1116 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix stop_loss_extra_aa to use ref_translation.len() matching VEP length() VEP's _stop_loss_extra_AA() uses length(_peptide()) as $ref_len. BioPerl's translate()->seq includes ALL codons (including internal stops and terminal stop), so length() = full translation length. Our code was using find('*') which returns the position of the FIRST stop codon. This is the same as length() for normal transcripts (one terminal stop), but differs for LoF transcripts with internal stops. Also skip 3' UTR extension for protein_coding_LoF biotype, matching VEP's _three_prime_utr() which returns undef for LoF transcripts. Non-merged: HGVSc 0 mismatches, HGVSp 7 (down from 4 but structurally more correct — remaining 7 are UTR content differences). Traceability: - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L2406-L2461 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/BaseTranscriptVariation.pm#L1282-L1291 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix stop_loss_extra_aa ref_len to match VEP cached peptide (no terminal *) VEP's _peptide() returns the cached peptide which excludes the terminal stop codon (*). BioPerl's translate() includes it, but VEP's cache converter strips it. So length(_peptide()) = translation length WITHOUT *. Our ref_translation includes *, so we use trim_end_matches('*').len() to match VEP's ref_len exactly. Confirmed by inspecting VEP's cache via Docker: cached peptide len: 101 (no *) fresh translate len: 102 (with *) Non-merged: HGVSc 0, HGVSp 1 (only insertion flanking mapper diff). Traceability: - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L2430-L2432 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/BaseTranscriptVariation.pm#L1282-L1291 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Port VEP genomic2pep for insertion protein position — 74/74 zero mismatches Replace arithmetic protein position calculation (cds_idx / 3) with VEP's exact genomic2pep logic: map BOTH flanking bases of the insertion through genomic_to_cds_index independently, then apply VEP's formula int((cds_1based + 2) / 3) to each. When the two sides map to different protein positions, the insertion is at a codon boundary. For the HGVS notation, VEP's translation_start() returns the protein position from seq_region_start (POS+1), which is the HIGHER position for boundary insertions. Set hgvs start=end (higher), end=start (lower) to match VEP's _get_hgvs_peptides which uses min(start,end) for flanking. Non-merged benchmark: 74/74 fields at zero mismatches. 2,997,504 CSQ entries, 100.000% accuracy. Traceability: - https://github.com/Ensembl/ensembl/blob/release/115/modules/Bio/EnsEMBL/TranscriptMapper.pm#L451-L487 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/TranscriptVariationAllele.pm#L1680-L1682 - https://github.com/Ensembl/ensembl-variation/blob/release/115/modules/Bio/EnsEMBL/Variation/BaseTranscriptVariation.pm#L467-L499 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add comprehensive HGVS unit tests for edge cases and VEP parity 44 new tests covering critical gaps in test coverage: hgvs.rs (30 new tests): - stop_loss_extra_aa: internal stops (LoF), no new stop, frameshift, trim_end_matches('*') for VEP cached peptide semantics - perform_shift_ensembl: forward/reverse, hgvs_reverse flag, no-match, genomic shift seq_strand=1 constraint - clip_alleles: negative strand prefix/suffix coordinate adjustments - check_for_peptide_duplication: current position, fallback offsets, multi-residue dup, no-match - resolve_frameshift_hgvs: first changed residue, synonymous frameshift - surrounding_peptides: flanking, stop extension - Protein helpers: normalize_peptide_allele, append_terminal_stop, reverse_complement, split_hgvs_coord, protein_event_type, peptide_char - format_hgvsp: deletion, multi-residue deletion, missense, delins, frameshift immediate stop transcript_consequence.rs (14 new tests): - three_prime_utr_seq: LoF biotype suppression, spliced_seq source, cdna_seq fallback, coding_end at seq end, missing coding_end - genomic_to_cds_index: positive strand, negative strand, outside CDS - coding_segments: positive strand ordering, negative strand reversal, no CDS - translate_protein_from_cds: stop codon inclusion, incomplete codons, N bases Total: 525 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Optimize cDNA hydration: batch FASTA reads per transcript span Reduce FASTA I/O for cDNA hydration by reading the entire transcript genomic span in one query and extracting exon subsequences from it, instead of issuing separate FASTA reads per exon (~8.7 reads per transcript average). Also skip hydration for LoF biotype and transcripts with no 3' UTR. For transcripts >500KB, fall back to per-exon reads to avoid large memory allocations. Performance: 120s → 101s (-16%, -19s) with HGVS enabled. Correctness: 74/74 fields at zero mismatches maintained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Optimize cDNA hydration: filter by indel CDS overlap + stop codon vicinity Only hydrate transcripts whose CDS overlaps indel variants (potential frameshifts) or whose stop codon overlaps any variant (potential stop_lost). This skips ~80% of transcripts that only overlap SNVs far from the stop codon. Combined with the batched FASTA reads (one read per transcript span instead of per exon), hydration drops from 23s to ~6s. Performance: 120s → 100s (-17%) with HGVS enabled. Correctness: 74/74 fields at zero mismatches maintained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Skip genomic shift computation for SNVs/MNVs Only compute build_hgvs_genomic_shift for indels (ref_len != alt_len). SNVs and MNVs never shift, so this avoids allele normalization and function call overhead for ~84% of variants. Correctness: 74/74 fields at zero mismatches maintained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Read promoted transcript columns instead of parsing raw_object_json Use top-level parquet columns for bam_edit_status, has_non_polya_rna_edit, spliced_seq, translateable_seq, flags_str, and cdna_mapper_segments when available, falling back to JSON parsing for older caches without these columns (e.g., merged cache). This eliminates 7 redundant serde_json::from_str calls per transcript (~25KB JSON × 47K transcripts = 1.2GB parsed 7 times → 0 times with promoted columns). Performance: 72s → 64s without HGVS (-11%), 100s → 94s with HGVS (-6%). Correctness: 74/74 fields at zero mismatches. Depends on biodatageeks/datafusion-bio-formats#125, #126. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Skip raw_object_json entirely when promoted columns are available When the parquet cache has promoted top-level columns (translateable_seq, cdna_mapper_segments, bam_edit_status, etc.), skip reading the 25KB raw_object_json blob entirely. This eliminates all JSON parsing overhead. Performance (non-merged VEP-only cache with promoted columns): No HGVS: 72s → 51s (-29%, matches pre-HGVS baseline) With HGVS: 100s → 79s (-21%) Correctness: 74/74 fields at zero mismatches. 525 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove dead JSON extraction functions and tests With all transcript fields promoted to top-level parquet columns (biodatageeks/datafusion-bio-formats#125, #126), the raw_object_json parsing functions are no longer used. Remove: - resolve_flags_str, flags_str_from_raw_object_json - normalize_source_cache, source_cache_from_raw_object_json - normalized_source_from_transcript_id - translateable_seq_from_raw_object_json - bam_edit_status_from_raw_object_json - has_non_polya_rna_edit_from_raw_object_json - rna_edit_attribute_is_polya, is_polya_tail_edit - spliced_seq_from_raw_object_json - cdna_mapper_segments_from_raw_object_json - mapper_inner_value, cdna_mapper_segment_from_json And their 13 associated tests. File reduced from 6148 → 3691 lines (-40%). 365 tests passing. 74/74 fields at zero mismatches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update parity plan: Phase 4 HGVS achieved — 74/74 zero mismatches Document final status: - Non-merged benchmark: 74/74 fields, 0 mismatches, 2,997,504 CSQ entries - Performance: 51s without HGVS, 79s with HGVS (baseline recovered) - 9 VEP-exact fixes with Ensembl GitHub traceability - 4 performance optimizations (promoted columns, batched FASTA, targeted hydration) - 365 unit tests - Dead JSON extraction code removed (-2,499 lines) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Implement --everything flag with 80-field CSQ schema, wire APPRIS/SIFT/PolyPhen Phase 5 of VEP parity: add --everything flag support that enables all VEP features and produces an 80-field CSQ string matching VEP's --everything output order (Constants.pm#L66-L138). Key changes: - Add `everything` flag to VepFlags/HgvsFlags — when true, all sub-flags (hgvs, af, af_1kg, af_gnomade, af_gnomadg, max_af, pubmed) are enabled - Add CSQ_FIELD_NAMES_EVERYTHING (80 fields) with reordered fields: VARIANT_CLASS after FLAGS, MOTIF_* at end, SOURCE removed, gnomAD sub-pops with _AF suffix, 6 new fields (MANE, APPRIS, SIFT, PolyPhen, DOMAINS, miRNA, HGVS_OFFSET) - Wire APPRIS from transcript parquet column (principal1→P1, alternative2→A2) - Wire SIFT/PolyPhen from translation parquet columns with lookup by (protein_position, alt_amino_acid) for single AA substitutions - Wire HGVS_OFFSET from HgvsGenomicShift.shift_length (strand-aware) - Add --everything to benchmark harness with parameterized CSQ comparison - DOMAINS blocked on protein_features.analysis being NULL (#128) - miRNA blocked on missing structure type in cache (#128) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Wire DOMAINS, fix MANE/HGVS_OFFSET/gnomAD sub-pops — 80/80 zero mismatches Complete Phase 5 --everything parity: all 80 CSQ fields now match Ensembl VEP 115 output with zero mismatches across 34,741 CSQ entries (chr1, 1000 variants, non-merged). Fixes: - DOMAINS: load protein_features from translation parquet, overlap check vs variant protein_position, format as analysis:hseqname joined with & - MANE: emit MANE_Select/MANE_Plus_Clinical based on transcript attributes - HGVS_OFFSET: gate on hgvsc && tc.hgvsc.is_some() to only emit when HGVSc was actually computed for the transcript variant allele - gnomAD sub-pops: when flags.everything, override emit_in_csq to emit all AF columns in CSQ (VEP --everything emits all sub-pop frequencies) Known issue: SIFT/PolyPhen eager loading causes ~24GB peak memory on chr1 (22K translations × 11K prediction entries each). Tracked in #38. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add unit tests for Phase 5 --everything helpers and update parity plan Tests cover format_appris, format_prediction, lookup_sift_polyphen, lookup_domains, and HGVS_OFFSET sign arithmetic. Update parity plan to reflect 80/80 zero mismatches on 5,021-variant chr22 benchmark. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Sliding window SIFT/PolyPhen loading, wire miRNA structure field Replace eager SIFT/PolyPhen loading (~20GB on chr1) with a sliding genomic window approach that queries translations per-region: SELECT transcript_id, sift_predictions, polyphen_predictions FROM translations WHERE chrom = '1' AND start <= win_end AND "end" >= win_start Each 5MB window returns ~20-50 translations. LRU cache (capacity 2000) bounds peak memory. Results on full chr1 (319K variants): - Memory: 16GB (down from 24GB eager, was 60GB+ before optimization) - Time: 109s (down from killed/OOM) - Correctness: 77/80 fields perfect (2,997,504 CSQ entries) Wire miRNA CSQ field fr…
1 parent 9e5863b commit b11017d

59 files changed

Lines changed: 36699 additions & 121 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jobs:
2020
- name: Setup Rust
2121
uses: actions-rust-lang/setup-rust-toolchain@v1
2222
with:
23-
toolchain: '1.88.0'
23+
toolchain: '1.91.0'
2424
components: 'clippy, rustfmt'
2525

2626
- name: Cache Cargo registry and build

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,6 @@ target
1919
# and can be added to the global gitignore or merged into this file. For a more nuclear
2020
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
2121
.idea/
22+
/vep-benchmark/data/output/
23+
/vep-benchmark/data/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
24+
/vep-benchmark/data/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi

0 commit comments

Comments
 (0)