Skip to content

Add VEP flag_pick_allele_gene and pick_order support#139

Merged
mwiewior merged 17 commits into
masterfrom
feature/vep-flag-pick-allele-gene
May 18, 2026
Merged

Add VEP flag_pick_allele_gene and pick_order support#139
mwiewior merged 17 commits into
masterfrom
feature/vep-flag-pick-allele-gene

Conversation

@mwiewior
Copy link
Copy Markdown
Contributor

@mwiewior mwiewior commented Apr 10, 2026

Summary

  • add annotate_vep support for flag_pick_allele_gene and configurable pick_order
  • bypass the cache-hit synthetic CSQ fast path when pick mode needs transcript-level evaluation
  • align the picker with Ensembl VEP release 115 ranking semantics and add traceability comments
  • preserve raw _source_cache for exact source ranking while keeping normalized SOURCE output
  • keep the public csq_field_names_for_mode(everything, refseq, merged) API and add a pick-aware helper
  • add integration and unit coverage for custom ordering, cache-hit behavior, exact source matching, APPRIS grading, CDS-length ranking, and missing-gene handling

Testing

  • cargo fmt --all
  • cargo test -p datafusion-bio-function-vep --lib pick
  • cargo test -p datafusion-bio-function-vep --lib parse_transcript_raw_metadata
  • cargo test -p datafusion-bio-function-vep --lib csq_field_names
  • cargo test -p datafusion-bio-function-vep --lib
  • cargo test -p datafusion-bio-function-vep --examples

Note

  • flag_pick_allele_gene emits a standalone PICK CSQ sub-field and typed PICK column. Matching VEP, retained non-transcript regulatory/motif/intergenic rows are also marked with PICK=1.

@mwiewior mwiewior force-pushed the feature/vep-flag-pick-allele-gene branch from fbcddd3 to 927a442 Compare April 10, 2026 07:45
@mwiewior mwiewior force-pushed the feature/vep-flag-pick-allele-gene branch from 927a442 to 2ec8971 Compare April 25, 2026 09:12
@mwiewior
Copy link
Copy Markdown
Contributor Author

@claude pls review

@mwiewior
Copy link
Copy Markdown
Contributor Author

@codex pls review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @mwiewior's task in 2m 39s —— View job


PR Review: Add VEP flag_pick_allele_gene and pick_order support

Tasks:

  • Post initial comment
  • Read and analyze annotate_provider.rs changes
  • Read and analyze annotate_table_function.rs changes
  • Read and analyze vcf_sink.rs changes
  • Read and analyze remaining files (golden_benchmark.rs, transcript_consequence.rs, example)
  • Formulate and post review feedback

Overall this is a solid, well-structured implementation. The traceability comments linking back to specific VEP release-115 source lines are excellent. Test coverage is comprehensive and well-targeted. A few issues worth addressing before merge:


Issues

Potential panic in retain_best_pick_candidates

annotate_provider.rs ~line 450:

fn retain_best_pick_candidates<T, F>(candidates: &mut Vec<PickCandidateInfo>, mut key: F) -> Option<usize>
{
    candidates.sort_by_key(|candidate| key(candidate));
    let best = key(&candidates[0]);  // ← panics if candidates is empty

pick_worst_assignment guards with candidates.is_empty() before the loop, but after the first successful truncation, a subsequent call with candidates.len() == 0 could theoretically panic. This is unreachable today since truncate(keep) where keep >= 1, but a defensive guard (or a debug_assert!(!candidates.is_empty())) would make the invariant explicit. Fix this →


include_pick_output is derived twice

annotate_provider.rs ~line 4211:

let include_pick_output = pick_flags.flag_pick_allele_gene;

But self.include_pick_output is also independently derived in AnnotateProvider::new. The two derivations always agree today, but having two sources of truth is fragile. The out_cols.push at the end of the function correctly uses self.include_pick_output, while the rest of the function uses the local. Consider removing the local and using self.include_pick_output (already accessible via the receiver) or passing pick_flags.flag_pick_allele_gene consistently. Fix this →


PickFlags::from_options_json called per window (not per provider)

annotate_provider.rs ~line 4089:

let pick_flags = PickFlags::from_options_json(self.options_json.as_deref())?;

This is inside annotate_batch, which is called once per window-batch. For large VCFs this means repeated JSON parse + validation on every batch. PickFlags could be stored on AnnotateProvider (alongside include_pick_output) and passed down, eliminating the per-batch parse. Fix this →


Breaking API signature change on a public function

golden_benchmark.rs:

// Before
pub fn csq_field_names_for_mode(everything: bool, refseq: bool, merged: bool) -> Vec<&'static str>

// After
pub fn csq_field_names_for_mode(everything: bool, refseq: bool, merged: bool, include_pick: bool) -> Vec<&'static str>

This is a semver-breaking change on a pub function. Any downstream crate (polars-bio, comet-bio) that calls csq_field_names_for_mode directly will fail to compile. The newly added csq_field_names wrapper is a good convenience but doesn't help existing callers. Consider whether a version bump or a release note is needed when coordinating with polars-bio.


VCF header description string changed

vcf_sink.rs ~line 199:

Before: "Consequence annotations from annotate_vep. Format: ..."
After:  "Consequence annotations from Ensembl VEP. Format: ..."

This silently alters the VCF INFO header for all users, not just those using pick mode. Any downstream tooling that parses the INFO header description string will see a different value. If intentional, worth calling out in the PR description.


PR description says "FLAGS field" but code adds a standalone PICK column

The PR note says "marks picked consequences through the existing FLAGS field instead of introducing a standalone PICK output column", but the implementation clearly adds a separate PICK typed column and a |PICK| CSQ sub-field. This makes the PR description misleading. Worth updating the note to accurately reflect the design.


Behavior note: all non-transcript consequences receive PICK=1

annotate_provider.rs mark_flag_pick_allele_gene:

if assignments[idx].feature_type != FeatureType::Transcript {
    assignments[idx].picked = true;  // intergenic, regulatory, motif all PICK=1
    continue;
}

This matches VEP's --flag_pick_allele_gene contract (non-transcript overlaps pass through unchanged), but it may surprise users who filter on PICK=1 expecting only one consequence per allele+gene — they will also get all intergenic/regulatory/motif rows. This is worth a sentence in the public-facing docs or SQL function signature comment.


Nits

  • pick_worst_assignment mirrors VEP's historical pick_worst_* naming, which is counter-intuitive from a user's perspective (it actually picks the "best"). The traceability comment explains the origin, which is sufficient.
  • test_mark_flag_pick_allele_gene_uses_spliced_length_for_noncoding_transcripts — the "shorter spliced" transcript has a larger genomic end coordinate (1000), making the test data confusing to read. The exon-level lengths correctly drive the assertion, but a comment on the variable names would help.

What's working well

  • Iterative narrowing in pick_worst_assignment faithfully mirrors the VEP algorithm
  • Per-criterion traceability comments with exact VEP source URLs are excellent
  • use_cached_fast_path guard correctly bypasses the CSQ fast path when pick mode is active
  • gene_stable_id_from_raw_object_json fallback correctly prefers _gene_stable_id over _gene.stable_id
  • parse_appris_pick_rank faithfully reimplements the VEP regex logic for principal/alternative grading
  • Test for cache-hit bypass (test_annotate_vep_flag_pick_allele_gene_bypasses_cache_hit_fast_path) covers an important edge case
  • APPRIS alternative correctly gets rank + 10 to penalise it relative to principal

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6dd7438d8f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread datafusion/bio-function-vep/src/annotate_provider.rs Outdated
@mwiewior
Copy link
Copy Markdown
Contributor Author

@claude pls review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @mwiewior's task in 3m 11s —— View job


PR Review: Add VEP flag_pick_allele_gene and pick_order support

Tasks:

  • Post initial comment
  • Read annotate_provider.rs changes
  • Read annotate_table_function.rs changes
  • Read vcf_sink.rs, golden_benchmark.rs, transcript_consequence.rs changes
  • Formulate and post review feedback

This is a well-executed, thorough implementation. All of the issues raised in the prior review have been addressed. The traceability comments linking to VEP release 115 source lines remain excellent. A few remaining points below.


Resolved from prior review

  • retain_best_pick_candidates now guards candidates.is_empty() — no panic risk
  • PickFlags is stored on AnnotateProvider and parsed once in new() — not per batch; self.pick_flags.clone() per batch is a cheap copy of a bool + small Vec
  • csq_field_names_for_mode(everything, refseq, merged) signature is unchanged — not a breaking change; old callers compile as-is
  • ✅ VCF header description still reads "from annotate_vep" — not changed globally
  • source_cache correctly preserved for exact source ranking; pick_assignment_source_rank prefers it over the normalized source
  • include_pick_output in annotate_batch is a simple rebind of self.include_pick_output, not a re-derivation

Remaining issues

b_pick builder allocated and maintained unconditionally

annotate_provider.rs — builder initialization:

let mut b_pick = ListBuilder::new(StringBuilder::new());

The builder is always created and append-ed to on every row, even when include_pick_output is false. When pick is disabled the column is never included in the output, but the allocations still happen. For large VCFs this adds per-row overhead on a branch that most users never touch. Gating the inner-loop b_pick.values().append_* calls with if include_pick_output { ... } (matching the pattern used for include_source_field) would keep the hot path clean. Fix this →


AnnotateProvider holds pick_flags and include_pick_output as separate fields

Both are stored on the struct; include_pick_output is always equal to pick_flags.flag_pick_allele_gene. That invariant is correct today, but having two fields means a future edit could silently break it. Since include_pick_output is only ever checked as a bool, consider dropping the field and using self.pick_flags.flag_pick_allele_gene directly everywhere. The only cost is one field access instead of one field read — negligible. Fix this →


PR description note is inaccurate

The PR note reads:

"marks picked consequences through the existing FLAGS field instead of introducing a standalone PICK output column"

The implementation does the opposite — it adds a standalone PICK typed column and a |PICK| CSQ sub-field. Non-transcript rows also receive PICK=1. This contradicts the note. Worth correcting so the PR description accurately describes what shipped.


Integration tests can't exercise length as a tiebreaker

pick_transcripts_batch() and pick_exons_batch() in annotate_table_function.rs give all three transcripts identical exon coordinates (100–250). This means the length criterion never resolves a tie in the integration tests — picks are always determined by mane_select or tsl. The unit tests in annotate_provider.rs do cover CDS and spliced length ranking (test_mark_flag_pick_allele_gene_prefers_cds_length_over_protein_length, test_mark_flag_pick_allele_gene_uses_spliced_length_for_noncoding_transcripts), so the logic is tested, but end-to-end coverage of pick_order=length is missing. Low-priority, but worth a follow-up test.


pick_transcripts_batch / pick_transcripts_batch_with_gene_id_only_in_raw_json duplication

These two test helpers differ only in gene_stable_id (Some vs None) and raw_object_json. All the schema and array construction is duplicated (~80 lines). A single builder function taking gene_stable_id_col and raw_json_col as parameters would eliminate the repetition. Minor, but the duplication will diverge as the schema grows.


Nit

option_value in annotate_vep_golden_bench.rs skips next-arg values that start with -- (.filter(|value| !value.starts_with("--"))). A FASTA path starting with -- would be silently ignored. Edge case for a benchmark tool, but worth a comment so the next reader doesn't wonder about it.


What's working well

  • Iterative narrowing in pick_worst_assignment faithfully mirrors VEP release 115's pick_worst_VariationFeatureOverlapAllele()
  • use_cached_fast_path correctly set to false when pick mode is active — cache-hit bypass is properly gated
  • parse_appris_pick_rank faithfully reimplements the VEP regex (first alpha char + trailing digits + 10 for alternative*)
  • gene_stable_id_from_raw_object_json correctly prefers _gene_stable_id over _gene.stable_id
  • csq_field_names_for_mode backward compat preserved via delegation to csq_field_names_for_mode_with_pick
  • Unit test coverage is comprehensive: exact source matching, APPRIS grading, CDS vs protein length, spliced length for non-coding, non-transcript PICK pass-through, missing gene ID skip
  • test_annotate_vep_flag_pick_allele_gene_bypasses_cache_hit_fast_path is a critical integration test that covers an important edge case
  • Per-criterion traceability comments with exact VEP source URLs are excellent

@mwiewior
Copy link
Copy Markdown
Contributor Author

@codex pls review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mwiewior
Copy link
Copy Markdown
Contributor Author

@claude pls review

@mwiewior
Copy link
Copy Markdown
Contributor Author

@codex pls review

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude finished @mwiewior's task in 12m 57s —— View job


PR Review: Add VEP flag_pick_allele_gene and pick_order support (Round 3)

Tasks:

  • Post initial comment
  • Read annotate_provider.rs changes
  • Read annotate_table_function.rs changes
  • Read vcf_sink.rs and golden_benchmark.rs
  • Check status of issues from prior reviews
  • Formulate and post final review

This is a solid, well-tested implementation. The two new commits since round 2 each address a real behavioral deviation from VEP and both include updated or new tests. No new correctness issues found.


Status of prior review issues

All Round-1 issues: ✅ resolved.

Round-2 issues:

  • ✅ PR description note now accurately describes the design (standalone PICK column with non-transcript pass-through)
  • retain_best_pick_candidates empty-guard in place
  • PickFlags stored on struct, not re-parsed per batch
  • csq_field_names_for_mode signature backward-compatible (delegates to _with_pick)

Still outstanding from Round 2 (minor):

  • b_pick builder allocated unconditionally even when include_pick_output = false
  • AnnotateProvider has both pick_flags: PickFlags and include_pick_output: bool (always equals pick_flags.flag_pick_allele_gene)
  • Integration tests don't exercise Length as tiebreaker
  • pick_transcripts_batch / pick_transcripts_batch_with_gene_id_only_in_raw_json duplicated

New issues in the two new commits

Unused import: std::cmp::Ordering (commit 32863f3)

annotate_provider.rs:10:

use std::cmp::Ordering;

Ordering is never named directly in the file — .cmp().then_with() uses it implicitly, but the name is never written. This generates an unused_imports lint warning. If CI runs cargo clippy -- -D warnings (as noted in CLAUDE.md), this would fail the build. Fix this →


PickCandidateInfo.length = 0 for non-transcript features silently beats all transcripts on Length criterion

annotate_provider.rs:2254:

} else {
    (1, 1, 1, 1, 0, 1, 100, 100)  // defaults for non-transcript features
}

length = 0 for non-transcripts, while transcripts use length = -pick_transcript_length() (always negative). Since retain_best_pick_candidates sorts ascending and picks the minimum, all transcripts (with negative length) rank better than non-transcripts (0) in the Length criterion. This is probably the desired behavior — transcripts are always preferred over regulatory/motif features when tied on all biological criteria — but it is not documented. If a user specifies pick_order=length and has only regulatory features, the Length criterion effectively ignores them and they fall through to the index tiebreaker. Worth a brief comment documenting the invariant.


What's working well in the new commits

1dcead4 Match VEP noncoding pick length

  • Correctly aligns pick_transcript_length for non-coding transcripts with Bio::EnsEMBL::Transcript::length() (sum of exon lengths), replacing the previous spliced_seq → cdna_seq → exons fallback chain
  • Test test_mark_flag_pick_allele_gene_uses_exon_length_for_noncoding_transcripts now explicitly sets spliced_seq and cdna_seq on the shorter-exon transcript, proving they are no longer consulted
  • Traceability comment linking to the VEP Ensembl core source is excellent
  • unwrap_or_else fallback to genomic span when no exons are available is correct

32863f3 Fix VEP pick allele tie ordering

  • sort_pick_candidates_by_vep_order now called for ALL pick modes (Pick, PickAllele, PickAlleleGene, FlagPick, FlagPickAllele, FlagPickAlleleGene, PerGene) — previously missing for the first three
  • Sort is deterministic (feature_type rank → feature_id → assignment_index), matches the CSQ concat order, and is independent of COITree traversal order
  • pick_assignment_feature_id correctly handles non-Transcript features by falling back to tc.transcript_id.as_deref() — correctly populated for regulatory features (r.feature_id.clone()) and empty for aggregated motif consequences
  • test_apply_pick_allele_tie_breaks_by_vep_feature_order directly exercises the case where insertion order (ENST00000052 first) differs from lexicographic order (ENST00000051 wins)
  • Stable-sort + candidates.first() fallback correctly propagates the pre-sort VEP order through all biological criterion sorts

Summary

One actionable issue: the unused use std::cmp::Ordering import should be removed before merge to avoid a clippy lint failure. Everything else is sound. The implementation faithfully mirrors VEP release 115 pick semantics.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mwiewior mwiewior merged commit f542b76 into master May 18, 2026
2 checks passed
@mwiewior mwiewior deleted the feature/vep-flag-pick-allele-gene branch May 18, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant