Add VEP flag_pick_allele_gene and pick_order support by mwiewior · Pull Request #139 · biodatageeks/datafusion-bio-functions

mwiewior · 2026-04-10T03:08:14Z

Summary

add annotate_vep support for flag_pick_allele_gene and configurable pick_order
bypass the cache-hit synthetic CSQ fast path when pick mode needs transcript-level evaluation
align the picker with Ensembl VEP release 115 ranking semantics and add traceability comments
preserve raw _source_cache for exact source ranking while keeping normalized SOURCE output
keep the public csq_field_names_for_mode(everything, refseq, merged) API and add a pick-aware helper
add integration and unit coverage for custom ordering, cache-hit behavior, exact source matching, APPRIS grading, CDS-length ranking, and missing-gene handling

Testing

cargo fmt --all
cargo test -p datafusion-bio-function-vep --lib pick
cargo test -p datafusion-bio-function-vep --lib parse_transcript_raw_metadata
cargo test -p datafusion-bio-function-vep --lib csq_field_names
cargo test -p datafusion-bio-function-vep --lib
cargo test -p datafusion-bio-function-vep --examples

Note

flag_pick_allele_gene emits a standalone PICK CSQ sub-field and typed PICK column. Matching VEP, retained non-transcript regulatory/motif/intergenic rows are also marked with PICK=1.

mwiewior · 2026-04-25T12:33:19Z

@claude pls review

mwiewior · 2026-04-25T12:33:25Z

@codex pls review

claude · 2026-04-25T12:33:33Z

Claude finished @mwiewior's task in 2m 39s —— View job

PR Review: Add VEP flag_pick_allele_gene and pick_order support

Tasks:

Post initial comment
Read and analyze annotate_provider.rs changes
Read and analyze annotate_table_function.rs changes
Read and analyze vcf_sink.rs changes
Read and analyze remaining files (golden_benchmark.rs, transcript_consequence.rs, example)
Formulate and post review feedback

Overall this is a solid, well-structured implementation. The traceability comments linking back to specific VEP release-115 source lines are excellent. Test coverage is comprehensive and well-targeted. A few issues worth addressing before merge:

Issues

Potential panic in `retain_best_pick_candidates`

annotate_provider.rs ~line 450:

fn retain_best_pick_candidates<T, F>(candidates: &mut Vec<PickCandidateInfo>, mut key: F) -> Option<usize>
{
    candidates.sort_by_key(|candidate| key(candidate));
    let best = key(&candidates[0]);  // ← panics if candidates is empty

pick_worst_assignment guards with candidates.is_empty() before the loop, but after the first successful truncation, a subsequent call with candidates.len() == 0 could theoretically panic. This is unreachable today since truncate(keep) where keep >= 1, but a defensive guard (or a debug_assert!(!candidates.is_empty())) would make the invariant explicit. Fix this →

`include_pick_output` is derived twice

annotate_provider.rs ~line 4211:

let include_pick_output = pick_flags.flag_pick_allele_gene;

But self.include_pick_output is also independently derived in AnnotateProvider::new. The two derivations always agree today, but having two sources of truth is fragile. The out_cols.push at the end of the function correctly uses self.include_pick_output, while the rest of the function uses the local. Consider removing the local and using self.include_pick_output (already accessible via the receiver) or passing pick_flags.flag_pick_allele_gene consistently. Fix this →

`PickFlags::from_options_json` called per window (not per provider)

annotate_provider.rs ~line 4089:

let pick_flags = PickFlags::from_options_json(self.options_json.as_deref())?;

This is inside annotate_batch, which is called once per window-batch. For large VCFs this means repeated JSON parse + validation on every batch. PickFlags could be stored on AnnotateProvider (alongside include_pick_output) and passed down, eliminating the per-batch parse. Fix this →

Breaking API signature change on a public function

golden_benchmark.rs:

// Before
pub fn csq_field_names_for_mode(everything: bool, refseq: bool, merged: bool) -> Vec<&'static str>

// After
pub fn csq_field_names_for_mode(everything: bool, refseq: bool, merged: bool, include_pick: bool) -> Vec<&'static str>

This is a semver-breaking change on a pub function. Any downstream crate (polars-bio, comet-bio) that calls csq_field_names_for_mode directly will fail to compile. The newly added csq_field_names wrapper is a good convenience but doesn't help existing callers. Consider whether a version bump or a release note is needed when coordinating with polars-bio.

VCF header description string changed

vcf_sink.rs ~line 199:

Before: "Consequence annotations from annotate_vep. Format: ..."
After:  "Consequence annotations from Ensembl VEP. Format: ..."

This silently alters the VCF INFO header for all users, not just those using pick mode. Any downstream tooling that parses the INFO header description string will see a different value. If intentional, worth calling out in the PR description.

PR description says "FLAGS field" but code adds a standalone `PICK` column

The PR note says "marks picked consequences through the existing FLAGS field instead of introducing a standalone PICK output column", but the implementation clearly adds a separate PICK typed column and a |PICK| CSQ sub-field. This makes the PR description misleading. Worth updating the note to accurately reflect the design.

Behavior note: all non-transcript consequences receive `PICK=1`

annotate_provider.rs mark_flag_pick_allele_gene:

if assignments[idx].feature_type != FeatureType::Transcript {
    assignments[idx].picked = true;  // intergenic, regulatory, motif all PICK=1
    continue;
}

This matches VEP's --flag_pick_allele_gene contract (non-transcript overlaps pass through unchanged), but it may surprise users who filter on PICK=1 expecting only one consequence per allele+gene — they will also get all intergenic/regulatory/motif rows. This is worth a sentence in the public-facing docs or SQL function signature comment.

Nits

pick_worst_assignment mirrors VEP's historical pick_worst_* naming, which is counter-intuitive from a user's perspective (it actually picks the "best"). The traceability comment explains the origin, which is sufficient.
test_mark_flag_pick_allele_gene_uses_spliced_length_for_noncoding_transcripts — the "shorter spliced" transcript has a larger genomic end coordinate (1000), making the test data confusing to read. The exon-level lengths correctly drive the assertion, but a comment on the variable names would help.

What's working well

Iterative narrowing in pick_worst_assignment faithfully mirrors the VEP algorithm
Per-criterion traceability comments with exact VEP source URLs are excellent
use_cached_fast_path guard correctly bypasses the CSQ fast path when pick mode is active
gene_stable_id_from_raw_object_json fallback correctly prefers _gene_stable_id over _gene.stable_id
parse_appris_pick_rank faithfully reimplements the VEP regex logic for principal/alternative grading
Test for cache-hit bypass (test_annotate_vep_flag_pick_allele_gene_bypasses_cache_hit_fast_path) covers an important edge case
APPRIS alternative correctly gets rank + 10 to penalise it relative to principal

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6dd7438d8f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mwiewior · 2026-04-25T14:09:28Z

@claude pls review

claude · 2026-04-25T14:09:41Z

Claude finished @mwiewior's task in 3m 11s —— View job

PR Review: Add VEP flag_pick_allele_gene and pick_order support

Tasks:

Post initial comment
Read annotate_provider.rs changes
Read annotate_table_function.rs changes
Read vcf_sink.rs, golden_benchmark.rs, transcript_consequence.rs changes
Formulate and post review feedback

This is a well-executed, thorough implementation. All of the issues raised in the prior review have been addressed. The traceability comments linking to VEP release 115 source lines remain excellent. A few remaining points below.

Resolved from prior review

✅ retain_best_pick_candidates now guards candidates.is_empty() — no panic risk
✅ PickFlags is stored on AnnotateProvider and parsed once in new() — not per batch; self.pick_flags.clone() per batch is a cheap copy of a bool + small Vec
✅ csq_field_names_for_mode(everything, refseq, merged) signature is unchanged — not a breaking change; old callers compile as-is
✅ VCF header description still reads "from annotate_vep" — not changed globally
✅ source_cache correctly preserved for exact source ranking; pick_assignment_source_rank prefers it over the normalized source
✅ include_pick_output in annotate_batch is a simple rebind of self.include_pick_output, not a re-derivation

Remaining issues

`b_pick` builder allocated and maintained unconditionally

annotate_provider.rs — builder initialization:

let mut b_pick = ListBuilder::new(StringBuilder::new());

The builder is always created and append-ed to on every row, even when include_pick_output is false. When pick is disabled the column is never included in the output, but the allocations still happen. For large VCFs this adds per-row overhead on a branch that most users never touch. Gating the inner-loop b_pick.values().append_* calls with if include_pick_output { ... } (matching the pattern used for include_source_field) would keep the hot path clean. Fix this →

`AnnotateProvider` holds `pick_flags` and `include_pick_output` as separate fields

Both are stored on the struct; include_pick_output is always equal to pick_flags.flag_pick_allele_gene. That invariant is correct today, but having two fields means a future edit could silently break it. Since include_pick_output is only ever checked as a bool, consider dropping the field and using self.pick_flags.flag_pick_allele_gene directly everywhere. The only cost is one field access instead of one field read — negligible. Fix this →

PR description note is inaccurate

The PR note reads:

"marks picked consequences through the existing FLAGS field instead of introducing a standalone PICK output column"

The implementation does the opposite — it adds a standalone PICK typed column and a |PICK| CSQ sub-field. Non-transcript rows also receive PICK=1. This contradicts the note. Worth correcting so the PR description accurately describes what shipped.

Integration tests can't exercise `length` as a tiebreaker

pick_transcripts_batch() and pick_exons_batch() in annotate_table_function.rs give all three transcripts identical exon coordinates (100–250). This means the length criterion never resolves a tie in the integration tests — picks are always determined by mane_select or tsl. The unit tests in annotate_provider.rs do cover CDS and spliced length ranking (test_mark_flag_pick_allele_gene_prefers_cds_length_over_protein_length, test_mark_flag_pick_allele_gene_uses_spliced_length_for_noncoding_transcripts), so the logic is tested, but end-to-end coverage of pick_order=length is missing. Low-priority, but worth a follow-up test.

`pick_transcripts_batch` / `pick_transcripts_batch_with_gene_id_only_in_raw_json` duplication

These two test helpers differ only in gene_stable_id (Some vs None) and raw_object_json. All the schema and array construction is duplicated (~80 lines). A single builder function taking gene_stable_id_col and raw_json_col as parameters would eliminate the repetition. Minor, but the duplication will diverge as the schema grows.

Nit

option_value in annotate_vep_golden_bench.rs skips next-arg values that start with -- (.filter(|value| !value.starts_with("--"))). A FASTA path starting with -- would be silently ignored. Edge case for a benchmark tool, but worth a comment so the next reader doesn't wonder about it.

What's working well

Iterative narrowing in pick_worst_assignment faithfully mirrors VEP release 115's pick_worst_VariationFeatureOverlapAllele()
use_cached_fast_path correctly set to false when pick mode is active — cache-hit bypass is properly gated
parse_appris_pick_rank faithfully reimplements the VEP regex (first alpha char + trailing digits + 10 for alternative*)
gene_stable_id_from_raw_object_json correctly prefers _gene_stable_id over _gene.stable_id
csq_field_names_for_mode backward compat preserved via delegation to csq_field_names_for_mode_with_pick
Unit test coverage is comprehensive: exact source matching, APPRIS grading, CDS vs protein length, spliced length for non-coding, non-transcript PICK pass-through, missing gene ID skip
test_annotate_vep_flag_pick_allele_gene_bypasses_cache_hit_fast_path is a critical integration test that covers an important edge case
Per-criterion traceability comments with exact VEP source URLs are excellent

mwiewior · 2026-04-25T14:09:49Z

@codex pls review

chatgpt-codex-connector · 2026-04-25T14:17:49Z

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mwiewior · 2026-05-18T09:52:35Z

@claude pls review

mwiewior · 2026-05-18T09:52:42Z

@codex pls review

claude · 2026-05-18T09:52:50Z

Claude finished @mwiewior's task in 12m 57s —— View job

PR Review: Add VEP flag_pick_allele_gene and pick_order support (Round 3)

Tasks:

Post initial comment
Read annotate_provider.rs changes
Read annotate_table_function.rs changes
Read vcf_sink.rs and golden_benchmark.rs
Check status of issues from prior reviews
Formulate and post final review

This is a solid, well-tested implementation. The two new commits since round 2 each address a real behavioral deviation from VEP and both include updated or new tests. No new correctness issues found.

Status of prior review issues

All Round-1 issues: ✅ resolved.

Round-2 issues:

✅ PR description note now accurately describes the design (standalone PICK column with non-transcript pass-through)
✅ retain_best_pick_candidates empty-guard in place
✅ PickFlags stored on struct, not re-parsed per batch
✅ csq_field_names_for_mode signature backward-compatible (delegates to _with_pick)

Still outstanding from Round 2 (minor):

b_pick builder allocated unconditionally even when include_pick_output = false
AnnotateProvider has both pick_flags: PickFlags and include_pick_output: bool (always equals pick_flags.flag_pick_allele_gene)
Integration tests don't exercise Length as tiebreaker
pick_transcripts_batch / pick_transcripts_batch_with_gene_id_only_in_raw_json duplicated

New issues in the two new commits

Unused import: `std::cmp::Ordering` (commit `32863f3`)

annotate_provider.rs:10:

use std::cmp::Ordering;

Ordering is never named directly in the file — .cmp().then_with() uses it implicitly, but the name is never written. This generates an unused_imports lint warning. If CI runs cargo clippy -- -D warnings (as noted in CLAUDE.md), this would fail the build. Fix this →

`PickCandidateInfo.length = 0` for non-transcript features silently beats all transcripts on `Length` criterion

annotate_provider.rs:2254:

} else {
    (1, 1, 1, 1, 0, 1, 100, 100)  // defaults for non-transcript features
}

length = 0 for non-transcripts, while transcripts use length = -pick_transcript_length() (always negative). Since retain_best_pick_candidates sorts ascending and picks the minimum, all transcripts (with negative length) rank better than non-transcripts (0) in the Length criterion. This is probably the desired behavior — transcripts are always preferred over regulatory/motif features when tied on all biological criteria — but it is not documented. If a user specifies pick_order=length and has only regulatory features, the Length criterion effectively ignores them and they fall through to the index tiebreaker. Worth a brief comment documenting the invariant.

What's working well in the new commits

`1dcead4 Match VEP noncoding pick length`

Correctly aligns pick_transcript_length for non-coding transcripts with Bio::EnsEMBL::Transcript::length() (sum of exon lengths), replacing the previous spliced_seq → cdna_seq → exons fallback chain
Test test_mark_flag_pick_allele_gene_uses_exon_length_for_noncoding_transcripts now explicitly sets spliced_seq and cdna_seq on the shorter-exon transcript, proving they are no longer consulted
Traceability comment linking to the VEP Ensembl core source is excellent
unwrap_or_else fallback to genomic span when no exons are available is correct

`32863f3 Fix VEP pick allele tie ordering`

sort_pick_candidates_by_vep_order now called for ALL pick modes (Pick, PickAllele, PickAlleleGene, FlagPick, FlagPickAllele, FlagPickAlleleGene, PerGene) — previously missing for the first three
Sort is deterministic (feature_type rank → feature_id → assignment_index), matches the CSQ concat order, and is independent of COITree traversal order
pick_assignment_feature_id correctly handles non-Transcript features by falling back to tc.transcript_id.as_deref() — correctly populated for regulatory features (r.feature_id.clone()) and empty for aggregated motif consequences
test_apply_pick_allele_tie_breaks_by_vep_feature_order directly exercises the case where insertion order (ENST00000052 first) differs from lexicographic order (ENST00000051 wins)
Stable-sort + candidates.first() fallback correctly propagates the pre-sort VEP order through all biological criterion sorts

Summary

One actionable issue: the unused use std::cmp::Ordering import should be removed before merge to avoid a clippy lint failure. Everything else is sound. The implementation faithfully mirrors VEP release 115 pick semantics.

chatgpt-codex-connector · 2026-05-18T10:04:09Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mwiewior force-pushed the feature/vep-flag-pick-allele-gene branch from fbcddd3 to 927a442 Compare April 10, 2026 07:45

mwiewior added 3 commits April 25, 2026 11:01

Add VEP pick allele gene support

270f3c5

Align pick allele gene logic with VEP 115

2e852c9

Match VEP PICK output and gene grouping

2ec8971

mwiewior force-pushed the feature/vep-flag-pick-allele-gene branch from 927a442 to 2ec8971 Compare April 25, 2026 09:12

mwiewior added 4 commits April 25, 2026 11:41

Wire pick flags into VEP golden benchmark

a82b1ac

Align flag_pick_allele_gene with VEP

48d890b

Use transcript length for VEP pick ranking

a0b510b

Fix VEP CSQ default field padding

6dd7438

chatgpt-codex-connector Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread datafusion/bio-function-vep/src/annotate_provider.rs Outdated

Address VEP pick review feedback

b8647b1

mwiewior added 8 commits April 25, 2026 16:44

Implement VEP pick modes

7478e35

Update VEP pick OpenSpec progress

b534d20

Mark VEP pick OpenSpec complete

6b482ed

Fix parquet HGNC propagation in VEP buffers

92c079d

Preserve VCF order in parquet lookup output

2b7ff60

Align VEP input buffer feature scoping

497fba3

Fix VEP pick allele tie ordering

32863f3

Match VEP noncoding pick length

1dcead4

mwiewior mentioned this pull request May 18, 2026

Follow up on VEP pick-mode review nits #161

Open

Remove unused pick ordering import

a0e2bc1

mwiewior merged commit f542b76 into master May 18, 2026
2 checks passed

mwiewior deleted the feature/vep-flag-pick-allele-gene branch May 18, 2026 12:28

Conversation

mwiewior commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Note

Uh oh!

mwiewior commented Apr 25, 2026

Uh oh!

mwiewior commented Apr 25, 2026

Uh oh!

claude Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add VEP flag_pick_allele_gene and pick_order support

Issues

Potential panic in retain_best_pick_candidates

include_pick_output is derived twice

PickFlags::from_options_json called per window (not per provider)

Breaking API signature change on a public function

VCF header description string changed

PR description says "FLAGS field" but code adds a standalone PICK column

Behavior note: all non-transcript consequences receive PICK=1

Nits

What's working well

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mwiewior commented Apr 25, 2026

Uh oh!

claude Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add VEP flag_pick_allele_gene and pick_order support

Resolved from prior review

Remaining issues

b_pick builder allocated and maintained unconditionally

AnnotateProvider holds pick_flags and include_pick_output as separate fields

PR description note is inaccurate

Integration tests can't exercise length as a tiebreaker

pick_transcripts_batch / pick_transcripts_batch_with_gene_id_only_in_raw_json duplication

Nit

What's working well

Uh oh!

mwiewior commented Apr 25, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 25, 2026

Uh oh!

mwiewior commented May 18, 2026

Uh oh!

mwiewior commented May 18, 2026

Uh oh!

claude Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add VEP flag_pick_allele_gene and pick_order support (Round 3)

Status of prior review issues

New issues in the two new commits

Unused import: std::cmp::Ordering (commit 32863f3)

PickCandidateInfo.length = 0 for non-transcript features silently beats all transcripts on Length criterion

What's working well in the new commits

1dcead4 Match VEP noncoding pick length

32863f3 Fix VEP pick allele tie ordering

Summary

Uh oh!

chatgpt-codex-connector Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mwiewior commented Apr 10, 2026 •

edited

Loading

claude Bot commented Apr 25, 2026 •

edited

Loading

Potential panic in `retain_best_pick_candidates`

`include_pick_output` is derived twice

`PickFlags::from_options_json` called per window (not per provider)

PR description says "FLAGS field" but code adds a standalone `PICK` column

Behavior note: all non-transcript consequences receive `PICK=1`

claude Bot commented Apr 25, 2026 •

edited

Loading

`b_pick` builder allocated and maintained unconditionally

`AnnotateProvider` holds `pick_flags` and `include_pick_output` as separate fields

Integration tests can't exercise `length` as a tiebreaker

`pick_transcripts_batch` / `pick_transcripts_batch_with_gene_id_only_in_raw_json` duplication

claude Bot commented May 18, 2026 •

edited

Loading

Unused import: `std::cmp::Ordering` (commit `32863f3`)

`PickCandidateInfo.length = 0` for non-transcript features silently beats all transcripts on `Length` criterion

`1dcead4 Match VEP noncoding pick length`

`32863f3 Fix VEP pick allele tie ordering`