Summary
annotate_vep(..., merged=true) appears to be missing the merged-only output columns on the table-function / DataFrame path:
REFSEQ_MATCH
SOURCE
REFSEQ_OFFSET
GIVEN_REF
USED_REF
BAM_EDIT
The VCF sink path is a different story: annotate_to_vcf is merged-aware and emits the merged CSQ schema correctly.
So the issue looks isolated to the annotate_vep() provider/schema path, not to the merged consequence engine itself.
What I verified
I first saw this through vepyr, which pins datafusion-bio-function-vep to:
97375ac981bd075d77f7b1da3d04364894283be4
vepyr does not build the DataFrame schema itself. It calls upstream annotate_vep(...) and forwards the returned Arrow schema/batches, so this looks upstream rather than a downstream wrapper problem.
1. Merged VCF output is correct
With the same merged cache and input VCF:
merged=True VCF output contains merged CSQ fields in the header
- comparing generated VCF against real Ensembl VEP merged output is clean after fixing a local trimmed-cache buffering mistake on my side
In other words: the merged annotation content itself looks fine on the VCF path.
2. Merged DataFrame / table-function output is missing columns
With merged=True, the DataFrame output still has the default non-merged width and is missing:
REFSEQ_MATCH
SOURCE
REFSEQ_OFFSET
GIVEN_REF
USED_REF
BAM_EDIT
I reproduced that even against the full merged cache, so this is not a trimmed-cache artifact.
Minimal reproduction
This is the shortest repro I have at hand. I used vepyr only because it is a thin wrapper over annotate_vep(...).
import vepyr
from pathlib import Path
base = Path("tests/data/golden")
full_cache = "/Users/mwiewior/workspace/data_vepyr/115_GRCh38_merged"
df = vepyr.annotate(
str(base / "input.vcf.gz"),
full_cache,
everything=True,
merged=True,
reference_fasta=str(base / "reference.fa"),
).collect()
missing = [
c for c in [
"REFSEQ_MATCH",
"SOURCE",
"REFSEQ_OFFSET",
"GIVEN_REF",
"USED_REF",
"BAM_EDIT",
]
if c not in df.columns
]
print("missing", missing)
print("width", df.width)
Observed:
missing ['REFSEQ_MATCH', 'SOURCE', 'REFSEQ_OFFSET', 'GIVEN_REF', 'USED_REF', 'BAM_EDIT']
width 96
Expected:
- merged table-function / DataFrame output should expose those 6 fields as typed columns
- output width should increase accordingly
- values should be consistent with the merged VCF / CSQ output
Why this looks upstream
vepyr DataFrame path only forwards upstream schema
vepyr gets the schema from _create_annotator(...) / annotate_vep(...) and registers it directly:
src/annotate.rs in vepyr: builds SQL SELECT * FROM annotate_vep(...)
src/vepyr/__init__.py: probe annotator -> Arrow schema -> Polars schema
There is no downstream schema rewrite that could explain the missing fields.
Upstream AnnotateProvider schema is static and does not appear merged-aware
In datafusion-bio-function-vep at 97375ac:
datafusion/bio-function-vep/src/annotate_provider.rs:130-133
annotation_column_defs() is defined as a static list
datafusion/bio-function-vep/src/annotate_provider.rs:2064-2081
AnnotateProvider::new() builds the output schema from annotation_column_defs()
That static annotation column list appears to contain only the non-merged/default fields. I do not see:
REFSEQ_MATCH
SOURCE
REFSEQ_OFFSET
GIVEN_REF
USED_REF
BAM_EDIT
in annotation_column_defs().
Upstream VCF sink is merged-aware
In contrast, the VCF sink explicitly uses merged mode when building CSQ metadata:
datafusion/bio-function-vep/src/vcf_sink.rs:339-345
datafusion/bio-function-vep/src/vcf_sink.rs also calls csq_field_names_for_mode(config.everything, config.refseq, config.merged) when constructing the CSQ header metadata
So the split seems to be:
annotate_to_vcf / VCF sink: merged-aware
annotate_vep / provider schema: not merged-aware
Suggested fix direction
Make the annotate_vep() output schema conditional on mode, so that:
- default mode keeps the current schema
refseq=true adds the RefSeq-specific columns
merged=true adds the merged-specific columns
and ensure the execution path actually populates those arrays/columns, not just the CSQ string.
Acceptance criteria
SELECT * FROM annotate_vep(..., merged=true, ...) exposes:
REFSEQ_MATCH
SOURCE
REFSEQ_OFFSET
GIVEN_REF
USED_REF
BAM_EDIT
- those columns are populated on the table-function / DataFrame path
- schema matches what the merged VCF/CSQ path already advertises
Additional context
I initially saw a few merged VCF mismatches in a trimmed test fixture, but that turned out to be my local cache-slice bug: I trimmed the merged cache without the same downstream/upstream buffer used by the default golden fixture. After adding the buffer, VCF-vs-VEP merged comparison passed. The remaining problem is only the table-function/DataFrame schema.
Summary
annotate_vep(..., merged=true)appears to be missing the merged-only output columns on the table-function / DataFrame path:REFSEQ_MATCHSOURCEREFSEQ_OFFSETGIVEN_REFUSED_REFBAM_EDITThe VCF sink path is a different story:
annotate_to_vcfis merged-aware and emits the merged CSQ schema correctly.So the issue looks isolated to the
annotate_vep()provider/schema path, not to the merged consequence engine itself.What I verified
I first saw this through
vepyr, which pinsdatafusion-bio-function-vepto:97375ac981bd075d77f7b1da3d04364894283be4vepyrdoes not build the DataFrame schema itself. It calls upstreamannotate_vep(...)and forwards the returned Arrow schema/batches, so this looks upstream rather than a downstream wrapper problem.1. Merged VCF output is correct
With the same merged cache and input VCF:
merged=TrueVCF output contains merged CSQ fields in the headerIn other words: the merged annotation content itself looks fine on the VCF path.
2. Merged DataFrame / table-function output is missing columns
With
merged=True, the DataFrame output still has the default non-merged width and is missing:REFSEQ_MATCHSOURCEREFSEQ_OFFSETGIVEN_REFUSED_REFBAM_EDITI reproduced that even against the full merged cache, so this is not a trimmed-cache artifact.
Minimal reproduction
This is the shortest repro I have at hand. I used
vepyronly because it is a thin wrapper overannotate_vep(...).Observed:
Expected:
Why this looks upstream
vepyrDataFrame path only forwards upstream schemavepyrgets the schema from_create_annotator(...)/annotate_vep(...)and registers it directly:src/annotate.rsinvepyr: builds SQLSELECT * FROM annotate_vep(...)src/vepyr/__init__.py: probe annotator -> Arrow schema -> Polars schemaThere is no downstream schema rewrite that could explain the missing fields.
Upstream
AnnotateProviderschema is static and does not appear merged-awareIn
datafusion-bio-function-vepat97375ac:datafusion/bio-function-vep/src/annotate_provider.rs:130-133annotation_column_defs()is defined as a static listdatafusion/bio-function-vep/src/annotate_provider.rs:2064-2081AnnotateProvider::new()builds the output schema fromannotation_column_defs()That static annotation column list appears to contain only the non-merged/default fields. I do not see:
REFSEQ_MATCHSOURCEREFSEQ_OFFSETGIVEN_REFUSED_REFBAM_EDITin
annotation_column_defs().Upstream VCF sink is merged-aware
In contrast, the VCF sink explicitly uses merged mode when building CSQ metadata:
datafusion/bio-function-vep/src/vcf_sink.rs:339-345datafusion/bio-function-vep/src/vcf_sink.rsalso callscsq_field_names_for_mode(config.everything, config.refseq, config.merged)when constructing the CSQ header metadataSo the split seems to be:
annotate_to_vcf/ VCF sink: merged-awareannotate_vep/ provider schema: not merged-awareSuggested fix direction
Make the
annotate_vep()output schema conditional on mode, so that:refseq=trueadds the RefSeq-specific columnsmerged=trueadds the merged-specific columnsand ensure the execution path actually populates those arrays/columns, not just the CSQ string.
Acceptance criteria
SELECT * FROM annotate_vep(..., merged=true, ...)exposes:REFSEQ_MATCHSOURCEREFSEQ_OFFSETGIVEN_REFUSED_REFBAM_EDITAdditional context
I initially saw a few merged VCF mismatches in a trimmed test fixture, but that turned out to be my local cache-slice bug: I trimmed the merged cache without the same downstream/upstream buffer used by the default golden fixture. After adding the buffer, VCF-vs-VEP merged comparison passed. The remaining problem is only the table-function/DataFrame schema.