Skip to content

annotate_vep(..., merged=true) omits merged-only columns from table-function/DataFrame schema #149

@mwiewior

Description

@mwiewior

Summary

annotate_vep(..., merged=true) appears to be missing the merged-only output columns on the table-function / DataFrame path:

  • REFSEQ_MATCH
  • SOURCE
  • REFSEQ_OFFSET
  • GIVEN_REF
  • USED_REF
  • BAM_EDIT

The VCF sink path is a different story: annotate_to_vcf is merged-aware and emits the merged CSQ schema correctly.

So the issue looks isolated to the annotate_vep() provider/schema path, not to the merged consequence engine itself.

What I verified

I first saw this through vepyr, which pins datafusion-bio-function-vep to:

  • 97375ac981bd075d77f7b1da3d04364894283be4

vepyr does not build the DataFrame schema itself. It calls upstream annotate_vep(...) and forwards the returned Arrow schema/batches, so this looks upstream rather than a downstream wrapper problem.

1. Merged VCF output is correct

With the same merged cache and input VCF:

  • merged=True VCF output contains merged CSQ fields in the header
  • comparing generated VCF against real Ensembl VEP merged output is clean after fixing a local trimmed-cache buffering mistake on my side

In other words: the merged annotation content itself looks fine on the VCF path.

2. Merged DataFrame / table-function output is missing columns

With merged=True, the DataFrame output still has the default non-merged width and is missing:

  • REFSEQ_MATCH
  • SOURCE
  • REFSEQ_OFFSET
  • GIVEN_REF
  • USED_REF
  • BAM_EDIT

I reproduced that even against the full merged cache, so this is not a trimmed-cache artifact.

Minimal reproduction

This is the shortest repro I have at hand. I used vepyr only because it is a thin wrapper over annotate_vep(...).

import vepyr
from pathlib import Path

base = Path("tests/data/golden")
full_cache = "/Users/mwiewior/workspace/data_vepyr/115_GRCh38_merged"

df = vepyr.annotate(
    str(base / "input.vcf.gz"),
    full_cache,
    everything=True,
    merged=True,
    reference_fasta=str(base / "reference.fa"),
).collect()

missing = [
    c for c in [
        "REFSEQ_MATCH",
        "SOURCE",
        "REFSEQ_OFFSET",
        "GIVEN_REF",
        "USED_REF",
        "BAM_EDIT",
    ]
    if c not in df.columns
]

print("missing", missing)
print("width", df.width)

Observed:

missing ['REFSEQ_MATCH', 'SOURCE', 'REFSEQ_OFFSET', 'GIVEN_REF', 'USED_REF', 'BAM_EDIT']
width 96

Expected:

  • merged table-function / DataFrame output should expose those 6 fields as typed columns
  • output width should increase accordingly
  • values should be consistent with the merged VCF / CSQ output

Why this looks upstream

vepyr DataFrame path only forwards upstream schema

vepyr gets the schema from _create_annotator(...) / annotate_vep(...) and registers it directly:

  • src/annotate.rs in vepyr: builds SQL SELECT * FROM annotate_vep(...)
  • src/vepyr/__init__.py: probe annotator -> Arrow schema -> Polars schema

There is no downstream schema rewrite that could explain the missing fields.

Upstream AnnotateProvider schema is static and does not appear merged-aware

In datafusion-bio-function-vep at 97375ac:

  • datafusion/bio-function-vep/src/annotate_provider.rs:130-133
    • annotation_column_defs() is defined as a static list
  • datafusion/bio-function-vep/src/annotate_provider.rs:2064-2081
    • AnnotateProvider::new() builds the output schema from annotation_column_defs()

That static annotation column list appears to contain only the non-merged/default fields. I do not see:

  • REFSEQ_MATCH
  • SOURCE
  • REFSEQ_OFFSET
  • GIVEN_REF
  • USED_REF
  • BAM_EDIT

in annotation_column_defs().

Upstream VCF sink is merged-aware

In contrast, the VCF sink explicitly uses merged mode when building CSQ metadata:

  • datafusion/bio-function-vep/src/vcf_sink.rs:339-345
  • datafusion/bio-function-vep/src/vcf_sink.rs also calls csq_field_names_for_mode(config.everything, config.refseq, config.merged) when constructing the CSQ header metadata

So the split seems to be:

  • annotate_to_vcf / VCF sink: merged-aware
  • annotate_vep / provider schema: not merged-aware

Suggested fix direction

Make the annotate_vep() output schema conditional on mode, so that:

  • default mode keeps the current schema
  • refseq=true adds the RefSeq-specific columns
  • merged=true adds the merged-specific columns

and ensure the execution path actually populates those arrays/columns, not just the CSQ string.

Acceptance criteria

  • SELECT * FROM annotate_vep(..., merged=true, ...) exposes:
    • REFSEQ_MATCH
    • SOURCE
    • REFSEQ_OFFSET
    • GIVEN_REF
    • USED_REF
    • BAM_EDIT
  • those columns are populated on the table-function / DataFrame path
  • schema matches what the merged VCF/CSQ path already advertises

Additional context

I initially saw a few merged VCF mismatches in a trimmed test fixture, but that turned out to be my local cache-slice bug: I trimmed the merged cache without the same downstream/upstream buffer used by the default golden fixture. After adding the buffer, VCF-vs-VEP merged comparison passed. The remaining problem is only the table-function/DataFrame schema.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions