annotate_vep(..., merged=true) omits merged-only columns from table-function/DataFrame schema

## Summary

`annotate_vep(..., merged=true)` appears to be missing the merged-only output columns on the table-function / DataFrame path:

- `REFSEQ_MATCH`
- `SOURCE`
- `REFSEQ_OFFSET`
- `GIVEN_REF`
- `USED_REF`
- `BAM_EDIT`

The VCF sink path is a different story: `annotate_to_vcf` is merged-aware and emits the merged CSQ schema correctly.

So the issue looks isolated to the `annotate_vep()` provider/schema path, not to the merged consequence engine itself.

## What I verified

I first saw this through `vepyr`, which pins `datafusion-bio-function-vep` to:

- `97375ac981bd075d77f7b1da3d04364894283be4`

`vepyr` does **not** build the DataFrame schema itself. It calls upstream `annotate_vep(...)` and forwards the returned Arrow schema/batches, so this looks upstream rather than a downstream wrapper problem.

### 1. Merged VCF output is correct

With the same merged cache and input VCF:

- `merged=True` VCF output contains merged CSQ fields in the header
- comparing generated VCF against real Ensembl VEP merged output is clean after fixing a local trimmed-cache buffering mistake on my side

In other words: the merged annotation content itself looks fine on the VCF path.

### 2. Merged DataFrame / table-function output is missing columns

With `merged=True`, the DataFrame output still has the default non-merged width and is missing:

- `REFSEQ_MATCH`
- `SOURCE`
- `REFSEQ_OFFSET`
- `GIVEN_REF`
- `USED_REF`
- `BAM_EDIT`

I reproduced that even against the **full merged cache**, so this is not a trimmed-cache artifact.

## Minimal reproduction

This is the shortest repro I have at hand. I used `vepyr` only because it is a thin wrapper over `annotate_vep(...)`.

```python
import vepyr
from pathlib import Path

base = Path("tests/data/golden")
full_cache = "/Users/mwiewior/workspace/data_vepyr/115_GRCh38_merged"

df = vepyr.annotate(
    str(base / "input.vcf.gz"),
    full_cache,
    everything=True,
    merged=True,
    reference_fasta=str(base / "reference.fa"),
).collect()

missing = [
    c for c in [
        "REFSEQ_MATCH",
        "SOURCE",
        "REFSEQ_OFFSET",
        "GIVEN_REF",
        "USED_REF",
        "BAM_EDIT",
    ]
    if c not in df.columns
]

print("missing", missing)
print("width", df.width)
```

Observed:

```text
missing ['REFSEQ_MATCH', 'SOURCE', 'REFSEQ_OFFSET', 'GIVEN_REF', 'USED_REF', 'BAM_EDIT']
width 96
```

Expected:

- merged table-function / DataFrame output should expose those 6 fields as typed columns
- output width should increase accordingly
- values should be consistent with the merged VCF / CSQ output

## Why this looks upstream

### `vepyr` DataFrame path only forwards upstream schema

`vepyr` gets the schema from `_create_annotator(...)` / `annotate_vep(...)` and registers it directly:

- `src/annotate.rs` in `vepyr`: builds SQL `SELECT * FROM annotate_vep(...)`
- `src/vepyr/__init__.py`: probe annotator -> Arrow schema -> Polars schema

There is no downstream schema rewrite that could explain the missing fields.

### Upstream `AnnotateProvider` schema is static and does not appear merged-aware

In `datafusion-bio-function-vep` at `97375ac`:

- `datafusion/bio-function-vep/src/annotate_provider.rs:130-133`
  - `annotation_column_defs()` is defined as a static list
- `datafusion/bio-function-vep/src/annotate_provider.rs:2064-2081`
  - `AnnotateProvider::new()` builds the output schema from `annotation_column_defs()`

That static annotation column list appears to contain only the non-merged/default fields. I do not see:

- `REFSEQ_MATCH`
- `SOURCE`
- `REFSEQ_OFFSET`
- `GIVEN_REF`
- `USED_REF`
- `BAM_EDIT`

in `annotation_column_defs()`.

### Upstream VCF sink *is* merged-aware

In contrast, the VCF sink explicitly uses merged mode when building CSQ metadata:

- `datafusion/bio-function-vep/src/vcf_sink.rs:339-345`
- `datafusion/bio-function-vep/src/vcf_sink.rs` also calls `csq_field_names_for_mode(config.everything, config.refseq, config.merged)` when constructing the CSQ header metadata

So the split seems to be:

- `annotate_to_vcf` / VCF sink: merged-aware
- `annotate_vep` / provider schema: not merged-aware

## Suggested fix direction

Make the `annotate_vep()` output schema conditional on mode, so that:

- default mode keeps the current schema
- `refseq=true` adds the RefSeq-specific columns
- `merged=true` adds the merged-specific columns

and ensure the execution path actually populates those arrays/columns, not just the CSQ string.

## Acceptance criteria

- `SELECT * FROM annotate_vep(..., merged=true, ...)` exposes:
  - `REFSEQ_MATCH`
  - `SOURCE`
  - `REFSEQ_OFFSET`
  - `GIVEN_REF`
  - `USED_REF`
  - `BAM_EDIT`
- those columns are populated on the table-function / DataFrame path
- schema matches what the merged VCF/CSQ path already advertises

## Additional context

I initially saw a few merged VCF mismatches in a trimmed test fixture, but that turned out to be my local cache-slice bug: I trimmed the merged cache without the same downstream/upstream buffer used by the default golden fixture. After adding the buffer, VCF-vs-VEP merged comparison passed. The remaining problem is only the table-function/DataFrame schema.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotate_vep(..., merged=true) omits merged-only columns from table-function/DataFrame schema #149

Summary

What I verified

1. Merged VCF output is correct

2. Merged DataFrame / table-function output is missing columns

Minimal reproduction

Why this looks upstream

`vepyr` DataFrame path only forwards upstream schema

Upstream `AnnotateProvider` schema is static and does not appear merged-aware

Upstream VCF sink is merged-aware

Suggested fix direction

Acceptance criteria

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

annotate_vep(..., merged=true) omits merged-only columns from table-function/DataFrame schema #149

Description

Summary

What I verified

1. Merged VCF output is correct

2. Merged DataFrame / table-function output is missing columns

Minimal reproduction

Why this looks upstream

vepyr DataFrame path only forwards upstream schema

Upstream AnnotateProvider schema is static and does not appear merged-aware

Upstream VCF sink is merged-aware

Suggested fix direction

Acceptance criteria

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`vepyr` DataFrame path only forwards upstream schema

Upstream `AnnotateProvider` schema is static and does not appear merged-aware