Skip to content

HGVSc/HGVS_OFFSET emitted for non-coding indels where VEP produces empty #112

@mwiewior

Description

@mwiewior

Summary

vepyr computes HGVSc and HGVS_OFFSET for certain non-coding transcript indels (n. notation) and 3'UTR indels (c.* notation) where Ensembl VEP leaves both fields empty. This causes 35 HGVSc + 35 HGVS_OFFSET = 70 field-level mismatches on chr6 of the GIAB HG002 benchmark (271,966 variants, 2,371,184 CSQ entries).

Pattern

All 35 mismatches share the same structure:

  • vepyr produces a valid HGVSc (e.g., ENST00000663428.2:n.1676_1677del) and a non-zero HGVS_OFFSET
  • VEP produces empty string for both HGVSc and HGVS_OFFSET

The affected transcripts are either:

  1. Non-coding transcripts — HGVSc uses n. prefix (e.g., lncRNAs, processed pseudogenes)
  2. 3'UTR indels — HGVSc uses c.* prefix (downstream of stop codon)

All are indels in repeat contexts that require 3' HGVS shifting (hence the non-zero HGVS_OFFSET).

Hypothesis

VEP may skip HGVSc computation when the 3' shift lands outside the transcript's spliced sequence, or when _return_3prime() fails for non-coding transcripts that lack a spliced_seq. In these cases VEP silently returns undef and emits empty, while vepyr's genomic shift path succeeds and emits a value.

It's also possible this is a vepyr improvement — the values appear correct, and VEP's omission may be a bug. Needs investigation of VEP's _calc_hgvs() for non-coding transcripts to determine whether the empty output is intentional.

Examples

Non-coding transcripts (n. notation)

Variant Transcript vepyr HGVSc vepyr HGVS_OFFSET VEP
chr6:2307503 CTA>C ENST00000663428.2 n.1676_1677del 25 empty
chr6:2307503 CTA>CTATA ENST00000663428.2 n.1677_1678insAT 25 empty
chr6:4490858 AAAG>A ENST00000827273.1 n.1058_1060del 4 empty
chr6:6719895 TAAA>T ENST00000776254.1 n.1051_1053del 14 empty
chr6:8441367 CAAA>C ENST00000790778.1 n.783_785del 15 empty
chr6:10456777 C>CAA ENST00000366312.2 n.666_667insAA 14 empty
chr6:22194385 CAA>C ENST00000606851.5 n.1901_1902del 13 empty
chr6:22194385 CAA>C ENST00000822831.1 n.664_665del 13 empty
chr6:22194385 CAA>C ENST00000822928.1 n.720_721del 13 empty

3'UTR (c.* notation)

Variant Transcript vepyr HGVSc vepyr HGVS_OFFSET VEP
chr6:25781174 CAGAA>C ENST00000377905.9 c.*2012_*2015del 25 empty

Impact

  • 35 / 2,371,184 CSQ entries (0.001%)
  • vepyr produces more data than VEP, not wrong data
  • All emitted values appear structurally valid (correct transcript ID, correct notation prefix, plausible positions)

Investigation needed

  1. Check VEP Perl _calc_hgvs() / _return_3prime() for non-coding transcripts — does it intentionally skip, or fail silently?
  2. Verify the emitted positions are numerically correct by manual cDNA coordinate lookup
  3. Decide: match VEP (suppress) or accept as improvement

Context

Identified in vepyr chr6 fast annotation benchmark after fixing #99 (HGVS_OFFSET flank buffer). These mismatches are unrelated to #99 — they exist because vepyr computes HGVS for transcripts where VEP doesn't, not because of incorrect shift values.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions