Summary
vepyr computes HGVSc and HGVS_OFFSET for certain non-coding transcript indels (n. notation) and 3'UTR indels (c.* notation) where Ensembl VEP leaves both fields empty. This causes 35 HGVSc + 35 HGVS_OFFSET = 70 field-level mismatches on chr6 of the GIAB HG002 benchmark (271,966 variants, 2,371,184 CSQ entries).
Pattern
All 35 mismatches share the same structure:
- vepyr produces a valid HGVSc (e.g.,
ENST00000663428.2:n.1676_1677del) and a non-zero HGVS_OFFSET
- VEP produces empty string for both HGVSc and HGVS_OFFSET
The affected transcripts are either:
- Non-coding transcripts — HGVSc uses
n. prefix (e.g., lncRNAs, processed pseudogenes)
- 3'UTR indels — HGVSc uses
c.* prefix (downstream of stop codon)
All are indels in repeat contexts that require 3' HGVS shifting (hence the non-zero HGVS_OFFSET).
Hypothesis
VEP may skip HGVSc computation when the 3' shift lands outside the transcript's spliced sequence, or when _return_3prime() fails for non-coding transcripts that lack a spliced_seq. In these cases VEP silently returns undef and emits empty, while vepyr's genomic shift path succeeds and emits a value.
It's also possible this is a vepyr improvement — the values appear correct, and VEP's omission may be a bug. Needs investigation of VEP's _calc_hgvs() for non-coding transcripts to determine whether the empty output is intentional.
Examples
Non-coding transcripts (n. notation)
| Variant |
Transcript |
vepyr HGVSc |
vepyr HGVS_OFFSET |
VEP |
| chr6:2307503 CTA>C |
ENST00000663428.2 |
n.1676_1677del |
25 |
empty |
| chr6:2307503 CTA>CTATA |
ENST00000663428.2 |
n.1677_1678insAT |
25 |
empty |
| chr6:4490858 AAAG>A |
ENST00000827273.1 |
n.1058_1060del |
4 |
empty |
| chr6:6719895 TAAA>T |
ENST00000776254.1 |
n.1051_1053del |
14 |
empty |
| chr6:8441367 CAAA>C |
ENST00000790778.1 |
n.783_785del |
15 |
empty |
| chr6:10456777 C>CAA |
ENST00000366312.2 |
n.666_667insAA |
14 |
empty |
| chr6:22194385 CAA>C |
ENST00000606851.5 |
n.1901_1902del |
13 |
empty |
| chr6:22194385 CAA>C |
ENST00000822831.1 |
n.664_665del |
13 |
empty |
| chr6:22194385 CAA>C |
ENST00000822928.1 |
n.720_721del |
13 |
empty |
3'UTR (c.* notation)
| Variant |
Transcript |
vepyr HGVSc |
vepyr HGVS_OFFSET |
VEP |
| chr6:25781174 CAGAA>C |
ENST00000377905.9 |
c.*2012_*2015del |
25 |
empty |
Impact
- 35 / 2,371,184 CSQ entries (0.001%)
- vepyr produces more data than VEP, not wrong data
- All emitted values appear structurally valid (correct transcript ID, correct notation prefix, plausible positions)
Investigation needed
- Check VEP Perl
_calc_hgvs() / _return_3prime() for non-coding transcripts — does it intentionally skip, or fail silently?
- Verify the emitted positions are numerically correct by manual cDNA coordinate lookup
- Decide: match VEP (suppress) or accept as improvement
Context
Identified in vepyr chr6 fast annotation benchmark after fixing #99 (HGVS_OFFSET flank buffer). These mismatches are unrelated to #99 — they exist because vepyr computes HGVS for transcripts where VEP doesn't, not because of incorrect shift values.
Summary
vepyr computes HGVSc and HGVS_OFFSET for certain non-coding transcript indels (
n.notation) and 3'UTR indels (c.*notation) where Ensembl VEP leaves both fields empty. This causes 35 HGVSc + 35 HGVS_OFFSET = 70 field-level mismatches on chr6 of the GIAB HG002 benchmark (271,966 variants, 2,371,184 CSQ entries).Pattern
All 35 mismatches share the same structure:
ENST00000663428.2:n.1676_1677del) and a non-zero HGVS_OFFSETThe affected transcripts are either:
n.prefix (e.g., lncRNAs, processed pseudogenes)c.*prefix (downstream of stop codon)All are indels in repeat contexts that require 3' HGVS shifting (hence the non-zero HGVS_OFFSET).
Hypothesis
VEP may skip HGVSc computation when the 3' shift lands outside the transcript's spliced sequence, or when
_return_3prime()fails for non-coding transcripts that lack aspliced_seq. In these cases VEP silently returns undef and emits empty, while vepyr's genomic shift path succeeds and emits a value.It's also possible this is a vepyr improvement — the values appear correct, and VEP's omission may be a bug. Needs investigation of VEP's
_calc_hgvs()for non-coding transcripts to determine whether the empty output is intentional.Examples
Non-coding transcripts (
n.notation)n.1676_1677deln.1677_1678insATn.1058_1060deln.1051_1053deln.783_785deln.666_667insAAn.1901_1902deln.664_665deln.720_721del3'UTR (
c.*notation)c.*2012_*2015delImpact
Investigation needed
_calc_hgvs()/_return_3prime()for non-coding transcripts — does it intentionally skip, or fail silently?Context
Identified in vepyr chr6 fast annotation benchmark after fixing #99 (HGVS_OFFSET flank buffer). These mismatches are unrelated to #99 — they exist because vepyr computes HGVS for transcripts where VEP doesn't, not because of incorrect shift values.