Skip to content

Phasing: cis/trans-aware effect prediction for nearby variants #269

@iskandr

Description

@iskandr

Background

When two variants land close to each other (same codon, same exon, same read span), their combined effect on the protein depends on whether they are in cis (same haplotype/allele) or trans (different haplotypes). Varcode currently predicts each variant's effect independently, against the reference — which silently assumes no interaction.

Example: two SNVs in the same codon

  • Reference codon: `GCA` (Ala)
  • Variant A: position 1, G→T
  • Variant B: position 3, A→T

In cis (both on same allele): `GCA` → `TCT` (Ala → Ser). One substitution effect.
In trans (each on different allele): two separate substitutions. Variant A alone: `GCA` → `TCA` (Ala → Ser). Variant B alone: `GCA` → `GCT` (Ala → Ala, silent).

Varcode today reports variant A as `A→S` and variant B as `A→A` (silent), which is what you'd get from trans. If they're actually in cis, the protein has a single `A→S` change at that codon, not two independent events.

Other scenarios

  • Compound heterozygotes: two damaging LoF variants in the same gene. In trans → full knockout. In cis → one functional copy remains. Clinical interpretation differs radically.
  • Frameshift rescue: a frameshift insertion + a nearby frameshift deletion on the same allele can restore reading frame. Independently they each shift the frame.
  • Phased germline + somatic: a somatic variant on the same haplotype as a germline SNP (see Germline-aware effect prediction (umbrella) #268) affects one allele; in trans, it affects the other. Peptide context differs.

Scope

  1. Preserve phase information from VCFs: VCF `GT` fields distinguish phased (`0|1`) from unphased (`0/1`) genotypes. Currently varcode discards this in the metadata dict. First-class access would let effect prediction take advantage of it when available.

  2. Phased effect prediction: when two or more variants overlap the same codon / same exon / a short window and are known to be in cis, predict their joint effect rather than independent effects. The result could be a `HaplotypeEffect` carrying multiple source variants.

  3. Phase block awareness: modern variant callers and tools like WhatsHap produce phase-set blocks (`PS` tag). Variants within the same block on the same haplotype are phased relative to each other. Varcode should respect these blocks.

  4. Unphased-with-evidence fallback: when no phasing information is available but two variants are close enough that long reads or paired-end reads could resolve phase, emit both the cis and trans predictions as candidates (ties in to the possibility-set model in Incorporate RNA-level evidence for variant effects #259).

Design questions

  • Data model: where does phase live? Attached to the variant (with the phase set ID), or on a new `Haplotype` object that groups variants from the same phase block?
  • Priority: should phased joint effects replace individual effects, or coexist with them? (I lean toward coexist — different consumers want different granularities.)
  • Multi-sample: phasing is per-sample. In a multi-sample VCF, the same variant may be phased differently across samples.

Dependencies

Related prior art

  • WhatsHap (https://whatshap.readthedocs.io/) — the de facto tool for producing phased VCFs from read data.
  • HGVS supports cis/trans notation: `[c.1A>T];[c.5G>C]` (trans) vs `[c.1A>T;c.5G>C]` (cis).

Part of the #270 umbrella.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions