Skip to content

Germline-aware effect prediction (umbrella) #268

@iskandr

Description

@iskandr

Summary

Today varcode predicts variant effects against the reference genome. For somatic-variant analysis in a real patient, the biologically correct "before" state is the patient's germline — not the reference. This umbrella tracks the work to make varcode's effect prediction germline-aware.

The work is large enough and gnarly enough that it lives as an umbrella with focused sub-issues. This issue captures the master plan, design decisions, complications, and acceptance criteria. Sub-issues are listed at the bottom.

Status: planning. Do not start implementation until #271 (foundational MutantTranscript data model) lands; this work composes naturally on top of it and would otherwise duplicate logic that #271 will rewrite.


Why "reference-relative" is wrong

Reference codon GCC (Ala). Patient is heterozygous for a germline V_g changing position 2 G→T:

  • Haplotype A: GCC (Ala)
  • Haplotype B: GTC (Val)

Somatic V_s changes position 1 G→A. The somatic effect depends on which haplotype it landed on:

  • In cis with V_g (haplotype B): codon was GTC, becomes ATC (Ile) → Val→Ile
  • In trans with V_g (haplotype A): codon was GCC, becomes ACC (Thr) → Ala→Thr

Today's varcode reports Ala→Thr (reference-relative). That's wrong on haplotype B. Without phase information, the honest output is a set {Val→Ile, Ala→Thr}; with phase, the set collapses to one. This pattern repeats wherever germline and somatic share genomic context: codons, splice signals, reading frame.


Complications and how each affects the design

1. Phasing changes the answer

When germline and somatic share a codon (or a splice region), phase determines the outcome. Three phase states matter:

State Source Treatment
Phased VCF | GT separator + PS tag (WhatsHap, HapCUT2, longshot, assembly tools) Use the phase set; one outcome per haplotype
Implicit phase Variants on hemizygous chromosomes (chrX/Y in males, chrM) Single haplotype; phase is trivial
Unphased / GT separator, no PS Emit possibility set per phase hypothesis

Read-length implications (docs only — same code path):

  • Short-read data (~150bp Illumina): read-backed phasing only resolves variants within a paired-end insert (~500bp). For most genes, somatic vs germline phase across longer distances is not available without statistical phasing. Most tumor/normal pairs in the wild are short-read; expect frequent unphased + same-codon situations → possibility sets are the common case.
  • Long-read data (PacBio HiFi, ONT): phasing typically resolved across whole genes. If the user ran WhatsHap/HapCUT2 on long reads, expect PS tags for almost all variants in coding regions; possibility sets become rare.
  • Assembly-based phasing (hifiasm, Verkko, Exacto): produces full haplotype sequences. Strictly richer than per-position phase. Belongs with Add loader for Exacto output formats #260 (Exacto loader) — assembled haplotypes flow into varcode as MutantTranscript objects directly, bypassing the per-variant phasing logic.

For varcode's API, the read-length distinction is invisible — what matters is whether PS tags are present. Documentation should explain this so users understand why their short-read pipeline produces possibility sets while a long-read pipeline doesn't.

2. Germline can break splice signals

Patient has germline at -3 of an exon (the M of MAG donor). Cases:

  • Germline-canonical: G→A makes MAG → AAG (still M=A). Somatic at -1 (G→T) makes AAG → AAT (broken). Real splice disruption.
  • Germline-already-non-canonical: germline at the same -3 makes MAG → TAG (T is not M=A/C). Site was already non-canonical before the somatic. Somatic disruption is irrelevant.
  • Germline-fully-disruptive: germline at -1 already broke the donor. Patient's transcript already uses a cryptic splice or skips the exon. Somatic in the same area is operating on a different effective transcript than reference.

Implications:

  • Germline-aware splice classification needs to rebuild the patient's splice signal by applying germline first, then run the sequence-aware classifier from 2.0.0
  • When germline disrupts splice signals, the patient's transcript may not even include the exon containing the somatic variant. Without RNA evidence (Incorporate RNA-level evidence for variant effects #259), we can't know what the patient's actual transcript looks like
  • For v1: use canonical splicing as the assumption; flag when germline disrupts splice signals so the user knows the prediction may be unreliable

3. Some germline variants are invisible in the protein

A germline variant in an intron that is far from any splice site has no protein impact and shouldn't enter germline-aware prediction at all. A germline variant in an exon that gets skipped (due to another germline variant) has no protein impact either — the exon isn't in the patient's transcript.

For v1, we apply all germline variants in the window regardless of expressed status. This is conservative — we may mention germline context that the patient doesn't actually express. Caller should be aware. RNA evidence (#259) resolves this properly.

4. The somatic caller didn't see the germline (the most important real-world case)

Standard somatic callers (MuTect, Strelka, VarScan, Mutect2) take tumor + normal BAMs and emit a VCF of variants present in tumor but absent in normal. Their ref field is the reference genome ref, not the patient's germline. The somatic VCF carries no information about what was germline.

This means germline-aware annotation needs the germline VCF as a separate input, with sub-cases:

Sub-case Available data Plan
a. Tumor + matched normal both available Two VCFs Standard input route 1; primary v1 target
b. Tumor + multi-sample VCF with normal column One VCF, multiple samples Input route 2; uses #267 sample machinery
c. Tumor only, no germline One VCF Fall back to reference-relative with a warning. v2: option to use population-frequency databases (gnomAD/ExAC) as a probabilistic germline; deferred
d. Pre-merged with tags One VCF with INFO=GERMLINE or similar Lower priority; easy to support via a classifier callback

5. Variants that overlap between germline and somatic VCFs

If the same locus appears in both:

  • Caller artifact: somatic caller failed to filter germline. Treat as germline; drop from somatic. Emit a warning so the user can investigate.
  • Loss of heterozygosity (LOH): germline het, tumor lost the ref allele → tumor is hom alt for the germline variant. The "somatic" is really a zygosity change. v1: treat as germline-applied with a note; emit special LOH flag on the effect.
  • Same alt by coincidence (Li-Fraumeni-style hotspot): extraordinarily rare except in cancer-predisposition genes. v1: warn and treat as germline.

6. Caller-level normalization differences

Two real failure modes that look similar to ref-mismatch but aren't:

  • MNV vs split SNVs: somatic VCF reports a 2bp substitution as a single MNV; germline VCF splits the same biological event into two SNVs. Variants don't match by position equality.
  • Indel left-alignment: germline VCF normalized with bcftools norm, somatic VCF didn't (or vice versa). Same indel reported at different positions.

Mitigation: pre-normalize both VCFs with the same tool before loading; document this. v1 doesn't do automatic normalization — users get errors or wrong results, and our docs say "normalize first."

7. Reference-genome mismatch between somatic and germline VCFs

Hard error by default. Sample case: germline called against GRCh37 (hg19), somatic against GRCh38. Coordinates don't match; results are silently garbage. We should refuse to proceed.

Opt-out: validate_reference=False for users who know what they're doing (e.g., they've lifted over one VCF into the other build).

8. Coverage gaps in germline

Positions where germline wasn't sequenced (low/no coverage in normal) → unknown germline. We can't tell if the position is ref/ref or unsequenced. v1: treat unsequenced as ref/ref (the same default the somatic caller used). Document the limitation. Future work could ingest coverage tracks (BedGraph) and emit "unknown" effects in low-coverage regions.

9. Compound interactions across haplotypes

Two germline variants in the same codon, plus a somatic variant — phase between all three matters. With three diploid variants there are up to 8 phase configurations. v1 emits the possibility set; if it's intractable (>8 hypotheses, configurable), we emit a TooManyHypotheses warning effect and require explicit phase data.

10. Hemizygous chromosomes / chrM

Single-allele calls have no phase ambiguity. v1: detect from Genotype.is_haploid (already implemented as part of #267) and skip the hypothesis enumeration. ChrM has its own genetic code (mitochondrial codon table) — out of scope for v1; flagged as a known limitation.

11. Tumor heterogeneity / subclonal somatic

Some somatic variants are present in only a subset of tumor cells. The "patient germline" model assumes a uniform background, which is true for the germline but not for low-frequency somatic. v1: ignores this; treats every somatic variant as if it occurs in 100% of cells. Documented limitation; future work could use AF (allele frequency) from the somatic VCF to weight effects.

12. CNV / copy number changes

Not modeled. Affects effective dosage of variants but not the per-haplotype protein sequence. Documented as out of scope.


Input routes

Five concrete shapes a user might come in with:

# Input v1? Notes
1 Two VCFs: somatic + germline ✅ primary Most common pipeline output
2 One multi-sample VCF (tumor + normal columns) User specifies which sample is germline; uses #267 sample machinery
3 Tumor VCF only, no germline ⚠️ fallback Falls through to reference-relative with a warning. Population-frequency germline is v2
4 Tumor VCF + Isovar/Exacto RNA evidence Different shape entirely — assembled contigs ARE the answer. Belongs with #260 / #271
5 Pre-tagged single VCF (INFO=GERMLINE) ⚠️ low priority Easy via a germline_classifier callback; not v1 critical

Algorithms (v1)

Top-level loop

For each somatic variant V_s in collection:
    For each transcript T overlapping V_s:
        Identify germline variants V_g_set in window(V_s, T)
        If V_g_set is empty:
            Emit reference-relative effect; germline_aware = NoGermlineInWindow
        Else:
            Validate phase data, hemizygosity, etc.
            phase_hypotheses = enumerate_phase(V_s, V_g_set, phase_data)
            For each hypothesis H:
                germline_transcript_H = MutantTranscript.apply(T, V_g_set, H)
                somatic_effect_H = predict_against(V_s, germline_transcript_H)
                reference_effect = predict_against(V_s, T)  # cached
                Build GermlineAwareEffect(reference_effect, somatic_effect_H, V_g_set, H)
            Emit canonical (highest-confidence hypothesis) + alternatives

Window selection

Effect site Window
Coding (in-exon) Same codon (3bp around variant)
Splice-adjacent (within 6bp of exon-intron boundary) Splice signal region (12bp covering MAG|GURAGU and YAG|R)
Frameshift candidate Same codon AND any germline frameshifts upstream in the same exon (because they shift the reading frame for the somatic)
StopLoss / extension into UTR Same codon AND germline variants in the proximal 3' UTR

All windows configurable; defaults documented and justified.

Phase enumeration

def enumerate_phase(somatic_variant, germline_variants, phase_data):
    if all variants are hemizygous: return [single_haplotype]
    if all variants share a phase set: return [resolved_haplotypes]
    n = len(germline_variants) + 1
    if 2^n > MAX_HYPOTHESES (default 8):
        emit TooManyHypothesesWarning
        return [reference_relative_only]
    # Generate all 2^n cis/trans assignments
    return enumerate_assignments(...)

Splice signal recomputation

def germline_aware_splice_classify(variant, transcript, germline_in_splice_window):
    patient_signal = build_patient_splice_signal(transcript, germline_in_splice_window)
    if patient_signal is non_canonical:
        # Germline already disrupted; somatic disruption is downgraded
        return downgrade(reference_classification, reason=GERMLINE_DISRUPTED)
    return reference_classification  # patient's signal matches canonical, treat normally

LOH detection

def detect_loh(somatic_variant, germline_variants):
    for V_g in germline_variants:
        if V_g.position == V_s.position and V_g.alt == V_s.alt:
            if germline_zygosity(V_g) == HETEROZYGOUS:
                return LOH(germline_variant=V_g)
    return None

Design decisions

Q1. Phase-unknown default

Decision: emit one canonical reference-relative effect (today's behavior) plus a germline_aware field carrying the most-likely germline-relative effect AND a list of alternatives.

effect = somatic.effects(germline=germline)[0]

effect.short_description           # "p.A100T" — reference-relative, today's behavior, unchanged
effect.germline_aware              # GermlineAwareEffect | None
effect.germline_aware.short_description  # "p.V100I" — patient-specific (one most-likely hypothesis)
effect.germline_aware.alternatives # [GermlineAwareEffect, ...] — other phase hypotheses
effect.germline_aware.phase        # cis / trans / unknown / irrelevant

Rationale: back-compat (existing code reading short_description keeps working) and possibility-set support (new code can iterate alternatives).

Q2. Where does germline= live?

Decision: layered.

  • Lowest level (annotator interface from Foundational refactor: MutantTranscript abstraction #271): EffectAnnotator.annotate_on_transcript(variant, transcript, context) where context: AnnotationContext carries germline_variants_in_window, phase data, etc. The annotator decides what to do with it.
  • Mid level: Variant.effects(germline=...) and VariantCollection.effects(germline=...) — convenience kwargs that build the context.
  • High level: a context manager with varcode.use_germline(germline_vc): ... for batch / scripting.

This composes with #271's annotator pluggability. Custom annotators (Isovar's, Exacto's) can ignore germline= because they have their own context (assembled contigs already incorporate germline).

Q3. Germline window

Decision: per-effect-site defaults (table above), all configurable. Default values are concrete and documented; users who need different windows pass germline_window kwargs.

Q4. Output classes

Decision: new GermlineAwareEffect wrapper attached as .germline_aware on existing Effect objects. Wrapper, not extension. Reasoning: keeps existing Effect classes unchanged (no breaking API surface), keeps the germline complexity contained, lets .germline_aware be None for the common no-germline case.

Q5. Reference-genome mismatch

Decision: hard error. validate_reference=False opt-out for users who explicitly know they've lifted over.


Possible failure modes

Failure Detection Response
Reference build mismatch Compare genome.reference_name of both VCFs Hard error (validate_reference=True default)
Empty germline VCF Check len > 0 Warn loudly (likely wrong file)
Sparse germline (no overlap with somatic) Check intersection Warn (coverage issue or wrong file)
Same variant in both VCFs Position/alt equality scan Warn; treat as germline; check for LOH
Conflicting alts at same position Different alts at same position Warn; emit effect with ConflictingGermline flag
Multi-allelic split mismatch Per-position alt list comparison Warn; require manual reconciliation
Sample name not in germline VCF Lookup against vc.samples SampleNotFoundError (already exists from #267)
Phase hypothesis explosion (>N variants in window) Count enumeration size Emit TooManyHypotheses warning effect; fall back to reference-relative
Germline variant disrupts splice in the variant's transcript Check splice signal post-germline Emit SplicePotentiallyDisrupted flag on the GermlineAwareEffect; mark prediction as low confidence
Germline frameshift upstream of somatic Reading-frame check Emit possibility set: somatic on the frameshifted haplotype produces different effects than on the unshifted one
Mitochondrial variant transcript.contig in ('MT', 'M', 'chrM') Skip germline-aware (mitochondrial codon table differs); reference-relative only with note

Acceptance criteria for v1

  1. Two-VCF input route works (input route 1) — tumor + matched normal.
  2. Multi-sample VCF route works (input route 2) — germline=("vcf_path", sample_name).
  3. SNV in same codon as germline SNV produces correct phase-aware possibility set when unphased; correct single answer when phased.
  4. SNV in a splice signal where germline modified the same signal produces a downgraded prediction with the right severity.
  5. LOH detection: when somatic and germline both have the same alt and germline was het, emit LOH flag.
  6. Reference build mismatch produces a hard error by default, opt-out works.
  7. All existing reference-relative tests still pass (back-compat).
  8. Documentation page (docs/germline.md) explaining setup, input routes, possibility sets, phasing implications for short vs long read.

Sub-issues

The work decomposes as follows. Each sub-issue gets its own ticket; they may land in different PRs but share the v1 acceptance criteria above.

Dependencies

Out of scope (for v1)

  • Population-frequency-based germline inference (input route 3)
  • Trio joint analysis (use one germline at a time)
  • Tumor heterogeneity / subclonal somatic
  • CNV-aware dosage modeling
  • Mitochondrial codon table
  • Automatic VCF normalization
  • Compound-het / clinical interpretation (separate analysis layer on top)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions