You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today varcode predicts variant effects against the reference genome. For somatic-variant analysis in a real patient, the biologically correct "before" state is the patient's germline — not the reference. This umbrella tracks the work to make varcode's effect prediction germline-aware.
The work is large enough and gnarly enough that it lives as an umbrella with focused sub-issues. This issue captures the master plan, design decisions, complications, and acceptance criteria. Sub-issues are listed at the bottom.
Status: planning. Do not start implementation until #271 (foundational MutantTranscript data model) lands; this work composes naturally on top of it and would otherwise duplicate logic that #271 will rewrite.
Why "reference-relative" is wrong
Reference codon GCC (Ala). Patient is heterozygous for a germline V_g changing position 2 G→T:
Haplotype A: GCC (Ala)
Haplotype B: GTC (Val)
Somatic V_s changes position 1 G→A. The somatic effect depends on which haplotype it landed on:
In cis with V_g (haplotype B): codon was GTC, becomes ATC (Ile) → Val→Ile
In trans with V_g (haplotype A): codon was GCC, becomes ACC (Thr) → Ala→Thr
Today's varcode reports Ala→Thr (reference-relative). That's wrong on haplotype B. Without phase information, the honest output is a set {Val→Ile, Ala→Thr}; with phase, the set collapses to one. This pattern repeats wherever germline and somatic share genomic context: codons, splice signals, reading frame.
Complications and how each affects the design
1. Phasing changes the answer
When germline and somatic share a codon (or a splice region), phase determines the outcome. Three phase states matter:
Variants on hemizygous chromosomes (chrX/Y in males, chrM)
Single haplotype; phase is trivial
Unphased
/ GT separator, no PS
Emit possibility set per phase hypothesis
Read-length implications (docs only — same code path):
Short-read data (~150bp Illumina): read-backed phasing only resolves variants within a paired-end insert (~500bp). For most genes, somatic vs germline phase across longer distances is not available without statistical phasing. Most tumor/normal pairs in the wild are short-read; expect frequent unphased + same-codon situations → possibility sets are the common case.
Long-read data (PacBio HiFi, ONT): phasing typically resolved across whole genes. If the user ran WhatsHap/HapCUT2 on long reads, expect PS tags for almost all variants in coding regions; possibility sets become rare.
Assembly-based phasing (hifiasm, Verkko, Exacto): produces full haplotype sequences. Strictly richer than per-position phase. Belongs with Add loader for Exacto output formats #260 (Exacto loader) — assembled haplotypes flow into varcode as MutantTranscript objects directly, bypassing the per-variant phasing logic.
For varcode's API, the read-length distinction is invisible — what matters is whether PS tags are present. Documentation should explain this so users understand why their short-read pipeline produces possibility sets while a long-read pipeline doesn't.
2. Germline can break splice signals
Patient has germline at -3 of an exon (the M of MAG donor). Cases:
Germline-canonical: G→A makes MAG → AAG (still M=A). Somatic at -1 (G→T) makes AAG → AAT (broken). Real splice disruption.
Germline-already-non-canonical: germline at the same -3 makes MAG → TAG (T is not M=A/C). Site was already non-canonical before the somatic. Somatic disruption is irrelevant.
Germline-fully-disruptive: germline at -1 already broke the donor. Patient's transcript already uses a cryptic splice or skips the exon. Somatic in the same area is operating on a different effective transcript than reference.
Implications:
Germline-aware splice classification needs to rebuild the patient's splice signal by applying germline first, then run the sequence-aware classifier from 2.0.0
When germline disrupts splice signals, the patient's transcript may not even include the exon containing the somatic variant. Without RNA evidence (Incorporate RNA-level evidence for variant effects #259), we can't know what the patient's actual transcript looks like
For v1: use canonical splicing as the assumption; flag when germline disrupts splice signals so the user knows the prediction may be unreliable
3. Some germline variants are invisible in the protein
A germline variant in an intron that is far from any splice site has no protein impact and shouldn't enter germline-aware prediction at all. A germline variant in an exon that gets skipped (due to another germline variant) has no protein impact either — the exon isn't in the patient's transcript.
For v1, we apply all germline variants in the window regardless of expressed status. This is conservative — we may mention germline context that the patient doesn't actually express. Caller should be aware. RNA evidence (#259) resolves this properly.
4. The somatic caller didn't see the germline (the most important real-world case)
Standard somatic callers (MuTect, Strelka, VarScan, Mutect2) take tumor + normal BAMs and emit a VCF of variants present in tumor but absent in normal. Their ref field is the reference genome ref, not the patient's germline. The somatic VCF carries no information about what was germline.
This means germline-aware annotation needs the germline VCF as a separate input, with sub-cases:
Fall back to reference-relative with a warning. v2: option to use population-frequency databases (gnomAD/ExAC) as a probabilistic germline; deferred
d. Pre-merged with tags
One VCF with INFO=GERMLINE or similar
Lower priority; easy to support via a classifier callback
5. Variants that overlap between germline and somatic VCFs
If the same locus appears in both:
Caller artifact: somatic caller failed to filter germline. Treat as germline; drop from somatic. Emit a warning so the user can investigate.
Loss of heterozygosity (LOH): germline het, tumor lost the ref allele → tumor is hom alt for the germline variant. The "somatic" is really a zygosity change. v1: treat as germline-applied with a note; emit special LOH flag on the effect.
Same alt by coincidence (Li-Fraumeni-style hotspot): extraordinarily rare except in cancer-predisposition genes. v1: warn and treat as germline.
6. Caller-level normalization differences
Two real failure modes that look similar to ref-mismatch but aren't:
MNV vs split SNVs: somatic VCF reports a 2bp substitution as a single MNV; germline VCF splits the same biological event into two SNVs. Variants don't match by position equality.
Indel left-alignment: germline VCF normalized with bcftools norm, somatic VCF didn't (or vice versa). Same indel reported at different positions.
Mitigation: pre-normalize both VCFs with the same tool before loading; document this. v1 doesn't do automatic normalization — users get errors or wrong results, and our docs say "normalize first."
7. Reference-genome mismatch between somatic and germline VCFs
Hard error by default. Sample case: germline called against GRCh37 (hg19), somatic against GRCh38. Coordinates don't match; results are silently garbage. We should refuse to proceed.
Opt-out: validate_reference=False for users who know what they're doing (e.g., they've lifted over one VCF into the other build).
8. Coverage gaps in germline
Positions where germline wasn't sequenced (low/no coverage in normal) → unknown germline. We can't tell if the position is ref/ref or unsequenced. v1: treat unsequenced as ref/ref (the same default the somatic caller used). Document the limitation. Future work could ingest coverage tracks (BedGraph) and emit "unknown" effects in low-coverage regions.
9. Compound interactions across haplotypes
Two germline variants in the same codon, plus a somatic variant — phase between all three matters. With three diploid variants there are up to 8 phase configurations. v1 emits the possibility set; if it's intractable (>8 hypotheses, configurable), we emit a TooManyHypotheses warning effect and require explicit phase data.
10. Hemizygous chromosomes / chrM
Single-allele calls have no phase ambiguity. v1: detect from Genotype.is_haploid (already implemented as part of #267) and skip the hypothesis enumeration. ChrM has its own genetic code (mitochondrial codon table) — out of scope for v1; flagged as a known limitation.
11. Tumor heterogeneity / subclonal somatic
Some somatic variants are present in only a subset of tumor cells. The "patient germline" model assumes a uniform background, which is true for the germline but not for low-frequency somatic. v1: ignores this; treats every somatic variant as if it occurs in 100% of cells. Documented limitation; future work could use AF (allele frequency) from the somatic VCF to weight effects.
12. CNV / copy number changes
Not modeled. Affects effective dosage of variants but not the per-haplotype protein sequence. Documented as out of scope.
Input routes
Five concrete shapes a user might come in with:
#
Input
v1?
Notes
1
Two VCFs: somatic + germline
✅ primary
Most common pipeline output
2
One multi-sample VCF (tumor + normal columns)
✅
User specifies which sample is germline; uses #267 sample machinery
3
Tumor VCF only, no germline
⚠️ fallback
Falls through to reference-relative with a warning. Population-frequency germline is v2
4
Tumor VCF + Isovar/Exacto RNA evidence
❌
Different shape entirely — assembled contigs ARE the answer. Belongs with #260 / #271
5
Pre-tagged single VCF (INFO=GERMLINE)
⚠️ low priority
Easy via a germline_classifier callback; not v1 critical
Algorithms (v1)
Top-level loop
For each somatic variant V_s in collection:
For each transcript T overlapping V_s:
Identify germline variants V_g_set in window(V_s, T)
If V_g_set is empty:
Emit reference-relative effect; germline_aware = NoGermlineInWindow
Else:
Validate phase data, hemizygosity, etc.
phase_hypotheses = enumerate_phase(V_s, V_g_set, phase_data)
For each hypothesis H:
germline_transcript_H = MutantTranscript.apply(T, V_g_set, H)
somatic_effect_H = predict_against(V_s, germline_transcript_H)
reference_effect = predict_against(V_s, T) # cached
Build GermlineAwareEffect(reference_effect, somatic_effect_H, V_g_set, H)
Emit canonical (highest-confidence hypothesis) + alternatives
Window selection
Effect site
Window
Coding (in-exon)
Same codon (3bp around variant)
Splice-adjacent (within 6bp of exon-intron boundary)
Splice signal region (12bp covering MAG|GURAGU and YAG|R)
Frameshift candidate
Same codon AND any germline frameshifts upstream in the same exon (because they shift the reading frame for the somatic)
StopLoss / extension into UTR
Same codon AND germline variants in the proximal 3' UTR
All windows configurable; defaults documented and justified.
Phase enumeration
def enumerate_phase(somatic_variant, germline_variants, phase_data):
if all variants are hemizygous: return [single_haplotype]
if all variants share a phase set: return [resolved_haplotypes]
n = len(germline_variants) + 1
if 2^n > MAX_HYPOTHESES (default 8):
emit TooManyHypothesesWarning
return [reference_relative_only]
# Generate all 2^n cis/trans assignments
return enumerate_assignments(...)
Splice signal recomputation
def germline_aware_splice_classify(variant, transcript, germline_in_splice_window):
patient_signal = build_patient_splice_signal(transcript, germline_in_splice_window)
if patient_signal is non_canonical:
# Germline already disrupted; somatic disruption is downgraded
return downgrade(reference_classification, reason=GERMLINE_DISRUPTED)
return reference_classification # patient's signal matches canonical, treat normally
LOH detection
def detect_loh(somatic_variant, germline_variants):
for V_g in germline_variants:
if V_g.position == V_s.position and V_g.alt == V_s.alt:
if germline_zygosity(V_g) == HETEROZYGOUS:
return LOH(germline_variant=V_g)
return None
Design decisions
Q1. Phase-unknown default
Decision: emit one canonical reference-relative effect (today's behavior) plus a germline_aware field carrying the most-likely germline-relative effect AND a list of alternatives.
Rationale: back-compat (existing code reading short_description keeps working) and possibility-set support (new code can iterate alternatives).
Q2. Where does germline= live?
Decision: layered.
Lowest level (annotator interface from Foundational refactor: MutantTranscript abstraction #271): EffectAnnotator.annotate_on_transcript(variant, transcript, context) where context: AnnotationContext carries germline_variants_in_window, phase data, etc. The annotator decides what to do with it.
Mid level: Variant.effects(germline=...) and VariantCollection.effects(germline=...) — convenience kwargs that build the context.
High level: a context manager with varcode.use_germline(germline_vc): ... for batch / scripting.
This composes with #271's annotator pluggability. Custom annotators (Isovar's, Exacto's) can ignore germline= because they have their own context (assembled contigs already incorporate germline).
Q3. Germline window
Decision: per-effect-site defaults (table above), all configurable. Default values are concrete and documented; users who need different windows pass germline_window kwargs.
Q4. Output classes
Decision: new GermlineAwareEffect wrapper attached as .germline_aware on existing Effect objects. Wrapper, not extension. Reasoning: keeps existing Effect classes unchanged (no breaking API surface), keeps the germline complexity contained, lets .germline_aware be None for the common no-germline case.
Q5. Reference-genome mismatch
Decision: hard error. validate_reference=False opt-out for users who explicitly know they've lifted over.
Summary
Today varcode predicts variant effects against the reference genome. For somatic-variant analysis in a real patient, the biologically correct "before" state is the patient's germline — not the reference. This umbrella tracks the work to make varcode's effect prediction germline-aware.
The work is large enough and gnarly enough that it lives as an umbrella with focused sub-issues. This issue captures the master plan, design decisions, complications, and acceptance criteria. Sub-issues are listed at the bottom.
Why "reference-relative" is wrong
Reference codon
GCC(Ala). Patient is heterozygous for a germline V_g changing position 2 G→T:GCC(Ala)GTC(Val)Somatic V_s changes position 1 G→A. The somatic effect depends on which haplotype it landed on:
Today's varcode reports Ala→Thr (reference-relative). That's wrong on haplotype B. Without phase information, the honest output is a set {Val→Ile, Ala→Thr}; with phase, the set collapses to one. This pattern repeats wherever germline and somatic share genomic context: codons, splice signals, reading frame.
Complications and how each affects the design
1. Phasing changes the answer
When germline and somatic share a codon (or a splice region), phase determines the outcome. Three phase states matter:
|GT separator +PStag (WhatsHap, HapCUT2, longshot, assembly tools)/GT separator, noPSRead-length implications (docs only — same code path):
PStags for almost all variants in coding regions; possibility sets become rare.MutantTranscriptobjects directly, bypassing the per-variant phasing logic.For varcode's API, the read-length distinction is invisible — what matters is whether
PStags are present. Documentation should explain this so users understand why their short-read pipeline produces possibility sets while a long-read pipeline doesn't.2. Germline can break splice signals
Patient has germline at -3 of an exon (the M of MAG donor). Cases:
Implications:
3. Some germline variants are invisible in the protein
A germline variant in an intron that is far from any splice site has no protein impact and shouldn't enter germline-aware prediction at all. A germline variant in an exon that gets skipped (due to another germline variant) has no protein impact either — the exon isn't in the patient's transcript.
For v1, we apply all germline variants in the window regardless of expressed status. This is conservative — we may mention germline context that the patient doesn't actually express. Caller should be aware. RNA evidence (#259) resolves this properly.
4. The somatic caller didn't see the germline (the most important real-world case)
Standard somatic callers (MuTect, Strelka, VarScan, Mutect2) take tumor + normal BAMs and emit a VCF of variants present in tumor but absent in normal. Their
reffield is the reference genome ref, not the patient's germline. The somatic VCF carries no information about what was germline.This means germline-aware annotation needs the germline VCF as a separate input, with sub-cases:
INFO=GERMLINEor similar5. Variants that overlap between germline and somatic VCFs
If the same locus appears in both:
LOHflag on the effect.6. Caller-level normalization differences
Two real failure modes that look similar to ref-mismatch but aren't:
bcftools norm, somatic VCF didn't (or vice versa). Same indel reported at different positions.Mitigation: pre-normalize both VCFs with the same tool before loading; document this. v1 doesn't do automatic normalization — users get errors or wrong results, and our docs say "normalize first."
7. Reference-genome mismatch between somatic and germline VCFs
Hard error by default. Sample case: germline called against GRCh37 (hg19), somatic against GRCh38. Coordinates don't match; results are silently garbage. We should refuse to proceed.
Opt-out:
validate_reference=Falsefor users who know what they're doing (e.g., they've lifted over one VCF into the other build).8. Coverage gaps in germline
Positions where germline wasn't sequenced (low/no coverage in normal) → unknown germline. We can't tell if the position is ref/ref or unsequenced. v1: treat unsequenced as ref/ref (the same default the somatic caller used). Document the limitation. Future work could ingest coverage tracks (BedGraph) and emit "unknown" effects in low-coverage regions.
9. Compound interactions across haplotypes
Two germline variants in the same codon, plus a somatic variant — phase between all three matters. With three diploid variants there are up to 8 phase configurations. v1 emits the possibility set; if it's intractable (>8 hypotheses, configurable), we emit a
TooManyHypotheseswarning effect and require explicit phase data.10. Hemizygous chromosomes / chrM
Single-allele calls have no phase ambiguity. v1: detect from
Genotype.is_haploid(already implemented as part of #267) and skip the hypothesis enumeration. ChrM has its own genetic code (mitochondrial codon table) — out of scope for v1; flagged as a known limitation.11. Tumor heterogeneity / subclonal somatic
Some somatic variants are present in only a subset of tumor cells. The "patient germline" model assumes a uniform background, which is true for the germline but not for low-frequency somatic. v1: ignores this; treats every somatic variant as if it occurs in 100% of cells. Documented limitation; future work could use
AF(allele frequency) from the somatic VCF to weight effects.12. CNV / copy number changes
Not modeled. Affects effective dosage of variants but not the per-haplotype protein sequence. Documented as out of scope.
Input routes
Five concrete shapes a user might come in with:
INFO=GERMLINE)germline_classifiercallback; not v1 criticalAlgorithms (v1)
Top-level loop
Window selection
All windows configurable; defaults documented and justified.
Phase enumeration
Splice signal recomputation
LOH detection
Design decisions
Q1. Phase-unknown default
Decision: emit one canonical reference-relative effect (today's behavior) plus a
germline_awarefield carrying the most-likely germline-relative effect AND a list of alternatives.Rationale: back-compat (existing code reading
short_descriptionkeeps working) and possibility-set support (new code can iterate alternatives).Q2. Where does
germline=live?Decision: layered.
EffectAnnotator.annotate_on_transcript(variant, transcript, context)wherecontext: AnnotationContextcarriesgermline_variants_in_window, phase data, etc. The annotator decides what to do with it.Variant.effects(germline=...)andVariantCollection.effects(germline=...)— convenience kwargs that build the context.with varcode.use_germline(germline_vc): ...for batch / scripting.This composes with #271's annotator pluggability. Custom annotators (Isovar's, Exacto's) can ignore
germline=because they have their own context (assembled contigs already incorporate germline).Q3. Germline window
Decision: per-effect-site defaults (table above), all configurable. Default values are concrete and documented; users who need different windows pass
germline_windowkwargs.Q4. Output classes
Decision: new
GermlineAwareEffectwrapper attached as.germline_awareon existing Effect objects. Wrapper, not extension. Reasoning: keeps existing Effect classes unchanged (no breaking API surface), keeps the germline complexity contained, lets.germline_awarebeNonefor the common no-germline case.Q5. Reference-genome mismatch
Decision: hard error.
validate_reference=Falseopt-out for users who explicitly know they've lifted over.Possible failure modes
genome.reference_nameof both VCFsConflictingGermlineflagvc.samplesSampleNotFoundError(already exists from #267)TooManyHypotheseswarning effect; fall back to reference-relativeSplicePotentiallyDisruptedflag on the GermlineAwareEffect; mark prediction as low confidencetranscript.contig in ('MT', 'M', 'chrM')Acceptance criteria for v1
germline=("vcf_path", sample_name).LOHflag.docs/germline.md) explaining setup, input routes, possibility sets, phasing implications for short vs long read.Sub-issues
The work decomposes as follows. Each sub-issue gets its own ticket; they may land in different PRs but share the v1 acceptance criteria above.
GermlineAwareEffectdata model +Phase/SpliceSignalStatusenums + integration withEffectCollectionDependencies
MutantTranscriptdata model) — strongly recommended as a prerequisite. Without it, this work duplicates logic that Foundational refactor: MutantTranscript abstraction #271 will rewrite.Out of scope (for v1)