You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Planning issue for the next stages of #271, drafted after stages 1 (data model + Protocol + LegacyEffectAnnotator) and 2 (apply_variant_to_transcript) landed. Locks in naming, outlines the classifier algorithm, and proposes a staging reorder.
Naming
Proposed: rename the Protocol and classes to drop the redundant "Effect" prefix that I added in stage 1.
Current (2.5.0)
Proposed
varcode.annotators.EffectAnnotator (Protocol)
varcode.annotators.Annotator
varcode.annotators.LegacyEffectAnnotator
varcode.annotators.LegacyAnnotator
(not yet)
varcode.annotators.SequenceDiffAnnotator
Rationale:
The module is already varcode.annotators; EffectAnnotator re-states the concept.
The issue spec uses path-style identifiers varcode.annotators.legacy / varcode.annotators.sequence_diff — the class names should read cleanly next to them.
Cheaper to rename now (one public class + a few tests) than after a second class lands and doubles the migration.
Back-compat: keep EffectAnnotator and LegacyEffectAnnotator as aliases for one minor version, deprecate via docstring, remove in the version after.
Alternative: leave names as-is. Slightly verbose but zero migration cost. Open to either — flagging so the choice is deliberate.
Will legacy stay around?
Yes. The LegacyAnnotator is the existing predict_variant_effect_on_transcript path already wrapped as an object (stage 1). It stays registered and available after sequence_diff becomes the default, because:
A/B comparison between annotators requires both to coexist.
Removing legacy is a separate decision with its own deprecation cycle.
So the question isn't "replace legacy with sequence_diff" — it's "add a second annotator behind the same interface, let users opt into either."
SequenceDiffAnnotator algorithm
Skeleton:
classSequenceDiffAnnotator:
name= \"sequence_diff\"
supports = frozenset({\"snv\", \"indel\", \"mnv\"})
defannotate_on_transcript(self, variant, transcript):
# 1. Gate: non-coding / incomplete transcripts go straight to legacy.ifnottranscript.is_protein_coding:
returnNoncodingTranscript(variant, transcript)
ifnottranscript.complete:
returnIncompleteTranscript(variant, transcript)
# 2. Fast path (shared with legacy): trivial single-codon SNV.if_is_trivial_point_variant(variant, transcript):
return_fast_path_annotate(variant, transcript)
# 3. Slow path: materialize and diff.mt=apply_variant_to_transcript(variant, transcript)
ifmtisNone:
# Intronic / splice-junction / UTR / ref-mismatch.# Delegate to legacy for these cases at this stage; later# PRs extend apply_variant_to_transcript to cover them.returnLegacyAnnotator().annotate_on_transcript(variant, transcript)
return_classify_from_protein_diff(mt, variant, transcript)
_classify_from_protein_diff is the new algorithmic work. It takes the reference protein and the mutant protein and classifies the change. Pseudocode:
def_classify_from_protein_diff(mt, variant, transcript):
ref_protein=str(transcript.protein_sequence)
mut_protein=mt.mutant_protein_sequence# may include stop-not-reached case# Trivial casesifref_protein==mut_protein:
returnSilent(...)
# Trim shared prefix/suffix to find the minimal edit windowref_delta, alt_delta, prefix, suffix=trim_shared_flanking_strings(
ref_protein, mut_protein)
aa_offset=len(prefix)
is_insertion=len(ref_delta) ==0is_deletion=len(alt_delta) ==0is_simple_sub=len(ref_delta) ==len(alt_delta) ==1# Start loss: mutation changes aa_offset 0, or the mutant protein# does not start with M.ifaa_offset==0andnotmut_protein.startswith(\"M\"):
returnStartLoss(...)
# Stop loss: ref ends at a stop but the mutant keeps translating# through what used to be 3' UTR.ifaa_offset+len(ref_delta) ==len(ref_protein) andlen(mut_protein) >len(ref_protein):
returnStopLoss(...)
# Premature stop: mutant is shorter because we hit a stop codon# before the reference's stop.iflen(mut_protein) <len(ref_protein) andaa_offset+len(alt_delta) ==len(mut_protein):
returnPrematureStop(...)
# Frameshift detection from total_length_deltaifmt.total_length_delta%3!=0:
# Frameshift (indel not divisible by 3).# If alt_delta is empty, FrameShiftTruncation; else FrameShift.
...
# In-frameifis_simple_sub:
returnSubstitution(...)
ifis_insertion:
returnInsertion(...)
ifis_deletion:
returnDeletion(...)
returnComplexSubstitution(...)
Known edge cases / open questions:
AlternateStartCodon. Today's legacy code has a special case when the first codon changes M→M (synonymous between ATG and TTG etc.). Under diff semantics the proteins are identical (both start with M), which would classify as Silent. Do we keep AlternateStartCodon or collapse into Silent+warning? Leaning keep — it's informative for researchers.
3' UTR readthrough on StopLoss. The legacy code translates into the 3' UTR when the stop is disrupted. apply_variant_to_transcript doesn't currently do that — mutant_protein_sequence stops at the first in-CDS stop. Need to extend the builder to optionally include 3' UTR translation when the stop codon is disrupted.
5' UTR start creation. If a mutation creates a new upstream ATG in the 5' UTR, legacy doesn't catch it (returns FivePrimeUTR). Sequence-diff could in principle detect the shifted start, but that's an enhancement, not a regression. Keep out of scope for stage 6.
Splice effects. Sequence-diff doesn't emit splice classes (SpliceDonor, ExonicSpliceSite, etc.) — those still come from the offset-based classifier since they're about DNA position, not protein diff. Legacy stays responsible for splice classification. SequenceDiffAnnotator should call the splice-check step explicitly and only run protein-diff when the variant is purely coding.
Proposed staging reorder
My original plan was 3 (algorithm) → 4 (per-call kwarg + fast path) → 5 (provenance) → 6 (splice_outcomes rewrite). Problem: stage 3 is the biggest chunk of new code and also the only one with user-visible behaviour change. Better to build the plumbing first so the algorithm drops into a well-tested socket.
EffectCollection provenance fields (annotator, annotator_version, varcode_version, reference, annotated_at). to_csv / to_json header metadata + from_csv round-trip. Works with just legacy.
Medium
3b
3d
Extract shared fast-path SNV helper from predict_in_frame_coding_effect. Legacy routes single-codon SNVs through it; hook is ready for sequence_diff.
Medium
3c (not strictly — can parallelize)
3e
SequenceDiffAnnotator implementation: the _classify_from_protein_diff algorithm, the splice-check gate, extended apply_variant_to_transcript covering 3'UTR readthrough. Byte-for-byte match with legacy on a curated test corpus before defaulting.
High
3a–3d
3f
Flip default to sequence_diff, keep legacy available. Separate PR so the default flip is reviewable on its own.
Medium
3e + validation corpus
3g
Rewrite splice_outcomes.py to use MutantTranscript instead of the ad-hoc exon-skipping and boundary-codon helpers. Closes the #262 prototype loop.
Medium
Can parallelize with 3d onward; only depends on stage 2 (apply_variant_to_transcript) which already landed
This puts four low/medium-risk PRs (3a–3d) before the one high-risk PR (3e). By the time the algorithm lands, the plumbing has been exercised in production on the default path for multiple versions.
Validation corpus for 3e
Before flipping the default (3f), we need evidence that sequence_diff matches legacy on real-world data. Proposal:
Run both annotators. Diff Effect class + short_description.
Any mismatch is a gate: either a legacy bug sequence_diff fixed (document and accept), or a sequence_diff bug (fix before merging).
Open questions for you
Rename yes or no? (Annotator vs EffectAnnotator.)
AlternateStartCodon: keep as a class under sequence-diff, or collapse into Silent?
Should 3d (fast-path extraction) happen before or after 3e (algorithm)? My bias is before — easier to exercise on legacy first — but arguably the algorithm spec reveals what the fast path needs to do.
Planning issue for the next stages of #271, drafted after stages 1 (data model + Protocol + LegacyEffectAnnotator) and 2 (apply_variant_to_transcript) landed. Locks in naming, outlines the classifier algorithm, and proposes a staging reorder.
Naming
Proposed: rename the Protocol and classes to drop the redundant "Effect" prefix that I added in stage 1.
varcode.annotators.EffectAnnotator(Protocol)varcode.annotators.Annotatorvarcode.annotators.LegacyEffectAnnotatorvarcode.annotators.LegacyAnnotatorvarcode.annotators.SequenceDiffAnnotatorRationale:
varcode.annotators;EffectAnnotatorre-states the concept.varcode.annotators.legacy/varcode.annotators.sequence_diff— the class names should read cleanly next to them.Back-compat: keep
EffectAnnotatorandLegacyEffectAnnotatoras aliases for one minor version, deprecate via docstring, remove in the version after.Alternative: leave names as-is. Slightly verbose but zero migration cost. Open to either — flagging so the choice is deliberate.
Will legacy stay around?
Yes. The
LegacyAnnotatoris the existingpredict_variant_effect_on_transcriptpath already wrapped as an object (stage 1). It stays registered and available aftersequence_diffbecomes the default, because:So the question isn't "replace legacy with sequence_diff" — it's "add a second annotator behind the same interface, let users opt into either."
SequenceDiffAnnotator algorithm
Skeleton:
_classify_from_protein_diffis the new algorithmic work. It takes the reference protein and the mutant protein and classifies the change. Pseudocode:Known edge cases / open questions:
apply_variant_to_transcriptdoesn't currently do that —mutant_protein_sequencestops at the first in-CDS stop. Need to extend the builder to optionally include 3' UTR translation when the stop codon is disrupted.SequenceDiffAnnotatorshould call the splice-check step explicitly and only run protein-diff when the variant is purely coding.Proposed staging reorder
My original plan was 3 (algorithm) → 4 (per-call kwarg + fast path) → 5 (provenance) → 6 (splice_outcomes rewrite). Problem: stage 3 is the biggest chunk of new code and also the only one with user-visible behaviour change. Better to build the plumbing first so the algorithm drops into a well-tested socket.
Revised order:
EffectAnnotator/LegacyEffectAnnotator->Annotator/LegacyAnnotator; keep aliases + deprecateVariant.effects(annotator=...)kwarg +VariantCollection.effects(annotator=...)+with varcode.use_annotator(...)context manager. Default stays\"legacy\".EffectCollectionprovenance fields (annotator,annotator_version,varcode_version,reference,annotated_at).to_csv/to_jsonheader metadata +from_csvround-trip. Works with just legacy.predict_in_frame_coding_effect. Legacy routes single-codon SNVs through it; hook is ready for sequence_diff.SequenceDiffAnnotatorimplementation: the_classify_from_protein_diffalgorithm, the splice-check gate, extendedapply_variant_to_transcriptcovering 3'UTR readthrough. Byte-for-byte match with legacy on a curated test corpus before defaulting.sequence_diff, keep legacy available. Separate PR so the default flip is reviewable on its own.splice_outcomes.pyto useMutantTranscriptinstead of the ad-hoc exon-skipping and boundary-codon helpers. Closes the #262 prototype loop.apply_variant_to_transcript) which already landedThis puts four low/medium-risk PRs (3a–3d) before the one high-risk PR (3e). By the time the algorithm lands, the plumbing has been exercised in production on the default path for multiple versions.
Validation corpus for 3e
Before flipping the default (3f), we need evidence that
sequence_diffmatcheslegacyon real-world data. Proposal:short_description.Open questions for you
AnnotatorvsEffectAnnotator.)Once these are settled I'll file per-stage PRs against this plan.