Skip to content

[#271 stage 3 planning] SequenceDiffAnnotator design, naming, and reordered staging #304

@iskandr

Description

@iskandr

Planning issue for the next stages of #271, drafted after stages 1 (data model + Protocol + LegacyEffectAnnotator) and 2 (apply_variant_to_transcript) landed. Locks in naming, outlines the classifier algorithm, and proposes a staging reorder.

Naming

Proposed: rename the Protocol and classes to drop the redundant "Effect" prefix that I added in stage 1.

Current (2.5.0) Proposed
varcode.annotators.EffectAnnotator (Protocol) varcode.annotators.Annotator
varcode.annotators.LegacyEffectAnnotator varcode.annotators.LegacyAnnotator
(not yet) varcode.annotators.SequenceDiffAnnotator

Rationale:

  • The module is already varcode.annotators; EffectAnnotator re-states the concept.
  • The issue spec uses path-style identifiers varcode.annotators.legacy / varcode.annotators.sequence_diff — the class names should read cleanly next to them.
  • Cheaper to rename now (one public class + a few tests) than after a second class lands and doubles the migration.

Back-compat: keep EffectAnnotator and LegacyEffectAnnotator as aliases for one minor version, deprecate via docstring, remove in the version after.

Alternative: leave names as-is. Slightly verbose but zero migration cost. Open to either — flagging so the choice is deliberate.

Will legacy stay around?

Yes. The LegacyAnnotator is the existing predict_variant_effect_on_transcript path already wrapped as an object (stage 1). It stays registered and available after sequence_diff becomes the default, because:

  1. Third-party pipelines pin varcode versions expecting byte-for-byte stable output.
  2. A/B comparison between annotators requires both to coexist.
  3. Removing legacy is a separate decision with its own deprecation cycle.

So the question isn't "replace legacy with sequence_diff" — it's "add a second annotator behind the same interface, let users opt into either."

SequenceDiffAnnotator algorithm

Skeleton:

class SequenceDiffAnnotator:
    name = \"sequence_diff\"
    supports = frozenset({\"snv\", \"indel\", \"mnv\"})

    def annotate_on_transcript(self, variant, transcript):
        # 1. Gate: non-coding / incomplete transcripts go straight to legacy.
        if not transcript.is_protein_coding:
            return NoncodingTranscript(variant, transcript)
        if not transcript.complete:
            return IncompleteTranscript(variant, transcript)

        # 2. Fast path (shared with legacy): trivial single-codon SNV.
        if _is_trivial_point_variant(variant, transcript):
            return _fast_path_annotate(variant, transcript)

        # 3. Slow path: materialize and diff.
        mt = apply_variant_to_transcript(variant, transcript)
        if mt is None:
            # Intronic / splice-junction / UTR / ref-mismatch.
            # Delegate to legacy for these cases at this stage; later
            # PRs extend apply_variant_to_transcript to cover them.
            return LegacyAnnotator().annotate_on_transcript(variant, transcript)

        return _classify_from_protein_diff(mt, variant, transcript)

_classify_from_protein_diff is the new algorithmic work. It takes the reference protein and the mutant protein and classifies the change. Pseudocode:

def _classify_from_protein_diff(mt, variant, transcript):
    ref_protein = str(transcript.protein_sequence)
    mut_protein = mt.mutant_protein_sequence  # may include stop-not-reached case

    # Trivial cases
    if ref_protein == mut_protein:
        return Silent(...)

    # Trim shared prefix/suffix to find the minimal edit window
    ref_delta, alt_delta, prefix, suffix = trim_shared_flanking_strings(
        ref_protein, mut_protein)

    aa_offset = len(prefix)
    is_insertion = len(ref_delta) == 0
    is_deletion = len(alt_delta) == 0
    is_simple_sub = len(ref_delta) == len(alt_delta) == 1

    # Start loss: mutation changes aa_offset 0, or the mutant protein
    # does not start with M.
    if aa_offset == 0 and not mut_protein.startswith(\"M\"):
        return StartLoss(...)

    # Stop loss: ref ends at a stop but the mutant keeps translating
    # through what used to be 3' UTR.
    if aa_offset + len(ref_delta) == len(ref_protein) and len(mut_protein) > len(ref_protein):
        return StopLoss(...)

    # Premature stop: mutant is shorter because we hit a stop codon
    # before the reference's stop.
    if len(mut_protein) < len(ref_protein) and aa_offset + len(alt_delta) == len(mut_protein):
        return PrematureStop(...)

    # Frameshift detection from total_length_delta
    if mt.total_length_delta % 3 != 0:
        # Frameshift (indel not divisible by 3).
        # If alt_delta is empty, FrameShiftTruncation; else FrameShift.
        ...

    # In-frame
    if is_simple_sub:
        return Substitution(...)
    if is_insertion:
        return Insertion(...)
    if is_deletion:
        return Deletion(...)
    return ComplexSubstitution(...)

Known edge cases / open questions:

  1. AlternateStartCodon. Today's legacy code has a special case when the first codon changes M→M (synonymous between ATG and TTG etc.). Under diff semantics the proteins are identical (both start with M), which would classify as Silent. Do we keep AlternateStartCodon or collapse into Silent+warning? Leaning keep — it's informative for researchers.
  2. 3' UTR readthrough on StopLoss. The legacy code translates into the 3' UTR when the stop is disrupted. apply_variant_to_transcript doesn't currently do that — mutant_protein_sequence stops at the first in-CDS stop. Need to extend the builder to optionally include 3' UTR translation when the stop codon is disrupted.
  3. 5' UTR start creation. If a mutation creates a new upstream ATG in the 5' UTR, legacy doesn't catch it (returns FivePrimeUTR). Sequence-diff could in principle detect the shifted start, but that's an enhancement, not a regression. Keep out of scope for stage 6.
  4. Splice effects. Sequence-diff doesn't emit splice classes (SpliceDonor, ExonicSpliceSite, etc.) — those still come from the offset-based classifier since they're about DNA position, not protein diff. Legacy stays responsible for splice classification. SequenceDiffAnnotator should call the splice-check step explicitly and only run protein-diff when the variant is purely coding.

Proposed staging reorder

My original plan was 3 (algorithm) → 4 (per-call kwarg + fast path) → 5 (provenance) → 6 (splice_outcomes rewrite). Problem: stage 3 is the biggest chunk of new code and also the only one with user-visible behaviour change. Better to build the plumbing first so the algorithm drops into a well-tested socket.

Revised order:

Stage Scope Risk Depends on
3a Rename EffectAnnotator/LegacyEffectAnnotator -> Annotator/LegacyAnnotator; keep aliases + deprecate Low
3b Per-call Variant.effects(annotator=...) kwarg + VariantCollection.effects(annotator=...) + with varcode.use_annotator(...) context manager. Default stays \"legacy\". Low 3a
3c EffectCollection provenance fields (annotator, annotator_version, varcode_version, reference, annotated_at). to_csv / to_json header metadata + from_csv round-trip. Works with just legacy. Medium 3b
3d Extract shared fast-path SNV helper from predict_in_frame_coding_effect. Legacy routes single-codon SNVs through it; hook is ready for sequence_diff. Medium 3c (not strictly — can parallelize)
3e SequenceDiffAnnotator implementation: the _classify_from_protein_diff algorithm, the splice-check gate, extended apply_variant_to_transcript covering 3'UTR readthrough. Byte-for-byte match with legacy on a curated test corpus before defaulting. High 3a–3d
3f Flip default to sequence_diff, keep legacy available. Separate PR so the default flip is reviewable on its own. Medium 3e + validation corpus
3g Rewrite splice_outcomes.py to use MutantTranscript instead of the ad-hoc exon-skipping and boundary-codon helpers. Closes the #262 prototype loop. Medium Can parallelize with 3d onward; only depends on stage 2 (apply_variant_to_transcript) which already landed

This puts four low/medium-risk PRs (3a–3d) before the one high-risk PR (3e). By the time the algorithm lands, the plumbing has been exercised in production on the default path for multiple versions.

Validation corpus for 3e

Before flipping the default (3f), we need evidence that sequence_diff matches legacy on real-world data. Proposal:

  • Pull a curated set of ~500 variants from existing test files (somatic_hg19_14muts.vcf, the BRCA1/CFTR fixtures, the splice-site corpus, the mitochondrial cases from Mitochondrial variants are translated with the wrong codon table #294).
  • Run both annotators. Diff Effect class + short_description.
  • Any mismatch is a gate: either a legacy bug sequence_diff fixed (document and accept), or a sequence_diff bug (fix before merging).

Open questions for you

  1. Rename yes or no? (Annotator vs EffectAnnotator.)
  2. AlternateStartCodon: keep as a class under sequence-diff, or collapse into Silent?
  3. Should 3d (fast-path extraction) happen before or after 3e (algorithm)? My bias is before — easier to exercise on legacy first — but arguably the algorithm spec reveals what the fast path needs to do.
  4. 3g (splice_outcomes rewrite) — defer to its own tracking issue, or keep in the Foundational refactor: MutantTranscript abstraction #271 scope?

Once these are settled I'll file per-stage PRs against this plan.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions