[#271 stage 3 planning] SequenceDiffAnnotator design, naming, and reordered staging

Planning issue for the next stages of #271, drafted after stages 1 (data model + Protocol + LegacyEffectAnnotator) and 2 (apply_variant_to_transcript) landed. Locks in naming, outlines the classifier algorithm, and proposes a staging reorder.

## Naming

**Proposed**: rename the Protocol and classes to drop the redundant \"Effect\" prefix that I added in stage 1.

| Current (2.5.0) | Proposed |
|---|---|
| `varcode.annotators.EffectAnnotator` (Protocol) | `varcode.annotators.Annotator` |
| `varcode.annotators.LegacyEffectAnnotator` | `varcode.annotators.LegacyAnnotator` |
| *(not yet)* | `varcode.annotators.SequenceDiffAnnotator` |

Rationale:
- The module is already `varcode.annotators`; `EffectAnnotator` re-states the concept.
- The issue spec uses path-style identifiers `varcode.annotators.legacy` / `varcode.annotators.sequence_diff` — the class names should read cleanly next to them.
- Cheaper to rename now (one public class + a few tests) than after a second class lands and doubles the migration.

**Back-compat**: keep `EffectAnnotator` and `LegacyEffectAnnotator` as aliases for one minor version, deprecate via docstring, remove in the version after.

**Alternative**: leave names as-is. Slightly verbose but zero migration cost. Open to either — flagging so the choice is deliberate.

## Will legacy stay around?

Yes. The `LegacyAnnotator` is the existing `predict_variant_effect_on_transcript` path already wrapped as an object (stage 1). It stays registered and available after `sequence_diff` becomes the default, because:

1. Third-party pipelines pin varcode versions expecting byte-for-byte stable output.
2. A/B comparison between annotators requires both to coexist.
3. Removing legacy is a separate decision with its own deprecation cycle.

So the question isn't \"replace legacy with sequence_diff\" — it's \"add a second annotator behind the same interface, let users opt into either.\"

## SequenceDiffAnnotator algorithm

Skeleton:

```python
class SequenceDiffAnnotator:
    name = \"sequence_diff\"
    supports = frozenset({\"snv\", \"indel\", \"mnv\"})

    def annotate_on_transcript(self, variant, transcript):
        # 1. Gate: non-coding / incomplete transcripts go straight to legacy.
        if not transcript.is_protein_coding:
            return NoncodingTranscript(variant, transcript)
        if not transcript.complete:
            return IncompleteTranscript(variant, transcript)

        # 2. Fast path (shared with legacy): trivial single-codon SNV.
        if _is_trivial_point_variant(variant, transcript):
            return _fast_path_annotate(variant, transcript)

        # 3. Slow path: materialize and diff.
        mt = apply_variant_to_transcript(variant, transcript)
        if mt is None:
            # Intronic / splice-junction / UTR / ref-mismatch.
            # Delegate to legacy for these cases at this stage; later
            # PRs extend apply_variant_to_transcript to cover them.
            return LegacyAnnotator().annotate_on_transcript(variant, transcript)

        return _classify_from_protein_diff(mt, variant, transcript)
```

**`_classify_from_protein_diff`** is the new algorithmic work. It takes the reference protein and the mutant protein and classifies the change. Pseudocode:

```python
def _classify_from_protein_diff(mt, variant, transcript):
    ref_protein = str(transcript.protein_sequence)
    mut_protein = mt.mutant_protein_sequence  # may include stop-not-reached case

    # Trivial cases
    if ref_protein == mut_protein:
        return Silent(...)

    # Trim shared prefix/suffix to find the minimal edit window
    ref_delta, alt_delta, prefix, suffix = trim_shared_flanking_strings(
        ref_protein, mut_protein)

    aa_offset = len(prefix)
    is_insertion = len(ref_delta) == 0
    is_deletion = len(alt_delta) == 0
    is_simple_sub = len(ref_delta) == len(alt_delta) == 1

    # Start loss: mutation changes aa_offset 0, or the mutant protein
    # does not start with M.
    if aa_offset == 0 and not mut_protein.startswith(\"M\"):
        return StartLoss(...)

    # Stop loss: ref ends at a stop but the mutant keeps translating
    # through what used to be 3' UTR.
    if aa_offset + len(ref_delta) == len(ref_protein) and len(mut_protein) > len(ref_protein):
        return StopLoss(...)

    # Premature stop: mutant is shorter because we hit a stop codon
    # before the reference's stop.
    if len(mut_protein) < len(ref_protein) and aa_offset + len(alt_delta) == len(mut_protein):
        return PrematureStop(...)

    # Frameshift detection from total_length_delta
    if mt.total_length_delta % 3 != 0:
        # Frameshift (indel not divisible by 3).
        # If alt_delta is empty, FrameShiftTruncation; else FrameShift.
        ...

    # In-frame
    if is_simple_sub:
        return Substitution(...)
    if is_insertion:
        return Insertion(...)
    if is_deletion:
        return Deletion(...)
    return ComplexSubstitution(...)
```

**Known edge cases / open questions**:

1. **AlternateStartCodon**. Today's legacy code has a special case when the first codon changes M→M (synonymous between ATG and TTG etc.). Under diff semantics the proteins are identical (both start with M), which would classify as Silent. Do we keep AlternateStartCodon or collapse into Silent+warning? Leaning keep — it's informative for researchers.
2. **3' UTR readthrough on StopLoss**. The legacy code translates into the 3' UTR when the stop is disrupted. `apply_variant_to_transcript` doesn't currently do that — `mutant_protein_sequence` stops at the first in-CDS stop. Need to extend the builder to optionally include 3' UTR translation when the stop codon is disrupted.
3. **5' UTR start creation**. If a mutation creates a new upstream ATG in the 5' UTR, legacy doesn't catch it (returns FivePrimeUTR). Sequence-diff could in principle detect the shifted start, but that's an enhancement, not a regression. Keep out of scope for stage 6.
4. **Splice effects**. Sequence-diff doesn't emit splice classes (SpliceDonor, ExonicSpliceSite, etc.) — those still come from the offset-based classifier since they're about DNA position, not protein diff. Legacy stays responsible for splice classification. `SequenceDiffAnnotator` should call the splice-check step explicitly and only run protein-diff when the variant is purely coding.

## Proposed staging reorder

My original plan was 3 (algorithm) → 4 (per-call kwarg + fast path) → 5 (provenance) → 6 (splice_outcomes rewrite). Problem: stage 3 is the biggest chunk of new code and also the only one with user-visible behaviour change. Better to build the plumbing first so the algorithm drops into a well-tested socket.

**Revised order**:

| Stage | Scope | Risk | Depends on |
|---|---|---|---|
| **3a** | Rename `EffectAnnotator`/`LegacyEffectAnnotator` -> `Annotator`/`LegacyAnnotator`; keep aliases + deprecate | Low | — |
| **3b** | Per-call `Variant.effects(annotator=...)` kwarg + `VariantCollection.effects(annotator=...)` + `with varcode.use_annotator(...)` context manager. Default stays `\"legacy\"`. | Low | 3a |
| **3c** | `EffectCollection` provenance fields (`annotator`, `annotator_version`, `varcode_version`, `reference`, `annotated_at`). `to_csv` / `to_json` header metadata + `from_csv` round-trip. Works with just legacy. | Medium | 3b |
| **3d** | Extract shared fast-path SNV helper from `predict_in_frame_coding_effect`. Legacy routes single-codon SNVs through it; hook is ready for sequence_diff. | Medium | 3c (not strictly — can parallelize) |
| **3e** | `SequenceDiffAnnotator` implementation: the `_classify_from_protein_diff` algorithm, the splice-check gate, extended `apply_variant_to_transcript` covering 3'UTR readthrough. Byte-for-byte match with legacy on a curated test corpus before defaulting. | High | 3a–3d |
| **3f** | Flip default to `sequence_diff`, keep legacy available. Separate PR so the default flip is reviewable on its own. | Medium | 3e + validation corpus |
| **3g** | Rewrite `splice_outcomes.py` to use `MutantTranscript` instead of the ad-hoc exon-skipping and boundary-codon helpers. Closes the #262 prototype loop. | Medium | Can parallelize with 3d onward; only depends on stage 2 (`apply_variant_to_transcript`) which already landed |

This puts four low/medium-risk PRs (3a–3d) before the one high-risk PR (3e). By the time the algorithm lands, the plumbing has been exercised in production on the default path for multiple versions.

## Validation corpus for 3e

Before flipping the default (3f), we need evidence that `sequence_diff` matches `legacy` on real-world data. Proposal:

- Pull a curated set of ~500 variants from existing test files (somatic_hg19_14muts.vcf, the BRCA1/CFTR fixtures, the splice-site corpus, the mitochondrial cases from #294).
- Run both annotators. Diff Effect class + `short_description`.
- Any mismatch is a gate: either a legacy bug sequence_diff fixed (document and accept), or a sequence_diff bug (fix before merging).

## Open questions for you

1. Rename yes or no? (`Annotator` vs `EffectAnnotator`.)
2. AlternateStartCodon: keep as a class under sequence-diff, or collapse into Silent?
3. Should 3d (fast-path extraction) happen before or after 3e (algorithm)? My bias is before — easier to exercise on legacy first — but arguably the algorithm spec reveals what the fast path needs to do.
4. 3g (splice_outcomes rewrite) — defer to its own tracking issue, or keep in the #271 scope?

Once these are settled I'll file per-stage PRs against this plan.

Current (2.5.0)	Proposed
`varcode.annotators.EffectAnnotator` (Protocol)	`varcode.annotators.Annotator`
`varcode.annotators.LegacyEffectAnnotator`	`varcode.annotators.LegacyAnnotator`
(not yet)	`varcode.annotators.SequenceDiffAnnotator`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#271 stage 3 planning] SequenceDiffAnnotator design, naming, and reordered staging #304

Naming

Will legacy stay around?

SequenceDiffAnnotator algorithm

Proposed staging reorder

Validation corpus for 3e

Open questions for you

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage	Scope	Risk	Depends on
3a	Rename `EffectAnnotator`/`LegacyEffectAnnotator` -> `Annotator`/`LegacyAnnotator`; keep aliases + deprecate	Low	—
3b	Per-call `Variant.effects(annotator=...)` kwarg + `VariantCollection.effects(annotator=...)` + `with varcode.use_annotator(...)` context manager. Default stays `\"legacy\"`.	Low	3a
3c	`EffectCollection` provenance fields (`annotator`, `annotator_version`, `varcode_version`, `reference`, `annotated_at`). `to_csv` / `to_json` header metadata + `from_csv` round-trip. Works with just legacy.	Medium	3b
3d	Extract shared fast-path SNV helper from `predict_in_frame_coding_effect`. Legacy routes single-codon SNVs through it; hook is ready for sequence_diff.	Medium	3c (not strictly — can parallelize)
3e	`SequenceDiffAnnotator` implementation: the `_classify_from_protein_diff` algorithm, the splice-check gate, extended `apply_variant_to_transcript` covering 3'UTR readthrough. Byte-for-byte match with legacy on a curated test corpus before defaulting.	High	3a–3d
3f	Flip default to `sequence_diff`, keep legacy available. Separate PR so the default flip is reviewable on its own.	Medium	3e + validation corpus
3g	Rewrite `splice_outcomes.py` to use `MutantTranscript` instead of the ad-hoc exon-skipping and boundary-codon helpers. Closes the #262 prototype loop.	Medium	Can parallelize with 3d onward; only depends on stage 2 (`apply_variant_to_transcript`) which already landed

[#271 stage 3 planning] SequenceDiffAnnotator design, naming, and reordered staging #304

Description

Naming

Will legacy stay around?

SequenceDiffAnnotator algorithm

Proposed staging reorder

Validation corpus for 3e

Open questions for you

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions