Problem
_build_in_frame_exon_skip_effect in `varcode/splice_outcomes.py` (PR #292) assumes the skipped exon begins and ends on codon boundaries when computing the resulting amino-acid deletion:
```python
aa_start = (exon_start_in_tx - cds_start_offset) // 3
n_aa_removed = exon_length // 3
aa_end = aa_start + n_aa_removed
aa_ref = str(transcript.protein_sequence[aa_start:aa_end])
```
When an exon actually starts mid-codon (i.e. the previous exon contributes 1 or 2 bases to the boundary codon), the math counts codons correctly (integer division truncates), but the boundary codon itself is not reconstructed. After the skip, the joined transcript's boundary codon is assembled from the last bases of the exon before the skip and the first bases of the exon after the skip — which may translate to a different amino acid than either of the flanking reference codons.
Today the candidate reports the wrong `aa_ref` at the seam and doesn't express the boundary-codon substitution at all; the in-frame case collapses to a pure `Deletion`, when biologically it's a `Deletion` with a seam substitution.
The out-of-frame skip path (`_build_out_of_frame_exon_skip_effect`) is less affected because it retranslates the full post-skip cDNA to first stop, which naturally picks up the new boundary codon — but it uses an ad-hoc `_ExonSkipFrameshiftEffect` shim rather than varcode's standard `FrameShift` class.
Scope
- In-frame exon skip: reconstruct the boundary codon from the cDNA of the flanking exons, emit a `ComplexSubstitution` (or equivalent) at the seam in addition to the `Deletion` of the interior AAs — or represent the whole thing as a single `ComplexSubstitution` whose `aa_alt` replaces the skipped AAs plus the reshaped boundary codon.
- Out-of-frame exon skip: consider reusing `FrameShift` directly rather than the `_ExonSkipFrameshiftEffect` shim. The existing shim computes `mutant_protein_sequence` correctly but doesn't integrate with type-based downstream consumers.
Related
Problem
_build_in_frame_exon_skip_effectin `varcode/splice_outcomes.py` (PR #292) assumes the skipped exon begins and ends on codon boundaries when computing the resulting amino-acid deletion:```python
aa_start = (exon_start_in_tx - cds_start_offset) // 3
n_aa_removed = exon_length // 3
aa_end = aa_start + n_aa_removed
aa_ref = str(transcript.protein_sequence[aa_start:aa_end])
```
When an exon actually starts mid-codon (i.e. the previous exon contributes 1 or 2 bases to the boundary codon), the math counts codons correctly (integer division truncates), but the boundary codon itself is not reconstructed. After the skip, the joined transcript's boundary codon is assembled from the last bases of the exon before the skip and the first bases of the exon after the skip — which may translate to a different amino acid than either of the flanking reference codons.
Today the candidate reports the wrong `aa_ref` at the seam and doesn't express the boundary-codon substitution at all; the in-frame case collapses to a pure `Deletion`, when biologically it's a `Deletion` with a seam substitution.
The out-of-frame skip path (`_build_out_of_frame_exon_skip_effect`) is less affected because it retranslates the full post-skip cDNA to first stop, which naturally picks up the new boundary codon — but it uses an ad-hoc `_ExonSkipFrameshiftEffect` shim rather than varcode's standard `FrameShift` class.
Scope
Related