Status
Currently dropped silently. ~30% of antigens in a real LENS report (Hugo IPRES Pt02: 613/2153 rows) have NaN `variant_coords` because their source isn't a single point mutation:
- SPLICE (~6% of Pt02): aberrant splice junctions — coords live in `splice_coords` / `splice_description`.
- FUSION (~0.1%): gene fusions — coords in `fusion_left_breakpoint` / `fusion_right_breakpoint` plus `fusion_left_gene` / `fusion_right_gene` / `fusion_type`.
- CTA/SELF (~13%): cancer-testis antigens — no point mutation; tumor-specificity comes from expression pattern.
- ERV (~10%): endogenous retroviruses — `erv_orf_id` and `erv_*` columns.
Why these matter
These are real neoantigens with known therapeutic value:
- Fusion-derived neoantigens are the canonical example for fusion-driven cancers (e.g. EWSR1-FLI1 in Ewing sarcoma).
- CTA-driven vaccines (NY-ESO-1, MAGE-A) are the basis for several clinical trials.
- ERV-derived antigens are an active area for melanoma and other tumors.
Vaxrank data-model gap
`MutantProteinFragment` is shaped around point mutations (single `varcode.Variant`). To represent fusion / splice / CTA / ERV, we need either:
-
Extend the data model: add an `antigen_source` field plus per-source provenance (e.g. `fusion_breakpoints: tuple`, `splice_junction: dict`, `cta_gene_id: str`, `erv_orf_id: str`). Each source type gets its own optional dataclass attached to the fragment.
-
Polymorphic dispatch: a single `SourceProvenance` union type with subclasses (SNVProvenance, SpliceProvenance, FusionProvenance, CTAProvenance, ERVProvenance).
-
Free-form metadata dict: `MutantProteinFragment.source_metadata: dict` — least invasive, least typed, most flexible.
Option 1 or 2 is the right long-term move; option 3 unlocks the data faster while we figure out the schema.
Acceptance
- LENS rows with `antigen_source` ∈ {SPLICE, FUSION, CTA/SELF, ERV} produce `MutantProteinFragment`s with their source-specific provenance preserved.
- Reports surface the antigen source so reviewers know what kind of neoantigen they're looking at.
- Per-source filtering: vaccine designs can opt in / out of CTA, ERV, etc. independently.
Related
Status
Currently dropped silently. ~30% of antigens in a real LENS report (Hugo IPRES Pt02: 613/2153 rows) have NaN `variant_coords` because their source isn't a single point mutation:
Why these matter
These are real neoantigens with known therapeutic value:
Vaxrank data-model gap
`MutantProteinFragment` is shaped around point mutations (single `varcode.Variant`). To represent fusion / splice / CTA / ERV, we need either:
Extend the data model: add an `antigen_source` field plus per-source provenance (e.g. `fusion_breakpoints: tuple`, `splice_junction: dict`, `cta_gene_id: str`, `erv_orf_id: str`). Each source type gets its own optional dataclass attached to the fragment.
Polymorphic dispatch: a single `SourceProvenance` union type with subclasses (SNVProvenance, SpliceProvenance, FusionProvenance, CTAProvenance, ERVProvenance).
Free-form metadata dict: `MutantProteinFragment.source_metadata: dict` — least invasive, least typed, most flexible.
Option 1 or 2 is the right long-term move; option 3 unlocks the data faster while we figure out the schema.
Acceptance
Related