Skip to content

Annotate neoepitopes with cancer cell fraction (CCF) / clonality #95

@riasc

Description

@riasc

Motivation

Clonality is one of the most decision-relevant orthogonal signals for neoantigen prioritization. Clonal neoantigens (present in essentially all tumor cells) predict immune-checkpoint-blockade response and elicit reactive T cells; subclonal neoantigens are far weaker targets — and for vaccine design a subclonal target only covers a fraction of the tumor (McGranahan et al., Science 2016).

ScanNeo2 already reports per-variant VAF, but raw VAF is not clonality — it is confounded by tumor purity and local copy number (a variant in a copy-gained or low-purity region is mis-scored). This issue adds a proper cancer cell fraction (CCF) estimate and a clonal/subclonal classification.

Scope — light

ScanNeo2 will not call copy number or purity itself (no ASCAT/Sequenza/FACETS integration — that would be a much larger build). Instead, the user supplies per-sample tumor purity and a copy-number segments file (produced by whatever CNV caller they already use). ScanNeo2 computes CCF from VAF + purity + copy number. Computing purity/CNV in-pipeline is a possible full-scope follow-up.

CCF formula

For a variant at a locus with tumor total copy number CN_t:

CCF = VAF * ( p * CN_t + (1 - p) * CN_n ) / ( p * m )
  • VAF — observed variant allele frequency (already available per variant)
  • p — tumor purity (user-supplied, per sample)
  • CN_t — tumor total copy number at the locus (from the user's CNV file)
  • CN_n — normal copy number (2 autosomal; 1 hemizygous / sex chromosomes)
  • m — mutation multiplicity (copies carrying the mutation)

clonal if CCF >= threshold (default ~0.9, configurable), else subclonal.

Implementation plan

  1. Per-sample inputspurity (scalar 0-1) and cnv (segments file). These belong in the Multi-sample support: process multiple samples in one run via a sample sheet #93 sample sheet (purity, cnv columns); until Multi-sample support: process multiple samples in one run via a sample sheet #93 lands they can be config keys. If either is absent for a sample, the CCF/clonality columns are left empty — graceful, no failure (optional feature).

  2. CNV lookup helper — given (chrom, pos), return total copy number from the segments file. Accept a generic TSV (chrom, start, end, total_cn); document how to derive it from common callers (ASCAT/Sequenza/FACETS). A position with no covering segment ⇒ assume CN_n (diploid) or leave empty.

  3. CCF computation — slot into the prioritization stage where VAF is already in hand (variants.py populates vaf; compute alongside or in compile.py/filtering.py). Apply the formula per variant.

  4. Multiplicity (m) — the fuzzy part. v1 simplification: pick m in 1..CN_t giving CCF closest to but not exceeding 1; default m = 1 when ambiguous. Document the assumption; a probabilistic estimate is a follow-up.

  5. Classification — add a clonality value (clonal / subclonal) from a configurable CCF threshold.

  6. Output — two new columns, ccf and clonality, in {vartype}_{mhc_class}_neoepitopes.txt.

  7. Ranking integration — v1: report-only columns; do not change the existing ranking_score. Whether to fold clonality into the score is a separate decision (it would shift all existing scores).

  8. Tests — unit-test the CCF formula with synthetic (VAF, purity, CN_t, m) cases of known CCF, including hemizygous loci and missing-input handling.

Open design decisions

Dependencies

Scope note

Standalone feature, not part of the 2026-03-26 audit cluster. Full-scope in-pipeline purity/CNV calling (ASCAT/Sequenza-style) is explicitly out of scope here and a possible separate follow-up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions