This document describes the scoring model used in this repository to suggest candidate clade breakpoints on a phylogenetic tree, and how those suggestions translate into new-clade labels in Auspice/Nextstrain JSON output.
If you only want to run the workflow, start with the main README.md.
If you are tuning parameters or interpreting score outputs, read this file.
The algorithm is designed to be run repeatedly as datasets grow. It is not meant to find a globally optimal partition of an unlabeled tree, but to identify new groups that have become large, diverged, and/or mutation-defined enough to justify a clade designation.
At a high level, for every node/branch we compute three components:
- Bushiness / growth signal (
bushiness_raw, normalizedbushiness) - Branch mutation score (
branch_score) based on weighted AA substitutions - Divergence contribution (
div_score) based on divergence since the last clade breakpoint
These are combined into a total score (score) and compared to a cutoff. If the cutoff is exceeded (and a minimum size is met), the branch is marked as a new clade breakpoint and the subtree is assigned a new hierarchical clade ID which is then converted to a human-friendly name.
The algorithm operates on a Nextstrain Auspice JSON export (typically via augur export) and expects:
treewithchildren,node_attrs,branch_attrsbranch_attrs.mutations:- must include
nucfor branch-length proxy used in bushiness - should include one or more translated CDS keys (e.g.
VP1) to score amino-acid mutations
- must include
node_attrs.num_date.valuefor time-filtering ("alive" tips) if available- an existing clade label present in
node_attrs[clade_key].value(or the workflow can insert a dummy clade)
- Terminal node: no
children - Non-terminal node: has one or more
children
The implementation uses these distinctions for recursive traversal and for defining bushiness base cases.
A node is considered "alive" if it falls within a date window:
min_date < num_date < max_date
This is used to focus the growth/bushiness score on recent lineages (configurable), so very old clusters do not dominate clade designation just because they have many historical tips.
If num_date is missing, the code currently treats the node as alive.
During preparation, mutation-less internal nodes may be collapsed upward (implementation detail) to reduce chains of internal nodes with no mutations and simplify the tree for scoring.
This does not change topology in a way that affects clade membership, but can change where exactly breakpoints land if a long chain of internal nodes has no mutations.
Before suggesting new clades, the tree is annotated with a boolean attribute:
backbone = Truefor nodes on paths from root to existing clade breakpointsbackbone = Falseotherwise
This allows the algorithm to avoid creating new clades along the "spine" of already-defined clades when ignore_backbone=True is used in bushiness/scoring.
What counts as an "existing breakpoint"?
- A branch with
branch_attrs.labels[clade_key](dataset-dependent)
Each node gets a divergence counter:
div: cumulative count of mutations along the path since the root (then effectively "reset" at breakpoints during traversal)
In this repo’s implementation the divergence is computed as counts of mutations in a configurable set of genes/proteins (typically VP1 for EV clades).
For amino acids, a mutation string looks like A123T meaning A→T at position 123.
The algorithm ignores:
- deletions (
-) - unknown amino acids (
X) - for nuc, ambiguous states like
Nand gaps-where relevant
We want to prioritize clades that are:
- not just deep splits, but show recent downstream growth
- robust to uneven branch lengths
- comparable across different parts of the tree
We compute a local branching index (LBI)-like measure on the tree.
For a node
Where:
-
$c$ are child nodes ofn -
$l_c$ is the branch length leading to child nodec -
$d$ is a distance scale (in code: you pass adistance(c)function; the scale is controlled bybushiness_branch_scale) -
I(c alive)is an indicator function (1 if$c$ is "alive", else 0)
Terminal nodes have:
-
$\phi = 1$ if alive, else$0$
Enterovirus trees often don’t have consistent continuous branch lengths in JSON. Instead, this repo uses a mutation-count proxy:
- branch length ≈ number of nucleotide mutations on the branch excluding
Nand- - then divided by
bushiness_branch_scale
Conceptually:
- smaller
bushiness_branch_scale→ longer effective branch lengths → bushiness decays faster - larger
bushiness_branch_scale→ shorter effective branch lengths → bushiness persists further
bushiness_raw can vary widely with sampling density and tree size. We normalize it using saturation:
Where
This yields bushiness in [0, 1).
We want breakpoints to land on branches with informative AA changes, e.g. known antigenic loops in EV-D68 VP1.
For each non-nucleotide CDS key in branch_attrs.mutations:
- look up per-position weights in
weights.json - sum the weights across AA substitutions on that branch
Let raw AA weight on the branch be
Where:
-
$s_w$ corresponds tobranch_length_scalein config/code - interpretation: how many "weight units" should constitute a "strong" branch
This yields branch_score in [0, 1).
Keep it simple:
- encode a small set of known important positions with higher weights (2–3)
- use a conservative
"default"(often 0 or 1) - start with VP1 only unless you have a strong reason to include other genes given recombination
Divergence since the last breakpoint is used to make new clades easier to trigger as a lineage accumulates changes.
Let:
-
$\Delta = div - div_{breakpoint}$ (divergence since last breakpoint) -
$s_d$ =divergence_scale
The divergence contribution is:
Where:
-
$a_d$ =divergence_addition(max contribution to total score)
So div_score ranges from 0 to divergence_addition.
In tree_clade_assignment.py the divergence contribution is added into the stored score:
node_attrs["score"]["value"] += div_score
So the exported score is a total score already including divergence.
Before divergence is added, the combined "base score" is effectively:
After adding divergence, the total is:
A node triggers a new clade breakpoint if:
score_total > cutoff- and
ntips > min_size
There is an additional "one-daughter" safeguard:
- if exactly one child would also trigger under the same thresholding logic, the parent node is not used as a breakpoint
- this reduces redundant breakpoints along a single path
When a breakpoint is designated, divergence is conceptually "reset" for downstream evaluation by updating the base divergence reference used for
Internally, clades are represented hierarchically as tuples, e.g.:
('A', 1, 2)or similar (exact structure depends on your alias map)
An aliases.json file provides mapping between:
- a short top-level label (e.g.
"A") - and the corresponding full hierarchical tuple prefix
When a new breakpoint is triggered:
- the algorithm assigns the next available child index under the parent clade in the hierarchy
- then converts the full hierarchical representation into a short name via
full_clade_to_short_name(...)
The result is written to:
branch_attrs.labels["new-clade"]at the breakpoint branchnode_attrs["new-clade"].valuefor all downstream nodes
The suggested tree JSON includes several helpful continuous attributes:
score(total score used for triggering; includes divergence)div_score(divergence contribution only)div_impact(implementation-defined helper; roughly "how much divergence is needed to cross cutoff")bushiness_rawandbushinessbranch_scorenew-clade(assigned clade label)
When inspecting in Auspice:
-
Color by
new-clade- do proposed clades correspond to visually coherent clusters?
- are they too nested / too granular?
-
Color by
score- do breakpoint branches sit near the upper end of score distribution?
-
Color by
branch_score- do breakpoints align with branches containing expected informative VP1 changes?
-
Color by
div_score- is divergence driving clade calls everywhere (too high
divergence_additionor too lowdivergence_scale)? - or is it never contributing (too low
divergence_additionor too highdivergence_scale)?
- is divergence driving clade calls everywhere (too high
-
Check sizes
- if many tiny clades appear: increase
min_sizeand/orcutoff
- if many tiny clades appear: increase
- Higher cutoff → fewer clades
- Lower cutoff → more clades
- If you want "only when at least two signals agree", target cutoff around ~1.0 (because components are saturating in ~[0,1] and divergence adds up to
divergence_addition)
- Most effective lever to prevent "tip noise"
- If you get many small "microclades", raise
min_size
divergence_additionsets the maximum divergence contribution (0 disables divergence)divergence_scalesets how quickly divergence saturates:- smaller → divergence rises quickly with few changes
- larger → divergence needs many changes to matter
- Larger → branch_score saturates more slowly (less sensitive to a few high-weight mutations)
- Smaller → a few weighted AA changes push branch_score close to 1 quickly
Controls the effective decay length used in bushiness:
- smaller → bushiness is "short-memory" (focuses strongly on very local tip growth)
- larger → bushiness is "longer-memory" (signals propagate further up)
- Recombination: genome-wide phylogeny and VP1 phylogeny can disagree. For nomenclature, VP1-only is often most interpretable.
- Sampling changes: bushiness depends on sampling density; expect to re-tune when surveillance intensity changes.
- Time filtering:
min_dateandmax_datestrongly affect which parts of the tree contribute to bushiness. Use them deliberately (e.g. focus on recent years when defining new names).
This approach is adapted from the influenza clade suggestion algorithm:
Related tools:
- Nextstrain: https://nextstrain.org
- Nextclade: https://clades.nextstrain.org