Suggest candidate clade breakpoints and clade labels for non‑polio enteroviruses (EV‑D68, EV‑A71, CV‑A16, CV‑A10, …) from Nextstrain Auspice JSON trees.
This repo adapts the influenza clade-suggestion approach to enteroviruses using a combined score from:
- tree growth / structure (“bushiness”, LBI-like),
- amino-acid change patterns (weighted substitutions),
- divergence since last clade breakpoint (saturating contribution).
This is a clade suggestion tool: it proposes breakpoints and labels for review. It does not claim a globally optimal partition.
- How to run the workflow + outputs: this README
- Algorithm details (formulas, scoring, interpretation):
docs/algorithm.md
The repo is organized with one directory per virus:
EV-clade-algorithm/
├── Snakefile
├── config.yaml
├── auspice/
│ ├── original_{virus}.json
│ └── suggested_{virus}.json
├── scripts/
│ ├── add_new_clades.py # main algorithm (called by Snakefile)
│ ├── calculate_optimal_scales.py # scale recommendations
│ ├── extract_aliases_from_tree.py # optional: derive aliases from an annotated tree
│ ├── extract_yml_from_json.py # export clade definitions to YAML
│ ├── construct_tsv.py # build TSVs from exported YAMLs
│ └── visualize_clades.R # optional plots
├── ev-d68/
│ ├── resources/
│ │ ├── weights.json
│ │ ├── aliases_*.json
│ │ └── genome_annotation.gff3 # optional (if not fetched/generated)
│ └── results/
│ ├── original_ev-d68.clades.json
│ ├── optimal_scales.json
│ ├── new-clades.tsv
│ └── (plots/)
├── ev-a71/
├── cva16/
└── cva10/
(Directory names must match the keys under viruses: in config.yaml.)
This repo uses a renv.yaml conda/ micromamba environment with Python + R dependencies and required CLI tools.
micromamba env create -f renv.yaml -n clade
micromamba activate cladeAt minimum:
- list viruses under
viruses: - provide
tree_url(if not using local trees) - set defaults (
defaults:) for date range, proteins, and thresholds
snakemake -c 1 allOutputs:
auspice/suggested_{virus}.json(viewable in Auspice){virus}/results/optimal_scales.json{virus}/results/new-clades.tsv(+ optional sweep output)
auspice view --datasetDir auspiceThen open something like:
http://localhost:4000/suggested/ev-d68?branchLabel=new-clade&c=new-clade&d=tree&p=full&showBranchLabels=allFor each {virus}:
-
Fetch input tree + annotation (unless local tree provided)
- Tries Nextclade dataset:
enpen/enterovirus/{virus} - Falls back to
curl {tree_url} - If only a tree is fetched, a GFF3 is generated from
meta.genome_annotationsusingjq
- Tries Nextclade dataset:
-
Ensure existing clade annotations exist
- If the input tree has no
node_attrs[clade_key], assigns a dummy clade to all nodes. - This keeps the algorithm happy because it expects an “old clade key” for hierarchy/backbone logic.
- If the input tree has no
-
Create default inputs
resources/weights.jsonif missing (simple defaults)- optional
resources/aliases_default.jsonextracted from the tree (if you want a starting point)
-
Compute recommended scales (
scripts/calculate_optimal_scales.py)- writes
{virus}/results/optimal_scales.json - used as defaults for sweep ranges / per-virus settings
- writes
-
Suggest new clades (
scripts/add_new_clades.py)- writes
auspice/suggested_{virus}.json - writes
{virus}/results/new-clades.tsv - optionally runs a parameter sweep (
--plots True) and writes a TSV of sweep results
- writes
Implemented in scripts/add_new_clades.py.
- alive: tip in
[min_date, max_date]based onnode_attrs.num_date.value - ntips: number of descendant tips
- bushiness_raw: LBI-like score using exponential decay along branches
(branch length is computed from nucleotide mutations, scaled bybushiness_branch_scale) - bushiness: normalized bushiness
raw/(raw + bushiness_scale) - branch_score: normalized weighted AA score
aa_weight/(branch_length_scale + aa_weight) - div_score: divergence contribution since last breakpoint:
divergence_addition * delta_div/(delta_div + divergence_scale)
A branch triggers a new clade if:
score_total > cutoffandntips > min_size- plus a “one-daughter” check to avoid creating a clade when exactly one child would also trigger
Important implementation detail: the script adds div_score directly into node_attrs["score"]["value"] before testing the cutoff.
This is the node attribute key containing existing clade membership, used to:
- build the existing hierarchy (
full_{clade_key}) - label the backbone (avoid placing new clades on branches leading to existing clade roots)
For EV datasets this is usually something like:
clade_membership- or dataset-specific
The workflow’s ensure_clades rule can assign a dummy clade if missing.
The algorithm uses:
- nucleotide mutations (
nuc) for bushiness branch length - amino-acid mutations for branch weighting (
weights.json) - divergence computed over
defaults.proteins(if empty, it falls back tonuc)
If a genome_annotation.gff3 is present, the script can infer CDS names (useful if you don’t want to hardcode proteins).
The Snakemake rule runs add_new_clades.py with sweep ranges from config.yaml:
--cutoff-sweep--div_add-sweep--div_scale-sweep--min_size-sweep--bush_scale-sweep--bls_range-sweep
Output: a TSV (same path as --clades) with one row per parameter set including:
- counts of old/new clades
- mean node stats (
score,bushiness,div,branch_score)
Tip: start with a small grid (e.g. 5×5 over cutoff and divergence_addition) before adding more axes.
In auspice/suggested_{virus}.json:
new-clade: suggested clade name- continuous colorings:
score(already includesdiv_scorein current implementation)div_scorediv_impactbushiness_raw,bushinessbranch_score
In {virus}/results/new-clades.tsv:
- a simple table listing old clades vs newly suggested clades (mainly for quick counts)
- Recombination can break the “single tree = single history” assumption. Consider restricting mutation scoring/divergence to VP1 unless you have a clear reason to include additional proteins.
- Sampling density strongly affects bushiness. Re-tune thresholds when sampling changes substantially.
Clean derived outputs:
snakemake clean -c 1Export clade definitions to YAML (for downstream Nextclade / documentation):
snakemake extract_clade_ymls -c 1- Upstream inspiration: https://github.com/influenza-clade-nomenclature/clade-suggestion-algorithm
- Nextstrain: https://nextstrain.org
- Nextclade: https://clades.nextstrain.org