Skip to content

Commit 6e19a7c

Browse files
committed
include multiple references in the CVA16 dataset for testing multi-ref minimizer
1 parent abc65e8 commit 6e19a7c

16 files changed

Lines changed: 5320 additions & 1 deletion

data/enpen/collection.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
]
2424
},
2525
"dataset_order": [
26-
"enpen/enterovirus/ev-d68"
26+
"enpen/enterovirus/ev-d68",
27+
"enpen/enterovirus/cva16"
2728
]
2829
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
## Unreleased
2+
3+
Initial release of a Coxsackievirus A16 dataset for lineage classification!
4+
5+
Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Coxsackievirus A16 dataset based on reference "G-10"
2+
3+
| Key | Value |
4+
|----------------------|-----------------------------------------------------------------------|
5+
| authors | [Nadia Neuner-Jehle](https://eve-lab.org/people/nadia-neuner-jehle/), [Alejandra González-Sánchez](https://www.vallhebron.com/en/professionals/alejandra-gonzalez-sanchez), [Emma B. Hodcroft](https://eve-lab.org/people/emma-hodcroft/), [ENPEN](https://escv.eu/european-non-polio-enterovirus-network-enpen/) |
6+
| name | Coxsackievirus A16 |
7+
| reference | [U05876.1](https://www.ncbi.nlm.nih.gov/nuccore/U05876) |
8+
| workflow | https://github.com/enterovirus-phylo/nextclade_a16 |
9+
| path | `enpen/enterovirus/cva16` |
10+
| clade definitions | A–F |
11+
12+
## Scope of this dataset
13+
14+
This dataset uses the historical G-10 prototype sequence ([U05876.1](https://www.ncbi.nlm.nih.gov/nuccore/U05876)), which may differ from contemporary global CVA16 strains. It is intended for broad subgenogroup classification, mutation quality control, and phylogenetic analysis of CVA16 diversity.
15+
16+
*Note: The G-10 reference differs substantially from currently circulating strains.* This is common for enterovirus datasets, in contrast to some other virus datasets (e.g., seasonal influenza), where the reference is updated more frequently to reflect recent lineages.
17+
18+
To address this, the dataset is *rooted* on a Static Inferred Ancestor — a phylogenetically reconstructed ancestral sequence near the tree root. This provides a stable reference point that can be used, optionally, as an alternative for mutation calling.
19+
20+
## Features
21+
22+
This dataset supports:
23+
24+
- Assignment of subgenotypes
25+
- Phylogenetic placement
26+
- Sequence quality control (QC)
27+
28+
## Subgenogroups of Coxsackievirus A16
29+
30+
Subgenogroups B1a, B1b and B1c are the major phylogenetic divisions of CVA16 and are commonly used in virological surveillance and literature. They are defined by phylogenetic clustering and do not necessarily indicate antigenic differences. In recent years, recombinant forms were identified and labeled C-F (also known as B2, B3, and D). These recombinant forms cluster with the prototype strain, clade A.
31+
32+
These designations are based on the phylogenetic structure and mutations, and are widely used in molecular epidemiology, similar to subgenotype systems for other enteroviruses. Unlike influenza (H1N1, H3N2) or SARS-CoV-2, there is no universal, standardized global lineage nomenclature for enteroviruses. Naming follows conventions from published studies and surveillance practices.
33+
34+
## Reference types
35+
36+
This dataset includes several reference points used in analyses:
37+
- *Reference:* RefSeq or similarly established reference sequence. Here G-10.
38+
39+
- *Parent:* The nearest ancestral node of a sample in the tree, used to infer branch-specific mutations.
40+
41+
- *Clade founder:* The inferred ancestral node defining a clade (e.g., B1a, B2). Mutations "since clade founder" describe changes that define that clade.
42+
43+
- *Static Inferred Ancestor:* Reconstructed ancestral sequence inferred with an outgroup, representing the likely founder of CVA16. Serves as a stable reference.
44+
45+
- *Tree root:* Corresponds to the root of the tree, it may change in future updates as more data become available.
46+
47+
All references use the coordinate system of the G-10 sequence.
48+
49+
## Issues & Contact
50+
- For questions or suggestions, please [open an issue](https://github.com/enterovirus-phylo/nextclade_a16/issues) or email: eve-group[at]swisstph.ch
51+
52+
## What is a Nextclade dataset?
53+
54+
A Nextclade dataset includes the reference sequence, genome annotations, tree, clade definitions, and QC rules. Learn more in the [Nextclade documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html).
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
##gff-version 3
2+
#!gff-spec-version 1.21
3+
#!processor NCBI annotwriter
4+
##sequence-region U05876.1 1 7413
5+
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704
6+
U05876.1 Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10
7+
U05876.1 Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69
8+
U05876.1 Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323
9+
U05876.1 Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565
10+
U05876.1 Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862
11+
U05876.1 Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012
12+
U05876.1 Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111
13+
U05876.1 Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440
14+
U05876.1 Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526
15+
U05876.1 Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548
16+
U05876.1 Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731
17+
U05876.1 Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
>MK989714
2+
TTAAAACAGCCTGTGGGTTGTACCCACCCACAGGGCCCACTGGGCGTTAGCACACTGATT
3+
TCACGGAATCTTTGTGCGCCTGTTTTATATCCCTCCCCCAATACAGTAACTTAGAAGTTA
4+
GAAACCTACACGACCAATAGCAGGCGTGACGCACCAGTCATGTCTTGGTCAAACACTTCT
5+
GTTTCCCCGGACTGAGTATCAATAAGCCGCTCACGCGGCTGAAGGAGAAAACGTTCGTTA
6+
TCCGGCTAACTACTTCGAGAAACCTAGTAGCACCGTGAAAGTTGCAGAGTGTTTCGCTCA
7+
GCACTTTCCCCGTGTAGATCAGGTCGATGAGTCACTGCATCCCCCACGGGCGACCGTGGC
8+
AGTGGCTGCGTTGGCGGCCTGCCTATGGGGCAACCCATGGGACGCTCTAATACAGACATG
9+
GTGTGAAGAGTCTATTGAGCTAGTTAGTAGTCCTCCGGCCCCTGAATGCGGCTAATCCTA
10+
ACTGCGGAGCACGCACCCTCAATCCAGGGGGCGGCGTGTCGTAACGGGTAACTCTGCAGC
11+
GGAACCGACTACTTTGGGTGTCCGTGTTTCCTTTTATCACTTACTGGCTGCTTATGGTGA
12+
CAATCAAAGAATTGTTACCATATAGCTTTTGGATTGGCCATCCGGTGTCTAACAGAGCTA
13+
TTGTTTACCTGTTTGTTGGATACATTCCTCTCAACTACAAAGTTCTTCAAACTCTCAACT
14+
TTATATTGCTCCTTAATCATAAGAAATGGGGTCACAGGTCTCTACCCAGCGATCTGGATC
15+
GCATGAGAACTCCAATTCTGCATCGGAGGGTTCAACCATAAATTACACAACCATAAACTA
16+
CTATAAGGATGCATATGCTGCAAGTGCGGGGCGCCAGGACATGTCCCAAGATCCAAAGAA
17+
GTTTACCGATCCTGTCATGGATGTCATACATGAAATGGCCCCACCGCTCAAATCCCCGAG
18+
TGCTGAGGCGTGTGGCTACAGTGACCGCGTGGCTCAGCTCACCATCGGGAATTCTACCAT
19+
TACTACACAAGAAGCAGCTAACATAGTTATAGCGTATGGGGAATGGCCTGAGTACTGCCC
20+
AGATACAGATGCAACGGCTGTCGACAAGCCAACACGGCCTGATGTGTCGGTTAACAGGTT
21+
CTTCACGCTCGACACTAAATCCTGGGCTAAAGACTCAAAGGGGTGGTACTGGAAGTTTCC
22+
TGATGTCCTGACAGAGGTAGGTGTTTTTGGCCAGAATGCTCAATTTCACTATCTGTATCG
23+
ATCAGGATTCTGTGTGCACGTGCAATGCAATGCAAGCAAGTTTCACCAAGGTGCTCTGTT
24+
AGTGGCAGTCCTCCCCGAATACGTGCTCGGTACCATCGCGGGAGGGACCGGGAATGAGAA
25+
CTCCCACCCCCCTTATGCCACCACGCAGCCTGGTCAAGTTGGTGCAGTCTTGACGCACCC
26+
CTATGTACTAGATGCGGGGATTCCCTTGAGTCAATTGACCGTGTGCCCACATCAGTGGAT
27+
CAACCTGAGAACTAACAATTGTGCAACCATCATAGTTCCATACATGAATACAGTTCCTTT
28+
TGATTCAGCTCTTAACCACTGCAATTTCGGTCTGCTGGTCGTCCCAGTGGTGCCGTTGGA
29+
TTTCAATACAGGCGCCACGTCCGAAATTCCTATTACAGTCACCATAGCCCCTATGTGTGC
30+
AGAGTTCGCAGGCCTCCGCCAGGCAGTGAAACAGGGCATTCCCACAGAGCTTAAACCTGG
31+
TACCAACCAGTTTCTCACTACCGACGATGGTGTGTCTGCGCCAATTTTACCAGGTTTTCA
32+
TCCAACCCCACCTATACACATACCAGGGGAAGTGCACAATCTATTAGAAATATGTAGAGT
33+
GGAAACTATCCTGGAAGTTAATAATCTAAAGACTAATGAGACTACCCCTATGCAACGCTT
34+
GTGTTTTCCAGTCTCGGTACAGAGCAAGACAGGTGAACTGTGTGCCGCCTTCAGGGCAGA
35+
TCCTGGAAGAGATGGCCCGTGGCAGTCCACAATATTGGGCCAACTTTGTCGATACTATAC
36+
ACAGTGGTCAGGCTCATTGGAGGTGACATTCATGTTCGCAGGCTCGTTTATGGCCACAGG
37+
CAAGATGCTTATTGCCTACACCCCGCCTGGAGGGAACGTACCTGCAGACAGAATCACGGC
38+
AATGCTAGGAACACATGTGATCTGGGACTTTGGATTGCAGTCCTCTGTGACGTTGGTCGT
39+
GCCATGGATTAGCAACACACATTACAGAGCACACGCCCGTGCTGGGTACTTTGACTATTA
40+
TACTACTGGTATCATAACCATATGGTATCAAACTAATTATGTAGTGCCCATTGGAGCCCC
41+
CACCACAGCTTATATTGTAGCCTTGGCAGCAGCCCAAGATAACTTCACCATGAAACTATG
42+
CAAGGACACTGAAGATATTGAGCAAACAGCTAACATACAAGGGGACCCTATTGCAGACAT
43+
GATCGACCAAACTGTAAACAGTCAAGTGAATCGCTCCTTAACTGCACTGCAGGTACTACC
44+
CACAGCTGCAGATACTGAAGCAAGCAGTCATAGATTAGGCACCGGTGTGGTACCAGCGCT
45+
GCAGGCCGCAGAGACTGGGGCGTCATCAAACGCCAGTGACAAAAACCTCATTGAAACTAG
46+
ATGTGTGTTGAACCATCATTCCACACAAGAGACAGCCATTGGGAACTTTTTCAGCCGCGC
47+
TGGTCTGGTCAGCATTATCACAATGCCCACCACGGGCACACAGAACACAGAGGGCTATGT
48+
CAACTGGGACATTGATTTAATGGGGTATGCTCAATTACGGCGGAAGTGTGAGTTGTTTAC
49+
CTACATGCGTTTCGACGCTGAATTCACGTTCGTCGTGGCTAAGCCCAACGGTGAGTTAGT
50+
CCCTCAACTGTTACAGTACATGTATGTCCCGCCAGGGGCGCCGAAGCCCACTTCCAGAGA
51+
TTCATTCGCCTGGCAAACAGCCACTAACCCGTCTGTGTTTGTGAAGATGACAGACCCGCC
52+
AGCTCAAGTGTCAGTACCTTTTATGTCACCAGCCAGCGCGTATCAGTGGTTCTATGATGG
53+
TTATCCCACTTTTGGAGAACATCTCCAAGCAAACGACCTAGACTATGGTCAATGCCCAAA
54+
CAATATGATGGGCACCTTCAGCATCAGAACAGTAGGGACTGAGAAATCACCACACTCTAT
55+
TACCCTGAGAGTATACATGCGGATTAAGCATGTTAGGGCATGGATCCCGAGGCCTTTGAG
56+
AAACCAACCCTATCTATTTAAAACCAACCCTAACTACAAAGGGAACGATATCAAGTGTAC
57+
CAGCACTAGTAGAGATAAAATAACAACCTTAGGAAAATTTGGACAGCAATCAGGAGCCAT
58+
ATATGTAGGTAATTACAGGGTGGTGAACCGGCACCTTGCCACGCACAATGATTGGGCAAA
59+
TCTTGTGTGGGAAGATAGCTCAAGGGATCTACTAGTCTCTTCCACCACTGCACAAGGGTG
60+
CGACACTATAGCTAGATGTGATTGCCAAACCGGGGTGTACTATTGCAACTCCAGGAGGAA
61+
ACACTACGCAGTTAGTTTCACTAAACCTAGCCTAATCTTTGTAGAGGCTAGCGAGTACTA
62+
TCCAGCTAGATATCAGTCACATCTCATGCTTGCTGCAGGCCATTCTGAACCAGGAGACTG
63+
TGGGGGGATCCTCAGGTGCCAGCATGGTGTTGTGGGCATTGTCTCCACTGGGGGCAATGG
64+
CCTAGTTGGATTTGCTGATGTCAGGGATCTTTTGTGGCTAGATGAGGAGGCCATGGAGCA
65+
AGGTGTCTCCGACTATATCAAGGGACTTGGCGATGCCTTCGGCACGGGCTTTACTGATGC
66+
AGTGTCCAGAGAGGTGGAAGCTTTGAAAACCTACCTAATTGGTTCTGAAGGGGCAGTTGA
67+
GAAAATTTTAAAAAATTTGGTTAAGCTGATTTCAGCACTAGTTATAGTAATCAGAAGTGA
68+
CTATGATATGGTCACCCTCACAGCTACTCTGGCTCTCATAGGCTGTCATGGTAGTCCCTG
69+
GGCATGGATCAAAGCAAAAGCAGCGTCCATCTTAGGCATCCCCATTGCCCAGAAACAGAG
70+
TGCATCATGGCTGAAGAAATTCAACGATATGGCCAATGCTGCTAAGGGTCTAGAGTGGAT
71+
ATCCAACAAGATTAGCAAATTCATTGATTGGCTCAAAGAGAAGATCATACCAGCGGCCAA
72+
AGAGAAAGTGGAATTCCTGAACAATTTGAAGCAGTTACCATTGTTGGAAAATCAGATATC
73+
AAACTTGGAACAATCAGCAGCTTCACAGGAAGATCTTGAGGCAATGTTTGGGAACGTATC
74+
GTACCTTGCGCACTTTTGTCGCAAGTTCCAACCACTCTACGCTACAGAGGCAAAGAGAGT
75+
TTATGCACTAGAGAAAAGAATGAATAACTACATGCAGTTCAAGAGCAAACACCGTATTGA
76+
ACCTGTATGTCTTATCATCAGAGGCTCGCCAGGCACTGGGAAATCCCTAGCAACCGGAAT
77+
CATTGCCCGAGCAATAGCCGATAAATACCACTCCAGTGTATACTCACTCCCACCAGACCC
78+
AGACCACTTTGATGGTTATAAGCAACAAGTGGTCACAGTTATGGATGACTTGTGCCAAAA
79+
TCCAGATGGCAAGGACATGTCGTTGTTTTGCCAGATGGTATCCACCGTGGATTTTATCCC
80+
GCCAATGGCATCTCTGGAGGAAAAAGGAGTCTCTTTCACATCCAAATTTGTGATTGCGTC
81+
CACCAATGCTAGCAATATCATAGTGCCAACAGTATCTGACTCAGATGCCATCCGTCGCAG
82+
ATTTTATATGGACTGTGACATCGAAGTGACGGATTCGTATAAAACAGACTTAGGCAGATT
83+
GGACGCTGGGCGGGCTGCTAAGTTGTGCTCTGATAACAACACAGCAAATTTCAAGCGTTG
84+
CAGCCCGCTAGTATGTGGGAAAGCTATCCAGTTGAGAGACAGAAAATCCAAGGTTAGGTA
85+
CAGCGTGGATACAGTGGTTTCTGAGCTTATAAGGGAGTACAACAACAGGTCTGCTATCGG
86+
AAACACAATTGAGGCATTATTCCAGGGACCACCTAAATTTAGGCCTATTAGGATTAGTCT
87+
AGAAGAGAAACCAGCTCCAGATGCTATCAGTGATCTTCTTGCCAGCGTAGATAGTGAAGA
88+
GGTGCGCCAATACTGTAGAGATCAAGGTTGGATCATCCCAGAAACTCCCACCAATGTTGA
89+
GCGGCATCTCAACAGGGCTGTATTAATTATGCAATCCATTGCTACAGTGGTGGCAGTTGT
90+
CTCGCTTGTGTACGTTATCTACAAACTCTTTGCTGGCTTCCAGGGCGCATACTCTGGTGC
91+
TCCTAAGCAGGTTCTTAAGAAACCCATCCTTCGCACGGCAACAGTGCAAGGTCCAAGTCT
92+
TGATTTTGCTCTGTCCTTACTGAGGAGAAACATCAGACAGGTTCAGACAGATCAGGGGCA
93+
TTTTACCATGTTAGGAGTCAGGGACCGCTTGGCTGTCCTTCCGCGACACTCGCAGCCTGG
94+
AAAAACAATTTGGGTGGAGCACAAGCTCGTGAACATTTTGGATGCCGTTGAGTTGGTGGA
95+
TGAGCAAGGAGTCAATTTGGAACTCACCCTAGTCACCCTTGACACTAACGAGAAATTTAG
96+
AGACATCACTAAGTTCATCCCGGAGAACATCAGCGCCGCCAGTGATGCCACCCTGGTGAT
97+
CAATACAGAGCATATGCCTTCAATGTTTGTTCCAGTAGGTGATGTTGTGCAGTATGGTTT
98+
TCTAAACCTTAGTGGAAAGCCCACTCACCGCACCATGATGTATAACTTCCCTACCAAAGC
99+
AGGGCAGTGTGGAGGGGTGGTGACGTCAGTTGGAAAGATCATTGGCATCCACATAGGGGG
100+
CAATGGTAGGCAAGGCTTCTGTGCGGGACTCAAGAGAAGTTATTTTGCCAGTGAGCAAGG
101+
AGAGATCCAATGGGTGAAGCCAAATAAGGAGACTGGAAGACTCAACATCAATGGTCCAAC
102+
TCGCACCAAGCTTGAACCCAGTGTATTCCATGATGTGTTTGAGGGTGACAAAGAGCCAGC
103+
GGTCTTGCACAGTAAAGATCCTCGCCTTGAAGTGGACTTTGAGCAGGCACTGTTCTCCAA
104+
GTATGTGGGGAATACGCTACATGAGCCTGATGAGTATGTCAGAGAGGCAGCCCTACATTA
105+
TGCAAATCAGTTGAAACAACTAGACATAGACACCTCTCAAATGAGCATGGAGGAAGCTTG
106+
TTATGGCACAGATAACCTTGAGGCCATTGATCTCCACACCAGCGCAGGTTACCCTTACAG
107+
TGCTTTGGGAATCAAAAAGAGGGACATTTTGGATCCTACCACTAGGGATGTGGGTAAGAT
108+
GAAATTTTACATGGACAAGTATGGTCTTGACCTCCCTTACTCCACCTATGTTAAGGATGA
109+
GCTACGCTCAATAGATAAGATCAAGAAAGGAAAATCCCGCTTGATTGAAGCCAGCAGCTT
110+
GAATGACTCAGTCTACCTCAGAATGGCTTTCGGGCATCTCTATGAAGCTTTCCATGCAAA
111+
CCCAGGGACTGTGACTGGTTCAGCTGTAGGGTGCAACCCAGATGTGTTTTGGAGTAAACT
112+
ACCAATTCTGCTTCCTGGTTCCCTATTCGCCTTTGACTACTCGGGCTATGATGCCAGTCT
113+
CAGTCCAGTTTGGTTTAGGGCGTTGGAGCTAGTTCTCAGAGAAATAGGCTATAGTGAGGA
114+
GGCAGTTTCACTCATTGAGGGAATCAACCATACGCACCATGTGTACCGCAATAAAACCTA
115+
TTGTGTACTTGGTGGGATGCCCTCAGGCTGCTCAGGCACATCCATTTTCAACTCGATGAT
116+
TAACAACATCATCATTAGAGCATTGCTTATTAAGACATTTAAGGGTATTGACTTGGATGA
117+
ACTCAATATGGTTGCTTACGGGGACGACGTGCTTGCCAGTTACCCATTTCCGATTGACTG
118+
CCTAGAATTAGCAAAAACAGGTAAAGAGTATGGCTTAACCATGACTCCTGCAGATAAGTC
119+
TCCTTGCTTTAATGAAGTTAATTGGGAGAATGCAACCTTCCTTAAGAGAGGCTTCTTGCC
120+
TGATGAACAATTCCCATTCTTGATTCACCCAACCATGCCAATGAAGGAGATCCACGAGTC
121+
TATTCGATGGACCAAGGACGCACGCAACACACAAGATCATGTGAGATCCTTATGCTTATT
122+
GGCGTGGCACAACGGCAAGCAAGAATATGAAAAATTTGTGAGCACAATTAGATCTGTCCC
123+
GGTGGGAAAGGCATTGGCAATTCCAAACTATGAAAATCTGAGACGCAATTGGCTCGAACT
124+
ATTTTAGAGGTTAAACACACCTCAACCCCACCAGAAATCTGGTCGTGAACATGACTGGTG
125+
GGGGTAAATTTGTTATAACCAGAATAGCAAA

0 commit comments

Comments
 (0)