nextstrain
diff --git a/‎docs/README.md‎
Lines changed: 0 additions & 1 deletion b/‎docs/README.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/dataset-creation-guide.md‎
Lines changed: 117 additions & 10 deletions b/‎docs/dataset-creation-guide.md‎
Lines changed: 117 additions & 10 deletions
diff --git a/‎docs/dataset-curation-guide.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/dataset-curation-guide.md‎
Lines changed: 4 additions & 4 deletions
@@ -13,4 +13,3 @@ Additional links:
 - [Nextclade user documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/index.html) - for detailed instructions on how to use Nextclade Web and Nextclade CLI
 - [Nextclade source code repository](https://github.com/nextstrain/nextclade) - for contributors to Nextclade software (code, bug reports, feature requests etc.)
 - [Nextstrain discussion forum](https://discussion.nextstrain.org) - for general discussion and questions
-
@@ -1,8 +1,15 @@
 # Nextclade dataset creation guide
 
-This guide describes how to create a new Nextclade dataset from scratch. Datasets are how Nextclade is customized to work with a particular virus.
+This guide describes how to create a new Nextclade dataset from scratch.
+Datasets are how Nextclade is customized to work with a particular virus.
+Once a dataset is created, it can be optionally shared with other users via the Nextclade data repository or a separate Github repository.
+This page provides a detailed walk-through on how to put together a Nextclade dataset. Answers to [Frequently Asked Questions](#faqs) is provided below.
 
-Once a dataset is created, it can be shared with other users via the Nextclade data repository or a separate Github repository.
+> ⚠️ This guide serves for advanced Nextclade users and enthusiasts who want to create and maintain their own Nextclade datasets, e.g. to add a yet unsupported pathogen or strain. It assumes basic familiarity with Nextclade CLI and Nextclade Web and some experience with different datasets as a user. If you are not yet comfortable using Nextclade and want to learn more about Nextclade datasets, please refer to the [Nextclade user documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/) first.
+
+> ⚠️ If you are looking for Nextclade software developer documentation, see [Nextclade developer guide](https://github.com/nextstrain/nextclade/blob/master/docs/dev/developer-guide.md) instead.
+
+> ⚠️ If you are looking how you can share your dataset using the Nextclade data repository please consult the [Nextclade dataset curation guide](dataset-curation-guide.md).
 
 ## Recommended prerequisites
 
@@ -15,7 +22,7 @@ Once a dataset is created, it can be shared with other users via the Nextclade d
 
 A Nextclade dataset is a collection of files. While Nextclade CLI can be used with as little as a reference sequence, a full dataset to be used with Nextclade web requires a general configuration file as well. The full set of files is as follows:
 
-1. A reference sequence against which to align user-provided sequences (required)
+1. A reference sequence in fasta format against which query sequences are aligned (required)
 1. A genome annotation for the reference sequence in GFF3 format (recommended)
 1. A phylogenetic tree in Auspice JSON format (optional)
 1. A general configuration file (`pathogen.json`) in JSON format (required for Nextclade Web)
@@ -38,7 +45,8 @@ See the Nextclade docs for the [full specification of the reference sequence fil
 
 ## Preparing the genome annotation
 
-Nextclade uses the genome annotation to determine how to extract amino acid sequences from aligned nucleotide sequences. Nextclade requires the genome annotation to be in [GFF3 format](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). Nextclade only uses the `CDS` feature type. Each `CDS` feature is treated separately for translation, alignment and amino acid mutation reporting.
+Nextclade uses the genome annotation to determine how to extract amino acid sequences from aligned nucleotide sequences. Nextclade requires the genome annotation to be in [GFF3 format](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). Nextclade only uses the `CDS` feature type, but parent features such as genes can be included in the annotation.
+Each `CDS` feature is treated separately for translation, alignment and amino acid mutation reporting. CDS consisting of multiple segments are represented as multiple CDS entries in the GFF files and linked via a common `ID`.
 
 While Genbank provides annotations in GFF3 format, there is often a discrepancy between what the dataset creator wants to be translated and what is annotated as a CDS in the Genbank provided GFF3 file. In addition, users may want to change names of CDS features to make them more readable in Nextclade output.
 
@@ -102,6 +110,12 @@ Alternatively, one can use a custom script to perform these tasks. We provide an
 
 Further details on the genome annotation file format are available in the [Nextclade docs](https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/03-genome-annotation.html).
 
+If Nextclade errors reading your annotation, please read the error message carefully. Common problems include (i) references to non-existing parents, (ii) non-unique names or IDs, and (iii) CDS that are not multiples of 3. To check you annotation, you can run
+
+```bash
+nextclade read-annotation my_annotation.gff3
+```
+
 ## Preparing a minimal `pathogen.json` config file and accompanying files
 
 A `pathogen.json` config file is required to use a dataset with Nextclade web. Luckily, it is very simple to create by hand. All that is required is the schema version, a manifest of files and their paths. In addition, it is recommended to include basic attributes for display in the UI, such as name of the virus and the reference sequence.
@@ -125,7 +139,6 @@ Lastly, one can enable basic quality control for frame shifts, stop codons, miss
     "reference accession": "NC_012532.1"
   },
   "qc": {
-    "schemaVersion": "1.2.0",
     "missingData": {
       "enabled": true,
       "missingDataThreshold": 2000,
@@ -156,8 +169,9 @@ Lastly, one can enable basic quality control for frame shifts, stop codons, miss
 ```
 
 Further details on the `pathogen.json` file format are available in the [Nextclade docs](https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/05-pathogen-config.html).
+The `files` field of the `pathogen.json` is mandatory and has to list all input files Nextclade should expect to find in the dataset.
 
-Adding example sequences is useful to allow users (and dataset creators) to quickly test the dataset. A README is useful to provide additional information about the dataset. Lastly, a CHANGELOG helps users understand what has changed between different versions of the dataset. As a starting point, they can be minimal:
+While only the `reference.fasta` and the `pathogen.json` are required input files, we recommend adding example sequences. These allow users to quickly explore available datasets and help during dataset creation. A README is useful to provide additional information about the dataset. Lastly, a CHANGELOG helps users understand what has changed between different versions of the dataset. As a starting point, they can be minimal:
 
 `README.md`:
 
@@ -176,6 +190,7 @@ Initial release.
 ```
 
 (The heading `## Unreleased` is required for the CHANGELOG to be accepted by the Nextclade data repo).
+Ultimately, both README and CHANGELOG are intended for users of the dataset and should provide information on what the dataset is intended for and what has changed in the dataset at each update. Notes for dataset creators and curators should instead be documented in the workflow that creates the dataset, not in the dataset itself.
 
 ## Testing the minimal dataset
 
@@ -234,14 +249,14 @@ We provide a simple Snakemake workflow in this repository at [`example-workflow`
 
 You also need to provide paths to the reference sequence (in fasta and genbank formats) and the genome annotation.
 
-The example workflow is a good starting point, but if you want to customize the it's recommended to consult Nextstrain documentation, for example the [Creating a pathogen workflow tutorial](https://docs.nextstrain.org/en/latest/tutorials/creating-a-workflow.html).
+The example workflow is a good starting point, but if you want to customize the workflow, it's recommended to consult the Nextstrain documentation, for example the [Creating a pathogen workflow tutorial](https://docs.nextstrain.org/en/latest/tutorials/creating-a-workflow.html).
 
 ## Next steps
 
 Once you have your dataset created and committed in a Github repository, you can share it with others through a custom URL, e.g.
 
 ```url
-https://master.clades.nextstrain.org/?dataset-url=gh:nextstrain/nextclade_data@docs-v3@/docs/example-workflow/dataset
+https://master.clades.nextstrain.org/?dataset-url=gh:nextstrain/nextclade_data@master@/docs/example-workflow/dataset
 ```
 
 To make your dataset easily discoverable by all Nextclade users, you can submit it to the Nextclade data repository via a pull request. See the [dataset curation guide](dataset-curation-guide.md) for details. If you have any questions, please post them in the [Nextstrain discussion forum](https://discussion.nextstrain.org/).
@@ -258,12 +273,104 @@ Nextclade supports circular genomes. Genes that cross the origin are translated
 
 ### Does Nextclade support spliced genes?
 
-Yes, Nextclade supports spliced genes. Each exon should be annotated as a separate CDS feature according to the GFF3 specifications.
+Yes, Nextclade supports spliced genes. Each exon should be annotated as a separate CDS feature according to the GFF3 specifications. Features belonging to the same translated sequence have to be linked by a common `ID`.
 
 ### Does Nextclade support ribosomal slippage?
 
 Yes, Nextclade supports genes with programmed ribosomal frameshifting as long as the CDS is annotated according to the GFF3 specifications.
 
 ### Must the reference sequence be at the root of the reference tree?
 
-No, generally this is not required, however, the mutations from the reference sequence to the root of the tree must be added to the root branch of the tree. This can be achieved automatically by passing the reference sequence to the `--root-sequence` argument of [`augur ancestral`](https://docs.nextstrain.org/projects/augur/en/stable/usage/cli/ancestral.html).
+No, this is not required. However, the mutations from the reference sequence to the root of the tree must be added to the root branch of the tree. This can be achieved automatically by passing the reference sequence to the `--root-sequence` argument of [`augur ancestral`](https://docs.nextstrain.org/projects/augur/en/stable/usage/cli/ancestral.html).
+
+### My virus is very diverse; many sequences don't align!
+
+Nextclade was initially built for the analysis of SARS-COV-2 genomes that were all very similar to the reference. This is reflected in the default alignment parameters. For more diverse viruses, you can tune the alignment and seed-matching parameters by setting these parameters in the `pathogen.json`.
+Suggested parameters for high diversity viruses are given below (note that comments are not allowed in JSON, please remove `//...` when copying this into your `pathogen.json`):
+
+```json5
+    "alignmentParams": {
+        "penaltyGapExtend": 0,             // alignment: allow long gaps (same as default)
+        "penaltyGapOpen": 10,              // alignment: make gaps more expensive relative to mismatches (default 6)
+        "penaltyGapOpenInFrame": 15,       // alignment: increase the gap opening in coding regions (default 7)
+        "penaltyGapOpenOutOfFrame": 17,    // alignment: increase the gap opening in coding regions (default 8)
+        "excessBandwidth": 20,             // alignment: increase to allow for compensated indels. (default 9)
+        "minSeedCover": 0.1,               // seedmatching: attempt alignment even if less of the sequence is covered by extended seed matched (default 0.33)
+        "kmerLength": 6,                   // seedmatching: reduce seed length to find more matches (default 10)
+        "kmerDistance": 25,                // seedmatching: reduce to try more seeds (default 50)
+        "minMatchLength": 30,              // seedmatching: reduce minimal extended match length to keep more seeds (default 40)
+        "allowedMismatches": 15,           // seedmatching: increase allowed mismatches in window_size during seed extension to keep more seeds (default 8)
+        "windowSize": 30,
+    }
+```
+
+You can tweak these parameters further if you think gap penalties should be even higher.
+
+Nextclade CLI uses multiple sources to decide final alignment parameter values. In order of preference:
+
+- CLI arguments of the `nextclade run` command, if provided (e.g. `--min-seed-cover`)
+- the fields in `alignmentParams` section of `pathogen.json` file in the dataset, if dataset is used and `alignmentParams` are provided (e.g. `alignmentParams.minSeedCover`)
+- built-in hardcoded parameters (in Nextclade software code), when neither of the above is provided
+
+Note that CLI argument names use kebab-case (e.g. `--min-seed-cover`) while the corresponding JSON field names use camelCase (e.g. `minSeedCover`).
+
+You can see the resulting parameter values for a given dataset and command-line by invoking `nextclade run` command in verbose mode: add `--verbose` (`-v`) argument. This will display the final resolved values for all alignment parameters before processing begins.
+
+### My virus is very long with lots of indels; sequences align very poorly!
+
+Long viral genomes, such as mpox or herpes viruses, often have significant length variation in form of large insertions or deletions. In such cases, it can happen that the bandwidth Nextclade provisions in the banded alignment is insufficient and therefore does not align the sequences properly. In such cases, the bandwidth parameters should be adjusted by setting parameters in the `pathogen.json`.
+The `mpox/all-clades` dataset uses the following parameters:
+
+```json
+  "alignmentParams": {
+      "excessBandwidth": 100,
+      "terminalBandwidth": 300
+  }
+```
+
+The `excessBandwidth` is added in addition to what is estimated from the sketched band, `terminalBandwidth` is used for parts of the sequence before the first or after the last seed where indel variation is common and can't be reliably sketched.
+
+> ⚠️ Increasing these parameters increases the memory requirements and run-times of Nextclade.
+
+### How can I align short fragments?
+
+By default, Nextclade requires sequences to be at least 100 bases long. You can reduce this by setting
+
+```json
+  "alignmentParams": {
+      "minLength": 30,
+      "minMatchLength": 20,
+      "kmerDistance": 3,
+      "kmerLength": 6,
+  }
+```
+
+### Can I preview how Nextclade reads my annotation file?
+
+Yes. Try to run
+
+```bash
+nextclade read-annotation my_annotation.gff3
+```
+
+Nextclade will print a formatted output of how your gff3 is interpreted. If there are problems reading your annotation, Nextclade tries to identify the problem and prints an informative error message.
+
+### How many example sequences should I include?
+
+Example sequences are meant for users to test the dataset and to help curators to develop the dataset. A few dozen sequences that cover viral diversity and different use cases (partial/complete, high quality/low quality) are usually sufficient. Too large example datasets make testing cumbersome, and increase storage and compute requirements necessarily. Ideally, the example sequences are different from the sequences use for the reference tree.
+
+### Should I make multiple datasets for the same virus with different reference sequences?
+
+In earlier versions, Nextclade only reported mutations relative to the alignment reference and private mutations. At that time, it was useful to have datasets that use different reference sequences (e.g. ancestral SARS-CoV-2 Wuhan-Hu-1 and Omicron). Now, Nextclade also reports mutations relative to clade/lineage founders as well as specific strains that can be specified in the `auspice.json`.
+Unless the virus is very diverse and there are clearly recognized 'types' (e.g. Dengue 1-4), it is preferable to provide only a single dataset. This reduces ambiguities with what dataset sequences should be analyzed and simplifies the automated dataset suggestion.
+
+The documentation on how to specify reference strains relative to which mutations should be reported can be found [here](https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/04-reference-tree.html#relative-mutations).
+
+### Are whole-genome datasets preferable to partial genome datasets?
+
+Yes, at least for viruses with limited recombination. A full genome dataset will cover most use cases and will provide all the functionality that a partial genome dataset does.
+Exceptions to this recommendation include cases where relevant diversity is not represented in the available complete sequences.
+
+### How do I add multiple nomenclature systems to my dataset?
+
+Nextclade supports adding multiple 'clade-like' attributes to the reference tree. These need to be assigned to the nodes of the reference tree (for example using `augur clades`) and specified in the `tree.json` file. Please consult the documentation on [clade-like attributes](https://docs.nextstrain.org/projects/nextclade/en/latest/user/input-files/04-reference-tree.html#clade-like-attributes).
@@ -1,12 +1,12 @@
 # Nextclade dataset curation guide
 
-This guide explains how to create, update and test Nextclade datasets as well as how to contribute them into the official Nextclade dataset collection.
+This guide explains how Nextclade datasets are structured and how to you can contribute or update datasets into the official Nextclade dataset collection.
 
-> ⚠️ If you are a user of Nextclade Web or Nextclade CLI and looking for documentation on how to use Nextclade, see [Nextclade user documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/index.html) instead.
+> ⚠️ This guide serves for advanced Nextclade users and enthusiasts who want to create and maintain their own Nextclade datasets, e.g. to add a yet unsupported pathogen or strain. It assumes basic familiarity with Nextclade CLI and Nextclade Web and some experience with different datasets as a user. If you are not yet comfortable using Nextclade and want to learn more about Nextclade datasets, please refer to the [Nextclade user documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/) first.
 
 > ⚠️ If you are looking for Nextclade software developer documentation, see [Nextclade developer guide](https://github.com/nextstrain/nextclade/blob/master/docs/dev/developer-guide.md) instead.
 
-> ⚠️ This guide serves for advanced Nextclade users and enthusiasts who want to create and maintain their own Nextclade datasets, e.g. to add a yet unsupported pathogen or strain. It assumes basic familiarity with Nextclade CLI and Nextclade Web and some experience with different datasets as a user. If you are not yet comfortable using Nextclade and want to learn more about Nextclade datasets, please refer to the [Nextclade user documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/) first.
+> ⚠️ If you are looking for guidance on how to assemble a dataset from scratch and tweak its parameters for a particular virus, see [Nextclade dataset creation guide](dataset-creation-guide.md) instead.
 
 ## Basic principles
 
@@ -48,7 +48,7 @@ Add changes to your forked repository, commit and submit a pull request to the u
 
 Refer to [GitHub documentation "Contributing to projects"](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) for more details.
 
-> 💡 Make sure you [keep your local code up to date](https://github.com/git-guides/git-pull) with the origin repo,  [especially if it's forked](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork).
+> 💡 Make sure you [keep your local code up to date](https://github.com/git-guides/git-pull) with the origin repo, [especially if it's forked](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork).
 
 > 💡 If you are a member of Nextstrain team, then you don't need a fork, and you can contribute directly to the origin repository `nextstrain/nextclade_data`. Nonetheless, please still submit a pull requests for review, rather than pushing changes to branches directly.