You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+53-34Lines changed: 53 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ If you use this program and wish to cite it, please cite both this repository (s
13
13
Given one or more viral genome sequences in a fasta file, the script:
14
14
1. Aligns the sequences (optional; when an alignment is given as input, this step can be skipped with `--no-align`)
15
15
2. Splits the alignment into overlapping windows of a chosen size.
16
-
3. Calculates the pairwise similarity between a query and other sequences in each window.
16
+
3. Calculates the pairwise distance between a query and other sequences in each window.
17
17
4. Produces:
18
18
a) a plot showing how similarity changes along the genome, and
19
19
b) a CSV table with similarity values (optional).
@@ -23,12 +23,24 @@ Given one or more viral genome sequences in a fasta file, the script:
23
23
24
24
## How similarity is calculated
25
25
26
-
For each sliding window:
27
-
- Count how many positions differ between the query and a reference sequence (Hamming distance).
28
-
- Divide that by the number of valid positions (ignoring gaps and Ns depending on your settings).
29
-
- That gives the p-distance: $p = \text{differences} / \text{valid positions}$.
30
-
- Then $\text{similarity} = 1 - p$.
31
-
That’s what’s plotted along the genome.
26
+
For each sliding window, the distance between the query and each reference sequence is computed from the valid (unambiguous nucleotide) positions in that window. Positions containing gaps or ambiguous characters are excluded. The similarity plotted is `1 − distance`.
27
+
28
+
Five distance models are currently available (see `--distance-model`):
29
+
30
+
| Model | Description |
31
+
|-------|-------------|
32
+
|`pdist`| p-distance: raw proportion of differing sites (default) |
33
+
|`jc69`| Jukes-Cantor 1969: single-rate correction for multiple hits |
34
+
|`k80`| Kimura 1980: separate rates for transitions and transversions |
35
+
|`hky`| Hasegawa-Kishino-Yano 1984/85: empirical base frequencies, single transition rate |
36
+
|`tn93`| Tamura-Nei 1993: empirical base frequencies, separate purine/pyrimidine transition rates |
37
+
38
+
For `hky` and `tn93`, base frequencies are estimated from the full alignment by default. Per-window estimation can be enabled with `--local-freqs`.
39
+
40
+
If a distance formula is undefined for a window (e.g. due to sequence saturation), that window is shown as a gap in the plot and recorded as `NaN` in the CSV.
41
+
42
+
Windows are centered such that every plotted point represents exactly `--windowsize` sites. The first center is at position `windowsize // 2` and the last is the final position where a full window still fits within the alignment; edge positions are not covered by truncated windows.
43
+
32
44
33
45
## Requirements and Installation
34
46
Requires Python ≥ 3.9 and the following libraries:
@@ -72,45 +84,48 @@ SimPlots will be generated for Query1 and Query2 in sequences.fasta, using all o
72
84
73
85
SimPlots will be generated for all sequences in sequences.fasta, using all sequences in references.fasta as references.
74
86
75
-
Window size, step size, output directories, metadata, colors, etc. can be customized using the arguments listed below.
87
+
Window size, step size, distance model, output directories, metadata, colors, etc. can be customized using the arguments listed below.
|`-s`, `--sequences`| Path to the main sequence file (.fasta) |
82
-
|`-q`, `--query-id`| ID of query sequence(s) within the alignment (mutually exclusive with `--reference-alignment`). |
83
-
|`-i`, `--include-queries-as-refs`| If set, treat other --query-id sequences as references for each query (default: excluded). |
84
-
|`-r`, `--reference-sequences`| Path to a separate reference alignment (must be the same nucleotide length; mutually exclusive with `--query-id`). |
85
-
|`-n`, `--no-align`| If set, skip MAFFT alignment before similarity plotting. Else, align sequences before plotting using `mafft --auto`. |
86
-
|`-t`, `--threads`| Number of threads to use for MAFFT alignment (default: 1). |
87
-
|`-m`, `--metadata`| Optional CSV/TSV file with sequence info (mapping accessions to genotypes). If provided, genotype information will be added to the output plots. |
88
-
|`-mi`, `--metadata-id-col`| Column name in metadata for sequence IDs (default: `Accession`). |
89
-
|`-mg`, `--metadata-genotype-col`| Column name in metadata for genotype info (default: `Genotype`). |
93
+
|`-s`, `--sequences`| Path to the main sequence file (.fasta) |
94
+
|`-q`, `--query-id`| ID of query sequence(s) within the alignment (mutually exclusive with `-r`). |
95
+
|`-i`, `--include-queries-as-refs`| If set, treat other `--query-id` sequences as references for each query (default: excluded). |
96
+
|`-r`, `--reference-sequences`| Path to a separate reference fasta (mutually exclusive with `-q`). |
97
+
|`-n`, `--no-align`| If set, skip MAFFT alignment. Input sequences must already be aligned. |
98
+
|`-t`, `--threads`| Number of threads for MAFFT alignment (default: 1). |
|`-lf`, `--local-freqs`| Estimate base frequencies per window rather than from the full alignment (only affects `hky` and `tn93`). |
101
+
|`-mgf`, `--max-gap-frequency`| Maximum allowed proportion of gap/ambiguous positions per window (default: 0.1). Windows exceeding this threshold are skipped and shown as gaps in the plot. |
102
+
|`-ws`, `--windowsize`| Window size in nucleotides (default: 100). |
103
+
|`-ss`, `--stepsize`| Step size between window centers (default: 50). |
|`-ss`, `--stepsize`| Step size between windows (default: 50). |
94
-
|`-g`, `--gaps`| How to treat gaps: 0 = skip position if one or both sequences have a gap, 1 = mismatch if one has a gap, match if both have a gap, 2 = mismatch if one has a gap, skip position if both have a gap. |
95
-
|`-f`, `--outformat`| Output file format for the plots: `png`, `pdf`, `svg`, or `jpg` (default: `png`). |
96
-
|`-ht`, `--height`| Height of the entire figure in inches (default: 5.0). |
97
-
|`-wd`, `--width`| Width of the plotting axes area in inches (default: 14.0). |
98
-
|`-p`, `--outplots`| Directory for plot outputs (default: `simplots/`). |
99
-
|`-o`, `--outcsv`| Directory for CSV outputs (optional; if not provided, tables will not be saved). |
100
-
|`-oa`, `--outaln`| Output file path for alignment in fasta format (optional). If not provided, the alignment will not be saved. |
108
+
|`-c`, `--colors`| Optional file mapping genotypes to colors (`tsv` or `csv`). |
|`-ht`, `--height`| Figure height in inches (default: 5.0). |
111
+
|`-wd`, `--width`| Axes width in inches (default: 14.0). |
112
+
|`-p`, `--outplots`| Directory for plot outputs (default: `simplots/`). |
113
+
|`-o`, `--outcsv`| Directory for CSV outputs (optional). |
114
+
|`-oa`, `--outaln`| Output path for the alignment in fasta format (optional). |
101
115
102
116
103
117
## Output
104
118
105
119
Each run creates:
106
-
- One or multiple similarity plots (`simplots/<query>_simplot.png`)
107
-
- A similarity table (if `--outcsv` is set)
120
+
- One or multiple similarity plots (`simplots/<query>_<model>_simplot.png`)
121
+
- A similarity table (`<query>_<model>_similarity_results.csv`, if `--outcsv` is set)
108
122
- An alignment fasta (if `--outaln` is set)
109
123
110
124
Plots show:
111
125
- Genome position on the x-axis
112
-
- Similarity (1 − p-distance) on the y-axis
113
-
- One line per reference sequence (colored by genotype if available)
126
+
- Similarity (1 − distance) on the y-axis
127
+
- One line per reference sequence (colored by genotype if metadata is provided)
128
+
- Window size, step size, and distance model shown in the lower left corner
114
129
115
130
116
131
## Examples
@@ -130,7 +145,7 @@ python simplot.py \
130
145
<br>
131
146
132
147
**With a separate reference alignment** <br>
133
-
Compare all query sequences in query_alignment.fasta to all references in references_alignment.fasta (here again, the sequences in the two fasta files are already aligned, so `--no-align` is used):
148
+
Compare all query sequences in query_alignment.fasta to all references in references_alignment.fasta:
134
149
135
150
```
136
151
python simplot.py \
@@ -143,17 +158,21 @@ python simplot.py \
143
158
```
144
159
<br>
145
160
146
-
**With metadata and custom colors** <br>
147
-
Providing a metadata.csv/tsv file which maps sequence IDs to genotypes enables annotation of genotypes in the output plots as well as coloring the lines by genotype. Default expected metadata column names are "Accession" and "Genotype", but other column names can be specified using `--metadata-id-col` (`-mi`) and `--metadata-genotype-col` (`-mg`). Custom genotype colors can be used by providing a colors.csv/tsv file which maps genotype names to color codes.
161
+
**With a model-based distance, metadata and custom colors** <br>
162
+
Use the TN93 distance model and annotate sequences by genotype:
148
163
149
164
```
150
165
python simplot.py \
151
166
-s demo_data/query_alignment.fasta \
152
167
-r demo_data/reference_alignment.fasta \
168
+
-dm tn93 \
153
169
-m demo_data/metadata.csv \
154
170
-c demo_data/colors.tsv \
155
171
--no-align
156
172
```
173
+
<br>
174
+
175
+
Providing a metadata.csv/tsv file (`-m`) which maps sequence IDs to genotypes enables annotation of genotypes in the output plots as well as coloring the lines by genotype. Default expected metadata column names are "Accession" and "Genotype", but other column names can be specified using `--metadata-id-col` (`-mi`) and `--metadata-genotype-col` (`-mg`). Custom genotype colors can be used by providing a colors.csv/tsv (`-c`) file which maps genotype names to color codes.
0 commit comments