Skip to content

Commit 85e9784

Browse files
committed
Implement distance models (jc69, k80, hky, tn93) & change window behaviour
1 parent b4a9aea commit 85e9784

2 files changed

Lines changed: 470 additions & 114 deletions

File tree

README.md

Lines changed: 53 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ If you use this program and wish to cite it, please cite both this repository (s
1313
Given one or more viral genome sequences in a fasta file, the script:
1414
1. Aligns the sequences (optional; when an alignment is given as input, this step can be skipped with `--no-align`)
1515
2. Splits the alignment into overlapping windows of a chosen size.
16-
3. Calculates the pairwise similarity between a query and other sequences in each window.
16+
3. Calculates the pairwise distance between a query and other sequences in each window.
1717
4. Produces:
1818
a) a plot showing how similarity changes along the genome, and
1919
b) a CSV table with similarity values (optional).
@@ -23,12 +23,24 @@ Given one or more viral genome sequences in a fasta file, the script:
2323

2424
## How similarity is calculated
2525

26-
For each sliding window:
27-
- Count how many positions differ between the query and a reference sequence (Hamming distance).
28-
- Divide that by the number of valid positions (ignoring gaps and Ns depending on your settings).
29-
- That gives the p-distance: $p = \text{differences} / \text{valid positions}$.
30-
- Then $\text{similarity} = 1 - p$.
31-
That’s what’s plotted along the genome.
26+
For each sliding window, the distance between the query and each reference sequence is computed from the valid (unambiguous nucleotide) positions in that window. Positions containing gaps or ambiguous characters are excluded. The similarity plotted is `1 − distance`.
27+
28+
Five distance models are currently available (see `--distance-model`):
29+
30+
| Model | Description |
31+
|-------|-------------|
32+
| `pdist` | p-distance: raw proportion of differing sites (default) |
33+
| `jc69` | Jukes-Cantor 1969: single-rate correction for multiple hits |
34+
| `k80` | Kimura 1980: separate rates for transitions and transversions |
35+
| `hky` | Hasegawa-Kishino-Yano 1984/85: empirical base frequencies, single transition rate |
36+
| `tn93` | Tamura-Nei 1993: empirical base frequencies, separate purine/pyrimidine transition rates |
37+
38+
For `hky` and `tn93`, base frequencies are estimated from the full alignment by default. Per-window estimation can be enabled with `--local-freqs`.
39+
40+
If a distance formula is undefined for a window (e.g. due to sequence saturation), that window is shown as a gap in the plot and recorded as `NaN` in the CSV.
41+
42+
Windows are centered such that every plotted point represents exactly `--windowsize` sites. The first center is at position `windowsize // 2` and the last is the final position where a full window still fits within the alignment; edge positions are not covered by truncated windows.
43+
3244

3345
## Requirements and Installation
3446
Requires Python ≥ 3.9 and the following libraries:
@@ -72,45 +84,48 @@ SimPlots will be generated for Query1 and Query2 in sequences.fasta, using all o
7284

7385
SimPlots will be generated for all sequences in sequences.fasta, using all sequences in references.fasta as references.
7486

75-
Window size, step size, output directories, metadata, colors, etc. can be customized using the arguments listed below.
87+
Window size, step size, distance model, output directories, metadata, colors, etc. can be customized using the arguments listed below.
7688

7789
## Arguments
7890

7991
| Flag | Description |
8092
| -------------------------------- | ------------------------------------------------------------------------------ |
81-
| `-s`, `--sequences` | Path to the main sequence file (.fasta) |
82-
| `-q`, `--query-id` | ID of query sequence(s) within the alignment (mutually exclusive with `--reference-alignment`). |
83-
| `-i`, `--include-queries-as-refs` | If set, treat other --query-id sequences as references for each query (default: excluded). |
84-
| `-r`, `--reference-sequences` | Path to a separate reference alignment (must be the same nucleotide length; mutually exclusive with `--query-id`). |
85-
| `-n`, `--no-align` | If set, skip MAFFT alignment before similarity plotting. Else, align sequences before plotting using `mafft --auto`. |
86-
| `-t`, `--threads` | Number of threads to use for MAFFT alignment (default: 1). |
87-
| `-m`, `--metadata` | Optional CSV/TSV file with sequence info (mapping accessions to genotypes). If provided, genotype information will be added to the output plots. |
88-
| `-mi`, `--metadata-id-col` | Column name in metadata for sequence IDs (default: `Accession`). |
89-
| `-mg`, `--metadata-genotype-col` | Column name in metadata for genotype info (default: `Genotype`). |
93+
| `-s`, `--sequences` | Path to the main sequence file (.fasta) |
94+
| `-q`, `--query-id` | ID of query sequence(s) within the alignment (mutually exclusive with `-r`). |
95+
| `-i`, `--include-queries-as-refs` | If set, treat other `--query-id` sequences as references for each query (default: excluded). |
96+
| `-r`, `--reference-sequences` | Path to a separate reference fasta (mutually exclusive with `-q`). |
97+
| `-n`, `--no-align` | If set, skip MAFFT alignment. Input sequences must already be aligned. |
98+
| `-t`, `--threads` | Number of threads for MAFFT alignment (default: 1). |
99+
| `-dm`, `--distance-model` | Distance model: `pdist` (default), `jc69`, `k80`, `hky`, `tn93`. See above. |
100+
| `-lf`, `--local-freqs` | Estimate base frequencies per window rather than from the full alignment (only affects `hky` and `tn93`). |
101+
| `-mgf`, `--max-gap-frequency` | Maximum allowed proportion of gap/ambiguous positions per window (default: 0.1). Windows exceeding this threshold are skipped and shown as gaps in the plot. |
102+
| `-ws`, `--windowsize` | Window size in nucleotides (default: 100). |
103+
| `-ss`, `--stepsize` | Step size between window centers (default: 50). |
104+
| `-m`, `--metadata` | Optional CSV/TSV file mapping sequence IDs to genotypes. |
105+
| `-mi`, `--metadata-id-col` | Column name in metadata for sequence IDs (default: `Accession`). |
106+
| `-mg`, `--metadata-genotype-col` | Column name in metadata for genotype info (default: `Genotype`). |
90107
| `-mm`, `--metadata-mode` | Whether metadata applies to `query`, `reference`, or `both` (default: `both`). |
91-
| `-c`, `--colors` | Optional file mapping genotypes to colors (`tsv` or `csv`). |
92-
| `-ws`, `--windowsize` | Window size (default: 100). |
93-
| `-ss`, `--stepsize` | Step size between windows (default: 50). |
94-
| `-g`, `--gaps` | How to treat gaps: 0 = skip position if one or both sequences have a gap, 1 = mismatch if one has a gap, match if both have a gap, 2 = mismatch if one has a gap, skip position if both have a gap. |
95-
| `-f`, `--outformat` | Output file format for the plots: `png`, `pdf`, `svg`, or `jpg` (default: `png`). |
96-
| `-ht`, `--height` | Height of the entire figure in inches (default: 5.0). |
97-
| `-wd`, `--width` | Width of the plotting axes area in inches (default: 14.0). |
98-
| `-p`, `--outplots` | Directory for plot outputs (default: `simplots/`). |
99-
| `-o`, `--outcsv` | Directory for CSV outputs (optional; if not provided, tables will not be saved). |
100-
| `-oa`, `--outaln` | Output file path for alignment in fasta format (optional). If not provided, the alignment will not be saved. |
108+
| `-c`, `--colors` | Optional file mapping genotypes to colors (`tsv` or `csv`). |
109+
| `-f`, `--outformat` | Output plot format: `png` (default), `pdf`, `svg`, or `jpg`. |
110+
| `-ht`, `--height` | Figure height in inches (default: 5.0). |
111+
| `-wd`, `--width` | Axes width in inches (default: 14.0). |
112+
| `-p`, `--outplots` | Directory for plot outputs (default: `simplots/`). |
113+
| `-o`, `--outcsv` | Directory for CSV outputs (optional). |
114+
| `-oa`, `--outaln` | Output path for the alignment in fasta format (optional). |
101115

102116

103117
## Output
104118

105119
Each run creates:
106-
- One or multiple similarity plots (`simplots/<query>_simplot.png`)
107-
- A similarity table (if `--outcsv` is set)
120+
- One or multiple similarity plots (`simplots/<query>_<model>_simplot.png`)
121+
- A similarity table (`<query>_<model>_similarity_results.csv`, if `--outcsv` is set)
108122
- An alignment fasta (if `--outaln` is set)
109123

110124
Plots show:
111125
- Genome position on the x-axis
112-
- Similarity (1 − p-distance) on the y-axis
113-
- One line per reference sequence (colored by genotype if available)
126+
- Similarity (1 − distance) on the y-axis
127+
- One line per reference sequence (colored by genotype if metadata is provided)
128+
- Window size, step size, and distance model shown in the lower left corner
114129

115130

116131
## Examples
@@ -130,7 +145,7 @@ python simplot.py \
130145
<br>
131146

132147
**With a separate reference alignment** <br>
133-
Compare all query sequences in query_alignment.fasta to all references in references_alignment.fasta (here again, the sequences in the two fasta files are already aligned, so `--no-align` is used):
148+
Compare all query sequences in query_alignment.fasta to all references in references_alignment.fasta:
134149

135150
```
136151
python simplot.py \
@@ -143,17 +158,21 @@ python simplot.py \
143158
```
144159
<br>
145160

146-
**With metadata and custom colors** <br>
147-
Providing a metadata.csv/tsv file which maps sequence IDs to genotypes enables annotation of genotypes in the output plots as well as coloring the lines by genotype. Default expected metadata column names are "Accession" and "Genotype", but other column names can be specified using `--metadata-id-col` (`-mi`) and `--metadata-genotype-col` (`-mg`). Custom genotype colors can be used by providing a colors.csv/tsv file which maps genotype names to color codes.
161+
**With a model-based distance, metadata and custom colors** <br>
162+
Use the TN93 distance model and annotate sequences by genotype:
148163

149164
```
150165
python simplot.py \
151166
-s demo_data/query_alignment.fasta \
152167
-r demo_data/reference_alignment.fasta \
168+
-dm tn93 \
153169
-m demo_data/metadata.csv \
154170
-c demo_data/colors.tsv \
155171
--no-align
156172
```
173+
<br>
174+
175+
Providing a metadata.csv/tsv file (`-m`) which maps sequence IDs to genotypes enables annotation of genotypes in the output plots as well as coloring the lines by genotype. Default expected metadata column names are "Accession" and "Genotype", but other column names can be specified using `--metadata-id-col` (`-mi`) and `--metadata-genotype-col` (`-mg`). Custom genotype colors can be used by providing a colors.csv/tsv (`-c`) file which maps genotype names to color codes.
157176

158177
Example `metadata.csv`:
159178
```

0 commit comments

Comments
 (0)