You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- ⚡ **split-fastq**: Perform ultra-fast, alignment-free genotyping of SNPs, MNVs, and both small and large structural variants (Indels/SVs) directly from raw FASTQ reads.
26
25
26
+
- 🔎 **match**: Quickly find the best-matching reference genome for your raw sequencing reads from a collection of references in a multi-FASTA file.
27
+
27
28
## Key Features
28
29
-**Dynamic Variant Detection**: The `classify` and `split-fastq` modules can detect SNPs, MNVs, and Indels (including large SVs) using a flexible marker format.
29
30
30
-
-**High-Speed & Alignment-Free**: The `split-fastq`engine genotypes variants directly from raw reads, bypassing the need for computationally expensive alignment.
31
+
-**High-Speed & Alignment-Free**: The `split-fastq`and `match` engines operate directly on raw reads, bypassing the need for computationally expensive alignment.
31
32
32
-
-**Flexible Input**: Process samples individually, in batches from the command line, or using a convenient sample list file.
33
+
-**Flexible Input**: Process samples individually, in batches from the command line, or using a convenient sample list file (TSV format).
33
34
34
35
-**Efficient & Parallel**: Optimized for performance using Rayon to leverage all available CPU cores by default.
# 5. Find the best matching reference for a set of reads
75
+
pathotypr match --input sample_R1.fq.gz sample_R2.fq.gz --references all_references.fasta --output best_match_report.tsv
73
76
```
74
77
75
78
## Documentation
@@ -83,13 +86,11 @@ Builds and trains a Random Forest model from a multifasta file where headers are
83
86
| :--- | :--- | :--- | :--- |
84
87
| --input | -i | Path to the input multifasta file. Headers must be in Lineage_sequenceID format. | Required |
85
88
| --output | -o | Path for the unified output model file (e.g., my_model.pathotypr.gz). | Required |
86
-
| --kmer-size | -k | The size of the k-mers to generate from sequences. |6|
89
+
| --kmer-size | -k | The size of the k-mers to generate from sequences. |21|
87
90
| --test-split | -s | Proportion of the data to use for the test set. | 0.2 (20%) |
88
91
| --threads | -t | Number of CPU threads to use. | All available |
89
92
| --verbose | -v | Set the verbosity level. Use -v for debug, -vv for trace. | Off |
90
93
91
-
_The tool will warn you if it detects a strong class imbalance in your training data._
92
-
93
94
**Usage**:
94
95
```bash
95
96
pathotypr train \
@@ -119,6 +120,29 @@ pathotypr predict \
119
120
[OPTIONS]
120
121
```
121
122
123
+
### 🔎 `match`
124
+
Finds the best matching reference genome for a set of FASTQ reads by comparing them against a collection of references in a multi-FASTA file. It calculates a weighted k-mer containment score to determine similarity.
125
+
126
+
**Arguments**:
127
+
| Option | Flag | Description | Default |
128
+
| :--- | :--- | :--- | :--- |
129
+
| --input | -i | One or more FASTQ files to analyze. | One input required |
130
+
| --input-list | -l | Path to a TSV file listing FASTQ files (name\tpath1[\tpath2...]). | One input required |
131
+
| --references | -r | Path to a single multi-FASTA file containing all reference genomes. | Required |
132
+
| --output | -o | Path for the output TSV report. Prints to console if not provided. | Optional |
133
+
| --kmer-size | -k | The size of the k-mers to use for comparison. | 31 |
134
+
| --threads | -t | Number of CPU threads to use. | All available |
135
+
| --verbose | -v | Set the verbosity level. Use -v for debug, -vv for trace. | Off |
136
+
137
+
**Usage**:
138
+
```bash
139
+
pathotypr match \
140
+
--input <READS_1.FQ><READS_2.FQ> \
141
+
--references <MULTI_FASTA_REFS> \
142
+
--output <BEST_MATCH_REPORT.tsv> \
143
+
[OPTIONS]
144
+
```
145
+
122
146
### 🧬 `classify` & ⚡ `split-fastq`
123
147
These two modules share a powerful, dynamic variant detection engine but operate on different inputs.
124
148
@@ -137,7 +161,7 @@ Both commands use the same flexible TSV format for defining variants:
137
161
138
162
**Example**`markers.tsv`:
139
163
```text
140
-
#pos ref alt marker annotation
164
+
#pos ref alt marker annotation
141
165
761109 G T rpoB_p.Asp435Tyr DrugResistance
142
166
761109 GAC TAT rpoB_p.Asp435Tyr DrugResistance
143
167
2155162 C CAT katG_p.Ser315Thr Compensatory
@@ -165,18 +189,18 @@ Both commands use the same flexible TSV format for defining variants:
165
189
pathotypr classify \
166
190
--markers <MARKERS_TSV> \
167
191
--reference <REF_FASTA> \
168
-
--output<OUTPUT_TSV> \
169
-
--genome-fasta <GENOMES_FASTA> \
192
+
--output-prefix <PREFIX> \
193
+
--input <GENOME_FASTA> \
170
194
[OPTIONS]
171
195
```
172
196
#### Functional Annotation with GFF
173
197
The classify command can translate SNPs into amino acid changes if provided with a GFF3 annotation file.
174
198
175
199
How to provide GFF files:
176
-
- For a single FASTA input (--input): Use the --gff flag to specify a single GFF file that corresponds to the sequences in the FASTA file.
177
-
- For multiple genomes via a list (--input-list): Add a third, optional column to your TSV file containing the path to the corresponding GFF file for each genome.
200
+
- For a single FASTA input (`--input`): Use the `--gff` flag to specify a single GFF file that corresponds to the sequences in the FASTA file.
201
+
- For multiple genomes via a list (`--input-list`): Add a third, optional column to your TSV file containing the path to the corresponding GFF file for each genome.
0 commit comments