Skip to content

Commit 8ef4c9b

Browse files
authored
Update readme.md
1 parent f3c9dc7 commit 8ef4c9b

1 file changed

Lines changed: 55 additions & 30 deletions

File tree

readme.md

Lines changed: 55 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,13 @@
33
</p>
44

55
<div align="center">
6-
76

87
</div>
98

10-
__Paula Ruiz-Rodriguez<sup>1</sup>__
9+
__Paula Ruiz-Rodriguez<sup>1</sup>__
1110
__and Mireia Coscolla<sup>1</sup>__
1211
<br>
13-
<sub> 1. Institute for Integrative Systems Biology, I<sup>2</sup>SysBio, University of Valencia-CSIC, Valencia, Spain </sub>
12+
<sub> 1. Institute for Integrative Systems Biology, I<sup>2</sup>SysBio, University of Valencia-CSIC, Valencia, Spain </sub>
1413

1514
# pathotypr
1615

@@ -24,12 +23,14 @@ __and Mireia Coscolla<sup>1</sup>__
2423

2524
-**split-fastq**: Perform ultra-fast, alignment-free genotyping of SNPs, MNVs, and both small and large structural variants (Indels/SVs) directly from raw FASTQ reads.
2625

26+
- 🔎 **match**: Quickly find the best-matching reference genome for your raw sequencing reads from a collection of references in a multi-FASTA file.
27+
2728
## Key Features
2829
- **Dynamic Variant Detection**: The `classify` and `split-fastq` modules can detect SNPs, MNVs, and Indels (including large SVs) using a flexible marker format.
2930

30-
- **High-Speed & Alignment-Free**: The `split-fastq` engine genotypes variants directly from raw reads, bypassing the need for computationally expensive alignment.
31+
- **High-Speed & Alignment-Free**: The `split-fastq` and `match` engines operate directly on raw reads, bypassing the need for computationally expensive alignment.
3132

32-
- **Flexible Input**: Process samples individually, in batches from the command line, or using a convenient sample list file.
33+
- **Flexible Input**: Process samples individually, in batches from the command line, or using a convenient sample list file (TSV format).
3334

3435
- **Efficient & Parallel**: Optimized for performance using Rayon to leverage all available CPU cores by default.
3536

@@ -51,7 +52,7 @@ mamba activate pathotypr
5152
mamba install -c bioconda pathotypr
5253

5354
# Or build from source
54-
git clone https://github.com/PathoGenOmics-Lab/pathotypr.git
55+
git clone [https://github.com/PathoGenOmics-Lab/pathotypr.git](https://github.com/PathoGenOmics-Lab/pathotypr.git)
5556
cd pathotypr
5657
cargo build --release
5758
```
@@ -70,6 +71,8 @@ pathotypr classify --markers variants.tsv --reference ref.fasta --input my_genom
7071
# 4. Genotype variants directly from raw reads
7172
pathotypr split-fastq --markers variants.tsv --reference ref.fasta -i sample_R1.fq.gz -i sample_R2.fq.gz --paired --output-prefix sample_genotyping
7273

74+
# 5. Find the best matching reference for a set of reads
75+
pathotypr match --input sample_R1.fq.gz sample_R2.fq.gz --references all_references.fasta --output best_match_report.tsv
7376
```
7477

7578
## Documentation
@@ -83,13 +86,11 @@ Builds and trains a Random Forest model from a multifasta file where headers are
8386
| :--- | :--- | :--- | :--- |
8487
| --input | -i | Path to the input multifasta file. Headers must be in Lineage_sequenceID format. | Required |
8588
| --output | -o | Path for the unified output model file (e.g., my_model.pathotypr.gz). | Required |
86-
| --kmer-size | -k | The size of the k-mers to generate from sequences. | 6 |
89+
| --kmer-size | -k | The size of the k-mers to generate from sequences. | 21 |
8790
| --test-split | -s | Proportion of the data to use for the test set. | 0.2 (20%) |
8891
| --threads | -t | Number of CPU threads to use. | All available |
8992
| --verbose | -v | Set the verbosity level. Use -v for debug, -vv for trace. | Off |
9093

91-
_The tool will warn you if it detects a strong class imbalance in your training data._
92-
9394
**Usage**:
9495
```bash
9596
pathotypr train \
@@ -119,6 +120,29 @@ pathotypr predict \
119120
[OPTIONS]
120121
```
121122

123+
### 🔎 `match`
124+
Finds the best matching reference genome for a set of FASTQ reads by comparing them against a collection of references in a multi-FASTA file. It calculates a weighted k-mer containment score to determine similarity.
125+
126+
**Arguments**:
127+
| Option | Flag | Description | Default |
128+
| :--- | :--- | :--- | :--- |
129+
| --input | -i | One or more FASTQ files to analyze. | One input required |
130+
| --input-list | -l | Path to a TSV file listing FASTQ files (name\tpath1[\tpath2...]). | One input required |
131+
| --references | -r | Path to a single multi-FASTA file containing all reference genomes. | Required |
132+
| --output | -o | Path for the output TSV report. Prints to console if not provided. | Optional |
133+
| --kmer-size | -k | The size of the k-mers to use for comparison. | 31 |
134+
| --threads | -t | Number of CPU threads to use. | All available |
135+
| --verbose | -v | Set the verbosity level. Use -v for debug, -vv for trace. | Off |
136+
137+
**Usage**:
138+
```bash
139+
pathotypr match \
140+
--input <READS_1.FQ> <READS_2.FQ> \
141+
--references <MULTI_FASTA_REFS> \
142+
--output <BEST_MATCH_REPORT.tsv> \
143+
[OPTIONS]
144+
```
145+
122146
### 🧬 `classify` & ⚡ `split-fastq`
123147
These two modules share a powerful, dynamic variant detection engine but operate on different inputs.
124148

@@ -137,7 +161,7 @@ Both commands use the same flexible TSV format for defining variants:
137161

138162
**Example** `markers.tsv`:
139163
```text
140-
#pos ref alt marker annotation
164+
#pos ref alt marker annotation
141165
761109 G T rpoB_p.Asp435Tyr DrugResistance
142166
761109 GAC TAT rpoB_p.Asp435Tyr DrugResistance
143167
2155162 C CAT katG_p.Ser315Thr Compensatory
@@ -165,18 +189,18 @@ Both commands use the same flexible TSV format for defining variants:
165189
pathotypr classify \
166190
--markers <MARKERS_TSV> \
167191
--reference <REF_FASTA> \
168-
--output <OUTPUT_TSV> \
169-
--genome-fasta <GENOMES_FASTA> \
192+
--output-prefix <PREFIX> \
193+
--input <GENOME_FASTA> \
170194
[OPTIONS]
171195
```
172196
#### Functional Annotation with GFF
173197
The classify command can translate SNPs into amino acid changes if provided with a GFF3 annotation file.
174198

175199
How to provide GFF files:
176-
- For a single FASTA input (--input): Use the --gff flag to specify a single GFF file that corresponds to the sequences in the FASTA file.
177-
- For multiple genomes via a list (--input-list): Add a third, optional column to your TSV file containing the path to the corresponding GFF file for each genome.
200+
- For a single FASTA input (`--input`): Use the `--gff` flag to specify a single GFF file that corresponds to the sequences in the FASTA file.
201+
- For multiple genomes via a list (`--input-list`): Add a third, optional column to your TSV file containing the path to the corresponding GFF file for each genome.
178202

179-
Example input-list.tsv:
203+
Example `input-list.tsv`:
180204
```bash
181205
SampleA path/to/sampleA.fasta path/to/sampleA.gff3
182206
SampleB path/to/sampleB.fasta path/to/sampleB.gff3
@@ -185,15 +209,15 @@ SampleC path/to/sampleC.fasta # No GFF for this sample
185209
Output Columns:
186210
When annotation is performed, the output file will contain three additional columns:
187211

188-
- Gene: The ID of the gene where the SNP is located.
189-
- AA_Pos: The position of the amino acid within the gene.
190-
- AA_Change: The resulting amino acid (using 3-letter code).
212+
- `Gene`: The ID of the gene where the SNP is located.
213+
- `AA_Pos`: The position of the amino acid within the gene.
214+
- `AA_Change`: The resulting amino acid (using 3-letter code).
191215

192216
Example Output:
193217
```bash
194-
genome k-mer k-merPOS SNPgenome SNPreference lineage Gene AA_Pos AA_Change
195-
G0000_contig_1 GGCGGCGCCGCCTGGGTGGAG 1854184 1854194 1859559 L4 Rv1649 276 Gly
196-
G0000_contig_1 GACCCCGAGGCCCGGGCCGGC 4296504 4296514 4313128 L4 gyrA 95 Ser
218+
genome k-mer k-merPOS SNPgenome SNPreference lineage Gene AA_Pos AA_Change
219+
G0000_contig_1 GGCGGCGCCGCCTGGGTGGAG 1854184 1854194 1859559 L4 Rv1649 276 Gly
220+
G0000_contig_1 GACCCCGAGGCCCGGGCCGGC 4296504 4296514 4313128 L4 gyrA 95 Ser
197221
```
198222

199223
#### `split-fastq`
@@ -230,9 +254,10 @@ pathotypr/
230254
├── src/
231255
│ ├── main.rs # CLI handling and dispatch
232256
│ ├── errors.rs # Custom error types
233-
│ ├── common.rs # Shared code (model bundle, kmerize)
257+
│ ├── common.rs # Shared code (model bundle, etc.)
234258
│ ├── train.rs # `train` subcommand logic
235259
│ ├── predict.rs # `predict` subcommand logic
260+
│ ├── match.rs # `match` subcommand logic
236261
│ ├── classify.rs # `classify` subcommand logic
237262
│ ├── classify_split_fastq.rs # `split-fastq` subcommand logic
238263
│ └── split_kmer.rs # Core dynamic k-mer engine
@@ -242,11 +267,11 @@ pathotypr/
242267

243268
## Key Dependencies
244269

245-
- 🎯 clap: CLI parsing
246-
- 🤖 smartcore: Machine learning
247-
- ⚡ rayon: Parallel processing
248-
- 🧬 bio: Bioinformatics tools
249-
- 📊 serde: Serialization
270+
- 🎯 `clap`: CLI parsing
271+
- 🤖 `smartcore`: Machine learning
272+
-`rayon`: Parallel processing
273+
- 🧬 `needletail`: High-speed FASTA/Q parsing
274+
- 🗃️ `serde`: Serialization
250275

251276
## Contributing
252277

@@ -286,7 +311,7 @@ pathotypr is developed with ❤️ by:
286311
<a href="" title="Data">🔣</a>
287312
<a href="" title="Desing">🎨</a>
288313
<a href="" title="Tool">🔧</a>
289-
</td>
314+
</td>
290315
<td align="center">
291316
<a href="https://github.com/mireiacoscolla">
292317
<img src="https://avatars.githubusercontent.com/u/29301737?v=4&s=100" width="100px;" alt=""/>
@@ -299,7 +324,7 @@ pathotypr is developed with ❤️ by:
299324
<a href="" title="Mentoring">🧑‍🏫</a>
300325
<a href="" title="Research">🔬</a>
301326
<a href="" title="User Testing">📓</a>
302-
</td>
327+
</td>
303328
</tr>
304329
</table>
305330

@@ -309,4 +334,4 @@ This project follows the [all-contributors](https://github.com/all-contributors/
309334
<!-- prettier-ignore-end -->
310335

311336
<!-- ALL-CONTRIBUTORS-LIST:END -->
312-
---
337+
---

0 commit comments

Comments
 (0)