Skip to content

Commit 6aee574

Browse files
authored
Update readme.md
1 parent 5860832 commit 6aee574

1 file changed

Lines changed: 141 additions & 39 deletions

File tree

readme.md

Lines changed: 141 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
<p align="center">
32
<img src="logo/pathotypr.png" title="pathotypr.png logo" style="width:750px; height: auto;">
43
</p>
@@ -15,11 +14,28 @@ __and Mireia Coscolla<sup>1</sup>__
1514

1615
# pathotypr
1716

18-
**pathotypr** is a powerful command-line tool for genome classification using machine learning and SNP markers. It provides three main functionalities:
17+
**pathotypr** is a powerful and versatile command-line tool for high-speed genome classification and variant genotyping. It combines machine learning models with an advanced k-mer based engine to provide a comprehensive analysis toolkit.
18+
19+
- 🎓 **train**: Build, train, and validate Random Forest models from your own FASTA sequences.
20+
21+
- 🔮 **predict**: Classify new genomes using a pre-trained pathotypr model.
22+
23+
- 🧬 **classify**: Genotype known variants (SNPs, MNVs, Indels) in assembled genomes (FASTA) against a reference.
24+
25+
-**split-fastq**: Perform ultra-fast, alignment-free genotyping of SNPs, MNVs, and both small and large structural variants (Indels/SVs) directly from raw FASTQ reads.
26+
27+
## Key Features
28+
- **Dynamic Variant Detection**: The `classify` and `split-fastq` modules can detect SNPs, MNVs, and Indels (including large SVs) using a flexible marker format.
1929

20-
- 🎓 **Train**: Build ML models from FASTA sequences
21-
- 🔮 **Predict**: Classify new genomes using trained models
22-
- 🧬 **Classify**: Process genomic markers (SNPs) against a reference for lineage/drug resistance classification
30+
- **High-Speed & Alignment-Free**: The `split-fastq` engine genotypes variants directly from raw reads, bypassing the need for computationally expensive alignment.
31+
32+
- **Flexible Input**: Process samples individually, in batches from the command line, or using a convenient sample list file.
33+
34+
- **Efficient & Parallel**: Optimized for performance using Rayon to leverage all available CPU cores by default.
35+
36+
- **Unified Model Format**: The `train` command produces a single, portable file containing the model, configuration, and all necessary components for reproducible predictions.
37+
38+
- **Robust Logging**: Clear, informative, and standardized logging across all modules.
2339

2440
## Installation
2541

@@ -39,67 +55,153 @@ git clone https://github.com/PathoGenOmics-Lab/pathotypr.git
3955
cd pathotypr
4056
cargo build --release
4157
```
42-
4358
## Quick Start
4459

4560
```bash
46-
# Train a model
47-
pathotypr train --fasta input.fasta --output my_model --kmer_size 6
61+
# 1. Train a Random Forest model
62+
pathotypr train --input training_genomes.fasta --output my_species.model.gz
4863

49-
# Predict classifications
50-
pathotypr predict --fasta input.fasta --model_base my_model --output predictions.txt
64+
# 2. Predict the class of a new genome
65+
pathotypr predict --input new_genome.fasta --model my_species.model.gz --output prediction.tsv
5166

52-
# Classify using SNP markers
53-
pathotypr classify --tsv_pos markers.tsv --ref_fasta ref.fasta --fasta_genomes genomes.fasta --output results.txt
67+
# 3. Genotype variants in an assembled genome
68+
pathotypr classify --markers variants.tsv --reference ref.fasta --genome-fasta my_genome.fasta --output classified_variants.tsv
69+
70+
# 4. Genotype variants directly from raw reads
71+
pathotypr split-fastq --markers variants.tsv --reference ref.fasta -i sample_R1.fq.gz -i sample_R2.fq.gz --paired --output-prefix sample_genotyping
5472
```
5573

56-
## Features
74+
## Documentation
5775

58-
- 🚀 Fast parallel processing using Rayon
59-
- 📊 Random Forest classification for ML-based prediction
60-
- 🧪 K-mer based sequence analysis
61-
- 💾 Compressed model storage
62-
- 🔍 Reference-based SNP marker detection
63-
- 🧬 Flexible marker positions for closed/mapped genomes
64-
- 📈 Progress tracking and logging
76+
### 🎓 `train`
6577

66-
## Documentation
78+
Builds and trains a Random Forest model from a multifasta file where headers are in the format `Lineage_sequenceID`. The command produces a single, self-contained model file.
6779

68-
### Train Mode
69-
Trains a model from FASTA files with headers in `Lineage_sequenceID` format:
80+
**Arguments**:
81+
| Option | Flag | Description | Default |
82+
| :--- | :--- | :--- | :--- |
83+
| --input | -i | Path to the input multifasta file. Headers must be in Lineage_sequenceID format. | Required |
84+
| --output | -o | Path for the unified output model file (e.g., my_model.pathotypr.gz). | Required |
85+
| --kmer-size | -k | The size of the k-mers to generate from sequences. | 6 |
86+
| --test-split | -s | Proportion of the data to use for the test set. | 0.2 (20%) |
87+
| --threads | -t | Number of CPU threads to use. | All available |
88+
89+
_The tool will warn you if it detects a strong class imbalance in your training data._
90+
91+
**Usage**:
7092
```bash
71-
pathotypr train --fasta input.fasta --output my_model --kmer_size 21
93+
pathotypr train \
94+
--input <FASTA> \
95+
--output <MODEL_FILE> \
96+
[OPTIONS]
7297
```
98+
### 🔮 `predict`
99+
100+
Classifies new genomes using a model file generated by `train`.
73101

74-
### Predict Mode
75-
Classifies genomes using a trained model:
102+
**Arguments**:
103+
| Option | Flag | Description | Default |
104+
| :--- | :--- | :--- | :--- |
105+
| --input | -i | Path to the input FASTA file containing sequences to classify. | Required |
106+
| --model | -m | Path to the unified model file created by the train command. | Required |
107+
| --output | -o | Path for the output file where predictions will be written in TSV format. | Required |
108+
| --threads | -t | Number of CPU threads to use. | All available |
109+
110+
**Usage**:
76111
```bash
77-
pathotypr predict --fasta input.fasta --model_base my_model --output predictions.txt
112+
pathotypr predict \
113+
--input <FASTA> \
114+
--model <MODEL_FILE> \
115+
--output <PREDICTIONS_TSV> \
116+
[OPTIONS]
78117
```
79118

80-
### Classify Mode
81-
Process genomes using SNP markers against a reference sequence. Supports both lineage-defining SNPs and drug resistance markers. Works with:
82-
- Mapped genomes (SNP positions relative to reference)
83-
- Closed genomes (SNP positions may vary)
119+
### 🧬 `classify` & ⚡ `split-fastq`
120+
These two modules share a powerful, dynamic variant detection engine but operate on different inputs.
121+
122+
- `classify`: Works on assembled genomes (FASTA).
123+
- `split-fastq`: Works on raw sequencing reads (FASTQ).
124+
125+
#### Marker File Format
126+
Both commands use the same flexible TSV format for defining variants:
127+
`position <tab> REF <tab> ALT <tab> marker_name <tab> [optional_annotations...]`
128+
129+
- `position`: 1-based chromosomal position.
130+
- `REF`: The reference allele sequence.
131+
- `ALT`: The alternate allele sequence.
132+
- `marker_name`: The name of the marker/lineage.
133+
- `[optional_annotations...]`: Any additional columns, which will be appended to the output file.
134+
135+
**Example** `markers.tsv`:
136+
```text
137+
#pos ref alt marker annotation
138+
761109 G T rpoB_p.Asp435Tyr DrugResistance
139+
761109 GAC TAT rpoB_p.Asp435Tyr DrugResistance
140+
2155162 C CAT katG_p.Ser315Thr Compensatory
141+
987654 TGC.. T LargeDeletion_1 StructuralVariant
142+
```
84143

144+
#### 🧬 `classify`
145+
Genotype known variants (SNPs, MNVs, Indels) in assembled genomes (FASTA) against a reference.
146+
**Arguments**:
147+
| Option | Flag | Description | Default |
148+
| :--- | :--- | :--- | :--- |
149+
| --markers | -m | Path to the marker definition file (TSV format). | Required |
150+
| --reference | -r | Path to the reference genome FASTA file. | Required |
151+
| --output | -o | Path for the detailed output report. | Required |
152+
| --genome-fasta| | Path to a multifasta file containing all genomes to analyze. | One input required |
153+
| --genome-list | | Path to a TSV file listing genomes (name\tpath/to/fasta). | One input required |
154+
| --kmer-size | -k | The size of the diagnostic k-mers to use. | 31 |
155+
| --threads | -t | Number of CPU threads to use. | All available |
156+
157+
**Usage**:
85158
```bash
86-
# Using FASTA input
87-
pathotypr classify --tsv_pos markers.tsv --ref_fasta ref.fasta --fasta_genomes genomes.fasta --output results.txt
159+
pathotypr classify \
160+
--markers <MARKERS_TSV> \
161+
--reference <REF_FASTA> \
162+
--output <OUTPUT_TSV> \
163+
--genome-fasta <GENOMES_FASTA> \
164+
[OPTIONS]
165+
```
88166

89-
# Using TSV input
90-
pathotypr classify --tsv_pos markers.tsv --ref_fasta ref.fasta --tsv_genomes genomes.tsv --output results.txt
167+
#### `split-fastq`
168+
169+
**Arguments**:
170+
| Option | Flag | Description | Default |
171+
| :--- | :--- | :--- | :--- |
172+
| --markers | -m | Path to the marker definition file (TSV format). | Required |
173+
| --reference | -r | Path to the reference genome FASTA file. | Required |
174+
| --output-prefix| -o | Prefix for all output files. | split |
175+
| --input | -i | One or more FASTQ files to process. | One input required |
176+
| --input-list | -l | Path to a TSV file listing samples and their FASTQ files. | One input required |
177+
| --paired | | Flag to treat input files as paired-end, grouped in pairs. | false |
178+
| --min-depth | | Minimum read depth required to call a variant. | 10 |
179+
| --min-alt-percent| | Minimum frequency of the alternate allele to call a variant (%). | 95 |
180+
| --threads | -t | Number of CPU threads to use. | All available |
181+
182+
**Usage**:
183+
```bash
184+
pathotypr split-fastq \
185+
--markers <MARKERS_TSV> \
186+
--reference <REF_FASTA> \
187+
--output-prefix <PREFIX> \
188+
-i <READS_1.FQ> -i <READS_2.FQ> --paired \
189+
[OPTIONS]
91190
```
92191

93192
## Project Structure
94193

95194
```
96195
pathotypr/
97196
├── src/
98-
│ ├── main.rs # CLI handling
99-
│ ├── train.rs # Model training
100-
│ ├── predict.rs # Classification
101-
│ └── classify.rs # Marker processing
197+
│ ├── main.rs # CLI handling
198+
│ ├── train.rs # Model training logic
199+
│ ├── predict.rs # Model prediction logic
200+
│ ├── classify.rs # Variant detection in assemblies
201+
│ ├── classify_split_fastq.rs # Variant detection in reads
202+
│ └── split_kmer.rs # Core dynamic k-mer engine
102203
└── Cargo.toml
204+
103205
```
104206

105207
## Key Dependencies

0 commit comments

Comments
 (0)