You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**pathotypr** is a powerful command-line tool for genome classification using machine learning and SNP markers. It provides three main functionalities:
17
+
**pathotypr** is a powerful and versatile command-line tool for high-speed genome classification and variant genotyping. It combines machine learning models with an advanced k-mer based engine to provide a comprehensive analysis toolkit.
18
+
19
+
- 🎓 **train**: Build, train, and validate Random Forest models from your own FASTA sequences.
20
+
21
+
- 🔮 **predict**: Classify new genomes using a pre-trained pathotypr model.
22
+
23
+
- 🧬 **classify**: Genotype known variants (SNPs, MNVs, Indels) in assembled genomes (FASTA) against a reference.
24
+
25
+
- ⚡ **split-fastq**: Perform ultra-fast, alignment-free genotyping of SNPs, MNVs, and both small and large structural variants (Indels/SVs) directly from raw FASTQ reads.
26
+
27
+
## Key Features
28
+
-**Dynamic Variant Detection**: The `classify` and `split-fastq` modules can detect SNPs, MNVs, and Indels (including large SVs) using a flexible marker format.
19
29
20
-
- 🎓 **Train**: Build ML models from FASTA sequences
21
-
- 🔮 **Predict**: Classify new genomes using trained models
22
-
- 🧬 **Classify**: Process genomic markers (SNPs) against a reference for lineage/drug resistance classification
30
+
-**High-Speed & Alignment-Free**: The `split-fastq` engine genotypes variants directly from raw reads, bypassing the need for computationally expensive alignment.
31
+
32
+
-**Flexible Input**: Process samples individually, in batches from the command line, or using a convenient sample list file.
33
+
34
+
-**Efficient & Parallel**: Optimized for performance using Rayon to leverage all available CPU cores by default.
35
+
36
+
-**Unified Model Format**: The `train` command produces a single, portable file containing the model, configuration, and all necessary components for reproducible predictions.
37
+
38
+
-**Robust Logging**: Clear, informative, and standardized logging across all modules.
- 📊 Random Forest classification for ML-based prediction
60
-
- 🧪 K-mer based sequence analysis
61
-
- 💾 Compressed model storage
62
-
- 🔍 Reference-based SNP marker detection
63
-
- 🧬 Flexible marker positions for closed/mapped genomes
64
-
- 📈 Progress tracking and logging
76
+
### 🎓 `train`
65
77
66
-
## Documentation
78
+
Builds and trains a Random Forest model from a multifasta file where headers are in the format `Lineage_sequenceID`. The command produces a single, self-contained model file.
67
79
68
-
### Train Mode
69
-
Trains a model from FASTA files with headers in `Lineage_sequenceID` format:
80
+
**Arguments**:
81
+
| Option | Flag | Description | Default |
82
+
| :--- | :--- | :--- | :--- |
83
+
| --input | -i | Path to the input multifasta file. Headers must be in Lineage_sequenceID format. | Required |
84
+
| --output | -o | Path for the unified output model file (e.g., my_model.pathotypr.gz). | Required |
85
+
| --kmer-size | -k | The size of the k-mers to generate from sequences. | 6 |
86
+
| --test-split | -s | Proportion of the data to use for the test set. | 0.2 (20%) |
87
+
| --threads | -t | Number of CPU threads to use. | All available |
88
+
89
+
_The tool will warn you if it detects a strong class imbalance in your training data._
0 commit comments