Update readme.md

Paururo · web-flow · commit 6aee57470fc0 · 2025-07-06T11:35:54.000+02:00
diff --git a/readme.md b/readme.md
@@ -1,4 +1,3 @@
-
 <p align="center">
   <img src="logo/pathotypr.png" title="pathotypr.png logo" style="width:750px; height: auto;">
 </p>
@@ -15,11 +14,28 @@ __and Mireia Coscolla<sup>1</sup>__
 
 # pathotypr
 
-**pathotypr** is a powerful command-line tool for genome classification using machine learning and SNP markers. It provides three main functionalities:
+**pathotypr** is a powerful and versatile command-line tool for high-speed genome classification and variant genotyping. It combines machine learning models with an advanced k-mer based engine to provide a comprehensive analysis toolkit.
+
+- 🎓 **train**: Build, train, and validate Random Forest models from your own FASTA sequences.
+
+- 🔮 **predict**: Classify new genomes using a pre-trained pathotypr model.
+
+- 🧬 **classify**: Genotype known variants (SNPs, MNVs, Indels) in assembled genomes (FASTA) against a reference.
+
+- ⚡ **split-fastq**: Perform ultra-fast, alignment-free genotyping of SNPs, MNVs, and both small and large structural variants (Indels/SVs) directly from raw FASTQ reads.
+
+## Key Features
+- **Dynamic Variant Detection**: The `classify` and `split-fastq` modules can detect SNPs, MNVs, and Indels (including large SVs) using a flexible marker format.
 
-- 🎓 **Train**: Build ML models from FASTA sequences
-- 🔮 **Predict**: Classify new genomes using trained models
-- 🧬 **Classify**: Process genomic markers (SNPs) against a reference for lineage/drug resistance classification
+- **High-Speed & Alignment-Free**: The `split-fastq` engine genotypes variants directly from raw reads, bypassing the need for computationally expensive alignment.
+
+- **Flexible Input**: Process samples individually, in batches from the command line, or using a convenient sample list file.
+
+- **Efficient & Parallel**: Optimized for performance using Rayon to leverage all available CPU cores by default.
+
+- **Unified Model Format**: The `train` command produces a single, portable file containing the model, configuration, and all necessary components for reproducible predictions.
+
+- **Robust Logging**: Clear, informative, and standardized logging across all modules.
 
 ## Installation
 
@@ -39,67 +55,153 @@ git clone https://github.com/PathoGenOmics-Lab/pathotypr.git
 cd pathotypr
 cargo build --release
 ```
-
 ## Quick Start
 
 ```bash
-# Train a model
-pathotypr train --fasta input.fasta --output my_model --kmer_size 6
+# 1. Train a Random Forest model
+pathotypr train --input training_genomes.fasta --output my_species.model.gz
 
-# Predict classifications
-pathotypr predict --fasta input.fasta --model_base my_model --output predictions.txt
+# 2. Predict the class of a new genome
+pathotypr predict --input new_genome.fasta --model my_species.model.gz --output prediction.tsv
 
-# Classify using SNP markers
-pathotypr classify --tsv_pos markers.tsv --ref_fasta ref.fasta --fasta_genomes genomes.fasta --output results.txt
+# 3. Genotype variants in an assembled genome
+pathotypr classify --markers variants.tsv --reference ref.fasta --genome-fasta my_genome.fasta --output classified_variants.tsv
+
+# 4. Genotype variants directly from raw reads
+pathotypr split-fastq --markers variants.tsv --reference ref.fasta -i sample_R1.fq.gz -i sample_R2.fq.gz --paired --output-prefix sample_genotyping
 ```
 
-## Features
+## Documentation
 
-- 🚀 Fast parallel processing using Rayon
-- 📊 Random Forest classification for ML-based prediction
-- 🧪 K-mer based sequence analysis
-- 💾 Compressed model storage
-- 🔍 Reference-based SNP marker detection
-- 🧬 Flexible marker positions for closed/mapped genomes
-- 📈 Progress tracking and logging
+### 🎓 `train`
 
-## Documentation
+Builds and trains a Random Forest model from a multifasta file where headers are in the format `Lineage_sequenceID`. The command produces a single, self-contained model file.
 
-### Train Mode
-Trains a model from FASTA files with headers in `Lineage_sequenceID` format:
+**Arguments**:
+| Option | Flag | Description | Default |
+| :--- | :--- | :--- | :--- |
+| --input | -i | Path to the input multifasta file. Headers must be in Lineage_sequenceID format. | Required |
+| --output | -o | Path for the unified output model file (e.g., my_model.pathotypr.gz). | Required |
+| --kmer-size | -k | The size of the k-mers to generate from sequences. | 6 |
+| --test-split | -s | Proportion of the data to use for the test set. | 0.2 (20%) |
+| --threads | -t | Number of CPU threads to use. | All available |
+
+_The tool will warn you if it detects a strong class imbalance in your training data._
+
+**Usage**:
 ```bash
-pathotypr train --fasta input.fasta --output my_model --kmer_size 21
+pathotypr train \
+  --input <FASTA> \
+  --output <MODEL_FILE> \
+  [OPTIONS]
 ```
+### 🔮 `predict`
+
+Classifies new genomes using a model file generated by `train`.
 
-### Predict Mode
-Classifies genomes using a trained model:
+**Arguments**:
+| Option | Flag | Description | Default |
+| :--- | :--- | :--- | :--- |
+| --input | -i | Path to the input FASTA file containing sequences to classify. | Required |
+| --model | -m | Path to the unified model file created by the train command. | Required |
+| --output | -o | Path for the output file where predictions will be written in TSV format. | Required |
+| --threads | -t | Number of CPU threads to use. | All available |
+
+**Usage**:
 ```bash
-pathotypr predict --fasta input.fasta --model_base my_model --output predictions.txt
+pathotypr predict \
+  --input <FASTA> \
+  --model <MODEL_FILE> \
+  --output <PREDICTIONS_TSV> \
+  [OPTIONS]
 ```
 
-### Classify Mode
-Process genomes using SNP markers against a reference sequence. Supports both lineage-defining SNPs and drug resistance markers. Works with:
-- Mapped genomes (SNP positions relative to reference)
-- Closed genomes (SNP positions may vary)
+### 🧬 `classify` & ⚡ `split-fastq`
+These two modules share a powerful, dynamic variant detection engine but operate on different inputs.
+
+- `classify`: Works on assembled genomes (FASTA).
+- `split-fastq`: Works on raw sequencing reads (FASTQ).
+
+#### Marker File Format
+Both commands use the same flexible TSV format for defining variants:
+`position <tab> REF <tab> ALT <tab> marker_name <tab> [optional_annotations...]`
+
+- `position`: 1-based chromosomal position.
+- `REF`: The reference allele sequence.
+- `ALT`: The alternate allele sequence.
+- `marker_name`: The name of the marker/lineage.
+- `[optional_annotations...]`: Any additional columns, which will be appended to the output file.
+
+**Example** `markers.tsv`:
+```text
+#pos    ref   alt   marker              annotation
+761109  G     T     rpoB_p.Asp435Tyr    DrugResistance
+761109  GAC   TAT   rpoB_p.Asp435Tyr    DrugResistance
+2155162 C     CAT   katG_p.Ser315Thr    Compensatory
+987654  TGC.. T     LargeDeletion_1     StructuralVariant
+```
 
+#### 🧬 `classify`
+ Genotype known variants (SNPs, MNVs, Indels) in assembled genomes (FASTA) against a reference.
+**Arguments**:
+| Option | Flag | Description | Default |
+| :--- | :--- | :--- | :--- |
+| --markers | -m | Path to the marker definition file (TSV format). | Required |
+| --reference | -r | Path to the reference genome FASTA file. | Required |
+| --output | -o | Path for the detailed output report. | Required |
+| --genome-fasta| | Path to a multifasta file containing all genomes to analyze. | One input required |
+| --genome-list | | Path to a TSV file listing genomes (name\tpath/to/fasta). | One input required |
+| --kmer-size | -k | The size of the diagnostic k-mers to use. | 31 |
+| --threads | -t | Number of CPU threads to use. | All available |
+
+**Usage**:
 ```bash
-# Using FASTA input
-pathotypr classify --tsv_pos markers.tsv --ref_fasta ref.fasta --fasta_genomes genomes.fasta --output results.txt
+pathotypr classify \
+  --markers <MARKERS_TSV> \
+  --reference <REF_FASTA> \
+  --output <OUTPUT_TSV> \
+  --genome-fasta <GENOMES_FASTA> \
+  [OPTIONS]
+```
 
-# Using TSV input
-pathotypr classify --tsv_pos markers.tsv --ref_fasta ref.fasta --tsv_genomes genomes.tsv --output results.txt
+#### ⚡ `split-fastq`
+
+**Arguments**:
+| Option | Flag | Description | Default |
+| :--- | :--- | :--- | :--- |
+| --markers | -m | Path to the marker definition file (TSV format). | Required |
+| --reference | -r | Path to the reference genome FASTA file. | Required |
+| --output-prefix| -o | Prefix for all output files. | split |
+| --input | -i | One or more FASTQ files to process. | One input required |
+| --input-list | -l | Path to a TSV file listing samples and their FASTQ files. | One input required |
+| --paired | | Flag to treat input files as paired-end, grouped in pairs. | false |
+| --min-depth | | Minimum read depth required to call a variant. | 10 |
+| --min-alt-percent| | Minimum frequency of the alternate allele to call a variant (%). | 95 |
+| --threads | -t | Number of CPU threads to use. | All available |
+
+**Usage**:
+```bash
+pathotypr split-fastq \
+  --markers <MARKERS_TSV> \
+  --reference <REF_FASTA> \
+  --output-prefix <PREFIX> \
+  -i <READS_1.FQ> -i <READS_2.FQ> --paired \
+  [OPTIONS]
 ```
 
 ## Project Structure
 
 ```
 pathotypr/
 ├── src/
-│   ├── main.rs     # CLI handling
-│   ├── train.rs    # Model training
-│   ├── predict.rs  # Classification
-│   └── classify.rs # Marker processing
+│   ├── main.rs                 # CLI handling
+│   ├── train.rs                # Model training logic
+│   ├── predict.rs              # Model prediction logic
+│   ├── classify.rs             # Variant detection in assemblies
+│   ├── classify_split_fastq.rs # Variant detection in reads
+│   └── split_kmer.rs           # Core dynamic k-mer engine
 └── Cargo.toml
+
 ```
 
 ## Key Dependencies