Skip to content

Commit 286e86f

Browse files
authored
Merge pull request #18 from PopovIILab/dev
Refactor: modernize CLI routing, optimize MPA pipeline and taxonomic sorting
2 parents 844f024 + 766ce52 commit 286e86f

7 files changed

Lines changed: 413 additions & 229 deletions

File tree

README.md

Lines changed: 80 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -107,15 +107,9 @@ X9,0.7232472324723247,0.7352941176470589,...,0.8066914498141264,0.0
107107
|![combined_white](https://github.com/user-attachments/assets/48b3f6e3-6dd5-4298-a793-23dcd549e90c)|![kpclust](https://github.com/user-attachments/assets/98a4d540-7c43-4802-8f77-277a5637a7a1)|
108108

109109
## Quick Start (Full Pipeline)
110-
To run the full pipeline, use the following command:
111-
```bash
112-
KrakenParser --complete -i data/kreports -o results/
113-
#Having troubles? Run KrakenParser --complete -h
114-
```
115110

116-
For **reproducible** β-diversity (rarefaction is stochastic by default):
117111
```bash
118-
KrakenParser -i data/kreports -o results/ -s 42
112+
KrakenParser -i data/kreports -o results/
119113
```
120114

121115
This will:
@@ -127,147 +121,165 @@ This will:
127121
6. Calculate relative abundance
128122
7. Calculate α & β-diversities
129123

130-
## Installation
124+
> [!TIP]
125+
> After the pipeline finishes, the output window will remind you about calibrating
126+
> rarefaction depth for β-diversity and re-running relative abundance normalization
127+
> before visualization — with ready-to-paste example commands tailored to your output paths.
128+
129+
### Full help output
131130

132131
```
133-
pip install krakenparser
132+
usage: KrakenParser [-h] [-i INPUT] [-o OUTPUT] [--viruses] [--keep-human]
133+
[-V] [-d DEPTH] [-s SEED] [--overwrite]
134+
[--step {mpa,combine,split,process,csv,relabund,diversity}]
135+
136+
KrakenParser: Convert Kraken2 Reports to CSV.
137+
138+
options:
139+
-h, --help show this help message and exit
140+
141+
Core Arguments:
142+
-i, --input INPUT Directory containing Kraken2 report files
143+
-o, --output OUTPUT Output directory (default: parent of input)
144+
--viruses Extract only VIRUSES domain taxa in the pipeline
145+
--keep-human Do not filter human-related taxa
146+
-V, --version show program's version number and exit
147+
148+
Pipeline Options (Full Run):
149+
-d, --depth DEPTH Rarefaction depth for β-diversity (default: 1000)
150+
-s, --seed SEED Random seed for reproducible rarefaction (default: random)
151+
--overwrite Overwrite the output directory if it already exists
152+
153+
Advanced (Step-by-step control):
154+
--step {mpa,combine,split,process,csv,relabund,diversity}
155+
Run only a specific part of the pipeline.
156+
Type 'krakenparser --step <name> -h' for more.
134157
```
135158

136-
## Before Visualization: Grouping Low-Abundance Taxa
137-
138-
The full pipeline automatically calculates relative abundance. Before passing data to visualization, it is strongly recommended to re-run `--relabund` with the `-O` flag — this collapses all taxa below the chosen threshold into a single **"Other"** group, producing much cleaner and more readable plots.
159+
## Installation
139160

140-
```bash
141-
KrakenParser --relabund -i data/counts/counts_species.csv -o data/rel_abund/ra_species.csv -O 4
142161
```
143-
144-
This groups every taxon with relative abundance **< 4 %** into `Other (<4.0%)`. Adjust the threshold to your data.
145-
146-
> **Note:** The pipeline-generated `rel_abund/ra_*.csv` files (no `-O`) preserve the full unfiltered data — use them for statistical analysis. Use the `-O` variant specifically for visualization.
162+
pip install krakenparser
163+
```
147164

148165
---
149166

150167
<details>
151168
<summary><b>Using Individual Modules (Advanced)</b></summary>
152169
<br>
153170

154-
Each step of the pipeline can also be run individually. This is useful for re-running a single step, debugging, or integrating KrakenParser into a custom workflow.
171+
Each step of the pipeline can be run individually via `--step`. This is useful for re-running a single step, debugging, or integrating KrakenParser into a custom workflow. Run `krakenparser --step <name> -h` to see the full argument list for any step.
155172

156173
### **Step 1: Convert Kraken2 Reports to MPA Format**
157174
```bash
158175
# Batch mode (directory)
159-
KrakenParser --kreport2mpa -i data/kreports -o data/intermediate/mpa
176+
KrakenParser --step mpa -i data/kreports -o data/intermediate/mpa
160177
# Single file
161-
KrakenParser --kreport2mpa -r data/kreports/sample.kreport -o data/intermediate/mpa/sample.MPA.TXT
162-
#Having troubles? Run KrakenParser --kreport2mpa -h
178+
KrakenParser --step mpa -r data/kreports/sample.kreport -o data/intermediate/mpa/sample.MPA.TXT
163179
```
164180
Converts Kraken2 `.kreport` files into **MPA format**.
165181

166182
### **Step 2: Combine MPA Files**
167183
```bash
168-
KrakenParser --combine_mpa -i data/intermediate/mpa/* -o data/intermediate/COMBINED.txt
169-
#Having troubles? Run KrakenParser --combine_mpa -h
184+
KrakenParser --step combine -i data/intermediate/mpa/* -o data/intermediate/COMBINED.txt
170185
```
171186
Merges multiple MPA files into a single combined table.
172187

173188
### **Step 3: Extract Taxonomic Levels**
174189
```bash
175-
KrakenParser --deconstruct -i data/intermediate/COMBINED.txt -o data/intermediate
176-
#Having troubles? Run KrakenParser --deconstruct -h
190+
KrakenParser --step split -i data/intermediate/COMBINED.txt -o data/intermediate
177191
```
178192

179193
By default, human-related taxa (Homo sapiens, Hominidae, Primates, Mammalia, Chordata) are removed. To keep them:
180194
```bash
181-
KrakenParser --deconstruct -i data/intermediate/COMBINED.txt -o data/intermediate --keep-human
195+
KrakenParser --step split -i data/intermediate/COMBINED.txt -o data/intermediate --keep-human
182196
```
183197

184-
To inspect the **Viruses** domain separately:
198+
To inspect the **Viruses** domain only:
185199
```bash
186-
KrakenParser --deconstruct_viruses -i data/intermediate/COMBINED.txt -o data/counts_viruses
187-
#Having troubles? Run KrakenParser --deconstruct_viruses -h
200+
KrakenParser --step split -i data/intermediate/COMBINED.txt -o data/counts_viruses --viruses-only
188201
```
189202

190203
### **Step 4: Process Extracted Taxonomic Data**
191204
```bash
192-
KrakenParser --process -i data/intermediate/COMBINED.txt -o data/intermediate/txt/counts_phylum.txt
193-
#Having troubles? Run KrakenParser --process -h
205+
KrakenParser --step process -i data/intermediate/COMBINED.txt -o data/intermediate/txt/counts_phylum.txt
194206
```
195207

196-
Repeat on other 5 taxonomical levels (class, order, family, genus, species) or wrap up `KrakenParser --process` in a loop.
208+
Repeat on other 5 taxonomical levels (class, order, family, genus, species) or wrap `--step process` in a loop.
197209

198210
Cleans up taxonomic names: removes prefixes (`s__`, `g__`, etc.) and replaces underscores with spaces.
199211

200212
### **Step 5: Convert TXT to CSV**
201213
```bash
202-
KrakenParser --txt2csv -i data/intermediate/txt/counts_phylum.txt -o data/counts/counts_phylum.csv
203-
#Having troubles? Run KrakenParser --txt2csv -h
214+
KrakenParser --step csv -i data/intermediate/txt/counts_phylum.txt -o data/counts/counts_phylum.csv
204215
```
205216
Repeat on other 5 taxonomical levels or wrap in a loop. Transposes data so that sample names become rows.
206217

207218
### **Step 6: Calculate Relative Abundance**
208219
```bash
209-
KrakenParser --relabund -i data/counts/counts_phylum.csv -o data/rel_abund/ra_phylum.csv
210-
#Having troubles? Run KrakenParser --relabund -h
220+
KrakenParser --step relabund -i data/counts/counts_phylum.csv -o data/rel_abund/ra_phylum.csv
211221
```
212222
Repeat on other 5 taxonomical levels or wrap in a loop.
213223

214224
With "Other" grouping:
215225
```bash
216-
KrakenParser --relabund -i data/counts/counts_phylum.csv -o data/rel_abund/ra_phylum.csv -O 3.5
226+
KrakenParser --step relabund -i data/counts/counts_phylum.csv -o data/rel_abund/ra_phylum.csv -O 3.5
217227
```
218228
Groups all taxa with abundance < 3.5 % into `Other (<3.5%)`.
219229

220230
### **Step 7: Calculate α & β-Diversities**
221231
```bash
222-
KrakenParser --diversity -i data/counts/counts_species.csv -o data/diversity
223-
#Having troubles? Run KrakenParser --diversity -h
232+
KrakenParser --step diversity -i data/counts/counts_species.csv -o data/diversity
224233
```
225234

226235
With a custom rarefaction depth:
227236
```bash
228-
KrakenParser --diversity -i data/counts/counts_species.csv -o data/diversity -d 750
237+
KrakenParser --step diversity -i data/counts/counts_species.csv -o data/diversity -d 750
229238
```
230239

231-
For reproducible results (rarefaction uses random subsampling — fix the seed to get the same matrix every run):
240+
For reproducible results (fix the seed to get the same matrix every run):
232241
```bash
233-
KrakenParser --diversity -i data/counts/counts_species.csv -o data/diversity -s 42
242+
KrakenParser --step diversity -i data/counts/counts_species.csv -o data/diversity -s 42
234243
```
235244

236245
---
237246

238247
## Arguments Breakdown
239248

240-
### **--complete** (Full Pipeline)
241-
- Requires `-i`: path to the Kraken2 reports directory (e.g., `data/kreports`).
242-
- Optional `-o`: output directory (default: parent of `-i`).
243-
- Optional `--keep-human`: retain human-related taxa (default: filtered out).
244-
- Optional `-s INT`: random seed for reproducible β-diversity rarefaction (default: random).
249+
### **Full Pipeline** (`-i`)
250+
- `-i / --input`: path to the Kraken2 reports directory (e.g., `data/kreports`). Triggers the full pipeline.
251+
- `-o / --output`: output directory (default: parent of `-i`).
252+
- `--viruses`: extract only Viruses domain taxa throughout the pipeline.
253+
- `--keep-human`: retain human-related taxa (default: filtered out).
254+
- `-d INT / --depth`: rarefaction depth for β-diversity (default: 1000).
255+
- `-s INT / --seed`: random seed for reproducible β-diversity rarefaction (default: random).
256+
- `--overwrite`: overwrite the output directory if it already exists.
245257

246-
### **--kreport2mpa** (Step 1)
258+
### **--step mpa** (Step 1)
247259
- Batch mode: `-i DIR -o DIR` — converts all files in a directory.
248260
- Single-file mode: `-r FILE -o FILE`.
249261

250-
### **--combine_mpa** (Step 2)
262+
### **--step combine** (Step 2)
251263
- `-i FILE [FILE ...]`: one or more MPA files.
252264
- `-o FILE`: output merged table.
253265

254-
### **--deconstruct** & **--deconstruct_viruses** (Step 3)
266+
### **--step split** (Step 3)
255267
- Extracts **phylum, class, order, family, genus, species** into separate text files.
256-
- `--deconstruct` removes human-related reads by default; use `--keep-human` to retain them.
257-
- `--deconstruct_viruses` extracts only the Viruses domain.
268+
- Removes human-related reads by default; use `--keep-human` to retain them.
269+
- Use `--viruses-only` to extract only the Viruses domain.
258270

259-
### **--process** (Step 4)
271+
### **--step process** (Step 4)
260272
- Removes prefixes (`s__`, `g__`, etc.), replaces underscores with spaces.
261273
- `-i`: COMBINED.txt (source for sample-name header); `-o`: target txt file.
262274

263-
### **--txt2csv** (Step 5)
275+
### **--step csv** (Step 5)
264276
- Transposes a processed txt file into a CSV with sample names as rows.
265277

266-
### **--relabund** (Step 6)
278+
### **--step relabund** (Step 6)
267279
- Calculates relative abundance from a total-counts CSV.
268280
- `-O FLOAT`: group taxa below FLOAT % into `Other (<FLOAT%)`.
269281

270-
### **--diversity** (Step 7)
282+
### **--step diversity** (Step 7)
271283
- Shannon, Pielou & Chao1 for α-diversity.
272284
- Bray-Curtis & Jaccard for β-diversity.
273285
- `-d INT`: rarefaction depth for β-diversity (default: 1000).
@@ -293,16 +305,17 @@ results/
293305
│ ├─ alpha_div.csv
294306
│ ├─ beta_div_bray.csv
295307
│ └─ beta_div_jaccard.csv
296-
└─ intermediate/ # Intermediate files
297-
├─ mpa/ # Converted MPA files
298-
│ ├─ {sample}.txt
299-
│ ├─ ...
300-
├─ COMBINED.txt # Merged MPA table
301-
└─ txt/ # Extracted taxonomic levels in TXT
302-
├─ counts_species.txt
303-
├─ counts_genus.txt
304-
├─ ...
305-
└─ counts_phylum.txt
308+
├─ intermediate/ # Intermediate files
309+
│ ├─ mpa/ # Converted MPA files
310+
│ │ ├─ {sample}.txt
311+
│ │ ├─ ...
312+
│ ├─ COMBINED.txt # Merged MPA table
313+
│ └─ txt/ # Extracted taxonomic levels in TXT
314+
│ ├─ counts_species.txt
315+
│ ├─ counts_genus.txt
316+
│ ├─ ...
317+
│ └─ counts_phylum.txt
318+
└─ krakenparser.log # Pipeline execution logs
306319
```
307320

308321
## Conclusion

0 commit comments

Comments
 (0)