You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/paper.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,18 +32,21 @@ Comparative metagenomics and microbiome studies depend fundamentally on cross-sa
32
32
`KrakenParser` is implemented in Python 3 (distributed via PyPI as `krakenparser`) and follows a modular architecture split into three distinct operational layers: Data Processing, Statistical Analysis, and Visualization. The pipeline can be executed in an end-to-end automated mode by providing global input and output paths directly to the main command, or controlled step-by-step through granular subcommands.
33
33
34
34
## Data Processing and Filtering
35
+
35
36
Individual taxonomic reports are programmatically parsed, converted into MetaPhlAn (MPA) tables, and merged into a unified cross-sample master count matrix. This matrix is subsequently deconstructed into distinct tables for each major taxonomic rank. During deconstruction, `KrakenParser` purges internal structural prefixes (e.g., stripping `s__` from species names) and normalizes taxonomic strings by replacing underscores with spaces to ensure human readability and compatibility with downstream software.
36
37
37
38
The core data engine features flexible filtering mechanisms. Users can selectively isolate or exclude specific biological domains or kingdoms (Bacteria, Viruses, Archaea, Fungi) during extraction. While non-target host reads (e.g., human contamination) are filtered out by default to focus on microbial signatures, the `--keep-human` flag preserves host read counts within the output matrices. Crucially, `--keep-human` can be combined concurrently with domain-specific filters, allowing the simultaneous evaluation of host-to-microbe or host-to-pathogen abundance ratios within a single run.
38
39
39
40
## Statistical Analysis
41
+
40
42
Following matrix generation, the statistical module computes normalization metrics and ecological indices directly:
41
43
42
44
***Relative Abundance:** Normalizes absolute counts into percentage distributions using the formula: $\text{Relative Abundance} = \left( \frac{\text{Number of individuals of taxa}}{\text{Total number of individuals of all taxa}} \right) \times 100$. A user-defined abundance threshold aggregates rare background taxa into a consolidated `Other` category to simplify downstream parsing and plotting.
43
45
***Alpha Diversity:** Calculates *Shannon*[@shannon1948mathematical], *Pielou’s evenness*[@pielou1966measurement], and *Chao1*[@chao2002estimating] indices. To mitigate artifacts caused by uneven sequencing depths across different sequencing runs, a built-in rarefaction procedure subsamples reads to a uniform user-specified depth prior to calculating indices.
44
46
***Beta Diversity:** Computes compositional dissimilarity between samples via *Bray-Curtis*[@bray10jt] and *Jaccard*[@jaccard1901etude] distance metrics, exporting standard distance matrices ready for ordination.
45
47
46
48
## Visualization
49
+
47
50
The `kpplot` module utilizes an object-oriented design inheriting from a unified base configuration class (`KpPlotBase`), enforcing consistent rendering properties such as DPI, bounding box scaling, and layout properties. Built on top of `matplotlib`[@Hunter2007], `pandas`[@reback2020pandas], and `seaborn`[@Waskom2021], the visualization engine exposes four primary programmatic layouts:
48
51
49
52
***Stacked Bar Plots:** For comparing relative taxonomic proportions across multi-sample cohorts.
@@ -59,4 +62,4 @@ The functional reliability and execution integrity of `KrakenParser` are validat
59
62
60
63
Generative AI tools were used during the development of this work to assist with code refactoring, documentation drafting, and manuscript text editing. All software design decisions, implementation, validation, and scientific interpretation were performed and reviewed by the authors. No generative AI tools were used to generate or analyze research data, and all results reported are reproducible from the publicly available source code and documentation.
0 commit comments