feat: v1.0.0 — full Python rewrite, modular architecture, test suite by ilypopv · Pull Request #13 · PopovIILab/KrakenParser

ilypopv · 2026-05-12T13:52:16Z

Summary

Complete modernisation of KrakenParser from a loosely-coupled set of
bash + Python scripts into a proper Python package.

Architecture

No more bash. All shell scripts replaced with pure Python:
transform2mpa.py, split_mpa.py, pipeline.py
Subpackage structure: krakenparser/mpa/, counts/, stats/,
kpplot/ — each with __init__.py
pipeline.py — orchestrates the full pipeline as a callable
run_pipeline() function (importable, not just CLI)
Removed scikit-bio in favour of scipy — eliminates
biom-format dependency and unblocks Python 3.10–3.13 support

New features

KrakenParser -i data/kreports — no --complete required (default mode)
-o / --output — custom output directory for the full pipeline
--overwrite — explicit protection against accidental reruns
--keep-human — opt-out of human taxa filtering
-s / --seed — reproducible β-diversity rarefaction
--kreport2mpa — batch (-i DIR) and single-file (-r FILE) modes

Output structure

Intermediate files (mpa/, COMBINED.txt, txt/) are now grouped
under intermediate/; final results (counts/, rel_abund/,
diversity/) surface at the top level.

Quality

70 tests, 83 % coverage — unit, integration, kpplot, full pipeline
logging instead of print() in library functions
raise FileNotFoundError instead of sys.exit() in library code
importlib.metadata for version (no version.py)
python -m krakenparser.* dispatch preserves package import context

CI/CD

python-package.yml — matrix 3.10–3.13, flake8, codecov
publish.yml — Trusted Publishing on v* tags

Add -i/--input dir mode alongside the existing single-file -r mode. Remove run_kreport2mpa.sh wrapper — batch logic now lives in Python.

Merge decombine.sh + decombine_viruses.sh into a single split_mpa.py. Add --viruses-only and --keep-human flags. Move convert2csv and processing_script into counts/ subpackage.

Replace kraken2csv.sh with pipeline.py. Add run_pipeline() as a callable library function. Add -o/--output and --overwrite flags. Skip binary and hidden files automatically (_is_processable check).

…dation - Add aggregate_by_metadata() to base for metadata grouping - Fix streamgraph grid axis (x→y) - Raise ValueError when cmap list is shorter than taxa count - Add missing-sample validation on sample_order

…modules Remove version.py — version now sourced from package metadata. Dispatcher uses python -m package.module to preserve import context. Default to full pipeline when -i is given without a subcommand.

Add fail-fast: false, flake8 lint step, --no-build-isolation, branch-scoped triggers (dev, main), codecov-action@v5.

Triggers on v* tags. Strips GitHub-specific markup from README before build to ensure clean PyPI rendering.

Unit tests: _parse_line, shannon/pielou/chao1 indices, modify_taxa_names. Integration: kreport_to_mpa, convert_to_csv, relabund, alpha/beta div, split_mpa — all with reproducibility (SHA-256) checks.

Add smoke tests for stacked_barplot, streamgraph, clustermap. Switch test_full_pipeline to direct run_pipeline() call so pipeline.py is included in coverage. Add overwrite protection tests.

Add codecov badge. Document new -o/--output and --overwrite flags. Add Before Visualization section highlighting --relabund -O. Move step-by-step modules to collapsible <details> block. Update Example Output Structure with intermediate/ directory.

biom-format uses a legacy setup.py that imports numpy at metadata generation time. Pre-installing numpy ensures it is available before pip resolves scikit-bio's transitive dependencies.

biom-format (transitive dep of scikit-bio) has no cp314 wheel and its legacy setup.py has undeclared build-time deps (numpy, Cython). Pin support to 3.10–3.13 until scikit-bio ecosystem catches up.

scikit-bio pulls in biom-format which has no cp314 wheel and brittle build-time deps. Reimplement subsample_counts in numpy and use scipy.spatial.distance for Bray-Curtis and Jaccard. Restore Python 3.14 support and bump requires-python to <=3.16.

Rarefaction is stochastic — without a fixed seed results differ between runs. Replace np.random.choice with np.random.default_rng and propagate the seed through calc_beta_div and run_pipeline.

ilypopv added 18 commits March 12, 2026 09:14

build: add pyproject.toml and migrate packaging from wheel metadata

8367c2e

refactor(mpa): rewrite kreport2mpa as transform2mpa with batch mode

3ac8555

Add -i/--input dir mode alongside the existing single-file -r mode. Remove run_kreport2mpa.sh wrapper — batch logic now lives in Python.

refactor(counts): replace bash decombine scripts with Python split_mpa

3ccdab8

Merge decombine.sh + decombine_viruses.sh into a single split_mpa.py. Add --viruses-only and --keep-human flags. Move convert2csv and processing_script into counts/ subpackage.

refactor(stats): move relabund and diversity into stats/ subpackage

eb92428

feat(pipeline): port full pipeline orchestration from bash to Python

58c1d8a

Replace kraken2csv.sh with pipeline.py. Add run_pipeline() as a callable library function. Add -o/--output and --overwrite flags. Skip binary and hidden files automatically (_is_processable check).

refactor(cli): use importlib.metadata for version, run scripts as -m …

ee330e8

…modules Remove version.py — version now sourced from package metadata. Dispatcher uses python -m package.module to preserve import context. Default to full pipeline when -i is given without a subcommand.

ci: rename workflow to python-package.yml, expand matrix to 3.10–3.14

7dd6914

Add fail-fast: false, flake8 lint step, --no-build-isolation, branch-scoped triggers (dev, main), codecov-action@v5.

ci: add publish.yml with Trusted Publishing for PyPI

c37f5cb

Triggers on v* tags. Strips GitHub-specific markup from README before build to ensure clean PyPI rendering.

test: add unit and integration test suite (62 tests)

6a5c80f

Unit tests: _parse_line, shannon/pielou/chao1 indices, modify_taxa_names. Integration: kreport_to_mpa, convert_to_csv, relabund, alpha/beta div, split_mpa — all with reproducibility (SHA-256) checks.

test: add kpplot tests and raise coverage to 83%

00efa56

Add smoke tests for stacked_barplot, streamgraph, clustermap. Switch test_full_pipeline to direct run_pipeline() call so pipeline.py is included in coverage. Add overwrite protection tests.

docs: update README for v1.0.0

0aa7b8d

Add codecov badge. Document new -o/--output and --overwrite flags. Add Before Visualization section highlighting --relabund -O. Move step-by-step modules to collapsible <details> block. Update Example Output Structure with intermediate/ directory.

ci: pre-install numpy to fix biom-format build on Python 3.14

9664a56

biom-format uses a legacy setup.py that imports numpy at metadata generation time. Pre-installing numpy ensures it is available before pip resolves scikit-bio's transitive dependencies.

ci: drop Python 3.14 from matrix; cap requires-python to <=3.13

8d5d457

biom-format (transitive dep of scikit-bio) has no cp314 wheel and its legacy setup.py has undeclared build-time deps (numpy, Cython). Pin support to 3.10–3.13 until scikit-bio ecosystem catches up.

feat(diversity): add -s/--seed for reproducible rarefaction

4ce8750

Rarefaction is stochastic — without a fixed seed results differ between runs. Replace np.random.choice with np.random.default_rng and propagate the seed through calc_beta_div and run_pipeline.

docs: document -s/--seed flag in Quick Start and diversity step

185e8d4

Update README.md

14a21e2

ilypopv merged commit 7315fbc into main May 12, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: v1.0.0 — full Python rewrite, modular architecture, test suite#13

feat: v1.0.0 — full Python rewrite, modular architecture, test suite#13
ilypopv merged 18 commits into
mainfrom
dev

ilypopv commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ilypopv commented May 12, 2026

Summary

Architecture

New features

Output structure

Quality

CI/CD

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant