Conversation
Add -i/--input dir mode alongside the existing single-file -r mode. Remove run_kreport2mpa.sh wrapper — batch logic now lives in Python.
Merge decombine.sh + decombine_viruses.sh into a single split_mpa.py. Add --viruses-only and --keep-human flags. Move convert2csv and processing_script into counts/ subpackage.
Replace kraken2csv.sh with pipeline.py. Add run_pipeline() as a callable library function. Add -o/--output and --overwrite flags. Skip binary and hidden files automatically (_is_processable check).
…dation - Add aggregate_by_metadata() to base for metadata grouping - Fix streamgraph grid axis (x→y) - Raise ValueError when cmap list is shorter than taxa count - Add missing-sample validation on sample_order
…modules Remove version.py — version now sourced from package metadata. Dispatcher uses python -m package.module to preserve import context. Default to full pipeline when -i is given without a subcommand.
Add fail-fast: false, flake8 lint step, --no-build-isolation, branch-scoped triggers (dev, main), codecov-action@v5.
Triggers on v* tags. Strips GitHub-specific markup from README before build to ensure clean PyPI rendering.
Unit tests: _parse_line, shannon/pielou/chao1 indices, modify_taxa_names. Integration: kreport_to_mpa, convert_to_csv, relabund, alpha/beta div, split_mpa — all with reproducibility (SHA-256) checks.
Add smoke tests for stacked_barplot, streamgraph, clustermap. Switch test_full_pipeline to direct run_pipeline() call so pipeline.py is included in coverage. Add overwrite protection tests.
Add codecov badge. Document new -o/--output and --overwrite flags. Add Before Visualization section highlighting --relabund -O. Move step-by-step modules to collapsible <details> block. Update Example Output Structure with intermediate/ directory.
biom-format uses a legacy setup.py that imports numpy at metadata generation time. Pre-installing numpy ensures it is available before pip resolves scikit-bio's transitive dependencies.
biom-format (transitive dep of scikit-bio) has no cp314 wheel and its legacy setup.py has undeclared build-time deps (numpy, Cython). Pin support to 3.10–3.13 until scikit-bio ecosystem catches up.
scikit-bio pulls in biom-format which has no cp314 wheel and brittle build-time deps. Reimplement subsample_counts in numpy and use scipy.spatial.distance for Bray-Curtis and Jaccard. Restore Python 3.14 support and bump requires-python to <=3.16.
Rarefaction is stochastic — without a fixed seed results differ between runs. Replace np.random.choice with np.random.default_rng and propagate the seed through calc_beta_div and run_pipeline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete modernisation of KrakenParser from a loosely-coupled set of
bash + Python scripts into a proper Python package.
Architecture
transform2mpa.py,split_mpa.py,pipeline.pykrakenparser/mpa/,counts/,stats/,kpplot/— each with__init__.pypipeline.py— orchestrates the full pipeline as a callablerun_pipeline()function (importable, not just CLI)scikit-bioin favour ofscipy— eliminatesbiom-formatdependency and unblocks Python 3.10–3.13 supportNew features
KrakenParser -i data/kreports— no--completerequired (default mode)-o / --output— custom output directory for the full pipeline--overwrite— explicit protection against accidental reruns--keep-human— opt-out of human taxa filtering-s / --seed— reproducible β-diversity rarefaction--kreport2mpa— batch (-i DIR) and single-file (-r FILE) modesOutput structure
Intermediate files (
mpa/,COMBINED.txt,txt/) are now groupedunder
intermediate/; final results (counts/,rel_abund/,diversity/) surface at the top level.Quality
logginginstead ofprint()in library functionsraise FileNotFoundErrorinstead ofsys.exit()in library codeimportlib.metadatafor version (noversion.py)python -m krakenparser.*dispatch preserves package import contextCI/CD
python-package.yml— matrix 3.10–3.13, flake8, codecovpublish.yml— Trusted Publishing onv*tags