Skip to content

feat: v1.0.0 — full Python rewrite, modular architecture, test suite#13

Merged
ilypopv merged 18 commits into
mainfrom
dev
May 12, 2026
Merged

feat: v1.0.0 — full Python rewrite, modular architecture, test suite#13
ilypopv merged 18 commits into
mainfrom
dev

Conversation

@ilypopv

@ilypopv ilypopv commented May 12, 2026

Copy link
Copy Markdown
Member

Summary

Complete modernisation of KrakenParser from a loosely-coupled set of
bash + Python scripts into a proper Python package.

Architecture

  • No more bash. All shell scripts replaced with pure Python:
    transform2mpa.py, split_mpa.py, pipeline.py
  • Subpackage structure: krakenparser/mpa/, counts/, stats/,
    kpplot/ — each with __init__.py
  • pipeline.py — orchestrates the full pipeline as a callable
    run_pipeline() function (importable, not just CLI)
  • Removed scikit-bio in favour of scipy — eliminates
    biom-format dependency and unblocks Python 3.10–3.13 support

New features

  • KrakenParser -i data/kreports — no --complete required (default mode)
  • -o / --output — custom output directory for the full pipeline
  • --overwrite — explicit protection against accidental reruns
  • --keep-human — opt-out of human taxa filtering
  • -s / --seed — reproducible β-diversity rarefaction
  • --kreport2mpa — batch (-i DIR) and single-file (-r FILE) modes

Output structure

Intermediate files (mpa/, COMBINED.txt, txt/) are now grouped
under intermediate/; final results (counts/, rel_abund/,
diversity/) surface at the top level.

Quality

  • 70 tests, 83 % coverage — unit, integration, kpplot, full pipeline
  • logging instead of print() in library functions
  • raise FileNotFoundError instead of sys.exit() in library code
  • importlib.metadata for version (no version.py)
  • python -m krakenparser.* dispatch preserves package import context

CI/CD

  • python-package.yml — matrix 3.10–3.13, flake8, codecov
  • publish.yml — Trusted Publishing on v* tags

ilypopv added 18 commits March 12, 2026 09:14
Add -i/--input dir mode alongside the existing single-file -r mode.
Remove run_kreport2mpa.sh wrapper — batch logic now lives in Python.
Merge decombine.sh + decombine_viruses.sh into a single split_mpa.py.
Add --viruses-only and --keep-human flags.
Move convert2csv and processing_script into counts/ subpackage.
Replace kraken2csv.sh with pipeline.py. Add run_pipeline() as a
callable library function. Add -o/--output and --overwrite flags.
Skip binary and hidden files automatically (_is_processable check).
…dation

- Add aggregate_by_metadata() to base for metadata grouping
- Fix streamgraph grid axis (x→y)
- Raise ValueError when cmap list is shorter than taxa count
- Add missing-sample validation on sample_order
…modules

Remove version.py — version now sourced from package metadata.
Dispatcher uses python -m package.module to preserve import context.
Default to full pipeline when -i is given without a subcommand.
Add fail-fast: false, flake8 lint step, --no-build-isolation,
branch-scoped triggers (dev, main), codecov-action@v5.
Triggers on v* tags. Strips GitHub-specific markup from README
before build to ensure clean PyPI rendering.
Unit tests: _parse_line, shannon/pielou/chao1 indices, modify_taxa_names.
Integration: kreport_to_mpa, convert_to_csv, relabund, alpha/beta div,
split_mpa — all with reproducibility (SHA-256) checks.
Add smoke tests for stacked_barplot, streamgraph, clustermap.
Switch test_full_pipeline to direct run_pipeline() call so pipeline.py
is included in coverage. Add overwrite protection tests.
Add codecov badge. Document new -o/--output and --overwrite flags.
Add Before Visualization section highlighting --relabund -O.
Move step-by-step modules to collapsible <details> block.
Update Example Output Structure with intermediate/ directory.
biom-format uses a legacy setup.py that imports numpy at metadata
generation time. Pre-installing numpy ensures it is available before
pip resolves scikit-bio's transitive dependencies.
biom-format (transitive dep of scikit-bio) has no cp314 wheel and
its legacy setup.py has undeclared build-time deps (numpy, Cython).
Pin support to 3.10–3.13 until scikit-bio ecosystem catches up.
scikit-bio pulls in biom-format which has no cp314 wheel and
brittle build-time deps. Reimplement subsample_counts in numpy
and use scipy.spatial.distance for Bray-Curtis and Jaccard.
Restore Python 3.14 support and bump requires-python to <=3.16.
Rarefaction is stochastic — without a fixed seed results differ
between runs. Replace np.random.choice with np.random.default_rng
and propagate the seed through calc_beta_div and run_pipeline.
@ilypopv ilypopv merged commit 7315fbc into main May 12, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant