Skip to content

feat: add kraken2 wrappers (classify, build, add-to-library)#5086

Open
warrok97ao wants to merge 39 commits into
snakemake:masterfrom
warrok97ao:kraken2
Open

feat: add kraken2 wrappers (classify, build, add-to-library)#5086
warrok97ao wants to merge 39 commits into
snakemake:masterfrom
warrok97ao:kraken2

Conversation

@warrok97ao
Copy link
Copy Markdown

@warrok97ao warrok97ao commented Mar 16, 2026

Adds three new wrappers for Kraken2:

  • bio/kraken2/classify: classify metagenomic reads (SE and PE) against a prebuilt database
  • bio/kraken2/build: build a Kraken2 database from a sequence library and taxonomy
  • bio/kraken2/add-to-library: add FASTA sequences to an existing Kraken2 database library

All wrappers use the new k2 CLI (kraken2=2.17.1) and include
wrapper.py, environment.yaml, meta.yaml, and test Snakefiles.

QC

While the contributions guidelines are more extensive, please particularly ensure that:

  • test.py was updated to call any added or updated example rules in a Snakefile
  • input: and output: file paths in the rules can be chosen arbitrarily
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:)
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to
  • the meta.yaml contains a link to the documentation of the respective tool or command under url:
  • conda environments use a minimal amount of channels and packages, in recommended ordering

Summary by CodeRabbit

  • New Features

    • Kraken2 workflow: classification (single- and paired-end, compressed inputs), database build, and add-to-library with flexible reporting and output controls.
  • Tests

    • End-to-end tests exercising classify, build, and add-to-library using sample FASTQ/FASTA fixtures, test databases, taxonomy entries, and mapping/manifests.
  • Documentation

    • Reproducible Conda environment manifests and step metadata to support deterministic testing and deployment.

Copilot AI review requested due to automatic review settings March 16, 2026 17:08
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Warning

Rate limit exceeded

@warrok97ao has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 5 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 15 minutes and 5 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99431025-0bbf-4c68-b4e2-d49aa07a0dc4

📥 Commits

Reviewing files that changed from the base of the PR and between d75ce45 and 0e99965.

📒 Files selected for processing (1)
  • bio/kraken2/classify/wrapper.py
📝 Walkthrough

Walkthrough

Adds three Kraken2 Snakemake wrappers (classify, build, add_to_library) with metadata, conda environment specifications, Python wrapper implementations, tests, and test fixtures to support classification, database build, and library additions.

Changes

Cohort / File(s) Summary
Classify wrapper & tests
bio/kraken2/classify/meta.yaml, bio/kraken2/classify/wrapper.py, bio/kraken2/classify/environment.yaml, bio/kraken2/classify/environment.linux-64.pin.txt, bio/kraken2/classify/test/Snakefile, bio/kraken2/classify/test/reads/*, bio/kraken2/classify/test/test_db/**
New classify meta and wrapper implementing SE/PE handling, DB validation, output/report and paired-output formatting; adds conda env specs, Snakemake test rules, sample reads, library fixtures, mappings, md5, and taxonomy test data.
Build wrapper & tests
bio/kraken2/build/meta.yaml, bio/kraken2/build/wrapper.py, bio/kraken2/build/environment.yaml, bio/kraken2/build/environment.linux-64.pin.txt, bio/kraken2/build/test/Snakefile, bio/kraken2/build/test/test_db/**
New build meta and wrapper that validates inputs/outputs and directories, composes k2 build command with threads/log/extra; adds conda env specs, Snakemake test rule, and test DB fixtures (library files, mappings, md5, taxonomy, estimated_capacity).
Add-to-library wrapper & tests
bio/kraken2/add_to_library/meta.yaml, bio/kraken2/add_to_library/wrapper.py, bio/kraken2/add_to_library/environment.yaml, bio/kraken2/add_to_library/environment.linux-64.pin.txt, bio/kraken2/add_to_library/test/Snakefile, bio/kraken2/add_to_library/test/test_db/**
New add-to-library meta and wrapper that normalizes multiple FASTA inputs, asserts presence, formats k2 add-to-library CLI with threads/log/extra; adds Snakemake test rule and library/test fixtures (added FASTA, prelim maps, md5, taxonomy).
Conda explicit pins
**/environment.linux-64.pin.txt (classify, build, add_to_library)
Adds explicit Conda @EXPLICIT pin files (linux-64) listing exact package URLs to reproduce deterministic environments.
Top-level tests
test_wrappers.py
Adds test_kraken2(run) that invokes classify (PE & SE), build, and add_to_library via Snakemake with --use-conda to exercise the wrappers and fixtures.

Sequence Diagram(s)

sequenceDiagram
  participant SM as Snakemake
  participant W as Wrapper.py
  participant K as k2 (Kraken2)
  participant FS as Filesystem
  SM->>W: invoke rule (inputs, params, threads, log)
  W->>W: validate inputs/outputs, normalize lists, format flags
  W->>K: execute "k2 classify|build|add-to-library" with assembled args
  K->>FS: read DB, library files, input reads/FASTA
  K->>FS: write outputs (reports, .k2d files, added FASTA)
  K-->>W: exit status
  W-->>SM: return success/failure
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding three Kraken2 wrappers (classify, build, add-to-library) following conventional commit style.
Description check ✅ Passed The description comprehensively addresses all template requirements: it explains the three new wrappers, notes the use of kraken2=2.17.1, and explicitly confirms all QC checklist items were completed with checkmarks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds three new Snakemake wrappers under bio/kraken2/ to expose Kraken2 functionality via the k2 CLI (classify reads, add FASTA to a DB library, and build a DB), along with conda environments and wrapper-level tests that exercise SE/PE classification and DB operations.

Changes:

  • Added bio/kraken2/classify wrapper + test Snakefile and test fixtures for SE/PE classification.
  • Added bio/kraken2/build wrapper + test Snakefile and test fixtures for building a Kraken2 database.
  • Added bio/kraken2/add_to_library wrapper + test Snakefile and test fixtures for adding FASTA sequences into an existing DB library.

Reviewed changes

Copilot reviewed 54 out of 63 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
test_wrappers.py Registers tests for the three new Kraken2 wrappers.
bio/kraken2/classify/wrapper.py Implements k2 classify wrapper with SE/PE handling and optional outputs.
bio/kraken2/classify/test/Snakefile Test rules for SE and PE classification against a tiny test DB.
bio/kraken2/classify/meta.yaml Wrapper metadata for inputs/outputs/params.
bio/kraken2/classify/environment.yaml Conda env definition for Kraken2.
bio/kraken2/classify/environment.linux-64.pin.txt Linux-64 explicit conda lock for reproducible tests.
bio/kraken2/classify/test/test_db/taxonomy/nodes.dmp Minimal taxonomy fixture for test DB.
bio/kraken2/classify/test/test_db/taxonomy/names.dmp Minimal taxonomy names fixture for test DB.
bio/kraken2/classify/test/test_db/taxo.k2d Prebuilt taxonomy binary fixture for test DB.
bio/kraken2/classify/test/test_db/hash.k2d Prebuilt hash binary fixture for test DB.
bio/kraken2/classify/test/test_db/opts.k2d Prebuilt opts binary fixture for test DB.
bio/kraken2/classify/test/test_db/seqid2taxid.map Sequence-to-taxid mapping fixture for test DB.
bio/kraken2/classify/test/test_db/prelim_map.txt Kraken2 prelim map fixture for test DB.
bio/kraken2/classify/test/test_db/estimated_capacity Capacity fixture for test DB.
bio/kraken2/classify/test/test_db/library/tiny.fa Tiny library FASTA fixture.
bio/kraken2/classify/test/test_db/library/tiny2.fa Second tiny library FASTA fixture.
bio/kraken2/classify/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna “Added” library artifact fixture.
bio/kraken2/classify/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna “Added” library artifact fixture.
bio/kraken2/classify/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt “Added” prelim map fixture.
bio/kraken2/classify/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt “Added” prelim map fixture.
bio/kraken2/classify/test/test_db/library/added/added.md5 “Added” md5 manifest fixture.
bio/kraken2/classify/test/reads/a_R1.fastq PE/SE test read fixture.
bio/kraken2/classify/test/reads/a_R2.fastq PE test read fixture.
bio/kraken2/classify/test/reads/a_R1.1.fastq Additional read fixture variant.
bio/kraken2/classify/test/reads/a_R1.2.fastq Additional read fixture variant.
bio/kraken2/classify/test/reads/a_R2.1.fastq Additional read fixture variant.
bio/kraken2/classify/test/reads/a_R2.2.fastq Additional read fixture variant.
bio/kraken2/classify/test/reads/b_R1.fastq SE test read fixture.
bio/kraken2/build/wrapper.py Implements k2 build wrapper.
bio/kraken2/build/test/Snakefile Test rule for DB build.
bio/kraken2/build/meta.yaml Wrapper metadata for build step.
bio/kraken2/build/environment.yaml Conda env definition for Kraken2.
bio/kraken2/build/environment.linux-64.pin.txt Linux-64 explicit conda lock for reproducible tests.
bio/kraken2/build/test/test_db/taxonomy/nodes.dmp Minimal taxonomy fixture for build tests.
bio/kraken2/build/test/test_db/taxonomy/names.dmp Minimal taxonomy names fixture for build tests.
bio/kraken2/build/test/test_db/taxo.k2d Taxonomy binary fixture (build tests).
bio/kraken2/build/test/test_db/hash.k2d Hash binary fixture (build tests).
bio/kraken2/build/test/test_db/opts.k2d Opts binary fixture (build tests).
bio/kraken2/build/test/test_db/seqid2taxid.map Sequence-to-taxid mapping fixture (build tests).
bio/kraken2/build/test/test_db/prelim_map.txt Kraken2 prelim map fixture (build tests).
bio/kraken2/build/test/test_db/estimated_capacity Capacity fixture (build tests).
bio/kraken2/build/test/test_db/library/tiny.fa Tiny library FASTA fixture (build tests).
bio/kraken2/build/test/test_db/library/tiny2.fa Second tiny library FASTA fixture (build tests).
bio/kraken2/build/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna “Added” library artifact fixture (build tests).
bio/kraken2/build/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna “Added” library artifact fixture (build tests).
bio/kraken2/build/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt “Added” prelim map fixture (build tests).
bio/kraken2/build/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt “Added” prelim map fixture (build tests).
bio/kraken2/build/test/test_db/library/added/added.md5 “Added” md5 manifest fixture (build tests).
bio/kraken2/add_to_library/wrapper.py Implements k2 add-to-library wrapper.
bio/kraken2/add_to_library/test/Snakefile Test rule for add-to-library step.
bio/kraken2/add_to_library/meta.yaml Wrapper metadata for add-to-library step.
bio/kraken2/add_to_library/environment.yaml Conda env definition for Kraken2.
bio/kraken2/add_to_library/environment.linux-64.pin.txt Linux-64 explicit conda lock for reproducible tests.
bio/kraken2/add_to_library/test/test_db/taxonomy/nodes.dmp Minimal taxonomy fixture for add-to-library tests.
bio/kraken2/add_to_library/test/test_db/taxonomy/names.dmp Minimal taxonomy names fixture for add-to-library tests.
bio/kraken2/add_to_library/test/test_db/library/tiny.fa Tiny library FASTA fixture (add-to-library tests).
bio/kraken2/add_to_library/test/test_db/library/tiny2.fa Second tiny library FASTA fixture (add-to-library tests).
bio/kraken2/add_to_library/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna “Added” library artifact fixture (add-to-library tests).
bio/kraken2/add_to_library/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna “Added” library artifact fixture (add-to-library tests).
bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt “Added” prelim map fixture (add-to-library tests).
bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt “Added” prelim map fixture (add-to-library tests).
bio/kraken2/add_to_library/test/test_db/library/added/added.md5 “Added” md5 manifest fixture (add-to-library tests).
bio/kraken2/add_to_library/test/test_db/library/added/.snakemake_timestamp Timestamp artifact present in added fixtures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread bio/kraken2/build/test/test_db/library/added/added.md5 Outdated
Comment on lines +47 to +52
stem, ext = os.path.splitext(out_file)
if not stem.endswith("_1"):
raise ValueError(
f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension."
)
kraken_template = stem[:-2] + "#" + ext
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@warrok97ao is this valid? Can kraken2 output .gz files?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual does not specify anything about the output format, with no mention of gz files accepted as valid output. It only accounts for input extensions (bz2, gzip, and xz).
Users might still name their files with extension for their output (e.g. classified/unclassified out) even if kraken2 does not compress them, so it might be safer to use the suggested logic

Comment thread bio/kraken2/classify/test/test_db/prelim_map.txt Outdated
Comment thread bio/kraken2/build/wrapper.py Outdated
Comment on lines +13 to +19
hash_file = snakemake.output.get("hash")
opts_file = snakemake.output.get("opts")
taxo_file = snakemake.output.get("taxo")

assert hash_file is not None, "Output -> hash is required"
assert opts_file is not None, "Output -> opts is required"
assert taxo_file is not None, "Output -> taxo is required"
Comment thread bio/kraken2/add_to_library/wrapper.py Outdated
Comment on lines +8 to +28
db = snakemake.input.get("db")
assert db is not None, "Input -> db is a required input parameter"

fasta = snakemake.input.get("fasta")
assert fasta is not None, "Input -> fasta is a required input parameter"

fasta = [fasta] if isinstance(fasta, str) else list(fasta)
fasta_str = " ".join(fasta)

extra = snakemake.params.get("extra", "")

log_file = snakemake.log[0] if snakemake.log else None
log_flag = f"--log {log_file}" if log_file else ""

shell(
"k2 add-to-library "
"--db {db} "
"--file {fasta_str} "
"--threads {snakemake.threads} "
"{log_flag} "
"{extra}"
Comment thread bio/kraken2/build/test/test_db/prelim_map.txt Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
bio/kraken2/classify/environment.yaml (1)

6-6: Add whitespace before the version specifier.

Per snakemake-wrappers conventions, version specifications should have whitespace before the equal sign.

Proposed fix
 channels:
   - conda-forge
   - bioconda
   - nodefaults
 dependencies:
-  - kraken2=2.17.1
+  - kraken2 =2.17.1

Based on learnings: In snakemake-wrappers repository, environment.yaml files should use whitespace before the equal sign in version specifications (e.g., "datavzrd =2.53.1").

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/environment.yaml` at line 6, The version specifier for
the package entry "kraken2=2.17.1" in the environment.yaml lacks the required
whitespace before the equal sign; update that entry to use the
snakemake-wrappers convention by changing it to "kraken2 =2.17.1" so the package
name and the equals sign are separated by a space.
bio/kraken2/build/environment.yaml (1)

6-6: Align dependency pin formatting with repo convention.

Line 6 uses kraken2=2.17.1; use a space before = in environment.yaml entries for consistency.

♻️ Proposed update
-  - kraken2=2.17.1
+  - kraken2 =2.17.1

Based on learnings: In snakemake-wrappers repository, environment.yaml files should use whitespace before the equal sign in version specifications.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/build/environment.yaml` at line 6, Fix the formatting of the
package pin in the environment.yaml by changing the entry 'kraken2=2.17.1' to
include a space before the equals sign (i.e., follow the repo convention used
across environment.yaml files), ensuring consistent formatting for package
version specs in this file.
bio/kraken2/add_to_library/environment.yaml (1)

6-6: Use repository-standard pin formatting.

Line 6 should include whitespace before = to match wrapper environment style conventions.

♻️ Proposed update
-  - kraken2=2.17.1
+  - kraken2 =2.17.1

Based on learnings: In snakemake-wrappers repository, environment.yaml files should use whitespace before the equal sign in version specifications.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/add_to_library/environment.yaml` at line 6, Update the package
pin in environment.yaml to use the repository-standard spacing around the equals
sign: locate the kraken2 entry in environment.yaml (the line with
"kraken2=2.17.1") and change it to include a space before and after the equals
sign ("kraken2 = 2.17.1") so it matches wrapper environment style conventions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bio/kraken2/add_to_library/test/test_db/library/added/added.md5`:
- Around line 1-2: The committed test fixture added.md5 contains developer-local
absolute paths (/home/aleone/...) which break portability; edit added.md5 (the
two lines referencing tiny.fa and tiny2.fa) to use relative paths or just the
filenames (e.g., "library/tiny.fa" or "tiny.fa" and the corresponding tiny_...
.fna names) so the MD5 file is host-independent, then update any test helpers if
they expect absolute paths.

In `@bio/kraken2/build/test/test_db/library/tiny2.fa`:
- Around line 1-6: The test FASTA contains sequences seq2 and seq3 tagged with
taxids 444 and 453 which are missing from the taxonomy files, so either add
entries for taxid 444 and 453 to the taxonomy (both names.dmp and nodes.dmp)
with valid parent, rank and name records matching Kraken2 format, or change
those sequence headers in tiny2.fa to use an existing taxid (e.g., replace 444
and 453 with 562) so all taxids referenced by seq2 and seq3 are present in the
taxonomy and will be included during database build.

In `@bio/kraken2/classify/test/test_db/library/added/added.md5`:
- Around line 1-2: The committed fixture manifest
bio/kraken2/classify/test/test_db/library/added/added.md5 contains
machine-specific absolute paths (/home/aleone/...) and an incorrect reference to
the add_to_library test tree; update this file to use portable relative
filenames and basenames instead of absolute paths (e.g., replace the full paths
before the checksums so lines start with tiny.fa and tiny2.fa or the expected
relative path within the classify test DB), keep the checksum and
output-filename suffixes intact (tiny_6861723... .fna, tiny2_92a2... .fna), and
ensure the manifest points only to files within the classify test fixture rather
than referencing add_to_library.

In
`@bio/kraken2/classify/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna`:
- Around line 1-6: The FASTA contains sequences with taxids 444 and 453 (seq2
and seq3 in tiny2_92a2a273338dd5873628711dfbc7180a.fna) which are not present in
the test taxonomy; either add entries for taxids 444 and 453 to the test
taxonomy (nodes.dmp/names.dmp) matching the expected format, or change seq2 and
seq3 headers to use an existing valid taxid (e.g., replace "|kraken:taxid|444"
and "|kraken:taxid|453" with "|kraken:taxid|562") so all taxids in the FASTA
match the classify test taxonomy.

In `@bio/kraken2/classify/wrapper.py`:
- Around line 47-53: The paired-end suffix check fails for compressed files
because os.path.splitext only removes the last extension; change the logic to
treat multi-part extensions correctly by using pathlib.Path.suffixes (or strip
known compression suffixes) to build full_ext = ''.join(Path(out_file).suffixes)
and core_name = Path(out_file).name[:-len(full_ext)], then verify
core_name.endswith("_1"), compute core_name[:-2] + "#" + full_ext for
kraken_template, and return f"{flag_name} {kraken_template}" (references:
variables out_file, flag_name, kraken_template).

---

Nitpick comments:
In `@bio/kraken2/add_to_library/environment.yaml`:
- Line 6: Update the package pin in environment.yaml to use the
repository-standard spacing around the equals sign: locate the kraken2 entry in
environment.yaml (the line with "kraken2=2.17.1") and change it to include a
space before and after the equals sign ("kraken2 = 2.17.1") so it matches
wrapper environment style conventions.

In `@bio/kraken2/build/environment.yaml`:
- Line 6: Fix the formatting of the package pin in the environment.yaml by
changing the entry 'kraken2=2.17.1' to include a space before the equals sign
(i.e., follow the repo convention used across environment.yaml files), ensuring
consistent formatting for package version specs in this file.

In `@bio/kraken2/classify/environment.yaml`:
- Line 6: The version specifier for the package entry "kraken2=2.17.1" in the
environment.yaml lacks the required whitespace before the equal sign; update
that entry to use the snakemake-wrappers convention by changing it to "kraken2
=2.17.1" so the package name and the equals sign are separated by a space.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3735cb70-63f2-4e13-92ed-78e5352f3b81

📥 Commits

Reviewing files that changed from the base of the PR and between e726bc9 and f0df7f5.

⛔ Files ignored due to path filters (2)
  • bio/kraken2/build/test/test_db/seqid2taxid.map is excluded by !**/*.map
  • bio/kraken2/classify/test/test_db/seqid2taxid.map is excluded by !**/*.map
📒 Files selected for processing (61)
  • bio/kraken2/add_to_library/environment.linux-64.pin.txt
  • bio/kraken2/add_to_library/environment.yaml
  • bio/kraken2/add_to_library/meta.yaml
  • bio/kraken2/add_to_library/test/Snakefile
  • bio/kraken2/add_to_library/test/test_db/library/added/.snakemake_timestamp
  • bio/kraken2/add_to_library/test/test_db/library/added/added.md5
  • bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt
  • bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
  • bio/kraken2/add_to_library/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna
  • bio/kraken2/add_to_library/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna
  • bio/kraken2/add_to_library/test/test_db/library/tiny.fa
  • bio/kraken2/add_to_library/test/test_db/library/tiny2.fa
  • bio/kraken2/add_to_library/test/test_db/taxonomy/names.dmp
  • bio/kraken2/add_to_library/test/test_db/taxonomy/nodes.dmp
  • bio/kraken2/add_to_library/wrapper.py
  • bio/kraken2/build/environment.linux-64.pin.txt
  • bio/kraken2/build/environment.yaml
  • bio/kraken2/build/meta.yaml
  • bio/kraken2/build/test/Snakefile
  • bio/kraken2/build/test/test_db/estimated_capacity
  • bio/kraken2/build/test/test_db/hash.k2d
  • bio/kraken2/build/test/test_db/library/added/added.md5
  • bio/kraken2/build/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt
  • bio/kraken2/build/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
  • bio/kraken2/build/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna
  • bio/kraken2/build/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna
  • bio/kraken2/build/test/test_db/library/tiny.fa
  • bio/kraken2/build/test/test_db/library/tiny2.fa
  • bio/kraken2/build/test/test_db/opts.k2d
  • bio/kraken2/build/test/test_db/prelim_map.txt
  • bio/kraken2/build/test/test_db/taxo.k2d
  • bio/kraken2/build/test/test_db/taxonomy/names.dmp
  • bio/kraken2/build/test/test_db/taxonomy/nodes.dmp
  • bio/kraken2/build/wrapper.py
  • bio/kraken2/classify/environment.linux-64.pin.txt
  • bio/kraken2/classify/environment.yaml
  • bio/kraken2/classify/meta.yaml
  • bio/kraken2/classify/test/Snakefile
  • bio/kraken2/classify/test/reads/a_R1.1.fastq
  • bio/kraken2/classify/test/reads/a_R1.2.fastq
  • bio/kraken2/classify/test/reads/a_R1.fastq
  • bio/kraken2/classify/test/reads/a_R2.1.fastq
  • bio/kraken2/classify/test/reads/a_R2.2.fastq
  • bio/kraken2/classify/test/reads/a_R2.fastq
  • bio/kraken2/classify/test/reads/b_R1.fastq
  • bio/kraken2/classify/test/test_db/estimated_capacity
  • bio/kraken2/classify/test/test_db/hash.k2d
  • bio/kraken2/classify/test/test_db/library/added/added.md5
  • bio/kraken2/classify/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt
  • bio/kraken2/classify/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
  • bio/kraken2/classify/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna
  • bio/kraken2/classify/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna
  • bio/kraken2/classify/test/test_db/library/tiny.fa
  • bio/kraken2/classify/test/test_db/library/tiny2.fa
  • bio/kraken2/classify/test/test_db/opts.k2d
  • bio/kraken2/classify/test/test_db/prelim_map.txt
  • bio/kraken2/classify/test/test_db/taxo.k2d
  • bio/kraken2/classify/test/test_db/taxonomy/names.dmp
  • bio/kraken2/classify/test/test_db/taxonomy/nodes.dmp
  • bio/kraken2/classify/wrapper.py
  • test_wrappers.py

Comment thread bio/kraken2/add_to_library/test/test_db/library/added/added.md5 Outdated
Comment on lines +1 to +6
>seq1|kraken:taxid|562
ACGTACGTACGTAAAAACGT
>seq2|kraken:taxid|444
TTGCAATGCAACGGCATGCA
>seq3|kraken:taxid|453
TTCAGTTGCAATGCAATGCA No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if taxids 444 and 453 exist in the taxonomy files for build tests
echo "=== Checking names.dmp for taxids 444 and 453 ==="
fd -t f 'names.dmp' bio/kraken2/build/test --exec cat {} \; 2>/dev/null | grep -E '^(444|453)\s'

echo "=== Checking nodes.dmp for taxids 444 and 453 ==="
fd -t f 'nodes.dmp' bio/kraken2/build/test --exec cat {} \; 2>/dev/null | grep -E '^(444|453)\s'

echo "=== Full content of taxonomy files ==="
fd -t f -e dmp . bio/kraken2/build/test --exec sh -c 'echo "--- {} ---" && cat {}'

Repository: snakemake/snakemake-wrappers

Length of output: 501


🏁 Script executed:

#!/bin/bash
# Search for test setup files and documentation
echo "=== Looking for test setup/documentation files ==="
fd -t f '\.(md|txt|yaml|yml|py|sh)$' bio/kraken2/build/test --exec sh -c 'echo "--- {} ---" && head -20 {}'

echo "=== Checking if there are any other taxonomy sources ==="
fd -t f -e dmp . bio/kraken2/build/test

echo "=== Looking for Kraken2 wrapper or build script ==="
fd -t f -e py . bio/kraken2/build --exec grep -l "kraken2-build" {} \;

Repository: snakemake/snakemake-wrappers

Length of output: 1803


🌐 Web query:

Kraken2 build missing taxids behavior handling

💡 Result:

In Kraken2’s DB build, “missing taxids” can mean two different things, and they’re handled differently:

  1. Sequence/accession can’t be mapped to any taxid (missing mapping)
  • During the “Creating sequence ID to taxonomy ID map” step, Kraken2 may write an unmapped.txt containing sequence IDs/accessions it couldn’t map to a taxid (i.e., they never made it into seqid2taxid.map). Downstream, those sequences won’t be included in the final DB unless you supply/fix mappings. KrakenTools explicitly documents this workflow (unmapped.txt → generate extra mappings → append into seqid2taxid.map). [1]
  1. A taxid exists in seqid2taxid.map, but that taxid is not present in the taxonomy (nodes.dmp/names.dmptaxo.k2d)
  • In the Kraken2 build code, taxonomy lookup is effectively: external taxid → internal node ID, and if the taxid isn’t in the taxonomy, the internal ID returned is 0. [2]
  • During minimizer assignment, Kraken2 combines candidate taxids via LCA; because LCA(x, 0) = x and if all candidates are missing you end up with taxid 0, Kraken2 will skip indexing that sequence entirely (if (taxid) { ... ProcessSequence ... }). Net effect: taxids missing from taxonomy cause those sequences to contribute nothing to the DB (they’re silently dropped at indexing time rather than being forced to root). [3]

How to handle it

  • If you’re adding custom genomes with “invented” taxids (not in NCBI), you must also add corresponding nodes/names (i.e., extend taxonomy) so those taxids exist in the taxonomy used for the build; otherwise those sequences won’t be indexed. (This is the common situation discussed in Kraken2 custom-db threads/issues.) [4]

Sources

  1. KrakenTools fix_unmapped.py docs (unmapped.txt handling). (github.com)
  2. Kraken2 Taxonomy::GetInternalID() returns 0 when an external taxid is absent. (raw.githubusercontent.com)
  3. Kraken2 build logic: missing/0 taxid ⇒ sequence not processed/indexed (if (taxid) { ... }). (raw.githubusercontent.com)
  4. Kraken2 issue discussing adding custom genomes whose taxids aren’t in NCBI taxonomy (need nodes/names updates). (github.com)

Citations:


🏁 Script executed:

#!/bin/bash
# Look for test configuration and documentation
echo "=== Checking for test README or docs ==="
fd -t f -e md -e txt -e rst . bio/kraken2/build/test

echo "=== Checking wrapper YAML/Snakefile ==="
fd -t f '\(Snakefile\|\.yaml\|\.yml\)' bio/kraken2/build --exec sh -c 'echo "--- {} ---" && cat {}'

echo "=== Looking for any test comments/setup in Python files ==="
fd -t f -e py . bio/kraken2/build --exec sh -c 'echo "--- {} ---" && head -30 {}'

Repository: snakemake/snakemake-wrappers

Length of output: 1322


Add missing taxids 444 and 453 to the taxonomy files, or replace with existing taxids.

This file references taxids 444 and 453 (lines 3, 5) but the companion names.dmp and nodes.dmp in test_db/taxonomy/ only contain taxids 1, 2, and 562. According to Kraken2's build logic, sequences with taxids missing from the taxonomy are silently dropped during database indexing and will not contribute to the final database. Either add the missing taxids to the taxonomy files or replace these sequences with taxid 562 to ensure they are properly indexed in the test database.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/build/test/test_db/library/tiny2.fa` around lines 1 - 6, The test
FASTA contains sequences seq2 and seq3 tagged with taxids 444 and 453 which are
missing from the taxonomy files, so either add entries for taxid 444 and 453 to
the taxonomy (both names.dmp and nodes.dmp) with valid parent, rank and name
records matching Kraken2 format, or change those sequence headers in tiny2.fa to
use an existing taxid (e.g., replace 444 and 453 with 562) so all taxids
referenced by seq2 and seq3 are present in the taxonomy and will be included
during database build.

Comment thread bio/kraken2/classify/test/test_db/library/added/added.md5 Outdated
Comment on lines +1 to +6
>seq1|kraken:taxid|562
ACGTACGTACGTAAAAACGT
>seq2|kraken:taxid|444
TTGCAATGCAACGGCATGCA
>seq3|kraken:taxid|453
TTCAGTTGCAATGCAATGCA
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check taxonomy files in classify test directory
echo "=== Checking classify test taxonomy for taxids 444, 453 ==="
fd -t f -e dmp . bio/kraken2/classify/test --exec sh -c 'echo "--- {} ---" && cat {}'

Repository: snakemake/snakemake-wrappers

Length of output: 428


Add taxids 444 and 453 to classify test taxonomy or replace with valid taxid (562).

This FASTA file references taxids 444 and 453, but the classify test taxonomy (nodes.dmp, names.dmp) only contains taxids 1, 2, and 562. The invalid taxids will cause test failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bio/kraken2/classify/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna`
around lines 1 - 6, The FASTA contains sequences with taxids 444 and 453 (seq2
and seq3 in tiny2_92a2a273338dd5873628711dfbc7180a.fna) which are not present in
the test taxonomy; either add entries for taxids 444 and 453 to the test
taxonomy (nodes.dmp/names.dmp) matching the expected format, or change seq2 and
seq3 headers to use an existing valid taxid (e.g., replace "|kraken:taxid|444"
and "|kraken:taxid|453" with "|kraken:taxid|562") so all taxids in the FASTA
match the classify test taxonomy.

Comment on lines +47 to +53
stem, ext = os.path.splitext(out_file)
if not stem.endswith("_1"):
raise ValueError(
f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension."
)
kraken_template = stem[:-2] + "#" + ext
return f"{flag_name} {kraken_template}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import os, re

samples = [
    "a.classified_1.fastq",
    "a.classified_1.fastq.gz",
    "a.classified_1.fq.gz",
]

print("Current splitext-based check:")
for s in samples:
    stem, ext = os.path.splitext(s)
    ok = stem.endswith("_1")
    print(f"{s:30} -> stem='{stem}', ext='{ext}', passes={ok}")

print("\nRegex-based check:")
for s in samples:
    ok = bool(re.search(r"_1(?=(?:\.[^.]+)+$|$)", s))
    out = re.sub(r"_1(?=(?:\.[^.]+)+$|$)", "#", s, count=1) if ok else None
    print(f"{s:30} -> passes={ok}, template={out}")
PY

Repository: snakemake/snakemake-wrappers

Length of output: 606


🏁 Script executed:

cat -n bio/kraken2/classify/wrapper.py | head -80

Repository: snakemake/snakemake-wrappers

Length of output: 2886


🏁 Script executed:

find . -path "*kraken2/classify*" -type f | head -20

Repository: snakemake/snakemake-wrappers

Length of output: 1203


🏁 Script executed:

cat -n bio/kraken2/classify/test/Snakefile

Repository: snakemake/snakemake-wrappers

Length of output: 1459


Paired-output suffix check fails with compressed file extensions.

Lines 47-53 incorrectly reject valid paired outputs like sample_1.fastq.gz because os.path.splitext() only strips the final extension (.gz), leaving the stem as sample_1.fastq, which no longer ends with _1. This breaks paired-end classification with compressed files.

Proposed fix
 import os
+import re
 from snakemake.shell import shell
@@
 def format_out_flag(out_param, flag_name):
@@
     if paired:
-        stem, ext = os.path.splitext(out_file)
-        if not stem.endswith("_1"):
+        if not re.search(r"_1(?=(?:\.[^.]+)+$|$)", out_file):
             raise ValueError(
                 f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension."
             )
-        kraken_template = stem[:-2] + "#" + ext
+        kraken_template = re.sub(
+            r"_1(?=(?:\.[^.]+)+$|$)", "#", out_file, count=1
+        )
         return f"{flag_name} {kraken_template}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/wrapper.py` around lines 47 - 53, The paired-end suffix
check fails for compressed files because os.path.splitext only removes the last
extension; change the logic to treat multi-part extensions correctly by using
pathlib.Path.suffixes (or strip known compression suffixes) to build full_ext =
''.join(Path(out_file).suffixes) and core_name =
Path(out_file).name[:-len(full_ext)], then verify core_name.endswith("_1"),
compute core_name[:-2] + "#" + full_ext for kraken_template, and return
f"{flag_name} {kraken_template}" (references: variables out_file, flag_name,
kraken_template).

"{db}/library/tiny2.fa",
],
output:
added_dir=directory("{db}/library/added"),
Copy link
Copy Markdown
Collaborator

@fgvieira fgvieira Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this just be called db?

Suggested change
added_dir=directory("{db}/library/added"),
db=directory("{db}/library/added"),

And are the new files apended to the input db, or is a new db created with the new files added?

Copy link
Copy Markdown
Collaborator

@fgvieira fgvieira Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is output.db a new DB or are files added to the input DB?
I mean, can db=directory("{db}/other_library")?

Comment thread bio/kraken2/classify/wrapper.py Outdated
Comment thread test_wrappers.py Outdated
@fgvieira
Copy link
Copy Markdown
Collaborator

Nice work!

Can you address/check mine and AI comments?

warrok97ao and others added 3 commits March 24, 2026 09:59
Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
bio/kraken2/classify/wrapper.py (1)

47-53: ⚠️ Potential issue | 🟠 Major

Fix paired-output template handling for compressed FASTQ names.

Line 47-Line 53 reject valid paired outputs such as sample_1.fastq.gz because os.path.splitext() strips only .gz, so Line 48 fails the _1 check.

Proposed fix
 import os
+import re
 from snakemake.shell import shell
@@
     if paired:
-        stem, ext = os.path.splitext(out_file)
-        if not stem.endswith("_1"):
+        if not re.search(r"_1(?=(?:\.[^.]+)+$|$)", out_file):
             raise ValueError(
                 f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension."
             )
-        kraken_template = stem[:-2] + "#" + ext
+        kraken_template = re.sub(
+            r"_1(?=(?:\.[^.]+)+$|$)", "#", out_file, count=1
+        )
         return f"{flag_name} {kraken_template}"
#!/bin/bash
python - <<'PY'
import os
samples = ["sample_1.fastq", "sample_1.fastq.gz", "sample_1.fq.gz"]
for s in samples:
    stem, ext = os.path.splitext(s)
    print(f"{s:20} stem={stem!r} ext={ext!r} endswith__1={stem.endswith('_1')}")
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/wrapper.py` around lines 47 - 53, The code incorrectly
uses os.path.splitext which only removes one suffix so names like
"sample_1.fastq.gz" fail the "_1" check; update the logic in the paired-output
handling (the block that reads out_file, stem, ext and builds kraken_template)
to respect multi-part extensions by using pathlib.Path(out_file).suffixes to
detect compression suffixes (e.g., .gz, .bz2, .xz) and treat the full extension
(e.g., ".fastq.gz") as ext and the remaining base name as stem before performing
stem.endswith("_1"), then build kraken_template from that base (stem[:-2] + "#"
+ ext) and return f"{flag_name} {kraken_template}".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 47-53: The code incorrectly uses os.path.splitext which only
removes one suffix so names like "sample_1.fastq.gz" fail the "_1" check; update
the logic in the paired-output handling (the block that reads out_file, stem,
ext and builds kraken_template) to respect multi-part extensions by using
pathlib.Path(out_file).suffixes to detect compression suffixes (e.g., .gz, .bz2,
.xz) and treat the full extension (e.g., ".fastq.gz") as ext and the remaining
base name as stem before performing stem.endswith("_1"), then build
kraken_template from that base (stem[:-2] + "#" + ext) and return f"{flag_name}
{kraken_template}".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 45d668b3-f218-4a20-83f9-a0a7b9411c14

📥 Commits

Reviewing files that changed from the base of the PR and between 1a0ea1f and 09cb034.

📒 Files selected for processing (3)
  • bio/kraken2/add_to_library/test/Snakefile
  • bio/kraken2/classify/wrapper.py
  • test_wrappers.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • bio/kraken2/add_to_library/test/Snakefile

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
bio/kraken2/classify/wrapper.py (1)

47-53: ⚠️ Potential issue | 🟠 Major

Handle multi-part extensions when deriving paired # output templates.

Line 47-53 reject valid paired outputs like sample_1.fastq.gz. os.path.splitext() strips only the last suffix (.gz), so the stem becomes sample_1.fastq and fails the _1 check.

Proposed fix
 import os
+import re
 from snakemake.shell import shell
@@
     if paired:
-        stem, ext = os.path.splitext(out_file)
-        if not stem.endswith("_1"):
+        if not re.search(r"_1(?=(?:\.[^.]+)+$|$)", out_file):
             raise ValueError(
                 f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension."
             )
-        kraken_template = stem[:-2] + "#" + ext
+        kraken_template = re.sub(
+            r"_1(?=(?:\.[^.]+)+$|$)", "#", out_file, count=1
+        )
         return f"{flag_name} {kraken_template}"
#!/bin/bash
python - <<'PY'
import os, re
samples = ["sample_1.fastq", "sample_1.fastq.gz", "sample_1.fq.gz"]
for s in samples:
    stem, ext = os.path.splitext(s)
    old_ok = stem.endswith("_1")
    new_ok = bool(re.search(r"_1(?=(?:\.[^.]+)+$|$)", s))
    print(f"{s:24} old_ok={old_ok} new_ok={new_ok}")
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/wrapper.py` around lines 47 - 53, The current logic
using os.path.splitext() produces incorrect stems for multi-part extensions
(e.g., sample_1.fastq.gz) and therefore rejects valid paired filenames; update
the code that inspects out_file (the stem/ext extraction and the _1 check) to
detect a trailing "_1" that occurs immediately before the full extension
sequence (or end of string) instead of just before the last suffix—e.g., use a
regex like r"_1(?=(?:\.[^.]+)+$|$)" to find/replace the "_1" in out_file or
repeatedly call os.path.splitext to peel all suffixes, then build
kraken_template by replacing that matched "_1" with "#" while preserving the
entire extension string, and keep using the same variables flag_name and
kraken_template and the same return format.
🧹 Nitpick comments (1)
bio/kraken2/build/wrapper.py (1)

13-19: Required outputs are not enforced by the executed command

Lines 13–19 require hash, opts, and taxo, but Line 24 never uses or validates those paths. If outputs are configured outside the DB directory, Snakemake can wait for files k2 build will never create.

🔧 Suggested fix (enforce output contract with current command behavior)
+from pathlib import Path
 from snakemake.shell import shell

 db = snakemake.input.get("db")
 assert db is not None, "Input -> db is a required input parameter"
+db_dir = Path(db)

 extra = snakemake.params.get("extra", "")

 hash_file = snakemake.output.get("hash")
 opts_file = snakemake.output.get("opts")
 taxo_file = snakemake.output.get("taxo")

 assert hash_file is not None, "Output -> hash is required"
 assert opts_file is not None, "Output -> opts is required"
 assert taxo_file is not None, "Output -> taxo is required"
+for name, out in {"hash": hash_file, "opts": opts_file, "taxo": taxo_file}.items():
+    out_path = Path(out)
+    assert out_path.parent == db_dir, (
+        f"Output -> {name} must be inside db ({db_dir}) because k2 build writes DB files there"
+    )

 log_file = snakemake.log[0] if snakemake.log else None
 log_flag = f"--log {log_file}" if log_file else ""

Also applies to: 24-24

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/build/wrapper.py` around lines 13 - 19, The rule currently
requires snakemake.output.get("hash"), "opts", and "taxo" (hash_file, opts_file,
taxo_file) but never ensures the executed k2 build command actually writes them,
so Snakemake can hang if outputs live outside the DB dir; update the wrapper so
that after running the existing k2 build invocation it either (A) passes
appropriate CLI flags to k2 build to write those specific paths (so
hash_file/opts_file/taxo_file are produced by the command) or (B) explicitly
creates/writes those files (e.g., touch/write metadata) and then verifies their
existence, raising an error if any of hash_file, opts_file, or taxo_file are
missing; make these checks next to the current command invocation rather than
only asserting before it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 47-53: The current logic using os.path.splitext() produces
incorrect stems for multi-part extensions (e.g., sample_1.fastq.gz) and
therefore rejects valid paired filenames; update the code that inspects out_file
(the stem/ext extraction and the _1 check) to detect a trailing "_1" that occurs
immediately before the full extension sequence (or end of string) instead of
just before the last suffix—e.g., use a regex like r"_1(?=(?:\.[^.]+)+$|$)" to
find/replace the "_1" in out_file or repeatedly call os.path.splitext to peel
all suffixes, then build kraken_template by replacing that matched "_1" with "#"
while preserving the entire extension string, and keep using the same variables
flag_name and kraken_template and the same return format.

---

Nitpick comments:
In `@bio/kraken2/build/wrapper.py`:
- Around line 13-19: The rule currently requires snakemake.output.get("hash"),
"opts", and "taxo" (hash_file, opts_file, taxo_file) but never ensures the
executed k2 build command actually writes them, so Snakemake can hang if outputs
live outside the DB dir; update the wrapper so that after running the existing
k2 build invocation it either (A) passes appropriate CLI flags to k2 build to
write those specific paths (so hash_file/opts_file/taxo_file are produced by the
command) or (B) explicitly creates/writes those files (e.g., touch/write
metadata) and then verifies their existence, raising an error if any of
hash_file, opts_file, or taxo_file are missing; make these checks next to the
current command invocation rather than only asserting before it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: febccbc6-2bea-46be-8f95-2005e46e82b8

📥 Commits

Reviewing files that changed from the base of the PR and between 09cb034 and 3abbe96.

📒 Files selected for processing (3)
  • bio/kraken2/add_to_library/wrapper.py
  • bio/kraken2/build/wrapper.py
  • bio/kraken2/classify/wrapper.py

@warrok97ao
Copy link
Copy Markdown
Author

Thank your for the great suggestions!
The Code quality / formatting check appears to be failing due to repo-wide Black formatting, not changes in this PR.
When running:
black --check //*/wrapper.py //wrapper.py

In CI it reports 120 existing wrapper.py files would be reformatted
My diff against upstream/master only includes bio/kraken2/** and test_wrappers.py.
How can I fix the issue?
Thanks

Comment thread bio/kraken2/build/test/test_db/opts.k2d Outdated
Comment thread bio/kraken2/build/test/test_db/hash.k2d Outdated
Comment thread bio/kraken2/build/test/test_db/taxo.k2d Outdated
"{db}/library/tiny2.fa",
],
output:
db=directory("{db}/library/added"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this path determined by kraken or can the user change it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the path library/added is defined by k2, not user-defined

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlaehnemann same question as in comment below.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would solve this analogously to the kraken2_build wrapper. So explicitly state here, that the path needs to be exactly this, and that the {db} entry needs to match the db= entry in the input:. And then it's probably best to also assert that in the wrapper itself.

Comment thread bio/kraken2/build/test/Snakefile Outdated
Comment thread bio/kraken2/classify/test/Snakefile Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
bio/kraken2/classify/wrapper.py (1)

29-30: Unnecessary list conversion for db.

At line 23, db is assigned as db_dirs[0], which is already a string (the directory path). Line 29 wraps it in a list, but db can only ever be a single directory string at this point, making the isinstance check always true and the list wrapping unnecessary.

♻️ Simplify db handling
-db = [db] if isinstance(db, str) else db
+# db is already a single directory path string

Then update line 39 to use db directly:

-input_str_db = ",".join(db)
+input_str_db = db
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/wrapper.py` around lines 29 - 30, db is already set from
db_dirs[0] (a single string), so the conversion db = [db] if isinstance(db, str)
else db is unnecessary; remove that list-wrapping and keep db as a string, leave
the fq1 conversion as-is, and update any later usage that assumed a list (e.g.,
where db was indexed or iterated) to use db directly (references: db_dirs, db,
fq1 in wrapper.py).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bio/kraken2/add_to_library/wrapper.py`:
- Around line 16-18: The current assignment to log makes it None which
serializes to the string "None" when formatted into the shell command; change
the logic around snakemake.log handling so that the variable used in the command
is an empty string when no log is provided (or else conditionally include the
"--log <path>" token only when snakemake.log is truthy). Specifically, update
the snippet that sets log (referencing snakemake.log and the log variable) to
produce either an empty string or omit the flag entirely when snakemake.log is
empty so that "{log}" in the command does not expand to "None".

In `@bio/kraken2/build/wrapper.py`:
- Around line 13-19: The comparison uses os.path.abspath which can mismatch when
symlinks are involved; change the computation of db_dir and out_dir to use
os.path.realpath instead of os.path.abspath (i.e., compute db_dir =
os.path.realpath(snakemake.input.db) and out_dir =
os.path.dirname(os.path.realpath(snakemake.output[key]))) so both sides are
canonicalized before the assert in the loop that checks equality for "hash",
"opts", "taxo".

In `@bio/kraken2/classify/wrapper.py`:
- Around line 43-48: The block that builds db_files, db_dirs and db_dir
(variables db_files, db_dirs, db_dir) is a duplicate of the earlier validation
and should be removed; delete the entire repeated validation block and any
assignment to db_dir, and ensure the code continues to use the previously
computed db variable from the first validation (lines that set db from
snakemake.input should remain unchanged).

---

Nitpick comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 29-30: db is already set from db_dirs[0] (a single string), so the
conversion db = [db] if isinstance(db, str) else db is unnecessary; remove that
list-wrapping and keep db as a string, leave the fq1 conversion as-is, and
update any later usage that assumed a list (e.g., where db was indexed or
iterated) to use db directly (references: db_dirs, db, fq1 in wrapper.py).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 900d67a6-a2c2-4c2f-acbc-51ec06d49d13

📥 Commits

Reviewing files that changed from the base of the PR and between 3abbe96 and cf45a71.

📒 Files selected for processing (19)
  • bio/kraken2/add_to_library/meta.yaml
  • bio/kraken2/add_to_library/test/Snakefile
  • bio/kraken2/add_to_library/test/test_db/library/added/added.md5
  • bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt
  • bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
  • bio/kraken2/add_to_library/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna
  • bio/kraken2/add_to_library/wrapper.py
  • bio/kraken2/build/test/Snakefile
  • bio/kraken2/build/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt
  • bio/kraken2/build/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
  • bio/kraken2/build/test/test_db/prelim_map.txt
  • bio/kraken2/build/wrapper.py
  • bio/kraken2/classify/test/Snakefile
  • bio/kraken2/classify/test/test_db/library/added/added.md5
  • bio/kraken2/classify/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt
  • bio/kraken2/classify/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
  • bio/kraken2/classify/test/test_db/prelim_map.txt
  • bio/kraken2/classify/wrapper.py
  • test_wrappers.py
✅ Files skipped from review due to trivial changes (7)
  • bio/kraken2/add_to_library/test/test_db/library/added/added.md5
  • bio/kraken2/add_to_library/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna
  • bio/kraken2/classify/test/test_db/library/added/added.md5
  • bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
  • bio/kraken2/add_to_library/meta.yaml
  • bio/kraken2/build/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
  • bio/kraken2/add_to_library/test/Snakefile
🚧 Files skipped from review as they are similar to previous changes (3)
  • bio/kraken2/build/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt
  • bio/kraken2/build/test/Snakefile
  • bio/kraken2/classify/test/Snakefile

Comment thread bio/kraken2/add_to_library/wrapper.py
Comment thread bio/kraken2/build/wrapper.py Outdated
Comment thread bio/kraken2/classify/wrapper.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
bio/kraken2/classify/wrapper.py (1)

55-62: ⚠️ Potential issue | 🔴 Critical

Paired-end suffix check fails for compressed files (e.g., .fastq.gz).

os.path.splitext() only removes the final extension, so sample_1.fastq.gz yields stem sample_1.fastq which doesn't end with _1. This incorrectly rejects valid paired-end compressed outputs.

🐛 Proposed fix using pathlib to handle multi-part extensions
     if paired:
-        stem, ext = os.path.splitext(out_file)
-        if not stem.endswith("_1"):
+        p = Path(out_file)
+        suffixes = "".join(p.suffixes)
+        core = p.name[: -len(suffixes)] if suffixes else p.name
+        if not core.endswith("_1"):
             raise ValueError(
                 f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension."
             )
-        kraken_template = stem[:-2] + "#" + ext
+        kraken_template = str(p.parent / (core[:-2] + "#" + suffixes))
         return f"{flag_name} {kraken_template}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/wrapper.py` around lines 55 - 62, The current paired-end
check uses os.path.splitext so compressed filenames like "sample_1.fastq.gz"
leave stem "sample_1.fastq" and fail the "_1" check; update the logic around
out_file to use pathlib.Path to peel off all compression suffixes (e.g.,
Path(out_file).suffixes) or iteratively strip known compression extensions
('.gz', '.bz2', '.xz') so you obtain the true base stem and the full extension
string (ext = concatenated suffixes); then perform the stem.endswith("_1") check
on that base stem and build kraken_template as base_stem[:-2] + "#" + ext,
returning f"{flag_name} {kraken_template}" as before (update references to
out_file, stem, ext, flag_name, and kraken_template accordingly).
🧹 Nitpick comments (2)
bio/kraken2/classify/wrapper.py (2)

7-10: Unused imports: TemporaryDirectory and move_files.

These imports are not used in the current implementation. Consider removing them to keep the code clean.

♻️ Suggested cleanup
 import os
-from pathlib import Path
-from tempfile import TemporaryDirectory
 from snakemake.shell import shell
-from snakemake_wrapper_utils.snakemake import move_files

Note: If you apply the fix for the paired-end suffix check, keep from pathlib import Path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/wrapper.py` around lines 7 - 10, The file imports
TemporaryDirectory and move_files which are not used; remove the unused imports
to clean up the module: delete TemporaryDirectory from the import line "from
tempfile import TemporaryDirectory" and remove the "from
snakemake_wrapper_utils.snakemake import move_files" import; keep "from pathlib
import Path" as noted for paired-end suffix checks and ensure no other code
references TemporaryDirectory or move_files before committing.

33-34: Redundant list wrapping for db.

db is always a single directory string derived from db_dirs[0], so wrapping it in a list and then joining with commas (line 43) is unnecessary. This doesn't break anything but adds confusion.

♻️ Optional simplification
-db = [db] if isinstance(db, str) else db
 fq1 = [fq1] if isinstance(fq1, str) else fq1
 ...
-input_str_db = ",".join(db)
+input_str_db = db

Also applies to: 43-43

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/wrapper.py` around lines 33 - 34, The code redundantly
wraps db into a list then later joins it (db = [db] if isinstance(db, str) else
db), but db is always a single directory string derived from db_dirs[0]; remove
that unnecessary list-wrapping and treat db as a string (keep fq1 wrapping if
needed) so downstream joining or formatting uses the original string. Update
usages that expect db to be a list (e.g., wherever db is joined with commas) to
accept a string or directly use db without list conversion; reference the db
variable assignment in the wrapper and the later join/format that combines db
values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 55-62: The current paired-end check uses os.path.splitext so
compressed filenames like "sample_1.fastq.gz" leave stem "sample_1.fastq" and
fail the "_1" check; update the logic around out_file to use pathlib.Path to
peel off all compression suffixes (e.g., Path(out_file).suffixes) or iteratively
strip known compression extensions ('.gz', '.bz2', '.xz') so you obtain the true
base stem and the full extension string (ext = concatenated suffixes); then
perform the stem.endswith("_1") check on that base stem and build
kraken_template as base_stem[:-2] + "#" + ext, returning f"{flag_name}
{kraken_template}" as before (update references to out_file, stem, ext,
flag_name, and kraken_template accordingly).

---

Nitpick comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 7-10: The file imports TemporaryDirectory and move_files which are
not used; remove the unused imports to clean up the module: delete
TemporaryDirectory from the import line "from tempfile import
TemporaryDirectory" and remove the "from snakemake_wrapper_utils.snakemake
import move_files" import; keep "from pathlib import Path" as noted for
paired-end suffix checks and ensure no other code references TemporaryDirectory
or move_files before committing.
- Around line 33-34: The code redundantly wraps db into a list then later joins
it (db = [db] if isinstance(db, str) else db), but db is always a single
directory string derived from db_dirs[0]; remove that unnecessary list-wrapping
and treat db as a string (keep fq1 wrapping if needed) so downstream joining or
formatting uses the original string. Update usages that expect db to be a list
(e.g., wherever db is joined with commas) to accept a string or directly use db
without list conversion; reference the db variable assignment in the wrapper and
the later join/format that combines db values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 19975bc5-9f7b-48a9-9199-b51dc6911a04

📥 Commits

Reviewing files that changed from the base of the PR and between cf45a71 and f5d8744.

📒 Files selected for processing (3)
  • bio/kraken2/add_to_library/wrapper.py
  • bio/kraken2/build/wrapper.py
  • bio/kraken2/classify/wrapper.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
bio/kraken2/classify/wrapper.py (1)

55-60: ⚠️ Potential issue | 🟠 Major

Handle multi-part extensions when validating paired *_1 outputs.

Line 55 uses os.path.splitext(), which only strips the last suffix. For names like sample.classified_1.fastq.gz, Line 56 evaluates sample.classified_1.fastq.endswith("_1") and incorrectly rejects valid paired output patterns.

Proposed fix
-        stem, ext = os.path.splitext(out_file)
-        if not stem.endswith("_1"):
+        p = Path(out_file)
+        full_ext = "".join(p.suffixes)
+        core_name = p.name[: -len(full_ext)] if full_ext else p.name
+        if not core_name.endswith("_1"):
             raise ValueError(
                 f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension."
             )
-        kraken_template = stem[:-2] + "#" + ext
-        return f"{flag_name} {kraken_template}"
+        kraken_template = str(
+            p.with_name(f"{core_name[:-2]}#{full_ext}")
+        )
+        return f"{flag_name} {kraken_template}"
#!/bin/bash
python - <<'PY'
import os
samples = [
    "sample_1.fastq",
    "sample_1.fastq.gz",
    "sample_1.fq.gz",
]
for s in samples:
    stem, ext = os.path.splitext(s)
    print(f"{s:20} splitext -> stem='{stem}', ext='{ext}', endswith('_1')={stem.endswith('_1')}")
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/classify/wrapper.py` around lines 55 - 60, The current logic uses
os.path.splitext on out_file (variables stem, ext) which only removes the final
suffix and mis-detects names like "sample.classified_1.fastq.gz"; update the
logic to compute the full extension using pathlib.Path(out_file).suffixes (or
repeatedly rsplit to join all suffixes into ext_str), derive stem_full by
stripping that full extension from out_file, then validate
stem_full.endswith("_1") (use flag_name in the error message), and construct
kraken_template as stem_full[:-2] + "#" + "".join(suffixes) (or ext_str) so
multi-part extensions like ".fastq.gz" are handled correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bio/kraken2/add_to_library/wrapper.py`:
- Around line 11-12: The code currently builds fasta_str by joining multiple
FASTA paths (fasta = [fasta] if isinstance(fasta, str) else list(fasta);
fasta_str = " ".join(fasta)) and then passes that into the kraken2-build call as
--file {fasta_str}; instead, change the logic to iterate over the fasta list and
call the kraken2-build invocation once per file: keep the conversion to list
(fasta = ...), then for each entry in fasta call the existing command
construction/invocation (the place that uses --file {fasta_str}) with the single
filename (e.g., --file {f}) and run the subprocess/runner for each file
sequentially, preserving existing options and error handling.

---

Duplicate comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 55-60: The current logic uses os.path.splitext on out_file
(variables stem, ext) which only removes the final suffix and mis-detects names
like "sample.classified_1.fastq.gz"; update the logic to compute the full
extension using pathlib.Path(out_file).suffixes (or repeatedly rsplit to join
all suffixes into ext_str), derive stem_full by stripping that full extension
from out_file, then validate stem_full.endswith("_1") (use flag_name in the
error message), and construct kraken_template as stem_full[:-2] + "#" +
"".join(suffixes) (or ext_str) so multi-part extensions like ".fastq.gz" are
handled correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b9aeecd-b8f6-484a-ae8c-3f0f26daa536

📥 Commits

Reviewing files that changed from the base of the PR and between f5d8744 and d75ce45.

📒 Files selected for processing (2)
  • bio/kraken2/add_to_library/wrapper.py
  • bio/kraken2/classify/wrapper.py

Comment on lines +11 to +12
fasta = [fasta] if isinstance(fasta, str) else list(fasta)
fasta_str = " ".join(fasta)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify k2 add-to-library --file accepts multiple files or requires separate calls.

# Check k2 help output for add-to-library usage
k2 add-to-library --help 2>&1 | grep -A 10 -i "file"

Repository: snakemake/snakemake-wrappers

Length of output: 54


🏁 Script executed:

# First, let's explore the wrapper file structure and context
cat -n bio/kraken2/add_to_library/wrapper.py

Repository: snakemake/snakemake-wrappers

Length of output: 955


🏁 Script executed:

# Check if there are test files for this wrapper
fd -t f "test.*kraken2.*add_to_library|add_to_library.*test" .

Repository: snakemake/snakemake-wrappers

Length of output: 54


🏁 Script executed:

# Look for similar wrappers handling multiple input files
rg -l "isinstance.*str.*else.*list" --type py | head -10

Repository: snakemake/snakemake-wrappers

Length of output: 141


🌐 Web query:

k2 add-to-library CLI documentation multiple files

💡 Result:

The k2 add-to-library command is part of Kraken2 (k2 is shorthand for Kraken2), a taxonomic sequence classification tool. It is used via kraken2-build --add-to-library FILE --db DB_NAME to add a single FASTA/FASTQ file to the library of a Kraken2 database before building it. The command does not natively support multiple files in a single invocation; FILE specifies one file. To add multiple files: 1. Loop over files sequentially: for file in .fna; do kraken2-build --add-to-library "$file" --db $DB_NAME done 2. For parallel addition (faster for many files, post-2018 fix for race conditions): find . -name ".fna" -print0 | xargs -0 -I{} -P8 kraken2-build --add-to-library {} --db $DB_NAME (Adjust -P for number of threads/CPUs.) Requirements: - FASTA headers must include |kraken:taxid|XXXXX (taxid from NCBI taxonomy). - Run --download-taxonomy --db $DB_NAME first. - After adding files, run kraken2-build --build --db $DB_NAME --threads N. - Cannot add to a fully built database; must rebuild. Official documentation: Kraken2 GitHub wiki (manual) and manpages confirm single-file usage with loop/xargs for multiples. Examples from Biostars, GitHub issues, and tutorials.

Citations:


Loop over FASTA files: kraken2-build --add-to-library accepts only a single file per call.

The current implementation joins multiple files with spaces (line 12) and passes them as --file {fasta_str} (line 25). According to Kraken2 documentation, the --add-to-library command accepts only one file per invocation. Multiple files must be added by calling the command once per file.

The wrapper must loop over the fasta list and invoke the command for each file:

🔧 Proposed fix
 fasta = [fasta] if isinstance(fasta, str) else list(fasta)
-fasta_str = " ".join(fasta)
 
 extra = snakemake.params.get("extra", "")
 
 log = snakemake.log[0] if snakemake.log else None
 if log:
     log = f"--log {log}"
 else:
     log = ""
 
-shell(
-    "k2 add-to-library"
-    " --db {snakemake.input.db}"
-    " --file {fasta_str}"
-    " --threads {snakemake.threads}"
-    " {extra}"
-    " {log}"
-)
+for fasta_file in fasta:
+    shell(
+        "k2 add-to-library"
+        " --db {snakemake.input.db}"
+        " --file {fasta_file}"
+        " --threads {snakemake.threads}"
+        " {extra}"
+        " {log}"
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bio/kraken2/add_to_library/wrapper.py` around lines 11 - 12, The code
currently builds fasta_str by joining multiple FASTA paths (fasta = [fasta] if
isinstance(fasta, str) else list(fasta); fasta_str = " ".join(fasta)) and then
passes that into the kraken2-build call as --file {fasta_str}; instead, change
the logic to iterate over the fasta list and call the kraken2-build invocation
once per file: keep the conversion to list (fasta = ...), then for each entry in
fasta call the existing command construction/invocation (the place that uses
--file {fasta_str}) with the single filename (e.g., --file {f}) and run the
subprocess/runner for each file sequentially, preserving existing options and
error handling.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@warrok97ao it seems this is a real issue, no?

Comment on lines +11 to +12
fasta = [fasta] if isinstance(fasta, str) else list(fasta)
fasta_str = " ".join(fasta)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@warrok97ao it seems this is a real issue, no?

f"Output -> {key} must be in the same directory as the input database"
f"({db_dir}), but it is in {out_dir}"
)
log = snakemake.log[0] if snakemake.log else ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log = snakemake.log[0] if snakemake.log else ""
log = snakemake.get("log", "")

Comment on lines +16 to +21
log = snakemake.log[0] if snakemake.log else None
if log:
log = f"--log {log}"
else:
log = ""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log = snakemake.log[0] if snakemake.log else None
if log:
log = f"--log {log}"
else:
log = ""
log = snakemake.get("log", "")
if log:
log = f"--log {log}"

Comment on lines +13 to +19
db_dir = os.path.realpath(snakemake.input.db)
for key in ("hash", "opts", "taxo"):
out_dir = os.path.dirname(os.path.realpath(snakemake.output[key]))
assert out_dir == db_dir, (
f"Output -> {key} must be in the same directory as the input database"
f"({db_dir}), but it is in {out_dir}"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
db_dir = os.path.realpath(snakemake.input.db)
for key in ("hash", "opts", "taxo"):
out_dir = os.path.dirname(os.path.realpath(snakemake.output[key]))
assert out_dir == db_dir, (
f"Output -> {key} must be in the same directory as the input database"
f"({db_dir}), but it is in {out_dir}"
)
db_dir = os.path.commonprefix(snakemake.input.db)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test where you classify against two DBs?

Comment on lines +21 to +26
db_files = [hash_file, opts_file, taxo_file]
db_dirs = [os.path.dirname(os.path.abspath(f)) for f in db_files]
assert all(
d == db_dirs[0] for d in db_dirs
), f"All db files (hash, opts, taxo) must be in the same directory, got: {db_dirs}"
db = db_dirs[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this work with more than one DB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants