feat: add kraken2 wrappers (classify, build, add-to-library)#5086
feat: add kraken2 wrappers (classify, build, add-to-library)#5086warrok97ao wants to merge 39 commits into
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 15 minutes and 5 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds three Kraken2 Snakemake wrappers (classify, build, add_to_library) with metadata, conda environment specifications, Python wrapper implementations, tests, and test fixtures to support classification, database build, and library additions. Changes
Sequence Diagram(s)sequenceDiagram
participant SM as Snakemake
participant W as Wrapper.py
participant K as k2 (Kraken2)
participant FS as Filesystem
SM->>W: invoke rule (inputs, params, threads, log)
W->>W: validate inputs/outputs, normalize lists, format flags
W->>K: execute "k2 classify|build|add-to-library" with assembled args
K->>FS: read DB, library files, input reads/FASTA
K->>FS: write outputs (reports, .k2d files, added FASTA)
K-->>W: exit status
W-->>SM: return success/failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR adds three new Snakemake wrappers under bio/kraken2/ to expose Kraken2 functionality via the k2 CLI (classify reads, add FASTA to a DB library, and build a DB), along with conda environments and wrapper-level tests that exercise SE/PE classification and DB operations.
Changes:
- Added
bio/kraken2/classifywrapper + test Snakefile and test fixtures for SE/PE classification. - Added
bio/kraken2/buildwrapper + test Snakefile and test fixtures for building a Kraken2 database. - Added
bio/kraken2/add_to_librarywrapper + test Snakefile and test fixtures for adding FASTA sequences into an existing DB library.
Reviewed changes
Copilot reviewed 54 out of 63 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| test_wrappers.py | Registers tests for the three new Kraken2 wrappers. |
| bio/kraken2/classify/wrapper.py | Implements k2 classify wrapper with SE/PE handling and optional outputs. |
| bio/kraken2/classify/test/Snakefile | Test rules for SE and PE classification against a tiny test DB. |
| bio/kraken2/classify/meta.yaml | Wrapper metadata for inputs/outputs/params. |
| bio/kraken2/classify/environment.yaml | Conda env definition for Kraken2. |
| bio/kraken2/classify/environment.linux-64.pin.txt | Linux-64 explicit conda lock for reproducible tests. |
| bio/kraken2/classify/test/test_db/taxonomy/nodes.dmp | Minimal taxonomy fixture for test DB. |
| bio/kraken2/classify/test/test_db/taxonomy/names.dmp | Minimal taxonomy names fixture for test DB. |
| bio/kraken2/classify/test/test_db/taxo.k2d | Prebuilt taxonomy binary fixture for test DB. |
| bio/kraken2/classify/test/test_db/hash.k2d | Prebuilt hash binary fixture for test DB. |
| bio/kraken2/classify/test/test_db/opts.k2d | Prebuilt opts binary fixture for test DB. |
| bio/kraken2/classify/test/test_db/seqid2taxid.map | Sequence-to-taxid mapping fixture for test DB. |
| bio/kraken2/classify/test/test_db/prelim_map.txt | Kraken2 prelim map fixture for test DB. |
| bio/kraken2/classify/test/test_db/estimated_capacity | Capacity fixture for test DB. |
| bio/kraken2/classify/test/test_db/library/tiny.fa | Tiny library FASTA fixture. |
| bio/kraken2/classify/test/test_db/library/tiny2.fa | Second tiny library FASTA fixture. |
| bio/kraken2/classify/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna | “Added” library artifact fixture. |
| bio/kraken2/classify/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna | “Added” library artifact fixture. |
| bio/kraken2/classify/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt | “Added” prelim map fixture. |
| bio/kraken2/classify/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt | “Added” prelim map fixture. |
| bio/kraken2/classify/test/test_db/library/added/added.md5 | “Added” md5 manifest fixture. |
| bio/kraken2/classify/test/reads/a_R1.fastq | PE/SE test read fixture. |
| bio/kraken2/classify/test/reads/a_R2.fastq | PE test read fixture. |
| bio/kraken2/classify/test/reads/a_R1.1.fastq | Additional read fixture variant. |
| bio/kraken2/classify/test/reads/a_R1.2.fastq | Additional read fixture variant. |
| bio/kraken2/classify/test/reads/a_R2.1.fastq | Additional read fixture variant. |
| bio/kraken2/classify/test/reads/a_R2.2.fastq | Additional read fixture variant. |
| bio/kraken2/classify/test/reads/b_R1.fastq | SE test read fixture. |
| bio/kraken2/build/wrapper.py | Implements k2 build wrapper. |
| bio/kraken2/build/test/Snakefile | Test rule for DB build. |
| bio/kraken2/build/meta.yaml | Wrapper metadata for build step. |
| bio/kraken2/build/environment.yaml | Conda env definition for Kraken2. |
| bio/kraken2/build/environment.linux-64.pin.txt | Linux-64 explicit conda lock for reproducible tests. |
| bio/kraken2/build/test/test_db/taxonomy/nodes.dmp | Minimal taxonomy fixture for build tests. |
| bio/kraken2/build/test/test_db/taxonomy/names.dmp | Minimal taxonomy names fixture for build tests. |
| bio/kraken2/build/test/test_db/taxo.k2d | Taxonomy binary fixture (build tests). |
| bio/kraken2/build/test/test_db/hash.k2d | Hash binary fixture (build tests). |
| bio/kraken2/build/test/test_db/opts.k2d | Opts binary fixture (build tests). |
| bio/kraken2/build/test/test_db/seqid2taxid.map | Sequence-to-taxid mapping fixture (build tests). |
| bio/kraken2/build/test/test_db/prelim_map.txt | Kraken2 prelim map fixture (build tests). |
| bio/kraken2/build/test/test_db/estimated_capacity | Capacity fixture (build tests). |
| bio/kraken2/build/test/test_db/library/tiny.fa | Tiny library FASTA fixture (build tests). |
| bio/kraken2/build/test/test_db/library/tiny2.fa | Second tiny library FASTA fixture (build tests). |
| bio/kraken2/build/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna | “Added” library artifact fixture (build tests). |
| bio/kraken2/build/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna | “Added” library artifact fixture (build tests). |
| bio/kraken2/build/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt | “Added” prelim map fixture (build tests). |
| bio/kraken2/build/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt | “Added” prelim map fixture (build tests). |
| bio/kraken2/build/test/test_db/library/added/added.md5 | “Added” md5 manifest fixture (build tests). |
| bio/kraken2/add_to_library/wrapper.py | Implements k2 add-to-library wrapper. |
| bio/kraken2/add_to_library/test/Snakefile | Test rule for add-to-library step. |
| bio/kraken2/add_to_library/meta.yaml | Wrapper metadata for add-to-library step. |
| bio/kraken2/add_to_library/environment.yaml | Conda env definition for Kraken2. |
| bio/kraken2/add_to_library/environment.linux-64.pin.txt | Linux-64 explicit conda lock for reproducible tests. |
| bio/kraken2/add_to_library/test/test_db/taxonomy/nodes.dmp | Minimal taxonomy fixture for add-to-library tests. |
| bio/kraken2/add_to_library/test/test_db/taxonomy/names.dmp | Minimal taxonomy names fixture for add-to-library tests. |
| bio/kraken2/add_to_library/test/test_db/library/tiny.fa | Tiny library FASTA fixture (add-to-library tests). |
| bio/kraken2/add_to_library/test/test_db/library/tiny2.fa | Second tiny library FASTA fixture (add-to-library tests). |
| bio/kraken2/add_to_library/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fna | “Added” library artifact fixture (add-to-library tests). |
| bio/kraken2/add_to_library/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna | “Added” library artifact fixture (add-to-library tests). |
| bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt | “Added” prelim map fixture (add-to-library tests). |
| bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt | “Added” prelim map fixture (add-to-library tests). |
| bio/kraken2/add_to_library/test/test_db/library/added/added.md5 | “Added” md5 manifest fixture (add-to-library tests). |
| bio/kraken2/add_to_library/test/test_db/library/added/.snakemake_timestamp | Timestamp artifact present in added fixtures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| stem, ext = os.path.splitext(out_file) | ||
| if not stem.endswith("_1"): | ||
| raise ValueError( | ||
| f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension." | ||
| ) | ||
| kraken_template = stem[:-2] + "#" + ext |
There was a problem hiding this comment.
@warrok97ao is this valid? Can kraken2 output .gz files?
There was a problem hiding this comment.
The manual does not specify anything about the output format, with no mention of gz files accepted as valid output. It only accounts for input extensions (bz2, gzip, and xz).
Users might still name their files with extension for their output (e.g. classified/unclassified out) even if kraken2 does not compress them, so it might be safer to use the suggested logic
| hash_file = snakemake.output.get("hash") | ||
| opts_file = snakemake.output.get("opts") | ||
| taxo_file = snakemake.output.get("taxo") | ||
|
|
||
| assert hash_file is not None, "Output -> hash is required" | ||
| assert opts_file is not None, "Output -> opts is required" | ||
| assert taxo_file is not None, "Output -> taxo is required" |
| db = snakemake.input.get("db") | ||
| assert db is not None, "Input -> db is a required input parameter" | ||
|
|
||
| fasta = snakemake.input.get("fasta") | ||
| assert fasta is not None, "Input -> fasta is a required input parameter" | ||
|
|
||
| fasta = [fasta] if isinstance(fasta, str) else list(fasta) | ||
| fasta_str = " ".join(fasta) | ||
|
|
||
| extra = snakemake.params.get("extra", "") | ||
|
|
||
| log_file = snakemake.log[0] if snakemake.log else None | ||
| log_flag = f"--log {log_file}" if log_file else "" | ||
|
|
||
| shell( | ||
| "k2 add-to-library " | ||
| "--db {db} " | ||
| "--file {fasta_str} " | ||
| "--threads {snakemake.threads} " | ||
| "{log_flag} " | ||
| "{extra}" |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
bio/kraken2/classify/environment.yaml (1)
6-6: Add whitespace before the version specifier.Per snakemake-wrappers conventions, version specifications should have whitespace before the equal sign.
Proposed fix
channels: - conda-forge - bioconda - nodefaults dependencies: - - kraken2=2.17.1 + - kraken2 =2.17.1Based on learnings: In snakemake-wrappers repository, environment.yaml files should use whitespace before the equal sign in version specifications (e.g., "datavzrd =2.53.1").
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/classify/environment.yaml` at line 6, The version specifier for the package entry "kraken2=2.17.1" in the environment.yaml lacks the required whitespace before the equal sign; update that entry to use the snakemake-wrappers convention by changing it to "kraken2 =2.17.1" so the package name and the equals sign are separated by a space.bio/kraken2/build/environment.yaml (1)
6-6: Align dependency pin formatting with repo convention.Line 6 uses
kraken2=2.17.1; use a space before=inenvironment.yamlentries for consistency.♻️ Proposed update
- - kraken2=2.17.1 + - kraken2 =2.17.1Based on learnings: In snakemake-wrappers repository, environment.yaml files should use whitespace before the equal sign in version specifications.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/build/environment.yaml` at line 6, Fix the formatting of the package pin in the environment.yaml by changing the entry 'kraken2=2.17.1' to include a space before the equals sign (i.e., follow the repo convention used across environment.yaml files), ensuring consistent formatting for package version specs in this file.bio/kraken2/add_to_library/environment.yaml (1)
6-6: Use repository-standard pin formatting.Line 6 should include whitespace before
=to match wrapper environment style conventions.♻️ Proposed update
- - kraken2=2.17.1 + - kraken2 =2.17.1Based on learnings: In snakemake-wrappers repository, environment.yaml files should use whitespace before the equal sign in version specifications.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/add_to_library/environment.yaml` at line 6, Update the package pin in environment.yaml to use the repository-standard spacing around the equals sign: locate the kraken2 entry in environment.yaml (the line with "kraken2=2.17.1") and change it to include a space before and after the equals sign ("kraken2 = 2.17.1") so it matches wrapper environment style conventions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bio/kraken2/add_to_library/test/test_db/library/added/added.md5`:
- Around line 1-2: The committed test fixture added.md5 contains developer-local
absolute paths (/home/aleone/...) which break portability; edit added.md5 (the
two lines referencing tiny.fa and tiny2.fa) to use relative paths or just the
filenames (e.g., "library/tiny.fa" or "tiny.fa" and the corresponding tiny_...
.fna names) so the MD5 file is host-independent, then update any test helpers if
they expect absolute paths.
In `@bio/kraken2/build/test/test_db/library/tiny2.fa`:
- Around line 1-6: The test FASTA contains sequences seq2 and seq3 tagged with
taxids 444 and 453 which are missing from the taxonomy files, so either add
entries for taxid 444 and 453 to the taxonomy (both names.dmp and nodes.dmp)
with valid parent, rank and name records matching Kraken2 format, or change
those sequence headers in tiny2.fa to use an existing taxid (e.g., replace 444
and 453 with 562) so all taxids referenced by seq2 and seq3 are present in the
taxonomy and will be included during database build.
In `@bio/kraken2/classify/test/test_db/library/added/added.md5`:
- Around line 1-2: The committed fixture manifest
bio/kraken2/classify/test/test_db/library/added/added.md5 contains
machine-specific absolute paths (/home/aleone/...) and an incorrect reference to
the add_to_library test tree; update this file to use portable relative
filenames and basenames instead of absolute paths (e.g., replace the full paths
before the checksums so lines start with tiny.fa and tiny2.fa or the expected
relative path within the classify test DB), keep the checksum and
output-filename suffixes intact (tiny_6861723... .fna, tiny2_92a2... .fna), and
ensure the manifest points only to files within the classify test fixture rather
than referencing add_to_library.
In
`@bio/kraken2/classify/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna`:
- Around line 1-6: The FASTA contains sequences with taxids 444 and 453 (seq2
and seq3 in tiny2_92a2a273338dd5873628711dfbc7180a.fna) which are not present in
the test taxonomy; either add entries for taxids 444 and 453 to the test
taxonomy (nodes.dmp/names.dmp) matching the expected format, or change seq2 and
seq3 headers to use an existing valid taxid (e.g., replace "|kraken:taxid|444"
and "|kraken:taxid|453" with "|kraken:taxid|562") so all taxids in the FASTA
match the classify test taxonomy.
In `@bio/kraken2/classify/wrapper.py`:
- Around line 47-53: The paired-end suffix check fails for compressed files
because os.path.splitext only removes the last extension; change the logic to
treat multi-part extensions correctly by using pathlib.Path.suffixes (or strip
known compression suffixes) to build full_ext = ''.join(Path(out_file).suffixes)
and core_name = Path(out_file).name[:-len(full_ext)], then verify
core_name.endswith("_1"), compute core_name[:-2] + "#" + full_ext for
kraken_template, and return f"{flag_name} {kraken_template}" (references:
variables out_file, flag_name, kraken_template).
---
Nitpick comments:
In `@bio/kraken2/add_to_library/environment.yaml`:
- Line 6: Update the package pin in environment.yaml to use the
repository-standard spacing around the equals sign: locate the kraken2 entry in
environment.yaml (the line with "kraken2=2.17.1") and change it to include a
space before and after the equals sign ("kraken2 = 2.17.1") so it matches
wrapper environment style conventions.
In `@bio/kraken2/build/environment.yaml`:
- Line 6: Fix the formatting of the package pin in the environment.yaml by
changing the entry 'kraken2=2.17.1' to include a space before the equals sign
(i.e., follow the repo convention used across environment.yaml files), ensuring
consistent formatting for package version specs in this file.
In `@bio/kraken2/classify/environment.yaml`:
- Line 6: The version specifier for the package entry "kraken2=2.17.1" in the
environment.yaml lacks the required whitespace before the equal sign; update
that entry to use the snakemake-wrappers convention by changing it to "kraken2
=2.17.1" so the package name and the equals sign are separated by a space.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3735cb70-63f2-4e13-92ed-78e5352f3b81
⛔ Files ignored due to path filters (2)
bio/kraken2/build/test/test_db/seqid2taxid.mapis excluded by!**/*.mapbio/kraken2/classify/test/test_db/seqid2taxid.mapis excluded by!**/*.map
📒 Files selected for processing (61)
bio/kraken2/add_to_library/environment.linux-64.pin.txtbio/kraken2/add_to_library/environment.yamlbio/kraken2/add_to_library/meta.yamlbio/kraken2/add_to_library/test/Snakefilebio/kraken2/add_to_library/test/test_db/library/added/.snakemake_timestampbio/kraken2/add_to_library/test/test_db/library/added/added.md5bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txtbio/kraken2/add_to_library/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txtbio/kraken2/add_to_library/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fnabio/kraken2/add_to_library/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fnabio/kraken2/add_to_library/test/test_db/library/tiny.fabio/kraken2/add_to_library/test/test_db/library/tiny2.fabio/kraken2/add_to_library/test/test_db/taxonomy/names.dmpbio/kraken2/add_to_library/test/test_db/taxonomy/nodes.dmpbio/kraken2/add_to_library/wrapper.pybio/kraken2/build/environment.linux-64.pin.txtbio/kraken2/build/environment.yamlbio/kraken2/build/meta.yamlbio/kraken2/build/test/Snakefilebio/kraken2/build/test/test_db/estimated_capacitybio/kraken2/build/test/test_db/hash.k2dbio/kraken2/build/test/test_db/library/added/added.md5bio/kraken2/build/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txtbio/kraken2/build/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txtbio/kraken2/build/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fnabio/kraken2/build/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fnabio/kraken2/build/test/test_db/library/tiny.fabio/kraken2/build/test/test_db/library/tiny2.fabio/kraken2/build/test/test_db/opts.k2dbio/kraken2/build/test/test_db/prelim_map.txtbio/kraken2/build/test/test_db/taxo.k2dbio/kraken2/build/test/test_db/taxonomy/names.dmpbio/kraken2/build/test/test_db/taxonomy/nodes.dmpbio/kraken2/build/wrapper.pybio/kraken2/classify/environment.linux-64.pin.txtbio/kraken2/classify/environment.yamlbio/kraken2/classify/meta.yamlbio/kraken2/classify/test/Snakefilebio/kraken2/classify/test/reads/a_R1.1.fastqbio/kraken2/classify/test/reads/a_R1.2.fastqbio/kraken2/classify/test/reads/a_R1.fastqbio/kraken2/classify/test/reads/a_R2.1.fastqbio/kraken2/classify/test/reads/a_R2.2.fastqbio/kraken2/classify/test/reads/a_R2.fastqbio/kraken2/classify/test/reads/b_R1.fastqbio/kraken2/classify/test/test_db/estimated_capacitybio/kraken2/classify/test/test_db/hash.k2dbio/kraken2/classify/test/test_db/library/added/added.md5bio/kraken2/classify/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txtbio/kraken2/classify/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txtbio/kraken2/classify/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fnabio/kraken2/classify/test/test_db/library/added/tiny_6861723b4fe267de9fea4bf63e327b14.fnabio/kraken2/classify/test/test_db/library/tiny.fabio/kraken2/classify/test/test_db/library/tiny2.fabio/kraken2/classify/test/test_db/opts.k2dbio/kraken2/classify/test/test_db/prelim_map.txtbio/kraken2/classify/test/test_db/taxo.k2dbio/kraken2/classify/test/test_db/taxonomy/names.dmpbio/kraken2/classify/test/test_db/taxonomy/nodes.dmpbio/kraken2/classify/wrapper.pytest_wrappers.py
| >seq1|kraken:taxid|562 | ||
| ACGTACGTACGTAAAAACGT | ||
| >seq2|kraken:taxid|444 | ||
| TTGCAATGCAACGGCATGCA | ||
| >seq3|kraken:taxid|453 | ||
| TTCAGTTGCAATGCAATGCA No newline at end of file |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if taxids 444 and 453 exist in the taxonomy files for build tests
echo "=== Checking names.dmp for taxids 444 and 453 ==="
fd -t f 'names.dmp' bio/kraken2/build/test --exec cat {} \; 2>/dev/null | grep -E '^(444|453)\s'
echo "=== Checking nodes.dmp for taxids 444 and 453 ==="
fd -t f 'nodes.dmp' bio/kraken2/build/test --exec cat {} \; 2>/dev/null | grep -E '^(444|453)\s'
echo "=== Full content of taxonomy files ==="
fd -t f -e dmp . bio/kraken2/build/test --exec sh -c 'echo "--- {} ---" && cat {}'Repository: snakemake/snakemake-wrappers
Length of output: 501
🏁 Script executed:
#!/bin/bash
# Search for test setup files and documentation
echo "=== Looking for test setup/documentation files ==="
fd -t f '\.(md|txt|yaml|yml|py|sh)$' bio/kraken2/build/test --exec sh -c 'echo "--- {} ---" && head -20 {}'
echo "=== Checking if there are any other taxonomy sources ==="
fd -t f -e dmp . bio/kraken2/build/test
echo "=== Looking for Kraken2 wrapper or build script ==="
fd -t f -e py . bio/kraken2/build --exec grep -l "kraken2-build" {} \;Repository: snakemake/snakemake-wrappers
Length of output: 1803
🌐 Web query:
Kraken2 build missing taxids behavior handling
💡 Result:
In Kraken2’s DB build, “missing taxids” can mean two different things, and they’re handled differently:
- Sequence/accession can’t be mapped to any taxid (missing mapping)
- During the “Creating sequence ID to taxonomy ID map” step, Kraken2 may write an
unmapped.txtcontaining sequence IDs/accessions it couldn’t map to a taxid (i.e., they never made it intoseqid2taxid.map). Downstream, those sequences won’t be included in the final DB unless you supply/fix mappings. KrakenTools explicitly documents this workflow (unmapped.txt→ generate extra mappings → append intoseqid2taxid.map). [1]
- A taxid exists in
seqid2taxid.map, but that taxid is not present in the taxonomy (nodes.dmp/names.dmp→taxo.k2d)
- In the Kraken2 build code, taxonomy lookup is effectively: external taxid → internal node ID, and if the taxid isn’t in the taxonomy, the internal ID returned is
0. [2] - During minimizer assignment, Kraken2 combines candidate taxids via LCA; because LCA(x, 0) = x and if all candidates are missing you end up with taxid
0, Kraken2 will skip indexing that sequence entirely (if (taxid) { ... ProcessSequence ... }). Net effect: taxids missing from taxonomy cause those sequences to contribute nothing to the DB (they’re silently dropped at indexing time rather than being forced to root). [3]
How to handle it
- If you’re adding custom genomes with “invented” taxids (not in NCBI), you must also add corresponding nodes/names (i.e., extend taxonomy) so those taxids exist in the taxonomy used for the build; otherwise those sequences won’t be indexed. (This is the common situation discussed in Kraken2 custom-db threads/issues.) [4]
Sources
- KrakenTools
fix_unmapped.pydocs (unmapped.txt handling). (github.com) - Kraken2
Taxonomy::GetInternalID()returns 0 when an external taxid is absent. (raw.githubusercontent.com) - Kraken2 build logic: missing/0 taxid ⇒ sequence not processed/indexed (
if (taxid) { ... }). (raw.githubusercontent.com) - Kraken2 issue discussing adding custom genomes whose taxids aren’t in NCBI taxonomy (need nodes/names updates). (github.com)
Citations:
- 1: https://github.com/jenniferlu717/KrakenTools?utm_source=openai
- 2: https://raw.githubusercontent.com/DerrickWood/kraken2/master/src/taxonomy.h
- 3: https://raw.githubusercontent.com/DerrickWood/kraken2/master/src/build_db.cc
- 4: create nodes.dmp and names.dmp files for custom database DerrickWood/kraken2#436
🏁 Script executed:
#!/bin/bash
# Look for test configuration and documentation
echo "=== Checking for test README or docs ==="
fd -t f -e md -e txt -e rst . bio/kraken2/build/test
echo "=== Checking wrapper YAML/Snakefile ==="
fd -t f '\(Snakefile\|\.yaml\|\.yml\)' bio/kraken2/build --exec sh -c 'echo "--- {} ---" && cat {}'
echo "=== Looking for any test comments/setup in Python files ==="
fd -t f -e py . bio/kraken2/build --exec sh -c 'echo "--- {} ---" && head -30 {}'Repository: snakemake/snakemake-wrappers
Length of output: 1322
Add missing taxids 444 and 453 to the taxonomy files, or replace with existing taxids.
This file references taxids 444 and 453 (lines 3, 5) but the companion names.dmp and nodes.dmp in test_db/taxonomy/ only contain taxids 1, 2, and 562. According to Kraken2's build logic, sequences with taxids missing from the taxonomy are silently dropped during database indexing and will not contribute to the final database. Either add the missing taxids to the taxonomy files or replace these sequences with taxid 562 to ensure they are properly indexed in the test database.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bio/kraken2/build/test/test_db/library/tiny2.fa` around lines 1 - 6, The test
FASTA contains sequences seq2 and seq3 tagged with taxids 444 and 453 which are
missing from the taxonomy files, so either add entries for taxid 444 and 453 to
the taxonomy (both names.dmp and nodes.dmp) with valid parent, rank and name
records matching Kraken2 format, or change those sequence headers in tiny2.fa to
use an existing taxid (e.g., replace 444 and 453 with 562) so all taxids
referenced by seq2 and seq3 are present in the taxonomy and will be included
during database build.
| >seq1|kraken:taxid|562 | ||
| ACGTACGTACGTAAAAACGT | ||
| >seq2|kraken:taxid|444 | ||
| TTGCAATGCAACGGCATGCA | ||
| >seq3|kraken:taxid|453 | ||
| TTCAGTTGCAATGCAATGCA |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check taxonomy files in classify test directory
echo "=== Checking classify test taxonomy for taxids 444, 453 ==="
fd -t f -e dmp . bio/kraken2/classify/test --exec sh -c 'echo "--- {} ---" && cat {}'Repository: snakemake/snakemake-wrappers
Length of output: 428
Add taxids 444 and 453 to classify test taxonomy or replace with valid taxid (562).
This FASTA file references taxids 444 and 453, but the classify test taxonomy (nodes.dmp, names.dmp) only contains taxids 1, 2, and 562. The invalid taxids will cause test failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@bio/kraken2/classify/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna`
around lines 1 - 6, The FASTA contains sequences with taxids 444 and 453 (seq2
and seq3 in tiny2_92a2a273338dd5873628711dfbc7180a.fna) which are not present in
the test taxonomy; either add entries for taxids 444 and 453 to the test
taxonomy (nodes.dmp/names.dmp) matching the expected format, or change seq2 and
seq3 headers to use an existing valid taxid (e.g., replace "|kraken:taxid|444"
and "|kraken:taxid|453" with "|kraken:taxid|562") so all taxids in the FASTA
match the classify test taxonomy.
| stem, ext = os.path.splitext(out_file) | ||
| if not stem.endswith("_1"): | ||
| raise ValueError( | ||
| f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension." | ||
| ) | ||
| kraken_template = stem[:-2] + "#" + ext | ||
| return f"{flag_name} {kraken_template}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import os, re
samples = [
"a.classified_1.fastq",
"a.classified_1.fastq.gz",
"a.classified_1.fq.gz",
]
print("Current splitext-based check:")
for s in samples:
stem, ext = os.path.splitext(s)
ok = stem.endswith("_1")
print(f"{s:30} -> stem='{stem}', ext='{ext}', passes={ok}")
print("\nRegex-based check:")
for s in samples:
ok = bool(re.search(r"_1(?=(?:\.[^.]+)+$|$)", s))
out = re.sub(r"_1(?=(?:\.[^.]+)+$|$)", "#", s, count=1) if ok else None
print(f"{s:30} -> passes={ok}, template={out}")
PYRepository: snakemake/snakemake-wrappers
Length of output: 606
🏁 Script executed:
cat -n bio/kraken2/classify/wrapper.py | head -80Repository: snakemake/snakemake-wrappers
Length of output: 2886
🏁 Script executed:
find . -path "*kraken2/classify*" -type f | head -20Repository: snakemake/snakemake-wrappers
Length of output: 1203
🏁 Script executed:
cat -n bio/kraken2/classify/test/SnakefileRepository: snakemake/snakemake-wrappers
Length of output: 1459
Paired-output suffix check fails with compressed file extensions.
Lines 47-53 incorrectly reject valid paired outputs like sample_1.fastq.gz because os.path.splitext() only strips the final extension (.gz), leaving the stem as sample_1.fastq, which no longer ends with _1. This breaks paired-end classification with compressed files.
Proposed fix
import os
+import re
from snakemake.shell import shell
@@
def format_out_flag(out_param, flag_name):
@@
if paired:
- stem, ext = os.path.splitext(out_file)
- if not stem.endswith("_1"):
+ if not re.search(r"_1(?=(?:\.[^.]+)+$|$)", out_file):
raise ValueError(
f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension."
)
- kraken_template = stem[:-2] + "#" + ext
+ kraken_template = re.sub(
+ r"_1(?=(?:\.[^.]+)+$|$)", "#", out_file, count=1
+ )
return f"{flag_name} {kraken_template}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bio/kraken2/classify/wrapper.py` around lines 47 - 53, The paired-end suffix
check fails for compressed files because os.path.splitext only removes the last
extension; change the logic to treat multi-part extensions correctly by using
pathlib.Path.suffixes (or strip known compression suffixes) to build full_ext =
''.join(Path(out_file).suffixes) and core_name =
Path(out_file).name[:-len(full_ext)], then verify core_name.endswith("_1"),
compute core_name[:-2] + "#" + full_ext for kraken_template, and return
f"{flag_name} {kraken_template}" (references: variables out_file, flag_name,
kraken_template).
| "{db}/library/tiny2.fa", | ||
| ], | ||
| output: | ||
| added_dir=directory("{db}/library/added"), |
There was a problem hiding this comment.
Shouldn't this just be called db?
| added_dir=directory("{db}/library/added"), | |
| db=directory("{db}/library/added"), |
And are the new files apended to the input db, or is a new db created with the new files added?
There was a problem hiding this comment.
Is output.db a new DB or are files added to the input DB?
I mean, can db=directory("{db}/other_library")?
|
Nice work! Can you address/check mine and AI comments? |
Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
Co-authored-by: Filipe G. Vieira <1151762+fgvieira@users.noreply.github.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
bio/kraken2/classify/wrapper.py (1)
47-53:⚠️ Potential issue | 🟠 MajorFix paired-output template handling for compressed FASTQ names.
Line 47-Line 53 reject valid paired outputs such as
sample_1.fastq.gzbecauseos.path.splitext()strips only.gz, so Line 48 fails the_1check.Proposed fix
import os +import re from snakemake.shell import shell @@ if paired: - stem, ext = os.path.splitext(out_file) - if not stem.endswith("_1"): + if not re.search(r"_1(?=(?:\.[^.]+)+$|$)", out_file): raise ValueError( f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension." ) - kraken_template = stem[:-2] + "#" + ext + kraken_template = re.sub( + r"_1(?=(?:\.[^.]+)+$|$)", "#", out_file, count=1 + ) return f"{flag_name} {kraken_template}"#!/bin/bash python - <<'PY' import os samples = ["sample_1.fastq", "sample_1.fastq.gz", "sample_1.fq.gz"] for s in samples: stem, ext = os.path.splitext(s) print(f"{s:20} stem={stem!r} ext={ext!r} endswith__1={stem.endswith('_1')}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/classify/wrapper.py` around lines 47 - 53, The code incorrectly uses os.path.splitext which only removes one suffix so names like "sample_1.fastq.gz" fail the "_1" check; update the logic in the paired-output handling (the block that reads out_file, stem, ext and builds kraken_template) to respect multi-part extensions by using pathlib.Path(out_file).suffixes to detect compression suffixes (e.g., .gz, .bz2, .xz) and treat the full extension (e.g., ".fastq.gz") as ext and the remaining base name as stem before performing stem.endswith("_1"), then build kraken_template from that base (stem[:-2] + "#" + ext) and return f"{flag_name} {kraken_template}".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 47-53: The code incorrectly uses os.path.splitext which only
removes one suffix so names like "sample_1.fastq.gz" fail the "_1" check; update
the logic in the paired-output handling (the block that reads out_file, stem,
ext and builds kraken_template) to respect multi-part extensions by using
pathlib.Path(out_file).suffixes to detect compression suffixes (e.g., .gz, .bz2,
.xz) and treat the full extension (e.g., ".fastq.gz") as ext and the remaining
base name as stem before performing stem.endswith("_1"), then build
kraken_template from that base (stem[:-2] + "#" + ext) and return f"{flag_name}
{kraken_template}".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 45d668b3-f218-4a20-83f9-a0a7b9411c14
📒 Files selected for processing (3)
bio/kraken2/add_to_library/test/Snakefilebio/kraken2/classify/wrapper.pytest_wrappers.py
🚧 Files skipped from review as they are similar to previous changes (1)
- bio/kraken2/add_to_library/test/Snakefile
There was a problem hiding this comment.
♻️ Duplicate comments (1)
bio/kraken2/classify/wrapper.py (1)
47-53:⚠️ Potential issue | 🟠 MajorHandle multi-part extensions when deriving paired
#output templates.Line 47-53 reject valid paired outputs like
sample_1.fastq.gz.os.path.splitext()strips only the last suffix (.gz), so the stem becomessample_1.fastqand fails the_1check.Proposed fix
import os +import re from snakemake.shell import shell @@ if paired: - stem, ext = os.path.splitext(out_file) - if not stem.endswith("_1"): + if not re.search(r"_1(?=(?:\.[^.]+)+$|$)", out_file): raise ValueError( f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension." ) - kraken_template = stem[:-2] + "#" + ext + kraken_template = re.sub( + r"_1(?=(?:\.[^.]+)+$|$)", "#", out_file, count=1 + ) return f"{flag_name} {kraken_template}"#!/bin/bash python - <<'PY' import os, re samples = ["sample_1.fastq", "sample_1.fastq.gz", "sample_1.fq.gz"] for s in samples: stem, ext = os.path.splitext(s) old_ok = stem.endswith("_1") new_ok = bool(re.search(r"_1(?=(?:\.[^.]+)+$|$)", s)) print(f"{s:24} old_ok={old_ok} new_ok={new_ok}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/classify/wrapper.py` around lines 47 - 53, The current logic using os.path.splitext() produces incorrect stems for multi-part extensions (e.g., sample_1.fastq.gz) and therefore rejects valid paired filenames; update the code that inspects out_file (the stem/ext extraction and the _1 check) to detect a trailing "_1" that occurs immediately before the full extension sequence (or end of string) instead of just before the last suffix—e.g., use a regex like r"_1(?=(?:\.[^.]+)+$|$)" to find/replace the "_1" in out_file or repeatedly call os.path.splitext to peel all suffixes, then build kraken_template by replacing that matched "_1" with "#" while preserving the entire extension string, and keep using the same variables flag_name and kraken_template and the same return format.
🧹 Nitpick comments (1)
bio/kraken2/build/wrapper.py (1)
13-19: Required outputs are not enforced by the executed commandLines 13–19 require
hash,opts, andtaxo, but Line 24 never uses or validates those paths. If outputs are configured outside the DB directory, Snakemake can wait for filesk2 buildwill never create.🔧 Suggested fix (enforce output contract with current command behavior)
+from pathlib import Path from snakemake.shell import shell db = snakemake.input.get("db") assert db is not None, "Input -> db is a required input parameter" +db_dir = Path(db) extra = snakemake.params.get("extra", "") hash_file = snakemake.output.get("hash") opts_file = snakemake.output.get("opts") taxo_file = snakemake.output.get("taxo") assert hash_file is not None, "Output -> hash is required" assert opts_file is not None, "Output -> opts is required" assert taxo_file is not None, "Output -> taxo is required" +for name, out in {"hash": hash_file, "opts": opts_file, "taxo": taxo_file}.items(): + out_path = Path(out) + assert out_path.parent == db_dir, ( + f"Output -> {name} must be inside db ({db_dir}) because k2 build writes DB files there" + ) log_file = snakemake.log[0] if snakemake.log else None log_flag = f"--log {log_file}" if log_file else ""Also applies to: 24-24
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/build/wrapper.py` around lines 13 - 19, The rule currently requires snakemake.output.get("hash"), "opts", and "taxo" (hash_file, opts_file, taxo_file) but never ensures the executed k2 build command actually writes them, so Snakemake can hang if outputs live outside the DB dir; update the wrapper so that after running the existing k2 build invocation it either (A) passes appropriate CLI flags to k2 build to write those specific paths (so hash_file/opts_file/taxo_file are produced by the command) or (B) explicitly creates/writes those files (e.g., touch/write metadata) and then verifies their existence, raising an error if any of hash_file, opts_file, or taxo_file are missing; make these checks next to the current command invocation rather than only asserting before it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 47-53: The current logic using os.path.splitext() produces
incorrect stems for multi-part extensions (e.g., sample_1.fastq.gz) and
therefore rejects valid paired filenames; update the code that inspects out_file
(the stem/ext extraction and the _1 check) to detect a trailing "_1" that occurs
immediately before the full extension sequence (or end of string) instead of
just before the last suffix—e.g., use a regex like r"_1(?=(?:\.[^.]+)+$|$)" to
find/replace the "_1" in out_file or repeatedly call os.path.splitext to peel
all suffixes, then build kraken_template by replacing that matched "_1" with "#"
while preserving the entire extension string, and keep using the same variables
flag_name and kraken_template and the same return format.
---
Nitpick comments:
In `@bio/kraken2/build/wrapper.py`:
- Around line 13-19: The rule currently requires snakemake.output.get("hash"),
"opts", and "taxo" (hash_file, opts_file, taxo_file) but never ensures the
executed k2 build command actually writes them, so Snakemake can hang if outputs
live outside the DB dir; update the wrapper so that after running the existing
k2 build invocation it either (A) passes appropriate CLI flags to k2 build to
write those specific paths (so hash_file/opts_file/taxo_file are produced by the
command) or (B) explicitly creates/writes those files (e.g., touch/write
metadata) and then verifies their existence, raising an error if any of
hash_file, opts_file, or taxo_file are missing; make these checks next to the
current command invocation rather than only asserting before it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: febccbc6-2bea-46be-8f95-2005e46e82b8
📒 Files selected for processing (3)
bio/kraken2/add_to_library/wrapper.pybio/kraken2/build/wrapper.pybio/kraken2/classify/wrapper.py
|
Thank your for the great suggestions! In CI it reports 120 existing wrapper.py files would be reformatted |
| "{db}/library/tiny2.fa", | ||
| ], | ||
| output: | ||
| db=directory("{db}/library/added"), |
There was a problem hiding this comment.
Is this path determined by kraken or can the user change it?
There was a problem hiding this comment.
Here the path library/added is defined by k2, not user-defined
There was a problem hiding this comment.
@dlaehnemann same question as in comment below.
There was a problem hiding this comment.
I would solve this analogously to the kraken2_build wrapper. So explicitly state here, that the path needs to be exactly this, and that the {db} entry needs to match the db= entry in the input:. And then it's probably best to also assert that in the wrapper itself.
… and located in the same directory as the input database
…re required files are validated and located in the same directory
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
bio/kraken2/classify/wrapper.py (1)
29-30: Unnecessary list conversion fordb.At line 23,
dbis assigned asdb_dirs[0], which is already a string (the directory path). Line 29 wraps it in a list, butdbcan only ever be a single directory string at this point, making theisinstancecheck always true and the list wrapping unnecessary.♻️ Simplify db handling
-db = [db] if isinstance(db, str) else db +# db is already a single directory path stringThen update line 39 to use
dbdirectly:-input_str_db = ",".join(db) +input_str_db = db🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/classify/wrapper.py` around lines 29 - 30, db is already set from db_dirs[0] (a single string), so the conversion db = [db] if isinstance(db, str) else db is unnecessary; remove that list-wrapping and keep db as a string, leave the fq1 conversion as-is, and update any later usage that assumed a list (e.g., where db was indexed or iterated) to use db directly (references: db_dirs, db, fq1 in wrapper.py).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bio/kraken2/add_to_library/wrapper.py`:
- Around line 16-18: The current assignment to log makes it None which
serializes to the string "None" when formatted into the shell command; change
the logic around snakemake.log handling so that the variable used in the command
is an empty string when no log is provided (or else conditionally include the
"--log <path>" token only when snakemake.log is truthy). Specifically, update
the snippet that sets log (referencing snakemake.log and the log variable) to
produce either an empty string or omit the flag entirely when snakemake.log is
empty so that "{log}" in the command does not expand to "None".
In `@bio/kraken2/build/wrapper.py`:
- Around line 13-19: The comparison uses os.path.abspath which can mismatch when
symlinks are involved; change the computation of db_dir and out_dir to use
os.path.realpath instead of os.path.abspath (i.e., compute db_dir =
os.path.realpath(snakemake.input.db) and out_dir =
os.path.dirname(os.path.realpath(snakemake.output[key]))) so both sides are
canonicalized before the assert in the loop that checks equality for "hash",
"opts", "taxo".
In `@bio/kraken2/classify/wrapper.py`:
- Around line 43-48: The block that builds db_files, db_dirs and db_dir
(variables db_files, db_dirs, db_dir) is a duplicate of the earlier validation
and should be removed; delete the entire repeated validation block and any
assignment to db_dir, and ensure the code continues to use the previously
computed db variable from the first validation (lines that set db from
snakemake.input should remain unchanged).
---
Nitpick comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 29-30: db is already set from db_dirs[0] (a single string), so the
conversion db = [db] if isinstance(db, str) else db is unnecessary; remove that
list-wrapping and keep db as a string, leave the fq1 conversion as-is, and
update any later usage that assumed a list (e.g., where db was indexed or
iterated) to use db directly (references: db_dirs, db, fq1 in wrapper.py).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 900d67a6-a2c2-4c2f-acbc-51ec06d49d13
📒 Files selected for processing (19)
bio/kraken2/add_to_library/meta.yamlbio/kraken2/add_to_library/test/Snakefilebio/kraken2/add_to_library/test/test_db/library/added/added.md5bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txtbio/kraken2/add_to_library/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txtbio/kraken2/add_to_library/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fnabio/kraken2/add_to_library/wrapper.pybio/kraken2/build/test/Snakefilebio/kraken2/build/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txtbio/kraken2/build/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txtbio/kraken2/build/test/test_db/prelim_map.txtbio/kraken2/build/wrapper.pybio/kraken2/classify/test/Snakefilebio/kraken2/classify/test/test_db/library/added/added.md5bio/kraken2/classify/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txtbio/kraken2/classify/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txtbio/kraken2/classify/test/test_db/prelim_map.txtbio/kraken2/classify/wrapper.pytest_wrappers.py
✅ Files skipped from review due to trivial changes (7)
- bio/kraken2/add_to_library/test/test_db/library/added/added.md5
- bio/kraken2/add_to_library/test/test_db/library/added/tiny2_92a2a273338dd5873628711dfbc7180a.fna
- bio/kraken2/classify/test/test_db/library/added/added.md5
- bio/kraken2/add_to_library/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
- bio/kraken2/add_to_library/meta.yaml
- bio/kraken2/build/test/test_db/library/added/prelim_map_92a2a273338dd5873628711dfbc7180a.txt
- bio/kraken2/add_to_library/test/Snakefile
🚧 Files skipped from review as they are similar to previous changes (3)
- bio/kraken2/build/test/test_db/library/added/prelim_map_6861723b4fe267de9fea4bf63e327b14.txt
- bio/kraken2/build/test/Snakefile
- bio/kraken2/classify/test/Snakefile
There was a problem hiding this comment.
♻️ Duplicate comments (1)
bio/kraken2/classify/wrapper.py (1)
55-62:⚠️ Potential issue | 🔴 CriticalPaired-end suffix check fails for compressed files (e.g.,
.fastq.gz).
os.path.splitext()only removes the final extension, sosample_1.fastq.gzyields stemsample_1.fastqwhich doesn't end with_1. This incorrectly rejects valid paired-end compressed outputs.🐛 Proposed fix using pathlib to handle multi-part extensions
if paired: - stem, ext = os.path.splitext(out_file) - if not stem.endswith("_1"): + p = Path(out_file) + suffixes = "".join(p.suffixes) + core = p.name[: -len(suffixes)] if suffixes else p.name + if not core.endswith("_1"): raise ValueError( f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension." ) - kraken_template = stem[:-2] + "#" + ext + kraken_template = str(p.parent / (core[:-2] + "#" + suffixes)) return f"{flag_name} {kraken_template}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/classify/wrapper.py` around lines 55 - 62, The current paired-end check uses os.path.splitext so compressed filenames like "sample_1.fastq.gz" leave stem "sample_1.fastq" and fail the "_1" check; update the logic around out_file to use pathlib.Path to peel off all compression suffixes (e.g., Path(out_file).suffixes) or iteratively strip known compression extensions ('.gz', '.bz2', '.xz') so you obtain the true base stem and the full extension string (ext = concatenated suffixes); then perform the stem.endswith("_1") check on that base stem and build kraken_template as base_stem[:-2] + "#" + ext, returning f"{flag_name} {kraken_template}" as before (update references to out_file, stem, ext, flag_name, and kraken_template accordingly).
🧹 Nitpick comments (2)
bio/kraken2/classify/wrapper.py (2)
7-10: Unused imports:TemporaryDirectoryandmove_files.These imports are not used in the current implementation. Consider removing them to keep the code clean.
♻️ Suggested cleanup
import os -from pathlib import Path -from tempfile import TemporaryDirectory from snakemake.shell import shell -from snakemake_wrapper_utils.snakemake import move_filesNote: If you apply the fix for the paired-end suffix check, keep
from pathlib import Path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/classify/wrapper.py` around lines 7 - 10, The file imports TemporaryDirectory and move_files which are not used; remove the unused imports to clean up the module: delete TemporaryDirectory from the import line "from tempfile import TemporaryDirectory" and remove the "from snakemake_wrapper_utils.snakemake import move_files" import; keep "from pathlib import Path" as noted for paired-end suffix checks and ensure no other code references TemporaryDirectory or move_files before committing.
33-34: Redundant list wrapping fordb.
dbis always a single directory string derived fromdb_dirs[0], so wrapping it in a list and then joining with commas (line 43) is unnecessary. This doesn't break anything but adds confusion.♻️ Optional simplification
-db = [db] if isinstance(db, str) else db fq1 = [fq1] if isinstance(fq1, str) else fq1 ... -input_str_db = ",".join(db) +input_str_db = dbAlso applies to: 43-43
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/classify/wrapper.py` around lines 33 - 34, The code redundantly wraps db into a list then later joins it (db = [db] if isinstance(db, str) else db), but db is always a single directory string derived from db_dirs[0]; remove that unnecessary list-wrapping and treat db as a string (keep fq1 wrapping if needed) so downstream joining or formatting uses the original string. Update usages that expect db to be a list (e.g., wherever db is joined with commas) to accept a string or directly use db without list conversion; reference the db variable assignment in the wrapper and the later join/format that combines db values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 55-62: The current paired-end check uses os.path.splitext so
compressed filenames like "sample_1.fastq.gz" leave stem "sample_1.fastq" and
fail the "_1" check; update the logic around out_file to use pathlib.Path to
peel off all compression suffixes (e.g., Path(out_file).suffixes) or iteratively
strip known compression extensions ('.gz', '.bz2', '.xz') so you obtain the true
base stem and the full extension string (ext = concatenated suffixes); then
perform the stem.endswith("_1") check on that base stem and build
kraken_template as base_stem[:-2] + "#" + ext, returning f"{flag_name}
{kraken_template}" as before (update references to out_file, stem, ext,
flag_name, and kraken_template accordingly).
---
Nitpick comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 7-10: The file imports TemporaryDirectory and move_files which are
not used; remove the unused imports to clean up the module: delete
TemporaryDirectory from the import line "from tempfile import
TemporaryDirectory" and remove the "from snakemake_wrapper_utils.snakemake
import move_files" import; keep "from pathlib import Path" as noted for
paired-end suffix checks and ensure no other code references TemporaryDirectory
or move_files before committing.
- Around line 33-34: The code redundantly wraps db into a list then later joins
it (db = [db] if isinstance(db, str) else db), but db is always a single
directory string derived from db_dirs[0]; remove that unnecessary list-wrapping
and treat db as a string (keep fq1 wrapping if needed) so downstream joining or
formatting uses the original string. Update usages that expect db to be a list
(e.g., wherever db is joined with commas) to accept a string or directly use db
without list conversion; reference the db variable assignment in the wrapper and
the later join/format that combines db values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 19975bc5-9f7b-48a9-9199-b51dc6911a04
📒 Files selected for processing (3)
bio/kraken2/add_to_library/wrapper.pybio/kraken2/build/wrapper.pybio/kraken2/classify/wrapper.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
bio/kraken2/classify/wrapper.py (1)
55-60:⚠️ Potential issue | 🟠 MajorHandle multi-part extensions when validating paired
*_1outputs.Line 55 uses
os.path.splitext(), which only strips the last suffix. For names likesample.classified_1.fastq.gz, Line 56 evaluatessample.classified_1.fastq.endswith("_1")and incorrectly rejects valid paired output patterns.Proposed fix
- stem, ext = os.path.splitext(out_file) - if not stem.endswith("_1"): + p = Path(out_file) + full_ext = "".join(p.suffixes) + core_name = p.name[: -len(full_ext)] if full_ext else p.name + if not core_name.endswith("_1"): raise ValueError( f"For paired-end data, '{flag_name}' first file must end with '_1' before the extension." ) - kraken_template = stem[:-2] + "#" + ext - return f"{flag_name} {kraken_template}" + kraken_template = str( + p.with_name(f"{core_name[:-2]}#{full_ext}") + ) + return f"{flag_name} {kraken_template}"#!/bin/bash python - <<'PY' import os samples = [ "sample_1.fastq", "sample_1.fastq.gz", "sample_1.fq.gz", ] for s in samples: stem, ext = os.path.splitext(s) print(f"{s:20} splitext -> stem='{stem}', ext='{ext}', endswith('_1')={stem.endswith('_1')}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bio/kraken2/classify/wrapper.py` around lines 55 - 60, The current logic uses os.path.splitext on out_file (variables stem, ext) which only removes the final suffix and mis-detects names like "sample.classified_1.fastq.gz"; update the logic to compute the full extension using pathlib.Path(out_file).suffixes (or repeatedly rsplit to join all suffixes into ext_str), derive stem_full by stripping that full extension from out_file, then validate stem_full.endswith("_1") (use flag_name in the error message), and construct kraken_template as stem_full[:-2] + "#" + "".join(suffixes) (or ext_str) so multi-part extensions like ".fastq.gz" are handled correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bio/kraken2/add_to_library/wrapper.py`:
- Around line 11-12: The code currently builds fasta_str by joining multiple
FASTA paths (fasta = [fasta] if isinstance(fasta, str) else list(fasta);
fasta_str = " ".join(fasta)) and then passes that into the kraken2-build call as
--file {fasta_str}; instead, change the logic to iterate over the fasta list and
call the kraken2-build invocation once per file: keep the conversion to list
(fasta = ...), then for each entry in fasta call the existing command
construction/invocation (the place that uses --file {fasta_str}) with the single
filename (e.g., --file {f}) and run the subprocess/runner for each file
sequentially, preserving existing options and error handling.
---
Duplicate comments:
In `@bio/kraken2/classify/wrapper.py`:
- Around line 55-60: The current logic uses os.path.splitext on out_file
(variables stem, ext) which only removes the final suffix and mis-detects names
like "sample.classified_1.fastq.gz"; update the logic to compute the full
extension using pathlib.Path(out_file).suffixes (or repeatedly rsplit to join
all suffixes into ext_str), derive stem_full by stripping that full extension
from out_file, then validate stem_full.endswith("_1") (use flag_name in the
error message), and construct kraken_template as stem_full[:-2] + "#" +
"".join(suffixes) (or ext_str) so multi-part extensions like ".fastq.gz" are
handled correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4b9aeecd-b8f6-484a-ae8c-3f0f26daa536
📒 Files selected for processing (2)
bio/kraken2/add_to_library/wrapper.pybio/kraken2/classify/wrapper.py
| fasta = [fasta] if isinstance(fasta, str) else list(fasta) | ||
| fasta_str = " ".join(fasta) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify k2 add-to-library --file accepts multiple files or requires separate calls.
# Check k2 help output for add-to-library usage
k2 add-to-library --help 2>&1 | grep -A 10 -i "file"Repository: snakemake/snakemake-wrappers
Length of output: 54
🏁 Script executed:
# First, let's explore the wrapper file structure and context
cat -n bio/kraken2/add_to_library/wrapper.pyRepository: snakemake/snakemake-wrappers
Length of output: 955
🏁 Script executed:
# Check if there are test files for this wrapper
fd -t f "test.*kraken2.*add_to_library|add_to_library.*test" .Repository: snakemake/snakemake-wrappers
Length of output: 54
🏁 Script executed:
# Look for similar wrappers handling multiple input files
rg -l "isinstance.*str.*else.*list" --type py | head -10Repository: snakemake/snakemake-wrappers
Length of output: 141
🌐 Web query:
k2 add-to-library CLI documentation multiple files
💡 Result:
The k2 add-to-library command is part of Kraken2 (k2 is shorthand for Kraken2), a taxonomic sequence classification tool. It is used via kraken2-build --add-to-library FILE --db DB_NAME to add a single FASTA/FASTQ file to the library of a Kraken2 database before building it. The command does not natively support multiple files in a single invocation; FILE specifies one file. To add multiple files: 1. Loop over files sequentially: for file in .fna; do kraken2-build --add-to-library "$file" --db $DB_NAME done 2. For parallel addition (faster for many files, post-2018 fix for race conditions): find . -name ".fna" -print0 | xargs -0 -I{} -P8 kraken2-build --add-to-library {} --db $DB_NAME (Adjust -P for number of threads/CPUs.) Requirements: - FASTA headers must include |kraken:taxid|XXXXX (taxid from NCBI taxonomy). - Run --download-taxonomy --db $DB_NAME first. - After adding files, run kraken2-build --build --db $DB_NAME --threads N. - Cannot add to a fully built database; must rebuild. Official documentation: Kraken2 GitHub wiki (manual) and manpages confirm single-file usage with loop/xargs for multiples. Examples from Biostars, GitHub issues, and tutorials.
Citations:
- 1: https://manpages.ubuntu.com/manpages/lunar/en/man1/kraken2-build.1.html
- 2: https://manpages.debian.org/testing/kraken2/kraken2-build.1.en.html
- 3: https://www.biostars.org/p/9500776/
- 4: Problem with 'kraken2-build --add-to-library' in parallel DerrickWood/kraken2#28
- 5: https://www.biostars.org/p/9564039/
- 6: http://ccb.jhu.edu/software/kraken2/index.shtml
- 7: http://fanhuan.github.io/en/2022/07/25/build-kraken2-database/
Loop over FASTA files: kraken2-build --add-to-library accepts only a single file per call.
The current implementation joins multiple files with spaces (line 12) and passes them as --file {fasta_str} (line 25). According to Kraken2 documentation, the --add-to-library command accepts only one file per invocation. Multiple files must be added by calling the command once per file.
The wrapper must loop over the fasta list and invoke the command for each file:
🔧 Proposed fix
fasta = [fasta] if isinstance(fasta, str) else list(fasta)
-fasta_str = " ".join(fasta)
extra = snakemake.params.get("extra", "")
log = snakemake.log[0] if snakemake.log else None
if log:
log = f"--log {log}"
else:
log = ""
-shell(
- "k2 add-to-library"
- " --db {snakemake.input.db}"
- " --file {fasta_str}"
- " --threads {snakemake.threads}"
- " {extra}"
- " {log}"
-)
+for fasta_file in fasta:
+ shell(
+ "k2 add-to-library"
+ " --db {snakemake.input.db}"
+ " --file {fasta_file}"
+ " --threads {snakemake.threads}"
+ " {extra}"
+ " {log}"
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bio/kraken2/add_to_library/wrapper.py` around lines 11 - 12, The code
currently builds fasta_str by joining multiple FASTA paths (fasta = [fasta] if
isinstance(fasta, str) else list(fasta); fasta_str = " ".join(fasta)) and then
passes that into the kraken2-build call as --file {fasta_str}; instead, change
the logic to iterate over the fasta list and call the kraken2-build invocation
once per file: keep the conversion to list (fasta = ...), then for each entry in
fasta call the existing command construction/invocation (the place that uses
--file {fasta_str}) with the single filename (e.g., --file {f}) and run the
subprocess/runner for each file sequentially, preserving existing options and
error handling.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
@warrok97ao it seems this is a real issue, no?
| fasta = [fasta] if isinstance(fasta, str) else list(fasta) | ||
| fasta_str = " ".join(fasta) |
There was a problem hiding this comment.
@warrok97ao it seems this is a real issue, no?
| f"Output -> {key} must be in the same directory as the input database" | ||
| f"({db_dir}), but it is in {out_dir}" | ||
| ) | ||
| log = snakemake.log[0] if snakemake.log else "" |
There was a problem hiding this comment.
| log = snakemake.log[0] if snakemake.log else "" | |
| log = snakemake.get("log", "") |
| log = snakemake.log[0] if snakemake.log else None | ||
| if log: | ||
| log = f"--log {log}" | ||
| else: | ||
| log = "" | ||
|
|
There was a problem hiding this comment.
| log = snakemake.log[0] if snakemake.log else None | |
| if log: | |
| log = f"--log {log}" | |
| else: | |
| log = "" | |
| log = snakemake.get("log", "") | |
| if log: | |
| log = f"--log {log}" | |
| db_dir = os.path.realpath(snakemake.input.db) | ||
| for key in ("hash", "opts", "taxo"): | ||
| out_dir = os.path.dirname(os.path.realpath(snakemake.output[key])) | ||
| assert out_dir == db_dir, ( | ||
| f"Output -> {key} must be in the same directory as the input database" | ||
| f"({db_dir}), but it is in {out_dir}" | ||
| ) |
There was a problem hiding this comment.
| db_dir = os.path.realpath(snakemake.input.db) | |
| for key in ("hash", "opts", "taxo"): | |
| out_dir = os.path.dirname(os.path.realpath(snakemake.output[key])) | |
| assert out_dir == db_dir, ( | |
| f"Output -> {key} must be in the same directory as the input database" | |
| f"({db_dir}), but it is in {out_dir}" | |
| ) | |
| db_dir = os.path.commonprefix(snakemake.input.db) | |
There was a problem hiding this comment.
Can you add a test where you classify against two DBs?
| db_files = [hash_file, opts_file, taxo_file] | ||
| db_dirs = [os.path.dirname(os.path.abspath(f)) for f in db_files] | ||
| assert all( | ||
| d == db_dirs[0] for d in db_dirs | ||
| ), f"All db files (hash, opts, taxo) must be in the same directory, got: {db_dirs}" | ||
| db = db_dirs[0] |
There was a problem hiding this comment.
How would this work with more than one DB?
Adds three new wrappers for Kraken2:
bio/kraken2/classify: classify metagenomic reads (SE and PE) against a prebuilt databasebio/kraken2/build: build a Kraken2 database from a sequence library and taxonomybio/kraken2/add-to-library: add FASTA sequences to an existing Kraken2 database libraryAll wrappers use the new
k2CLI (kraken2=2.17.1) and includewrapper.py,environment.yaml,meta.yaml, and test Snakefiles.QC
snakemake-wrappers.While the contributions guidelines are more extensive, please particularly ensure that:
test.pywas updated to call any added or updated example rules in aSnakefileinput:andoutput:file paths in the rules can be chosen arbitrarilyinput:oroutput:)tempfile.gettempdir()points tometa.yamlcontains a link to the documentation of the respective tool or command underurl:Summary by CodeRabbit
New Features
Tests
Documentation