PRISM_python

PRecise Identification of Species of the Microbiome -- Python/Snakemake Implementation

_{Figure from [Ghaddar et al., bioRxiv 2025](https://github.com/sjdlabgroup/PRISM)}

A Snakemake pipeline that identifies truly present microbial taxa in genomic sequencing data and eliminates false positives and contaminants. PRISM uses multi-stage host depletion, competitive full-length BLAST alignment, and a pre-trained XGBoost model to score microbes vs. contaminants. Compatible with RNA-seq, WGS, 16S-seq, and scRNA-seq.

This is a Python/Snakemake reimplementation of the original R-based PRISM pipeline by Ghaddar B, Blaser M, De S (Rutgers University). The algorithm and pre-trained model are identical; only the implementation language and workflow engine differ.

Pipeline Overview

Step	Tool	Purpose
1	Kraken2	k-mer classification, extract putative microbial reads
2	Minimap2	Host depletion (keep unmapped reads)
3	STAR	Second host depletion (splice-aware, for RNA-seq)
4	Subsample	Representative sampling by species taxid
4a	BLAST+	Conditionally build custom subsetted BLAST database
5	BLAST+	Competitive alignment of subsampled reads
6	--	Identify uniquely identifiable taxa (UIT)
7	BLAST+	Full dataset alignment against UIT + human + model orgs
8	--	Filter non-microbial, low-quality, host reads
9	--	Annotate reads with GenBank gene/product
10	XGBoost	Feature engineering (40 features) + contaminant scoring
11	--	Write results CSV, counts CSV, annotated FASTA

Requirements

External Tools

Tool	Version	Link
Kraken2	>= 2.1	https://github.com/DerrickWood/kraken2
Minimap2	>= 2.24	https://github.com/lh3/minimap2
STAR	>= 2.7	https://github.com/alexdobin/STAR
BLAST+	>= 2.14	https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
SeqKit	>= 2.0	https://bioinf.shenwei.me/seqkit/
KrakenTools	any	https://github.com/jenniferlu717/KrakenTools (`kreport2mpa.py`)

Python Environment

conda env create -f environment.yaml
conda activate prism

Or install manually:

python >= 3.9
snakemake >= 7.0
pandas
pyarrow
xgboost
biopython

Setup

1. Install External Tools

Install all external tools listed above and ensure they are on your $PATH. Using conda/mamba is recommended:

mamba install -c bioconda kraken2 minimap2 star blast seqkit krakentools

2. Prepare Reference Indices (one-time)

Build merged human reference genome indices for host depletion:

bash scripts/prepare_reference_indices.sh [model_org_taxids.txt] [threads]

This downloads hg38 + T2T-CHM13v2 + model organism genomes and builds Minimap2 (.mmi) and STAR indices under dataset/indices/.

3. Prepare Databases (one-time)

Kraken2 database:

kraken2-build --standard --db /path/to/kraken2_db

Or download a pre-built database from https://benlangmead.github.io/aws-indexes/k2.

BLAST database:

Download or build an NCBI nt (or core_nt) database:

# Example: download core_nt
update_blastdb.pl --decompress core_nt

Accession-to-taxid map (required for custom BLAST DB feature):

blastdbcmd -db /path/to/core_nt -entry all -outfmt "%T %a" > taxid_accession_map.tsv
sort -k1,1 taxid_accession_map.tsv > sorted_accession_map.txt

This produces a ~2.5 GB file. Re-run only if the BLAST database is updated.

4. Prepare GenBank Annotations

Download the pre-processed GenBank annotation files from Zenodo and extract into the project root:

# Download (~6.5 GB)
wget https://zenodo.org/records/19125440/files/genbank_parquet.tar

# Extract into project root
tar -xf genbank_parquet.tar -C /path/to/PRISM_python/

This creates the following structure:

genbank/
  genbank_index.parquet
  parquet/
    *.parquet          # sharded by GenBank division

5. Configure

Copy the example config and edit paths:

cp config/config.yaml.example config/config.yaml

Edit config/config.yaml:

samples:
  YOUR_SAMPLE:
    fq1: "/path/to/YOUR_SAMPLE_R1.fq.gz"
    fq2: "/path/to/YOUR_SAMPLE_R2.fq.gz"

output_dir: "results"

kraken_db: "/path/to/kraken2/database"
minimap2_index: "/path/to/human_hg38_t2t_merged.mmi"
star_genome_dir: "/path/to/STAR/human_hg38_t2t_merged"
blast_db: "/path/to/blastdb/core_nt"

Usage

# Full run
snakemake --configfile config/config.yaml -j 16

# Dry run (preview what will execute)
snakemake -n --configfile config/config.yaml

# Generate DAG visualization
snakemake --configfile config/config.yaml --rulegraph | dot -Tpdf > rulegraph.pdf

Multiple Samples

Add multiple entries under samples: in the config:

samples:
  SRR12345678:
    fq1: "/data/SRR12345678_R1.fq.gz"
    fq2: "/data/SRR12345678_R2.fq.gz"
  SRR87654321:
    fq1: "/data/SRR87654321_R1.fq.gz"
    fq2: "/data/SRR87654321_R2.fq.gz"

Snakemake will process all samples, parallelizing where possible.

HPC / Cluster

# SLURM example
snakemake --configfile config/config.yaml \
  --jobs 100 --cluster "sbatch -p compute -n {threads} --mem=32G"

Configuration Reference

Parameter	Default	Description
`paired`	`true`	Paired-end sequencing mode
`threads`	`28`	Threads for external tools
`blast_threads`	`28`	BLAST threads (1 = deterministic results)
`analyze_mates`	`"both"`	Which mates to BLAST in Step 7: `"R1"` or `"both"`
`min_read_per`	`10000`	Minimum reads ratio threshold for species filtering
`min_uniq_frac`	`5`	Minimum unique k-mer fraction
`max_sample`	`100`	Maximum reads per taxon in subsample
`min_qcovs`	`80`	Minimum BLAST query coverage percent
`use_custom_db`	`true`	Build custom subsetted BLAST DB when FASTA > threshold
`fasta_size_threshold_mb`	`5`	FASTA size (MB) threshold to trigger custom DB build

Output

For each sample X, PRISM creates {output_dir}/X_prism/:

X_prism/
  X-results.csv          # Per-read classification with PRISM scores
  X-counts.csv           # Per-species summary
  X_1.fa                 # Annotated FASTA of retained microbial reads
  data/                  # Intermediate files

`X-results.csv`

Each row is a sequencing read that passed all filters.

Column	Description
`id`	Sequence identifier
`read`	Read number (1 or 2)
`staxids`	NCBI taxon ID
`tax_name`	Scientific name
`rank`	Phylogenetic rank (k, p, c, o, f, g, s)
`sacc`	BLAST subject accession
`pos`	Subject start position
`qcovs`	Query coverage percent (0-100)
`pident`	Percent identity
`bitscore`	BLAST bit score
`gene`	GenBank gene annotation
`product`	GenBank product annotation
`pred`	PRISM score (0 = likely contaminant, 1 = likely present)

`X-counts.csv`

Per-species summary:

Column	Description
`tax_name`	Scientific name
`staxids`	NCBI taxon ID
`n`	Number of reads assigned
`pred`	Mean PRISM score for the taxon

Example Output

tax_name,staxids,n,pred
Enterococcus faecium,1352,5521,0.973
Escherichia coli,562,1531,0.105
Bacteroides fragilis,817,228,0.617
Staphylococcus aureus,1280,262,0.005

Species with high pred scores (e.g., > 0.5) are predicted as truly present; low scores indicate likely contaminants.

Project Structure

PRISM_python/
├── Snakefile                       # Top-level DAG
├── config/
│   └── config.yaml.example         # Configuration template
├── workflow/
│   ├── rules/                      # 12 Snakemake rule files (step01-step11)
│   └── scripts/                    # 10 Python scripts
├── scripts/
│   └── prepare_reference_indices.sh
├── resources/
│   ├── model_org_taxids.txt        # Model organism taxid list
│   ├── prismxg_exported.bin        # Pre-trained XGBoost model
│   └── prismxg_feature_names.txt   # Feature names for the model
├── environment.yaml
├── LICENSE
├── NOTICE.md
└── README.md

Disclaimer

This Python/Snakemake reimplementation has only been tested on one colorectal cancer WGS sample (SRR34072063 without custom DB, SRR34072064 with custom DB). On these samples, the output matches the original R pipeline (pred correlation r=0.986+ at read level, r=0.991+ at species level; differences may be attributable to BLAST non-determinism in multi-threaded mode).

The purpose of this reimplementation is to understand and improve the PRISM workflow using a more maintainable Python/Snakemake framework. While every effort has been made to ensure behavioral consistency with the original R version, we cannot guarantee the analysis conclusions presented in the original PRISM paper. Users should independently validate results for their specific data types and research questions.

Notes

BLAST determinism: Multi-threaded BLAST with -max_target_seqs produces non-deterministic results. Set blast_threads: 1 for reproducible output (at the cost of speed).
Custom BLAST DB: When _1.fa exceeds the size threshold (default 5 MB), a subsetted BLAST database is built containing only human + model organism + target species sequences. This significantly speeds up the full BLAST step for large samples.
Disk usage: The pipeline produces intermediate files. For a typical WGS sample, expect 5-50 GB per sample in the data/ directory.

License

This project contains components under two separate licenses:

PRISM algorithm, pre-trained model, and feature definitions: PRISM Non-commercial Research License (RU-NCRL) from Rutgers, The State University of New Jersey. For commercial use, contact Rutgers at innovate@research.rutgers.edu (Docket #2025-029).
Python/Snakemake implementation code (workflow/, Snakefile, config/, scripts/, environment.yaml): MIT License.

Because the implementation code depends on the PRISM algorithm and pre-trained model, any use of the complete pipeline is subject to the RU-NCRL non-commercial restriction. See LICENSE for full details and NOTICE.md for third-party attributions.

Citation

If you use PRISM in your research, please cite the original paper:

Ghaddar B, Blaser M, De S. Revisiting the cancer microbiome using PRISM. bioRxiv 2025. https://www.biorxiv.org/content/10.1101/2025.01.21.634087v1

Acknowledgments

Original PRISM algorithm and R implementation: Ghaddar B, Blaser M, De S (Rutgers University)
kreport2mpa.py: KrakenTools by Jennifer Lu (GPL v3, not bundled -- install separately)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRISM_python

Pipeline Overview

Requirements

External Tools

Python Environment

Setup

1. Install External Tools

2. Prepare Reference Indices (one-time)

3. Prepare Databases (one-time)

4. Prepare GenBank Annotations

5. Configure

Usage

Multiple Samples

HPC / Cluster

Configuration Reference

Output

`X-results.csv`

`X-counts.csv`

Example Output

Project Structure

Disclaimer

Notes

License

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
resources		resources
scripts		scripts
workflow		workflow
.gitignore		.gitignore
Figure 1.png		Figure 1.png
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
Snakefile		Snakefile
environment.yaml		environment.yaml

Folders and files

Latest commit

History

Repository files navigation

PRISM_python

Pipeline Overview

Requirements

External Tools

Python Environment

Setup

1. Install External Tools

2. Prepare Reference Indices (one-time)

3. Prepare Databases (one-time)

4. Prepare GenBank Annotations

5. Configure

Usage

Multiple Samples

HPC / Cluster

Configuration Reference

Output

X-results.csv

X-counts.csv

Example Output

Project Structure

Disclaimer

Notes

License

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`X-results.csv`

`X-counts.csv`

Packages