PRecise Identification of Species of the Microbiome -- Python/Snakemake Implementation
Figure from [Ghaddar et al., bioRxiv 2025](https://github.com/sjdlabgroup/PRISM)A Snakemake pipeline that identifies truly present microbial taxa in genomic sequencing data and eliminates false positives and contaminants. PRISM uses multi-stage host depletion, competitive full-length BLAST alignment, and a pre-trained XGBoost model to score microbes vs. contaminants. Compatible with RNA-seq, WGS, 16S-seq, and scRNA-seq.
This is a Python/Snakemake reimplementation of the original R-based PRISM pipeline by Ghaddar B, Blaser M, De S (Rutgers University). The algorithm and pre-trained model are identical; only the implementation language and workflow engine differ.
| Step | Tool | Purpose |
|---|---|---|
| 1 | Kraken2 | k-mer classification, extract putative microbial reads |
| 2 | Minimap2 | Host depletion (keep unmapped reads) |
| 3 | STAR | Second host depletion (splice-aware, for RNA-seq) |
| 4 | Subsample | Representative sampling by species taxid |
| 4a | BLAST+ | Conditionally build custom subsetted BLAST database |
| 5 | BLAST+ | Competitive alignment of subsampled reads |
| 6 | -- | Identify uniquely identifiable taxa (UIT) |
| 7 | BLAST+ | Full dataset alignment against UIT + human + model orgs |
| 8 | -- | Filter non-microbial, low-quality, host reads |
| 9 | -- | Annotate reads with GenBank gene/product |
| 10 | XGBoost | Feature engineering (40 features) + contaminant scoring |
| 11 | -- | Write results CSV, counts CSV, annotated FASTA |
| Tool | Version | Link |
|---|---|---|
| Kraken2 | >= 2.1 | https://github.com/DerrickWood/kraken2 |
| Minimap2 | >= 2.24 | https://github.com/lh3/minimap2 |
| STAR | >= 2.7 | https://github.com/alexdobin/STAR |
| BLAST+ | >= 2.14 | https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ |
| SeqKit | >= 2.0 | https://bioinf.shenwei.me/seqkit/ |
| KrakenTools | any | https://github.com/jenniferlu717/KrakenTools (kreport2mpa.py) |
conda env create -f environment.yaml
conda activate prismOr install manually:
python >= 3.9
snakemake >= 7.0
pandas
pyarrow
xgboost
biopython
Install all external tools listed above and ensure they are on your $PATH. Using conda/mamba is recommended:
mamba install -c bioconda kraken2 minimap2 star blast seqkit krakentoolsBuild merged human reference genome indices for host depletion:
bash scripts/prepare_reference_indices.sh [model_org_taxids.txt] [threads]This downloads hg38 + T2T-CHM13v2 + model organism genomes and builds Minimap2 (.mmi) and STAR indices under dataset/indices/.
Kraken2 database:
kraken2-build --standard --db /path/to/kraken2_dbOr download a pre-built database from https://benlangmead.github.io/aws-indexes/k2.
BLAST database:
Download or build an NCBI nt (or core_nt) database:
# Example: download core_nt
update_blastdb.pl --decompress core_ntAccession-to-taxid map (required for custom BLAST DB feature):
blastdbcmd -db /path/to/core_nt -entry all -outfmt "%T %a" > taxid_accession_map.tsv
sort -k1,1 taxid_accession_map.tsv > sorted_accession_map.txtThis produces a ~2.5 GB file. Re-run only if the BLAST database is updated.
Download the pre-processed GenBank annotation files from Zenodo and extract into the project root:
# Download (~6.5 GB)
wget https://zenodo.org/records/19125440/files/genbank_parquet.tar
# Extract into project root
tar -xf genbank_parquet.tar -C /path/to/PRISM_python/This creates the following structure:
genbank/
genbank_index.parquet
parquet/
*.parquet # sharded by GenBank division
Copy the example config and edit paths:
cp config/config.yaml.example config/config.yamlEdit config/config.yaml:
samples:
YOUR_SAMPLE:
fq1: "/path/to/YOUR_SAMPLE_R1.fq.gz"
fq2: "/path/to/YOUR_SAMPLE_R2.fq.gz"
output_dir: "results"
kraken_db: "/path/to/kraken2/database"
minimap2_index: "/path/to/human_hg38_t2t_merged.mmi"
star_genome_dir: "/path/to/STAR/human_hg38_t2t_merged"
blast_db: "/path/to/blastdb/core_nt"# Full run
snakemake --configfile config/config.yaml -j 16
# Dry run (preview what will execute)
snakemake -n --configfile config/config.yaml
# Generate DAG visualization
snakemake --configfile config/config.yaml --rulegraph | dot -Tpdf > rulegraph.pdfAdd multiple entries under samples: in the config:
samples:
SRR12345678:
fq1: "/data/SRR12345678_R1.fq.gz"
fq2: "/data/SRR12345678_R2.fq.gz"
SRR87654321:
fq1: "/data/SRR87654321_R1.fq.gz"
fq2: "/data/SRR87654321_R2.fq.gz"Snakemake will process all samples, parallelizing where possible.
# SLURM example
snakemake --configfile config/config.yaml \
--jobs 100 --cluster "sbatch -p compute -n {threads} --mem=32G"| Parameter | Default | Description |
|---|---|---|
paired |
true |
Paired-end sequencing mode |
threads |
28 |
Threads for external tools |
blast_threads |
28 |
BLAST threads (1 = deterministic results) |
analyze_mates |
"both" |
Which mates to BLAST in Step 7: "R1" or "both" |
min_read_per |
10000 |
Minimum reads ratio threshold for species filtering |
min_uniq_frac |
5 |
Minimum unique k-mer fraction |
max_sample |
100 |
Maximum reads per taxon in subsample |
min_qcovs |
80 |
Minimum BLAST query coverage percent |
use_custom_db |
true |
Build custom subsetted BLAST DB when FASTA > threshold |
fasta_size_threshold_mb |
5 |
FASTA size (MB) threshold to trigger custom DB build |
For each sample X, PRISM creates {output_dir}/X_prism/:
X_prism/
X-results.csv # Per-read classification with PRISM scores
X-counts.csv # Per-species summary
X_1.fa # Annotated FASTA of retained microbial reads
data/ # Intermediate files
Each row is a sequencing read that passed all filters.
| Column | Description |
|---|---|
id |
Sequence identifier |
read |
Read number (1 or 2) |
staxids |
NCBI taxon ID |
tax_name |
Scientific name |
rank |
Phylogenetic rank (k, p, c, o, f, g, s) |
sacc |
BLAST subject accession |
pos |
Subject start position |
qcovs |
Query coverage percent (0-100) |
pident |
Percent identity |
bitscore |
BLAST bit score |
gene |
GenBank gene annotation |
product |
GenBank product annotation |
pred |
PRISM score (0 = likely contaminant, 1 = likely present) |
Per-species summary:
| Column | Description |
|---|---|
tax_name |
Scientific name |
staxids |
NCBI taxon ID |
n |
Number of reads assigned |
pred |
Mean PRISM score for the taxon |
tax_name,staxids,n,pred
Enterococcus faecium,1352,5521,0.973
Escherichia coli,562,1531,0.105
Bacteroides fragilis,817,228,0.617
Staphylococcus aureus,1280,262,0.005
Species with high pred scores (e.g., > 0.5) are predicted as truly present; low scores indicate likely contaminants.
PRISM_python/
├── Snakefile # Top-level DAG
├── config/
│ └── config.yaml.example # Configuration template
├── workflow/
│ ├── rules/ # 12 Snakemake rule files (step01-step11)
│ └── scripts/ # 10 Python scripts
├── scripts/
│ └── prepare_reference_indices.sh
├── resources/
│ ├── model_org_taxids.txt # Model organism taxid list
│ ├── prismxg_exported.bin # Pre-trained XGBoost model
│ └── prismxg_feature_names.txt # Feature names for the model
├── environment.yaml
├── LICENSE
├── NOTICE.md
└── README.md
This Python/Snakemake reimplementation has only been tested on one colorectal cancer WGS sample (SRR34072063 without custom DB, SRR34072064 with custom DB). On these samples, the output matches the original R pipeline (pred correlation r=0.986+ at read level, r=0.991+ at species level; differences may be attributable to BLAST non-determinism in multi-threaded mode).
The purpose of this reimplementation is to understand and improve the PRISM workflow using a more maintainable Python/Snakemake framework. While every effort has been made to ensure behavioral consistency with the original R version, we cannot guarantee the analysis conclusions presented in the original PRISM paper. Users should independently validate results for their specific data types and research questions.
- BLAST determinism: Multi-threaded BLAST with
-max_target_seqsproduces non-deterministic results. Setblast_threads: 1for reproducible output (at the cost of speed). - Custom BLAST DB: When
_1.faexceeds the size threshold (default 5 MB), a subsetted BLAST database is built containing only human + model organism + target species sequences. This significantly speeds up the full BLAST step for large samples. - Disk usage: The pipeline produces intermediate files. For a typical WGS sample, expect 5-50 GB per sample in the
data/directory.
This project contains components under two separate licenses:
- PRISM algorithm, pre-trained model, and feature definitions: PRISM Non-commercial Research License (RU-NCRL) from Rutgers, The State University of New Jersey. For commercial use, contact Rutgers at innovate@research.rutgers.edu (Docket #2025-029).
- Python/Snakemake implementation code (
workflow/,Snakefile,config/,scripts/,environment.yaml): MIT License.
Because the implementation code depends on the PRISM algorithm and pre-trained model, any use of the complete pipeline is subject to the RU-NCRL non-commercial restriction. See LICENSE for full details and NOTICE.md for third-party attributions.
If you use PRISM in your research, please cite the original paper:
Ghaddar B, Blaser M, De S. Revisiting the cancer microbiome using PRISM. bioRxiv 2025. https://www.biorxiv.org/content/10.1101/2025.01.21.634087v1
- Original PRISM algorithm and R implementation: Ghaddar B, Blaser M, De S (Rutgers University)
kreport2mpa.py: KrakenTools by Jennifer Lu (GPL v3, not bundled -- install separately)
