Skip to content

inspirewind/PRISM_python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PRISM_python

PRecise Identification of Species of the Microbiome -- Python/Snakemake Implementation

Figure from [Ghaddar et al., bioRxiv 2025](https://github.com/sjdlabgroup/PRISM)

A Snakemake pipeline that identifies truly present microbial taxa in genomic sequencing data and eliminates false positives and contaminants. PRISM uses multi-stage host depletion, competitive full-length BLAST alignment, and a pre-trained XGBoost model to score microbes vs. contaminants. Compatible with RNA-seq, WGS, 16S-seq, and scRNA-seq.

This is a Python/Snakemake reimplementation of the original R-based PRISM pipeline by Ghaddar B, Blaser M, De S (Rutgers University). The algorithm and pre-trained model are identical; only the implementation language and workflow engine differ.

Pipeline Overview

Step Tool Purpose
1 Kraken2 k-mer classification, extract putative microbial reads
2 Minimap2 Host depletion (keep unmapped reads)
3 STAR Second host depletion (splice-aware, for RNA-seq)
4 Subsample Representative sampling by species taxid
4a BLAST+ Conditionally build custom subsetted BLAST database
5 BLAST+ Competitive alignment of subsampled reads
6 -- Identify uniquely identifiable taxa (UIT)
7 BLAST+ Full dataset alignment against UIT + human + model orgs
8 -- Filter non-microbial, low-quality, host reads
9 -- Annotate reads with GenBank gene/product
10 XGBoost Feature engineering (40 features) + contaminant scoring
11 -- Write results CSV, counts CSV, annotated FASTA

Requirements

External Tools

Tool Version Link
Kraken2 >= 2.1 https://github.com/DerrickWood/kraken2
Minimap2 >= 2.24 https://github.com/lh3/minimap2
STAR >= 2.7 https://github.com/alexdobin/STAR
BLAST+ >= 2.14 https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
SeqKit >= 2.0 https://bioinf.shenwei.me/seqkit/
KrakenTools any https://github.com/jenniferlu717/KrakenTools (kreport2mpa.py)

Python Environment

conda env create -f environment.yaml
conda activate prism

Or install manually:

python >= 3.9
snakemake >= 7.0
pandas
pyarrow
xgboost
biopython

Setup

1. Install External Tools

Install all external tools listed above and ensure they are on your $PATH. Using conda/mamba is recommended:

mamba install -c bioconda kraken2 minimap2 star blast seqkit krakentools

2. Prepare Reference Indices (one-time)

Build merged human reference genome indices for host depletion:

bash scripts/prepare_reference_indices.sh [model_org_taxids.txt] [threads]

This downloads hg38 + T2T-CHM13v2 + model organism genomes and builds Minimap2 (.mmi) and STAR indices under dataset/indices/.

3. Prepare Databases (one-time)

Kraken2 database:

kraken2-build --standard --db /path/to/kraken2_db

Or download a pre-built database from https://benlangmead.github.io/aws-indexes/k2.

BLAST database:

Download or build an NCBI nt (or core_nt) database:

# Example: download core_nt
update_blastdb.pl --decompress core_nt

Accession-to-taxid map (required for custom BLAST DB feature):

blastdbcmd -db /path/to/core_nt -entry all -outfmt "%T %a" > taxid_accession_map.tsv
sort -k1,1 taxid_accession_map.tsv > sorted_accession_map.txt

This produces a ~2.5 GB file. Re-run only if the BLAST database is updated.

4. Prepare GenBank Annotations

Download the pre-processed GenBank annotation files from Zenodo and extract into the project root:

# Download (~6.5 GB)
wget https://zenodo.org/records/19125440/files/genbank_parquet.tar

# Extract into project root
tar -xf genbank_parquet.tar -C /path/to/PRISM_python/

This creates the following structure:

genbank/
  genbank_index.parquet
  parquet/
    *.parquet          # sharded by GenBank division

5. Configure

Copy the example config and edit paths:

cp config/config.yaml.example config/config.yaml

Edit config/config.yaml:

samples:
  YOUR_SAMPLE:
    fq1: "/path/to/YOUR_SAMPLE_R1.fq.gz"
    fq2: "/path/to/YOUR_SAMPLE_R2.fq.gz"

output_dir: "results"

kraken_db: "/path/to/kraken2/database"
minimap2_index: "/path/to/human_hg38_t2t_merged.mmi"
star_genome_dir: "/path/to/STAR/human_hg38_t2t_merged"
blast_db: "/path/to/blastdb/core_nt"

Usage

# Full run
snakemake --configfile config/config.yaml -j 16

# Dry run (preview what will execute)
snakemake -n --configfile config/config.yaml

# Generate DAG visualization
snakemake --configfile config/config.yaml --rulegraph | dot -Tpdf > rulegraph.pdf

Multiple Samples

Add multiple entries under samples: in the config:

samples:
  SRR12345678:
    fq1: "/data/SRR12345678_R1.fq.gz"
    fq2: "/data/SRR12345678_R2.fq.gz"
  SRR87654321:
    fq1: "/data/SRR87654321_R1.fq.gz"
    fq2: "/data/SRR87654321_R2.fq.gz"

Snakemake will process all samples, parallelizing where possible.

HPC / Cluster

# SLURM example
snakemake --configfile config/config.yaml \
  --jobs 100 --cluster "sbatch -p compute -n {threads} --mem=32G"

Configuration Reference

Parameter Default Description
paired true Paired-end sequencing mode
threads 28 Threads for external tools
blast_threads 28 BLAST threads (1 = deterministic results)
analyze_mates "both" Which mates to BLAST in Step 7: "R1" or "both"
min_read_per 10000 Minimum reads ratio threshold for species filtering
min_uniq_frac 5 Minimum unique k-mer fraction
max_sample 100 Maximum reads per taxon in subsample
min_qcovs 80 Minimum BLAST query coverage percent
use_custom_db true Build custom subsetted BLAST DB when FASTA > threshold
fasta_size_threshold_mb 5 FASTA size (MB) threshold to trigger custom DB build

Output

For each sample X, PRISM creates {output_dir}/X_prism/:

X_prism/
  X-results.csv          # Per-read classification with PRISM scores
  X-counts.csv           # Per-species summary
  X_1.fa                 # Annotated FASTA of retained microbial reads
  data/                  # Intermediate files

X-results.csv

Each row is a sequencing read that passed all filters.

Column Description
id Sequence identifier
read Read number (1 or 2)
staxids NCBI taxon ID
tax_name Scientific name
rank Phylogenetic rank (k, p, c, o, f, g, s)
sacc BLAST subject accession
pos Subject start position
qcovs Query coverage percent (0-100)
pident Percent identity
bitscore BLAST bit score
gene GenBank gene annotation
product GenBank product annotation
pred PRISM score (0 = likely contaminant, 1 = likely present)

X-counts.csv

Per-species summary:

Column Description
tax_name Scientific name
staxids NCBI taxon ID
n Number of reads assigned
pred Mean PRISM score for the taxon

Example Output

tax_name,staxids,n,pred
Enterococcus faecium,1352,5521,0.973
Escherichia coli,562,1531,0.105
Bacteroides fragilis,817,228,0.617
Staphylococcus aureus,1280,262,0.005

Species with high pred scores (e.g., > 0.5) are predicted as truly present; low scores indicate likely contaminants.

Project Structure

PRISM_python/
├── Snakefile                       # Top-level DAG
├── config/
│   └── config.yaml.example         # Configuration template
├── workflow/
│   ├── rules/                      # 12 Snakemake rule files (step01-step11)
│   └── scripts/                    # 10 Python scripts
├── scripts/
│   └── prepare_reference_indices.sh
├── resources/
│   ├── model_org_taxids.txt        # Model organism taxid list
│   ├── prismxg_exported.bin        # Pre-trained XGBoost model
│   └── prismxg_feature_names.txt   # Feature names for the model
├── environment.yaml
├── LICENSE
├── NOTICE.md
└── README.md

Disclaimer

This Python/Snakemake reimplementation has only been tested on one colorectal cancer WGS sample (SRR34072063 without custom DB, SRR34072064 with custom DB). On these samples, the output matches the original R pipeline (pred correlation r=0.986+ at read level, r=0.991+ at species level; differences may be attributable to BLAST non-determinism in multi-threaded mode).

The purpose of this reimplementation is to understand and improve the PRISM workflow using a more maintainable Python/Snakemake framework. While every effort has been made to ensure behavioral consistency with the original R version, we cannot guarantee the analysis conclusions presented in the original PRISM paper. Users should independently validate results for their specific data types and research questions.

Notes

  • BLAST determinism: Multi-threaded BLAST with -max_target_seqs produces non-deterministic results. Set blast_threads: 1 for reproducible output (at the cost of speed).
  • Custom BLAST DB: When _1.fa exceeds the size threshold (default 5 MB), a subsetted BLAST database is built containing only human + model organism + target species sequences. This significantly speeds up the full BLAST step for large samples.
  • Disk usage: The pipeline produces intermediate files. For a typical WGS sample, expect 5-50 GB per sample in the data/ directory.

License

This project contains components under two separate licenses:

Because the implementation code depends on the PRISM algorithm and pre-trained model, any use of the complete pipeline is subject to the RU-NCRL non-commercial restriction. See LICENSE for full details and NOTICE.md for third-party attributions.

Citation

If you use PRISM in your research, please cite the original paper:

Ghaddar B, Blaser M, De S. Revisiting the cancer microbiome using PRISM. bioRxiv 2025. https://www.biorxiv.org/content/10.1101/2025.01.21.634087v1

Acknowledgments

About

A Python + Snakemake implementation of PRecise Identification of Species of the Microbiome

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors