A Snakemake workflow for building taxonomically annotated protein sequence databases from public resources and custom genome collections.
RepDBmaker assembles protein sequence databases from multiple sources, annotates them with taxonomy, and builds searchable indices for:
- Diamond
- MMseqs2
- BLAST
It supports:
- prokaryotes from GTDB
- eukaryotes from EukProt, P10K, and UniProt
- viruses from NCBI Virus
Optional features include taxonomic clustering and contamination filtering.
- Snakemake
- Conda or Miniconda
- Internet access for external downloads
- Sufficient disk space for genome and database files
If using --sdm conda, Snakemake will automatically create the following environments from workflow/envs/:
workflow/envs/python.yaml— Python, pandas, matplotlib, polarsworkflow/envs/homology.yaml— diamond, mmseqs2, blastworkflow/envs/utils.yaml— taxonkit, csvtk, ncbi-datasets-cli, jq, seqkitworkflow/envs/R.yaml— R and visualization/taxonomy packages
Create all environments without executing the pipeline:
snakemake --conda-create-envs-onlyA Docker image is available at Docker Hub.
If Docker is installed, run the pipeline from the repository root with:
docker run --rm -v $(pwd):/app/data gmuttiirb/repdbmaker:v1.0 snakemake --cores 2 --directory /app/data -nTo inspect the available proteomes before running the full workflow:
snakemake -j 1 --until available_proteomesThis generates results/meta/available_proteomes.tsv, which can be used to select a proteome subset.
To run the full workflow:
snakemake -j 14Sometimes things can go wrong while downloading a proteome. That is why there is a step (rule db_stats) that will fail if any gzipped fasta is malformed and will block the creation of the database fasta.
To check for broken files:
cut -f1 results/dbs/<db>/genome_table.tsv | xargs -I {} sh -c 'gzip -t "{}" || echo "Failed: {}"'Then delete the problematic ones and re-run the pipeline. If the problem persists, there may be other sort of problems (the files may be broken or the current downloading script fails), I reccomend to exclude them and find the most suitable alternative.
Configure the workflow in config/repdb.yaml or pass a custom config file with:
snakemake --configfile path/to/custom.yamlExample configuration:
# Database configuration
dbs:
gtdb_version: "latest" # use release220, release226, etc.
type: ["diamond", "mmseqs"] # Available types: diamond, mmseqs, blastp
build:
repdb:
decontamination:
# optimized settings for ContScout benchmarking
identity: 0.9
coverage: 0.5
cov_mode: 3
prop_euka: 0.5
custom:
clusteredrepdb:
ids: resources/repdb.ids
cluster:
level: class
identity: 0.9
coverage: 0.9
files:
clades_to_keep: "resources/clades_tokeep.txt"
genomes_to_exclude: "resources/exclude.txt"
new_genomes: "resources/custom_genomes_repdb.csv"Using this config file will allow the creation of RepDB and its clustered version.
To add any custom database:
- Define an entry under
dbs.build.custom - Provide an
idsfile with genome identifiers or metadata - Optionally configure
clusteranddecontaminate
Example:
dbs:
type: ["diamond", "mmseqs", "blastp"]
build:
smalleuks:
ids: resources/eukas_50.ids
cluster:
level: order
identity: 0.8
coverage: 0.8
decontaminate:
identity: 0.9
coverage: 0.5
cov_mode: 3
prop_euka: 0.5The workflow will create results/dbs/<custom_db>/ and its associated outputs.
Key output locations:
results/dbs/<db>/<db>.fa.gz— compressed protein FASTA<db>_accession_map.txt— accession-to-taxid mapping<db>_map,<db>_nohead.map— BLAST/MMseqs maps<db>_diamond— Diamond database index<db>_mmseqs— MMseqs2 database index<db>_blastp— BLASTP database files
results/dbs/<db>/cluster/cluster_params.yaml— clustering settingsresults/dbs/<db>/cluster/<db>_clustered.fa.gz— clustered FASTA outputresults/dbs/<db>/decontaminate/decontaminate_params.yaml— decontamination settingsresults/dbs/<db>/decontaminate/contaminants.txt— decontamination candidatesresults/dbs/<db>/decontaminate/pair_counts.tsv— cluster pair countsresults/stats/— database and clustering statisticsresults/meta/check_resources.txt— validation of downloaded assetsresults/taxonomies/— taxonomy annotations and selection outputsresults/taxdump/repdb_taxdump/— taxonkit taxdump for combined genomes
The repository includes helper scripts for working with generated databases.
workflow/scripts/get_fasta.py: extract a subset of FASTA sequences from an MMseqs2 database
Example usage:
python workflow/scripts/get_fasta.py <id_file> <output_fasta> <db_mmseqs>A benchmark workflow is available in workflow/benchmark.smk and uses config/benchmark.yaml to compare RepDB against reference databases such as NR and clustered NR.
A results notebook is available at workflow/notebooks/comparison.Rmd.
Please add the preferred citation here.
See the LICENSE file for license details.
