Finds CRISPR arrays in raw, un-assembled metagenomic reads. Builds a succinct de Bruijn graph and detects multicycles - the structural signature of CRISPR repeat-spacer arrays - without any prior assembly step.
Outperforms assembly-based workflows and other assembly-free CRISPR detectors on synthetic and real metagenomes.
- CMake ≥ 3.12, C++17, zlib, OpenMP, BZip2
- Docker (recommended for production use)
git clone --recurse-submodules https://github.com/RNABioInfo/mcaat.git
cd mcaat
chmod +x ./install.sh
./install.shThe mcaat binary will be at build/mcaat.
Optional flags:
./install.sh --install # also installs to system
./install.sh --clean # clean build artifactsNote
A pre-built image is available on Docker Hub — no manual dependency setup required.
docker pull feeka94/mcaat:1.0.0docker build -t mcaat .
docker run --rm -v $(pwd):/data mcaat \
--input-files /data/reads_R1.fastq /data/reads_R2.fastq \
--output-folder /data/resultsThe image is based on debian:bookworm-slim and ships only the mcaat binary and runtime libs (libomp5, zlib1g).
Detailed usage of the tool is outlined: rnabioinfo.github.io/mcaat
Exactly one input source is required — either raw reads or a pre-built graph:
# From reads (builds the graph internally)
mcaat --input-files <file1> [file2] [options]
# From a pre-built graph (skips graph construction)
mcaat --graph <path> [options]Required (one of):
| Flag | Description |
|---|---|
--input-files <file1> [file2] |
One or two FASTA/FASTQ files — plain or gzipped. One file = single-end, two = paired-end |
--graph <path> |
Pre-built SDBG graph directory (or file prefix) from a previous run (skips graph construction) |
Optional:
| Flag | Default | Description |
|---|---|---|
--output-folder <path> |
mcaat_run_YYYY-MM-DD_HH-MM-SS/ |
Output directory |
--ram <amount> |
95% of system RAM | Memory cap. Units: B, K, M, G (e.g. --ram 8G) |
--threads <num> |
CPU cores − 2 | Thread count |
--cycle-max-length <int> |
77 |
Maximum cycle length to search |
--cycle-min-length <int> |
27 |
Minimum cycle length to search |
--threshold-multiplicity <int> |
20 |
Min edge multiplicity for cycle start nodes |
--low-abundance <true|false> |
true |
Enable low-abundance mode |
--autoclean <true|false> |
true |
Remove intermediate graph/cycle files after run. Set to false to keep them |
--settings <path> |
— | Key=value settings file (CLI flags override it) |
--help, -h |
— | Show usage and exit |
<output-folder>/
├── CRISPR_Arrays_1.txt # detected arrays (split into numbered files if large)
├── graph/ # succinct de Bruijn graph files
└── cycles/ # raw cycle data
Each CRISPR_Arrays_N.txt file has a short header followed by one block per array:
# MCAAT — CRISPR Array Output
# Generated : 2026-05-12 10:30:21
# Arrays : 42
# Spacers : 312
>Array_1 spacers=8
ATCGATCGATCGATCGATCGATCG
-------------------- AACCCGGTTAATCGATCGTTTCGAGC
-------------------- TTGGCCAATCGATCGATCAAAACGGG
ATCGATCGATCGATCTATCG GGAATTCCAATCGATCGAATACCCAC ← repeat variant
The consensus repeat sequence is on its own line. Each spacer entry shows the repeat variant (or dashes when it matches the consensus exactly) followed by the spacer sequence.
Pass a key=value file with --settings. CLI flags override any value from the file.
input-files=/data/R1.fastq /data/R2.fastq
ram=128G
threads=26
output-folder=results/run_1
cycle-max-length=77
cycle-min-length=27
threshold-multiplicity=20
low-abundance=true
autoclean=true
input-files accepts one or two paths separated by spaces, commas, or semicolons.
- CAS detection: identify and annotate CAS genes flanking detected CRISPR arrays
- Protospacer detection: map spacers back to reads/contigs to find protospacer sequences and PAM sites
If you use MCAAT please cite: https://academic.oup.com/microlife/article/doi/10.1093/femsml/uqaf016/8205558
Contact: Please write an issue on our GitHub page if any problems occur.