- 90GB or more hard disk space to perform a normal installation.
- 8GB or more memory to allocate to the JVM (needed for pilon error correction step (https://github.com/broadinstitute/pilon/wiki/Requirements-&-Usage ). The amount of memory required depends on the genome, the read data, and how many fixes Pilon needs to make. Generally, bacterial genomes with ~200x of Illumina coverage will require at least 8GB, though 16GB is recommended.
The file metacompass_environment.yml contains the list of packages needed to run MetaCompass. Currently all these packages are available through conda/bioconda.
Follow these steps to get started with running the MetaCompass software:
git clone https://github.com/marbl/MetaCompass.git-
Install Conda, a package and environment manager, if you haven't already. Follow the official Conda installation instructions.
-
Create a Conda environment from the
metacompass_environment.ymlfile provided in the cloned repository. Use the following command in your terminal:conda env create -f metacompass_environment.yml
-
Metacompass makes use of Nextflow, an open-source workflow framework for scientific and data-intensive computing.
To ensure Nextflow is properly installed and configured on your system, follow these steps:
- Activate the metacompass environment.
- Type
nextflow -vin the command line. If everything is setup correctly you should see an output similar to this:
$ conda activate metacompass $ nextflow help -v nextflow version 21.10.6.5660
A reference database is required to run Metacompass. You have the option to either generate a reference database on your own or utilize the one provided by us.
-
Download the reference database (approximately 16GB):
wget https://obj.umiacs.umd.edu/metacompass-db/RefSeq_V2_db.tar.gz
-
Extract the database:
tar -xzf RefSeq_V2_db.tar.gz
-
Use in pipeline:
nextflow run metacompass.nf --reference_db /path/to/RefSeq_V2_db/RefSeq_V2_db --forward "$forward_read" \ # [required] --reverse "$reverse_read" \ # [required] --output "$output_folder" \ # [required] --threads 8 \ # [optional] by default it is 16 --trace_file_name "$output_folder/trace.txt" \ # [optional] -with-timeline "$output_folder/timeline.html" \ # [optional]
The reference database can also be set up manually with an input file formatted according to the NCBI RefSeq assembly_summary.txt file (documented at link). An example is included in the repository: ref_db/RefSeq_V2_db/data/filtered.txt. Only the columns labeled "accession", "taxid", and "ftp" are currently used and the rest of the columns can be left blank.
The manual reference database setup has two options, depending on the size of the input file:
-
If this input file is small (less than ~10,000 lines; estimate, will depend on computational resources):
- Place your input text file in the following location:
/metacompass/ref_db/RefSeq_V2_db/data/ - Activate the metacompass conda environment
- Identify the full local path of the directory "ref_db" within the cloned metacompass repository. e.g.:
./metacompass/ref_db
- Run the script "setup_ref_db_small.sh" located in this directory with the ref_db path as the first argument and the input file as the second argument:
./setup_ref_db_small.sh ./metacompass/ref_db filtered.txt
Note: This script will take a long time to complete
- Place your input text file in the following location:
-
If this input file is large (greater than ~10,000 lines; estimate, will depend on computational resources):
- Place your input text file in the following location:
./metacompass/ref_db/RefSeq_V2_db/data/ - Identify the full local path of the directory "ref_db" within the cloned metacompass repository. Set repository path (e.g., if repository is in the current directory):
repository_path=./metacompass/ref_db
- Navigate to the data directory:
cd ${repository_path}/RefSeq_V2_db/data
- Split the accession text file:
split -l 5000 ${repository_path}/RefSeq_V2_db/data/prokaryotes.txt ${repository_path}/RefSeq_V2_db/data/prokaryotes_
- Activate the metacompass conda environment
- Run submit_build_ref.sh with the ref_db as the first argument. This will submit several jobs that will process the split up input file in parallel:
cd $repository_path ./submit_build_ref.sh ${repository_path}
- Once ALL jobs complete successfully, combine the outputs:
cd $repository_path ./combineOutputs.sh ${repository_path}
- (Optional) Remove split directories now that everything is combined:
rm -r marker_index_[a-z][a-z]
- Finally, run the script "setup_ref_db.sh" located in this directory with the ref_db path as the first argument, and the filename of the input text file as the second argument:
./setup_ref_db.sh ./metacompass/ref_db prokaryotes.txt
- Place your input text file in the following location: