Skip to content

Commit be1723a

Browse files
authored
Merge pull request #2 from Funatiq/v1.1
V1.1
2 parents 07ba587 + 6f659e0 commit be1723a

37 files changed

Lines changed: 1671 additions & 1236 deletions

CHANGELOG.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Change Log
2+
3+
## v1.1 - 2017-01-02
4+
### Updates
5+
- Ported some features from CLARK v1.2.3:
6+
- Feature to pass multiple datasets of paired-end reads.
7+
- Scripts to generate the target definition using the accession number instead of the GI number have been updated. Additional scripts have been added to facilitate the creation and changes of the customized databases.
8+
- Include updated README_CLARK.txt
9+
- New download scripts `download_data_newest.sh` and `download_data_release.sh`
10+
- Updated README
11+
12+
### Changes
13+
- Moved all source files to src/ folder
14+
- Added DEBUG flags for additional runtime output
15+
- Added Makefile
16+
17+
## v1.0 - 2016-09-01
18+
Initial release.

Makefile

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
TPROGS = getTargetsDef getAccssnTaxID getfilesToTaxNodes #getGInTaxID
2+
PROGS = cuCLARK cuCLARK-l $(TPROGS)
3+
4+
.PHONY: all clean target_definition
5+
6+
# install all programs in folder ./exe/
7+
all:
8+
$(MAKE) -C src
9+
@mkdir -p exe
10+
@cp $(addprefix src/,$(PROGS)) exe/
11+
12+
clean:
13+
rm -rf exe
14+
$(MAKE) -C src clean
15+
16+
target_definition:
17+
$(MAKE) -C src target_definition
18+
@mkdir -p exe
19+
@cp $(addprefix src/,$(TPROGS)) exe/

README.md

Lines changed: 45 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
1+
# CuCLARK
2+
13
ABOUT
24
-----
3-
CuCLARK is a metagenomic classifier for CUDA-enabled GPUs, based on CLARK (http://clark.cs.ucr.edu/).
5+
CuCLARK is a metagenomic classifier for CUDA-enabled GPUs, based on CLARK (http://clark.cs.ucr.edu/).
6+
For implementation details and speed comparison see the corresponding paper [Accelerating metagenomic read classification on CUDA-enabled GPUs](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1434-6). CuCLARK [v1.0](https://github.com/Funatiq/cuclark/releases/tag/v1.0) was used in the paper and has since been updated (see `CHANGELOG.md` for details).
7+
48

59
The program comes in two variants: CuCLARK and CuCLARK-l.
610
CuCLARK is designed for workstations which can provide enough RAM to fit large databases
@@ -78,7 +82,7 @@ details on these scripts.
7882

7983
SOFTWARE & SYSTEM REQUIREMENTS
8084
-----
81-
1) C++ COMPILER VERSION
85+
1) C++ COMPILER VERSION
8286
The main requirement is a 64-bit operating system (Linux or Mac), and the GNU GCC to
8387
compile version 4.4 or higher. Multi-threading operations are assured by the openmp
8488
libraries. If these libraries are not installed, CuCLARK will run in single-threaded
@@ -113,13 +117,30 @@ INSTALLATION
113117
Copy the whole "CuCLARK" folder to hard disk and execute the installation script (`./install.sh`).
114118
The installer builds binaries (CuCLARK and CuCLARK-l, in the subfolder "exe").
115119

120+
SCRIPTS
121+
-----
116122
In the main folder, you can also notice that several scripts are available.
117123
Especially:
118124
- `set_targets.sh` and `classify_metagenome.sh`: They allow you to classify your metagenomes
119125
against several database(s) (downloaded from NCBI or available "locally" in your disk).
120126
See section "CLASSIFICATION OF METAGENOMIC SAMPLES" for details.
127+
121128
- `download_data.sh`, `download_taxondata.sh` and `make_metadata.sh` are called by `set_targets.sh` to download a specific database and taxonomy tree data from NCBI, and to associate the genomes of the database with the corresponding taxons, respectively. Although it is possible to use these scripts on their own, we recommend to simply use `set_targets.sh` to carry out all necessery steps.
122129

130+
- `download_data.sh` downloads bacteria, viruses or human genomes from NCBI like the original CLARK.
131+
132+
- `download_data.sh` can be replaced with `download_data_newest.sh` or `download_data_release.sh`
133+
to download the newest NCBI RefSeq genomes or the genomes of the latest NCBI RefSeq release. These scripts allow to download any database included in RefSeq like archaea, bacteria, fungi, etc..
134+
135+
- `clean.sh`: This script will delete permanently all data related (generated and
136+
downloaded) of the database directory defined in set_targets.h.
137+
138+
- `resetCustomDB.sh`: It resets the targets definition with sequences (newly
139+
added/modified) of the customized database. Any call of this script must be
140+
followed by a run of set_target.sh.
141+
142+
- `updateTaxonomy.sh`: To download the latest taxonomy data (taxonomy id, accession numbers, etc.) from the NCBI website.
143+
123144

124145

125146
Following is a version of CLARK's usage guide adjusted to CuCLARK's needs.
@@ -137,7 +158,7 @@ Definitions of parameters:
137158
`-k <kmerSize>`, k-mer length: integer, >= 2 and <= 32.
138159
The default value for this parameter is 31, except for CuCLARK-l (it is 27).
139160

140-
`-T <fileTargets>`, The targets definition is written in fileTargets: filename.
161+
`-T <fileTargets>`, The targets definition is written in fileTargets: filename.
141162
This is a two-column file (separated by space, comma or tab), such that, for each line:
142163
column 1: the filename of a reference sequence
143164
column 2: the target ID (taxon name, or taxonomy ID, ...) of the reference sequence
@@ -148,15 +169,15 @@ Definitions of parameters:
148169
The default value is 0. For example, for 1 (or, 2), the program will discard any
149170
discriminative k-mer that appears only once (or, less than twice).
150171

151-
`-D <directoryDB/>`, Directory of the database : pathname.
172+
`-D <directoryDB/>`, Directory of the database : pathname.
152173
This parameter is mandatory.
153174

154-
`-O <fileObjects>`, file containing objects: filename.
175+
`-O <fileObjects>`, file containing objects: filename.
155176
This parameter is mandatory.
156177

157178
`-P <file1> <file2>`, Paired-end fastq files: filenames.
158179

159-
`-R <fileResults>`, file to store results: filename.
180+
`-R <fileResults>`, file to store results: filename.
160181
Results are stored in CSV format in the file <fileResults>.csv (the extension
161182
".csv" is automatically added to the filename).
162183
This parameter is mandatory.
@@ -223,8 +244,9 @@ To work with bacteria, viruses and human:
223244
`$ ./set_targets.sh <DIR_DB/> bacteria viruses human`
224245

225246
To classify against a custom database:
226-
The user will need to paste its sequences (fasta files with GI number in header, and
227-
one fasta file per reference sequence) in the directory "Custom", inside `<DIR_DB/>`.
247+
The user will need to paste its sequences (fasta files with accession numbers in the
248+
header, i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ...",
249+
and one fasta file per reference sequence) in the directory "Custom", inside `<DIR_DB/>`.
228250
To do so, the user must (1) create the directory "Custom" inside `<DIR_DB/>` (if it
229251
does not exist yet) (2) copy or move sequences of interest in Custom and (3) run:
230252
`$ ./set_targets.sh <DIR_DB/> custom`
@@ -278,18 +300,29 @@ IMPORTANT NOTES:
278300
computed by `set_targets.sh`.
279301

280302
- The script `set_targets.sh` assumes that each reference file from bacteria, viruses or custom
281-
database contains a GI number (in the RefSeq records format: ">gi|<number>|ref|<accession>|<text>").
303+
database contains an accession number (in the RefSeq records format:
304+
i.e., ">accession.number ..." or ">gi|number|ref|accession.number| ..." ).
282305
If a GI number is missing in a file, then this file will not be used for the classification.
283306

284-
- set_targets.sh also maps the GI number found in each reference sequence to its taxonomy ID
285-
based on the latest NCBI taxonomy data. If a mapping cannot be made for a given sequence,
286-
then it will be counted and excluded from the targets definition.
307+
- `set_targets.sh` also maps the accession number found in each reference sequence to its
308+
taxonomy ID based on the latest NCBI taxonomy data. If a mapping cannot be made for a given sequence,
309+
then it will NOT be counted and excluded from the targets definition.
287310
The total number of excluded files is prompted in the standard output, and all files that have
288311
been excluded are reported in the file "files_excluded.txt" (located in the the specified
289312
database directory (i.e., "./DBD/").
290313
If some files are excluded, then it will probably mean that they have been removed
291314
for curations for example (visit the RefSeq FAQ webpage).
292315

316+
- You can update your local taxonomy database thanks to the script `updateTaxonomy.sh`
317+
You can use this script before running `set_targets.sh`.
318+
319+
- If the user wants to work with a different customized database (for example, by removing
320+
or adding more sequences of interest in the Custom folder) then the targets definition
321+
must be reset. We made it simple with the script `resetCustomDB.sh`:
322+
After the sequences in the Custom folder have been updated, just run:
323+
`$ ./resetCustomDB.sh`
324+
Then, run `set_target.sh` with the desired settings.
325+
293326
- The database files (*.ky, *.lb and *.sz) will be created inside some subdirectory of the
294327
specified database directory in step I (i.e., "./DBD/") by `classify_metagenome.sh`.
295328

0 commit comments

Comments
 (0)