Skip to content

Commit 68a5e00

Browse files
authored
Merge pull request #62 from nf-core/hmmsearch-funfams
Download and hmmsearch FunFam
2 parents 3da49a9 + 5f8726b commit 68a5e00

17 files changed

Lines changed: 423 additions & 35 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Initial release of nf-core/proteinannotator, created with the [nf-core](https://
99

1010
### `Added`
1111

12+
- [#62](https://github.com/nf-core/proteinannotator/pull/62) - Added the option to download and use the latest FunFam HMM library (or use path to an existing one) for domain annotation. (by @vagkaratzas)
1213
- [#61](https://github.com/nf-core/proteinannotator/pull/61) - Added nf-core modules `ARIA2` and `HMMER_HMMSEARCH` to download latest Pfam HMM library (or use path to existing one) and match domains to input sequences. (by @vagkaratzas)
1314
- [#60](https://github.com/nf-core/proteinannotator/pull/60) - Added nf-core module `S4PRED_RUNMODEL` for secondary structure prediction (i.e., α-helix, a β-strand or a coil). (by @vagkaratzas)
1415
- [#59](https://github.com/nf-core/proteinannotator/pull/59) - Added nf-core qc and pre-processing subworkflow for amino acid sequences `FAA_SEQFU_SEQKIT`. (by @vagkaratzas)

conf/modules.config

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,15 +74,23 @@ process {
7474
]
7575
}
7676

77-
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:DOMAIN_ANNOTATION:ARIA2' {
77+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:DOMAIN_ANNOTATION:ARIA2_PFAM' {
7878
publishDir = [
7979
path: { "${params.outdir}/downloaded_dbs/" },
8080
mode: params.publish_dir_mode,
8181
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
8282
]
8383
}
8484

85-
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:DOMAIN_ANNOTATION:HMMER_HMMSEARCH' {
85+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:DOMAIN_ANNOTATION:ARIA2_FUNFAM' {
86+
publishDir = [
87+
path: { "${params.outdir}/downloaded_dbs/" },
88+
mode: params.publish_dir_mode,
89+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
90+
]
91+
}
92+
93+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:DOMAIN_ANNOTATION:HMMSEARCH_PFAM' {
8694
ext.args = { "-E ${params.hmmsearch_evalue_cutoff}" }
8795
publishDir = [
8896
path: { "${params.outdir}/domain_annotation/pfam/" },
@@ -92,6 +100,16 @@ process {
92100
]
93101
}
94102

103+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:DOMAIN_ANNOTATION:HMMSEARCH_FUNFAM' {
104+
ext.args = { "-E ${params.hmmsearch_evalue_cutoff}" }
105+
publishDir = [
106+
path: { "${params.outdir}/domain_annotation/funfam/" },
107+
mode: params.publish_dir_mode,
108+
pattern: "*.domtbl.gz",
109+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
110+
]
111+
}
112+
95113
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:S4PRED_RUNMODEL' {
96114
ext.prefix = { params.s4pred_outfmt }
97115
ext.args = { "--outfmt ${params.s4pred_outfmt}" }

conf/test.config

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@ params {
2525
// Input data
2626
input = params.pipelines_testdata_base_path + 'proteinfold/testdata/samplesheet/v1.2/samplesheet.csv'
2727
// Domain annotation
28-
pfam_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/pfam/Pfam-A_test.hmm.gz'
28+
pfam_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/pfam/Pfam-A_test.hmm.gz'
29+
funfam_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/funfam/funfam-hmm3-v4_3_0_test.lib.gz'
2930
// Functional annotation
3031
skip_interproscan = true
3132
}

conf/test_full.config

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,5 +17,6 @@ params {
1717
// Input data for full size test
1818
input = params.pipelines_testdata_base_path + 'proteinannotator/samplesheet/snap25-isoforms.csv'
1919
// Domain annotation
20-
pfam_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/pfam/Pfam-A_test.hmm.gz'
20+
pfam_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/pfam/Pfam-A_test.hmm.gz'
21+
funfam_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/funfam/funfam-hmm3-v4_3_0_test.lib.gz'
2122
}

docs/output.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@ This document describes the output produced by the pipeline. Most of the plots a
66

77
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
88

9-
<!-- TODO nf-core: Write this documentation describing your workflow's output -->
10-
119
## Pipeline overview
1210

1311
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
@@ -17,8 +15,8 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
1715
- [SeqKit](#seqkit) for preprocessing input amino acid sequences (i.e., gap removal, convert to upper case, validate, filter by length, replace special characters such as `/`, and remove duplicate sequences)
1816

1917
- [Domain annotation](#domain-annotation) Annotate proteins with domains from established repositories.
20-
- [aria2](#aria2) - To optionally download the latest Pfam database through the pipeline.
21-
- [hmmer](#hmmer) - To optionally match the input sequence to known Pfam domains through `hmmer/hmmsearch`
18+
- [aria2](#aria2) - To optionally download the latest Pfam and/or FunFam databases through the pipeline.
19+
- [hmmer](#hmmer) - To optionally match the input sequence to known Pfam and/or FunFam domains through `hmmer/hmmsearch`
2220

2321
- [Functional annotation](#functional-annotation) Annotate proteins with functional domains
2422
- [InterProScan](#Interproscan) - Search the InterPro database for functional domains
@@ -73,9 +71,12 @@ The `seqkit` module is used for initial preprocessing (i.e., gap removal, conver
7371

7472
- `downloaded_dbs/`
7573
- `Pfam-A*.hmm.gz`: (optional) The latest full, or a minimal test, Pfam-A HMM database that can be downloaded through the pipeline.
74+
- `funfam-hmm3-v4_3_0*.lib.gz`: (optional) The latest (v4_3_0) full, or a minimal test, FunFam HMM database that can be downloaded through the pipeline.
7675

7776
</details>
7877

78+
If the `skip_*` flags (e.g., `skip_pfam`, `skip_funfam`) for each domain annotation database is set to `true`, or the `*_db` parameter paths (e.g., `pfam_db`, `funfam_db`) are set (i.e., not `null`), or the run is resumed after a successful database download, then the respective database will not be (re)downloaded. The full database links can be found in the main `nextflow.config` file, while minimal test versions can be found in the `test` and `test_full` profiles (i.e., `conf/test.config`, `conf/test_full.config`).
79+
7980
[aria2](https://github.com/aria2/aria2/) is a lightweight multi-protocol & multi-source, cross platform download utility operated in command-line. It supports HTTP/HTTPS, FTP, SFTP, BitTorrent and Metalink.
8081

8182
#### hmmer
@@ -86,10 +87,12 @@ The `seqkit` module is used for initial preprocessing (i.e., gap removal, conver
8687
- `domain_annotation/`
8788
- `pfam/`
8889
- `<samplename>.domtbl.gz`: `hmmer/hmmsearch` results along parameters info.
90+
- `funfam/`
91+
- `<samplename>.domtbl.gz`: `hmmer/hmmsearch` results along parameters info.
8992

9093
</details>
9194

92-
The `domain_annotation/pfam` folder contains a `.domtbl.gz` annotation file per input sample.
95+
Each of the `domain_annotation/` subfolders (e.g., `pfam`, `funfam`) contain a `.domtbl.gz` annotation file per input sample, depending on which domain annotation databases were used in the pipeline execution.
9396

9497
[hmmer](https://github.com/EddyRivasLab/hmmer) is a fast and flexible alignment trimming tool that keeps phylogenetically informative sites and removes others.
9598

main.nf

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,11 @@ workflow NFCORE_PROTEINANNOTATOR {
4141
samplesheet,
4242
params.skip_preprocessing,
4343
params.skip_pfam,
44-
params.pfam_latest_link,
4544
params.pfam_db,
45+
params.pfam_latest_link,
46+
params.skip_funfam,
47+
params.funfam_db,
48+
params.funfam_latest_link,
4649
params.skip_s4pred
4750
)
4851
emit:

nextflow.config

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@ params {
2222
skip_pfam = false
2323
pfam_db = null
2424
pfam_latest_link = "https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz"
25+
skip_funfam = false
26+
funfam_db = null
27+
funfam_latest_link = "https://download.cathdb.info/cath/releases/all-releases/v4_3_0/sequence-data/funfam-hmm3-v4_3_0.lib.gz"
2528
hmmsearch_evalue_cutoff = 0.001
2629

2730
// InterProScan Options

nextflow_schema.json

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,23 @@
286286
"description": "InterPro hosted link to the latest Pfam HMM database file.",
287287
"help_text": "Latest version should be a bit more than 350MB."
288288
},
289+
"skip_funfam": {
290+
"type": "boolean",
291+
"fa_icon": "fas fa-ban",
292+
"description": "Skip the domain annotation with the FunFam database.",
293+
"help": "Skips the domain annotation of input sequence against a FunFam database."
294+
},
295+
"funfam_db": {
296+
"type": "string",
297+
"format": "file-path",
298+
"description": "Path to an already installed FunFam HMM database (.lib.gz).",
299+
"help_text": "If left null and skip_funfam is false, the pipeline will start downloading the latest FunFam HMM library."
300+
},
301+
"funfam_latest_link": {
302+
"type": "string",
303+
"default": "https://download.cathdb.info/cath/releases/all-releases/v4_3_0/sequence-data/funfam-hmm3-v4_3_0.lib.gz",
304+
"description": "CATH hosted link to the latest available (v4_3_0) FunFam HMM database file."
305+
},
289306
"hmmsearch_evalue_cutoff": {
290307
"type": "number",
291308
"default": 0.001,
Lines changed: 43 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,31 @@
1-
include { ARIA2 } from '../../../modules/nf-core/aria2/main'
2-
include { HMMER_HMMSEARCH } from '../../../modules/nf-core/hmmer/hmmsearch/main'
1+
include { ARIA2 as ARIA2_PFAM } from '../../../modules/nf-core/aria2/main'
2+
include { ARIA2 as ARIA2_FUNFAM } from '../../../modules/nf-core/aria2/main'
3+
include { HMMER_HMMSEARCH as HMMSEARCH_PFAM } from '../../../modules/nf-core/hmmer/hmmsearch/main'
4+
include { HMMER_HMMSEARCH as HMMSEARCH_FUNFAM } from '../../../modules/nf-core/hmmer/hmmsearch/main'
35

46
workflow DOMAIN_ANNOTATION {
57
take:
6-
ch_fasta // channel: [ val(meta), [ fasta ] ]
7-
skip_pfam // boolean
8-
pfam_latest_link // string, path to the latest pfam HMM database, to download
9-
pfam_db // string, path to the pfam HMM database, if already exists
8+
ch_fasta // channel: [ val(meta), [ fasta ] ]
9+
skip_pfam // boolean
10+
pfam_db // string, path to the pfam HMM database, if already exists
11+
pfam_latest_link // string, path to the latest pfam HMM database, to download
12+
skip_funfam // boolean
13+
funfam_db // string, path to the funfam HMM database, if already exists
14+
funfam_latest_link // string, path to the latest funfam HMM database, to download
1015

1116
main:
1217

13-
ch_versions = channel.empty()
18+
ch_versions = channel.empty()
19+
ch_pfam_domains = channel.empty()
20+
ch_funfam_domains = channel.empty()
1421

1522
if (!skip_pfam) {
1623
if (!pfam_db) {
1724
ch_pfam_link = channel.of([ [ id: 'pfam' ], pfam_latest_link ])
1825

19-
ARIA2( ch_pfam_link )
20-
ch_versions = ch_versions.mix( ARIA2.out.versions )
21-
ch_pfam_db = ARIA2.out.downloaded_file
26+
ARIA2_PFAM( ch_pfam_link )
27+
ch_versions = ch_versions.mix( ARIA2_PFAM.out.versions )
28+
ch_pfam_db = ARIA2_PFAM.out.downloaded_file
2229
} else {
2330
ch_pfam_db = channel.of([ [ id: 'pfam' ], pfam_db ])
2431
}
@@ -27,11 +34,33 @@ workflow DOMAIN_ANNOTATION {
2734
.combine(ch_pfam_db)
2835
.map{ meta, seqs, _meta2, models -> [meta, models, seqs, false, false, true] }
2936

30-
HMMER_HMMSEARCH( ch_input_for_hmmsearch )
31-
ch_versions = ch_versions.mix( HMMER_HMMSEARCH.out.versions.first() )
37+
HMMSEARCH_PFAM( ch_input_for_hmmsearch )
38+
ch_versions = ch_versions.mix( HMMSEARCH_PFAM.out.versions.first() )
39+
ch_pfam_domains = HMMSEARCH_PFAM.out.domain_summary
40+
}
41+
42+
if (!skip_funfam) {
43+
if (!funfam_db) {
44+
ch_funfam_link = channel.of([ [ id: 'funfam' ], funfam_latest_link ])
45+
46+
ARIA2_FUNFAM( ch_funfam_link )
47+
ch_versions = ch_versions.mix( ARIA2_FUNFAM.out.versions )
48+
ch_funfam_db = ARIA2_FUNFAM.out.downloaded_file
49+
} else {
50+
ch_funfam_db = channel.of([ [ id: 'funfam' ], funfam_db ])
51+
}
52+
53+
ch_input_for_hmmsearch = ch_fasta
54+
.combine(ch_funfam_db)
55+
.map{ meta, seqs, _meta2, models -> [meta, models, seqs, false, false, true] }
56+
57+
HMMSEARCH_FUNFAM( ch_input_for_hmmsearch )
58+
ch_versions = ch_versions.mix( HMMSEARCH_FUNFAM.out.versions.first() )
59+
ch_funfam_domains = HMMSEARCH_FUNFAM.out.domain_summary
3260
}
3361

3462
emit:
35-
pfam_domains = HMMER_HMMSEARCH.out.domain_summary
36-
versions = ch_versions
63+
pfam_domains = ch_pfam_domains
64+
funfam_domains = ch_funfam_domains
65+
versions = ch_versions
3766
}
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json
2+
name: "domain_annotation"
3+
description: Annotate amino acid fasta files with selected HMM libraries such as Pfam and FunFam
4+
keywords:
5+
- fasta
6+
- sequences
7+
- domain
8+
- annotation
9+
- database
10+
- download
11+
- HMM
12+
components:
13+
- aria2
14+
- hmmer/hmmsearch
15+
input:
16+
- ch_fasta:
17+
type: file
18+
description: |
19+
Amino acid fasta file containing amino acid sequences for annotation
20+
- skip_pfam:
21+
type: boolean
22+
description: |
23+
Skip domain annotation with Pfam
24+
- pfam_db:
25+
type: string
26+
description: |
27+
Path to an existing HMM Pfam library on the system. If provided, the ARIA2_PFAM db download will be skipped.
28+
- pfam_latest_link:
29+
type: string
30+
description: |
31+
Path to the latest Pfam HMM database, to download
32+
- skip_funfam:
33+
type: boolean
34+
description: |
35+
Skip domain annotation with FunFam
36+
- funfam_db:
37+
type: string
38+
description: |
39+
Path to an existing HMM FunFam library on the system. If provided, the ARIA2_FUNFAM db download will be skipped.
40+
- funfam_latest_link:
41+
type: string
42+
description: |
43+
Path to the latest FunFam HMM database, to download
44+
output:
45+
- pfam_domains:
46+
type: file
47+
description: |
48+
domtbl.gz files with pfam domain annotation for input amino acid sequences
49+
- funfam_domains:
50+
type: file
51+
description: |
52+
domtbl.gz files with funfam domain annotation for input amino acid sequences
53+
- versions:
54+
type: file
55+
description: |
56+
Versions file containing the software versions used in the workflow
57+
authors:
58+
- "@vagkaratzas"
59+
maintainers:
60+
- "@vagkaratzas"

0 commit comments

Comments
 (0)