Skip to content

Commit 7d17d6c

Browse files
authored
Merge pull request #59 from nf-core/qc-subworkflow
qc and preprocessing nf-core subworkflow added
2 parents 850b37b + 6386634 commit 7d17d6c

40 files changed

Lines changed: 2330 additions & 61 deletions

CHANGELOG.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,15 @@ Initial release of nf-core/proteinannotator, created with the [nf-core](https://
99

1010
### `Added`
1111

12-
- [[PR #52](https://github.com/nf-core/proteinannotator/pull/52)] Add option to turn off InterProScan for testing
13-
- [[PR #51](https://github.com/nf-core/proteinannotator/pull/51)] Update to nf-core/tools v3.3.1
14-
- [[PR #47](https://github.com/nf-core/proteinannotator/pull/47)] Update metromap with more tools added from [May 2025 Hackathon](https://nf-co.re/events/2025/hackathon-boston)
15-
- [[PR #43](https://github.com/nf-core/proteinannotator/pull/44)] Add [mTM-Align](https://nf-co.re/modules/mtmalign_align/) and [MMseqs2 Search](https://nf-co.re/modules/mmseqs_search/) modules
16-
- [[PR #42](https://github.com/nf-core/proteinannotator/pull/42)] Updated to `nf-test` on GitHub Actions and in the `PULL_REQUEST_TEMPLATE.md`
17-
- [[PR #13](https://github.com/nf-core/proteinannotator/pull/13)] Add nf-core seqkit/stats module
18-
- [[PR #9](https://github.com/nf-core/proteinannotator/pull/9)] Add [InterProScan](https://interproscan-docs.readthedocs.io/) module
12+
- [#59](https://github.com/nf-core/proteinannotator/pull/59) - Added nf-core qc and pre-processing subworkflow for amino acid sequences `FAA_SEQFU_SEQKIT`. (by @vagkaratzas). (by @vagkaratzas)
13+
- [#57](https://github.com/nf-core/proteinannotator/pull/57) - nf-core tools template update to 3.5.1. (by @vagkaratzas)
14+
- [#52](https://github.com/nf-core/proteinannotator/pull/52) - Add option to turn off InterProScan for testing
15+
- [#51](https://github.com/nf-core/proteinannotator/pull/51) - Update to nf-core/tools v3.3.1
16+
- [#47](https://github.com/nf-core/proteinannotator/pull/47) - Update metromap with more tools added from [May 2025 Hackathon](https://nf-co.re/events/2025/hackathon-boston)
17+
<!-- - [#43](https://github.com/nf-core/proteinannotator/pull/44) - Add [mTM-Align](https://nf-co.re/modules/mtmalign_align/) and [MMseqs2 Search](https://nf-co.re/modules/mmseqs_search/) modules -->
18+
- [#42](https://github.com/nf-core/proteinannotator/pull/42) - Updated to `nf-test` on GitHub Actions and in the `PULL_REQUEST_TEMPLATE.md`
19+
- [#13](https://github.com/nf-core/proteinannotator/pull/13) - Add nf-core seqkit/stats module
20+
- [#9](https://github.com/nf-core/proteinannotator/pull/9) - Add [InterProScan](https://interproscan-docs.readthedocs.io/) module
1921

2022
### `Fixed`
2123

CITATIONS.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,29 @@
1010
1111
## Pipeline tools
1212

13-
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
13+
- [SeqFu](https://pubmed.ncbi.nlm.nih.gov/34066939/)
1414

15-
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
15+
> Telatin A, Fariselli P, Birolo G. SeqFu: a suite of utilities for the robust and reproducible manipulation of sequence files. Bioengineering. 2021 May 7;8(5):59. doi: 10.3390/bioengineering8050059. PubMed PMID: 34066939; PubMed Central PMCID: PMC8148589.
16+
17+
- [SeqKit](https://pubmed.ncbi.nlm.nih.gov/38898985/)
18+
19+
> Shen W, Sipos B, Zhao L. SeqKit2: A Swiss army knife for sequence and alignment processing. iMeta. 2024 Apr 5:e191. doi: 10.1002/imt2.191. PubMed PMID: 38898985; PubMed Central PMCID: PMC11183193.
1620
1721
- [InterProScan](https://academic.oup.com/bioinformatics/article/17/9/847/206564)
1822

1923
> Zdobnov, Evgeni M., and Rolf Apweiler. “InterProScan – an Integration Platform for the Signature-Recognition Methods in InterPro.” Bioinformatics 17, no. 9 (September 1, 2001): 847–48. https://doi.org/10.1093/bioinformatics/17.9.847.
2024
21-
- [MMseqs2](https://www.nature.com/articles/nbt.3988)
25+
<!-- - [MMseqs2](https://www.nature.com/articles/nbt.3988)
2226
2327
> Zdobnov, Evgeni M., and Rolf Apweiler. “InterProScan – an Integration Platform for the Signature-Recognition Methods in InterPro.” Bioinformatics 17, no. 9 (September 1, 2001): 847–48. https://doi.org/10.1093/bioinformatics/17.9.847.
2428
2529
- [mTM-align](https://academic.oup.com/bioinformatics/article/34/10/1719/4769500)
2630
27-
> Dong, Runze, Zhenling Peng, Yang Zhang, and Jianyi Yang. “mTM-Align: An Algorithm for Fast and Accurate Multiple Protein Structure Alignment.” Bioinformatics 34, no. 10 (May 15, 2018): 1719–25. https://doi.org/10.1093/bioinformatics/btx828.
31+
> Dong, Runze, Zhenling Peng, Yang Zhang, and Jianyi Yang. “mTM-Align: An Algorithm for Fast and Accurate Multiple Protein Structure Alignment.” Bioinformatics 34, no. 10 (May 15, 2018): 1719–25. https://doi.org/10.1093/bioinformatics/btx828. -->
32+
33+
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
34+
35+
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
2836
2937
## Software packaging/containerisation tools
3038

assets/methods_description_template.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ description: "Suggested text and references to use when describing pipeline usag
33
section_name: "nf-core/proteinannotator Methods Description"
44
section_href: "https://github.com/nf-core/proteinannotator"
55
plot_type: "html"
6-
## TODO nf-core: Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
76
## You inject any metadata in the Nextflow '${workflow}' object
87
data: |
98
<h4>Methods</h4>

conf/modules.config

Lines changed: 56 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,62 @@ process {
1818
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
1919
]
2020

21+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:FAA_SEQFU_SEQKIT:SEQFU_STATS_BEFORE' {
22+
ext.prefix = { "${meta.id}_before" }
23+
publishDir = [
24+
path: { "${params.outdir}/qc/${meta.id}/" },
25+
mode: params.publish_dir_mode,
26+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
27+
]
28+
}
29+
30+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:FAA_SEQFU_SEQKIT:SEQFU_STATS_AFTER' {
31+
ext.prefix = { "${meta.id}_after" }
32+
publishDir = [
33+
path: { "${params.outdir}/qc/${meta.id}/" },
34+
mode: params.publish_dir_mode,
35+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
36+
]
37+
}
38+
39+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:FAA_SEQFU_SEQKIT:SEQKIT_SEQ' {
40+
ext.args = [
41+
"--remove-gaps",
42+
"--upper-case",
43+
"--validate-seq",
44+
"--min-len ${params.min_seq_length}",
45+
"--max-len ${params.max_seq_length}"
46+
].join(' ').trim()
47+
ext.prefix = "intermediate_seqkit_seq"
48+
publishDir = [
49+
path: { "${params.outdir}/qc/${meta.id}/" },
50+
mode: params.publish_dir_mode,
51+
enabled: false,
52+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
53+
]
54+
}
55+
56+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:FAA_SEQFU_SEQKIT:SEQKIT_REPLACE' {
57+
ext.args = '-p "/" -r "_"'
58+
ext.suffix = "fasta"
59+
ext.prefix = "intermediate_seqkit_replace"
60+
publishDir = [
61+
path: { "${params.outdir}/qc/${meta.id}/" },
62+
mode: params.publish_dir_mode,
63+
enabled: false,
64+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
65+
]
66+
}
67+
68+
withName: 'NFCORE_PROTEINANNOTATOR:PROTEINANNOTATOR:FAA_SEQFU_SEQKIT:SEQKIT_RMDUP' {
69+
ext.args = { params.remove_duplicates_on_sequence ? "--by-seq" : '' }
70+
publishDir = [
71+
path: { "${params.outdir}/qc/${meta.id}/" },
72+
mode: params.publish_dir_mode,
73+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
74+
]
75+
}
76+
2177
withName: 'MULTIQC' {
2278
ext.args = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
2379
publishDir = [
@@ -27,7 +83,4 @@ process {
2783
]
2884
}
2985

30-
withName: SEQKIT_STATS {
31-
ext.args = ' ' // turn off --all default argument
32-
}
3386
}

docs/output.md

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,51 @@ The directories listed below will be created in the results directory after the
1212

1313
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
1414

15+
- [Quality check and preprocessing](#quality-check-and-preprocessing)
16+
- [SeqFu](#seqfu) for input amino acid sequences quality check (QC)
17+
- [SeqKit](#seqkit) for preprocessing input amino acid sequences (i.e., gap removal, convert to upper case, validate, filter by length, replace special characters such as `/`, and remove duplicate sequences)
18+
1519
- [Functional Annotation](#functional-annotation) Annotate proteins with functional domains
1620
- [InterProScan](#Interproscan) - Search the InterPro database for functional domains
1721
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
18-
- [SeqKit stats](#seqkit_stats) - Simple statistics for protein FASTA files
1922
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
2023

24+
### Quality check and preprocessing
25+
26+
#### SeqFu
27+
28+
<details markdown="1">
29+
<summary>Output files</summary>
30+
31+
- `qc/`
32+
- `<samplename>/`
33+
- `<samplename>_before.tsv`: Statistics for the input amino acid sequences before preprocessing
34+
- `<samplename>_before_mqc.txt`: Statistics for the input amino acid sequences in MultiQC-ready format before preprocessing
35+
- `<samplename>_after.tsv`: (optional) Statistics for the input amino acid sequences after preprocessing
36+
- `<samplename>_after_mqc.txt`: (optional) Statistics for the input amino acid sequences in MultiQC-ready format after preprocessing
37+
- `<samplename>.log`: (optional) Output file with count of duplicate sequences that were found and removed
38+
39+
</details>
40+
41+
The `seqfu` module is used for statistics generation of input amino acid sequences, both before and after preprocessing.
42+
43+
[SeqFu](https://github.com/telatin/seqfu2) is a cross-platform compiled suite of tools to manipulate and inspect `FASTA` and `FASTQ` files.
44+
45+
#### SeqKit
46+
47+
<details markdown="1">
48+
<summary>Output files</summary>
49+
50+
- `qc/`
51+
- `<samplename>/`
52+
- `<samplename>.<suffix>`: Updated preprocessed input fasta file
53+
54+
</details>
55+
56+
The `seqkit` module is used for initial preprocessing (i.e., gap removal, convert to upper case, validate, filter by length, replace special characters such as `/`, and remove duplicate sequences) of the input amino acid sequences.
57+
58+
[SeqKit](https://github.com/shenwei356/seqkit) is a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.
59+
2160
### Functional Annotation
2261

2362
#### InterProScan

main.nf

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,8 @@ workflow NFCORE_PROTEINANNOTATOR {
3838
// WORKFLOW: Run pipeline
3939
//
4040
PROTEINANNOTATOR (
41-
samplesheet
41+
samplesheet,
42+
params.skip_preprocessing
4243
)
4344
emit:
4445
multiqc_report = PROTEINANNOTATOR.out.multiqc_report // channel: /path/to/multiqc_report.html

modules.json

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,26 @@
2020
"git_sha": "af27af1be706e6a2bb8fe454175b0cdf77f47b49",
2121
"installed_by": ["modules"]
2222
},
23+
"seqfu/stats": {
24+
"branch": "master",
25+
"git_sha": "e753770db613ce014b3c4bc94f6cba443427b726",
26+
"installed_by": ["faa_seqfu_seqkit"]
27+
},
28+
"seqkit/replace": {
29+
"branch": "master",
30+
"git_sha": "41dfa3f7c0ffabb96a6a813fe321c6d1cc5b6e46",
31+
"installed_by": ["faa_seqfu_seqkit"]
32+
},
33+
"seqkit/rmdup": {
34+
"branch": "master",
35+
"git_sha": "41dfa3f7c0ffabb96a6a813fe321c6d1cc5b6e46",
36+
"installed_by": ["faa_seqfu_seqkit"]
37+
},
38+
"seqkit/seq": {
39+
"branch": "master",
40+
"git_sha": "41dfa3f7c0ffabb96a6a813fe321c6d1cc5b6e46",
41+
"installed_by": ["faa_seqfu_seqkit"]
42+
},
2343
"seqkit/stats": {
2444
"branch": "master",
2545
"git_sha": "81880787133db07d9b4c1febd152c090eb8325dc",
@@ -34,6 +54,11 @@
3454
},
3555
"subworkflows": {
3656
"nf-core": {
57+
"faa_seqfu_seqkit": {
58+
"branch": "master",
59+
"git_sha": "15c0a7968179d3b717a9973a1c4f25beb8a9aa2b",
60+
"installed_by": ["subworkflows"]
61+
},
3762
"utils_nextflow_pipeline": {
3863
"branch": "master",
3964
"git_sha": "05954dab2ff481bcb999f24455da29a5828af08d",

modules/nf-core/seqfu/stats/environment.yml

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

modules/nf-core/seqfu/stats/main.nf

Lines changed: 52 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

modules/nf-core/seqfu/stats/meta.yml

Lines changed: 67 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)