diff --git a/.nf-core.yml b/.nf-core.yml
index 08bb8b9..50507a8 100644
--- a/.nf-core.yml
+++ b/.nf-core.yml
@@ -11,7 +11,7 @@ lint:
nf_core_version: 3.5.1
repository_type: pipeline
template:
- author: Olga Botvinnik
+ author: Olga Botvinnik, Evangelos Karatzas
description: Generation of sequence-level annotations for amino acid sequences
version: 1.0.0
force: true
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1eaf602..6c034ab 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,7 +3,7 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
-## v1.0.0 - Yellow Saiga - [2026/02/04]
+## v1.0.0 - Yellow Saiga - [2026/02/09]
Initial release of nf-core/proteinannotator, created with the [nf-core](https://nf-co.re/) template.
diff --git a/README.md b/README.md
index cbc7baa..fb552f8 100644
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@
## Introduction
-**nf-core/proteinannotator** is a bioinformatics pipeline that computes statistics on input protein FASTA files and identifies protein annotations such as conserved domains, predicted functions, and secondary structure features based on sequence data.
+**nf-core/proteinannotator** is a bioinformatics pipeline that computes statistics for protein FASTA inputs and produces protein annotations based on predicted sequence features, including conserved domains, functions, and secondary structure.
@@ -82,11 +82,13 @@ For more details about the output files and reports, please refer to the
## Credits
-nf-core/proteinannotator was originally written by Olga Botvinnik.
+nf-core/proteinannotator was originally written by Olga Botvinnik and Evangelos Karatzas.
We thank the following people for their extensive assistance in the development of this pipeline:
-- [Evangelos Karatzas](https://github.com/vagkaratzas)
+- [Michael L Heuer](https://github.com/heuermh)
+- [Edmund Miller](https://github.com/edmundmiller)
+- [Eric Wei](https://github.com/eweizy)
- [Martin Beracochea](https://github.com/mberacochea)
## Contributions and Support
diff --git a/docs/output.md b/docs/output.md
index dca8295..fcd3159 100644
--- a/docs/output.md
+++ b/docs/output.md
@@ -13,18 +13,13 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Quality control and preprocessing](#quality-control-and-preprocessing)
- [SeqFu](#seqfu) for input amino acid sequences quality control (QC)
- [SeqKit](#seqkit) for preprocessing input amino acid sequences (i.e., gap removal, convert to upper case, validate, filter by length, replace special characters such as `/`, and remove duplicate sequences)
-
- [Database download](#database-download) Optionally download selected databases for annotation.
- [aria2](#aria2) - To optionally download the Pfam, FunFam, and/or InterProScan databases through the pipeline.
-
- [Domain annotation](#domain-annotation) Annotate proteins with domains from established repositories.
- [hmmer](#hmmer) - To optionally match the input sequence to known Pfam and/or FunFam domains through `hmmer/hmmsearch`
-
- [Functional annotation](#functional-annotation) Annotate proteins with functional domains
- [InterProScan](#Interproscan) - Search the InterProScan database for functional domains
-
- [s4pred](#s4pred) - Predict secondary structures of sequences, producing amino acid level probabilities of forming an α-helix, a β-strand or a coil.
-
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
diff --git a/docs/usage.md b/docs/usage.md
index afe2933..3ac3aa4 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -80,10 +80,10 @@ You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-c
### InterProScan
-[InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow without `--skip_interproscan` will download and unzip the InterPro database. The database will then be saved in the output directory `/downloaded_dbs/interproscan_db/`.
+[InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow without `--skip_interproscan` will download and unzip the InterPro database. The database will then be saved in the output directory `/downloaded_dbs/interproscan_db/`. We recommend keeping a copy of this directory for future reuse in case the results folder is deleted.
:::note
-The huge database download (5.5GB) can take up to 4 hours depending on the bandwidth.
+The large database download (5.5GB) can take up to 4 hours depending on the bandwidth.
:::
A local version of the database can be supplied to the pipeline by passing the InterProScan database directory to `--interproscan_db `. The directory can be created by running (e.g. for database version 5.72-103.0):
diff --git a/nextflow_schema.json b/nextflow_schema.json
index 2e10e83..b7ad6d8 100644
--- a/nextflow_schema.json
+++ b/nextflow_schema.json
@@ -220,14 +220,14 @@
"default": 30,
"fa_icon": "fas fa-ruler-horizontal",
"description": "The minimum allowed sequence length",
- "help_text": "Specify the minimum length of amino acid sequences that go into clustering."
+ "help_text": "Specify the minimum length of amino acid sequences that go into clustering. Modifies the --min-len parameter of seqkit seq."
},
"max_seq_length": {
"type": "integer",
"default": 5000,
"fa_icon": "fas fa-ruler-horizontal",
"description": "The maximum allowed sequence length",
- "help_text": "Specify the maximum length of amino acid sequences that go into clustering."
+ "help_text": "Specify the maximum length of amino acid sequences that go into clustering. Modifies the --max-len parameter of seqkit seq"
},
"remove_duplicates_on_sequence": {
"type": "boolean",
@@ -279,7 +279,7 @@
"hmmsearch_evalue_cutoff": {
"type": "number",
"default": 0.001,
- "description": "hmmsearch e-value cutoff threshold for reported results"
+ "description": "hmmsearch e-value cutoff threshold for reported results. Modifies the -E parameter of hmmsearch."
}
}
},
@@ -339,7 +339,7 @@
"s4pred_outfmt": {
"type": "string",
"default": "ss2",
- "description": "Choose the output format (i.e., 'ss2', 'fas', 'horiz') for the s4pred per amino acid probability predictions (i.e., α-helix, β-strand, coil).",
+ "description": "Choose the output format (i.e., 'ss2', 'fas', 'horiz') for the s4pred per amino acid probability predictions (i.e., α-helix, β-strand, coil). Modifies the --outfmt parameter of s4pred run_model.",
"help_text": "ss2 is the default and it corresponds to the PSIPRED vertical format (PSIPRED VFORMAT). The fas output returns the sequence FASTA file with the predicted secondary structure concatenated on a second line. The horiz option outputs the results in the PSIPRED horizontal format (PSIPRED HFORMAT).",
"enum": ["ss2", "fas", "horiz"]
}
diff --git a/ro-crate-metadata.json b/ro-crate-metadata.json
index 6a10c1e..51bb9a3 100644
--- a/ro-crate-metadata.json
+++ b/ro-crate-metadata.json
@@ -22,8 +22,8 @@
"@id": "./",
"@type": "Dataset",
"creativeWorkStatus": "Stable",
- "datePublished": "2026-02-04T13:01:04+00:00",
- "description": "
\n \n \n \n \n
\n\n[](https://github.com/codespaces/new/nf-core/proteinannotator)\n[](https://github.com/nf-core/proteinannotator/actions/workflows/nf-test.yml)\n[](https://github.com/nf-core/proteinannotator/actions/workflows/linting.yml)[](https://nf-co.re/proteinannotator/results)[](https://doi.org/10.5281/zenodo.XXXXXXX)\n[](https://www.nf-test.com)\n\n[](https://www.nextflow.io/)\n[](https://github.com/nf-core/tools/releases/tag/3.5.1)\n[](https://docs.conda.io/en/latest/)\n[](https://www.docker.com/)\n[](https://sylabs.io/docs/)\n[](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/proteinannotator)\n\n[](https://nfcore.slack.com/channels/proteinannotator)[](https://bsky.app/profile/nf-co.re)[](https://mstdn.science/@nf_core)[](https://www.youtube.com/c/nf-core)\n\n## Introduction\n\n**nf-core/proteinannotator** is a bioinformatics pipeline that computes statistics on input protein FASTA files and identifies protein annotations such as conserved domains, predicted functions, and secondary structure features based on sequence data.\n\n
\n \n \n \n \n
\n\n### Check quality and pre-process\n\nGenerate input amino acid sequence statistics with ([`SeqFu`](https://github.com/telatin/seqfu2/)) and pre-process them (i.e., gap removal, convert to upper case, validate, filter by length, replace special characters such as `/`, and remove duplicate sequences) with ([`SeqKit`](https://github.com/shenwei356/seqkit/))\n\n### Annotate sequences\n\n1. Conserved domain annotation with ([`hmmer`](https://github.com/EddyRivasLab/hmmer/)) against databases\n such as [Pfam](https://ftp.ebi.ac.uk/pub/databases/Pfam/) and [FunFam](https://download.cathdb.info/cath/releases/all-releases/)\n2. Functional annotation:\n - ([`InterProScan`](https://interproscan-docs.readthedocs.io/en/v5/)) a software tool used to analyze protein sequences by scanning them against the signatures of protein families, domains, and sites in the [InterPro](https://www.ebi.ac.uk/interpro/) database, helping to identify their functional characteristics.\n3. Predict secondary structure compositional features such as \u03b1-helices, \u03b2-strands and coils with ([`s4pred`](https://github.com/psipred/s4pred))\n4. Present QC stats for input sequences before and after initial pre-processing with ([`MultiQC`](http://multiqc.info/))\n\n## Usage\n\n> [!NOTE]\n> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.\n\nFirst, prepare a samplesheet with your input data that looks as follows:\n\n`samplesheet.csv`:\n\n```csv\nid,fasta\nspecies1,species1_proteins.fasta\nspecies2,species2_proteins.fasta\n```\n\nEach row represents a FASTA file of proteins from a single species.\n\nNow, you can run the pipeline using:\n\n```bash\nnextflow run nf-core/proteinannotator \\\n -profile \\\n --input samplesheet.csv \\\n --outdir \n```\n\n> [!WARNING]\n> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).\n\nFor more details and further functionality, please refer to the [usage documentation](https://nf-co.re/proteinannotator/usage) and the [parameter documentation](https://nf-co.re/proteinannotator/parameters).\n\n## Pipeline output\n\nTo see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/proteinannotator/results) tab on the nf-core website pipeline page.\nFor more details about the output files and reports, please refer to the\n[output documentation](https://nf-co.re/proteinannotator/output).\n\n## Credits\n\nnf-core/proteinannotator was originally written by Olga Botvinnik.\n\nWe thank the following people for their extensive assistance in the development of this pipeline:\n\n- [Evangelos Karatzas](https://github.com/vagkaratzas)\n- [Martin Beracochea](https://github.com/mberacochea)\n\n## Contributions and Support\n\nIf you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).\n\nFor further information or help, don't hesitate to get in touch on the [Slack `#proteinannotator` channel](https://nfcore.slack.com/channels/proteinannotator) (you can join with [this invite](https://nf-co.re/join/slack)).\n\n## Citations\n\n\n\n\nAn extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.\n\nYou can cite the `nf-core` publication as follows:\n\n> **The nf-core framework for community-curated bioinformatics pipelines.**\n>\n> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.\n>\n> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).\n",
+ "datePublished": "2026-02-09T10:42:29+00:00",
+ "description": "
\n \n \n \n \n
\n\n[](https://github.com/codespaces/new/nf-core/proteinannotator)\n[](https://github.com/nf-core/proteinannotator/actions/workflows/nf-test.yml)\n[](https://github.com/nf-core/proteinannotator/actions/workflows/linting.yml)[](https://nf-co.re/proteinannotator/results)[](https://doi.org/10.5281/zenodo.XXXXXXX)\n[](https://www.nf-test.com)\n\n[](https://www.nextflow.io/)\n[](https://github.com/nf-core/tools/releases/tag/3.5.1)\n[](https://docs.conda.io/en/latest/)\n[](https://www.docker.com/)\n[](https://sylabs.io/docs/)\n[](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/proteinannotator)\n\n[](https://nfcore.slack.com/channels/proteinannotator)[](https://bsky.app/profile/nf-co.re)[](https://mstdn.science/@nf_core)[](https://www.youtube.com/c/nf-core)\n\n## Introduction\n\n**nf-core/proteinannotator** is a bioinformatics pipeline that computes statistics for protein FASTA inputs and produces protein annotations based on predicted sequence features, including conserved domains, functions, and secondary structure.\n\n
\n \n \n \n \n
\n\n### Check quality and pre-process\n\nGenerate input amino acid sequence statistics with ([`SeqFu`](https://github.com/telatin/seqfu2/)) and pre-process them (i.e., gap removal, convert to upper case, validate, filter by length, replace special characters such as `/`, and remove duplicate sequences) with ([`SeqKit`](https://github.com/shenwei356/seqkit/))\n\n### Annotate sequences\n\n1. Conserved domain annotation with ([`hmmer`](https://github.com/EddyRivasLab/hmmer/)) against databases\n such as [Pfam](https://ftp.ebi.ac.uk/pub/databases/Pfam/) and [FunFam](https://download.cathdb.info/cath/releases/all-releases/)\n2. Functional annotation:\n - ([`InterProScan`](https://interproscan-docs.readthedocs.io/en/v5/)) a software tool used to analyze protein sequences by scanning them against the signatures of protein families, domains, and sites in the [InterPro](https://www.ebi.ac.uk/interpro/) database, helping to identify their functional characteristics.\n3. Predict secondary structure compositional features such as \u03b1-helices, \u03b2-strands and coils with ([`s4pred`](https://github.com/psipred/s4pred))\n4. Present QC stats for input sequences before and after initial pre-processing with ([`MultiQC`](http://multiqc.info/))\n\n## Usage\n\n> [!NOTE]\n> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.\n\nFirst, prepare a samplesheet with your input data that looks as follows:\n\n`samplesheet.csv`:\n\n```csv\nid,fasta\nspecies1,species1_proteins.fasta\nspecies2,species2_proteins.fasta\n```\n\nEach row represents a FASTA file of proteins from a single species.\n\nNow, you can run the pipeline using:\n\n```bash\nnextflow run nf-core/proteinannotator \\\n -profile \\\n --input samplesheet.csv \\\n --outdir \n```\n\n> [!WARNING]\n> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).\n\nFor more details and further functionality, please refer to the [usage documentation](https://nf-co.re/proteinannotator/usage) and the [parameter documentation](https://nf-co.re/proteinannotator/parameters).\n\n## Pipeline output\n\nTo see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/proteinannotator/results) tab on the nf-core website pipeline page.\nFor more details about the output files and reports, please refer to the\n[output documentation](https://nf-co.re/proteinannotator/output).\n\n## Credits\n\nnf-core/proteinannotator was originally written by Olga Botvinnik and Evangelos Karatzas.\n\nWe thank the following people for their extensive assistance in the development of this pipeline:\n\n- [Michael L Heuer](https://github.com/heuermh)\n- [Edmund Miller](https://github.com/edmundmiller)\n- [Eric Wei](https://github.com/eweizy)\n- [Martin Beracochea](https://github.com/mberacochea)\n\n## Contributions and Support\n\nIf you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).\n\nFor further information or help, don't hesitate to get in touch on the [Slack `#proteinannotator` channel](https://nfcore.slack.com/channels/proteinannotator) (you can join with [this invite](https://nf-co.re/join/slack)).\n\n## Citations\n\n\n\n\nAn extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.\n\nYou can cite the `nf-core` publication as follows:\n\n> **The nf-core framework for community-curated bioinformatics pipelines.**\n>\n> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.\n>\n> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).\n",
"hasPart": [
{
"@id": "main.nf"
@@ -99,7 +99,7 @@
},
"mentions": [
{
- "@id": "#fd91eb9d-1f99-4b3e-93f2-56d5b754e7a2"
+ "@id": "#aff5d966-2a2a-4cbf-bf15-44cdd5058ceb"
}
],
"name": "nf-core/proteinannotator"
@@ -132,11 +132,13 @@
}
],
"dateCreated": "",
- "dateModified": "2026-02-04T13:01:04Z",
+ "dateModified": "2026-02-09T10:42:29Z",
"dct:conformsTo": "https://bioschemas.org/profiles/ComputationalWorkflow/1.0-RELEASE/",
"keywords": [
"nf-core",
- "nextflow"
+ "nextflow",
+ "annotation",
+ "proteomics"
],
"license": [
"MIT"
@@ -176,11 +178,11 @@
"version": "!>=25.10.0"
},
{
- "@id": "#fd91eb9d-1f99-4b3e-93f2-56d5b754e7a2",
+ "@id": "#aff5d966-2a2a-4cbf-bf15-44cdd5058ceb",
"@type": "TestSuite",
"instance": [
{
- "@id": "#fa5732e5-538b-4606-9ac9-3caf643d317b"
+ "@id": "#5d20d507-f40f-4fcc-854d-5d27a47f2941"
}
],
"mainEntity": {
@@ -189,7 +191,7 @@
"name": "Test suite for nf-core/proteinannotator"
},
{
- "@id": "#fa5732e5-538b-4606-9ac9-3caf643d317b",
+ "@id": "#5d20d507-f40f-4fcc-854d-5d27a47f2941",
"@type": "TestInstance",
"name": "GitHub Actions workflow for testing nf-core/proteinannotator",
"resource": "repos/nf-core/proteinannotator/actions/workflows/nf-test.yml",
diff --git a/subworkflows/local/domain_annotation/meta.yml b/subworkflows/local/domain_annotation/meta.yml
index c4d9e5a..e04e241 100644
--- a/subworkflows/local/domain_annotation/meta.yml
+++ b/subworkflows/local/domain_annotation/meta.yml
@@ -17,6 +17,7 @@ input:
type: file
description: |
Amino acid fasta file containing amino acid sequences for annotation
+ Structure: [ val(meta), [ path(fasta) ] ]
- skip_pfam:
type: boolean
description: |