Merge pull request #74 from vagkaratzas/changes-second-review

vagkaratzas · web-flow · commit bf1bee706aa1 · 2026-02-09T12:25:39.000Z
second reviewer comments resolve
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -11,7 +11,7 @@ lint:
 nf_core_version: 3.5.1
 repository_type: pipeline
 template:
-  author: Olga Botvinnik
+  author: Olga Botvinnik, Evangelos Karatzas
   description: Generation of sequence-level annotations for amino acid sequences
   version: 1.0.0
   force: true
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,7 +3,7 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v1.0.0 - Yellow Saiga - [2026/02/04]
+## v1.0.0 - Yellow Saiga - [2026/02/09]
 
 Initial release of nf-core/proteinannotator, created with the [nf-core](https://nf-co.re/) template.
 
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@
 
 ## Introduction
 
-**nf-core/proteinannotator** is a bioinformatics pipeline that computes statistics on input protein FASTA files and identifies protein annotations such as conserved domains, predicted functions, and secondary structure features based on sequence data.
+**nf-core/proteinannotator** is a bioinformatics pipeline that computes statistics for protein FASTA inputs and produces protein annotations based on predicted sequence features, including conserved domains, functions, and secondary structure.
 
 <p>
   <picture>
@@ -82,11 +82,13 @@ For more details about the output files and reports, please refer to the
 
 ## Credits
 
-nf-core/proteinannotator was originally written by Olga Botvinnik.
+nf-core/proteinannotator was originally written by Olga Botvinnik and Evangelos Karatzas.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
-- [Evangelos Karatzas](https://github.com/vagkaratzas)
+- [Michael L Heuer](https://github.com/heuermh)
+- [Edmund Miller](https://github.com/edmundmiller)
+- [Eric Wei](https://github.com/eweizy)
 - [Martin Beracochea](https://github.com/mberacochea)
 
 ## Contributions and Support
diff --git a/docs/output.md b/docs/output.md
@@ -13,18 +13,13 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [Quality control and preprocessing](#quality-control-and-preprocessing)
   - [SeqFu](#seqfu) for input amino acid sequences quality control (QC)
   - [SeqKit](#seqkit) for preprocessing input amino acid sequences (i.e., gap removal, convert to upper case, validate, filter by length, replace special characters such as `/`, and remove duplicate sequences)
-
 - [Database download](#database-download) Optionally download selected databases for annotation.
   - [aria2](#aria2) - To optionally download the Pfam, FunFam, and/or InterProScan databases through the pipeline.
-
 - [Domain annotation](#domain-annotation) Annotate proteins with domains from established repositories.
   - [hmmer](#hmmer) - To optionally match the input sequence to known Pfam and/or FunFam domains through `hmmer/hmmsearch`
-
 - [Functional annotation](#functional-annotation) Annotate proteins with functional domains
   - [InterProScan](#Interproscan) - Search the InterProScan database for functional domains
-
 - [s4pred](#s4pred) - Predict secondary structures of sequences, producing amino acid level probabilities of forming an α-helix, a β-strand or a coil.
-
 - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
diff --git a/docs/usage.md b/docs/usage.md
@@ -80,10 +80,10 @@ You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-c
 
 ### InterProScan
 
-[InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow without `--skip_interproscan` will download and unzip the InterPro database. The database will then be saved in the output directory `<output_directory>/downloaded_dbs/interproscan_db/`.
+[InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow without `--skip_interproscan` will download and unzip the InterPro database. The database will then be saved in the output directory `<output_directory>/downloaded_dbs/interproscan_db/`. We recommend keeping a copy of this directory for future reuse in case the results folder is deleted.
 
 :::note
-The huge database download (5.5GB) can take up to 4 hours depending on the bandwidth.
+The large database download (5.5GB) can take up to 4 hours depending on the bandwidth.
 :::
 
 A local version of the database can be supplied to the pipeline by passing the InterProScan database directory to `--interproscan_db <path/to/downloaded-untarred-interproscan_db-dir/>`. The directory can be created by running (e.g. for database version 5.72-103.0):
diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -220,14 +220,14 @@
                     "default": 30,
                     "fa_icon": "fas fa-ruler-horizontal",
                     "description": "The minimum allowed sequence length",
-                    "help_text": "Specify the minimum length of amino acid sequences that go into clustering."
+                    "help_text": "Specify the minimum length of amino acid sequences that go into clustering. Modifies the --min-len parameter of seqkit seq."
                 },
                 "max_seq_length": {
                     "type": "integer",
                     "default": 5000,
                     "fa_icon": "fas fa-ruler-horizontal",
                     "description": "The maximum allowed sequence length",
-                    "help_text": "Specify the maximum length of amino acid sequences that go into clustering."
+                    "help_text": "Specify the maximum length of amino acid sequences that go into clustering. Modifies the --max-len parameter of seqkit seq"
                 },
                 "remove_duplicates_on_sequence": {
                     "type": "boolean",
@@ -279,7 +279,7 @@
                 "hmmsearch_evalue_cutoff": {
                     "type": "number",
                     "default": 0.001,
-                    "description": "hmmsearch e-value cutoff threshold for reported results"
+                    "description": "hmmsearch e-value cutoff threshold for reported results. Modifies the -E parameter of hmmsearch."
                 }
             }
         },
@@ -339,7 +339,7 @@
                 "s4pred_outfmt": {
                     "type": "string",
                     "default": "ss2",
-                    "description": "Choose the output format (i.e., 'ss2', 'fas', 'horiz') for the s4pred per amino acid probability predictions (i.e., α-helix, β-strand, coil).",
+                    "description": "Choose the output format (i.e., 'ss2', 'fas', 'horiz') for the s4pred per amino acid probability predictions (i.e., α-helix, β-strand, coil). Modifies the --outfmt parameter of s4pred run_model.",
                     "help_text": "ss2 is the default and it corresponds to the PSIPRED vertical format (PSIPRED VFORMAT). The fas output returns the sequence FASTA file with the predicted secondary structure concatenated on a second line. The horiz option outputs the results in the PSIPRED horizontal format (PSIPRED HFORMAT).",
                     "enum": ["ss2", "fas", "horiz"]
                 }
diff --git a/ro-crate-metadata.json b/ro-crate-metadata.json
diff --git a/subworkflows/local/domain_annotation/meta.yml b/subworkflows/local/domain_annotation/meta.yml
@@ -17,6 +17,7 @@ input:
       type: file
       description: |
         Amino acid fasta file containing amino acid sequences for annotation
+        Structure: [ val(meta), [ path(fasta) ] ]
   - skip_pfam:
       type: boolean
       description: |