Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .github/workflows/awsfulltest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,6 @@ jobs:

- name: Launch workflow via Seqera Platform
uses: seqeralabs/action-tower-launch@v2
# TODO nf-core: You can customise AWS full pipeline tests as required
# Add full size test data (but still relatively small datasets for few samples)
# on the `test_full.config` test runs with only one set of parameters
with:
workspace_id: ${{ vars.TOWER_WORKSPACE_ID }}
access_token: ${{ secrets.TOWER_ACCESS_TOKEN }}
Expand Down
4 changes: 1 addition & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,10 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v1.0.0 - [2026/02/04]
## v1.0.0 - Yellow Saiga - [2026/02/04]

Initial release of nf-core/proteinannotator, created with the [nf-core](https://nf-co.re/) template.

### `Added`

- [#68](https://github.com/nf-core/proteinannotator/pull/68) - Using the `ARIA2` and `UNTAR` nf-core modules to download and decompress the InterProScan database. (by @vagkaratzas)
- [#67](https://github.com/nf-core/proteinannotator/pull/67) - Swapped to the updated, non-buggy, nf-core version of `INTERPROSCAN`. (by @vagkaratzas)
- [#65](https://github.com/nf-core/proteinannotator/pull/65) - Converted the pipeline schematic to nf-core metromap. (by @vagkaratzas)
Expand Down
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,12 @@

## Introduction

**nf-core/proteinannotator** is a bioinformatics pipeline that runs statistics of input protein fasta files and identifies
protein annotations such as conserved domains, functions and secondary structure features, based on their sequence data.
**nf-core/proteinannotator** is a bioinformatics pipeline that computes statistics on input protein FASTA files and identifies protein annotations such as conserved domains, predicted functions, and secondary structure features based on sequence data.

<p>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="docs/images/proteinannotator_metromap_dark.png">
<img alt="Protein annotator metromap. Protein fasta files are summarized with `seqkit stats`, then functionally annotated with InterProScan, DIAMOND-blastp, UniFire, and Kmerseek" src="docs/images/proteinannotator_metromap_light.png">
<img alt="nf-core/proteinannotator" src="docs/images/proteinannotator_metromap_light.png">
</picture>
</p>

Expand Down Expand Up @@ -59,7 +58,7 @@ species1,species1_proteins.fasta
species2,species2_proteins.fasta
```

Each row represents a fasta file of proteins from a single species.
Each row represents a FASTA file of proteins from a single species.

Now, you can run the pipeline using:

Expand Down
10 changes: 5 additions & 5 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ The directories listed below will be created in the results directory after the

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [Quality check and preprocessing](#quality-check-and-preprocessing)
- [SeqFu](#seqfu) for input amino acid sequences quality check (QC)
- [Quality control and preprocessing](#quality-control-and-preprocessing)
- [SeqFu](#seqfu) for input amino acid sequences quality control (QC)
- [SeqKit](#seqkit) for preprocessing input amino acid sequences (i.e., gap removal, convert to upper case, validate, filter by length, replace special characters such as `/`, and remove duplicate sequences)

- [Database download](#database-download) Optionally download selected databases for annotation.
Expand All @@ -23,12 +23,12 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Functional annotation](#functional-annotation) Annotate proteins with functional domains
- [InterProScan](#Interproscan) - Search the InterProScan database for functional domains

- [s4pred](#s4pred) - Predict secondary structures of sequences, producing per amino acid probabilities of being an α-helix, a β-strand or a coil.
- [s4pred](#s4pred) - Predict secondary structures of sequences, producing amino acid level probabilities of forming an α-helix, a β-strand or a coil.

- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

### Quality check and preprocessing
### Quality control and preprocessing

#### SeqFu

Expand Down Expand Up @@ -127,7 +127,7 @@ See also [InterProScan output documentation](https://interproscan-docs.readthedo

##### Generic Feature Format Version 3 (GFF3) Output

The GFF3 format is a flat tab-delimited file, which is much richer then the TSV output format. It allows you to trace back from matches to predicted proteins and to nucleic acid sequences. It also contains a FASTA format representation of the predicted protein sequences and their matches. You will find a documentation of all the columns and attributes used on http://www.sequenceontology.org/gff3.shtml.
The GFF3 format is a flat tab-delimited file, which is much richer then the TSV output. It allows you to trace back from matches to predicted proteins and to nucleic acid sequences. It also contains a FASTA format representation of the predicted protein sequences and their matches. You will find a documentation of all the columns and attributes used on http://www.sequenceontology.org/gff3.shtml.

<details markdown="1">
<summary>Example InterProScan GFF output</summary>
Expand Down
2 changes: 1 addition & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-c

### InterProScan

[InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow without `--skip_interproscan` will download and unzip the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/) version 5.72-103.0. The database will then be saved in the output directory `<output_directory>/databases/interproscan/`.
[InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow without `--skip_interproscan` will download and unzip the InterPro database. The database will then be saved in the output directory `<output_directory>/downloaded_dbs/interproscan_db/`.

:::note
The huge database download (5.5GB) can take up to 4 hours depending on the bandwidth.
Expand Down
2 changes: 1 addition & 1 deletion ro-crate-metadata.json

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions subworkflows/local/domain_annotation/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@ workflow DOMAIN_ANNOTATION {
ch_pfam_db = channel.of([ [ id: 'pfam' ], pfam_db ])
}

ch_input_for_hmmsearch = ch_fasta
ch_input_for_hmmsearch_pfam = ch_fasta
.combine(ch_pfam_db)
.map{ meta, seqs, _meta2, models -> [meta, models, seqs, false, false, true] }

HMMSEARCH_PFAM( ch_input_for_hmmsearch )
HMMSEARCH_PFAM( ch_input_for_hmmsearch_pfam )
ch_versions = ch_versions.mix( HMMSEARCH_PFAM.out.versions.first() )
ch_pfam_domains = HMMSEARCH_PFAM.out.domain_summary
}
Expand All @@ -50,11 +50,11 @@ workflow DOMAIN_ANNOTATION {
ch_funfam_db = channel.of([ [ id: 'funfam' ], funfam_db ])
}

ch_input_for_hmmsearch = ch_fasta
ch_input_for_hmmsearch_funfam = ch_fasta
.combine(ch_funfam_db)
.map{ meta, seqs, _meta2, models -> [meta, models, seqs, false, false, true] }

HMMSEARCH_FUNFAM( ch_input_for_hmmsearch )
HMMSEARCH_FUNFAM( ch_input_for_hmmsearch_funfam )
ch_versions = ch_versions.mix( HMMSEARCH_FUNFAM.out.versions.first() )
ch_funfam_domains = HMMSEARCH_FUNFAM.out.domain_summary
}
Expand Down
2 changes: 1 addition & 1 deletion subworkflows/local/functional_annotation/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ workflow FUNCTIONAL_ANNOTATION {
ch_versions = channel.empty()

if (!skip_interproscan) {
if (interproscan_db != null) {
if (interproscan_db) {
ch_interproscan_db = channel.fromPath(interproscan_db).first()
}
else {
Expand Down
14 changes: 3 additions & 11 deletions workflows/proteinannotator.nf
Original file line number Diff line number Diff line change
Expand Up @@ -42,16 +42,8 @@ workflow PROTEINANNOTATOR {
FAA_SEQFU_SEQKIT( ch_samplesheet, skip_preprocessing )
ch_versions = ch_versions.mix( FAA_SEQFU_SEQKIT.out.versions )

// Replace input fasta and join back in samplesheet to ensure in sync in case of multiple sequence files
ch_samplesheet_updated = ch_samplesheet
.combine(FAA_SEQFU_SEQKIT.out.fasta, by: 0)
.map {
meta, _fasta, updated_fasta ->
[ meta, updated_fasta ]
}

DOMAIN_ANNOTATION (
ch_samplesheet_updated,
FAA_SEQFU_SEQKIT.out.fasta,
skip_pfam,
pfam_db,
pfam_latest_link,
Expand All @@ -62,15 +54,15 @@ workflow PROTEINANNOTATOR {
ch_versions = ch_versions.mix( DOMAIN_ANNOTATION.out.versions )

FUNCTIONAL_ANNOTATION (
ch_samplesheet_updated,
FAA_SEQFU_SEQKIT.out.fasta,
skip_interproscan,
interproscan_db_url,
interproscan_db
)
ch_versions = ch_versions.mix( FUNCTIONAL_ANNOTATION.out.versions )

if (!skip_s4pred) {
S4PRED_RUNMODEL( ch_samplesheet_updated )
S4PRED_RUNMODEL( FAA_SEQFU_SEQKIT.out.fasta )
ch_versions = ch_versions.mix( S4PRED_RUNMODEL.out.versions.first() )
}

Expand Down