Lab 04: Structuring Projects and Inputs

Structuring Projects and Inputs

Inspect project organization

Go to exercise directory 04_structure_and_input and change to the pipeline directory.

This project is organized as follows:

.
├── fasta_seqs
│   ├── seqs_1.fna
│   ├── seqs_2.fna
│   ├── seqs_3.fna
│   ├── seqs_4.fna
│   └── seqs_5.fna
├── pipeline
│   ├── main.nf
│   ├── modules
│   │   └── fasta_utils.nf
│   └── nextflow.config
└── run_01.sh

Notice that we now have a modules subdirectory containing 'fasta_utils.nf'.

We also have a directory of 5 fasta sequence files to practice with.

Importing processes

Previously, we included our processes within the "main.nf" file. Although this is possible, it can quickly become very crowded as processes are added, so we will now practice writing our Nextflow processes in separate files and importing them into main.nf:

include { GET_HEADERS    } from "./modules/fasta_utils.nf"

Here we import the process "GET_HEADERS" from the file "fasta_utils.nf" and it is available within the scope of the main.nf workflow.

Workflow conditionals

This "main.nf" workflow contains a new type of syntax in the form of an if statement. If statements follow the groovy language convention of if () { code to perform if condition is true }. For example:

if (params.fasta_seqs) { 
    if (params.fasta_seqs) {                                                    
        ch_fastas = Channel.fromPath("${params.fasta_seqs}/*fna", checkIfExists: true)
        GET_HEADERS(ch_fastas)                                                  
    }

This "if block" is going to first test whether there is a parameter (from the "parameters" scope) called "fasta_seqs". If this variable is defined, then then a channel called ch_fastas will be produced using the fromPath operator, and this channel will be passed into the GET_HEADERS process we included earlier.

Publish directories

As we noticed from the "work" output earlier, we didn't actually have access to our results in a useful way. If we inspect the "modules/fasta_utils.nf", we'll see that the process begins with publishDir'. The lines preceding the Nextflow inputline are calledDirectives`. Many directives exist, and they can add further customization to your processes including which container/package to use, the number of cpus, exectutor type (local, slurm, etc), and many other memory and critical runtime behaviors.

process GET_HEADERS {
    publishDir(path: "${publish_dir}/headers", mode: "symlink")

    input:
        path fasta_file

    output:
        path "*headers.txt", emit: ch_headers

    script:
        """
        grep "^>" ${fasta_file} > ${fasta_file.baseName}_headers.txt
        """
}

In the above example, the publishDir directive is going to take all values from the output channel (files ending in "headers.txt") and it will write them to the specified $publish_dir/headers directory. Notice the dollar sign before "publish_dir", signifying that it is a parameter that we have defined a value for somewhere? Here we can see that we didn't send "publish_dir" as an input variable, so it must have been defined elsewhere. If you refer to the lecture on Nextflow scope, we covered that variables in the process scope (and many others) can be defined in a configuration file.

Configuration files

Let's inspect the included configuration file for this project, nextflow.config.

⚠️ The first place Nextflow will look for configuration settings must be named "nextflow.config". This configuration file can in turn include additional configuration files.

params {
    fasta_seqs = false
}

process {
    publish_dir = "${params.publish_dir}"
}

This is a very simple configuration file. We see that the params scope has a default value for "fasta_seqs" that is set to false, which is used in the workflow scope. The process scope has the mysterious publish_dir variable we were seeing in "fasta_utils.nf". Where is the params.publish_dir variable is coming from?

Input variables

So far we've seen ways to define parameters in our main.nf workflow and within our configuration file. It may also be useful to define variables from the command line when we call our Nextflow pipeline. To do this, we use the convention of double-hyphens before the parameter name. For example, if we want to define our fasta_seqs and publish_dir variables, we could use the following syntax when calling our pipeline:

nextflow ./pipeline/main.nf --fasta_seqs .fasta_seqs --publish_dir results

⚠️ The use of single hyphens before variable names are reserved for Nextflow-specific runtime parameters (e.g., "resume", "version", "-with-report" etc.).

Configuration variable priority

which would make our main.nf workflow skip the

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab 04: Structuring Projects and Inputs

Structuring Projects and Inputs

Inspect project organization

Importing processes

Workflow conditionals

Publish directories

Configuration files

Input variables

Configuration variable priority

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally