-
Notifications
You must be signed in to change notification settings - Fork 1
Lab 04: Structuring Projects and Inputs
Go to exercise directory 04_structure_and_input and change to the pipeline directory.
This project is organized as follows:
.
├── fasta_seqs
│ ├── seqs_1.fna
│ ├── seqs_2.fna
│ ├── seqs_3.fna
│ ├── seqs_4.fna
│ └── seqs_5.fna
├── pipeline
│ ├── main.nf
│ ├── modules
│ │ └── fasta_utils.nf
│ └── nextflow.config
└── run_01.sh
Notice that we now have a modules subdirectory containing 'fasta_utils.nf'.
We also have a directory of 5 fasta sequence files to practice with.
Previously, we included our processes within the "main.nf" file. Although this is possible, it can quickly become very crowded as processes are added, so we will now practice writing our Nextflow processes in separate files and importing them into main.nf:
include { GET_HEADERS } from "./modules/fasta_utils.nf"
Here we import the process "GET_HEADERS" from the file "fasta_utils.nf" and it is available within the scope of the main.nf workflow.
This "main.nf" workflow contains a new type of syntax in the form of an if statement. If statements follow the groovy language convention of if () { code to perform if condition is true }. For example:
if (params.fasta_seqs) {
if (params.fasta_seqs) {
ch_fastas = Channel.fromPath("${params.fasta_seqs}/*fna", checkIfExists: true)
GET_HEADERS(ch_fastas)
}
This "if block" is going to first test whether there is a parameter (from the "parameters" scope) called "fasta_seqs". If this variable is defined, then then a channel called ch_fastas will be produced using the fromPath operator, and this channel will be passed into the GET_HEADERS process we included earlier.
As we noticed from the "work" output earlier, we didn't actually have access to our results in a useful way. If we inspect the "modules/fasta_utils.nf", we'll see that the process begins with publishDir'. The lines preceding the Nextflow inputline are calledDirectives`. Many directives exist, and they can add further customization to your processes including which container/package to use, the number of cpus, exectutor type (local, slurm, etc), and many other memory and critical runtime behaviors.
process GET_HEADERS {
publishDir(path: "${publish_dir}/headers", mode: "symlink")
input:
path fasta_file
output:
path "*headers.txt", emit: ch_headers
script:
"""
grep "^>" ${fasta_file} > ${fasta_file.baseName}_headers.txt
"""
}
In the above example, the publishDir directive is going to take all values from the output channel (files ending in "headers.txt") and it will write them to the specified $publish_dir/headers directory. Notice the dollar sign before "publish_dir", signifying that it is a parameter that we have defined a value for somewhere? Here we can see that we didn't send "publish_dir" as an input variable, so it must have been defined elsewhere. If you refer to the lecture on Nextflow scope, we covered that variables in the process scope (and many others) can be defined in a configuration file.
Let's inspect the included configuration file for this project, nextflow.config.
params {
fasta_seqs = false
}
process {
publish_dir = "${params.publish_dir}"
}
This is a very simple configuration file. We see that the params scope has a default value for "fasta_seqs" that is set to false, which is used in the workflow scope. The process scope has the mysterious publish_dir variable we were seeing in "fasta_utils.nf". Where is the params.publish_dir variable is coming from?
So far we've seen ways to define parameters in our main.nf workflow and within our configuration file. It may also be useful to define variables from the command line when we call our Nextflow pipeline. To do this, we use the convention of double-hyphens before the parameter name. For example, if we want to define our fasta_seqs and publish_dir variables, we could use the following syntax when calling our pipeline:
nextflow ./pipeline/main.nf --fasta_seqs .fasta_seqs --publish_dir results
which would make our main.nf workflow skip the