Sydney-Informatics-Hub
diff --git a/‎docs/assets/1.4_rnaseq_fail.png‎
212 KB b/‎docs/assets/1.4_rnaseq_fail.png‎
212 KB
diff --git a/‎docs/assets/1.4_rnaseq_processes_start.png‎
276 KB b/‎docs/assets/1.4_rnaseq_processes_start.png‎
276 KB
diff --git a/‎docs/assets/1.4_rnaseq_start.png‎
221 KB b/‎docs/assets/1.4_rnaseq_start.png‎
221 KB
diff --git a/‎docs/session_1/1.4_rnaseq.md‎
Lines changed: 122 additions & 98 deletions b/‎docs/session_1/1.4_rnaseq.md‎
Lines changed: 122 additions & 98 deletions
@@ -232,7 +232,47 @@ Our input FASTQ files (`fastqs/`), reference data (`mm10_reference/`), and full
     ```
 
     ```console title="Output"
-    
+    data
+    |-- fastq
+    |   |-- SRR3473988_selected.fastq.gz
+    |   `-- SRR3473989_selected.fastq.gz
+    |-- mm10_reference
+    |   |-- STAR
+    |   |   |-- Genome
+    |   |   |-- Log.out
+    |   |   |-- SA
+    |   |   |-- SAindex
+    |   |   |-- chrLength.txt
+    |   |   |-- chrName.txt
+    |   |   |-- chrNameLength.txt
+    |   |   |-- chrStart.txt
+    |   |   |-- exonGeTrInfo.tab
+    |   |   |-- exonInfo.tab
+    |   |   |-- geneInfo.tab
+    |   |   |-- genomeParameters.txt
+    |   |   |-- sjdbInfo.txt
+    |   |   |-- sjdbList.fromGTF.out.tab
+    |   |   |-- sjdbList.out.tab
+    |   |   `-- transcriptInfo.tab
+    |   |-- mm10_chr18.fa
+    |   |-- mm10_chr18.gtf
+    |   `-- salmon-index
+    |       |-- complete_ref_lens.bin
+    |       |-- ctable.bin
+    |       |-- ctg_offsets.bin
+    |       |-- duplicate_clusters.tsv
+    |       |-- info.json
+    |       |-- mphf.bin
+    |       |-- pos.bin
+    |       |-- pre_indexing.log
+    |       |-- rank.bin
+    |       |-- refAccumLengths.bin
+    |       |-- ref_indexing.log
+    |       |-- reflengths.bin
+    |       |-- refseq.bin
+    |       |-- seq.bin
+    |       `-- versionInfo.json
+    `-- samplesheet.csv
     ```
 
     Finally, take a look at the `samplesheet.csv` file to see what information the `nf-core/rnaseq` pipeline requires for each sample:
@@ -247,58 +287,7 @@ Our input FASTQ files (`fastqs/`), reference data (`mm10_reference/`), and full
     SRR3473989,/home/training/data/fastq/SRR3473989_selected.fastq.gz,,forward
     ```
 
-### Required input: `--input` and `--outdir`
-
-The pipeline requires us to define both an input samplesheet and an output directory to place our results. We supply these with the `--input` and `--outdir` parameters, respectively. We've already looked at our input samplesheet: `~/data/samplesheet.csv`. Our output directory can be named anything we want, and will be automatically created by Nextflow if it doesn't already exists.
-
-!!! example "Exercise 1.4.2.1"
-
-    Create a new file called `run.sh` and start writing a run command for the rnaseq pipeline. Start by providing the samplesheet as input. Also define an output directory called `lesson-1.4`.
-
-    ??? success "Solution"
-
-        First, create the new run script:
-
-        ```bash
-        touch run.sh
-        ```
-
-        Additionally, make sure it is executable:
-
-        ```bash
-        chmod +x run.sh
-        ```
-
-        Open the file within VSCode so you can easily edit it. Remember you can do this via the graphical interface or with the `code` command in the terminal:
-
-        ```bash
-        code run.sh
-        ```
-
-        Next, start by writing out the basic `nextflow run` command:
-
-        ```bash title="run.sh"
-        nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
-        ```
-
-        **Note** that we have added a space and a backslash (` \`) to the end of the line so we may continue writing the full command over multiple lines for legibility.
-
-        Next, add the `--input` parameter and pass it the path to the samplesheet. Be sure to replace `<USERNAME>` with your provided user name:
-
-        ```bash title="run.sh" hl_lines="2"
-        nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
-            --input /home/<USERNAME>/data/samplesheet.csv \
-        ```
-
-        Finally, add the `--outdir` parameter and give it the name `lesson-1.4`:
-
-        ```bash title="run.sh" hl_lines="3"
-        nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
-            --input /home/<USERNAME>/data/samplesheet.csv \
-            --outdir lesson-1.4 \
-        ```
-
-### Required input: reference data
+### Reference data
 
 Many nf-core pipelines have a minimum requirement for reference data inputs. The input reference data requirements for this pipeline are provided in the [usage documentation](https://nf-co.re/rnaseq/3.11.1/usage#reference-genome-files). To see what reference files we can specify using parameters, rerun the pipeline's help command to view all the available parameters.
 
@@ -355,15 +344,52 @@ For each of these parameters, we have the following files that we can use:
 
 **Note** that we are just using chr18 as it is a relatively small chromosome, so this should help to keep the run time for our exercises nice and short.
 
+### Writing the run command: required `--input` and `--outdir` parameters
+
+The pipeline requires us to define both an input samplesheet and an output directory to place our results. We supply these with the `--input` and `--outdir` parameters, respectively. We've already looked at our input samplesheet: `~/data/samplesheet.csv`. Our output directory can be named anything we want, and will be automatically created by Nextflow if it doesn't already exists.
+
+!!! example "Exercise 1.4.2.1"
+
+    Start writing a run command for the rnaseq pipeline. Start by providing the samplesheet as input. Also define an output directory called `lesson-1.4`.
+
+    ??? success "Solution"
+
+        Start by writing out the basic `nextflow run` command:
+
+        ```bash
+        nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
+        ```
+
+        **Note** that we have added a space and a backslash (` \`) to the end of the line so we may continue writing the full command over multiple lines for legibility. If you hit `Enter` now, the command won't run yet, but you will be provided a new line to continue writing.
+
+        Next, add the `--input` parameter and pass it the path to the samplesheet. Be sure to replace `<USERNAME>` with your provided user name:
+
+        ```bash hl_lines="2"
+        nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
+            --input /home/<USERNAME>/data/samplesheet.csv \
+        ```
+
+        Finally, add the `--outdir` parameter and give it the name `lesson-1.4`:
+
+        ```bash hl_lines="3"
+        nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
+            --input /home/<USERNAME>/data/samplesheet.csv \
+            --outdir lesson-1.4 \
+        ```
+
+### Writing the run command: reference data
+
+With the inputs and outputs defined, we next need to tell the pipeline where to find the necessary reference data. We have already determined the parameters and files we need to pass to the pipeline, so let's add them to the command now.
+
 !!! example "Exercise 1.4.2.2"
 
-    Add the reference file parameters and their respective file paths to the `run.sh` script.
+    Continue writing your run command by passing the reference files to their respective parameters.
 
     ??? success "Solution"
 
-        Add the following lines to the end of `run.sh`:
+        Following on from the last line from Exercise 1.4.2.1, add the `--fasta`, `--gtf`, `--star_index`, and `--salmon_index` parameters, and pass them the files we determined above in [Reference data](#reference-data):
 
-        ```bash title="run.sh" hl_lines="4-7"
+        ```bash hl_lines="4-7"
         nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
             --input /home/<USERNAME>/data/samplesheet.csv \
             --outdir lesson-1.4 \
@@ -375,43 +401,36 @@ For each of these parameters, we have the following files that we can use:
 
 ### Optional parameters
 
-Now that we have prepared our input and reference data, we will customise the typical run command by:
-
-1. Using Nextflow's `-profile` parameter to specify that we will be running the Singularity profile instead of the Docker profile
-2. Adding additional process-specific flags to [skip duplicate read marking](https://nf-co.re/rnaseq/3.23.0/parameters#skip_markduplicates), [save trimmed reads](https://nf-co.re/rnaseq/3.23.0/parameters#save_trimmed) and [save unaligned reads](https://nf-co.re/rnaseq/3.23.0/parameters#save_unaligned)
-
-The parameters we will use are:
+Now that we have prepared our input and reference data, we have defined all the required parameters for the pipeline. However, Nextflow still needs to be configured to use Singularity, and we will add an additional workflow parameter to help speed up the pipeline run for the sake of this workshop. The parameters we will use are:
 
 - `-profile singularity`
+    - Recall that this is a **Nextflow** parameter and tell it to use nf-core's Singularity profile, rather than the default Docker profile, and run each process using Singularity containers.
 - `--skip_markduplicates true`
-- `--save_trimmed true`
-- `--save_unaligned true`
+    - This is a pipeline parameter that tells the `rnaseq` pipeline to [skip duplicate read marking](https://nf-co.re/rnaseq/3.23.0/parameters#skip_markduplicates). Ordinarily we would want to include this, but for the sake of the workshop and in the interest of time we will skip it.
 
 !!! example "Exercise 1.4.2.3"
 
     Add the optional parameters and the singularity profile to the run command.
 
     ??? success "Solution"
 
-        Add the following lines to the end of `run.sh`:
+        Finish writing the run command by adding the `-profile` and `--skip_markduplicates` parameters:
 
-        ```bash title="run.sh" hl_lines="8-11"
+        ```bash hl_lines="8-9"
         nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
-            --input /home/<USERNAME>/samplesheet.csv \
+            --input /home/<USERNAME>/data/samplesheet.csv \
             --outdir lesson-1.4 \
             --fasta /home/<USERNAME>/data/mm10_reference/mm10_chr18.fa \
             --gtf /home/<USERNAME>/data/mm10_reference/mm10_chr18.gtf \
             --star_index /home/<USERNAME>/data/mm10_reference/STAR \
             --salmon_index /home/<USERNAME>/data/mm10_reference/salmon-index \
             -profile singularity \
-            --skip_markduplicates true \
-            --save_trimmed true \
-            --save_unaligned true
+            --skip_markduplicates true
         ```
 
         **Remember** that `-profile` is a *Nextflow parameter* and therefore only uses a **single hyphen**. The remaining parameters are *workflow parameters* and use a **double hyphen**.
 
-        **Note** also that we have left off the trailing space and bashslash from the final line (`--save_unaligned true`) since this line concludes our initial run command.
+        **Note** also that we have left off the trailing space and bashslash from the final line (`--skip_markduplicates true`) since this line concludes our initial run command.
 
 [hi](./1.3_configure.md#configuring-processes)
 
@@ -421,9 +440,23 @@ The parameters we will use are:
 
     The inclusion of `ext.args` is currently best practice for all DSL2 nf-core modules where additional parameters may be required to run a process. However, this may not be implemented for all modules in all nf-core pipelines. Depending on the pipeline, these process modules may not have defined the `ext.args` variable in the script blocks and is thus not available for applying customisation. If that is the case consider submitting a feature request or a making pull request on the pipeline's GitHub repository to implement this!
 
-### Setting resource limits
+## 1.4.3 Run the pipeline
+
+You should now have a multi-line command in your terminal waiting to run. Now if you hit `Enter`, Nextflow should launch and the pipeline will start to run. It will take a few seconds to start up, and then you should start seeing processes spawning and running.
+
+![](../assets/1.4_rnaseq_start.png)
+
+![](../assets/1.4_rnaseq_processes_start.png)
+
+However, very quickly, we run into an error!
+
+![](../assets/1.4_rnaseq_fail.png)
 
-There is one thing left to do with our basic run command, and that is to set some resource limits. The `nf-core/rnaseq` pipeline is designed to run on large datasets and therefore expects to require lots of CPU and memory resources to run. However, we're using a small test dataset that doesn't need a lot of computing power, and as such we're also using low-resource VMs. Running the workflow with its default settings will cause it to crash due to insufficient CPU and memory requirements.
+What happened?
+
+## 1.4.4 Setting resource limits
+
+It turns out that there is one thing left to do in order to run the pipeline: set some **resource limits**. The `nf-core/rnaseq` pipeline is designed to run on large datasets and therefore expects to require lots of CPU and memory resources to run. However, we're using a small test dataset that doesn't need a lot of computing power, and as such we're also using low-resource VMs. Running the workflow with its default settings causes some of the processes to crash due to insufficient CPU and memory requirements.
 
 We can fix this by telling Nextflow that we want to limit the resource requests from each process to an upper bound of 2 CPUs and 6GB of memory. We do this within a custom configuration file using the `process.resourceLimits` directive. This takes a list of upper resource limits like so:
 
@@ -435,22 +468,22 @@ process.resourceLimits = [
 ]
 ```
 
-!!! example "Exercise 1.4.2.4"
+!!! example "Exercise 1.4.4"
 
-    Create a file called `nextflow.config` within your current working directory (`~/session2`) and add the `resourceLimits` directive, giving our workflow a limit of 2 CPUs and 6GB of memory.
+    Create a configuration file called `nectar_vm.config` within your current working directory (`~/session2`) and add the `resourceLimits` directive, giving our workflow a limit of 2 CPUs and 6GB of memory.
 
     ??? success "Solution"
 
-        First, create the `nextflow.config` file:
+        First, create the `nectar_vm.config` file:
 
         ```bash
-        touch nextflow.config
-        code nextflow.config
+        touch nectar_vm.config
+        code nectar_vm.config
         ```
 
         Next, add the `resourceLimits` directive. You can do this in one of two ways. You can use the `process.resourceLimits` form as shown above:
 
-        ```groovy title="nextflow.config"
+        ```groovy title="nectar_vm.config"
         process.resourceLimits = [
             cpus: 2,
             memory: 6.GB
@@ -459,7 +492,7 @@ process.resourceLimits = [
 
         Alternatively, you can use the expanded version by nesting `resourceLimits` within a `process` scope:
 
-        ```groovy title="nextflow.config"
+        ```groovy title="nectar_vm.config"
         process {
             resourceLimits = [
                 cpus: 2,
@@ -470,11 +503,11 @@ process.resourceLimits = [
 
         The second form is preferable since we will need the `process` scope for configuring processes further in the second session.
 
-We now have a finished initial run command. Note how we didn't update `run.sh` after creating the new `nextflow.config` file. Recall from the [previous session](./1.3_configure.md#131-introduction-to-nextflow-configuration) that Nextflow will automatically include a `nextflow.config` file in the launch directory in its configuration. So, if we have configuration options we want to include for every run, we can add them to `nextflow.config` and they will be automatically loaded without having to specify the file in our run command.
+We now have a finished initial run command. Now we just need to update our run command to include the new configuration file, as well as tell Nextflow to resume from where it left off - there's no sense re-running jobs that already succeeded!
 
 Our final run command and default config file look like:
 
-```bash title="run.sh"
+```bash hl_lines="11-13"
 nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
     --input /home/<USERNAME>/data/samplesheet.csv \
     --outdir lesson-1.4 \
@@ -485,10 +518,12 @@ nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
     -profile singularity \
     --skip_markduplicates true \
     --save_trimmed true \
-    --save_unaligned true
+    --save_unaligned true \
+    -c nectar_vm.config \
+    -resume
 ```
 
-```groovy title="nextflow.config"
+```groovy title="nectar_vm.config"
 process {
     resourceLimits = [
         cpus: 2,
@@ -497,17 +532,9 @@ process {
 }
 ```
 
-## 1.4.3 Run the pipeline
-
-Now all that is left to do is to run the pipeline!
+Go ahead and re-run the workflow. It should now run successfully to completion!
 
-!!! example "Run the pipeline"
-
-    Simply run the `run.sh` script to execute the pipeline:
-
-    ```bash
-    ./run.sh
-    ```
+## 1.4.5 Examine the outputs
 
 :eyes: Take a look at the stdout printed to the screen. Your workflow configuration and parameter customisations are all documented here. You can use this to confirm if your parameters have been correctly passed to the run command:
 
@@ -527,8 +554,6 @@ To understand how this is coordinated, consider the STAR_ALIGN process that is b
 - Once a TRIMGALORE task is completed for a sample, the STAR_ALIGN task for that sample begins 
 - When the STAR_ALIGN process starts, it spawns 2 tasks.
 
-## 1.4.4 Examine the outputs
-
 Once your pipeline has completed, you should see this message printed to your terminal:
 
 ```console title="Output"
@@ -556,10 +581,9 @@ In the meantime, list the contents of your directory. You will see a few new dir
     drwxr-x--- 16 tdev01 tdev01 4.0K Apr 20 01:26 ..
     drwxrwxr-x  8 tdev01 tdev01 4.0K Apr 19 23:51 lesson-1.4
     drwxrwxr-x  4 tdev01 tdev01 4.0K Apr 20 01:34 .nextflow
-    -rw-rw-r--  1 tdev01 tdev01   79 Apr 17 03:43 nextflow.config
+    -rw-rw-r--  1 tdev01 tdev01   79 Apr 17 03:43 nectar_vm.config
     -rw-rw-r--  1 tdev01 tdev01 150K Apr 20 01:34 .nextflow.log
     drwxrwxr-x  4 tdev01 tdev01 4.0K Apr 17 06:55 nf-core-rnaseq-3.23.0
-    -rwxrwxr-x  1 tdev01 tdev01  475 Apr 20 01:04 run.sh
     drwxrwxr-x 66 tdev01 tdev01 4.0K Apr 20 01:32 work
     ```