Skip to content

Commit ae999a0

Browse files
committed
WIP: addressing feedback, updating 1.4 flow
1 parent c2f3e5a commit ae999a0

4 files changed

Lines changed: 122 additions & 98 deletions

File tree

docs/assets/1.4_rnaseq_fail.png

212 KB
Loading
276 KB
Loading

docs/assets/1.4_rnaseq_start.png

221 KB
Loading

docs/session_1/1.4_rnaseq.md

Lines changed: 122 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,47 @@ Our input FASTQ files (`fastqs/`), reference data (`mm10_reference/`), and full
232232
```
233233

234234
```console title="Output"
235-
235+
data
236+
|-- fastq
237+
| |-- SRR3473988_selected.fastq.gz
238+
| `-- SRR3473989_selected.fastq.gz
239+
|-- mm10_reference
240+
| |-- STAR
241+
| | |-- Genome
242+
| | |-- Log.out
243+
| | |-- SA
244+
| | |-- SAindex
245+
| | |-- chrLength.txt
246+
| | |-- chrName.txt
247+
| | |-- chrNameLength.txt
248+
| | |-- chrStart.txt
249+
| | |-- exonGeTrInfo.tab
250+
| | |-- exonInfo.tab
251+
| | |-- geneInfo.tab
252+
| | |-- genomeParameters.txt
253+
| | |-- sjdbInfo.txt
254+
| | |-- sjdbList.fromGTF.out.tab
255+
| | |-- sjdbList.out.tab
256+
| | `-- transcriptInfo.tab
257+
| |-- mm10_chr18.fa
258+
| |-- mm10_chr18.gtf
259+
| `-- salmon-index
260+
| |-- complete_ref_lens.bin
261+
| |-- ctable.bin
262+
| |-- ctg_offsets.bin
263+
| |-- duplicate_clusters.tsv
264+
| |-- info.json
265+
| |-- mphf.bin
266+
| |-- pos.bin
267+
| |-- pre_indexing.log
268+
| |-- rank.bin
269+
| |-- refAccumLengths.bin
270+
| |-- ref_indexing.log
271+
| |-- reflengths.bin
272+
| |-- refseq.bin
273+
| |-- seq.bin
274+
| `-- versionInfo.json
275+
`-- samplesheet.csv
236276
```
237277

238278
Finally, take a look at the `samplesheet.csv` file to see what information the `nf-core/rnaseq` pipeline requires for each sample:
@@ -247,58 +287,7 @@ Our input FASTQ files (`fastqs/`), reference data (`mm10_reference/`), and full
247287
SRR3473989,/home/training/data/fastq/SRR3473989_selected.fastq.gz,,forward
248288
```
249289

250-
### Required input: `--input` and `--outdir`
251-
252-
The pipeline requires us to define both an input samplesheet and an output directory to place our results. We supply these with the `--input` and `--outdir` parameters, respectively. We've already looked at our input samplesheet: `~/data/samplesheet.csv`. Our output directory can be named anything we want, and will be automatically created by Nextflow if it doesn't already exists.
253-
254-
!!! example "Exercise 1.4.2.1"
255-
256-
Create a new file called `run.sh` and start writing a run command for the rnaseq pipeline. Start by providing the samplesheet as input. Also define an output directory called `lesson-1.4`.
257-
258-
??? success "Solution"
259-
260-
First, create the new run script:
261-
262-
```bash
263-
touch run.sh
264-
```
265-
266-
Additionally, make sure it is executable:
267-
268-
```bash
269-
chmod +x run.sh
270-
```
271-
272-
Open the file within VSCode so you can easily edit it. Remember you can do this via the graphical interface or with the `code` command in the terminal:
273-
274-
```bash
275-
code run.sh
276-
```
277-
278-
Next, start by writing out the basic `nextflow run` command:
279-
280-
```bash title="run.sh"
281-
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
282-
```
283-
284-
**Note** that we have added a space and a backslash (` \`) to the end of the line so we may continue writing the full command over multiple lines for legibility.
285-
286-
Next, add the `--input` parameter and pass it the path to the samplesheet. Be sure to replace `<USERNAME>` with your provided user name:
287-
288-
```bash title="run.sh" hl_lines="2"
289-
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
290-
--input /home/<USERNAME>/data/samplesheet.csv \
291-
```
292-
293-
Finally, add the `--outdir` parameter and give it the name `lesson-1.4`:
294-
295-
```bash title="run.sh" hl_lines="3"
296-
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
297-
--input /home/<USERNAME>/data/samplesheet.csv \
298-
--outdir lesson-1.4 \
299-
```
300-
301-
### Required input: reference data
290+
### Reference data
302291

303292
Many nf-core pipelines have a minimum requirement for reference data inputs. The input reference data requirements for this pipeline are provided in the [usage documentation](https://nf-co.re/rnaseq/3.11.1/usage#reference-genome-files). To see what reference files we can specify using parameters, rerun the pipeline's help command to view all the available parameters.
304293

@@ -355,15 +344,52 @@ For each of these parameters, we have the following files that we can use:
355344

356345
**Note** that we are just using chr18 as it is a relatively small chromosome, so this should help to keep the run time for our exercises nice and short.
357346

347+
### Writing the run command: required `--input` and `--outdir` parameters
348+
349+
The pipeline requires us to define both an input samplesheet and an output directory to place our results. We supply these with the `--input` and `--outdir` parameters, respectively. We've already looked at our input samplesheet: `~/data/samplesheet.csv`. Our output directory can be named anything we want, and will be automatically created by Nextflow if it doesn't already exists.
350+
351+
!!! example "Exercise 1.4.2.1"
352+
353+
Start writing a run command for the rnaseq pipeline. Start by providing the samplesheet as input. Also define an output directory called `lesson-1.4`.
354+
355+
??? success "Solution"
356+
357+
Start by writing out the basic `nextflow run` command:
358+
359+
```bash
360+
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
361+
```
362+
363+
**Note** that we have added a space and a backslash (` \`) to the end of the line so we may continue writing the full command over multiple lines for legibility. If you hit `Enter` now, the command won't run yet, but you will be provided a new line to continue writing.
364+
365+
Next, add the `--input` parameter and pass it the path to the samplesheet. Be sure to replace `<USERNAME>` with your provided user name:
366+
367+
```bash hl_lines="2"
368+
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
369+
--input /home/<USERNAME>/data/samplesheet.csv \
370+
```
371+
372+
Finally, add the `--outdir` parameter and give it the name `lesson-1.4`:
373+
374+
```bash hl_lines="3"
375+
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
376+
--input /home/<USERNAME>/data/samplesheet.csv \
377+
--outdir lesson-1.4 \
378+
```
379+
380+
### Writing the run command: reference data
381+
382+
With the inputs and outputs defined, we next need to tell the pipeline where to find the necessary reference data. We have already determined the parameters and files we need to pass to the pipeline, so let's add them to the command now.
383+
358384
!!! example "Exercise 1.4.2.2"
359385

360-
Add the reference file parameters and their respective file paths to the `run.sh` script.
386+
Continue writing your run command by passing the reference files to their respective parameters.
361387

362388
??? success "Solution"
363389

364-
Add the following lines to the end of `run.sh`:
390+
Following on from the last line from Exercise 1.4.2.1, add the `--fasta`, `--gtf`, `--star_index`, and `--salmon_index` parameters, and pass them the files we determined above in [Reference data](#reference-data):
365391

366-
```bash title="run.sh" hl_lines="4-7"
392+
```bash hl_lines="4-7"
367393
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
368394
--input /home/<USERNAME>/data/samplesheet.csv \
369395
--outdir lesson-1.4 \
@@ -375,43 +401,36 @@ For each of these parameters, we have the following files that we can use:
375401

376402
### Optional parameters
377403

378-
Now that we have prepared our input and reference data, we will customise the typical run command by:
379-
380-
1. Using Nextflow's `-profile` parameter to specify that we will be running the Singularity profile instead of the Docker profile
381-
2. Adding additional process-specific flags to [skip duplicate read marking](https://nf-co.re/rnaseq/3.23.0/parameters#skip_markduplicates), [save trimmed reads](https://nf-co.re/rnaseq/3.23.0/parameters#save_trimmed) and [save unaligned reads](https://nf-co.re/rnaseq/3.23.0/parameters#save_unaligned)
382-
383-
The parameters we will use are:
404+
Now that we have prepared our input and reference data, we have defined all the required parameters for the pipeline. However, Nextflow still needs to be configured to use Singularity, and we will add an additional workflow parameter to help speed up the pipeline run for the sake of this workshop. The parameters we will use are:
384405

385406
- `-profile singularity`
407+
- Recall that this is a **Nextflow** parameter and tell it to use nf-core's Singularity profile, rather than the default Docker profile, and run each process using Singularity containers.
386408
- `--skip_markduplicates true`
387-
- `--save_trimmed true`
388-
- `--save_unaligned true`
409+
- This is a pipeline parameter that tells the `rnaseq` pipeline to [skip duplicate read marking](https://nf-co.re/rnaseq/3.23.0/parameters#skip_markduplicates). Ordinarily we would want to include this, but for the sake of the workshop and in the interest of time we will skip it.
389410

390411
!!! example "Exercise 1.4.2.3"
391412

392413
Add the optional parameters and the singularity profile to the run command.
393414

394415
??? success "Solution"
395416

396-
Add the following lines to the end of `run.sh`:
417+
Finish writing the run command by adding the `-profile` and `--skip_markduplicates` parameters:
397418

398-
```bash title="run.sh" hl_lines="8-11"
419+
```bash hl_lines="8-9"
399420
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
400-
--input /home/<USERNAME>/samplesheet.csv \
421+
--input /home/<USERNAME>/data/samplesheet.csv \
401422
--outdir lesson-1.4 \
402423
--fasta /home/<USERNAME>/data/mm10_reference/mm10_chr18.fa \
403424
--gtf /home/<USERNAME>/data/mm10_reference/mm10_chr18.gtf \
404425
--star_index /home/<USERNAME>/data/mm10_reference/STAR \
405426
--salmon_index /home/<USERNAME>/data/mm10_reference/salmon-index \
406427
-profile singularity \
407-
--skip_markduplicates true \
408-
--save_trimmed true \
409-
--save_unaligned true
428+
--skip_markduplicates true
410429
```
411430

412431
**Remember** that `-profile` is a *Nextflow parameter* and therefore only uses a **single hyphen**. The remaining parameters are *workflow parameters* and use a **double hyphen**.
413432

414-
**Note** also that we have left off the trailing space and bashslash from the final line (`--save_unaligned true`) since this line concludes our initial run command.
433+
**Note** also that we have left off the trailing space and bashslash from the final line (`--skip_markduplicates true`) since this line concludes our initial run command.
415434

416435
[hi](./1.3_configure.md#configuring-processes)
417436

@@ -421,9 +440,23 @@ The parameters we will use are:
421440

422441
The inclusion of `ext.args` is currently best practice for all DSL2 nf-core modules where additional parameters may be required to run a process. However, this may not be implemented for all modules in all nf-core pipelines. Depending on the pipeline, these process modules may not have defined the `ext.args` variable in the script blocks and is thus not available for applying customisation. If that is the case consider submitting a feature request or a making pull request on the pipeline's GitHub repository to implement this!
423442

424-
### Setting resource limits
443+
## 1.4.3 Run the pipeline
444+
445+
You should now have a multi-line command in your terminal waiting to run. Now if you hit `Enter`, Nextflow should launch and the pipeline will start to run. It will take a few seconds to start up, and then you should start seeing processes spawning and running.
446+
447+
![](../assets/1.4_rnaseq_start.png)
448+
449+
![](../assets/1.4_rnaseq_processes_start.png)
450+
451+
However, very quickly, we run into an error!
452+
453+
![](../assets/1.4_rnaseq_fail.png)
425454

426-
There is one thing left to do with our basic run command, and that is to set some resource limits. The `nf-core/rnaseq` pipeline is designed to run on large datasets and therefore expects to require lots of CPU and memory resources to run. However, we're using a small test dataset that doesn't need a lot of computing power, and as such we're also using low-resource VMs. Running the workflow with its default settings will cause it to crash due to insufficient CPU and memory requirements.
455+
What happened?
456+
457+
## 1.4.4 Setting resource limits
458+
459+
It turns out that there is one thing left to do in order to run the pipeline: set some **resource limits**. The `nf-core/rnaseq` pipeline is designed to run on large datasets and therefore expects to require lots of CPU and memory resources to run. However, we're using a small test dataset that doesn't need a lot of computing power, and as such we're also using low-resource VMs. Running the workflow with its default settings causes some of the processes to crash due to insufficient CPU and memory requirements.
427460

428461
We can fix this by telling Nextflow that we want to limit the resource requests from each process to an upper bound of 2 CPUs and 6GB of memory. We do this within a custom configuration file using the `process.resourceLimits` directive. This takes a list of upper resource limits like so:
429462

@@ -435,22 +468,22 @@ process.resourceLimits = [
435468
]
436469
```
437470

438-
!!! example "Exercise 1.4.2.4"
471+
!!! example "Exercise 1.4.4"
439472

440-
Create a file called `nextflow.config` within your current working directory (`~/session2`) and add the `resourceLimits` directive, giving our workflow a limit of 2 CPUs and 6GB of memory.
473+
Create a configuration file called `nectar_vm.config` within your current working directory (`~/session2`) and add the `resourceLimits` directive, giving our workflow a limit of 2 CPUs and 6GB of memory.
441474

442475
??? success "Solution"
443476

444-
First, create the `nextflow.config` file:
477+
First, create the `nectar_vm.config` file:
445478

446479
```bash
447-
touch nextflow.config
448-
code nextflow.config
480+
touch nectar_vm.config
481+
code nectar_vm.config
449482
```
450483

451484
Next, add the `resourceLimits` directive. You can do this in one of two ways. You can use the `process.resourceLimits` form as shown above:
452485

453-
```groovy title="nextflow.config"
486+
```groovy title="nectar_vm.config"
454487
process.resourceLimits = [
455488
cpus: 2,
456489
memory: 6.GB
@@ -459,7 +492,7 @@ process.resourceLimits = [
459492

460493
Alternatively, you can use the expanded version by nesting `resourceLimits` within a `process` scope:
461494

462-
```groovy title="nextflow.config"
495+
```groovy title="nectar_vm.config"
463496
process {
464497
resourceLimits = [
465498
cpus: 2,
@@ -470,11 +503,11 @@ process.resourceLimits = [
470503

471504
The second form is preferable since we will need the `process` scope for configuring processes further in the second session.
472505

473-
We now have a finished initial run command. Note how we didn't update `run.sh` after creating the new `nextflow.config` file. Recall from the [previous session](./1.3_configure.md#131-introduction-to-nextflow-configuration) that Nextflow will automatically include a `nextflow.config` file in the launch directory in its configuration. So, if we have configuration options we want to include for every run, we can add them to `nextflow.config` and they will be automatically loaded without having to specify the file in our run command.
506+
We now have a finished initial run command. Now we just need to update our run command to include the new configuration file, as well as tell Nextflow to resume from where it left off - there's no sense re-running jobs that already succeeded!
474507

475508
Our final run command and default config file look like:
476509

477-
```bash title="run.sh"
510+
```bash hl_lines="11-13"
478511
nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
479512
--input /home/<USERNAME>/data/samplesheet.csv \
480513
--outdir lesson-1.4 \
@@ -485,10 +518,12 @@ nextflow run nf-core-rnaseq-3.23.0/3_23_0 \
485518
-profile singularity \
486519
--skip_markduplicates true \
487520
--save_trimmed true \
488-
--save_unaligned true
521+
--save_unaligned true \
522+
-c nectar_vm.config \
523+
-resume
489524
```
490525

491-
```groovy title="nextflow.config"
526+
```groovy title="nectar_vm.config"
492527
process {
493528
resourceLimits = [
494529
cpus: 2,
@@ -497,17 +532,9 @@ process {
497532
}
498533
```
499534

500-
## 1.4.3 Run the pipeline
501-
502-
Now all that is left to do is to run the pipeline!
535+
Go ahead and re-run the workflow. It should now run successfully to completion!
503536

504-
!!! example "Run the pipeline"
505-
506-
Simply run the `run.sh` script to execute the pipeline:
507-
508-
```bash
509-
./run.sh
510-
```
537+
## 1.4.5 Examine the outputs
511538

512539
:eyes: Take a look at the stdout printed to the screen. Your workflow configuration and parameter customisations are all documented here. You can use this to confirm if your parameters have been correctly passed to the run command:
513540

@@ -527,8 +554,6 @@ To understand how this is coordinated, consider the STAR_ALIGN process that is b
527554
- Once a TRIMGALORE task is completed for a sample, the STAR_ALIGN task for that sample begins
528555
- When the STAR_ALIGN process starts, it spawns 2 tasks.
529556

530-
## 1.4.4 Examine the outputs
531-
532557
Once your pipeline has completed, you should see this message printed to your terminal:
533558

534559
```console title="Output"
@@ -556,10 +581,9 @@ In the meantime, list the contents of your directory. You will see a few new dir
556581
drwxr-x--- 16 tdev01 tdev01 4.0K Apr 20 01:26 ..
557582
drwxrwxr-x 8 tdev01 tdev01 4.0K Apr 19 23:51 lesson-1.4
558583
drwxrwxr-x 4 tdev01 tdev01 4.0K Apr 20 01:34 .nextflow
559-
-rw-rw-r-- 1 tdev01 tdev01 79 Apr 17 03:43 nextflow.config
584+
-rw-rw-r-- 1 tdev01 tdev01 79 Apr 17 03:43 nectar_vm.config
560585
-rw-rw-r-- 1 tdev01 tdev01 150K Apr 20 01:34 .nextflow.log
561586
drwxrwxr-x 4 tdev01 tdev01 4.0K Apr 17 06:55 nf-core-rnaseq-3.23.0
562-
-rwxrwxr-x 1 tdev01 tdev01 475 Apr 20 01:04 run.sh
563587
drwxrwxr-x 66 tdev01 tdev01 4.0K Apr 20 01:32 work
564588
```
565589

0 commit comments

Comments
 (0)