Skip to content
This repository was archived by the owner on Aug 15, 2025. It is now read-only.

Commit f0c55f5

Browse files
committed
Various typos and formatting fixed in pipeline section
Added pipeline image for demultiplexing section Changed one manual feature counting example
1 parent 2e05283 commit f0c55f5

7 files changed

Lines changed: 426 additions & 42 deletions

04-01-PipelineOverview.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## High-level
44

5-
An RNA-seq pipeline needs to identify which RNA transcript a (short) read has originated from to quantify transcription of genes.
5+
An RNA-seq pipeline needs to identify which RNA transcript a sequencing read has originated from to quantify transcription of genes.
66

77
Once we know the origin of each read, we can use this to estimate the abundance of each transcript.
88

@@ -18,14 +18,14 @@ At the high-level, a pipeline aims to:
1818
- count reads associated with features to quantify (differential) abundance
1919

2020
Counts of the number of reads associated with each feature (_gene_) are used to
21-
estimate the relative abundance of transcripts, and find differences in expression of a gene between groups (== differential expression analysis).
21+
estimate the relative abundance of transcripts, and find differences in gene expression between groups (== differential expression analysis).
2222

2323
Usually we quantify expression per gene (the sum of all transcripts arising from that gene).
2424

2525
## A simple pipeline
2626

2727
<a href="images/pipeline/simple_pipeline.svg" target="_blank">
28-
![Simple RNA-seq pipeline overview](images/pipeline/simple_pipeline.svg){width="100%"}
28+
![](images/pipeline/simple_pipeline.svg){width="100%" fig-alt="Simple RNA-seq pipeline overview"}
2929
</a>
3030

3131
## A more complex pipeline

04-03-Pipeline-demultiplexing.Rmd

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Detour: demultiplexing {-}
22

3+
![](images/pipeline/simple_pipeline_demultiplexing.svg){width="100%" fig-alt="Simple RNA-seq pipeline overview with demultiplexing step"}
4+
5+
----
6+
37
***Before*** you get your FASTQs, the sequencing provider will have 'demultiplexed' the data output by the instrument.
48

59
Sequencing providers often pool many samples onto a single flowcell.

04-04-Pipeline-fastqc.Rmd

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
# An example dataset {-}
1+
# Raw read QC & trimming {-}
2+
3+
## An example dataset
24

35
For the following examples, we will use the results of an `nf-core/rnaseq` run on a published dataset from [Shanle et al, 2013](https://doi.org/10.1210/me.2013-1164). ([SRP062287](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP062287)).
46
This is a breast cancer cell line, ERα/PR/HER2 negative, and engineered to have doxycycline inducible ERβ expression.
@@ -8,14 +10,14 @@ The cells were treated with various combinations of estrogen, doxycycline, and c
810
![Shanle et al, 2013 headline](images/Shanle_et_al_2013.jpg){width="100%"}
911
</a>
1012

11-
We've already run this dataset through the `nf-core/rnaseq` pipeline - here is the output on <a href="https://laxy.io/#/job/3pLfQoLEuWeAnWh4H3Vvbv/?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a" target="_blank">laxy.io</a>, which includes a the FastQC reports.
13+
We've already run this dataset through the `nf-core/rnaseq` pipeline - here is the output on <a href="https://laxy.io/#/job/3pLfQoLEuWeAnWh4H3Vvbv/?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a" target="_blank">laxy.io</a>, which includes the FastQC reports.
1214

13-
## Raw read QC
15+
## FastQC
1416

1517
Let's begin by looking at some metrics about the raw FASTQ reads. The most popular tool for this is [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/),
1618
which generates reports on read quality, length, and content (among other things).
1719

18-
![Simple RNA-seq pipeline overview: FastQC step](images/pipeline/simple_pipeline_fastqc.svg){width="100%"}
20+
![](images/pipeline/simple_pipeline_fastqc.svg){width="100%" fig-alt="Simple RNA-seq pipeline overview: FastQC step"}
1921

2022
> FastQC provides a set of pass
2123
> ![FastQC PASS](images/multiqc/fastqc_pass.png) / warn ![FastQC PASS](images/multiqc/fastqc_warn.png) / fail ![FastQC PASS](images/multiqc/fastqc_fail.png)
@@ -58,7 +60,7 @@ source of bias to the resulting expression level estimates.
5860

5961
_However ...._
6062

61-
The literature suggests trimming probably isn't nessecary for differential abundance studies,
63+
The literature suggests trimming probably isn't necessary for differential abundance studies,
6264
and quantification at the gene level is very similar between trimmed and untrimmed reads [(Liao & Shi, 2020)](https://doi.org/10.1093%2Fnargab%2Flqaa068).
6365
This is because aligners used for RNA-seq 'soft-clip' reads - discarding parts at the ends of reads that don't match the reference genome.
6466

@@ -112,7 +114,7 @@ FastQC shows plots summarizing exact duplicate reads, and overrepresented sequen
112114
<a href="https://api.laxy.io/api/v1/job/3pLfQoLEuWeAnWh4H3Vvbv/files/output/results/fastqc/SRR2155413_fastqc.html?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a#M8" target="_blank">SRR2155413 Sequence Duplication Levels</a>
113115

114116
> Look at the Sequence Duplication Level plot - what might the peak at >10 duplication level represent ?
115-
> The red line represents the proportion that the same sequences would contribute after deduplication.
117+
> The red line represents the proportion that those same sequences would contribute after deduplication.
116118
117119
Our example dataset doesn't have any overrepresented sequences, but it's not uncommon to see ribosomal rRNAs,
118120
poly-A or poly-G, adapter sequences, or sequences prone to strong PCR amplification in this list.

04-05-Pipeline-mapping-and-qc.Rmd

Lines changed: 25 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Alignment / mapping {-}
22

3-
![Simple RNA-seq pipeline overview: FastQC step](images/pipeline/simple_pipeline_alignment.svg){width="100%"}
3+
![](images/pipeline/simple_pipeline_alignment.svg){width="100%" fig-alt="Simple RNA-seq pipeline overview: Alignment step"}
4+
5+
----
46

57
Short read aligners aim to find a matching region of sequence in a reference genome.
68

@@ -13,7 +15,7 @@ Short read aligners aim to find a matching region of sequence in a reference gen
1315

1416
- A BAM file with aligned reads
1517

16-
We typically align to the **whole genome sequence** rather than just the **predicted transcriptome sequences** - this allows use to assess useful
18+
We typically align to the **whole genome sequence** rather than just the **predicted transcriptome sequences** - this allows us to assess useful
1719
quality metrics, like the proportion of reads that map to exons vs other regions of the genome.
1820

1921
Popular alignment tools for RNA-seq in eurkaryotes such as STAR, HISAT2 or Subread are splice-aware - they will split
@@ -47,7 +49,7 @@ quality metrics from many tools into a single report. We will examine the report
4749
the `nf-core/rnaseq` pipeline using MultiQC, focusing on the parts that are most useful for understanding
4850
your dataset.
4951

50-
> Here's a link to the report [on laxy.io](https://api.laxy.io/api/v1/job/3pLfQoLEuWeAnWh4H3Vvbv/files/output/results/multiqc/star_salmon/multiqc_report.html?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a){target="_blank"} (and an alternative backup version [here](files/multiqc/SRP062287/multiqc_report.html){target="_blank"}).
52+
> Here's a link to [the MultiQC report](files/multiqc/SRP062287/multiqc_report.html){target="_blank"} (or the same [on laxy.io](https://api.laxy.io/api/v1/job/3pLfQoLEuWeAnWh4H3Vvbv/files/output/results/multiqc/star_salmon/multiqc_report.html?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a){target="_blank"}).
5153
5254
We can use these QC metrics to identify:
5355

@@ -71,18 +73,18 @@ This is handy for seeing some key metrics all in one place.
7173

7274
There's a lot of columns here - you can use the "configure columns" button to show just a subset.
7375

74-
![MultiQC configure columns](images/multiqc/mutliqc_configure_columns.jpg){width="60%"}
76+
![](images/multiqc/mutliqc_configure_columns.jpg){width="60%" fig-alt="MultiQC configure columns"}
7577

7678
Let's skip interpreting this table to begin with, since most of the information is repeated in plots below.
7779

7880
## Percent (%) aligned reads
7981

80-
Use the sidebar to jump to the ["STAR" section of the report](files/multiqc/SRP062287/multiqc_report.html#star){target="_blank"}) - this shows the number and proportion of reads, and if the mapped uniquely or otherwise.
82+
Use the sidebar to jump to the ["STAR" section of the report](files/multiqc/SRP062287/multiqc_report.html#star){target="_blank"}) - this shows the number and proportion of reads, and if they mapped uniquely or otherwise.
8183

82-
The overall number of reads per sample here will closely follow the total raw read numbers (eg (files/multiqc/SRP062287/multiqc_report.html#fastqc_sequence_counts){target="_blank"})[from FastQC]).
84+
The overall number of reads per sample here will closely follow the total raw read numbers (eg [from FastQC](files/multiqc/SRP062287/multiqc_report.html#fastqc_sequence_counts){target="_blank"}).
8385
Usually an attempt is made to load similar amounts of material for each sample onto the flowcell, but it's not possible to perfectly balance this loading, so library sizes will vary between samples.
8486

85-
Change plot to show percentages rather than raw numbers of mapped reads.
87+
Change the plot to show percentages rather than raw numbers of mapped reads.
8688

8789
In this case we can see every sample has ~92 % **uniquely mapped** reads and is very consistent (this isn't always the case !).
8890

@@ -94,7 +96,7 @@ There are several classes of read here:
9496
- Unmapped: too short (low quality reads, trimmed to oblivion)
9597
- Unmapped: other (no match to the genome sequence)
9698

97-
For traditional feature counting to estimate gene expression, only the the unambiguous, uniquely mapped reads are used.
99+
For traditional feature counting to estimate gene expression, only the the **unambiguous, uniquely mapped** reads are used.
98100

99101
**What should you expect from your data ?**
100102

@@ -108,7 +110,7 @@ the mouse genome with our human cell line example dataset.
108110

109111
### MultiQC: Duplication statistics
110112

111-
[See this section](files/multiqc/SRP062287/multiqc_report.html#picard)
113+
[See this section](files/multiqc/SRP062287/multiqc_report.html#picard)
112114

113115
`picard MarkDuplicates` identifies and quantifies 'duplicate' mapped reads - if two reads have the same 5' start position and the same orientation (strand) (prior to soft-clipping), one of these reads is considered a duplicate.
114116
_(For paired end reads, both R1 and R2 5' start positions must match another pair)._
@@ -119,11 +121,11 @@ These duplicates can be of several origins:
119121
2. Reads generated by **PCR** of a single fragment, which are derived from only one transcript molecule
120122
3. 'Optical duplicates' that originate from image analysis artefacts - where a single cluster on the flowcell is incorrectly interpreted as two (or more) clusters
121123

122-
Duplicate reads in category (1, biological) are generally 'good', in that they represent the real transcript abundance. The deeper you sequence (the larger the library), the more duplicates you'll get since the chance of two reads happening to have the same start position by chance increases.
124+
Duplicate reads in category 1 (**biological**) are generally 'good', in that they represent the real transcript abundance. The deeper you sequence (the larger the library), the more duplicates you'll get since the chance of two reads happening to have the same start position increases.
123125

124-
Duplicate reads in category (2, PCR) usually aren't desired, since they may bias the real transcript abundance, inflating counts for fragments that happen to amplify more efficiently.
126+
Duplicate reads in category 2 (**PCR**) usually aren't desired, since they may bias the real transcript abundance, inflating counts for fragments that happen to amplify more efficiently.
125127

126-
Duplicate reads in category (3, optical) seem to be rare in data on modern instruments, probably probably due to better image analysis and patterned flow cells.
128+
Duplicate reads in category 3 (**optical**) seem to be rare in data on modern instruments, probably probably due to better image analysis and patterned flow cells.
127129

128130
We can't _always_ know if a duplicate is biological (1), PCR-derived (2) or optical (3).
129131

@@ -145,12 +147,14 @@ There are two main solutions to avoid excessive PCR duplicates are:
145147
- Get more starting material so you don't need to artificially increase the library size via many PCR cycles
146148
- Use Universal Molecular Identifiers (UMIs) in the library construction to allow PCR duplicates to be identified.
147149

148-
> For 3'-focused sequencing you are likely to see more duplicates that are not PCR or optical
150+
> For 3'-focused sequencing you are likely to see more duplicates that are not PCR or optical.
149151
150152
#### UMIs and dupRadar
151153

152-
This is the purpose of Universal Molecular Identifiers (UMIs) - semi-random sequences incorporated into fragments as early as possible in library prep, prior to many rounds of PCR.
153-
They allow duplicated generated by PCR to be identified and accounted for.
154+
The purpose of Universal Molecular Identifiers (UMIs) is to help account for PCR duplicates.
155+
156+
These are semi-random sequences incorporated into fragments as early as possible in library prep, prior to many rounds of PCR.
157+
They allow duplicates of the same fragment generated by PCR to be identified by the unique random sequence, and hence accounted for.
154158

155159
_[dupRadar](https://doi.org/10.1186/s12859-016-1276-2) can help assess if an experiment is suffering from excessive PCR duplication, vs. 'over-sequencing'_
156160

@@ -164,18 +168,18 @@ _[dupRadar](https://doi.org/10.1186/s12859-016-1276-2) can help assess if an exp
164168

165169
#### Qualimap - read origin and gene coverage
166170

167-
[See this section](files/multiqc/SRP062287/multiqc_report.html#qualimap)
171+
[See this section](files/multiqc/SRP062287/multiqc_report.html#qualimap)
168172

169173
#### Biotype counts & RSeqQC read distribution
170174

171-
[RSeqQC read distribution](files/multiqc/SRP062287/multiqc_report.html#rseqc-read_distribution)
175+
[RSeqQC read distribution](files/multiqc/SRP062287/multiqc_report.html#rseqc-read_distribution)
172176

173-
[Biotype Counts](files/multiqc/SRP062287/multiqc_report.html#biotype_counts)
177+
[Biotype Counts](files/multiqc/SRP062287/multiqc_report.html#biotype_counts)
174178

175-
> The biotype counts in nf-core/rnaseq are calculated using `featureCounts` but aggregated at the feature level, not the gene level.
179+
> The biotype counts in nf-core/rnaseq are calculated using `featureCounts` but aggregated at the biotype feature level, not the gene level.
176180
177-
[Samtools mapping by XY and chromosome](files/multiqc/SRP062287/multiqc_report.html#samtools-idxstats-xy-counts)
181+
[Samtools mapping by XY and chromosome](files/multiqc/SRP062287/multiqc_report.html#samtools-idxstats-xy-counts)
178182

179183
#### Strandedness
180184

181-
[RSeQC: Infer experiment](files/multiqc/SRP062287/multiqc_report.html##rseqc-infer_experiment)
185+
[RSeQC: Infer experiment](files/multiqc/SRP062287/multiqc_report.html#rseqc-infer_experiment)

0 commit comments

Comments
 (0)