You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 04-04-Pipeline-fastqc.Rmd
+8-6Lines changed: 8 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,6 @@
1
-
# An example dataset {-}
1
+
# Raw read QC & trimming {-}
2
+
3
+
## An example dataset
2
4
3
5
For the following examples, we will use the results of an `nf-core/rnaseq` run on a published dataset from [Shanle et al, 2013](https://doi.org/10.1210/me.2013-1164). ([SRP062287](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP062287)).
4
6
This is a breast cancer cell line, ERα/PR/HER2 negative, and engineered to have doxycycline inducible ERβ expression.
@@ -8,14 +10,14 @@ The cells were treated with various combinations of estrogen, doxycycline, and c
8
10
{width="100%"}
9
11
</a>
10
12
11
-
We've already run this dataset through the `nf-core/rnaseq` pipeline - here is the output on <ahref="https://laxy.io/#/job/3pLfQoLEuWeAnWh4H3Vvbv/?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a"target="_blank">laxy.io</a>, which includes a the FastQC reports.
13
+
We've already run this dataset through the `nf-core/rnaseq` pipeline - here is the output on <ahref="https://laxy.io/#/job/3pLfQoLEuWeAnWh4H3Vvbv/?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a"target="_blank">laxy.io</a>, which includes the FastQC reports.
12
14
13
-
## Raw read QC
15
+
## FastQC
14
16
15
17
Let's begin by looking at some metrics about the raw FASTQ reads. The most popular tool for this is [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/),
16
18
which generates reports on read quality, length, and content (among other things).
@@ -58,7 +60,7 @@ source of bias to the resulting expression level estimates.
58
60
59
61
_However ...._
60
62
61
-
The literature suggests trimming probably isn't nessecary for differential abundance studies,
63
+
The literature suggests trimming probably isn't necessary for differential abundance studies,
62
64
and quantification at the gene level is very similar between trimmed and untrimmed reads [(Liao & Shi, 2020)](https://doi.org/10.1093%2Fnargab%2Flqaa068).
63
65
This is because aligners used for RNA-seq 'soft-clip' reads - discarding parts at the ends of reads that don't match the reference genome.
Short read aligners aim to find a matching region of sequence in a reference genome.
6
8
@@ -13,7 +15,7 @@ Short read aligners aim to find a matching region of sequence in a reference gen
13
15
14
16
- A BAM file with aligned reads
15
17
16
-
We typically align to the **whole genome sequence** rather than just the **predicted transcriptome sequences** - this allows use to assess useful
18
+
We typically align to the **whole genome sequence** rather than just the **predicted transcriptome sequences** - this allows us to assess useful
17
19
quality metrics, like the proportion of reads that map to exons vs other regions of the genome.
18
20
19
21
Popular alignment tools for RNA-seq in eurkaryotes such as STAR, HISAT2 or Subread are splice-aware - they will split
@@ -47,7 +49,7 @@ quality metrics from many tools into a single report. We will examine the report
47
49
the `nf-core/rnaseq` pipeline using MultiQC, focusing on the parts that are most useful for understanding
48
50
your dataset.
49
51
50
-
> Here's a link to the report[on laxy.io](https://api.laxy.io/api/v1/job/3pLfQoLEuWeAnWh4H3Vvbv/files/output/results/multiqc/star_salmon/multiqc_report.html?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a){target="_blank"} (and an alternative backup version [here](files/multiqc/SRP062287/multiqc_report.html){target="_blank"}).
52
+
> Here's a link to [the MultiQC report](files/multiqc/SRP062287/multiqc_report.html){target="_blank"} (or the same [on laxy.io](https://api.laxy.io/api/v1/job/3pLfQoLEuWeAnWh4H3Vvbv/files/output/results/multiqc/star_salmon/multiqc_report.html?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a){target="_blank"}).
51
53
52
54
We can use these QC metrics to identify:
53
55
@@ -71,18 +73,18 @@ This is handy for seeing some key metrics all in one place.
71
73
72
74
There's a lot of columns here - you can use the "configure columns" button to show just a subset.
Let's skip interpreting this table to begin with, since most of the information is repeated in plots below.
77
79
78
80
## Percent (%) aligned reads
79
81
80
-
Use the sidebar to jump to the ["STAR" section of the report](files/multiqc/SRP062287/multiqc_report.html#star){target="_blank"}) - this shows the number and proportion of reads, and if the mapped uniquely or otherwise.
82
+
▷ Use the sidebar to jump to the ["STAR" section of the report](files/multiqc/SRP062287/multiqc_report.html#star){target="_blank"}) - this shows the number and proportion of reads, and if they mapped uniquely or otherwise.
81
83
82
-
The overall number of reads per sample here will closely follow the total raw read numbers (eg (files/multiqc/SRP062287/multiqc_report.html#fastqc_sequence_counts){target="_blank"})[from FastQC]).
84
+
The overall number of reads per sample here will closely follow the total raw read numbers (eg [from FastQC](files/multiqc/SRP062287/multiqc_report.html#fastqc_sequence_counts){target="_blank"}).
83
85
Usually an attempt is made to load similar amounts of material for each sample onto the flowcell, but it's not possible to perfectly balance this loading, so library sizes will vary between samples.
84
86
85
-
Change plot to show percentages rather than raw numbers of mapped reads.
87
+
▷ Change the plot to show percentages rather than raw numbers of mapped reads.
86
88
87
89
In this case we can see every sample has ~92 % **uniquely mapped** reads and is very consistent (this isn't always the case !).
88
90
@@ -94,7 +96,7 @@ There are several classes of read here:
94
96
- Unmapped: too short (low quality reads, trimmed to oblivion)
95
97
- Unmapped: other (no match to the genome sequence)
96
98
97
-
For traditional feature counting to estimate gene expression, only the the unambiguous, uniquely mapped reads are used.
99
+
For traditional feature counting to estimate gene expression, only the the **unambiguous, uniquely mapped** reads are used.
98
100
99
101
**What should you expect from your data ?**
100
102
@@ -108,7 +110,7 @@ the mouse genome with our human cell line example dataset.
108
110
109
111
### MultiQC: Duplication statistics
110
112
111
-
[See this section](files/multiqc/SRP062287/multiqc_report.html#picard)
113
+
▷ [See this section](files/multiqc/SRP062287/multiqc_report.html#picard)
112
114
113
115
`picard MarkDuplicates` identifies and quantifies 'duplicate' mapped reads - if two reads have the same 5' start position and the same orientation (strand) (prior to soft-clipping), one of these reads is considered a duplicate.
114
116
_(For paired end reads, both R1 and R2 5' start positions must match another pair)._
@@ -119,11 +121,11 @@ These duplicates can be of several origins:
119
121
2. Reads generated by **PCR** of a single fragment, which are derived from only one transcript molecule
120
122
3. 'Optical duplicates' that originate from image analysis artefacts - where a single cluster on the flowcell is incorrectly interpreted as two (or more) clusters
121
123
122
-
Duplicate reads in category (1, biological) are generally 'good', in that they represent the real transcript abundance. The deeper you sequence (the larger the library), the more duplicates you'll get since the chance of two reads happening to have the same start position by chance increases.
124
+
Duplicate reads in category 1 (**biological**) are generally 'good', in that they represent the real transcript abundance. The deeper you sequence (the larger the library), the more duplicates you'll get since the chance of two reads happening to have the same start position increases.
123
125
124
-
Duplicate reads in category (2, PCR) usually aren't desired, since they may bias the real transcript abundance, inflating counts for fragments that happen to amplify more efficiently.
126
+
Duplicate reads in category 2 (**PCR**) usually aren't desired, since they may bias the real transcript abundance, inflating counts for fragments that happen to amplify more efficiently.
125
127
126
-
Duplicate reads in category (3, optical) seem to be rare in data on modern instruments, probably probably due to better image analysis and patterned flow cells.
128
+
Duplicate reads in category 3 (**optical**) seem to be rare in data on modern instruments, probably probably due to better image analysis and patterned flow cells.
127
129
128
130
We can't _always_ know if a duplicate is biological (1), PCR-derived (2) or optical (3).
129
131
@@ -145,12 +147,14 @@ There are two main solutions to avoid excessive PCR duplicates are:
145
147
- Get more starting material so you don't need to artificially increase the library size via many PCR cycles
146
148
- Use Universal Molecular Identifiers (UMIs) in the library construction to allow PCR duplicates to be identified.
147
149
148
-
> For 3'-focused sequencing you are likely to see more duplicates that are not PCR or optical
150
+
> For 3'-focused sequencing you are likely to see more duplicates that are not PCR or optical.
149
151
150
152
#### UMIs and dupRadar
151
153
152
-
This is the purpose of Universal Molecular Identifiers (UMIs) - semi-random sequences incorporated into fragments as early as possible in library prep, prior to many rounds of PCR.
153
-
They allow duplicated generated by PCR to be identified and accounted for.
154
+
The purpose of Universal Molecular Identifiers (UMIs) is to help account for PCR duplicates.
155
+
156
+
These are semi-random sequences incorporated into fragments as early as possible in library prep, prior to many rounds of PCR.
157
+
They allow duplicates of the same fragment generated by PCR to be identified by the unique random sequence, and hence accounted for.
154
158
155
159
_[dupRadar](https://doi.org/10.1186/s12859-016-1276-2) can help assess if an experiment is suffering from excessive PCR duplication, vs. 'over-sequencing'_
156
160
@@ -164,18 +168,18 @@ _[dupRadar](https://doi.org/10.1186/s12859-016-1276-2) can help assess if an exp
164
168
165
169
#### Qualimap - read origin and gene coverage
166
170
167
-
[See this section](files/multiqc/SRP062287/multiqc_report.html#qualimap)
171
+
▷ [See this section](files/multiqc/SRP062287/multiqc_report.html#qualimap)
0 commit comments