MonashBioinformaticsPlatform
diff --git a/‎04-01-PipelineOverview.Rmd‎
Lines changed: 3 additions & 3 deletions b/‎04-01-PipelineOverview.Rmd‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎04-03-Pipeline-demultiplexing.Rmd‎
Lines changed: 4 additions & 0 deletions b/‎04-03-Pipeline-demultiplexing.Rmd‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎04-04-Pipeline-fastqc.Rmd‎
Lines changed: 8 additions & 6 deletions b/‎04-04-Pipeline-fastqc.Rmd‎
Lines changed: 8 additions & 6 deletions
diff --git a/‎04-05-Pipeline-mapping-and-qc.Rmd‎
Lines changed: 25 additions & 21 deletions b/‎04-05-Pipeline-mapping-and-qc.Rmd‎
Lines changed: 25 additions & 21 deletions
@@ -2,7 +2,7 @@
 
 ## High-level
 
-An RNA-seq pipeline needs to identify which RNA transcript a (short) read has originated from to quantify transcription of genes.
+An RNA-seq pipeline needs to identify which RNA transcript a sequencing read has originated from to quantify transcription of genes.
 
 Once we know the origin of each read, we can use this to estimate the abundance of each transcript.
 
@@ -18,14 +18,14 @@ At the high-level, a pipeline aims to:
 - count reads associated with features to quantify (differential) abundance
 
 Counts of the number of reads associated with each feature (_gene_) are used to 
-estimate the relative abundance of transcripts, and find differences in expression of a gene between groups (== differential expression analysis).
+estimate the relative abundance of transcripts, and find differences in gene expression between groups (== differential expression analysis).
 
 Usually we quantify expression per gene (the sum of all transcripts arising from that gene).
 
 ## A simple pipeline
 
 <a href="images/pipeline/simple_pipeline.svg" target="_blank">
-![Simple RNA-seq pipeline overview](images/pipeline/simple_pipeline.svg){width="100%"}
+![](images/pipeline/simple_pipeline.svg){width="100%" fig-alt="Simple RNA-seq pipeline overview"}
 </a>
 
 ## A more complex pipeline
 
@@ -1,5 +1,9 @@
 # Detour: demultiplexing {-}
 
+![](images/pipeline/simple_pipeline_demultiplexing.svg){width="100%" fig-alt="Simple RNA-seq pipeline overview with demultiplexing step"}
+
+----
+
 ***Before***  you get your FASTQs, the sequencing provider will have 'demultiplexed' the data output by the instrument.
 
 Sequencing providers often pool many samples onto a single flowcell.
 
@@ -1,4 +1,6 @@
-# An example dataset {-}
+# Raw read QC & trimming {-}
+
+## An example dataset
 
 For the following examples, we will use the results of an `nf-core/rnaseq` run on a published dataset from [Shanle et al, 2013](https://doi.org/10.1210/me.2013-1164). ([SRP062287](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP062287)).
 This is a breast cancer cell line, ERα/PR/HER2 negative, and engineered to have doxycycline inducible ERβ expression. 
@@ -8,14 +10,14 @@ The cells were treated with various combinations of estrogen, doxycycline, and c
 ![Shanle et al, 2013 headline](images/Shanle_et_al_2013.jpg){width="100%"}
 </a>
 
-We've already run this dataset through the `nf-core/rnaseq` pipeline - here is the output on <a href="https://laxy.io/#/job/3pLfQoLEuWeAnWh4H3Vvbv/?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a" target="_blank">laxy.io</a>, which includes a the FastQC reports. 
+We've already run this dataset through the `nf-core/rnaseq` pipeline - here is the output on <a href="https://laxy.io/#/job/3pLfQoLEuWeAnWh4H3Vvbv/?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a" target="_blank">laxy.io</a>, which includes the FastQC reports. 
 
-## Raw read QC
+## FastQC
 
 Let's begin by looking at some metrics about the raw FASTQ reads. The most popular tool for this is [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), 
 which generates reports on read quality, length, and content (among other things).
 
-![Simple RNA-seq pipeline overview: FastQC step](images/pipeline/simple_pipeline_fastqc.svg){width="100%"}
+![](images/pipeline/simple_pipeline_fastqc.svg){width="100%" fig-alt="Simple RNA-seq pipeline overview: FastQC step"}
 
 > FastQC provides a set of pass 
 > ![FastQC PASS](images/multiqc/fastqc_pass.png) / warn ![FastQC PASS](images/multiqc/fastqc_warn.png)  / fail ![FastQC PASS](images/multiqc/fastqc_fail.png) 
@@ -58,7 +60,7 @@ source of bias to the resulting expression level estimates.
 
 _However ...._
 
-The literature suggests trimming probably isn't nessecary for differential abundance studies, 
+The literature suggests trimming probably isn't necessary for differential abundance studies, 
 and quantification at the gene level is very similar between trimmed and untrimmed reads [(Liao & Shi, 2020)](https://doi.org/10.1093%2Fnargab%2Flqaa068). 
 This is because aligners used for RNA-seq 'soft-clip' reads - discarding parts at the ends of reads that don't match the reference genome.
 
@@ -112,7 +114,7 @@ FastQC shows plots summarizing exact duplicate reads, and overrepresented sequen
 <a href="https://api.laxy.io/api/v1/job/3pLfQoLEuWeAnWh4H3Vvbv/files/output/results/fastqc/SRR2155413_fastqc.html?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a#M8" target="_blank">SRR2155413 Sequence Duplication Levels</a>
 
 > Look at the Sequence Duplication Level plot - what might the peak at >10 duplication level represent ?
-> The red line represents the proportion that the same sequences would contribute after deduplication.
+> The red line represents the proportion that those same sequences would contribute after deduplication.
 
 Our example dataset doesn't have any overrepresented sequences, but it's not uncommon to see ribosomal rRNAs, 
 poly-A or poly-G, adapter sequences, or sequences prone to strong PCR amplification in this list.
 
@@ -1,6 +1,8 @@
 # Alignment / mapping  {-}
 
-![Simple RNA-seq pipeline overview: FastQC step](images/pipeline/simple_pipeline_alignment.svg){width="100%"}
+![](images/pipeline/simple_pipeline_alignment.svg){width="100%" fig-alt="Simple RNA-seq pipeline overview: Alignment step"}
+
+----
 
 Short read aligners aim to find a matching region of sequence in a reference genome.
 
@@ -13,7 +15,7 @@ Short read aligners aim to find a matching region of sequence in a reference gen
 
 - A BAM file with aligned reads
 
-We typically align to the **whole genome sequence** rather than just the **predicted transcriptome sequences** - this allows use to assess useful 
+We typically align to the **whole genome sequence** rather than just the **predicted transcriptome sequences** - this allows us to assess useful 
 quality metrics, like the proportion of reads that map to exons vs other regions of the genome.
 
 Popular alignment tools for RNA-seq in eurkaryotes such as STAR, HISAT2 or Subread are splice-aware - they will split 
@@ -47,7 +49,7 @@ quality metrics from many tools into a single report. We will examine the report
 the `nf-core/rnaseq` pipeline using MultiQC, focusing on the parts that are most useful for understanding
 your dataset.
 
-> Here's a link to the report [on laxy.io](https://api.laxy.io/api/v1/job/3pLfQoLEuWeAnWh4H3Vvbv/files/output/results/multiqc/star_salmon/multiqc_report.html?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a){target="_blank"} (and an alternative backup version [here](files/multiqc/SRP062287/multiqc_report.html){target="_blank"}).
+> Here's a link to [the MultiQC report](files/multiqc/SRP062287/multiqc_report.html){target="_blank"} (or the same [on laxy.io](https://api.laxy.io/api/v1/job/3pLfQoLEuWeAnWh4H3Vvbv/files/output/results/multiqc/star_salmon/multiqc_report.html?access_token=e0124ee9-c8ad-4164-b59f-ca2ae0ce4d2a){target="_blank"}).
 
 We can use these QC metrics to identify:
 
@@ -71,18 +73,18 @@ This is handy for seeing some key metrics all in one place.
 
 There's a lot of columns here - you can use the "configure columns" button to show just a subset.
 
-![MultiQC configure columns](images/multiqc/mutliqc_configure_columns.jpg){width="60%"}
+![](images/multiqc/mutliqc_configure_columns.jpg){width="60%" fig-alt="MultiQC configure columns"}
 
 Let's skip interpreting this table to begin with, since most of the information is repeated in plots below. 
 
 ## Percent (%) aligned reads
 
-Use the sidebar to jump to the ["STAR" section of the report](files/multiqc/SRP062287/multiqc_report.html#star){target="_blank"}) - this shows the number and proportion of reads, and if the mapped uniquely or otherwise.
+▷ Use the sidebar to jump to the ["STAR" section of the report](files/multiqc/SRP062287/multiqc_report.html#star){target="_blank"}) - this shows the number and proportion of reads, and if they mapped uniquely or otherwise.
 
-The overall number of reads per sample here will closely follow the total raw read numbers (eg (files/multiqc/SRP062287/multiqc_report.html#fastqc_sequence_counts){target="_blank"})[from FastQC]).
+The overall number of reads per sample here will closely follow the total raw read numbers (eg [from FastQC](files/multiqc/SRP062287/multiqc_report.html#fastqc_sequence_counts){target="_blank"}).
 Usually an attempt is made to load similar amounts of material for each sample onto the flowcell, but it's not possible to perfectly balance this loading, so library sizes will vary between samples.
 
-Change plot to show percentages rather than raw numbers of mapped reads.
+▷ Change the plot to show percentages rather than raw numbers of mapped reads.
 
 In this case we can see every sample has ~92 % **uniquely mapped** reads and is very consistent (this isn't always the case !).
 
@@ -94,7 +96,7 @@ There are several classes of read here:
 - Unmapped: too short (low quality reads, trimmed to oblivion)
 - Unmapped: other (no match to the genome sequence)
 
-For traditional feature counting to estimate gene expression, only the the unambiguous, uniquely mapped reads are used.
+For traditional feature counting to estimate gene expression, only the the **unambiguous, uniquely mapped** reads are used.
 
 **What should you expect from your data ?**
 
@@ -108,7 +110,7 @@ the mouse genome with our human cell line example dataset.
 
 ### MultiQC: Duplication statistics
 
-[See this section](files/multiqc/SRP062287/multiqc_report.html#picard)
+▷ [See this section](files/multiqc/SRP062287/multiqc_report.html#picard)
 
 `picard MarkDuplicates` identifies and quantifies 'duplicate' mapped reads - if two reads have the same 5' start position and the same orientation (strand) (prior to soft-clipping), one of these reads is considered a duplicate.
 _(For paired end reads, both R1 and R2 5' start positions must match another pair)._
@@ -119,11 +121,11 @@ These duplicates can be of several origins:
 2. Reads generated by **PCR** of a single fragment, which are derived from only one transcript molecule
 3. 'Optical duplicates' that originate from image analysis artefacts - where a single cluster on the flowcell is incorrectly interpreted as two (or more) clusters
 
-Duplicate reads in category (1, biological) are generally 'good', in that they represent the real transcript abundance. The deeper you sequence (the larger the library), the more duplicates you'll get since the chance of two reads happening to have the same start position by chance increases.
+Duplicate reads in category 1 (**biological**) are generally 'good', in that they represent the real transcript abundance. The deeper you sequence (the larger the library), the more duplicates you'll get since the chance of two reads happening to have the same start position increases.
 
-Duplicate reads in category (2, PCR) usually aren't desired, since they may bias the real transcript abundance, inflating counts for fragments that happen to amplify more efficiently.
+Duplicate reads in category 2 (**PCR**) usually aren't desired, since they may bias the real transcript abundance, inflating counts for fragments that happen to amplify more efficiently.
 
-Duplicate reads in category (3, optical) seem to be rare in data on modern instruments, probably probably due to better image analysis and patterned flow cells. 
+Duplicate reads in category 3 (**optical**) seem to be rare in data on modern instruments, probably probably due to better image analysis and patterned flow cells. 
 
 We can't _always_ know if a duplicate is biological (1), PCR-derived (2) or optical (3).
 
@@ -145,12 +147,14 @@ There are two main solutions to avoid excessive PCR duplicates are:
 - Get more starting material so you don't need to artificially increase the library size via many PCR cycles
 - Use Universal Molecular Identifiers (UMIs) in the library construction to allow PCR duplicates to be identified.
 
-> For 3'-focused sequencing you are likely to see more duplicates that are not PCR or optical
+> For 3'-focused sequencing you are likely to see more duplicates that are not PCR or optical.
 
 #### UMIs and dupRadar
 
-This is the purpose of Universal Molecular Identifiers (UMIs) - semi-random sequences incorporated into fragments as early as possible in library prep, prior to many rounds of PCR.
-They allow duplicated generated by PCR to be identified and accounted for.
+The purpose of Universal Molecular Identifiers (UMIs) is to help account for PCR duplicates.
+
+These are semi-random sequences incorporated into fragments as early as possible in library prep, prior to many rounds of PCR.
+They allow duplicates of the same fragment generated by PCR to be identified by the unique random sequence, and hence accounted for.
 
 _[dupRadar](https://doi.org/10.1186/s12859-016-1276-2) can help assess if an experiment is suffering from excessive PCR duplication, vs. 'over-sequencing'_
 
@@ -164,18 +168,18 @@ _[dupRadar](https://doi.org/10.1186/s12859-016-1276-2) can help assess if an exp
 
 #### Qualimap - read origin and gene coverage
 
-[See this section](files/multiqc/SRP062287/multiqc_report.html#qualimap)
+▷ [See this section](files/multiqc/SRP062287/multiqc_report.html#qualimap)
 
 #### Biotype counts & RSeqQC read distribution
 
-[RSeqQC read distribution](files/multiqc/SRP062287/multiqc_report.html#rseqc-read_distribution)
+▷ [RSeqQC read distribution](files/multiqc/SRP062287/multiqc_report.html#rseqc-read_distribution)
 
-[Biotype Counts](files/multiqc/SRP062287/multiqc_report.html#biotype_counts)
+▷ [Biotype Counts](files/multiqc/SRP062287/multiqc_report.html#biotype_counts)
 
-> The biotype counts in nf-core/rnaseq are calculated using `featureCounts` but aggregated at the feature level, not the gene level.
+> The biotype counts in nf-core/rnaseq are calculated using `featureCounts` but aggregated at the biotype feature level, not the gene level.
 
-[Samtools mapping by XY and chromosome](files/multiqc/SRP062287/multiqc_report.html#samtools-idxstats-xy-counts)
+▷ [Samtools mapping by XY and chromosome](files/multiqc/SRP062287/multiqc_report.html#samtools-idxstats-xy-counts)
 
 #### Strandedness
 
-[RSeQC: Infer experiment](files/multiqc/SRP062287/multiqc_report.html##rseqc-infer_experiment)
+▷ [RSeQC: Infer experiment](files/multiqc/SRP062287/multiqc_report.html#rseqc-infer_experiment)