This repository was archived by the owner on Jun 21, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 8
Expand file tree
/
Copy pathbulkref.Rmd
More file actions
192 lines (154 loc) · 9.58 KB
/
bulkref.Rmd
File metadata and controls
192 lines (154 loc) · 9.58 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
bibliography: ref.bib
---
# Using the classic mode
```{r, echo=FALSE, results='asis'}
library(rebook)
chapterPreamble(cache=TRUE)
```
## Overview
`r Biocpkg("SingleR")` detects markers in a pairwise manner between labels in the reference dataset.
Specifically, for each label of interest, it performs pairwise comparisons to every other label in the reference
and identifies the genes that are upregulated in the label of interest for each comparison.
The initial score calculation is then performed on the union of marker genes across all comparisons for all label.
This approach ensures that the selected subset of features will contain genes that distinguish each label from any other label.
(In contrast, other approaches that treat the "other" labels as a single group do not offer this guarantee;
see [here](https://osca.bioconductor.org/marker-detection.html#standard-application) for a discussion.)
It also allows the fine-tuning step to aggressively improve resolution by only using marker genes
from comparisons where both labels have scores close to the maximum.
The original ("classic") marker detection algorithm used in @aran2019reference identified marker genes
based on their log-fold changes in each pairwise comparison.
Specifically, it used the genes with the largest positive differences in the per-label median log-expression values between labels.
The number of genes taken from each pairwise comparison was defined as $500 (\frac{2}{3})^{\log_{2}(n)}$,
where $n$ is the number of unique labels in the reference;
this scheme aimed to reduce the number of genes (and thus the computational time) as the number of labels and pairwise comparisons increased.
Classic mode is primarily intended for reference datasets that have little or no replication,
a description that covers many of the bulk-derived references
and precludes more complicated marker detection procedures (Chapter \@ref(more-markers)).
## Annotating the test dataset
For demonstration purposes, we will use the @grun2016denovo haematopoietic stem cell (HSC)
dataset from the `r Biocpkg("scRNAseq")` package.
The `GrunHSCData()` function conveniently returns a `SingleCellExperiment`
object containing the count matrix for this dataset.
```{r}
library(scRNAseq)
sce <- GrunHSCData(ensembl=TRUE)
sce
```
Our aim is to annotate each cell with the ImmGen reference dataset [@ImmGenRef] from the `r Biocpkg("celldex")` package.
(Some further comments on the choice of reference are provided below in Section \@ref(reference-choice).)
Calling the `ImmGenData()` function returns a `SummarizedExperiment` object
containing a matrix of log-expression values with sample-level labels.
We also set `ensembl=TRUE` to match the reference's gene annotation with that in the `sce` object -
the default behavior is to use the gene symbol.
```{r}
library(celldex)
immgen <- ImmGenData(ensembl=TRUE)
immgen
```
Each `r Biocpkg("celldex")` dataset actually has three sets of labels that primarily differ in their resolution.
For the purposes of this demonstration, we will use the "fine" labels in the `label.fine` metadata field,
which represents the highest resolution of annotation available for this dataset.
```{r}
head(immgen$label.fine)
```
We perform annotation by calling `SingleR()` on our test (Grun) dataset and the reference (ImmGen) dataset,
leaving the default of `de.method="classic"` to use the original marker detection scheme.
This applies the algorithm described in Section \@ref(method-description),
returning a `DataFrame` where each row contains prediction results for a single cell in the `sce` object.
Labels are provided in the `labels` column; some of the other fields are discussed in more detail in Chapter \@ref(annotation-diagnostics).
```{r}
library(SingleR)
# See 'Choices of assay data' for 'assay.type.test=' explanation.
pred <- SingleR(test = sce, ref = immgen,
labels = immgen$label.fine, assay.type.test=1)
colnames(pred)
```
## Interaction with quality control
Upon examining the distribution of assigned labels, we see that many of them are related to stem cells.
However, there are quite a large number of more differentiated labels mixed in,
which is not what we expect from a sorted population of HSCs.
```{r}
head(sort(table(pred$labels), decreasing=TRUE))
```
```{r, echo=FALSE}
# Sanity check that we got less stem here.
stuff <- head(sort(table(pred$labels), decreasing=TRUE))
stopifnot(any(!grepl("Stem", names(stuff))))
```
This is probably because - despite what its name might suggest -
the dataset obtained by `GrunHSCData()` actually contains more than HSCs.
If we restrict our analysis to the sorted HSCs (obviously) and remove one low-quality batch
(see [the analysis here](https://osca.bioconductor.org/merged-hcsc.html#quality-control-12) for the rationale)
we can see that the distribution of cell type labels is more similar to what we might expect.
Low-quality cells lack information for accurate label assignment and need to be removed to enable interpretation of the results.
```{r}
actual.hsc <- pred$labels[sce$protocol=="sorted hematopoietic stem cells" & sce$sample!="JC4"]
head(sort(table(actual.hsc), decreasing=TRUE))
```
```{r, echo=FALSE}
# Sanity check that we got some stem.
is.stem <- grepl("Stem", actual.hsc)
stopifnot(mean(is.stem) > 0.95)
```
Filtering the annotation results in the above manner is valid because `SingleR()` operates independently on each test cell.
The annotation is orthogonal to any decisions about the relative quality of the cells in the test dataset;
the same results will be obtained regardless of whether `r Biocpkg("SingleR")` is run before or after quality control.
This is logistically convenient as it means that the annotation does not have to be repeated
if the quality control scheme (or any other downstream step, like clustering) changes throughout the lifetime of the analysis.
## Choices of assay data
For the reference dataset, the assay matrix _must_ contain log-transformed normalized expression values.
This is because the default marker detection scheme computes log-fold changes by subtracting the medians,
which makes little sense unless the input expression values are already log-transformed.
For alternative schemes, this requirement may be relaxed (e.g., Wilcoxon rank sum tests do not require transformation);
similarly, if pre-defined markers are supplied, no transformation or normalization is necessary.
For the test data, the assay data need not be log-transformed or even (scale) normalized.
This is because `SingleR()` computes Spearman correlations within each cell,
which is unaffected by monotonic transformations like cell-specific scaling or log-transformation.
It is perfectly satisfactory to provide the raw counts for the test dataset to `SingleR()`,
which is the reason for setting `assay.type.test=1` in our previous `SingleR()` call for the Grun dataset.
The exception to this rule occurs when comparing data from full-length technologies to the `r Biocpkg("celldex")` references.
These references are intended to be comparable to data from unique molecular identifier (UMI) protocols
where the expression values are less sensitive to differences in gene length.
Thus, when annotating Smart-seq2 test datasets against the `r Biocpkg("celldex")` references,
better performance can often be achieved by processing the test counts to transcripts-per-million values.
We demonstrate below using another HSC dataset that was generated using the Smart-seq2 protocol [@nestorowa2016singlecell].
Again, we see that most of the predicted labels are related to stem cells, which is comforting.
```{r}
sce.nest <- NestorowaHSCData()
# Getting the exonic gene lengths.
library(AnnotationHub)
mm.db <- AnnotationHub()[["AH73905"]]
mm.exons <- exonsBy(mm.db, by="gene")
mm.exons <- reduce(mm.exons)
mm.len <- sum(width(mm.exons))
# Computing the TPMs with a simple scaling by gene length.
library(scater)
keep <- intersect(names(mm.len), rownames(sce.nest))
tpm.nest <- calculateTPM(sce.nest[keep,], lengths=mm.len[keep])
# Performing the assignment.
pred <- SingleR(test = tpm.nest, ref = immgen, labels = immgen$label.fine)
head(sort(table(pred$labels), decreasing=TRUE), 10)
```
```{r, echo=FALSE}
# Sanity check that we got some stem.
is.stem <- grepl("Stem", pred$labels)
stopifnot(mean(is.stem) > 0.95)
```
## Comments on choice of references {#reference-choice}
Unsurprisingly, the choice of reference has a major impact on the annotation results.
We need to pick a reference that contains a superset of the labels that we expect to be present in our test dataset.
Whether the original authors assigned appropriate labels to the reference samples is largely taken as a matter of faith;
it is not entirely unexpected that some references are "better" than others depending on the quality of sample preparation.
We would also prefer a reference that is generated from a similar technology or protocol as our test dataset,
though this is usually not an issue when using `SingleR()` to annotate well-defined cell types.
Users are advised to read the `r Biocpkg("celldex", vignette="userguide.html", "relevant vignette")`
for more details about the available references as well as some recommendations on which to use.
(As an aside, the ImmGen dataset and other references were originally supplied along with `r Biocpkg("SingleR")` itself
but have since been migrated to the separate `r Biocpkg("celldex")` package for more general use throughout Bioconductor.)
Of course, as we shall see in the next Chapter, it is entirely possible to supply your own reference datasets instead;
all we need are log-expression values and a set of labels for the cells or samples.
## Session information {-}
```{r, echo=FALSE, results='asis'}
prettySessionInfo()
```