biomedical-knowledge-mining-book/FAQ.qmd at master · YuLab-SMU/biomedical-knowledge-mining-book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136


# Frequently asked questions {#faq}

```{r}
#| echo: false
#| message: false
#| cache: false

source("_common.R")
```

## How to prepare your own geneList {#sec-genelist}

GSEA analysis requires a ranked gene list, which contains three features:


+ numeric vector: fold change or other type of numerical variable
+ named vector: every number has a name, the corresponding gene ID
+ sorted vector: number should be sorted in decreasing order


If you import your data from a `csv` file, the file should contains two columns, one for gene ID (no duplicated ID allowed) and another one for fold change. You can prepare your own `geneList` via the following command:


```r
d = read.csv(your_csv_file)
## assume 1st column is ID
## 2nd column is FC

## feature 1: numeric vector
geneList = d[,2]

## feature 2: named vector
names(geneList) = as.character(d[,1])

## feature 3: decreasing order
geneList = sort(geneList, decreasing = TRUE)
```


## No gene can be mapped

+ <https://www.biostars.org/p/431270/>
+ <https://github.com/YuLab-SMU/clusterProfiler/issues/280>


## Showing specific pathways {#showing-specific-pathways}

By default, all the visualization methods provided by `r Biocpkg("enrichplot")` display most significant pathways. If users are interested to show some specific pathways (e.g. excluding some unimportant pathways among the top categories), users can pass a vector of selected pathways to the `showCategory` parameter in `dotplot()`, `barplot()`, `treeplot()`, `cnetplot()` and `emapplot()` etc.


See @fig-selectedPaths for a side-by-side comparison of default top categories versus user-selected pathways.

```{r}
#| label: fig-selectedPaths
#| fig-height: 6
#| fig-width: 15
#| fig-cap: "**Showing specific pathways.** Top ten most significant pathways (A), selected ten pathways (B)."
#| fig-scap: "Showing specific pathways."
library(clusterProfiler)
library(enrichplot)
data(geneList, package='DOSE')
de <- names(geneList)[abs(geneList)>1]

x <- enrichKEGG(de)

## show top 10 most significant pathways and want to exclude the second one
## dotplot(x, showCategory = x$Description[1:10][-2])

set.seed(2020-10-27)
selected_pathways <- sample(x$Description, 5)
selected_pathways

p1 <- dotplot(x, showCategory = 10, font.size=14)
p2 <- dotplot(x, showCategory = selected_pathways, font.size=14)


aplot::plot_list(p1, p2, tag_levels = "A")
```

Note: Another solution is using the `filter` verb to extract a subset of the result as described in [Chapter 16](#clusterProfiler-dplyr).


## How to extract genes of a specific term/pathway

```{r}
id <- x$ID[1:3]
id
x[[id[1]]]
geneInCategory(x)[id]
```


## Wrap long axis labels {#label-format}

Most of the functions in `r Biocpkg("enrichplot")` can automatically split long labels across multiple lines. Users can passed a line width to the `label_format` parameter (default is 30). It also supports user defined function to format label strings.

See @fig-formatLabel for examples of wrapping long axis labels using numeric width and a custom labeller.

```{r}
#| label: fig-formatLabel
#| fig-height: 6
#| fig-width: 10
#| fig-cap: "**Wrap long axis labels.** Passing a numeric value to specify string width (A), a user specifiable labeller function (B)."
#| fig-scap: "Wrap long axis labels."
library(ReactomePA)
y <- enrichPathway(de)

p1 <- dotplot(y, label_format = 20)
p2 <- dotplot(y, label_format = function(x) stringr::str_wrap(x, width=20))
cowplot::plot_grid(p1, p2, ncol=2, labels=c("A", "B"))
```


The `label_format` option works with `barplot()`, `dotplot()`, `heatplot()`, `treeplot` and `ridgeplot()`.

## Why many genes are not annotated in KEGG?

Users often find that a significant portion of their gene list is dropped in the KEGG enrichment analysis. For example, a user reported that only 281 genes were retained from a list of over 700 genes.

This is not a software issue but rather a reflection of the database coverage. KEGG pathways mainly focus on metabolic and signaling pathways, and many genes are not annotated in any KEGG pathway.

We can verify the annotation coverage using `bitr_kegg()`:

```{r eval=FALSE}
# gene is a vector of gene IDs (e.g., Entrez IDs)
kk <- bitr_kegg(geneID = gene, fromType='ncbi-geneid', toType='Path', organism='hsa')
```

The warning message will indicate the percentage of genes that failed to map.

To manually verify a specific gene, you can visit the KEGG website using a URL constructed with the organism code and gene ID, for example: `http://www.genome.jp/dbget-bin/www_bget?hsa:100506775`. If the gene page does not list any pathway, it means the gene has no KEGG pathway annotation.