diff --git a/.github/ISSUE_TEMPLATE/other-issue.md b/.github/ISSUE_TEMPLATE/other-issue.md index c253dbbf..6ec2f944 100644 --- a/.github/ISSUE_TEMPLATE/other-issue.md +++ b/.github/ISSUE_TEMPLATE/other-issue.md @@ -8,13 +8,13 @@ assignees: '' ### Background - + ### Problem -#### What potential "gotchas" do we know of? +### What potential "gotchas" do we know of? @@ -23,4 +23,11 @@ assignees: '' + + +### Is there a particular timeframe for this issue? + + + + diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index e14de305..a95941d8 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,28 +1,33 @@ -## Purpose: -What issue(s) does your PR address? -## Strategy -What was your strategy for this new or edited analysis? +## Use the 'Preview' view to click on a link below to choose an appropriate PR template: -## Concerns/Questions for reviewers: -What things should reviewers look out for? +For any stage of adding a new analysis example: + New analysis PR -## Analysis Pull Request Check List (roughly in order): +For publishing changes to Github pages: + Publishing PR -### Content checks -* [ ] All `{{BLANKS}}` have been replaced with the correct content. -* [ ] [Sources are cited](https://github.com/AlexsLemonade/refinebio-examples/blob/master/CONTRIBUTING.md#citing-sources-in-text) -* [ ] Seed is set (if applicable) +For either stage of a hotfix PR -- something that is straightforward and needs to be user-facing quickly: + Hotfix PR -### Formatting Checks -* [ ] Removed any manual numbering of sections. -* [ ] Removed any instances of chunk naming. -* [ ] Comments and documentation are up to date. -* [ ] All links have been checked and are properly formatted. +## For any other types of PRs that don't fit any of the above categories, delete the previous section (including this line) and use the template below. -### Add datasets to S3 -* [ ] Added [data and metadata files to S3](https://github.com/AlexsLemonade/refinebio-examples/blob/master/CONTRIBUTING.md#adding-datasets-to-the-S3-bucket). +### Purpose -### Docker/Snakemake rendering components -* [ ] Added the `.html` [link to the navigation bar](https://github.com/AlexsLemonade/refinebio-examples/blob/master/CONTRIBUTING.md#add-new-analyses-to-the-navbar). -* [ ] Any not yet added packages needed for this analysis have been added to the Dockerfile and it successfully builds. + + + + +### Issue addressed + + + +#### Gotchas the reviewer should know about + + + +## Remaining concerns and questions + + + + diff --git a/.github/PULL_REQUEST_TEMPLATE/hotfix-pr.md b/.github/PULL_REQUEST_TEMPLATE/hotfix-pr.md new file mode 100644 index 00000000..42345361 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE/hotfix-pr.md @@ -0,0 +1,23 @@ +## Hotfix Purpose + + + +## Pull Request Stage + + + +This is the **first PR**, to `master` -- it has not been reviewed at all. + +This is the **second PR**, to `staging` -- it was reviewed for the `master` branch on and now should be reviewed mostly for correct merge conflict resolutions. + +## Explain the Fix + + + + + +## Comments/Questions for the Reviewer + + + + diff --git a/.github/PULL_REQUEST_TEMPLATE/new-analysis-pr.md b/.github/PULL_REQUEST_TEMPLATE/new-analysis-pr.md new file mode 100644 index 00000000..736a9e0f --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE/new-analysis-pr.md @@ -0,0 +1,47 @@ +## Analysis Purpose + + + +## Pull Request Stage + + + +This is a **Draft PR** - needs review of big concepts and outline + +This is a **Refined PR** - needs review of details and polishing + + + +## Strategy + + + + + +## Concerns/Questions for reviewers: + + + + + + + +## Analysis Pull Request Check List (roughly in order): + +### Content checks +* [ ] All `{{BLANKS}}` have been replaced with the correct content. +* [ ] [Sources are cited](https://github.com/AlexsLemonade/refinebio-examples/blob/master/CONTRIBUTING.md#citing-sources-in-text) +* [ ] Seed is set (if applicable) + +### Formatting Checks +* [ ] Removed any manual numbering of sections. +* [ ] Removed any instances of chunk naming. +* [ ] Comments and documentation are up to date. +* [ ] All links have been checked and are properly formatted. + +### Add datasets to S3 +* [ ] Added [data and metadata files to S3](https://github.com/AlexsLemonade/refinebio-examples/blob/master/CONTRIBUTING.md#adding-datasets-to-the-S3-bucket). + +### Docker/Snakemake rendering components +* [ ] Added the `.html` [link to the navigation bar](https://github.com/AlexsLemonade/refinebio-examples/blob/master/CONTRIBUTING.md#add-new-analyses-to-the-navbar). +* [ ] Any not yet added packages needed for this analysis have been added to the Dockerfile and it successfully builds. diff --git a/.github/PULL_REQUEST_TEMPLATE/publish-pr.md b/.github/PULL_REQUEST_TEMPLATE/publish-pr.md new file mode 100644 index 00000000..7abea311 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE/publish-pr.md @@ -0,0 +1,19 @@ +## Changes Being Published + + + +## List of Commits/PR's Included + + + +- PR# Added such and such + +## Link to html Preview + +For reviewing purposes, you can take a browse through htmlpreview: http://htmlpreview.github.io/?https://github.com/AlexsLemonade/refinebio-examples/gh-pages-stages/01-getting-started/getting-started.html + +## Publishing Checklist: + +- [ ] Does everything look good using html preview -- having taken a particularly close look at any new html files? + +- [ ] Have any new examples been added to the [refinebio-examples feedback survey](https://app.hubspot.com/forms/5187852/a50f293c-1ef4-4ee1-b7ee-c563afe2ad5c/performance)? Ideally, don't click publish until this PR is merged (changes are autosaved). diff --git a/.gitignore b/.gitignore index a9e154ea..c9721416 100644 --- a/.gitignore +++ b/.gitignore @@ -21,3 +21,6 @@ _site # If spell check is run, we don't really have a need for tracking this spell_check_results.tsv + +# If git commit history is saved to file, ignore it +git.log diff --git a/02-microarray/pathway-analysis_microarray_00_intro.Rmd b/02-microarray/pathway-analysis_microarray_00_intro.Rmd deleted file mode 100644 index 696e6724..00000000 --- a/02-microarray/pathway-analysis_microarray_00_intro.Rmd +++ /dev/null @@ -1,31 +0,0 @@ ---- -title: "Pathway Analysis Introduction" -output: - html_notebook: - toc: true - toc_float: true - number_sections: true ---- - -## Background - -Over-representation analysis (ORA) is a method of pathway or gene set analysis -where one can ask if a set of genes (e.g., those differentially expressed -using some cutoff) shares more or less genes with gene sets/pathways than -we would expect at random. -The other methodologies introduced throughout this module such as QuSAGE and -GSEA can require more samples than a different expression analysis. -For instance, the sample label permutation step of GSEA is reported to -perform poorly with 7 samples or less in each group -([](https://doi.org/10.1093/nar/gkt660)). -It is not uncommon to have n ~ 3 for each group in a treatment-control -transcriptomic study, at which point identifying differentially expressed genes -is possible. -If you are interested in performing pathway analysis on a small study, ORA may -be your best bet. -There are some limitations to ORA methods to be aware such as ignoring -gene-gene correlation. -See [](https://doi.org/10.1371/journal.pcbi.1002375) -to learn more about the different types of pathway analysis and their -limitations. - diff --git a/02-microarray/pathway-analysis_microarray_02_ora.Rmd b/02-microarray/pathway-analysis_microarray_01_ora.Rmd similarity index 72% rename from 02-microarray/pathway-analysis_microarray_02_ora.Rmd rename to 02-microarray/pathway-analysis_microarray_01_ora.Rmd index 74e8b73a..7c071bd1 100644 --- a/02-microarray/pathway-analysis_microarray_02_ora.Rmd +++ b/02-microarray/pathway-analysis_microarray_01_ora.Rmd @@ -11,34 +11,57 @@ output: # Purpose of this analysis -This example is one of pathway analysis module set, we recommend looking at the [pathway analysis introduction](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_00_intro.html) to help you determine which pathway analysis method is best suited for your purposes. +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. -This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect at random. -This pathway analysis method does not require any particular sample size, since the only input from your dataset is a set of genes of interest [@Yaari2013]. +This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect by chance. + +ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don't fit the requirements of other pathway analyses methods. +It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest [@Yaari2013]. + +If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering [GSEA](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_gsea.html) for your pathway analysis questions. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ +### What is pathway analysis? + +Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. +In the context of [refine.bio](https://www.refine.bio/), we use these techniques to analyze and interpret genome-wide gene expression experiments. +The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. +In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis. + +We highly recommend taking a look at [Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375) from @Khatri2012 for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the [`Resources for further learning` section](#resources-for-further-learning). + +### How to choose a pathway analysis? + +This table summarizes the pathway analyses examples in this module. + +|Analysis|What is required for input|What output looks like |✅ Pros| ⚠️ Cons| +|--------|--------------------------|-----------------------|-------|-------| +|[**ORA (Over-representation Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_01_ora.html)|A list of gene IDs (no stats needed)|A per-pathway hypergeometric test result|- Simple

- Inexpensive computationally to calculate p-values| - Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways| +|[**GSEA (Gene Set Enrichment Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_gsea.html)|A list of genes IDs with gene-level summary statistics|A per-pathway enrichment score|- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes|- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible| +|[**GSVA (Gene Set Variation Analysis)**](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_03_gsva.html)|A gene expression matrix (like what you get from refine.bio directly)|Pathway-level scores on a per-sample basis|- Does not require two groups to compare upfront

- Normally distributed scores|- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10| + # How to run this example For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). -We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. ## Obtain the `.Rmd` file -To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_ora_with_webgestaltr.Rmd). +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_01_ora.Rmd). -Clicking this link will most likely send this to your downloads folder on your computer. +Clicking this link will most likely send this to your downloads folder on your computer. Move this `.Rmd` file to where you would like this example and its files to be stored. You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) -## Set up your analysis folders +## Set up your analysis folders Good file organization is helpful for keeping your data analysis project on track! -We have set up some code that will automatically set up a folder structure for you. -Run this next chunk to set up your folders! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! -If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. ```{r} # Create the data folder if it doesn't exist @@ -79,22 +102,24 @@ For this example analysis, we will use this [CREB overexpression zebrafish exper ## Check out our file structure! -Your new analysis folder should contain: +Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded -- A folder called `data` (currently empty) +- A folder called `data` (currently empty) - A folder for `plots` (currently empty) - A folder for `results` (currently empty) - + Your example analysis folder should contain your `.Rmd` and three empty folders (which won't be empty for long!). -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). # Using a different refine.bio dataset with this analysis? -If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend replacing the `dge_url` with a different file path to a gene list that you are interested in. +The file we use here has several columns of differential expression summary statistics, so if your gene list does not have the exact same information, many steps will need to be changed or deleted entirely depending on what your gene list file looks like (particularly in the [`Determine our genes of interest list` section](#determined-our-genes-of-interest-list)). + We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. -From here you can customize this analysis example to fit your own scientific questions and preferences. +From here you can customize this analysis example to fit your own scientific questions and preferences. *** @@ -149,22 +174,25 @@ library(org.Dr.eg.db) library(magrittr) ``` -## Import data +## Import data -We will read in the differential expression results we will download from online. +For ORA, we only need a list of gene IDs as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways it shares genes with. + +For this example, we will read in results from a differential expression analysis that we have already performed. +Rather than reading from a local file, we will download the results table directly from a URL. These results are from a zebrafish microarray experiment we used for [differential expression analysis for two groups](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_02_2-groups.html) using [`limma`](https://bioconductor.org/packages/release/bioc/html/limma.html) [@Ritchie2015]. The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA. Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. -First we will assign the URL to its own variable called, `dge_url`. +First we will assign the URL to its own variable called, `dge_url`. ```{r} # Define the url to your differential expression results file dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv" ``` -Read in the file that has differential expression results. +Read in the file that has differential expression results. Here we are using the URL we set up above, but this can be a local file path instead _i.e._ you can replace `dge_url` in the code below with a path to file you have on your computer like: `file.path("results", "GSE71270_limma_results.tsv")`. ```{r} @@ -174,8 +202,8 @@ Here we are using the URL we set up above, but this can be a local file path ins dge_df <- readr::read_tsv(dge_url) ``` -`read_tsv()` can read TSV files online and doesn't necessarily require you download the file first. -Let's take a look at what these contrast results from the differential expression analysis look like. +`read_tsv()` can read TSV files online and doesn't necessarily require you download the file first. +Let's take a look at what these contrast results from the differential expression analysis look like. ```{r} dge_df @@ -208,15 +236,15 @@ MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/g In this example, we will use pathways that are gene sets considered to be "canonical representations of a biological process compiled by domain experts" and are a subset of `C2: curated gene sets` [@Subramanian2005]. -Specifically, we will use the [KEGG (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/) pathways [@Kanehisa2000]. +Specifically, we will use the [KEGG (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/) pathways [@Kanehisa2000]. -First, let's take a look at what information is included in this data frame. +First, let's take a look at what information is included in this data frame. ```{r} head(dr_msigdb_df) ``` -We will need to use `gs_cat` and `gs_subcat` columns to construct a filter step that will only keep curated gene sets and KEGG pathways. +We will need to use `gs_cat` and `gs_subcat` columns to construct a filter step that will only keep curated gene sets and KEGG pathways. ```{r} # Filter the zebrafish data frame to the KEGG pathways that are included in the @@ -228,13 +256,11 @@ dr_kegg_df <- dr_msigdb_df %>% ) ``` -Note: We could have specified that we wanted the KEGG gene sets using the `category` and `subcategory` arguments of `msigdbr()`, but we're going for general steps! -- use `?msigdbr` to see more information. - The `clusterProfiler()` function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment. Our data frame with KEGG terms contains Entrez IDs and gene symbols. -In our differential expression results data frame, `dge_df` we have Ensembl gene identifiers. +In our differential expression results data frame, `dge_df` we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs. ## Gene identifier conversion @@ -291,13 +317,13 @@ gene_key_df <- data.frame( dplyr::filter(!is.na(gene_symbol)) ``` -Let's see a preview of `gene_key_df`. +Let's see a preview of `gene_key_df`. ```{r} head(gene_key_df) ``` -Now we are ready to add the `gene_key_df` to our data frame with the differential expression stats, `dge_df`. +Now we are ready to add the `gene_key_df` to our data frame with the differential expression stats, `dge_df`. Here we're using a `dplyr::left_join()` because we only want to retain the genes that have gene symbols and this will filter out anything in our `dge_df` that does not have gene symbols when we join using the Ensembl gene identifiers. ```{r} @@ -311,7 +337,7 @@ dge_annot_df <- gene_key_df %>% ) ``` -Let's take a look at what this data frame looks like. +Let's take a look at what this data frame looks like. ```{r} # Print out a preview @@ -320,21 +346,21 @@ head(dge_annot_df) ## Over-representation Analysis (ORA) -Over-representation testing using `clusterProfiler` is based on a hypergeometric test [@clusterProfiler-book]. +Over-representation testing using `clusterProfiler` is based on a hypergeometric test (often referred to as Fisher's exact test) [@clusterProfiler-book]. +For more background on hypergeometric tests, this [handy tutorial](https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html) explains more about how hypergeometric tests work [@Puthier2015]. -\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{ {M \choose i}{ {N-M} \choose {n-i} } } { {N \choose n} }\) +We will need to provide to `clusterProfiler` two genes lists: -Where `N` is the number of genes in the background distribution, `M` is the number of genes in a pathway, `n` is the number of genes we are interested in (our differentially expressed genes), and `k` is the number of genes that overlap between the pathway and our genes of interest. - -So, we will need to provide to `clusterProfiler` two genes lists: - -1) Our genes of interest (`n`) -2) What genes were in our total background set (`N`). (All genes that originally had an opportunity to be measured). +1) Our genes of interest +2) What genes were in our total background set. (All genes that originally had an opportunity to be measured). ## Determine our genes of interest list -We will use our differential expression results to get a genes of interest list. -Let's use our adjusted p values as a cutoff. +We will use our differential expression results to get a genes of interest list. +Let's use our adjusted p values as a cutoff. + +This step is highly variable depending on what your gene list is, what information it contains and what your goals are. +You may want to delete this next chunk entirely if you supply an already determined list of genes OR you may need to adjust the cutoffs and column names. ```{r} # Select genes that are below a cutoff @@ -349,7 +375,7 @@ There are a lot of ways we could make a genes of interest list, and using a p-va ORA generally requires you make some sort of arbitrary decision to obtain your genes of interest list and this is one of the approach's weaknesses -- to get to a gene list we've removed all other context. -Because one `gene_symbol` may map to multiple Ensembl IDs, we need to make sure we have no repeated gene symbols in this list. +Because one `gene_symbol` may map to multiple Ensembl IDs, we need to make sure we have no repeated gene symbols in this list. ```{r} # Reduce to only unique gene symbols @@ -361,7 +387,7 @@ head(genes_of_interest) ## Determine our background set gene list -Sometimes folks consider genes from the entire genome to comprise the background, but for our microarray data, we should consider all genes that were measured as our background set. +Sometimes folks consider genes from the entire genome to comprise the background, but for our microarray data, we should consider all genes that were measured as our background set. In other words, if we are unable to detect a gene, it should not be in our background set. We can obtain our detected genes list from our data frame, `dge_annot_df` (which we haven't done filtering on). @@ -392,9 +418,6 @@ kegg_ora_results <- enricher( *Note: using `enrichKEGG()` is a shortcut for doing ORA using KEGG, but the approach we covered here can be used with any gene sets you'd like!* -What is returned by `enricher()`? -You can run `View(kegg_ora_results)` or click on the object in your Environment panel. - The information we're most likely interested in is in the `results` slot. Let's convert this into a data frame that we can write to file. @@ -402,19 +425,19 @@ Let's convert this into a data frame that we can write to file. kegg_result_df <- data.frame(kegg_ora_results@result) ``` -Let's print out a sneak peek of it here and take a look at how many sets do we have that fit our cutoff of `0.1` FDR? +Let's print out a sneak peek of it here and take a look at how many sets do we have that fit our cutoff of `0.1` FDR? ```{r} kegg_result_df %>% dplyr::filter(p.adjust < 0.1) ``` -Looks like there are four KEGG sets returned as significant at FDR of `0.1`. +Looks like there are four KEGG sets returned as significant at FDR of `0.1`. ## Visualizing results We can use a dot plot to visualize our significant enrichment results. -The `enrichplot::dotplot()` function will only plot gene sets that are significant according to the multiple testing corrected p values (in the `p.adjust` column) and the `pvalueCutoff` you provided in the [`enricher()` step](#run-ora-using-the-enricher-function). +The `enrichplot::dotplot()` function will only plot gene sets that are significant according to the multiple testing corrected p values (in the `p.adjust` column) and the `pvalueCutoff` you provided in the [`enricher()` step](#run-ora-using-the-enricher-function). ```{r} enrich_plot <- enrichplot::dotplot(kegg_ora_results) @@ -427,7 +450,7 @@ Use `?enrichplot::dotplot` to see the help page for more about how to use this f This plot is arguably more useful when we have a large number of significant pathways. -Let's save it to a PNG. +Let's save it to a PNG. ```{r} ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"), @@ -441,10 +464,10 @@ We can use an [UpSet plot](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720993/ enrichplot::upsetplot(kegg_ora_results) ``` -See that `KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION` and `KEGG_LYSOSOME` have all their genes in common. +See that `KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION` and `KEGG_LYSOSOME` have all their genes in common. Gene sets or pathways aren't independent! -Let's also save this to a PNG. +Let's also save this to a PNG. ```{r} ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_upset_plot.png"), @@ -466,18 +489,20 @@ readr::write_tsv( # Resources for further learning +- [Hypergeometric test exercises](https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html)[@Puthier2015]. +- [clusterProfiler ORA tutorial](https://learn.gencore.bio.nyu.edu/rna-seq-analysis/over-representation-analysis/#:~:text=Over%2Drepresentation%20(or%20enrichment),a%20subset%20of%20your%20data.) - [clusterProfiler paper](https://doi.org/10.1089/omi.2011.0118) [@Yu2012]. - [clusterProfiler book](https://yulab-smu.github.io/clusterProfiler-book/index.html) [@clusterProfiler-book]. - [This handy review](https://doi.org/10.1371/journal.pcbi.1002375) which summarizes the different types of pathway analysis and their limitations [@Khatri2012]. # Session info -At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of software and packages you used to run this. +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info -sessionInfo() +sessioninfo::session_info() ``` # References diff --git a/02-microarray/pathway-analysis_microarray_02_ora.html b/02-microarray/pathway-analysis_microarray_01_ora.html similarity index 89% rename from 02-microarray/pathway-analysis_microarray_02_ora.html rename to 02-microarray/pathway-analysis_microarray_01_ora.html index 4fd4efeb..cd2753ef 100644 --- a/02-microarray/pathway-analysis_microarray_02_ora.html +++ b/02-microarray/pathway-analysis_microarray_01_ora.html @@ -1263,25 +1263,22 @@ }; - - + + code.sourceCode > span { display: inline-block; line-height: 1.25; } + code.sourceCode > span { color: inherit; text-decoration: inherit; } + code.sourceCode > span:empty { height: 1.2em; } + .sourceCode { overflow: visible; } + code.sourceCode { white-space: pre; position: relative; } + div.sourceCode { margin: 1em 0; } + pre.sourceCode { margin: 0; } + @media screen { + div.sourceCode { overflow: auto; } + } + @media print { + code.sourceCode { white-space: pre-wrap; } + code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + } + pre.numberSource code + { counter-reset: source-line 0; } + pre.numberSource code > span + { position: relative; left: -4em; counter-increment: source-line; } + pre.numberSource code > span > a:first-child::before + { content: counter(source-line); + position: relative; left: -1em; text-align: right; vertical-align: baseline; + border: none; display: inline-block; + -webkit-touch-callout: none; -webkit-user-select: none; + -khtml-user-select: none; -moz-user-select: none; + -ms-user-select: none; user-select: none; + padding: 0 4px; width: 4em; + color: #aaaaaa; + } + pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } + div.sourceCode + { } + @media screen { + code.sourceCode > span > a:first-child::before { text-decoration: underline; } + } + code span.al { color: #ff0000; } /* Alert */ + code span.an { color: #008000; } /* Annotation */ + code span.at { } /* Attribute */ + code span.bu { } /* BuiltIn */ + code span.cf { color: #0000ff; } /* ControlFlow */ + code span.ch { color: #008080; } /* Char */ + code span.cn { } /* Constant */ + code span.co { color: #008000; } /* Comment */ + code span.cv { color: #008000; } /* CommentVar */ + code span.do { color: #008000; } /* Documentation */ + code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.ex { } /* Extension */ + code span.im { } /* Import */ + code span.in { color: #008000; } /* Information */ + code span.kw { color: #0000ff; } /* Keyword */ + code span.op { } /* Operator */ + code span.ot { color: #ff4000; } /* Other */ + code span.pp { color: #ff4000; } /* Preprocessor */ + code span.sc { color: #008080; } /* SpecialChar */ + code span.ss { color: #008080; } /* SpecialString */ + code span.st { color: #008080; } /* String */ + code span.va { } /* Variable */ + code span.vs { color: #008080; } /* VerbatimString */ + code span.wa { color: #008000; font-weight: bold; } /* Warning */ + + + + - - + @@ -2874,15 +3680,20 @@ @@ -2949,23 +3765,75 @@

Over-representation analysis - Microarray

CCDL for ALSF

-

October 2020

+

November 2020

1 Purpose of this analysis

-

This example is one of pathway analysis module set, we recommend looking at the pathway analysis introduction to help you determine which pathway analysis method is best suited for your purposes.

-

This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect at random. This pathway analysis method does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).

+

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

+

This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes (e.g., those differentially expressed using some cutoff) shares more or fewer genes with gene sets/pathways than we would expect at random.

+

ORA is a broadly applicable technique that may be good in scenarios where your dataset or scientific questions don’t fit the requirements of other pathway analyses methods. It also does not require any particular sample size, since the only input from your dataset is a set of genes of interest (Yaari et al. 2013).

+

If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering GSEA for your pathway analysis questions.

⬇️ Jump to the analysis code ⬇️

+
+

1.0.1 What is pathway analysis?

+

Pathway analysis refers to any one of many techniques that uses predetermined sets of genes that are related or coordinated in their expression in some way (e.g., participate in the same molecular process, are regulated by the same transcription factor) to interpret a high-throughput experiment. In the context of refine.bio, we use these techniques to analyze and interpret genome-wide gene expression experiments. The rationale for performing pathway analysis is that looking at the pathway-level may be more biologically meaningful than considering individual genes, especially if a large number of genes are differentially expressed between conditions of interest. In addition, many relatively small changes in the expression values of genes in the same pathway could lead to a phenotypic outcome and these small changes may go undetected in differential gene expression analysis.

+

We highly recommend taking a look at Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges from Khatri et al. (2012) for a more comprehensive overview. We have provided primary publications and documentation of the methods we will introduce below as well as some recommended reading in the Resources for further learning section.

+
+
+

1.0.2 How to choose a pathway analysis?

+

This table summarizes the pathway analyses examples in this module.

+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AnalysisWhat is required for inputWhat output looks like✅ Pros⚠️ Cons
ORA (Over-representation Analysis)A list of gene IDs (no stats needed)A per-pathway hypergeometric test result- Simple

- Inexpensive computationally to calculate p-values
- Requires arbitrary thresholds and ignores any statistics associated with a gene

- Assumes independence of genes and pathways
GSEA (Gene Set Enrichment Analysis)A list of genes IDs with gene-level summary statisticsA per-pathway enrichment score- Includes all genes (no arbitrary threshold!)

- Attempts to measure coordination of genes
- Permutations can be expensive

- Does not account for pathway overlap

- Two-group comparisons not always appropriate/feasible
GSVA (Gene Set Variation Analysis)A gene expression matrix (like what you get from refine.bio directly)Pathway-level scores on a per-sample basis- Does not require two groups to compare upfront

- Normally distributed scores
- Scores are not a good fit for gene sets that contain genes that go up AND down

- Method doesn’t assign statistical significance itself

- Recommended sample size n > 10
+

2 How to run this example

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

2.1 Obtain the .Rmd file

-

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

@@ -2973,26 +3841,26 @@

2.1 Obtain the .Rmd

2.2 Set up your analysis folders

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

@@ -3021,7 +3889,9 @@

2.5 Check out our file structure!

3 Using a different refine.bio dataset with this analysis?

-

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend replacing the dge_url with a different file path to a gene list that you are interested in. The file we use here has several columns of differential expression summary statistics, so if your gene list does not have the exact same information, many steps will need to be changed or deleted entirely depending on what your gene list file looks like (particularly in the Determine our genes of interest list section).

+

If you have differential expression results or something with a gene-level ranking and a two-group comparison, we recommend considering GSEA for your pathway analysis questions.

+

We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.


 

@@ -3033,31 +3903,31 @@

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

In this analysis, we will be using clusterProfiler package to perform ORA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Dolgalev 2020; Subramanian et al. 2005).

We will also need the org.Dr.eg.db package to perform gene identifier conversion and ggupset to make an UpSet plot (Carlson 2019; Ahlmann-Eltze 2020).

-
if (!("clusterProfiler" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("clusterProfiler", update = FALSE)
-}
-
-# This is required to make one of the plots that clusterProfiler will make
-if (!("ggupset" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("ggupset", update = FALSE)
-}
-
-if (!("msigdbr" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("msigdbr", update = FALSE)
-}
-
-if (!("org.Dr.eg.db" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("org.Dr.eg.db", update = FALSE)
-}
+
if (!("clusterProfiler" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+# This is required to make one of the plots that clusterProfiler will make
+if (!("ggupset" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("ggupset", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Dr.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Dr.eg.db", update = FALSE)
+}

Attach the packages we need for this analysis.

-
# Attach the library
-library(clusterProfiler)
+
# Attach the library
+library(clusterProfiler)
## 
-
## clusterProfiler v3.16.1  For help: https://guangchuangyu.github.io/software/clusterProfiler
+
## clusterProfiler v3.18.0  For help: https://guangchuangyu.github.io/software/clusterProfiler
 ## 
 ## If you use clusterProfiler in published research, please cite:
 ## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
@@ -3066,11 +3936,11 @@

4.1 Install libraries

## The following object is masked from 'package:stats':
 ## 
 ##     filter
-
# Package that contains MSigDB gene sets in tidy format
-library(msigdbr)
-
-# Danio rerio annotation package we'll use for gene identifier conversion
-library(org.Dr.eg.db)
+
# Package that contains MSigDB gene sets in tidy format
+library(msigdbr)
+
+# Danio rerio annotation package we'll use for gene identifier conversion
+library(org.Dr.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
@@ -3092,7 +3962,7 @@

4.1 Install libraries

## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, ## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, ## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, -## union, unique, unsplit, which, which.max, which.min
+## union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
 ## 
@@ -3120,21 +3990,23 @@ 

4.1 Install libraries

## ## select
## 
-
# We will need this so we can use the pipe: %>%
-library(magrittr)
+
# We will need this so we can use the pipe: %>%
+library(magrittr)

4.2 Import data

-

We will read in the differential expression results we will download from online. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA.

+

For ORA, we only need a list of gene IDs as our input, so this example can work for any situations where you have gene list and want to know more about what biological pathways it shares genes with.

+

For this example, we will read in results from a differential expression analysis that we have already performed. Rather than reading from a local file, we will download the results table directly from a URL. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains Ensembl gene IDs, log fold-changes for each group, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to ORA.

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

-
# Define the url to your differential expression results file
-dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"
+
# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"

Read in the file that has differential expression results. Here we are using the URL we set up above, but this can be a local file path instead i.e. you can replace dge_url in the code below with a path to file you have on your computer like: file.path("results", "GSE71270_limma_results.tsv").

-
# Read in the contents of your differential expression results file
-# `dge_url` can be replaced with a file path to a TSV file with your
-# desired gene list results
-dge_df <- readr::read_tsv(dge_url)
-
## Parsed with column specification:
+
# Read in the contents of your differential expression results file
+# `dge_url` can be replaced with a file path to a TSV file with your
+# desired gene list results
+dge_df <- readr::read_tsv(dge_url)
+
## 
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────
 ## cols(
 ##   Gene = col_character(),
 ##   logFC = col_double(),
@@ -3145,7 +4017,7 @@ 

4.2 Import data

## B = col_double() ## )

read_tsv() can read TSV files online and doesn’t necessarily require you download the file first. Let’s take a look at what these contrast results from the differential expression analysis look like.

-
dge_df
+
dge_df

The data we’re interested in here comes from zebrafish samples, so we can obtain just the gene sets relevant to D. rerio with the species argument to msigdbr().

-
dr_msigdb_df <- msigdbr(species = "Danio rerio")
+
dr_msigdb_df <- msigdbr(species = "Danio rerio")

MSigDB contains 8 different gene set collections (Subramanian et al. 2005).

H: hallmark gene sets
 C1: positional gene sets
@@ -3175,21 +4047,20 @@ 

4.3 Getting familiar with c

In this example, we will use pathways that are gene sets considered to be “canonical representations of a biological process compiled by domain experts” and are a subset of C2: curated gene sets (Subramanian et al. 2005).

Specifically, we will use the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways (Kanehisa and Goto 2000).

First, let’s take a look at what information is included in this data frame.

-
head(dr_msigdb_df)
+
head(dr_msigdb_df)

We will need to use gs_cat and gs_subcat columns to construct a filter step that will only keep curated gene sets and KEGG pathways.

-
# Filter the zebrafish data frame to the KEGG pathways that are included in the
-# curated gene sets
-dr_kegg_df <- dr_msigdb_df %>%
-  dplyr::filter(
-    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
-    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
-  )
-

Note: We could have specified that we wanted the KEGG gene sets using the category and subcategory arguments of msigdbr(), but we’re going for general steps! – use ?msigdbr to see more information.

+
# Filter the zebrafish data frame to the KEGG pathways that are included in the
+# curated gene sets
+dr_kegg_df <- dr_msigdb_df %>%
+  dplyr::filter(
+    gs_cat == "C2", # This is to filter only to the C2 curated gene sets
+    gs_subcat == "CP:KEGG" # This is because we only want KEGG pathways
+  )

The clusterProfiler() function we will use requires a data frame with two columns, where one column contains the term identifier or name and one column contains gene identifiers that match our gene lists we want to check for enrichment.

Our data frame with KEGG terms contains Entrez IDs and gene symbols.

In our differential expression results data frame, dge_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs.

@@ -3200,7 +4071,7 @@

4.4 Gene identifier conversionThe annotation package org.Dr.eg.db contains information for different identifiers (Carlson 2019). org.Dr.eg.db is specific to Danio rerio – this is what the Dr in the package name is referencing.

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

We can see what types of IDs are available to us in an annotation package with keytypes().

-
keytypes(org.Dr.eg.db)
+
keytypes(org.Dr.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 ##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
 ## [11] "GO"           "GOALL"        "IPI"          "ONTOLOGY"     "ONTOLOGYALL" 
@@ -3208,50 +4079,50 @@ 

4.4 Gene identifier conversion

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to gene symbols (SYMBOL), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to Entrez IDs (ENTREZID).

The function we will use to map from Ensembl gene IDs to gene symbols is called mapIds().

-
# This returns a named vector which we can convert to a data frame, where
-# the keys (Ensembl IDs) are the names
-symbols_vector <- mapIds(org.Dr.eg.db, # Specify the annotation package
-  # The vector of gene identifiers we want to map
-  keys = dge_df$Gene,
-  # The type of gene identifier we want returned
-  column = "SYMBOL",
-  # What type of gene identifiers we're starting with
-  keytype = "ENSEMBL",
-  # In the case of 1:many mappings, return the
-  # first one. This is default behavior!
-  multiVals = "first"
-)
+
# This returns a named vector which we can convert to a data frame, where
+# the keys (Ensembl IDs) are the names
+symbols_vector <- mapIds(org.Dr.eg.db, # Specify the annotation package
+  # The vector of gene identifiers we want to map
+  keys = dge_df$Gene,
+  # The type of gene identifier we want returned
+  column = "SYMBOL",
+  # What type of gene identifiers we're starting with
+  keytype = "ENSEMBL",
+  # In the case of 1:many mappings, return the
+  # first one. This is default behavior!
+  multiVals = "first"
+)
## 'select()' returned 1:many mapping between keys and columns

This message is letting us know that sometimes Ensembl gene identifiers will map to multiple gene symbols. In this case, it’s also possible that a gene symbol will map to multiple Ensembl IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

Let’s create a two column data frame that shows the gene symbols and their Ensembl IDs side-by-side.

-
# We would like a data frame we can join to the differential expression stats
-gene_key_df <- data.frame(
-  ensembl_id = names(symbols_vector),
-  gene_symbol = symbols_vector,
-  stringsAsFactors = FALSE
-) %>%
-  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
-  # from the data frame
-  dplyr::filter(!is.na(gene_symbol))
+
# We would like a data frame we can join to the differential expression stats
+gene_key_df <- data.frame(
+  ensembl_id = names(symbols_vector),
+  gene_symbol = symbols_vector,
+  stringsAsFactors = FALSE
+) %>%
+  # If an Ensembl gene identifier doesn't map to a gene symbol, drop that
+  # from the data frame
+  dplyr::filter(!is.na(gene_symbol))

Let’s see a preview of gene_key_df.

-
head(gene_key_df)
+
head(gene_key_df)

Now we are ready to add the gene_key_df to our data frame with the differential expression stats, dge_df. Here we’re using a dplyr::left_join() because we only want to retain the genes that have gene symbols and this will filter out anything in our dge_df that does not have gene symbols when we join using the Ensembl gene identifiers.

-
dge_annot_df <- gene_key_df %>%
-  # Using a left join removes the rows without gene symbols because those rows
-  # have already been removed in `gene_symbols_df`
-  dplyr::left_join(dge_df,
-    # The name of the column that contains the Ensembl gene IDs
-    # in the left data frame and right data frame
-    by = c("ensembl_id" = "Gene")
-  )
+
dge_annot_df <- gene_key_df %>%
+  # Using a left join removes the rows without gene symbols because those rows
+  # have already been removed in `gene_symbols_df`
+  dplyr::left_join(dge_df,
+    # The name of the column that contains the Ensembl gene IDs
+    # in the left data frame and right data frame
+    by = c("ensembl_id" = "Gene")
+  )

Let’s take a look at what this data frame looks like.

-
# Print out a preview
-head(dge_annot_df)
+
# Print out a preview
+head(dge_annot_df)

Looks like there are four KEGG sets returned as significant at FDR of 0.1.

@@ -3328,52 +4197,54 @@

4.8 Run ORA using the enric

4.9 Visualizing results

We can use a dot plot to visualize our significant enrichment results. The enrichplot::dotplot() function will only plot gene sets that are significant according to the multiple testing corrected p values (in the p.adjust column) and the pvalueCutoff you provided in the enricher() step.

-
enrich_plot <- enrichplot::dotplot(kegg_ora_results)
-
-# Print out the plot here
-enrich_plot
-

+
enrich_plot <- enrichplot::dotplot(kegg_ora_results)
+
+# Print out the plot here
+enrich_plot
+

Use ?enrichplot::dotplot to see the help page for more about how to use this function.

This plot is arguably more useful when we have a large number of significant pathways.

Let’s save it to a PNG.

-
ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"),
-  plot = enrich_plot
-)
+
ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_enrich_plot.png"),
+  plot = enrich_plot
+)
## Saving 7 x 5 in image

We can use an UpSet plot to visualize the overlap between the gene sets that were returned as significant.

-
enrichplot::upsetplot(kegg_ora_results)
+
enrichplot::upsetplot(kegg_ora_results)

See that KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION and KEGG_LYSOSOME have all their genes in common. Gene sets or pathways aren’t independent!

Let’s also save this to a PNG.

-
ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_upset_plot.png"),
-  plot = enrich_plot
-)
+
ggplot2::ggsave(file.path(plots_dir, "GSE71270_ora_upset_plot.png"),
+  plot = enrich_plot
+)
## Saving 7 x 5 in image

4.10 Write results to file

-
readr::write_tsv(
-  kegg_result_df,
-  file.path(
-    results_dir,
-    "GSE71270_pathway_analysis_results.tsv"
-  )
-)
+
readr::write_tsv(
+  kegg_result_df,
+  file.path(
+    results_dir,
+    "GSE71270_pathway_analysis_results.tsv"
+  )
+)

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessionInfo()
+
# Print session info
+sessionInfo()
## R version 4.0.2 (2020-06-22)
 ## Platform: x86_64-pc-linux-gnu (64-bit)
 ## Running under: Ubuntu 20.04 LTS
@@ -3394,86 +4265,91 @@ 

6 Session info

## [8] methods base ## ## other attached packages: -## [1] magrittr_1.5 org.Dr.eg.db_3.11.4 AnnotationDbi_1.50.3 -## [4] IRanges_2.22.2 S4Vectors_0.26.1 Biobase_2.48.0 -## [7] BiocGenerics_0.34.0 msigdbr_7.2.1 clusterProfiler_3.16.1 +## [1] magrittr_1.5 org.Dr.eg.db_3.12.0 AnnotationDbi_1.52.0 +## [4] IRanges_2.24.0 S4Vectors_0.28.0 Biobase_2.50.0 +## [7] BiocGenerics_0.36.0 msigdbr_7.2.1 clusterProfiler_3.18.0 ## [10] optparse_1.6.6 ## ## loaded via a namespace (and not attached): -## [1] enrichplot_1.8.1 bit64_4.0.5 progress_1.2.2 -## [4] httr_1.4.2 RColorBrewer_1.1-2 R.cache_0.14.0 -## [7] tools_4.0.2 backports_1.1.10 R6_2.4.1 -## [10] DBI_1.1.0 colorspace_1.4-1 prettyunits_1.1.1 -## [13] tidyselect_1.1.0 gridExtra_2.3 curl_4.3 -## [16] bit_4.0.4 compiler_4.0.2 cli_2.0.2 -## [19] scatterpie_0.1.5 xml2_1.3.2 labeling_0.3 -## [22] triebeard_0.3.0 scales_1.1.1 readr_1.3.1 -## [25] ggridges_0.5.2 stringr_1.4.0 digest_0.6.25 -## [28] ggupset_0.3.0 rmarkdown_2.4 DOSE_3.14.0 -## [31] R.utils_2.10.1 pkgconfig_2.0.3 htmltools_0.5.0 -## [34] styler_1.3.2 rlang_0.4.7 rstudioapi_0.11 -## [37] RSQLite_2.2.0 gridGraphics_0.5-0 generics_0.0.2 -## [40] farver_2.0.3 jsonlite_1.7.1 BiocParallel_1.22.0 -## [43] GOSemSim_2.14.2 dplyr_1.0.2 R.oo_1.24.0 -## [46] ggplotify_0.0.5 GO.db_3.11.4 Matrix_1.2-18 -## [49] Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1 -## [52] viridis_0.5.1 lifecycle_0.2.0 R.methodsS3_1.8.1 -## [55] stringi_1.5.3 yaml_2.2.1 ggraph_2.0.3 -## [58] MASS_7.3-51.6 plyr_1.8.6 qvalue_2.20.0 -## [61] grid_4.0.2 blob_1.2.1 ggrepel_0.8.2 -## [64] DO.db_2.9 crayon_1.3.4 lattice_0.20-41 -## [67] cowplot_1.1.0 graphlayouts_0.7.0 splines_4.0.2 -## [70] hms_0.5.3 knitr_1.30 pillar_1.4.6 -## [73] fgsea_1.14.0 igraph_1.2.5 reshape2_1.4.4 -## [76] fastmatch_1.1-0 glue_1.4.2 evaluate_0.14 -## [79] downloader_0.4 BiocManager_1.30.10 data.table_1.13.0 -## [82] urltools_1.7.3 vctrs_0.3.4 tweenr_1.0.1 -## [85] gtable_0.3.0 getopt_1.20.3 purrr_0.3.4 -## [88] polyclip_1.10-0 tidyr_1.1.2 rematch2_2.1.2 -## [91] assertthat_0.2.1 ggplot2_3.3.2 xfun_0.18 -## [94] ggforce_0.3.2 europepmc_0.4 tidygraph_1.2.0 -## [97] viridisLite_0.3.0 tibble_3.0.3 rvcheck_0.1.8 -## [100] memoise_1.1.0 ellipsis_0.3.1
+## [1] enrichplot_1.10.0 bit64_4.0.5 RColorBrewer_1.1-2 +## [4] R.cache_0.14.0 tools_4.0.2 backports_1.1.10 +## [7] R6_2.4.1 DBI_1.1.0 colorspace_1.4-1 +## [10] tidyselect_1.1.0 gridExtra_2.3 curl_4.3 +## [13] bit_4.0.4 compiler_4.0.2 cli_2.1.0 +## [16] scatterpie_0.1.5 labeling_0.3 shadowtext_0.0.7 +## [19] scales_1.1.1 readr_1.4.0 stringr_1.4.0 +## [22] digest_0.6.25 ggupset_0.3.0 rmarkdown_2.4 +## [25] DOSE_3.16.0 R.utils_2.10.1 pkgconfig_2.0.3 +## [28] htmltools_0.5.0 styler_1.3.2 rlang_0.4.8 +## [31] rstudioapi_0.11 RSQLite_2.2.1 generics_0.0.2 +## [34] farver_2.0.3 jsonlite_1.7.1 BiocParallel_1.24.1 +## [37] GOSemSim_2.16.1 dplyr_1.0.2 R.oo_1.24.0 +## [40] GO.db_3.12.1 Matrix_1.2-18 Rcpp_1.0.5 +## [43] munsell_0.5.0 fansi_0.4.1 viridis_0.5.1 +## [46] lifecycle_0.2.0 R.methodsS3_1.8.1 stringi_1.5.3 +## [49] yaml_2.2.1 ggraph_2.0.3 MASS_7.3-51.6 +## [52] plyr_1.8.6 qvalue_2.22.0 grid_4.0.2 +## [55] blob_1.2.1 ggrepel_0.8.2 DO.db_2.9 +## [58] crayon_1.3.4 lattice_0.20-41 cowplot_1.1.0 +## [61] graphlayouts_0.7.0 splines_4.0.2 hms_0.5.3 +## [64] ps_1.4.0 knitr_1.30 pillar_1.4.6 +## [67] fgsea_1.16.0 igraph_1.2.6 reshape2_1.4.4 +## [70] fastmatch_1.1-0 glue_1.4.2 evaluate_0.14 +## [73] downloader_0.4 BiocManager_1.30.10 data.table_1.13.0 +## [76] vctrs_0.3.4 tweenr_1.0.1 gtable_0.3.0 +## [79] getopt_1.20.3 purrr_0.3.4 polyclip_1.10-0 +## [82] tidyr_1.1.2 rematch2_2.1.2 assertthat_0.2.1 +## [85] ggplot2_3.3.2 xfun_0.18 ggforce_0.3.2 +## [88] tidygraph_1.2.0 viridisLite_0.3.0 tibble_3.0.4 +## [91] rvcheck_0.1.8 memoise_1.1.0 ellipsis_0.3.1

References

-

Ahlmann-Eltze C., 2020 Ggupset: Combination matrix axis for ’ggplot2’ to create ’upset’ plots.

+

Ahlmann-Eltze C., 2020 ggupset: Combination matrix axis for ’ggplot2’ to create ’upset’ plots. https://github.com/const-ae/ggupset

-

Carlson M., 2019 Genome wide annotation for zebrafish

+

Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html

-

Dolgalev I., 2020 Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format.

-
-
-

Guangchuang Yu, ClusterProfiler: Universal enrichment tool for functional and comparative study

+

Dolgalev I., 2020 Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

-

Kanehisa M., and S. Goto, 2000 KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28: 27–30.

+

Kanehisa M., and S. Goto, 2000 KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28: 27–30. https://doi.org/10.1093/nar/28.1.27

-

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 8: e1002375.

+

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

+
+
+

Puthier D., and J. van Helden, 2015 Statistics for Bioinformatics - Practicals - Gene enrichment statistics. https://dputhier.github.io/ASG/practicals/go_statistics_td/go_statistics_td_2015.html

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

-

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102: 15545–15550.

+

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

-

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896.

+

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98

-

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Res. 41: e170.

+

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

-

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 ClusterProfiler: An r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

+

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

+
+
+

Yu G., clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

+ diff --git a/02-microarray/pathway-analysis_microarray_01_ortholog_mapping_kegg.Rmd b/02-microarray/pathway-analysis_microarray_01_ortholog_mapping_kegg.Rmd deleted file mode 100644 index aba23317..00000000 --- a/02-microarray/pathway-analysis_microarray_01_ortholog_mapping_kegg.Rmd +++ /dev/null @@ -1,272 +0,0 @@ ---- -title: "KEGG pathways: mapping to mouse orthologs with `hcop`" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -In this module, we use QuSAGE ([](https://doi.org/10.1093/nar/gkt660)) -for pathway analysis (implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html)). - -`qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). - -[MSigDB](http://software.broadinstitute.org/gsea/msigdb) offers gene sets in this format. -[Curated gene sets](http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2) -such as [KEGG](https://www.genome.jp/kegg/) are a good starting point for any pathway analysis. - -However, MSigDB only distributes human pathways. -If we want to use KEGG Pathways with another species without going through -[KEGG Orthology](https://www.genome.jp/kegg/ko.html), we need to map to -orthologs ourselves. - -We'll use the [`hcop` package](https://github.com/stephenturner/hcop) to do -this. -If you're looking for a little bit more background information (like if you -run into trouble installing `hcop`), check out the notebook in our -[`ortholog-mapping`](https://github.com/AlexsLemonade/refinebio-examples/tree/master/ortholog-mapping) module. - -## Setup - -Package installation - -```{r} -# need read.gmt functionality from qusage -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -# we need devtools in order to install the hcop package we will use to -# do the ortholog mapping -if (!("devtools" %in% installed.packages())) { - install.packages("devtools") -} - -# this installs a specific version of hcop -# we pass the commit hash to the ref argument -devtools::install_github("stephenturner/hcop", - ref = "0985fddc91a6ef2308f4800958dfd11c25fe6a98" -) -``` - -```{r} -`%>%` <- dplyr::`%>%` -``` - -```{r} -library(hcop) -``` - -## KEGG human pathways - -We need to download the the MSigDB v6.2 KEGG gene sets that use Entrez gene IDs -and place them at the following path if we have not done so already: - -``` -gene-sets/c2.cp.kegg.v6.2.entrez.gmt -``` - -```{r} -# the kegg gmt file should be located in the spot we mention above -kegg_file <- file.path("gene-sets", "c2.cp.kegg.v6.2.entrez.gmt") -# since we do not track this file in our repository, let's check to make sure -# it exists where we expect it and download it if we don't find it -if (!file.exists(kegg_file)) { - message(paste( - "KEGG GMT file is not found at", kegg_file, - ", downloading now..." - )) - # need gene-sets directory - if (!dir.exists("gene-sets")) { - dir.create("gene-sets") - } - download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/c2.cp.kegg.v6.2.entrez.gmt", - destfile = kegg_file - ) -} -``` - -Read in the pathway file - -```{r} -kegg_human_list <- qusage::read.gmt(kegg_file) -``` - -## Conversion from human Entrez ID to mouse symbol - -### Human to mouse mapping with `hcop` - -`hcop` is designed to work well with `dplyr`, so we'll get all the possible -human Entrez IDs into a `data.frame`. - -```{r} -entrez_in_pathways_df <- data.frame( - human_entrez = as.integer(unique(unlist(kegg_human_list))) -) -``` - -Join to mouse orthologs by the human Entrez IDs. - -```{r} -# Join to mouse orthologs -mouse_ortholog_df <- entrez_in_pathways_df %>% - dplyr::inner_join(mouse, by = "human_entrez") -``` - -### 1:many mapping - -For 1:many mappings, we'll pick the one with the most support. -Here, we'll only consider the _number_ of sources without any regard for _what_ -those resources are. -Essentially, we weigh all sources equally though they likely have more or less -permissive criteria. - -```{r} -# add a column that counts resources -mouse_ortholog_df <- mouse_ortholog_df %>% - dplyr::rowwise() %>% - dplyr::mutate( - num_resources_support = - length(stringr::str_split(support, - pattern = ",", - simplify = TRUE - )) - ) -``` - -To demonstrate how this works, we'll pick a gene that had 1:many mappings -and follow it along. - -```{r} -mouse_ortholog_df %>% - dplyr::filter(human_entrez == 5631) %>% - dplyr::select( - human_entrez, mouse_entrez, human_symbol, mouse_symbol, - num_resources_support - ) -``` - -We can see that the human gene _PRPS1_/`5631` maps to 4 mouse genes, one of -which has 10 resources supporting that mapping. - -```{r} -# for each unique human entrez id, pick the mapping with the highest number of -# resources supporting it -most_support_df <- mouse_ortholog_df %>% - dplyr::group_by(human_entrez) %>% - dplyr::top_n(1, num_resources_support) -``` - -What happened to `5631`? - -```{r} -most_support_df %>% - dplyr::filter(human_entrez == 5631) %>% - dplyr::select( - human_entrez, mouse_entrez, human_symbol, mouse_symbol, - num_resources_support - ) -``` - -We successfully selected the mapping with the highest number of resources -supporting it. - -### Conversion of KEGG pathways - -For each KEGG pathway we have (currently populated with human Entrez IDs), -we need a new gene set that is comprised of mouse gene symbols. - -```{r} -kegg_mouse_list <- - lapply( - kegg_human_list, - function(pathway) { - dplyr::filter(most_support_df, human_entrez %in% pathway) %>% - dplyr::pull(mouse_symbol) - } - ) -``` - -Do the results this seem reasonable? -Let's pick the polymerase pathway, where our success should be pretty obvious -from the mouse gene symbols. - -```{r} -kegg_mouse_list[[grep( - "POLYMERASE", # Replace with a word (case sensitive) or phrase that would filter in your desired pathway(s) - names(kegg_mouse_list) -)]] -``` - -### Write mouse pathway list in `GMT` format - -We need to write this new mouse pathway list to file in [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29), as this will facilitate use with downstream pathway analysis. - -Briefly, the GMT format has one pathway per line and it follows this pattern: - -``` -\t\t... -``` - -We've lost the description information because it's removed by -`qusage::read.gmt`. -The description in `r kegg_file` follows this pattern: - -``` -http://www.broadinstitute.org/gsea/msigdb/cards/ -``` - -We can pretty easily stick this back in when we write to file with `write()`. - -```{r} -# filename we're going to write to -mouse_file <- file.path( - "gene-sets", # Replace with path to desired output directory - "c2.cp.kegg.v6.2.entrez_mouse_symbol_hcop.gmt" # Replace with name for output file -) - -# if you run this again after making changes above, you'd just end up appending -# this to the end of your old file, so we should take steps to get rid of -# the existing file -if (file.exists(mouse_file)) { - message(paste("Removing old", mouse_file)) - file.remove(mouse_file) -} - -# for each pathway, write it to line following GMT format -for (pathway_iter in seq_along(kegg_mouse_list)) { - # extract the current pathway name - pathway_name <- names(kegg_mouse_list)[pathway_iter] - text_to_write <- - paste( - # the name of the pathway - pathway_name, - # the description -- this is stripped out by qusage::read.gmt - paste0( - "http://www.broadinstitute.org/gsea/msigdb/cards/", - pathway_name - ), - # the gene symbols - paste(kegg_mouse_list[[pathway_iter]], collapse = "\t"), - sep = "\t" - ) - write(text_to_write, mouse_file, append = TRUE) -} -``` - -We can double check how this went by reading it back in with `qusage::read.gmt`. - -```{r} -mouse_read_list <- qusage::read.gmt(mouse_file) -all.equal(mouse_read_list, kegg_mouse_list) -``` - -## Session Info - -```{r} -sessioninfo::session_info() -``` diff --git a/02-microarray/pathway-analysis_microarray_02_gsea.Rmd b/02-microarray/pathway-analysis_microarray_02_gsea.Rmd new file mode 100644 index 00000000..051b639d --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_02_gsea.Rmd @@ -0,0 +1,522 @@ +--- +title: "Gene set enrichment analysis - Microarray" +author: "CCDL for ALSF" +date: "October 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis introduction](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_00_intro.html) to help you determine which pathway analysis method is best suited for your purposes. + +This particular example analysis shows how you can use gene set enrichment analysis (GSEA) to detect situations where all genes in a predefined gene set change in a coordinated way, detecting even small statistical but coordinated changes between two biological states. +Genes are ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic, for each gene set. +More specifically, an ES is calculated by starting with the most highly ranked genes (based on the gene-level statistic selected for ranking) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. +Normalized enrichment scores (NES) are enrichment scores that are scaled to account for gene sets of different sizes [@Subramanian2005; @Korotkevich2019]. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_02_gsea.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" # Can replace with path to desired output plots directory + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" # Can replace with path to desired output results directory + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the gene set for this example + +In this example, we are using differential expression results table we obtained from an [example analysis of zebrafish samples overexpressing human CREB experiment](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_01_2-groups.html) using [`limma`](https://bioconductor.org/packages/release/bioc/html/limma.html) [@Ritchie2015]. +The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case). + +We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you'd like to follow the steps for obtaining this results file yourself, we suggest going through [that differential expression analysis example](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_01_2-groups.html). + +## About the dataset we are using for this example + +For this example analysis, we will use this [CREB overexpression zebrafish experiment](https://www.refine.bio/experiments/GSE71270/creb-overexpression-induces-leukemia-in-zebrafish-by-blocking-myeloid-differentiation-process) [@Tregnago2016]. +@Tregnago2016 measured microarray gene expression of zebrafish samples overexpressing human CREB, as well as control samples. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called `data` (currently empty) +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should contain your `.Rmd` and three empty folders (which won't be empty for long!). + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Gene set enrichment analysis - Microarray + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform GSEA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Igor2020; @Subramanian2005]. + +We will also need the [`org.Dr.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) package to perform gene identifier conversion [@Carlson2019-zebrafish]. + +```{r} +if (!("clusterProfiler" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("clusterProfiler", update = FALSE) +} + +if (!("msigdbr" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("msigdbr", update = FALSE) +} + +if (!("org.Dr.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Dr.eg.db", update = FALSE) +} +``` + +Attach the packages we need for this analysis. + +```{r} +# Attach the library +library(clusterProfiler) + +# Package that contains MSigDB gene sets in tidy format +library(msigdbr) + +# Zebrafish annotation package we'll use for gene identifier conversion +library(org.Dr.eg.db) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +## Import data + +We will read in the differential expression results we will download from online. +These results are from a zebrafish microarray experiment we used for [differential expression analysis for two groups](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_02_2-groups.html) using [`limma`](https://bioconductor.org/packages/release/bioc/html/limma.html) [@Ritchie2015]. +The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case). +We can identify differentially regulated genes by filtering these results and use this list as input to GSEA. + +Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. +First we will assign the URL to its own variable called, `dge_url`. + +```{r} +# Define the url to your differential expression results file +dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv" +``` + +Read in the file that has differential expression results. +Here we are using the URL we set up above, but this can be a local file path instead _i.e._ you can replace `dge_url` in the code below with a path to file you have on your computer like: `file.path("results", "GSE71270_limma_results.tsv")`. + +```{r} +# Read in the contents of your differential expression results file +# `dge_url` can be replaced with a file path to a TSV file with your +# desired gene list results +dge_df <- readr::read_tsv(dge_url) +``` + +`read_tsv()` can read TSV files online and doesn't necessarily require you download the file first. +Let's take a look at what these contrast results from the differential expression analysis look like. + +```{r} +dge_df +``` + +## Getting familiar with `clusterProfiler`'s options + +Let's take a look at what organisms the package supports. + +```{r} +msigdbr_species() +``` + +MSigDB contains 8 different gene set collections [@Subramanian2005]. + + H: hallmark gene sets + C1: positional gene sets + C2: curated gene sets + C3: motif gene sets + C4: computational gene sets + C5: GO gene sets + C6: oncogenic signatures + C7: immunologic signatures + +MSigDB includes a collection called Hallmark gene sets. +Here's an excerpt of the collection description [@Liberzon2015]: + +Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. +These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. +The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA. +We'll use the Hallmark collection for GSEA +Notably, there are only 50 gene sets included in this collection. +The fewer gene sets we test, the lower our multiple hypothesis testing burden. + +The data we're interested in here comes from zebrafish samples, so we can obtain just the gene sets relevant to _D. rerio_ by specifying `species = "Danio rerio"` and only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function. + +```{r} +dr_hallmark_df <- msigdbr( + species = "Danio rerio", # Replace with species name relevant to your data + category = "H" +) +``` + +If you run the chunk above without specifying a `category` to the `msigdbr()` function, it will return all of the MSigDB gene sets for zebrafish. + +Let's preview what's in `dr_hallmark_df`. + +```{r} +head(dr_hallmark_df) +``` + +Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs. + +In our differential expression results data frame, `dge_df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA. + +## Gene identifier conversion + +We're going to convert our identifiers in `dge_df` to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Dr.eg.db` contains information for different identifiers [@Carlson2019-zebrafish]. +`org.Dr.eg.db` is specific to _Danio rerio_ -- this is what the `Dr` in the package name is referencing. + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Dr.eg.db) +``` + +Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to Entrez IDs (`ENTREZID`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to gene symbols (`SYMBOL`). + +Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called `mapIds()`. + +Let's create a data frame that shows the mapped Entrez IDs along with the differential expression stats for the respective Ensembl IDs. + +```{r} +# First let's create a mapped data frame we can join to the differential expression stats +dge_mapped_df <- data.frame( + "entrez_id" = mapIds( + org.Dr.eg.db, # Replace with annotation package for the organism relevant to your data + keys = dge_df$Gene, + column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to + keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data + multiVals = "first" # This will keep only the first mapped value for each Ensembl ID + ) +) %>% + # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop that + # from the data frame + dplyr::filter(!is.na(entrez_id)) %>% + # Make an `Ensembl` column to store the rownames + tibble::rownames_to_column("Ensembl") %>% + # Now let's join the rest of the expression data + dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene")) +``` + +This `1:many mapping between keys and columns` message means that some Ensembl gene identifiers map to multiple Entrez IDs. +In this case, it's also possible that a Entrez ID will map to multiple Ensembl IDs. +For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. +For more about how to explore this, take a look at our [microarray gene ID conversion example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html). + +Let's see a preview of `dge_mapped_df`. + +```{r} +head(dge_mapped_df) +``` +We will want to keep in mind that GSEA requires that genes are ranked from most highly positive to most highly negative and weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic. +GSEA also requires unique gene identifiers to produce the most accurate results. +This means if any duplicate gene identifiers are found in our dataset, we will want to retain the instance with the higher absolute value as this will be the instance most likely to be ranked most highly negative or most highly positive. +Otherwise the enrichment score results may be skewed and the `GSEA()` function will throw a warning. + +Let's check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs. + +```{r} +any(duplicated(dge_mapped_df$entrez_id)) +``` + +Looks like we do have duplicated Entrez IDs. +Let's find out which Entrez IDs have been duplicated. + +```{r} +dup_entrez_ids <- dge_mapped_df %>% + dplyr::filter(duplicated(entrez_id)) %>% + dplyr::pull(entrez_id) + +dup_entrez_ids +``` + +Now let's take a look at the rows associated with the duplicated Entrez IDs. + +```{r} +dge_mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) +``` + +We can see that the associated values vary for each row. + +As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA steps later, so let's keep the Entrez IDs associated with the higher absolute value of the t-statistic. +GSEA relies on genes' rankings on the basis of this statistic. +Retaining the instance of the Entrez ID with the higher absolute value means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. +We should keep this decision in mind when interpreting our results. +For example, if the duplicate identifiers tended to be enriched in a particular gene set, we may get an optimistic view of how perturbed that gene set is by preferentially selecting values that have a higher absolute value. +We are removing values for two genes here, so it is unlikely to have a considerable impact on our results. + +```{r} +filtered_dge_mapped_df <- dge_mapped_df %>% + # Sort so that the highest absolute values of the t-statistic are at the top + dplyr::arrange(dplyr::desc(abs(t))) %>% + # Filter out the duplicated rows using `dplyr::distinct()`-- this will keep + # the first row with the duplicated value thus keeping the row with the + # highest absolute value of the t-statistic + dplyr::distinct(entrez_id, .keep_all = TRUE) +``` + +Let's check to see that we removed the duplicate Entrez IDs and kept the rows with the higher absolute value of the t-statistic. + +```{r} +filtered_dge_mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) +``` + +Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the t-statistic! + +## Perform gene set enrichment analysis (GSEA) + +The goal of GSEA is to detect situations where all genes in a gene set change in a coordinated way, detecting even small statistical but coordinated changes. +This is done by ranking genes within a gene set from most highly positive to most highly negative, weighting them according to their gene-level statistic, to calculate the enrichment score (ES), which is a pathway-level statistic [@clusterProfiler-book]. + +### Determine our pre-ranked genes list + +The `GSEA()` function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. +We will therefore need to create a gene list with statistics that GSEA will rank by. + +In the next chunk, we will create a named vector ranked based on the gene-level t-statistic values. + +```{r} +# Let's create a named vector ranked based on the t-statistic values +t_vector <- filtered_dge_mapped_df$t +names(t_vector) <- filtered_dge_mapped_df$entrez_id + +# We need to sort the t-statistic values in descending order here +t_vector <- sort(t_vector, decreasing = TRUE) +``` + +Let's preview our pre-ranked named vector. + +```{r} +# Look at first entries of the ranked t-statistic vector +head(t_vector) +``` + +### Run GSEA using the `GSEA()` function + +Genes were ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, in the previous section. +In this section, we will implement GSEA to calculate the enrichment score (ES) for each gene set using our pre-ranked gene list. + +We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`). + +```{r} +gsea_results <- GSEA( + geneList = t_vector, # ordered ranked gene list + minGSSize = 25, # minimum gene set size + maxGSSize = 500, # maximum gene set set + pvalueCutoff = 0.05, # p value cutoff + eps = 0, # boundary for calculating the p value + seed = TRUE, # set seed to make results reproducible + pAdjustMethod = "BH", # Benjamini-Hochberg correction + TERM2GENE = dplyr::select( + dr_hallmark_df, + gs_name, + entrez_gene + ) +) +``` + +Let's take a look at the results. + +```{r} +# We can access the results from our gseaResult object using `@result` +head(gsea_results@result) +``` + +Significance is assessed by permutating the gene labels of the pre-ranked gene list and recomputing the ES of the gene set for the permutated data, which generates a null distribution for the ES. +The ES for each gene set is then normalized to account for the size of the set, resulting in a normalized enrichment score (NES), and an FDR (false discovery rate) value is calculated to account for multiple hypothesis testing. +Looks like we have gene sets returned as significant at FDR of `0.05`. +If we didn't have any, our visualizations below would show up blank as nothing would have met our `pvalueCutoff` above. + +The information we're most likely interested in is in the `results` slot. +Let's convert this into a data frame that we can use for further analysis and write to file. + +```{r} +gsea_result_df <- data.frame(gsea_results@result) +``` + +## Visualizing results + +We can visualize GSEA results for individual pathways or gene sets using `enrichplot::gseaplot()`. +Let's take a look at 2 different pathways -- one with a highly positive NES and one with a highly negative NES -- to get more insight into how ES are calculated. + +### Most Positive NES + +Let's look for the gene set with the most positive NES. + +```{r} +gsea_result_df %>% + # Use `top_n()` to return the top n observation using `NES` as the ordering variable + dplyr::top_n(n = 3, wt = NES) +``` + +The gene set `HALLMARK_TNFA_SIGNALING_VIA_NFKB` has the most positive NES score. + +```{r} +most_positive_nes_plot <- enrichplot::gseaplot( + gsea_results, + geneSetID = "HALLMARK_TNFA_SIGNALING_VIA_NFKB", + title = "HALLMARK_TNFA_SIGNALING_VIA_NFKB", + color.line = "#0d76ff" +) + +most_positive_nes_plot +``` + +Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. +The red dashed line indicates the peak ES score (the ES is the maximum deviation from zero). +As mentioned earlier, an ES is calculated by starting with the most highly ranked genes (according to the gene-level t-statistic values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. +In this case, the most highly positive enrichment score's data are being displayed. + +The plots returned by `enrichplot::gseaplot` are ggplots, so we can use `ggplot2::ggsave()` to save them to file. + +Let's save to PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_positive_plot.png"), + plot = most_positive_nes_plot +) +``` + +### Most Negative NES + +Let's look for the gene set with the most negative NES. + +```{r} +gsea_result_df %>% + # Use `top_n()` to return the top n observation using `NES` as the ordering + # variable -- note the use of `-n` to display the top n values when sorted + # in descending order + dplyr::top_n(n = -3, wt = NES) +``` + +The gene set `HALLMARK_E2F_TARGETS` has the most negative NES. + +```{r} +most_negative_nes_plot <- enrichplot::gseaplot( + gsea_results, + geneSetID = "HALLMARK_E2F_TARGETS", + title = "HALLMARK_E2F_TARGETS", + color.line = "#0d76ff" +) + +most_negative_nes_plot +``` + +This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph. +Again, the red dashed line here indicates the maximum deviation from zero, in other words, the most negative ES score. +As we know, the ES is calculated by starting with the most highly ranked genes (according to the gene-level t-statistic values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. +A negative enrichment score will be returned when not many of the genes are found at the top of the list, as this would mean decreasing the score a great deal thus producing a negative value. +In this case, the most negative enrichment score's data are being displayed. + +Let's save this plot to PNG. + +```{r} +ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_negative_plot.png"), + plot = most_negative_nes_plot +) +``` + +## Write results to file + +```{r} +readr::write_tsv( + gsea_result_df, + file.path( + results_dir, + "GSE71270_gsea_results.tsv" + ) +) +``` + +# Resources for further learning + +- [clusterProfiler paper](https://doi.org/10.1089/omi.2011.0118) [@Yu2012]. +- [clusterProfiler book](https://yulab-smu.github.io/clusterProfiler-book/index.html) [@clusterProfiler-book]. +- [This handy review](https://doi.org/10.1371/journal.pcbi.1002375) which summarizes the different types of pathway analysis and their limitations [@Khatri2012]. +- See this [Broad Institute page](https://www.gsea-msigdb.org/gsea/index.jsp) for more on GSEA and MSigDB [@GSEA-broad-institute]. + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/02-microarray/pathway-analysis_microarray_02_gsea.html b/02-microarray/pathway-analysis_microarray_02_gsea.html new file mode 100644 index 00000000..d3fa10fb --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_02_gsea.html @@ -0,0 +1,4467 @@ + + + + + + + + + + + + + + +Gene set enrichment analysis - Microarray + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

This example is one of pathway analysis module set, we recommend looking at the pathway analysis introduction to help you determine which pathway analysis method is best suited for your purposes.

+

This particular example analysis shows how you can use gene set enrichment analysis (GSEA) to detect situations where all genes in a predefined gene set change in a coordinated way, detecting even small statistical but coordinated changes between two biological states. Genes are ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic, for each gene set. More specifically, an ES is calculated by starting with the most highly ranked genes (based on the gene-level statistic selected for ranking) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. Normalized enrichment scores (NES) are enrichment scores that are scaled to account for gene sets of different sizes (Korotkevich et al. 2019; Subramanian et al. 2005).

+

⬇️ Jump to the analysis code ⬇️

+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

+
+
+

2.3 Obtain the gene set for this example

+

In this example, we are using differential expression results table we obtained from an example analysis of zebrafish samples overexpressing human CREB experiment using limma (Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case).

+

We have provided this file for you and the code in this notebook will read in the results that are stored online, but if you’d like to follow the steps for obtaining this results file yourself, we suggest going through that differential expression analysis example.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we will use this CREB overexpression zebrafish experiment (Tregnago et al. 2016). Tregnago et al. (2016) measured microarray gene expression of zebrafish samples overexpressing human CREB, as well as control samples.

+
+
+

2.5 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called data (currently empty)
  • +
  • A folder for plots (currently empty)
    +
  • +
  • A folder for results (currently empty)
  • +
+

Your example analysis folder should contain your .Rmd and three empty folders (which won’t be empty for long!).

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+
+ +

 

+
+
+

4 Gene set enrichment analysis - Microarray

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005).

+

We will also need the org.Dr.eg.db package to perform gene identifier conversion (Carlson 2019).

+
if (!("clusterProfiler" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("clusterProfiler", update = FALSE)
+}
+
+if (!("msigdbr" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("msigdbr", update = FALSE)
+}
+
+if (!("org.Dr.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Dr.eg.db", update = FALSE)
+}
+

Attach the packages we need for this analysis.

+
# Attach the library
+library(clusterProfiler)
+
## 
+
## clusterProfiler v3.16.1  For help: https://guangchuangyu.github.io/software/clusterProfiler
+## 
+## If you use clusterProfiler in published research, please cite:
+## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
+
## 
+## Attaching package: 'clusterProfiler'
+
## The following object is masked from 'package:stats':
+## 
+##     filter
+
# Package that contains MSigDB gene sets in tidy format
+library(msigdbr)
+
+# Zebrafish annotation package we'll use for gene identifier conversion
+library(org.Dr.eg.db)
+
## Loading required package: AnnotationDbi
+
## Loading required package: stats4
+
## Loading required package: BiocGenerics
+
## Loading required package: parallel
+
## 
+## Attaching package: 'BiocGenerics'
+
## The following objects are masked from 'package:parallel':
+## 
+##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+##     clusterExport, clusterMap, parApply, parCapply, parLapply,
+##     parLapplyLB, parRapply, parSapply, parSapplyLB
+
## The following objects are masked from 'package:stats':
+## 
+##     IQR, mad, sd, var, xtabs
+
## The following objects are masked from 'package:base':
+## 
+##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+##     union, unique, unsplit, which, which.max, which.min
+
## Loading required package: Biobase
+
## Welcome to Bioconductor
+## 
+##     Vignettes contain introductory material; view with
+##     'browseVignettes()'. To cite Bioconductor, see
+##     'citation("Biobase")', and for packages 'citation("pkgname")'.
+
## Loading required package: IRanges
+
## Loading required package: S4Vectors
+
## 
+## Attaching package: 'S4Vectors'
+
## The following object is masked from 'package:clusterProfiler':
+## 
+##     rename
+
## The following object is masked from 'package:base':
+## 
+##     expand.grid
+
## 
+## Attaching package: 'IRanges'
+
## The following object is masked from 'package:clusterProfiler':
+## 
+##     slice
+
## 
+## Attaching package: 'AnnotationDbi'
+
## The following object is masked from 'package:clusterProfiler':
+## 
+##     select
+
## 
+
# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+
+

4.2 Import data

+

We will read in the differential expression results we will download from online. These results are from a zebrafish microarray experiment we used for differential expression analysis for two groups using limma (Ritchie et al. 2015). The table contains summary statistics including Ensembl gene IDs, t-statistic values, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to GSEA.

+

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

+
# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/02-microarray/results/GSE71270/GSE71270_limma_results.tsv"
+

Read in the file that has differential expression results. Here we are using the URL we set up above, but this can be a local file path instead i.e. you can replace dge_url in the code below with a path to file you have on your computer like: file.path("results", "GSE71270_limma_results.tsv").

+
# Read in the contents of your differential expression results file
+# `dge_url` can be replaced with a file path to a TSV file with your
+# desired gene list results
+dge_df <- readr::read_tsv(dge_url)
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────
+## cols(
+##   Gene = col_character(),
+##   logFC = col_double(),
+##   AveExpr = col_double(),
+##   t = col_double(),
+##   P.Value = col_double(),
+##   adj.P.Val = col_double(),
+##   B = col_double()
+## )
+

read_tsv() can read TSV files online and doesn’t necessarily require you download the file first. Let’s take a look at what these contrast results from the differential expression analysis look like.

+
dge_df
+
+ +
+
+
+

4.3 Getting familiar with clusterProfiler’s options

+

Let’s take a look at what organisms the package supports.

+
msigdbr_species()
+
+ +
+

MSigDB contains 8 different gene set collections (Subramanian et al. 2005).

+
H: hallmark gene sets
+C1: positional gene sets
+C2: curated gene sets
+C3: motif gene sets
+C4: computational gene sets
+C5: GO gene sets
+C6: oncogenic signatures
+C7: immunologic signatures
+

MSigDB includes a collection called Hallmark gene sets. Here’s an excerpt of the collection description (Liberzon et al. 2015):

+

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA. We’ll use the Hallmark collection for GSEA Notably, there are only 50 gene sets included in this collection. The fewer gene sets we test, the lower our multiple hypothesis testing burden.

+

The data we’re interested in here comes from zebrafish samples, so we can obtain just the gene sets relevant to D. rerio by specifying species = "Danio rerio" and only the Hallmark gene sets by specifying category = "H" to the msigdbr() function.

+
dr_hallmark_df <- msigdbr(
+  species = "Danio rerio", # Replace with species name relevant to your data
+  category = "H"
+)
+

If you run the chunk above without specifying a category to the msigdbr() function, it will return all of the MSigDB gene sets for zebrafish.

+

Let’s preview what’s in dr_hallmark_df.

+
head(dr_hallmark_df)
+
+ +
+

Looks like we have a data frame of gene sets with associated gene symbols and Entrez IDs.

+

In our differential expression results data frame, dge_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into either gene symbols or Entrez IDs for GSEA.

+
+
+

4.4 Gene identifier conversion

+

We’re going to convert our identifiers in dge_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

+

The annotation package org.Dr.eg.db contains information for different identifiers (Carlson 2019). org.Dr.eg.db is specific to Danio rerio – this is what the Dr in the package name is referencing.

+

We can see what types of IDs are available to us in an annotation package with keytypes().

+
keytypes(org.Dr.eg.db)
+
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
+##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
+## [11] "GO"           "GOALL"        "IPI"          "ONTOLOGY"     "ONTOLOGYALL" 
+## [16] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"      
+## [21] "SYMBOL"       "UNIGENE"      "UNIPROT"      "ZFIN"
+

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

+

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

+

The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds().

+

Let’s create a data frame that shows the mapped Entrez IDs along with the differential expression stats for the respective Ensembl IDs.

+
# First let's create a mapped data frame we can join to the differential expression stats
+dge_mapped_df <- data.frame(
+  "entrez_id" = mapIds(
+    org.Dr.eg.db, # Replace with annotation package for the organism relevant to your data
+    keys = dge_df$Gene,
+    column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to
+    keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
+    multiVals = "first" # This will keep only the first mapped value for each Ensembl ID
+  )
+) %>%
+  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop that
+  # from the data frame
+  dplyr::filter(!is.na(entrez_id)) %>%
+  # Make an `Ensembl` column to store the rownames
+  tibble::rownames_to_column("Ensembl") %>%
+  # Now let's join the rest of the expression data
+  dplyr::inner_join(dge_df, by = c("Ensembl" = "Gene"))
+
## 'select()' returned 1:many mapping between keys and columns
+

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSEA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

+

Let’s see a preview of dge_mapped_df.

+
head(dge_mapped_df)
+
+ +
+

We will want to keep in mind that GSEA requires that genes are ranked from most highly positive to most highly negative and weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic. GSEA also requires unique gene identifiers to produce the most accurate results. This means if any duplicate gene identifiers are found in our dataset, we will want to retain the instance with the higher absolute value as this will be the instance most likely to be ranked most highly negative or most highly positive. Otherwise the enrichment score results may be skewed and the GSEA() function will throw a warning.

+

Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.

+
any(duplicated(dge_mapped_df$entrez_id))
+
## [1] TRUE
+

Looks like we do have duplicated Entrez IDs. Let’s find out which Entrez IDs have been duplicated.

+
dup_entrez_ids <- dge_mapped_df %>%
+  dplyr::filter(duplicated(entrez_id)) %>%
+  dplyr::pull(entrez_id)
+
+dup_entrez_ids
+
## [1] "336702" "57924"
+

Now let’s take a look at the rows associated with the duplicated Entrez IDs.

+
dge_mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids)
+
+ +
+

We can see that the associated values vary for each row.

+

As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA steps later, so let’s keep the Entrez IDs associated with the higher absolute value of the t-statistic. GSEA relies on genes’ rankings on the basis of this statistic. Retaining the instance of the Entrez ID with the higher absolute value means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if the duplicate identifiers tended to be enriched in a particular gene set, we may get an optimistic view of how perturbed that gene set is by preferentially selecting values that have a higher absolute value. We are removing values for two genes here, so it is unlikely to have a considerable impact on our results.

+
filtered_dge_mapped_df <- dge_mapped_df %>%
+  # Sort so that the highest absolute values of the t-statistic are at the top
+  dplyr::arrange(dplyr::desc(abs(t))) %>%
+  # Filter out the duplicated rows using `dplyr::distinct()`-- this will keep
+  # the first row with the duplicated value thus keeping the row with the
+  # highest absolute value of the t-statistic
+  dplyr::distinct(entrez_id, .keep_all = TRUE)
+

Let’s check to see that we removed the duplicate Entrez IDs and kept the rows with the higher absolute value of the t-statistic.

+
filtered_dge_mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids)
+
+ +
+

Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the t-statistic!

+
+
+

4.5 Perform gene set enrichment analysis (GSEA)

+

The goal of GSEA is to detect situations where all genes in a gene set change in a coordinated way, detecting even small statistical but coordinated changes. This is done by ranking genes within a gene set from most highly positive to most highly negative, weighting them according to their gene-level statistic, to calculate the enrichment score (ES), which is a pathway-level statistic (Yu).

+
+

4.5.1 Determine our pre-ranked genes list

+

The GSEA() function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. We will therefore need to create a gene list with statistics that GSEA will rank by.

+

In the next chunk, we will create a named vector ranked based on the gene-level t-statistic values.

+
# Let's create a named vector ranked based on the t-statistic values
+t_vector <- filtered_dge_mapped_df$t
+names(t_vector) <- filtered_dge_mapped_df$entrez_id
+
+# We need to sort the t-statistic values in descending order here
+t_vector <- sort(t_vector, decreasing = TRUE)
+

Let’s preview our pre-ranked named vector.

+
# Look at first entries of the ranked t-statistic vector
+head(t_vector)
+
##   555053   140633   407728   368924   335916   323329 
+## 20.22172 14.48634 13.88657 12.45258 11.24450 10.92140
+
+
+

4.5.2 Run GSEA using the GSEA() function

+

Genes were ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score (ES) for each gene set using our pre-ranked gene list.

+

We can use the GSEA() function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()).

+
gsea_results <- GSEA(
+  geneList = t_vector, # ordered ranked gene list
+  minGSSize = 25, # minimum gene set size
+  maxGSSize = 500, # maximum gene set set
+  pvalueCutoff = 0.05, # p value cutoff
+  eps = 0, # boundary for calculating the p value
+  seed = TRUE, # set seed to make results reproducible
+  pAdjustMethod = "BH", # Benjamini-Hochberg correction
+  TERM2GENE = dplyr::select(
+    dr_hallmark_df,
+    gs_name,
+    entrez_gene
+  )
+)
+
## preparing geneSet collections...
+
## GSEA analysis...
+
## leading edge analysis...
+
## done...
+

Let’s take a look at the results.

+
# We can access the results from our gseaResult object using `@result`
+head(gsea_results@result)
+
+ +
+

Significance is assessed by permutating the gene labels of the pre-ranked gene list and recomputing the ES of the gene set for the permutated data, which generates a null distribution for the ES. The ES for each gene set is then normalized to account for the size of the set, resulting in a normalized enrichment score (NES), and an FDR (false discovery rate) value is calculated to account for multiple hypothesis testing. Looks like we have gene sets returned as significant at FDR of 0.05. If we didn’t have any, our visualizations below would show up blank as nothing would have met our pvalueCutoff above.

+

The information we’re most likely interested in is in the results slot. Let’s convert this into a data frame that we can use for further analysis and write to file.

+
gsea_result_df <- data.frame(gsea_results@result)
+
+
+
+

4.6 Visualizing results

+

We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot(). Let’s take a look at 2 different pathways – one with a highly positive NES and one with a highly negative NES – to get more insight into how ES are calculated.

+
+

4.6.1 Most Positive NES

+

Let’s look for the gene set with the most positive NES.

+
gsea_result_df %>%
+  # Use `top_n()` to return the top n observation using `NES` as the ordering variable
+  dplyr::top_n(n = 3, wt = NES)
+
+ +
+

The gene set HALLMARK_TNFA_SIGNALING_VIA_NFKB has the most positive NES score.

+
most_positive_nes_plot <- enrichplot::gseaplot(
+  gsea_results,
+  geneSetID = "HALLMARK_TNFA_SIGNALING_VIA_NFKB",
+  title = "HALLMARK_TNFA_SIGNALING_VIA_NFKB",
+  color.line = "#0d76ff"
+)
+
+most_positive_nes_plot
+

+

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. The red dashed line indicates the peak ES score (the ES is the maximum deviation from zero). As mentioned earlier, an ES is calculated by starting with the most highly ranked genes (according to the gene-level t-statistic values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. In this case, the most highly positive enrichment score’s data are being displayed.

+

The plots returned by enrichplot::gseaplot are ggplots, so we can use ggplot2::ggsave() to save them to file.

+

Let’s save to PNG.

+
ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_positive_plot.png"),
+  plot = most_positive_nes_plot
+)
+
## Saving 7 x 5 in image
+
+
+

4.6.2 Most Negative NES

+

Let’s look for the gene set with the most negative NES.

+
gsea_result_df %>%
+  # Use `top_n()` to return the top n observation using `NES` as the ordering
+  # variable -- note the use of `-n` to display the top n values when sorted
+  # in descending order
+  dplyr::top_n(n = -3, wt = NES)
+
+ +
+

The gene set HALLMARK_E2F_TARGETS has the most negative NES.

+
most_negative_nes_plot <- enrichplot::gseaplot(
+  gsea_results,
+  geneSetID = "HALLMARK_E2F_TARGETS",
+  title = "HALLMARK_E2F_TARGETS",
+  color.line = "#0d76ff"
+)
+
+most_negative_nes_plot
+

+

This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph. Again, the red dashed line here indicates the maximum deviation from zero, in other words, the most negative ES score. As we know, the ES is calculated by starting with the most highly ranked genes (according to the gene-level t-statistic values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. A negative enrichment score will be returned when not many of the genes are found at the top of the list, as this would mean decreasing the score a great deal thus producing a negative value. In this case, the most negative enrichment score’s data are being displayed.

+

Let’s save this plot to PNG.

+
ggplot2::ggsave(file.path(plots_dir, "GSE71270_gsea_enrich_negative_plot.png"),
+  plot = most_negative_nes_plot
+)
+
## Saving 7 x 5 in image
+
+
+
+

4.7 Write results to file

+
readr::write_tsv(
+  gsea_result_df,
+  file.path(
+    results_dir,
+    "GSE71270_gsea_results.tsv"
+  )
+)
+
+
+
+

5 Resources for further learning

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ───────────────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-11-14                  
+## 
+## ─ Packages ───────────────────────────────────────────────────────────────────
+##  package         * version  date       lib source        
+##  AnnotationDbi   * 1.50.3   2020-07-25 [1] Bioconductor  
+##  assertthat        0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports         1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  Biobase         * 2.48.0   2020-04-27 [1] Bioconductor  
+##  BiocGenerics    * 0.34.0   2020-04-27 [1] Bioconductor  
+##  BiocManager       1.30.10  2019-11-16 [1] RSPM (R 4.0.0)
+##  BiocParallel      1.22.0   2020-04-27 [1] Bioconductor  
+##  bit               4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64             4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  blob              1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  cli               2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  clusterProfiler * 3.16.1   2020-08-18 [1] Bioconductor  
+##  colorspace        1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  cowplot           1.1.0    2020-09-08 [1] RSPM (R 4.0.2)
+##  crayon            1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  curl              4.3      2019-12-02 [1] RSPM (R 4.0.0)
+##  data.table        1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
+##  DBI               1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  digest            0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  DO.db             2.9      2020-10-27 [1] Bioconductor  
+##  DOSE              3.14.0   2020-04-27 [1] Bioconductor  
+##  downloader        0.4      2015-07-09 [1] RSPM (R 4.0.0)
+##  dplyr             1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  ellipsis          0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  enrichplot        1.8.1    2020-04-29 [1] Bioconductor  
+##  europepmc         0.4      2020-05-31 [1] RSPM (R 4.0.0)
+##  evaluate          0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi             0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver            2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastmatch         1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
+##  fgsea             1.14.0   2020-04-27 [1] Bioconductor  
+##  generics          0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  getopt            1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  ggforce           0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
+##  ggplot2           3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  ggplotify         0.0.5    2020-03-12 [1] RSPM (R 4.0.0)
+##  ggraph            2.0.3    2020-05-20 [1] RSPM (R 4.0.2)
+##  ggrepel           0.8.2    2020-03-08 [1] RSPM (R 4.0.2)
+##  ggridges          0.5.2    2020-01-12 [1] RSPM (R 4.0.0)
+##  glue              1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  GO.db             3.11.4   2020-10-27 [1] Bioconductor  
+##  GOSemSim          2.14.2   2020-09-04 [1] Bioconductor  
+##  graphlayouts      0.7.0    2020-04-25 [1] RSPM (R 4.0.2)
+##  gridExtra         2.3      2017-09-09 [1] RSPM (R 4.0.0)
+##  gridGraphics      0.5-0    2020-02-25 [1] RSPM (R 4.0.0)
+##  gtable            0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  hms               0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmltools         0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  httr              1.4.2    2020-07-20 [1] RSPM (R 4.0.2)
+##  igraph            1.2.6    2020-10-06 [1] RSPM (R 4.0.2)
+##  IRanges         * 2.22.2   2020-05-21 [1] Bioconductor  
+##  jsonlite          1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr             1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling          0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  lattice           0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  lifecycle         0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  magrittr        * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  MASS              7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+##  Matrix            1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  memoise           1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  msigdbr         * 7.2.1    2020-10-02 [1] RSPM (R 4.0.2)
+##  munsell           0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  optparse        * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  org.Dr.eg.db    * 3.11.4   2020-10-27 [1] Bioconductor  
+##  pillar            1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig         2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  plyr              1.8.6    2020-03-03 [1] RSPM (R 4.0.2)
+##  polyclip          1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
+##  prettyunits       1.1.1    2020-01-24 [1] RSPM (R 4.0.0)
+##  progress          1.2.2    2019-05-16 [1] RSPM (R 4.0.0)
+##  ps                1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr             0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  qvalue            2.20.0   2020-04-27 [1] Bioconductor  
+##  R.cache           0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3       1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo              1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils           2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  RColorBrewer      1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
+##  Rcpp              1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  readr             1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2          2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  reshape2          1.4.4    2020-04-09 [1] RSPM (R 4.0.2)
+##  rlang             0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown         2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  RSQLite           2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi        0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  rvcheck           0.1.8    2020-03-01 [1] RSPM (R 4.0.0)
+##  S4Vectors       * 0.26.1   2020-05-16 [1] Bioconductor  
+##  scales            1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  scatterpie        0.1.5    2020-09-09 [1] RSPM (R 4.0.2)
+##  sessioninfo       1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  stringi           1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr           1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler            1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  tibble            3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidygraph         1.2.0    2020-05-12 [1] RSPM (R 4.0.2)
+##  tidyr             1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  tidyselect        1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  triebeard         0.3.0    2016-08-04 [1] RSPM (R 4.0.2)
+##  tweenr            1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
+##  urltools          1.7.3    2019-04-14 [1] RSPM (R 4.0.2)
+##  vctrs             0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  viridis           0.5.1    2018-03-29 [1] RSPM (R 4.0.0)
+##  viridisLite       0.3.0    2018-02-01 [1] RSPM (R 4.0.0)
+##  withr             2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun              0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  xml2              1.3.2    2020-04-23 [1] RSPM (R 4.0.0)
+##  yaml              2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
+## 
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+
+
+

References

+
+
+

Carlson M., 2019 Genome wide annotation for zebrafish. https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html

+
+
+

Diego U. S., and B. I. Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp

+
+
+

Dolgalev I., 2020 Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

+
+
+

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

+
+
+

Korotkevich G., V. Sukhov, and A. Sergushichev, 2019 Fast gene set enrichment analysis. bioRxiv. https://doi.org/10.1101/060012

+
+
+

Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004

+
+
+

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

+
+
+

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

+
+
+

Tregnago C., E. Manara, M. Zampini, V. Bisio, and C. Borga et al., 2016 CREB engages C/EBPδ to initiate leukemogenesis. Leukemia 30: 1887–1896. https://doi.org/10.1038/leu.2016.98

+
+
+

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

+
+
+

Yu G., clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/02-microarray/pathway-analysis_microarray_03_gsva.Rmd b/02-microarray/pathway-analysis_microarray_03_gsva.Rmd new file mode 100644 index 00000000..8f61c2aa --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_03_gsva.Rmd @@ -0,0 +1,678 @@ +--- +title: "Gene set variation analysis - Microarray" +author: "CCDL for ALSF" +date: "November 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +This example is one of pathway analysis module set, we recommend looking at the [pathway analysis introduction](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_00_intro.html) to help you determine which pathway analysis method is best suited for your purposes. + +In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis. +Gene set variation analysis (GSVA) constitutes a starting point to build pathway-centric models of biology. +Rather than contextualizing some results you have received from another analysis like DGE, GSVA is designed to provide an estimate of pathway variation for each of the samples in an experiment. +This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change [@Hanzelmann2013]. + +GSVA determines the relative enrichment of gene sets across samples using a non-parametric approach. +GSVA transforms a gene by sample gene expression matrix into a gene set by sample pathway enrichment matrix [@Hanzelmann-github]. +You can use these scores for other downstream analyses; in this analysis we will perform differential expression to determine what pathway is most different across our variable of interest. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/pathway-analysis_microarray_03_gsva.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" # Can replace with path to desired output plots directory + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" # Can replace with path to desired output results directory + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the dataset from refine.bio + +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). + +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/GSE37382). + +Click the "Download Now" button on the right side of this screen. + + + +Fill out the pop up window with your email and our Terms and Conditions: + + + +It may take a few minutes for the dataset to process. +You will get an email when it is ready. + +## About the dataset we are using for this example + +For this example analysis, we will use this [medulloblastoma dataset](https://www.refine.bio/experiments/GSE37382) [@Northcott2012]. +The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. +Medulloblastoma is the most common childhood brain cancer and is often categorized by subgroups. +We will use these `subgroup` labels from our metadata to perform differential expression with our GSVA scores. + +## Place the dataset in your new `data/` folder + +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +Double clicking should unzip this for you and create a folder of the same name. + + + +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). + +The `GSE37382` folder has the data and metadata TSV files you will need for this example analysis. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. + +Copy and paste the `GSE37382` folder into your newly created `data/` folder. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called "data" which contains: + - The `GSE37382` folder which contains: + - The gene expression + - The metadata TSV +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): + + + +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +These chunks will declare your file paths and double check that your files are in the right place. + +First we will declare our file paths to our data and metadata files, which should be in our data directory. +This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. + +```{r} +# Define the file path to the data directory +data_dir <- file.path("data", "GSE37382") # Replace with accession number which will be the name of the folder the files will be in + +# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` +data_file <- file.path(data_dir, "GSE37382.tsv") # Replace with file path to your dataset + +# Declare the file path to the metadata file using the data directory saved as `data_dir` +metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") # Replace with file path to your metadata +``` + +Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. + +```{r} +# Check if the gene expression matrix file is at the file path stored in `data_file` +file.exists(data_file) + +# Check if the metadata file is at the file path stored in `metadata_file` +file.exists(metadata_file) +``` + +If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Gene set variation analysis - Microarray + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using the [`GSVA`](https://www.bioconductor.org/packages/release/bioc/html/GSVA.html) package to perform GSVA and the [`qusage`](https://www.bioconductor.org/packages/release/bioc/html/qusage.html) package to read in the GMT file containing the gene set data [@Hanzelmann2013; @Yaari2013]. + +We will also need the [`org.Hs.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) package to perform gene identifier conversion [@Carlson2020-human]. + +We'll also be performing differential expression test on our GSVA scores, and for that we will use `limma` [@Ritchie2015] and we'll make a sina plot of the scores of our most significant pathway using a ggplot2 companion package, `ggforce`. + +```{r} +if (!("GSVA" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("GSVA", update = FALSE) +} + +if (!("qusage" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("qusage", update = FALSE) +} + +if (!("org.Hs.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Hs.eg.db", update = FALSE) +} + +if (!("limma" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("limma", update = FALSE) +} + +if (!("ggforce" %in% installed.packages())) { + # Install this package if it isn't installed yet + install.packages("ggforce") +} +``` + +Attach the packages we need for this analysis. + +```{r} +# Attach the `qusage` library +library(qusage) + +# Attach the `GSVA` library +library(GSVA) + +# Human annotation package we'll use for gene identifier conversion +library(org.Hs.eg.db) + +# Attach the ggplot2 package for plotting +library(ggplot2) + +# We will need this so we can use the pipe: %>% +library(magrittr) +``` + +## Import and set up data + +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +This chunk of code will read both TSV files and add them as data frames to your environment. + +We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). + +```{r} +# Read in metadata TSV file +metadata <- readr::read_tsv(metadata_file) + +# Read in data TSV file +expression_df <- readr::read_tsv(data_file) +``` + +Let’s ensure that the metadata and data are in the same sample order. + +```{r} +# Make the data in the order of the metadata +expression_df <- expression_df %>% + dplyr::select(c(Gene, metadata$geo_accession)) + +# Check if this is in the same order +all.equal(colnames(expression_df)[-1], metadata$geo_accession) +``` + +### Import Gene Sets + +The function that we will use to run GSVA wants the gene sets to be in a list, rather than a tidy data frame. + +We are therefore going to read in the Hallmark collection of gene sets file directly from [MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp), rather than using the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package like we did in earlier notebooks [@Igor2020]. + +Note that when reading in the Hallmark collection file in this manner, the data is only compatible with _Homo sapiens_ datasets. + +```{r} +# R can often read in data from a URL +hallmarks_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt" + +# QuSAGE is another pathway analysis method, the `qusage` package has a function +# for reading GMT files and turning them into a list +hallmarks_list <- qusage::read.gmt(hallmarks_url) +``` + +What does this `hallmarks_list` look like? + +```{r} +head(hallmarks_list) +``` + +Looks like we have a list of gene sets with associated Entrez IDs. + +In our gene expression data frame, `expression_df` we have Ensembl gene identifiers. +So we will need to convert our Ensembl IDs into Entrez IDs for GSVA. + +### Gene identifier conversion + +We're going to convert our identifiers in `expression_df` to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers! + +The annotation package `org.Hs.eg.db` contains information for different identifiers [@Carlson2020-human]. +`org.Hs.eg.db` is specific to _Homo sapiens_ -- this is what the `Hs` in the package name is referencing. + +We can see what types of IDs are available to us in an annotation package with `keytypes()`. + +```{r} +keytypes(org.Hs.eg.db) +``` + +Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to Entrez IDs (`ENTREZID`), we could just as easily use it to convert from an Ensembl transcript ID (`ENSEMBLTRANS`) to gene symbols (`SYMBOL`). + +Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: +[the microarray example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html) and [the RNA-seq example](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html). + +The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called `mapIds()`. + +Let's create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs. + +```{r} +# First let's create a mapped data frame we can join to the gene expression values +mapped_df <- data.frame( + "entrez_id" = mapIds( + org.Hs.eg.db, # Replace with annotation package for the organism relevant to your data + keys = expression_df$Gene, + column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to + keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data + multiVals = "first" # This will keep only the first mapped value for each Ensembl ID + ) +) %>% + # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop that + # from the data frame + dplyr::filter(!is.na(entrez_id)) %>% + # Make an `Ensembl` column to store the row names + tibble::rownames_to_column("Ensembl") %>% + # Now let's join the rest of the expression data + dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene")) +``` + +This `1:many mapping between keys and columns` message means that some Ensembl gene identifiers map to multiple Entrez IDs. +In this case, it's also possible that a Entrez ID will map to multiple Ensembl IDs. +For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs. +For more about how to explore this, take a look at our [microarray gene ID conversion example](https://alexslemonade.github.io/refinebio-examples/02-microarray/gene-id-annotation_microarray_01_ensembl.html). + +Let's see a preview of `mapped_df`. + +```{r} +head(mapped_df) +``` + +We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. +In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers. + +Let's check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs. + +```{r} +any(duplicated(mapped_df$entrez_id)) +``` + +Looks like we do have duplicated Entrez IDs. +Let's find out which Entrez IDs have been duplicated. + +```{r} +dup_entrez_ids <- mapped_df %>% + dplyr::filter(duplicated(entrez_id)) %>% + dplyr::pull(entrez_id) + +dup_entrez_ids +``` + +#### Handling duplicate gene identifiers + +As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for the GSVA steps later. +GSVA is executed on a per sample basis so let's keep the maximum expression value per sample associated with the duplicate Entrez gene identifiers. +In other words, we will keep only the maximum expression value found across the duplicate Entrez gene identifier instances for each sample or column in this case. + +Let's take a look at the rows associated with the duplicated Entrez IDs and see how this will play out. + +```{r} +mapped_df %>% + dplyr::filter(entrez_id %in% dup_entrez_ids) +``` + +As an example using the strategy we described, for `GSM917111`'s data in the first column, `0.2294387` is larger than `0.1104345` so moving forward, Entrez gene `6013` will have `0.2294387` value and the `0.1104345` would be dropped from the dataset. +However, this is just one method of handling duplicate gene identifiers. +See the [Gene Set Enrichment Analysis (GSEA) User guide](https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html) for more information on other expression values that can be used, like the median expression value for example [@GSEA-user-guide]. +Now, let's implement our choose the max value method for all samples and Entrez IDs using tidyverse functions. + +```{r} +max_dup_df <- mapped_df %>% + # We won't be using Ensembl IDs moving forward, so we will drop this column + dplyr::select(-Ensembl) %>% + # Filter to include only the rows associated with the duplicate Entrez gene identifiers + dplyr::filter(entrez_id %in% dup_entrez_ids) %>% + # Group by Entrez IDs + dplyr::group_by(entrez_id) %>% + # Get the maximum expression value using all values associated with each duplicate + # Entrez ID for each column or sample in this case + dplyr::summarize_all(max) + +max_dup_df +``` + +We can see `GSM917111` now has the `0.2294387` value for Entrez ID `6013` like expected. +Looks like we were able to successfully get rid of the duplicate Entrez gene identifiers! + +Now let's combine our de-duplicated data with the rest of the mapped data! + +```{r} +filtered_mapped_df <- mapped_df %>% + # We won't be using Ensembl IDs moving forward, so we will drop this column + dplyr::select(-Ensembl) %>% + # First let's get the data associated with the Entrez gene identifiers that aren't duplicated + dplyr::filter(!entrez_id %in% dup_entrez_ids) %>% + # Now let's bind the rows of the maximum expression data we stored in `max_dup_df` + dplyr::bind_rows(max_dup_df) +``` + +As mentioned earlier, we need a matrix for GSVA. +Let's now convert our data frame into a matrix and prepare our object for GSVA. + +```{r} +filtered_mapped_matrix <- filtered_mapped_df %>% + # We need to store our gene identifiers as row names + tibble::column_to_rownames("entrez_id") %>% + # Now we can convert our object into a matrix + as.matrix() +``` + +Note that if we had duplicate gene identifiers here, we would not be able to set them as row names. + +## GSVA - Microarray + +Rather than ranking genes based on some gene-level statistic selected by the user, GSVA fits a model and ranks genes based on their expression level relative to the sample distribution. +The pathway-level score calculated is a way of asking how genes _within_ a gene set vary as compared to genes that are _outside_ of that gene set [@Malhotra2018]. + +The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) [@Hanzelmann2013]. + +The output is a gene set by sample matrix of GSVA scores. + +### Perform GSVA + +Let's perform GSVA using the `gsva()` function. +See `?gsva` for more options. + +```{r} +gsva_results <- gsva( + filtered_mapped_matrix, + hallmarks_list, + method = "gsva", + # Appropriate for our log2-transformed microarray data + kcdf = "Gaussian", + # Minimum gene set size + min.sz = 15, + # Maximum gene set size + max.sz = 500, + # Compute Gaussian-distributed scores + mx.diff = TRUE, + # Don't print out the progress bar + verbose = FALSE +) +``` + +Let's explore what the output of `gsva()` looks like. + +```{r} +head(gsva_results[, 1:10]) +``` + +## Find differentially expressed pathways + +If we want to identify most differentially expressed pathways across subgroups, we can use functionality in the `limma` package to test the GSVA scores. +This is one approach for working with GSVA scores. +The `mx.diff = TRUE` argument that we supplied to the `gsva()` function in the previous section means the GSVA output scores should be normally distributed, which allows downstream analyses that make this assumption about the distribution [@Hanzelmann-gsva-vignette]. + +### Create the design matrix + +`limma` needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. +Now we will create a model matrix based on our `subgroup` variable. +We are using a `+ 0` in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. +If you have a control group, you might want that to be the intercept. + +```{r} +# Create the design matrix +des_mat <- model.matrix(~ metadata$subgroup + 0) +``` + +Let's take a look at the design matrix we created. + +```{r} +# Print out the design matrix +head(des_mat) +``` + +The design matrix column names are a bit messy, so we will neaten them up by dropping the `metadata$subgroup` designation they all have and for being able to reference them easily later, we want to drop spaces as well. + +```{r} +# Make the column names less messy +colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "metadata\\$subgroup") + +# Do a similar thing but remove spaces in names +colnames(des_mat) <- stringr::str_remove(colnames(des_mat), " ") +``` + +Side note: If you are wondering why there are two `\` above in `"filtered_metadata\\$subgroup"`, that's called an [escape character](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html#escaping). +There's a whole universe of things called [regular expressions (regex)](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) that can be super handy for string manipulations. + +## Perform differential expression on pathway scores + +Run the linear model on each pathway (each row of `gsva_results`). + +```{r} +# Apply linear model to data +fit <- limma::lmFit(gsva_results, design = des_mat) + +# Apply empirical Bayes to smooth standard errors +fit <- limma::eBayes(fit) +``` + +Now that we have our basic model fitting, we will want to make the contrasts among all our groups. +Depending on your scientific questions, you will need to customize the next steps. +Consulting the [limma users guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) for how to set up your model based on your hypothesis is a good idea. + +In this contrasts matrix, we are comparing each subgroup to the average of the other subgroups. +We're dividing by two in this expression so that each group is compared to the average of the other two groups (`makeContrasts()` doesn't allow you to use functions like `mean()`; it wants a formula). + +```{r} +contrast_matrix <- makeContrasts( + "G3vsOther" = Group3 - (Group4 + SHH) / 2, + "G4vsOther" = Group4 - (Group3 + SHH) / 2, + "SHHvsOther" = SHH - (Group3 + Group4) / 2, + levels = des_mat +) +``` + +Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulate would look like `Group3 = Group3 - Control` for each one. +We highly recommend consulting the [limma users guide](https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) for figuring out what your `makeContrasts()` and `model.matrix()` setups should look like [@Ritchie2015]. + +Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with `eBayes()`. + +```{r} +# Fit the model according to the contrasts matrix +contrasts_fit <- contrasts.fit(fit, contrast_matrix) + +# Re-smooth the Bayes +contrasts_fit <- eBayes(contrasts_fit) +``` + +Here's a [nifty article and example](http://varianceexplained.org/r/empirical_bayes_baseball/) about what the empirical Bayes smoothing is for [@bayes-estimates]. + +Now let's create the results table based on the contrasts fitted model. + +This step will provide the Benjamini-Hochberg multiple testing correction. +The `topTable()` function default is to use Benjamini-Hochberg but this can be changed to a different method using the `adjust.method` argument (see the `?topTable` help page for more about the options). + +```{r} +# Apply multiple testing correction and obtain stats +stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>% + tibble::rownames_to_column("Gene") +``` + +Let's take a peek at our results table. + +```{r} +head(stats_df) +``` + +For each gene, each group's fold change in expression, compared to the average of the other groups is reported. + +By default, results are ordered from largest `F` value to the smallest, which means your most differentially expressed genes across all groups should be toward the top. + +This means `HALLMARK_UNFOLDED_PROTEIN_RESPONSE` appears to be the pathway that contains the most significant distribution of scores across subgroups. + +## Visualizing Results + +Let's make a plot for our most significant pathway, `HALLMARK_UNFOLDED_PROTEIN_RESPONSE`. + +### Sina plot + +First we need to get our GSVA scores for this pathway into a long data frame, an appropriate format for `ggplot2`. + +Let's look at the current format of `gsva_results`. + +```{r} +head(gsva_results[, 1:10]) +``` + +We can see that they are in a wide format with the gsva scores for each sample spread across a row associated with each pathway. + +Now let's convert these results into a data frame and into a long format, using the `tidyr::pivot_longer()` function. + +```{r} +annotated_results_df <- gsva_results %>% + # Make this into a data frame + data.frame() %>% + # Gene set names are row names + tibble::rownames_to_column("pathway") %>% + # Get into long format using the `tidyr::pivot_longer()` function + tidyr::pivot_longer( + cols = -pathway, + names_to = "sample", + values_to = "gsva_score" + ) + +# Preview the annotated results object +head(annotated_results_df) +``` + +Now let's filter to include only the scores associated with our most significant pathway, `HALLMARK_UNFOLDED_PROTEIN_RESPONSE`, and join the relevant group labels from the metadata for plotting. + +```{r} +top_pathway_annotated_results_df <- annotated_results_df %>% + # Filter for only scores associated with our most significant pathway + dplyr::filter(pathway == "HALLMARK_UNFOLDED_PROTEIN_RESPONSE") %>% + # Join the column with the group labels that we would like to plot and compare -- + # Select the variables relevant to your data + dplyr::left_join(metadata %>% dplyr::select( + refinebio_accession_code, + subgroup + ), + # Tell the join what columns are equivalent and should be used as a key + by = c("sample" = "refinebio_accession_code") + ) + +# Preview the filtered annotated results object +head(top_pathway_annotated_results_df) +``` + +Now let's make a sina plot so we can look at the differences between the `subgroup` groups using our GSVA scores. + +```{r} +# Sina plot comparing GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE` the `subgroup` groups in our dataset +sina_plot <- + ggplot( + top_pathway_annotated_results_df, # Supply our annotated data frame + aes( + x = subgroup, # Replace with a grouping variable relevant to your data + y = gsva_score, # Column we created to store the GSVA scores in the last chunk + color = subgroup # Let's make the groups different colors too + ) + ) + + # Add a boxplot that will have summary stats + geom_boxplot(outlier.shape = NA) + + # Make a sina plot that shows individual values + ggforce::geom_sina() + + # Rename the y-axis label + labs(y = "HALLMARK_UNFOLDED_PROTEIN_RESPONSE_score") + + # Adjust the plot background for better visualization + theme_bw() + +# Display plot +sina_plot +``` + +Looks like the `Group 4` samples have lower GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE` as compared to the `SHH` and `Group 3` scores. + +Let's save this plot to PNG. + +```{r} +ggsave(file.path(plots_dir, "GSE37382_gsva_HALLMARK_UNFOLDED_PROTEIN_RESPONSE_sina_plot.png"), + plot = sina_plot +) +``` + +## Write results to file + +Now let's write all of our GSVA results to file. + +```{r} +gsva_results %>% + as.data.frame() %>% + tibble::rownames_to_column("pathway") %>% + readr::write_tsv(file.path( + results_dir, + "GSE37382_gsva_results.tsv" + )) +``` + +# Resources for further learning + +- [GSVA Paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-7) [@Hanzelmann2013] +- See this article on [Decoding Gene Set Variation Analysis](https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3) [@Malhotra2018] + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/02-microarray/pathway-analysis_microarray_03_gsva.html b/02-microarray/pathway-analysis_microarray_03_gsva.html new file mode 100644 index 00000000..1ae4f1b1 --- /dev/null +++ b/02-microarray/pathway-analysis_microarray_03_gsva.html @@ -0,0 +1,4733 @@ + + + + + + + + + + + + + + +Gene set variation analysis - Microarray + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

This example is one of pathway analysis module set, we recommend looking at the pathway analysis introduction to help you determine which pathway analysis method is best suited for your purposes.

+

In this example we will cover a method called Gene Set Variation Analysis (GSVA) to calculate gene set or pathway scores on a per-sample basis. Gene set variation analysis (GSVA) constitutes a starting point to build pathway-centric models of biology. Rather than contextualizing some results you have received from another analysis like DGE, GSVA is designed to provide an estimate of pathway variation for each of the samples in an experiment. This means that GSVA scores will depend on the samples included in the dataset when you run GSVA; if you added more samples and ran GSVA again, you would expect the scores to change (Hänzelmann et al. 2013a).

+

GSVA determines the relative enrichment of gene sets across samples using a non-parametric approach. GSVA transforms a gene by sample gene expression matrix into a gene set by sample pathway enrichment matrix (Hänzelmann et al. 2013b). You can use these scores for other downstream analyses; in this analysis we will perform differential expression to determine what pathway is most different across our variable of interest.

+

⬇️ Jump to the analysis code ⬇️

+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

+
+
+

2.3 Obtain the dataset from refine.bio

+

For general information about downloading data for these examples, see our ‘Getting Started’ section.

+

Go to this dataset’s page on refine.bio.

+

Click the “Download Now” button on the right side of this screen.

+

+

Fill out the pop up window with your email and our Terms and Conditions:

+

+

It may take a few minutes for the dataset to process. You will get an email when it is ready.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we will use this medulloblastoma dataset (Northcott et al. 2012). The data that we downloaded from refine.bio for this analysis has 285 microarray samples obtained from patients with medulloblastoma. Medulloblastoma is the most common childhood brain cancer and is often categorized by subgroups. We will use these subgroup labels from our metadata to perform differential expression with our GSVA scores.

+
+
+

2.5 Place the dataset in your new data/ folder

+

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

+

+

For more details on the contents of this folder see these docs on refine.bio.

+

The GSE37382 folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

+

Copy and paste the GSE37382 folder into your newly created data/ folder.

+
+
+

2.6 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called “data” which contains: +
      +
    • The GSE37382 folder which contains: +
        +
      • The gene expression
        +
      • +
      • The metadata TSV
        +
      • +
    • +
  • +
  • A folder for plots (currently empty)
    +
  • +
  • A folder for results (currently empty)
  • +
+

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

+

+

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

+

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

+
# Define the file path to the data directory
+data_dir <- file.path("data", "GSE37382") # Replace with accession number which will be the name of the folder the files will be in
+
+# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
+data_file <- file.path(data_dir, "GSE37382.tsv") # Replace with file path to your dataset
+
+# Declare the file path to the metadata file using the data directory saved as `data_dir`
+metadata_file <- file.path(data_dir, "metadata_GSE37382.tsv") # Replace with file path to your metadata
+

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

+
# Check if the gene expression matrix file is at the file path stored in `data_file`
+file.exists(data_file)
+
## [1] TRUE
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
+
## [1] TRUE
+

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+
+ +

 

+
+
+

4 Gene set variation analysis - Microarray

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

In this analysis, we will be using the GSVA package to perform GSVA and the qusage package to read in the GMT file containing the gene set data (Hänzelmann et al. 2013a; Yaari et al. 2013).

+

We will also need the org.Hs.eg.db package to perform gene identifier conversion (Carlson 2020).

+

We’ll also be performing differential expression test on our GSVA scores, and for that we will use limma (Ritchie et al. 2015) and we’ll make a sina plot of the scores of our most significant pathway using a ggplot2 companion package, ggforce.

+
if (!("GSVA" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("GSVA", update = FALSE)
+}
+
+if (!("qusage" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("qusage", update = FALSE)
+}
+
+if (!("org.Hs.eg.db" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("org.Hs.eg.db", update = FALSE)
+}
+
+if (!("limma" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("limma", update = FALSE)
+}
+
+if (!("ggforce" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("ggforce")
+}
+

Attach the packages we need for this analysis.

+
# Attach the `qusage` library
+library(qusage)
+
## Loading required package: limma
+
# Attach the `GSVA` library
+library(GSVA)
+
+# Human annotation package we'll use for gene identifier conversion
+library(org.Hs.eg.db)
+
## Loading required package: AnnotationDbi
+
## Loading required package: stats4
+
## Loading required package: BiocGenerics
+
## Loading required package: parallel
+
## 
+## Attaching package: 'BiocGenerics'
+
## The following objects are masked from 'package:parallel':
+## 
+##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+##     clusterExport, clusterMap, parApply, parCapply, parLapply,
+##     parLapplyLB, parRapply, parSapply, parSapplyLB
+
## The following object is masked from 'package:limma':
+## 
+##     plotMA
+
## The following objects are masked from 'package:stats':
+## 
+##     IQR, mad, sd, var, xtabs
+
## The following objects are masked from 'package:base':
+## 
+##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+##     union, unique, unsplit, which.max, which.min
+
## Loading required package: Biobase
+
## Welcome to Bioconductor
+## 
+##     Vignettes contain introductory material; view with
+##     'browseVignettes()'. To cite Bioconductor, see
+##     'citation("Biobase")', and for packages 'citation("pkgname")'.
+
## Loading required package: IRanges
+
## Loading required package: S4Vectors
+
## 
+## Attaching package: 'S4Vectors'
+
## The following object is masked from 'package:base':
+## 
+##     expand.grid
+
## 
+
# Attach the ggplot2 package for plotting
+library(ggplot2)
+
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+
+

4.2 Import and set up data

+

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read both TSV files and add them as data frames to your environment.

+

We stored our file paths as objects named metadata_file and data_file in this previous step.

+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_character(),
+##   refinebio_age = col_double(),
+##   refinebio_cell_line = col_logical(),
+##   refinebio_compound = col_logical(),
+##   refinebio_disease = col_logical(),
+##   refinebio_disease_stage = col_logical(),
+##   refinebio_genetic_information = col_logical(),
+##   refinebio_processed = col_logical(),
+##   refinebio_race = col_logical(),
+##   refinebio_source_archive_url = col_logical(),
+##   refinebio_specimen_part = col_logical(),
+##   refinebio_subject = col_logical(),
+##   refinebio_time = col_logical(),
+##   refinebio_treatment = col_logical(),
+##   channel_count = col_double(),
+##   `contact_zip/postal_code` = col_double(),
+##   data_row_count = col_double(),
+##   taxid_ch1 = col_double()
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+expression_df <- readr::read_tsv(data_file)
+
## 
+## ── Column specification ──────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_double(),
+##   Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+

Let’s ensure that the metadata and data are in the same sample order.

+
# Make the data in the order of the metadata
+expression_df <- expression_df %>%
+  dplyr::select(c(Gene, metadata$geo_accession))
+
+# Check if this is in the same order
+all.equal(colnames(expression_df)[-1], metadata$geo_accession)
+
## [1] TRUE
+
+

4.2.1 Import Gene Sets

+

The function that we will use to run GSVA wants the gene sets to be in a list, rather than a tidy data frame.

+

We are therefore going to read in the Hallmark collection of gene sets file directly from MSigDB, rather than using the msigdbr package like we did in earlier notebooks (Dolgalev 2020).

+

Note that when reading in the Hallmark collection file in this manner, the data is only compatible with Homo sapiens datasets.

+
# R can often read in data from a URL
+hallmarks_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/h.all.v7.2.entrez.gmt"
+
+# QuSAGE is another pathway analysis method, the `qusage` package has a function
+# for reading GMT files and turning them into a list
+hallmarks_list <- qusage::read.gmt(hallmarks_url)
+

What does this hallmarks_list look like?

+
head(hallmarks_list)
+
## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
+##   [1] "3726"   "2920"   "467"    "4792"   "7128"   "5743"   "2919"   "8870"  
+##   [9] "9308"   "6364"   "2921"   "23764"  "4791"   "7127"   "1839"   "1316"  
+##  [17] "330"    "5329"   "7538"   "3383"   "3725"   "1960"   "3553"   "597"   
+##  [25] "23645"  "80149"  "6648"   "4929"   "3552"   "5971"   "7185"   "7832"  
+##  [33] "1843"   "1326"   "2114"   "2152"   "6385"   "1958"   "3569"   "7124"  
+##  [41] "23135"  "4790"   "3976"   "5806"   "8061"   "3164"   "182"    "6351"  
+##  [49] "2643"   "6347"   "1827"   "1844"   "10938"  "9592"   "5966"   "8837"  
+##  [57] "8767"   "4794"   "8013"   "22822"  "51278"  "8744"   "2669"   "1647"  
+##  [65] "3627"   "10769"  "8553"   "1959"   "9021"   "11182"  "5734"   "1847"  
+##  [73] "5055"   "4783"   "5054"   "10221"  "25976"  "5970"   "329"    "6372"  
+##  [81] "9516"   "7130"   "960"    "3624"   "5328"   "4609"   "3604"   "6446"  
+##  [89] "10318"  "10135"  "2355"   "10957"  "3398"   "969"    "3575"   "1942"  
+##  [97] "7262"   "5209"   "6352"   "79693"  "3460"   "8878"   "10950"  "4616"  
+## [105] "8942"   "50486"  "694"    "4170"   "7422"   "5606"   "1026"   "3491"  
+## [113] "10010"  "3433"   "3606"   "7280"   "3659"   "2353"   "4973"   "388"   
+## [121] "374"    "4814"   "65986"  "8613"   "9314"   "6373"   "6303"   "1435"  
+## [129] "1880"   "56937"  "5791"   "7097"   "57007"  "7071"   "4082"   "3914"  
+## [137] "1051"   "9322"   "2150"   "687"    "3949"   "7050"   "127544" "55332" 
+## [145] "2683"   "11080"  "1437"   "5142"   "8303"   "5341"   "6776"   "23258" 
+## [153] "595"    "23586"  "8877"   "941"    "25816"  "57018"  "2526"   "9034"  
+## [161] "80176"  "8848"   "9334"   "150094" "23529"  "4780"   "2354"   "5187"  
+## [169] "10725"  "490"    "3593"   "3572"   "9120"   "19"     "3280"   "604"   
+## [177] "8660"   "6515"   "1052"   "51561"  "4088"   "6890"   "9242"   "64135" 
+## [185] "3601"   "79155"  "602"    "24145"  "24147"  "1906"   "10209"  "650"   
+## [193] "1846"   "10611"  "23308"  "9945"   "10365"  "3371"   "5271"   "4084"  
+## 
+## $HALLMARK_HYPOXIA
+##   [1] "5230"   "5163"   "2632"   "5211"   "226"    "2026"   "5236"   "10397" 
+##   [9] "3099"   "230"    "2821"   "4601"   "6513"   "5033"   "133"    "8974"  
+##  [17] "2023"   "5214"   "205"    "26355"  "5209"   "7422"   "665"    "7167"  
+##  [25] "30001"  "55818"  "901"    "3939"   "2997"   "2597"   "8553"   "51129" 
+##  [33] "3725"   "5054"   "4015"   "2645"   "8497"   "23764"  "54541"  "6515"  
+##  [41] "3486"   "4783"   "2353"   "3516"   "3098"   "10370"  "3669"   "2584"  
+##  [49] "26118"  "5837"   "6781"   "23036"  "694"    "123"    "1466"   "7436"  
+##  [57] "23210"  "2131"   "2152"   "5165"   "55139"  "7360"   "229"    "8614"  
+##  [65] "54206"  "2027"   "10957"  "3162"   "5228"   "26330"  "9435"   "55076" 
+##  [73] "63827"  "467"    "857"    "272"    "2719"   "3340"   "8660"   "8819"  
+##  [81] "2548"   "6385"   "8987"   "8870"   "5313"   "3484"   "5329"   "112464"
+##  [89] "8839"   "9215"   "25819"  "6275"   "58528"  "7538"   "1956"   "1907"  
+##  [97] "3423"   "1026"   "6095"   "1843"   "4282"   "5507"   "10570"  "11015" 
+## [105] "1837"   "136"    "9957"   "284119" "2908"   "1316"   "2239"   "3491"  
+## [113] "7128"   "771"    "3073"   "633"    "23645"  "55276"  "5292"   "25824" 
+## [121] "55577"  "1027"   "680"    "8277"   "4493"   "538"    "4502"   "9672"  
+## [129] "25976"  "5317"   "302"    "5224"   "1649"   "5578"   "2542"   "7852"  
+## [137] "1944"   "1356"   "8609"   "1490"   "9469"   "7163"   "56925"  "124872"
+## [145] "10891"  "596"    "2651"   "3036"   "54800"  "949"    "6576"   "6383"  
+## [153] "839"    "7428"   "2309"   "5155"   "126792" "6518"   "8406"   "1942"  
+## [161] "2745"   "57007"  "5066"   "7045"   "1634"   "6478"   "51316"  "2203"  
+## [169] "8459"   "5260"   "4627"   "1028"   "9380"   "5105"   "3623"   "3309"  
+## [177] "8509"   "23327"  "7162"   "7511"   "3569"   "6533"   "4214"   "3948"  
+## [185] "9590"   "26136"  "3798"   "3906"   "1289"   "2817"   "3069"   "10994" 
+## [193] "1463"   "7052"   "2113"   "3219"   "8991"   "2355"   "6820"   "7043"  
+## 
+## $HALLMARK_CHOLESTEROL_HOMEOSTASIS
+##  [1] "2224"   "1595"   "3422"   "2222"   "1717"   "6713"   "3157"   "50814" 
+##  [9] "4047"   "4597"   "3949"   "7108"   "230"    "10682"  "6319"   "10654" 
+## [17] "4598"   "4023"   "6309"   "9415"   "3156"   "51478"  "312"    "6721"  
+## [25] "5833"   "55902"  "467"    "127"    "23474"  "1891"   "875"    "2990"  
+## [33] "2194"   "3958"   "22809"  "308"    "94241"  "1119"   "2946"   "39"    
+## [41] "552"    "5359"   "1191"   "54206"  "57761"  "58191"  "51330"  "71"    
+## [49] "182"    "5641"   "26270"  "493869" "10957"  "118429" "114569" "928"   
+## [57] "5468"   "2731"   "6811"   "134429" "1499"   "27346"  "116496" "5165"  
+## [65] "5329"   "7869"   "2770"   "20"     "6311"   "4783"   "214"    "2171"  
+## [73] "6282"   "132864"
+## 
+## $HALLMARK_MITOTIC_SPINDLE
+##   [1] "9181"   "23332"  "3832"   "9493"   "57679"  "382"    "4650"   "4627"  
+##   [9] "10426"  "9793"   "29127"  "57580"  "50650"  "4926"   "6711"   "11004" 
+##  [17] "3799"   "7272"   "324"    "11190"  "5048"   "10435"  "9371"   "55704" 
+##  [25] "56992"  "332"    "116840" "4763"   "7248"   "996"    "11064"  "114791"
+##  [33] "24137"  "22919"  "55785"  "675"    "5347"   "5921"   "4751"   "8936"  
+##  [41] "7153"   "7204"   "9826"   "10300"  "9055"   "54443"  "55755"  "9126"  
+##  [49] "10844"  "9700"   "55201"  "201176" "9732"   "29901"  "3619"   "394"   
+##  [57] "2934"   "10276"  "10128"  "23637"  "2317"   "64411"  "121512" "29"    
+##  [65] "55835"  "4690"   "1063"   "9585"   "10163"  "4628"   "1062"   "9266"  
+##  [73] "4281"   "3831"   "57787"  "127829" "9702"   "8409"   "393"    "23580" 
+##  [81] "163786" "9113"   "4983"   "8976"   "4296"   "6654"   "25"     "7074"  
+##  [89] "23095"  "6453"   "134549" "8440"   "9787"   "613"    "10048"  "2037"  
+##  [97] "10801"  "11104"  "51174"  "22974"  "3797"   "357"    "85378"  "6709"  
+## [105] "23022"  "23647"  "9735"   "84376"  "25777"  "58526"  "1739"   "2316"  
+## [113] "79658"  "8476"   "23365"  "4082"   "51199"  "5108"   "10928"  "7430"  
+## [121] "85464"  "983"    "22930"  "10160"  "11346"  "54509"  "1894"   "2035"  
+## [129] "51735"  "3835"   "84333"  "6780"   "396"    "6790"   "26271"  "51203" 
+## [137] "5829"   "9564"   "23607"  "11214"  "10013"  "22994"  "3996"   "23192" 
+## [145] "5116"   "7840"   "11133"  "667"    "22920"  "151987" "9411"   "9462"  
+## [153] "9133"   "80119"  "5922"   "4739"   "8243"   "81"     "5311"   "7461"  
+## [161] "998"    "10403"  "9874"   "9344"   "6904"   "832"    "1794"   "2017"  
+## [169] "10051"  "10565"  "7277"   "4001"   "10006"  "6093"   "55125"  "699"   
+## [177] "50628"  "64857"  "253260" "10018"  "1778"   "6624"   "8874"   "140735"
+## [185] "4643"   "274"    "4853"   "5981"   "10611"  "89941"  "8470"   "11135" 
+## [193] "7414"   "6249"   "23012"  "7531"   "9771"   "55722"  "1453"  
+## 
+## $HALLMARK_WNT_BETA_CATENIN_SIGNALING
+##  [1] "4609"  "1499"  "3714"  "4851"  "28514" "8313"  "5664"  "8321"  "4855" 
+## [10] "51176" "8312"  "85407" "81029" "8454"  "182"   "9794"  "2648"  "2770" 
+## [19] "7475"  "5727"  "9612"  "27121" "3066"  "22943" "6932"  "7471"  "8650" 
+## [28] "6868"  "1856"  "5467"  "23385" "10014" "894"   "10023" "1454"  "3516" 
+## [37] "8325"  "7157"  "6502"  "23493" "23462" "79885"
+## 
+## $HALLMARK_TGF_BETA_SIGNALING
+##  [1] "7046"  "4092"  "7040"  "64750" "57154" "659"   "6498"  "6497"  "90"   
+## [10] "56937" "9612"  "5054"  "3726"  "4086"  "4091"  "23645" "7050"  "5045" 
+## [19] "4088"  "2280"  "6885"  "657"   "1499"  "28996" "7071"  "650"   "2022" 
+## [28] "324"   "5494"  "331"   "999"   "3397"  "7044"  "1028"  "51592" "11031"
+## [37] "7082"  "6574"  "1025"  "3399"  "9241"  "51742" "3460"  "3398"  "5499" 
+## [46] "6711"  "25937" "8412"  "7057"  "2339"  "3065"  "7323"  "4053"  "387"
+

Looks like we have a list of gene sets with associated Entrez IDs.

+

In our gene expression data frame, expression_df we have Ensembl gene identifiers. So we will need to convert our Ensembl IDs into Entrez IDs for GSVA.

+
+
+

4.2.2 Gene identifier conversion

+

We’re going to convert our identifiers in expression_df to Entrez IDs, but you can, with the change of a single argument, use the same code to convert to many other types of identifiers!

+

The annotation package org.Hs.eg.db contains information for different identifiers (Carlson 2020). org.Hs.eg.db is specific to Homo sapiens – this is what the Hs in the package name is referencing.

+

We can see what types of IDs are available to us in an annotation package with keytypes().

+
keytypes(org.Hs.eg.db)
+
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
+##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
+## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
+## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
+## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
+## [26] "UNIPROT"
+

Even though we’ll use this package to convert from Ensembl gene IDs (ENSEMBL) to Entrez IDs (ENTREZID), we could just as easily use it to convert from an Ensembl transcript ID (ENSEMBLTRANS) to gene symbols (SYMBOL).

+

Take a look at our other gene identifier conversion examples for examples with different species and gene ID types: the microarray example and the RNA-seq example.

+

The function we will use to map from Ensembl gene IDs to Entrez gene IDs is called mapIds().

+

Let’s create a data frame that shows the mapped Entrez IDs along with the gene expression values for the respective Ensembl IDs.

+
# First let's create a mapped data frame we can join to the gene expression values
+mapped_df <- data.frame(
+  "entrez_id" = mapIds(
+    org.Hs.eg.db, # Replace with annotation package for the organism relevant to your data
+    keys = expression_df$Gene,
+    column = "ENTREZID", # Replace with the type of gene identifiers you would like to map to
+    keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data
+    multiVals = "first" # This will keep only the first mapped value for each Ensembl ID
+  )
+) %>%
+  # If an Ensembl gene identifier doesn't map to a Entrez gene identifier, drop that
+  # from the data frame
+  dplyr::filter(!is.na(entrez_id)) %>%
+  # Make an `Ensembl` column to store the row names
+  tibble::rownames_to_column("Ensembl") %>%
+  # Now let's join the rest of the expression data
+  dplyr::inner_join(expression_df, by = c("Ensembl" = "Gene"))
+
## 'select()' returned 1:many mapping between keys and columns
+

This 1:many mapping between keys and columns message means that some Ensembl gene identifiers map to multiple Entrez IDs. In this case, it’s also possible that a Entrez ID will map to multiple Ensembl IDs. For the purpose of performing GSVA later in this notebook, we keep only the first mapped IDs. For more about how to explore this, take a look at our microarray gene ID conversion example.

+

Let’s see a preview of mapped_df.

+
head(mapped_df)
+
+ +
+

We will want to keep in mind that GSVA requires that data is in a matrix with the gene identifiers as row names. In order to successfully turn our data frame into a matrix, we will need to ensure that we do not have any duplicate gene identifiers.

+

Let’s check to see if we have any Entrez IDs that mapped to multiple Ensembl IDs.

+
any(duplicated(mapped_df$entrez_id))
+
## [1] TRUE
+

Looks like we do have duplicated Entrez IDs. Let’s find out which Entrez IDs have been duplicated.

+
dup_entrez_ids <- mapped_df %>%
+  dplyr::filter(duplicated(entrez_id)) %>%
+  dplyr::pull(entrez_id)
+
+dup_entrez_ids
+
## [1] "6013" "3117"
+
+

4.2.2.1 Handling duplicate gene identifiers

+

As we mentioned earlier, we will not want any duplicate gene identifiers in our data frame when we convert it into a matrix in preparation for the GSVA steps later. GSVA is executed on a per sample basis so let’s keep the maximum expression value per sample associated with the duplicate Entrez gene identifiers. In other words, we will keep only the maximum expression value found across the duplicate Entrez gene identifier instances for each sample or column in this case.

+

Let’s take a look at the rows associated with the duplicated Entrez IDs and see how this will play out.

+
mapped_df %>%
+  dplyr::filter(entrez_id %in% dup_entrez_ids)
+
+ +
+

As an example using the strategy we described, for GSM917111’s data in the first column, 0.2294387 is larger than 0.1104345 so moving forward, Entrez gene 6013 will have 0.2294387 value and the 0.1104345 would be dropped from the dataset. However, this is just one method of handling duplicate gene identifiers. See the Gene Set Enrichment Analysis (GSEA) User guide for more information on other expression values that can be used, like the median expression value for example (Broad Institute Team 2019). Now, let’s implement our choose the max value method for all samples and Entrez IDs using tidyverse functions.

+
max_dup_df <- mapped_df %>%
+  # We won't be using Ensembl IDs moving forward, so we will drop this column
+  dplyr::select(-Ensembl) %>%
+  # Filter to include only the rows associated with the duplicate Entrez gene identifiers
+  dplyr::filter(entrez_id %in% dup_entrez_ids) %>%
+  # Group by Entrez IDs
+  dplyr::group_by(entrez_id) %>%
+  # Get the maximum expression value using all values associated with each duplicate
+  # Entrez ID for each column or sample in this case
+  dplyr::summarize_all(max)
+
+max_dup_df
+
+ +
+

We can see GSM917111 now has the 0.2294387 value for Entrez ID 6013 like expected. Looks like we were able to successfully get rid of the duplicate Entrez gene identifiers!

+

Now let’s combine our de-duplicated data with the rest of the mapped data!

+
filtered_mapped_df <- mapped_df %>%
+  # We won't be using Ensembl IDs moving forward, so we will drop this column
+  dplyr::select(-Ensembl) %>%
+  # First let's get the data associated with the Entrez gene identifiers that aren't duplicated
+  dplyr::filter(!entrez_id %in% dup_entrez_ids) %>%
+  # Now let's bind the rows of the maximum expression data we stored in `max_dup_df`
+  dplyr::bind_rows(max_dup_df)
+

As mentioned earlier, we need a matrix for GSVA. Let’s now convert our data frame into a matrix and prepare our object for GSVA.

+
filtered_mapped_matrix <- filtered_mapped_df %>%
+  # We need to store our gene identifiers as row names
+  tibble::column_to_rownames("entrez_id") %>%
+  # Now we can convert our object into a matrix
+  as.matrix()
+

Note that if we had duplicate gene identifiers here, we would not be able to set them as row names.

+
+
+
+
+

4.3 GSVA - Microarray

+

Rather than ranking genes based on some gene-level statistic selected by the user, GSVA fits a model and ranks genes based on their expression level relative to the sample distribution. The pathway-level score calculated is a way of asking how genes within a gene set vary as compared to genes that are outside of that gene set (Malhotra 2018).

+

The idea here is that we will get pathway-level scores for each sample that indicate if genes in a pathway vary concordantly in one direction (over-expressed or under-expressed relative to the overall population) (Hänzelmann et al. 2013a).

+

The output is a gene set by sample matrix of GSVA scores.

+
+

4.3.1 Perform GSVA

+

Let’s perform GSVA using the gsva() function. See ?gsva for more options.

+
gsva_results <- gsva(
+  filtered_mapped_matrix,
+  hallmarks_list,
+  method = "gsva",
+  # Appropriate for our log2-transformed microarray data
+  kcdf = "Gaussian",
+  # Minimum gene set size
+  min.sz = 15,
+  # Maximum gene set size
+  max.sz = 500,
+  # Compute Gaussian-distributed scores
+  mx.diff = TRUE,
+  # Don't print out the progress bar
+  verbose = FALSE
+)
+

Let’s explore what the output of gsva() looks like.

+
head(gsva_results[, 1:10])
+
##                                      GSM917111   GSM917250   GSM917281
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.2784726 -0.29221444 -0.30693127
+## HALLMARK_HYPOXIA                    -0.1907117 -0.13033725 -0.24058274
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2307863 -0.22997233 -0.25341066
+## HALLMARK_MITOTIC_SPINDLE            -0.2134439  0.09773602 -0.13886393
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668  0.27041084 -0.06319446
+## HALLMARK_TGF_BETA_SIGNALING         -0.2285640  0.08510027 -0.14161796
+##                                      GSM917062   GSM917288   GSM917230
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.2953894 -0.22966329 -0.20914620
+## HALLMARK_HYPOXIA                    -0.2658532  0.06741065 -0.02691280
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2214914 -0.08702648 -0.03084332
+## HALLMARK_MITOTIC_SPINDLE            -0.2020978 -0.17902098  0.05763884
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.2363895  0.21274606  0.08273239
+## HALLMARK_TGF_BETA_SIGNALING         -0.2284998  0.01208862 -0.13097578
+##                                      GSM917152    GSM917242  GSM917226
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    0.33276903  0.001857506 -0.1329156
+## HALLMARK_HYPOXIA                    0.18446386 -0.118269791 -0.2641157
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    0.05273271  0.104042284 -0.2136088
+## HALLMARK_MITOTIC_SPINDLE            0.14226250 -0.052122165 -0.3753805
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623 -0.3570903
+## HALLMARK_TGF_BETA_SIGNALING         0.15915374  0.300603909 -0.1973818
+##                                        GSM917290
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.385841741
+## HALLMARK_HYPOXIA                    -0.145480093
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.267519873
+## HALLMARK_MITOTIC_SPINDLE            -0.001471942
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.006265662
+## HALLMARK_TGF_BETA_SIGNALING         -0.130123427
+
+
+
+

4.4 Find differentially expressed pathways

+

If we want to identify most differentially expressed pathways across subgroups, we can use functionality in the limma package to test the GSVA scores. This is one approach for working with GSVA scores. The mx.diff = TRUE argument that we supplied to the gsva() function in the previous section means the GSVA output scores should be normally distributed, which allows downstream analyses that make this assumption about the distribution (Hanzelmann et al. 2020).

+
+

4.4.1 Create the design matrix

+

limma needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our subgroup variable. We are using a + 0 in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. If you have a control group, you might want that to be the intercept.

+
# Create the design matrix
+des_mat <- model.matrix(~ metadata$subgroup + 0)
+

Let’s take a look at the design matrix we created.

+
# Print out the design matrix
+head(des_mat)
+
##   metadata$subgroupGroup 3 metadata$subgroupGroup 4 metadata$subgroupSHH
+## 1                        0                        1                    0
+## 2                        0                        1                    0
+## 3                        0                        1                    0
+## 4                        1                        0                    0
+## 5                        0                        1                    0
+## 6                        0                        1                    0
+

The design matrix column names are a bit messy, so we will neaten them up by dropping the metadata$subgroup designation they all have and for being able to reference them easily later, we want to drop spaces as well.

+
# Make the column names less messy
+colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "metadata\\$subgroup")
+
+# Do a similar thing but remove spaces in names
+colnames(des_mat) <- stringr::str_remove(colnames(des_mat), " ")
+

Side note: If you are wondering why there are two \ above in "filtered_metadata\\$subgroup", that’s called an escape character. There’s a whole universe of things called regular expressions (regex) that can be super handy for string manipulations.

+
+
+
+

4.5 Perform differential expression on pathway scores

+

Run the linear model on each pathway (each row of gsva_results).

+
# Apply linear model to data
+fit <- limma::lmFit(gsva_results, design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- limma::eBayes(fit)
+

Now that we have our basic model fitting, we will want to make the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. Consulting the limma users guide for how to set up your model based on your hypothesis is a good idea.

+

In this contrasts matrix, we are comparing each subgroup to the average of the other subgroups.
+We’re dividing by two in this expression so that each group is compared to the average of the other two groups (makeContrasts() doesn’t allow you to use functions like mean(); it wants a formula).

+
contrast_matrix <- makeContrasts(
+  "G3vsOther" = Group3 - (Group4 + SHH) / 2,
+  "G4vsOther" = Group4 - (Group3 + SHH) / 2,
+  "SHHvsOther" = SHH - (Group3 + Group4) / 2,
+  levels = des_mat
+)
+

Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulate would look like Group3 = Group3 - Control for each one. We highly recommend consulting the limma users guide for figuring out what your makeContrasts() and model.matrix() setups should look like (Ritchie et al. 2015).

+

Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with eBayes().

+
# Fit the model according to the contrasts matrix
+contrasts_fit <- contrasts.fit(fit, contrast_matrix)
+
+# Re-smooth the Bayes
+contrasts_fit <- eBayes(contrasts_fit)
+

Here’s a nifty article and example about what the empirical Bayes smoothing is for (Robinson).

+

Now let’s create the results table based on the contrasts fitted model.

+

This step will provide the Benjamini-Hochberg multiple testing correction. The topTable() function default is to use Benjamini-Hochberg but this can be changed to a different method using the adjust.method argument (see the ?topTable help page for more about the options).

+
# Apply multiple testing correction and obtain stats
+stats_df <- topTable(contrasts_fit, number = nrow(expression_df)) %>%
+  tibble::rownames_to_column("Gene")
+

Let’s take a peek at our results table.

+
head(stats_df)
+
+ +
+

For each gene, each group’s fold change in expression, compared to the average of the other groups is reported.

+

By default, results are ordered from largest F value to the smallest, which means your most differentially expressed genes across all groups should be toward the top.

+

This means HALLMARK_UNFOLDED_PROTEIN_RESPONSE appears to be the pathway that contains the most significant distribution of scores across subgroups.

+
+
+

4.6 Visualizing Results

+

Let’s make a plot for our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE.

+
+

4.6.1 Sina plot

+

First we need to get our GSVA scores for this pathway into a long data frame, an appropriate format for ggplot2.

+

Let’s look at the current format of gsva_results.

+
head(gsva_results[, 1:10])
+
##                                      GSM917111   GSM917250   GSM917281
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.2784726 -0.29221444 -0.30693127
+## HALLMARK_HYPOXIA                    -0.1907117 -0.13033725 -0.24058274
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2307863 -0.22997233 -0.25341066
+## HALLMARK_MITOTIC_SPINDLE            -0.2134439  0.09773602 -0.13886393
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.3061668  0.27041084 -0.06319446
+## HALLMARK_TGF_BETA_SIGNALING         -0.2285640  0.08510027 -0.14161796
+##                                      GSM917062   GSM917288   GSM917230
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.2953894 -0.22966329 -0.20914620
+## HALLMARK_HYPOXIA                    -0.2658532  0.06741065 -0.02691280
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.2214914 -0.08702648 -0.03084332
+## HALLMARK_MITOTIC_SPINDLE            -0.2020978 -0.17902098  0.05763884
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.2363895  0.21274606  0.08273239
+## HALLMARK_TGF_BETA_SIGNALING         -0.2284998  0.01208862 -0.13097578
+##                                      GSM917152    GSM917242  GSM917226
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    0.33276903  0.001857506 -0.1329156
+## HALLMARK_HYPOXIA                    0.18446386 -0.118269791 -0.2641157
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    0.05273271  0.104042284 -0.2136088
+## HALLMARK_MITOTIC_SPINDLE            0.14226250 -0.052122165 -0.3753805
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.37981263 -0.037661623 -0.3570903
+## HALLMARK_TGF_BETA_SIGNALING         0.15915374  0.300603909 -0.1973818
+##                                        GSM917290
+## HALLMARK_TNFA_SIGNALING_VIA_NFKB    -0.385841741
+## HALLMARK_HYPOXIA                    -0.145480093
+## HALLMARK_CHOLESTEROL_HOMEOSTASIS    -0.267519873
+## HALLMARK_MITOTIC_SPINDLE            -0.001471942
+## HALLMARK_WNT_BETA_CATENIN_SIGNALING -0.006265662
+## HALLMARK_TGF_BETA_SIGNALING         -0.130123427
+

We can see that they are in a wide format with the gsva scores for each sample spread across a row associated with each pathway.

+

Now let’s convert these results into a data frame and into a long format, using the tidyr::pivot_longer() function.

+
annotated_results_df <- gsva_results %>%
+  # Make this into a data frame
+  data.frame() %>%
+  # Gene set names are row names
+  tibble::rownames_to_column("pathway") %>%
+  # Get into long format using the `tidyr::pivot_longer()` function
+  tidyr::pivot_longer(
+    cols = -pathway,
+    names_to = "sample",
+    values_to = "gsva_score"
+  )
+
+# Preview the annotated results object
+head(annotated_results_df)
+
+ +
+

Now let’s filter to include only the scores associated with our most significant pathway, HALLMARK_UNFOLDED_PROTEIN_RESPONSE, and join the relevant group labels from the metadata for plotting.

+
top_pathway_annotated_results_df <- annotated_results_df %>%
+  # Filter for only scores associated with our most significant pathway
+  dplyr::filter(pathway == "HALLMARK_UNFOLDED_PROTEIN_RESPONSE") %>%
+  # Join the column with the group labels that we would like to plot and compare --
+  # Select the variables relevant to your data
+  dplyr::left_join(metadata %>% dplyr::select(
+    refinebio_accession_code,
+    subgroup
+  ),
+  # Tell the join what columns are equivalent and should be used as a key
+  by = c("sample" = "refinebio_accession_code")
+  )
+
+# Preview the filtered annotated results object
+head(top_pathway_annotated_results_df)
+
+ +
+

Now let’s make a sina plot so we can look at the differences between the subgroup groups using our GSVA scores.

+
# Sina plot comparing GSVA scores for `HALLMARK_UNFOLDED_PROTEIN_RESPONSE` the `subgroup` groups in our dataset
+sina_plot <-
+  ggplot(
+    top_pathway_annotated_results_df, # Supply our annotated data frame
+    aes(
+      x = subgroup, # Replace with a grouping variable relevant to your data
+      y = gsva_score, # Column we created to store the GSVA scores in the last chunk
+      color = subgroup # Let's make the groups different colors too
+    )
+  ) +
+  # Add a boxplot that will have summary stats
+  geom_boxplot(outlier.shape = NA) +
+  # Make a sina plot that shows individual values
+  ggforce::geom_sina() +
+  # Rename the y-axis label
+  labs(y = "HALLMARK_UNFOLDED_PROTEIN_RESPONSE_score") +
+  # Adjust the plot background for better visualization
+  theme_bw()
+
+# Display plot
+sina_plot
+

+

Looks like the Group 4 samples have lower GSVA scores for HALLMARK_UNFOLDED_PROTEIN_RESPONSE as compared to the SHH and Group 3 scores.

+

Let’s save this plot to PNG.

+
ggsave(file.path(plots_dir, "GSE37382_gsva_HALLMARK_UNFOLDED_PROTEIN_RESPONSE_sina_plot.png"),
+  plot = sina_plot
+)
+
## Saving 7 x 5 in image
+
+
+
+

4.7 Write results to file

+

Now let’s write all of our GSVA results to file.

+
gsva_results %>%
+  as.data.frame() %>%
+  tibble::rownames_to_column("pathway") %>%
+  readr::write_tsv(file.path(
+    results_dir,
+    "GSE37382_gsva_results.tsv"
+  ))
+
+
+
+

5 Resources for further learning

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ───────────────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-11-30                  
+## 
+## ─ Packages ───────────────────────────────────────────────────────────────────
+##  package              * version  date       lib source        
+##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
+##  AnnotationDbi        * 1.52.0   2020-10-27 [1] Bioconductor  
+##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports              1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
+##  BiocGenerics         * 0.36.0   2020-10-27 [1] Bioconductor  
+##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
+##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  bitops                 1.0-6    2013-08-17 [1] RSPM (R 4.0.0)
+##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  cli                    2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  coda                   0.19-4   2020-09-30 [1] RSPM (R 4.0.2)
+##  colorspace             1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  crayon                 1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  DBI                    1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  DelayedArray           0.16.0   2020-10-27 [1] Bioconductor  
+##  digest                 0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  dplyr                  1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  ellipsis               0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  emmeans                1.5.1    2020-09-18 [1] RSPM (R 4.0.2)
+##  estimability           1.3      2018-02-11 [1] RSPM (R 4.0.0)
+##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi                  0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver                 2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastmatch              1.1-0    2017-01-28 [1] RSPM (R 4.0.0)
+##  fftw                   1.0-6    2020-02-24 [1] RSPM (R 4.0.2)
+##  generics               0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  GenomeInfoDb           1.26.0   2020-10-27 [1] Bioconductor  
+##  GenomeInfoDbData       1.2.4    2020-11-18 [1] Bioconductor  
+##  GenomicRanges          1.42.0   2020-10-27 [1] Bioconductor  
+##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  ggforce                0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
+##  ggplot2              * 3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  graph                  1.68.0   2020-10-27 [1] Bioconductor  
+##  GSEABase               1.52.0   2020-10-27 [1] Bioconductor  
+##  GSVA                 * 1.38.0   2020-10-27 [1] Bioconductor  
+##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  hms                    0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmltools              0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.2)
+##  IRanges              * 2.24.0   2020-10-27 [1] Bioconductor  
+##  jsonlite               1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr                  1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling               0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  lifecycle              0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  limma                * 3.46.0   2020-10-27 [1] Bioconductor  
+##  magrittr             * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  MASS                   7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+##  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  MatrixGenerics         1.2.0    2020-10-27 [1] Bioconductor  
+##  matrixStats            0.57.0   2020-09-25 [1] RSPM (R 4.0.2)
+##  memoise                1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  mvtnorm                1.1-1    2020-06-09 [1] RSPM (R 4.0.0)
+##  nlme                   3.1-148  2020-05-24 [2] CRAN (R 4.0.2)
+##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  org.Hs.eg.db         * 3.12.0   2020-11-18 [1] Bioconductor  
+##  pillar                 1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  polyclip               1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
+##  ps                     1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  qusage               * 2.24.0   2020-10-27 [1] Bioconductor  
+##  R.cache                0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                     2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  Rcpp                   1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  RCurl                  1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
+##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  rlang                  0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown              2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  RSQLite                2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi             0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  S4Vectors            * 0.28.0   2020-10-27 [1] Bioconductor  
+##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  stringi                1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler                 1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  SummarizedExperiment   1.20.0   2020-10-27 [1] Bioconductor  
+##  tibble                 3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidyr                  1.1.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  tidyselect             1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  tweenr                 1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
+##  vctrs                  0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  withr                  2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun                   0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  XML                    3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
+##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.0)
+##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
+##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
+##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
+## 
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+
+
+

References

+
+
+

Broad Institute Team, 2019 Gene set enrichment analysis (gsea) user guide. https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html

+
+
+

Carlson M., 2020 org.Hs.eg.db: Genome wide annotation for human. http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

+
+
+

Dolgalev I., 2020 Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

+
+
+

Hanzelmann S., R. Castelo, and J. Guinney, 2020 GSVA: The gene set variation analysis package for microarray and rna-seq data. https://www.bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf

+
+
+

Hänzelmann S., R. Castelo, and J. Guinney, 2013a Biases in Illumina transcriptome sequencing caused by random hexamer priming. BMC Bioinformatics 14. https://doi.org/10.1186/1471-2105-14-7

+
+
+

Hänzelmann S., R. Castelo, and J. Guinney, 2013b GSVA. https://github.com/rcastelo/GSVA/blob/master/man/gsva.Rd

+
+
+

Malhotra S., 2018 Decoding gene set variation analysis. https://towardsdatascience.com/decoding-gene-set-variation-analysis-8193a0cfda3

+
+
+

Northcott P., D. Shih, J. Peacock, L. Garzia, and S. Morrissy et al., 2012 Subgroup specific structural variation across 1,000 medulloblastoma genomes. Nature 488. https://doi.org/10.1038/nature11327

+
+
+

Ritchie M. E., B. Phipson, D. Wu, Y. Hu, and C. W. Law et al., 2015 limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43: e47. https://doi.org/10.1093/nar/gkv007

+
+
+

Robinson D., Understanding empirical Bayes estimation (using baseball statistics). http://varianceexplained.org/r/empirical_bayes_baseball/

+
+
+

Yaari G., C. R. Bolen, J. Thakar, and S. H. Kleinstein, 2013 Quantitative set analysis for gene expression: A method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Research 41: e170. https://doi.org/10.1093/nar/gkt660

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/02-microarray/pathway-analysis_microarray_03_qusage_meta_analysis.Rmd b/02-microarray/pathway-analysis_microarray_03_qusage_meta_analysis.Rmd deleted file mode 100644 index a3192ca9..00000000 --- a/02-microarray/pathway-analysis_microarray_03_qusage_meta_analysis.Rmd +++ /dev/null @@ -1,512 +0,0 @@ ---- -title: "Pathway analysis with QuSAGE: Meta-analysis of medulloblastoma" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -The Quantitative Set Analysis of Gene Expression (QuSAGE) -([](https://doi.org/10.1093/nar/gkt660)) framework -has advantages that we outline in -[`qusage_single_dataset`](./qusage_single_dataset.nb.html), including the -fact that it returns more than just a p-value. -Specifically, QuSAGE quantifies gene set activity with a full probability -density function (PDF). -If we're interested in pathway analysis of multiple datasets, QuSAGE allows -us to perform a _meta-analysis_ by combining distributions from the QuSAGE -results from each dataset. -Meta-analysis with QuSAGE is described in -[ et al. _PLOS Comp Bio._ 2019.](https://doi.org/10.1371/journal.pcbi.1006899) -and implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html). -The [`qusage` vignette](https://bioconductor.org/packages/release/bioc/vignettes/qusage/inst/doc/qusage.pdf) -contains a section on meta-analysis. - -## Datasets - -We will use two medulloblastoma datasets— -[Northcott et al.](https://doi.org/10.1038/nature11327) and -[Robinson et al.](https://doi.org/10.1038/nature11213)—to demonstrate how to -perform meta-analysis with `qusage`. -Specifically, we'll identify pathways that are differentially active between -the SHH subgroup vs. the Group 3 and 4 subgroups in both datasets. -These datasets are -[`GSE37382`](https://www.refine.bio/experiments/GSE37382/subgroup-specific-somatic-copy-number-aberrations-in-the-medulloblastoma-genome-mrna) and -[`GSE37418`](https://www.refine.bio/experiments/GSE37418/novel-mutations-target-distinct-subgroups-of-medulloblastoma), respectively. - -## Set up - -### Package installation and loading - -```{r} -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -if (!("org.Hs.eg.db" %in% installed.packages())) { - BiocManager::install("org.Hs.eg.db", update = FALSE) -} -``` - -```{r} -`%>%` <- dplyr::`%>%` -``` - -```{r} -library(org.Hs.eg.db) -library(qusage) -``` - -### Directories - -Make directories to hold the plots and results if they do not yet exist. - -```{r} -# Define the file path to the plots directory -plots_dir <- "plots" # Replace with path to desired output plots directory - -# Make a plots directory if it isn't created yet -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Replace with path to desired output results directory - -# Make a results directory if it isn't created yet -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} - -# Define the file path to the data directory -data_dir <- "data" -``` - -### Function - -Function to perform gene identifier conversion -- we'll do this once for -each dataset. -We 'functionalize' it to keep from repeating ourselves. -For an example where we do the conversion without using a custom function, see -[`qusage_single_dataset`](./qusage_single_dataset.nb.html) or -[`qusage_replicate_vignette`](./qusage_replicate_vignette.nb.html). - -```{r} -convert_ensembl_to_entrez_mat <- function(exprs_df) { - # Given a data.frame that contains human gene expression values and a column - # (named ENSEMBL) that contains Ensembl gene IDs, return an expression matrix - # where the rownames are Entrez IDs. In the case of duplicate Entrez - # identifiers, we summarize to the mean value for an Entrez ID. Mapping - # is performed with AnnotationDbi::mapIDs. - # - # Args: - # exprs_df: A data.frame of gene expression data. The first column should be - # named ENSEMBL and contain Ensembl gene IDs. Rows are genes; - # columns are samples. - # Returns: - # An expression matrix where the rownames are Entrez gene IDs - - `%>%` <- dplyr::`%>%` - require(org.Hs.eg.db) - - # error-handling: we need an ENSEMBL column - if (!("ENSEMBL" %in% colnames(exprs_df))) { - stop("'ENSEMBL' column expected in exprs_df") - } - - # using the default behavior for 1:many mappings, where only the first one is - # selected - entrez_mappings <- mapIds(org.Hs.eg.db, - keys = exprs_df$ENSEMBL, - column = "ENTREZID", keytype = "ENSEMBL" - ) - - # if this is not returned in the same order as the keys for some reason, stop - if (!all.equal(names(entrez_mappings), exprs_df$ENSEMBL)) { - stop("Something happened to the gene order!") - } - - # annotation with Entrez IDs - entrez_exprs_df <- exprs_df %>% - # add a new column that contains the Entrez IDs - # this gets added as the last column - dplyr::mutate(ENTREZID = entrez_mappings) %>% - # drop the Ensembl gene IDs - dplyr::select(-ENSEMBL) %>% - # reorder such that the Entrez IDs are in the first column - dplyr::select(ENTREZID, dplyr::everything()) %>% - # drop any genes without an Entrez ID - dplyr::filter(!is.na(ENTREZID)) - - # if there are any duplicate Entrez gene IDs, collapse to the mean value - if (any(duplicated(entrez_exprs_df$ENTREZID))) { - message("Collapsing to mean value...") - entrez_exprs_df <- entrez_exprs_df %>% - dplyr::group_by(ENTREZID) %>% - dplyr::summarise_all(mean) - } - - # expression matrix where the rownames are the gene identifiers - exprs_mat <- entrez_exprs_df %>% - tibble::column_to_rownames("ENTREZID") %>% - as.matrix() -} -``` - -### Gene sets - -`qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). -[MSigDB](http://software.broadinstitute.org/gsea/msigdb) offers gene sets in this format. - -For more information or recommendations about gene sets, [see the **Gene Sets** -section of the module README](https://github.com/AlexsLemonade/refinebio-examples/tree/master/pathway-analysis#choosing-gene-sets). - -We need to download the the MSigDB v6.2 KEGG gene sets that use Entrez gene IDs -and place them at the following relative path: - -``` -gene-sets/c2.cp.kegg.v6.2.entrez.gmt -``` - -We can download the file if it's not available locally yet. - -```{r} -# the kegg gmt file should be located in the spot we mention above -kegg_file <- file.path("gene-sets", "c2.cp.kegg.v6.2.entrez.gmt") -# since we do not track this file in our repository, let's check to make sure -# it exists where we expect it and download it if we don't find it -if (!file.exists(kegg_file)) { - message(paste( - "KEGG GMT file is not found at", kegg_file, - ", downloading now..." - )) - # need gene-sets directory - if (!dir.exists("gene-sets")) { - dir.create("gene-sets") - } - download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/c2.cp.kegg.v6.2.entrez.gmt", - destfile = kegg_file - ) -} -``` - -## Read in and prep refine.bio data - -### Northcott et al. - -This dataset is too large to be tracked with git without compression. -We need to unzip it here if we have not already. -Note that if your file is not zipped you can skip this chunk. - -```{r} -# if the unzipped folder does not exist -- skip this if you do not have a zipped file -if (!dir.exists(file.path( - data_dir, - "GSE37382" # Replace with the name of your file without the .zip extension -))) { - compressed_file <- file.path( - data_dir, - "GSE37382.zip" # Replace with the name of your zipped file - ) - unzip(compressed_file, exdir = data_dir) -} -``` - -#### Expression data - -```{r} -northcott_dir <- file.path( - data_dir, - "GSE37382" # Replace with name of the folder in which your northcott expression file is stored -) -northcott_expression_file <- file.path( - northcott_dir, - "GSE37382.tsv" # Replace with name of your northcott expression file -) -northcott_exprs_df <- readr::read_tsv(northcott_expression_file, - progress = FALSE -) -# first column is currently named 'Gene' and contains Ensembl gene IDs -colnames(northcott_exprs_df)[1] <- "ENSEMBL" -``` - -Convert to expression matrix that uses Entrez IDs as rownames using a -[custom function](#function). - -```{r} -northcott_mat <- convert_ensembl_to_entrez_mat(northcott_exprs_df) -``` - -We no longer need the `data.frame` that contains the expression values. - -```{r} -rm(northcott_exprs_df) -``` - -#### Metadata - -```{r} -northcott_metadata_file <- file.path( - northcott_dir, - "metadata_GSE37382.tsv" # Replace with name of your northcott metadata file -) -northcott_metadata_df <- readr::read_tsv(northcott_metadata_file) %>% - # drop columns that are all NA - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -We're going to compare the SHH subgroup to all others, so we need to encode -this information in a new column (`shh_v_other`). - -```{r} -northcott_metadata_df <- northcott_metadata_df %>% - # retain only pertinent columns - dplyr::select( - refinebio_accession_code, refinebio_title, refinebio_age, - refinebio_sex, subgroup - ) %>% - # new column that sets up pathway analysis - dplyr::mutate(shh_v_other = dplyr::case_when( - subgroup != "SHH" ~ "Other", - TRUE ~ "SHH" - )) -``` - -Reorder expression matrix to match the metadata. - -```{r} -northcott_mat <- northcott_mat[, northcott_metadata_df$refinebio_accession_code] -``` - -### Robinson et al. - -#### Expression data - -```{r} -robinson_dir <- file.path( - data_dir, - "GSE37418" # Replace with name of the folder in which your robinson expression file is stored -) -robinson_expression_file <- file.path( - robinson_dir, - "GSE37418.tsv" # Replace with name of your robinson expression file -) -robinson_exprs_df <- readr::read_tsv(robinson_expression_file, - progress = FALSE -) -colnames(robinson_exprs_df)[1] <- "ENSEMBL" -``` - -Convert to expression matrix that uses Entrez IDs as rownames using a -[custom function](#function). - -```{r} -robinson_mat <- convert_ensembl_to_entrez_mat(robinson_exprs_df) -``` - -```{r} -rm(robinson_exprs_df) -``` - -#### Metadata - -```{r} -robinson_metadata_file <- file.path( - robinson_dir, - "metadata_GSE37418.tsv" # Replace with name of your robinson metadata file -) -robinson_metadata_df <- readr::read_tsv(robinson_metadata_file) %>% - # drop columns that are all NA - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -To make this more comparable to the Northcott et al. dataset, we're going to -remove the WNT subgroup and outliers such that we are comparing the SHH group -to Group 3 and Group 4. - -```{r} -robinson_metadata_df <- robinson_metadata_df %>% - # retain only pertinent columns - dplyr::select( - refinebio_accession_code, refinebio_title, age, `m stage`, - subgroup - ) %>% - # make more comparable to Northcott dataset - dplyr::filter(!(subgroup %in% c("WNT", "SHH OUTLIER", "U"))) %>% - # we'll use SHH vs. Other for our pathway analysis - dplyr::mutate(shh_v_other = dplyr::case_when( - subgroup != "SHH" ~ "Other", - TRUE ~ "SHH" - )) -``` - -Reorder expression data to match metadata. - -```{r} -robinson_mat <- robinson_mat[, robinson_metadata_df$refinebio_accession_code] -``` - -## Pathway analysis - -### Read in KEGG pathways - -```{r} -kegg_pathways <- read.gmt(kegg_file) -``` - -### Northcott et al. - -```{r} -northcott_results <- qusage( - eset = northcott_mat, - labels = northcott_metadata_df$shh_v_other, - contrast = "SHH-Other", - geneSets = kegg_pathways -) -``` - -Save the Northcott et al. results to file. - -```{r} -northcott_results_file <- file.path( - results_dir, - "Northcott_SHH-Other_QSarray.RDS" # Replace with a relevant output name for the first results RDS file -) -readr::write_rds(northcott_results, northcott_results_file) -``` - -### Robinson et al. - -```{r} -robinson_results <- qusage( - eset = robinson_mat, - labels = robinson_metadata_df$shh_v_other, - contrast = "SHH-Other", - geneSets = kegg_pathways -) -``` - -Save the Robinson et al. results to file. - -```{r} -robinson_results_file <- file.path( - results_dir, - "Robinson_SHH-Other_QSarray.RDS" # Replace with a relevant output name for the second results RDS file -) -readr::write_rds(robinson_results, robinson_results_file) -``` - -## Meta-analysis - -### Combine probability density function - -`combinePDFs` is a `qusage` function we can use for meta-analysis. -This function accepts a list of `QSArray` results. - -```{r} -results_list <- list( - Northcott = northcott_results, - Robinson = robinson_results -) -``` - -Because there are more samples in the Northcott et al. dataset, it will be -weighted more highly when combining the distribution. - -```{r} -combined_results <- combinePDFs(results_list) -``` - -Just as in a single dataset, we can extract relevant information from -the combined `QSArray` with built-in `qusage` functions. - -Let's look at the top 20 pathways with `qsTable`. -The row numbers in the output of `qsTable` can serve as input to the -`path.index` argument of other `qusage` functions as we'll see below. - -```{r} -qsTable(combined_results) -``` - -### Plotting - -We can plot the distributions with `plotCombinedPDF`. -First let's plot KEGG Pathways in Cancer, which has elevated expression in -the SHH group as it has a _positive_ fold change. -In a two group comparison in QuSAGE, **pathway activity** is the mean difference -of the log expression of all genes in a pathway. - -```{r} -plotCombinedPDF(combined_results, path.index = 162) -legend("topleft", - legend = c("Northcott", "Robinson", "Meta-analysis"), - lty = 1, col = c("#E41A1C", "#377EB8", "black") -) -``` - -The positive pathway activity curves indicate that the genes in this -pathway have higher expression values in the SHH group for both datasets. -The pathway activity (mean difference) in the Northcott dataset which has -a larger sample size. - -Let's plot KEGG Dorsoventral Axis Formation, which has higher expression in -Group 3 and 4 samples. - -```{r} -plotCombinedPDF(combined_results, path.index = 107) -legend("topleft", - legend = c("Northcott", "Robinson", "Meta-analysis"), - lty = 1, col = c("#E41A1C", "#377EB8", "black") -) -``` - -#### Plot individual gene mean and 95% confidence intervals (CI) - -We know that the directionality of the KEGG Dorsoventral Axis Formation pathway -agrees between datasets. -We can look into what genes are driving the pathway activity with the -`plotCIsGenes` function. -Gene activity, which will be plotted on the y-axis, is difference between the -two groups. -The _pathway_ CI will also be displayed on the plot as a gray band by default. - -```{r} -plotCIsGenes(northcott_results, - path.index = 107, addGrid = FALSE, - cex.xaxis = 1.25 -) -``` - -We can see that, in the Northcott dataset, there are about 5 genes really -driving the negative pathway activity. -Let's take a look at the same pathway in Robinson. - -```{r} -plotCIsGenes(robinson_results, - path.index = 107, addGrid = FALSE, - cex.xaxis = 1.25 -) -``` - -Looks like some of the top genes are the same between datasets: `56907`, `56776` -Because Entrez IDs are not particularly human-readable, we can use the same -annotation package to convert these to gene symbols and gene names. - -```{r} -AnnotationDbi::select(org.Hs.eg.db, - keys = c("56907", "56776"), - keytype = "ENTREZID", columns = c("SYMBOL", "GENENAME") -) -``` - -It looks like these two genes promote the polymerization of actin filaments. - -## Session Info - -```{r} -sessioninfo::session_info() -``` diff --git a/02-microarray/pathway-analysis_microarray_04_qusage_replicate_vignette.Rmd b/02-microarray/pathway-analysis_microarray_04_qusage_replicate_vignette.Rmd deleted file mode 100644 index 397a6b3d..00000000 --- a/02-microarray/pathway-analysis_microarray_04_qusage_replicate_vignette.Rmd +++ /dev/null @@ -1,465 +0,0 @@ ---- -title: "Pathway analysis with QuSAGE: Replicate vignette" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -Here, we will replicate the [`qusage` package vignette](https://bioconductor.org/packages/release/bioc/vignettes/qusage/inst/doc/qusage.pdf) ( C.). -Specifically, we'll use the same dataset and analysis as the vignette, but the -expression data and sample metadata we will use is processed with refine.bio. -This allows us to explore formatting refine.bio datasets for use with -`qusage` and to compare the results using refine.bio data to the -results in the package vignette. -We will briefly cover how to obtain gene sets for pathway analysis from the -[Molecular Signatures Database (MSigDB)](http://software.broadinstitute.org/gsea/msigdb) -as well. - -## Pathway analysis - -### Set up - -```{r} -# Set seed -set.seed(12345) - -`%>%` <- dplyr::`%>%` -``` - -We need to install `qusage` if we have not already done so. -We'll need `org.Hs.eg.db` as well. - -```{r} -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -if (!("org.Hs.eg.db" %in% installed.packages())) { - BiocManager::install("org.Hs.eg.db", update = FALSE) -} -``` - -```{r} -library(org.Hs.eg.db) -library(qusage) -``` - -Make directories to hold the plots and results. - -```{r} -# Define the file path to the plots directory -plots_dir <- "plots" # Replace with path to desired output plots directory - -# Make a plots directory if it isn't created yet -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Replace with path to desired output results directory - -# Make a results directory if it isn't created yet -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} - -# Define the file path to the data directory -data_dir <- "data" -``` - -### Gene sets - -`qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). -[MSigDB](http://software.broadinstitute.org/gsea/msigdb) offers gene sets in this format. - -We need to download the the MSigDB v6.2 KEGG gene sets that use Entrez gene IDs and placed them at the following relative path: - -``` -gene-sets/c2.cp.kegg.v6.2.entrez.gmt -``` - -We can download the file if we don't find it where we expect! - -```{r} -# the kegg gmt file should be located in the spot we mention above -kegg_file <- file.path("gene-sets", "c2.cp.kegg.v6.2.entrez.gmt") -# since we do not track this file in our repository, let's check to make sure -# it exists where we expect it and download it if we don't find it -if (!file.exists(kegg_file)) { - message(paste( - "KEGG GMT file is not found at", kegg_file, - ", downloading now..." - )) - # need gene-sets directory - if (!dir.exists("gene-sets")) { - dir.create("gene-sets") - } - download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/6.2/c2.cp.kegg.v6.2.entrez.gmt", - destfile = kegg_file - ) -} -``` - -### Conversion to Entrez gene IDs - -All refine.bio data uses Ensembl gene IDs. -As noted above, the gene sets we'll be using for our pathway analysis use -Entrez gene identifiers. -As a result, we'll need to convert the expression data identifiers to Entrez -IDs. - -```{r} -# read in gene expression data -expression_file <- file.path( - data_dir, - "GSE30550", # Replace with name of the folder in which your expression file is stored - "GSE30550.tsv" # Replace with name of your expression file -) -exprs_df <- readr::read_tsv(expression_file, progress = FALSE) -# first column is currently named 'Gene' and contains Ensembl gene IDs -colnames(exprs_df)[1] <- "ENSEMBL" -``` - -We're using the default behavior for 1:many mappings, where only the first one -is selected ([docs](https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.30.1/topics/AnnotationDb-objects)). - -```{r} -entrez_mappings <- mapIds(org.Hs.eg.db, - keys = exprs_df$ENSEMBL, - column = "ENTREZID", keytype = "ENSEMBL" -) -``` - -```{r} -all.equal(names(entrez_mappings), exprs_df$ENSEMBL) -``` - -We'll add this to the `data.frame` of expression data and save to file. - -```{r} -exprs_df <- exprs_df %>% - dplyr::mutate(ENTREZID = entrez_mappings) %>% - dplyr::select(ENTREZID, ENSEMBL, dplyr::everything()) -readr::write_tsv(exprs_df, - path = file.path( - data_dir, - "GSE30550", # Replace with name of the folder in which tour expression data is stored - "GSE30550_entrez.tsv" # Replace with name of your expression file containing Entrez gene IDs - ) -) -``` - -### Read in metadata and filter - -```{r} -metadata_file <- file.path( - data_dir, - "GSE30550", # Replace with name of the folder in which your relevant metadata is stored - "metadata_GSE30550.tsv" # Replace with name of the metadata file relevant to your expression file -) -metadata_df <- readr::read_tsv(metadata_file) %>% - # drop columns that are all NA - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -In the `qusage` vignette, the authors compared samples 0 hours after exposure to -influenza and sample 77 hours after exposure to influenza. -We'll subset to just these samples. - -```{r} -filtered_metadata_df <- metadata_df %>% - dplyr::filter(time_hpi %in% c("Hour 00", "Hour 077")) %>% - # the subject & timepoint info are combined in the 'title' column - dplyr::mutate( - subject = stringr::word(title, 1, sep = ","), - # relabel the timepoints, the lack of spaces will help - # with the contrast call later - timepoint = dplyr::case_when( - time_hpi == "Hour 00" ~ "t0", - time_hpi == "Hour 077" ~ "t1" - ) - ) %>% - dplyr::arrange(time_hpi, subject) %>% - dplyr::select( - refinebio_accession_code, subject, refinebio_title, class_blu, - clinic_pheno, time_hpi, timepoint - ) -``` - -```{r} -time_filtered_exprs <- exprs_df %>% - dplyr::select(-ENSEMBL) %>% - # we're reordering samples here to match the metadata ordering - dplyr::select(ENTREZID, filtered_metadata_df$refinebio_accession_code) -``` - -Are there duplicated identifiers? - -```{r} -anyDuplicated(time_filtered_exprs$ENTREZID) -``` - -Collapse to duplicated gene identifiers to their mean values. - -```{r} -collapsed_exprs <- time_filtered_exprs %>% - dplyr::group_by(ENTREZID) %>% - dplyr::summarise_all(mean) %>% - # drop unannotated gene - dplyr::filter(!is.na(ENTREZID)) -``` - -We need to pass `qusage` a matrix. -Note that we're departing a bit from the advice of the authors by -refraining from filtering the expression matrix. - -```{r} -exprs_mat <- collapsed_exprs %>% - tibble::column_to_rownames("ENTREZID") %>% - as.matrix() -``` - -Let's get rid of some of the objects in the workspace we don't need. - -```{r} -rm(collapsed_exprs, time_filtered_exprs, exprs_df, metadata_df) -``` - -### Read in gene sets - -```{r} -kegg_genesets <- read.gmt(kegg_file) -``` - -### Perform pathway analysis - -```{r} -timepoint_qusage_results <- qusage( - eset = exprs_mat, - labels = filtered_metadata_df$timepoint, - contrast = "t1-t0", - geneSets = kegg_genesets -) -``` - -We can look at the results with the `qsTable` function. -This will give us the top 20 pathways by default. -(This can be changed via the `number` argument to `qsTable`.) - -```{r} -qsTable(timepoint_qusage_results) -``` - -Let's save the results to file. -First, let's save a nicely formatted table. - -```{r} -timepoint_table <- qsTable(timepoint_qusage_results, - # numPathways let's us snag all tested gene sets - number = numPathways(timepoint_qusage_results) -) -readr::write_tsv( - timepoint_table, - file.path(results_dir, "qusage_t1-t0_kegg_results.tsv") -) -``` - -Note that gene sets with negative fold changes are higher in `t0`. - -We're also going to save the `QSarray` output of `qusage` because, as we'll see -shortly, there's lots you can do with qusage results! - -```{r} -readr::write_rds( - timepoint_qusage_results, - file.path(results_dir, "qusage_t1-t0_kegg_QSarray.RDS") -) -``` - -#### Plotting - -We can look at the overall results with the `plotCIs` function. - -```{r} -plotCIs(timepoint_qusage_results, cex.xaxis = 0.25) -``` - -It's a bit difficult to see in this notebook, but we'll save the plot as a PDF. - -```{r} -pdf(file.path(plots_dir, "qusage_t1-t0_plotCIs.pdf"), width = 14, height = 8) -# adjusting the figure margins to better fit pathway names -par(mar = c(8, 4, 1, 2)) -plotCIs(timepoint_qusage_results, cex.xaxis = 0.5) -dev.off() -``` - -Let's look more closely at the T Cell Receptor Signaling Pathway. -We can see above that this pathway has a negative fold change and therefore -should be higher at `t0`. - -```{r} -pathway_index <- - which(names(kegg_genesets) == "KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY") # Replace with exact name of desired pathway -- names of pathways can be seen using the `names(kegg_genesets)` function -``` - -Plot the overall distribution of the pathway (thick black line) and the -distribution of the individual genes (thin lines colored by their standard -deviations). - -```{r} -plotGeneSetDistributions(timepoint_qusage_results, path.index = pathway_index) -``` - -We can see the majority of genes in this pathway don't change much, but a -handful have lower values in `t1`. - -Now, plot the mean and confidence intervals for each gene in the pathway. -This gives us insight into _which_ genes changed the most. - -```{r} -plotCIsGenes(timepoint_qusage_results, - path.index = pathway_index, - cex.xaxis = 0.75 -) -``` - -We can easily check to see if the directionality is as we expect by making -a boxplot of the gene `387` which has a negative value and is on right side of -the graph. - -```{r} -graphics::boxplot(exprs_mat["387", ] ~ filtered_metadata_df$timepoint) -``` - -### Two-way comparison - -Patients in this influenza data set were either _symptomatic_ or _asymptomatic_. -We can check if there are differences in pathway _responses_ to influenza virus -(e.g., between time points) between the two groups. -The ability to do this more complex comparison is a great feature of `qusage`. -Again, we'll mirror the analysis performed in the vignette. - -The symptomatic or asymptomatic information is in the `clinic_pheno` column -of the metadata. - -```{r} -head(filtered_metadata_df$clinic_pheno) -``` - -Generate labels for the two-way comparison. - -```{r} -two_way_labels <- paste(filtered_metadata_df$clinic_pheno, - filtered_metadata_df$timepoint, - sep = "." -) -two_way_labels -``` - -Symptomatic comparison - -```{r} -sx_results <- qusage( - eset = exprs_mat, - labels = two_way_labels, - contrast = "Symptomatic.t1-Symptomatic.t0", - geneSets = kegg_genesets, - # timepoints are paired samples, so we can specify that - # information here - pairVector = filtered_metadata_df$subject -) -``` - -Asymptomatic comparison - -```{r} -asx_results <- qusage( - eset = exprs_mat, - labels = two_way_labels, - contrast = "Asymptomatic.t1-Asymptomatic.t0", - geneSets = kegg_genesets, - pairVector = filtered_metadata_df$subject -) -``` - -Calculate the difference between the two comparisons - -```{r} -sx_vs_asx <- qusage( - eset = exprs_mat, - labels = two_way_labels, - contrast = "(Symptomatic.t1-Symptomatic.t0) - (Asymptomatic.t1-Asymptomatic.t0)", - geneSets = kegg_genesets, - pairVector = filtered_metadata_df$subject -) -``` - -Let's look at the cytosolic DNA sensing pathway like the authors' do in the -vignette and see if we get similar results. - -```{r} -names(kegg_genesets)[125] -``` - -```{r} -plotDensityCurves(asx_results, - path.index = 125, - col = "#D55E00", - main = "CYTOSOLIC DNA SENSING", - xlim = c(-1.0, 2.5) -) -plotDensityCurves(sx_results, path.index = 125, col = "#56B4E9", add = TRUE) -plotDensityCurves(sx_vs_asx, - path.index = 125, col = "#000000", add = TRUE, - lwd = 3 -) -legend("topleft", - legend = c("Asymptomatic", "Symptomatic", "Symptomatic-Asymptomatic"), - lty = 1, col = c("#D55E00", "#56B4E9", "#000000", lwd = c(1, 1, 3)), - cex = 0.5 -) -``` - -This tells us that this pathway is active in the Symptomatic patients, but -not much is happening in the Asymptomatic patients. -This is the result in the vignette, although the magnitude of the values (`x`) -are different. - -```{r} -plotGeneSetDistributions(sx_results, asx_results, path.index = 125) -``` - -A few genes appear to be driving the elevation of the cytosolic DNA sensing -patients in symptomatic patients after exposure (`t1`). -We can take a closer look with `plotCIsGenes`. - -```{r} -plotCIsGenes(sx_results, path.index = 125) -``` - -Let's look at the Asymptomatic results, where we'd expect no difference between -timepoints for most of the genes based on the plots above. - -```{r} -plotCIsGenes(asx_results, path.index = 125) -``` - -We expect the gene `3665` ([_IRF7_](https://www.genecards.org/cgi-bin/carddisp.pl?gene=IRF7&keywords=3665)) -to be elevated in `t1` compared to `t0` but _only_ in Symptomatic patients. -Let's make a boxplot to be sure. - -```{r} -graphics::boxplot(exprs_mat["3665", ] ~ interaction(filtered_metadata_df$timepoint, filtered_metadata_df$clinic_pheno)) -``` - -## Session info - -```{r} -sessioninfo::session_info() -``` diff --git a/02-microarray/pathway-analysis_microarray_05_qusage_single_dataset.Rmd b/02-microarray/pathway-analysis_microarray_05_qusage_single_dataset.Rmd deleted file mode 100644 index 864f1dc3..00000000 --- a/02-microarray/pathway-analysis_microarray_05_qusage_single_dataset.Rmd +++ /dev/null @@ -1,481 +0,0 @@ ---- -title: "Pathway analysis with QuSAGE: Single dataset" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -In this module, we'll demonstrate how to perform pathway analysis using -Quantitative Set Analysis of Gene Expression (QuSAGE) -([](https://doi.org/10.1093/nar/gkt660)). -QuSAGE, implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html), -has some nice features: - -* It takes into account inter-gene correlation (a source of type I error). -* It returns more information than just a p-value. -That's useful for analyses you might want to perform downstream. -* Built-in visualization functionality. - -We recommend taking a look at the original publication (Yaari et al.) and -the R package documentation to learn more. - -## Gene sets - -`qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). -[Curated gene sets from MSigDB](http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2) -like [KEGG](https://www.genome.jp/kegg/) are popular for pathway analysis, but -MSigDB only distributes human pathway data. -Here, we'll work with a mouse dataset. - -In the [`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook in -this module, we mapped human Entrez IDs to mouse symbols using the -[`hcop` package](https://github.com/stephenturner/hcop). -When there was a 1:many mapping between human Entrez IDs and mouse gene symbols, -we selected the mouse gene symbol with the highest number of resources -supporting the mapping. -This decision might not be suitable for every experiment and may result in -some loss of information. - -## Dataset - -We're using [`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction) in this notebook. -This dataset assays the gene expression response to short-term calorie -restriction in multiple tissues from multiple mouse strains. -We'll test for pathways that change in response to short-term calorie -restriction. - -## Set up - -Package installation and loading - -```{r} -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -if (!("org.Mm.eg.db" %in% installed.packages())) { - BiocManager::install("org.Mm.eg.db", update = FALSE) -} - -if (!("pheatmap" %in% installed.packages())) { - install.packages("pheatmap") -} -``` - -```{r} -`%>%` <- dplyr::`%>%` -``` - -```{r} -library(qusage) -library(org.Mm.eg.db) -library(pheatmap) -``` - -Create directories to hold plots and results if they do not yet exist. - -```{r} -# Define the file path to the plots directory -plots_dir <- "plots" # Replace with path to desired output plots directory - -# Make a plots directory if it isn't created yet -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Replace with path to desired output results directory - -# Make a results directory if it isn't created yet -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} - -# Define the file path to the data directory -data_dir <- "data" -``` - -## Read in refine.bio data - -The gene expression matrix of the dataset we'll be working with is too large -to be tracked with git without compression, so we need to unzip it if we have -not already. - -Note that if your file is not zipped you can skip this chunk. - -```{r} -# if the unzipped folder does not exist -- skip this if you do not have a zipped file -if (!dir.exists(file.path( - data_dir, - "GSE75574" # Replace with the name of your file without the .zip extension -))) { - compressed_file <- file.path( - data_dir, - "GSE75574.zip" # Replace with the name of your zipped file - ) - unzip(compressed_file, exdir = data_dir) -} -``` - -### Expression data - -```{r} -expression_file <- file.path( - data_dir, - "GSE75574", # Replace with name of the folder in which your expression file is stored - "GSE75574.tsv" # Replace with name of your expression file -) -exprs_df <- readr::read_tsv(expression_file, progress = FALSE) -# first column is currently named 'Gene' and contains Ensembl gene IDs -colnames(exprs_df)[1] <- "ENSEMBL" -``` - -Because our gene sets use gene symbols and expression data from refine.bio uses -Ensembl IDs, we need to do a conversion. -We're using the default behavior for 1:many mappings here, where only the first -one is selected -([docs](https://www.rdocumentation.org/packages/AnnotationDbi/versions/1.30.1/topics/AnnotationDb-objects)). - -```{r} -symbol_mappings <- mapIds(org.Mm.eg.db, - keys = exprs_df$ENSEMBL, - column = "SYMBOL", keytype = "ENSEMBL" -) -``` - -```{r} -head(symbol_mappings) -``` - -```{r} -# mapIds returns this in the same order as the keys we passed it to, let's -# demonstrate that -all.equal(names(symbol_mappings), exprs_df$ENSEMBL) -``` - -We'll annotate these expression data with the gene symbols. - -```{r} -symbol_exprs_df <- exprs_df %>% - # add a new column that contains the gene symbols - # this gets added as the last column - dplyr::mutate(SYMBOL = symbol_mappings) %>% - # drop the Ensembl gene IDs - dplyr::select(-ENSEMBL) %>% - # reorder such that the gene symbols are in the first column - dplyr::select(SYMBOL, dplyr::everything()) %>% - # drop any genes without a gene symbol - dplyr::filter(!is.na(SYMBOL)) -``` - -How many gene symbols have duplicate values? - -```{r} -sum(duplicated(symbol_exprs_df$SYMBOL)) -``` - -Collapse to the mean value for that gene symbol. - -```{r} -symbol_exprs_df <- symbol_exprs_df %>% - dplyr::group_by(SYMBOL) %>% - dplyr::summarise_all(mean) -``` - -Write to file. - -```{r} -readr::write_tsv(symbol_exprs_df, file.path( - data_dir, - "GSE75574", # Replace with name of the folder in which you would like to store your output file - "GSE75574_symbols.tsv" # Replace with relevant output name -)) -``` - -### Metadata - -```{r} -metadata_file <- file.path( - data_dir, - "GSE75574", # Replace with name of the folder in which your relevant metadata file is stored - "metadata_GSE75574.tsv" # Replace with name of relevant metadata file -) -metadata_df <- readr::read_tsv(metadata_file) %>% - # drop any metadata columns that are all NAs - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -Retain only the most pertinent columns. - -```{r} -metadata_df <- metadata_df %>% - dplyr::select( - refinebio_accession_code, refinebio_title, - refinebio_specimen_part, refinebio_sex, strain - ) -``` - -This particular accession ([`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction)) is a SuperSeries comprised of experiments -in multiple tissues from multiple mouse strains. -We'll use white adipose tissue for all strains in this example. - -```{r} -metadata_df <- metadata_df %>% - dplyr::filter(refinebio_specimen_part == "white adipose") -``` - -The two groups we are interested in for comparison are calorie restricted vs. -controls. -This information is in the title. -We can extract this and put it in its own column called `condition`. - -```{r} -metadata_df <- metadata_df %>% - dplyr::mutate(condition = dplyr::case_when( - grepl("control", refinebio_title) ~ "control", - grepl("calorierestricted", refinebio_title) ~ "calorierestricted" - )) -``` - -## Pathway Analysis - -### Read in KEGG pathways - -First, we need the sets of genes that represent pathways. -Again, these were prepared in the -[`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook (see -[Gene Sets](#gene-sets) above). - -```{r} -kegg_file <- file.path( - "gene-sets", - "c2.cp.kegg.v6.2.entrez_mouse_symbol_hcop.gmt" -) -kegg_pathways <- read.gmt(kegg_file) -``` - -### Prep expression matrix - -```{r} -exprs_mat <- symbol_exprs_df %>% - tibble::column_to_rownames("SYMBOL") %>% - # same order as the metdata - dplyr::select(metadata_df$refinebio_accession_code) %>% - # qusage takes a matrix - as.matrix() -``` - -We want to compare calorie restricted mice to controls. - -```{r} -qusage_results <- qusage( - eset = exprs_mat, - labels = metadata_df$condition, - contrast = "calorierestricted-control", - geneSets = kegg_pathways -) -``` - -Save the `QSArray` output. -We can use this object for visualization or other downstream analyses. - -```{r} -readr::write_rds( - qusage_results, - file.path( - results_dir, - "GSE75574_adipose_calorierest-ctrl_QSarray.RDS" # Replace with file name relevant to output results - ) -) -``` - -### Overall results - -We can get a look at the general trend of the results with the `plotCIs` -function. -This plots the means and 95% confidence intervals of each pathway we tested, -sorted such that the gene sets with the highest mean will be on the left of the -plot. -These are gene sets that are elevated in calorie restricted mice. -Error bars are colored by the directionality and corrected p-value (FDR by -default). -Unfortunately the p-value color scheme is red-green, which does not work well -for people with green or red color-vision deficiency. - -```{r} -plotCIs(qusage_results, cex.xaxis = 0.25, main = "Calorie Restricted - Control") -``` - -Save to plot. - -```{r} -pdf(file.path( - plots_dir, - "GSE75574_adipose_calorierestricted-control_plotCIs.pdf" -), -width = 14, height = 8.5 -) -plotCIs(qusage_results, cex.xaxis = 0.25, main = "Calorie Restricted - Control") -dev.off() -``` - -We can also look at the log fold-change and FDR values for pathways with the -`qsTable` function. -By default, this function shows you the top 20 pathways sorted by FDR. -We can change the `number` argument to `qsTable` to decrease or increase the -number of pathways returned. -We can use the `numPathways` function to get _all_ the pathways we tested. - -```{r} -qsTable(qusage_results, number = numPathways(qusage_results)) -``` - -Write to results. - -```{r} -qsTable(qusage_results, number = numPathways(qusage_results)) %>% - readr::write_tsv(file.path( - results_dir, - "GSE75574_adipose_calorierestricted-control.tsv" # Replace with file name relevant to results output - )) -``` - -`qusage` has functionality that lets us dig into our results a bit more to -get an idea of which genes are contributing to the differences between our two -groups. - -### KEGG ECM Receptor Interaction - -The KEGG ECM Receptor Interaction pathway expression is reduced in response -to calorie restriction. - -We can look at the distribution of genes in this pathway with the -`plotGeneSetDistributions` function. -We tell this function which pathway to plot with the `path.index` argument. - -```{r} -grep( - "ECM_RECEPTOR", # Replace with an exact word or phrase that would filter in your desired pathway(s) as the `grep` function looks for all instances of the pattern given -- names of pathways can be seen using the `names(kegg_pathways)` function - names(kegg_pathways) -) -``` - -```{r} -plotGeneSetDistributions(qusage_results, path.index = 114) -``` - -_Most_ genes are around zero (no activity), but we see some genes with -negative activity values. -We can dig into the _what_ the genes are with the `plotCIsGenes` function. - -```{r} -plotCIsGenes(qusage_results, path.index = 114) -``` - -Let's look at another example with the opposite directionality. - -### KEGG Steroid Biosynthesis - -The KEGG Steroid Biosynthesis pathway expression is increased in calorie -restricted adipose tissue. - -```{r} -plotGeneSetDistributions(qusage_results, path.index = 9) -``` - -This looks about half of the genes have positive activity values. - -```{r} -plotCIsGenes(qusage_results, path.index = 9) -``` - -We can make a heatmap with only genes from this pathway. -The heatmap will have similar information as `plotCIsGenes`, but we'll also -get a sense of how samples relate to one another. - -#### Heatmap - -Subset the expression matrix to only genes in the KEGG Steroid Biosynthesis -pathways. - -```{r} -steroid_genes <- kegg_pathways$KEGG_STEROID_BIOSYNTHESIS -# rows (genes) in the pathway -steroid_mat <- exprs_mat[which(rownames(exprs_mat) %in% steroid_genes), ] -``` - -We'll use annotation bars for the columns (samples). - -```{r} -# need a data.frame that contains the sample metadata that we're interested in -annotation_col_df <- metadata_df %>% - dplyr::select(refinebio_accession_code, condition, strain) %>% - tibble::column_to_rownames("refinebio_accession_code") - -# colors to be used in the annotation bars -- we need to assign one for -# each factor level in the two columns -annot_colors <- list( - condition = c( - control = "#FFFFFF", - calorierestricted = "#CD2626" - ), - # palette from - # https://github.com/clauswilke/colorblindr/blob/1ac3d4d62dad047b68bb66c06cee927a4517d678/R/palettes.R#L7 - strain = c( - `129S1/SvImJ` = "#E69F00", - `B6C3F1/J` = "#56B4E9", - `Balbc/J` = "#009E73", - `C3H/HeJ` = "#F0E442", - `C57BL6/J` = "#0072B2", - `CBA/J` = "#D55E00", - `DBA/2J` = "#CC79A7" - ) -) -``` - -Let's make the heatmap with annotation bars. - -```{r} -hm_plot <- pheatmap::pheatmap(steroid_mat, - # using blue, white, red color scheme - color = colorRampPalette(c( - "#0000FF", "#FFFFFF", - "#FF0000" - ))(25), - clustering_distance_cols = "correlation", - clustering_distance_rows = "correlation", - clustering_method = "average", - # scale the expression values of genes for - # visualization - scale = "row", - fontsize_col = 2, - angle_col = "45", - annotation_col = annotation_col_df, - annotation_colors = annot_colors, - main = "KEGG Steroid Biosynthesis" -) -``` - -Save the plot. - -```{r} -pdf(file.path( - plots_dir, - "GSE75574_steroid_biosynthesis_heatmap.pdf" # Replace with relevant output plot name -), -width = 7, height = 5 -) -print(hm_plot) -dev.off() -``` - -## Session info - -```{r} -sessioninfo::session_info() -``` diff --git a/02-microarray/pathway-analysis_microarray_06_ssgsea.Rmd b/02-microarray/pathway-analysis_microarray_06_ssgsea.Rmd deleted file mode 100644 index 18d4da57..00000000 --- a/02-microarray/pathway-analysis_microarray_06_ssgsea.Rmd +++ /dev/null @@ -1,445 +0,0 @@ ---- -title: "Pathway analysis: ssGSEA" -output: - html_notebook: - toc: TRUE - toc_float: TRUE -author: J. Taroni for ALSF CCDL -date: 2019 ---- - -## Background - -Pathway or gene set analysis methods like Quantitative Set Analysis of Gene -Expression (QuSAGE) -([](https://doi.org/10.1093/nar/gkt660)) or Gene Set -Enrichment Analysis (GSEA) -([ et al. _PNAS_. 2005.](https://doi.org/10.1073/pnas.0506580102)) -require us to specify group labels. -We may want a better idea of what pathways are up- or down-regulated in -_individual samples_ if we, for example, suspect that there are subgroups of -patients during exploratory data analysis. -We can use single-sample GSEA (ssGSEA) -([Barbie et al. _Nature_. 2009.](https://dx.doi.org/10.1038/nature08460)), -which is implemented in the -[`GSVA` bioconductor package](https://bioconductor.org/packages/release/bioc/html/GSVA.html). -Note that `GSVA` contains _multiple_ gene set enrichment methods and has an -excellent [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf). - -## Gene sets - -We will use KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathways in this -analysis. -We acquired the KEGG `v6.2` pathway set that used human Entrez IDs from the -[Molecular Signatures Database (MSigDB)](http://software.broadinstitute.org/gsea/msigdb). - -In the [`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook in -this module, we mapped human Entrez IDs to mouse symbols using the -[`hcop` package](https://github.com/stephenturner/hcop). -When there was a 1:many mapping between human Entrez IDs and mouse gene symbols, -we selected the mouse gene symbol with the highest number of resources -supporting the mapping. -This decision might not be suitable for every experiment and may result in -some loss of information. - -## Dataset - -We're using [`GSE75574`](https://www.refine.bio/experiments/GSE75574/gene-expression-in-mouse-tissues-in-response-to-short-term-calorie-restriction) -in this notebook. -This dataset assays the gene expression response to short-term calorie -restriction in multiple tissues from multiple mouse strains. -This is the same dataset we used in one of our QuSAGE examples -([`qusage_single_dataset`](./qusage_single_dataset.nb.html)), where we only -looked at white adipose tissue. - -## Set up - -Package installation and loading. - -```{r} -if (!("qusage" %in% installed.packages())) { - BiocManager::install("qusage", update = FALSE) -} - -if (!("org.Mm.eg.db" %in% installed.packages())) { - BiocManager::install("org.Mm.eg.db", update = FALSE) -} - -if (!("GSVA" %in% installed.packages())) { - BiocManager::install("GSVA", update = FALSE) -} - -if (!("matrixStats" %in% installed.packages())) { - install.packages("matrixStats") -} -``` - -```{r} -`%>%` <- dplyr::`%>%` -``` - -```{r} -library(org.Mm.eg.db) -library(GSVA) -``` - -Create directories to hold plots and results if they do not yet exist. - -```{r} -# Define the file path to the plots directory -plots_dir <- "plots" # Replace with path to desired output plots directory - -# Make a plots directory if it isn't created yet -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Replace with path to desired output results directory - -# Make a results directory if it isn't created yet -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} - -# Define the file path to the data directory -data_dir <- "data" -``` - -## Read in refine.bio data - -The gene expression matrix of the dataset we'll be working with is too large -to be tracked with git without compression, so we need to unzip it if we have -not already. -This compressed folder contains both the expression matrix and the metadata. - -```{r} -# if the unzipped folder does not exist -if (!dir.exists(file.path( - data_dir, - "GSE75574" # Replace with the name of your file without the .zip extension -))) { - compressed_file <- file.path( - data_dir, - "GSE75574.zip" # Replace with the name of your zipped file - ) - unzip(compressed_file, exdir = data_dir) -} -``` - -### Expression data - -Because we used this in another notebook, it's possible that you already have -a local copy of the prepared expression matrix. -If a local copy exists, we don't want to go through the steps to reprocess it, -so we can check for this first. - -If not, we'll process the data in the following ways: - -* Convert the Ensembl gene IDs that are used in refine.bio data to gene symbols. -* If duplicate gene symbols exist, we'll collapse to the mean value for that -gene symbol. - -```{r} -# check for the existence of the file that would have been generated in the -# other notebook -symbol_file <- file.path( - data_dir, - "GSE75574", # Replace with name of your unzipped folder - "GSE75574_symbols.tsv" # Replace with relevant output name -) -if (file.exists(symbol_file)) { - symbol_exprs_df <- readr::read_tsv(symbol_file, progress = FALSE) -} else { - - # read in the refine.bio expression data - expression_file <- file.path( - data_dir, - "GSE75574", # Replace with name of your unzipped folder - "GSE75574.tsv" # Replace with name of your expression file - ) - exprs_df <- readr::read_tsv(expression_file, progress = FALSE) - # first column is currently named 'Gene' and contains Ensembl gene IDs - colnames(exprs_df)[1] <- "ENSEMBL" - - # using the default behavior for 1:many mappings here -- select the first one - symbol_mappings <- mapIds(org.Mm.eg.db, - keys = exprs_df$ENSEMBL, - column = "SYMBOL", keytype = "ENSEMBL" - ) - - # mapIds should give us back IDs in the same order as the keys, but we can - # check to be sure - if (!(all.equal(names(symbol_mappings), exprs_df$ENSEMBL))) { - stop("Gene order is not as expected!") - } - - symbol_exprs_df <- exprs_df %>% - # add a new column that contains the gene symbols - # this gets added as the last column - dplyr::mutate(SYMBOL = symbol_mappings) %>% - # drop the Ensembl gene IDs - dplyr::select(-ENSEMBL) %>% - # reorder such that the gene symbols are in the first column - dplyr::select(SYMBOL, dplyr::everything()) %>% - # drop any genes without a gene symbol - dplyr::filter(!is.na(SYMBOL)) - - # if there are any duplicate gene symbols, collapse to the mean value - if (any(duplicated(symbol_exprs_df$SYMBOL))) { - symbol_exprs_df <- symbol_exprs_df %>% - dplyr::group_by(SYMBOL) %>% - dplyr::summarise_all(mean) - } - - # write to file - readr::write_tsv(symbol_exprs_df, symbol_file) -} -``` - -### Metadata - -```{r} -metadata_file <- file.path( - data_dir, - "GSE75574", # Replace with name of your unzipped folder - "metadata_GSE75574.tsv" # Replace with name of relevant metadata file -) -metadata_df <- readr::read_tsv(metadata_file) %>% - # drop any metadata columns that are all NAs - dplyr::select(-which(apply(is.na(.), 2, all))) -``` - -Retain only the most pertinent columns. - -```{r} -metadata_df <- metadata_df %>% - dplyr::select( - refinebio_accession_code, refinebio_title, - refinebio_specimen_part, refinebio_sex, strain - ) -``` - -We're interested in the are calorie-restricted vs. controls information, which -is in the sample title. -We can extract this and put it in its own column called `condition` as is good -practice. - -```{r} -metadata_df <- metadata_df %>% - dplyr::mutate(condition = dplyr::case_when( - grepl("control", refinebio_title) ~ "control", - grepl("calorierestricted", refinebio_title) ~ "calorie-restricted" - )) -``` - -## Pathway Analysis - -### Read in KEGG pathways - -First, we need the sets of genes that represent pathways. -Again, these were prepared in the -[`kegg_ortholog_mapping`](./kegg_ortholog_mapping.nb.html) notebook (see -[Gene Sets](#gene-sets) above). - -```{r} -kegg_file <- file.path( - "gene-sets", - "c2.cp.kegg.v6.2.entrez_mouse_symbol_hcop.gmt" -) -kegg_pathways <- qusage::read.gmt(kegg_file) -``` - -### Prep expression matrix - -```{r} -exprs_mat <- symbol_exprs_df %>% - tibble::column_to_rownames("SYMBOL") %>% - # same order as the metdata -- this may be helpful in downstream analyses - dplyr::select(metadata_df$refinebio_accession_code) %>% - # qusage takes a matrix - as.matrix() -``` - -### ssGSEA - -We can specify that we'd like to perform ssGSEA with the `method` argument to -`gsva`. - -```{r} -ssgsea_results <- gsva( - expr = exprs_mat, - gset.idx.list = kegg_pathways, - method = "ssgsea", - verbose = FALSE -) -``` - -This returns a matrix of ssGSEA enrichment scores where the columns are samples -and the rows are the input gene sets. - -```{r} -str(ssgsea_results) -``` - -#### Identifying "interesting" features - -One way to figure out what pathways may be interesting is to find which pathways -have the highest _variance_ in their ssGSEA scores. -We can calculate the row variances with the `matrixStats` package. - -```{r} -score_row_var <- matrixStats::rowVars(ssgsea_results) -# for convenience, let's use the pathway names with this vector -names(score_row_var) <- rownames(ssgsea_results) -# look at the 'top' pathways by this metric -head(sort(score_row_var, decreasing = TRUE)) -``` - -Note that many of these pathways seem to have something to do with the immune -system and may very well share a large number of genes. -Let's look at an example pair of these pathways. - -```{r} -intersect( - kegg_pathways$KEGG_ALLOGRAFT_REJECTION, - kegg_pathways$KEGG_GRAFT_VERSUS_HOST_DISEASE -) -``` - -These ssGSEA scores should not be treated as if they are independent. - -#### Plotting features - -What do the patterns across strains and tissues look like? -Let's look at the KEGG Complement and Coagulation Cascades pathway as an -example. - -We'll be using `ggplot2` for plotting, so we'll want our ssGSEA values as a -`data.frame`. - -```{r} -comp_coag_df <- data.frame( - sample_accession = colnames(ssgsea_results), - ssgsea_score = ssgsea_results["KEGG_COMPLEMENT_AND_COAGULATION_CASCADES", ] # Replace with desired pathway -- note that this is distinct from `grep` in that you'd have to write the name exactly as it is in the matrix -) -``` - -We need to add in the strain, tissue, and condition metadata to make an -informative plot. - -```{r} -comp_coag_df <- metadata_df %>% - dplyr::inner_join(comp_coag_df, - by = c("refinebio_accession_code" = "sample_accession") - ) -``` - -```{r fig.height=11, fig.width=8.5} -comp_coag_df %>% - ggplot(aes( - x = condition, - y = ssgsea_score - )) + - geom_boxplot() + - facet_grid(strain ~ refinebio_specimen_part) + - labs(y = "KEGG Complement and Coagulation Cascades") + - theme_bw() + - # x-axis text at a 45 degree angle to increase readability - theme( - axis.text.x = element_text(angle = 45, hjust = 1), - text = element_text(size = 15) - ) -``` - -```{r} -# saving to file -ggsave(file.path( - plots_dir, - "GSE75574_ssgsea_comp_coag_facet.pdf" # Replace with relevant output plot name -), -plot = last_plot() + - theme(text = element_text(size = 10)) -) -``` - -Looks like there may be some differences between tissues in this pathway, with -some strains showing more pronounced differences in white adipose tissue in the -calorie-restricted vs. control comparison. -We should always follow up with more formal analysis (e.g., QuSAGE), but -this series of steps is a good way to explore one's data. - -#### Aside: Gene set size influences ssGSEA score values - -ssGSEA score values are not necessarily comparable between gene sets of -different sizes (number of genes). -If you are comparing ssGSEA scores between samples _within a gene set_, this is -not a concern. - -To demonstrate the effect of gene set size on scores, we'll perform a short -experiment with random gene sets. - -```{r} -# use all the gene symbols in the dataset as the pool of possible genes -all_genes <- symbol_exprs_df$SYMBOL - -# set a seed for reproducibility -set.seed(123) - -# GSVA::gsva takes gene sets as a list, so that's how we'll store the random -# gene sets -random_gene_sets <- list() - -# testing 5 different gene set sizes -for (pathway_size in c(25, 50, 100, 250, 500)) { - # generate 10 random sets of pathway_size - for (path_iter in 1:10) { - random_pathway_name <- paste("size", pathway_size, path_iter, sep = "_") - current_gene_set <- base::sample(x = all_genes, size = pathway_size) - random_gene_sets[[random_pathway_name]] <- current_gene_set - } -} - -# calculate ssGSEA scores for the random gene sets -random_ssgsea_results <- gsva( - expr = exprs_mat, - gset.idx.list = random_gene_sets, - method = "ssgsea", - verbose = FALSE -) - -# plot the results with ggplot2 -# first we need to get this in a form that is amenable for plotting, as gsva -# returns a matrix where columns are samples and rows are gene sets -random_long_df <- as.data.frame(random_ssgsea_results) %>% - # gene set names are rownames - tibble::rownames_to_column("gene_set") %>% - # long format - reshape2::melt(variable.name = "sample", value.name = "ssgsea_score") %>% - # extract the gene set size from the gene set name - dplyr::mutate(gene_set_size = stringr::word(gene_set, 2, sep = "_")) %>% - # we want to plot smallest no. genes -> largest no. genes - dplyr::mutate(gene_set_size = factor(gene_set_size, - levels = c(25, 50, 100, 250, 500) - )) - -# violin plot comparing no. genes / gene set size -random_long_df %>% - ggplot(aes(x = gene_set_size, y = ssgsea_score)) + - geom_violin() + - coord_flip() + - labs( - title = "Random gene set scores", - x = "gene set size", - y = "ssGSEA score" - ) + - theme_bw() -``` - -Note how a smaller gene set results in a larger range of scores. - -## Session info - -```{r} -sessioninfo::session_info() -``` diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd new file mode 100644 index 00000000..53e0198d --- /dev/null +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -0,0 +1,746 @@ +--- +title: "WGCNA - RNA-seq" +author: "CCDL for ALSF" +date: "November 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules [@Langfelder2008]. +WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. +This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data. + +As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) [@Langfelder2007]. +This eigengene x sample data can, in many instances, be used as you would the original gene expression values. +In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group + +This method does require some computing power, but can still be run locally (on your own computer) for most refine.bio datasets. +As with many clustering and network methods, there are some parameters that may need tweaking. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential_expression_rnaseq_01_rnaseq.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" # Can replace with path to desired output plots directory + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" # Can replace with path to desired output results directory + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the dataset from refine.bio + +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). + +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP140558). + +Click the "Download Now" button on the right side of this screen. + + + +Fill out the pop up window with your email and our Terms and Conditions: + + + +We are going to use non-quantile normalized data for this analysis. +To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples". +Note that this option will only be available for RNA-seq datasets. + + + +It may take a few minutes for the dataset to process. +You will get an email when it is ready. + +## About the dataset we are using for this example + +For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558). +The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. +Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV"). + +## Place the dataset in your new `data/` folder + +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +Double clicking should unzip this for you and create a folder of the same name. + + + +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). + +The `` folder has the data and metadata TSV files you will need for this example analysis. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. + +Copy and paste the `SRP140558` folder into your newly created `data/` folder. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called "data" which contains: + - The `SRP140558` folder which contains: + - The gene expression + - The metadata TSV +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): + + + +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +These chunks will declare your file paths and double check that your files are in the right place. + +First we will declare our file paths to our data and metadata files, which should be in our data directory. +This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. + +```{r} +# Define the file path to the data directory +data_dir <- file.path("data", "SRP140558") # Replace with accession number which will be the name of the folder the files will be in + +# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` +data_file <- file.path(data_dir, "SRP140558.tsv") # Replace with file path to your dataset + +# Declare the file path to the metadata file using the data directory saved as `data_dir` +metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") # Replace with file path to your metadata +``` + +Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. + +```{r} +# Check if the gene expression matrix file is at the file path stored in `data_file` +file.exists(data_file) + +# Check if the metadata file is at the file path stored in `metadata_file` +file.exists(metadata_file) +``` + +If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +### Sample size + +Keep in mind when using a different refine.bio dataset with this example, that WGCNA requires at least 15 samples to produce a meaningful result [according to its authors](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html). +So you will need to make sure the dataset you use is sufficiently large. +However, note that very large datasets will be difficult to run locally (on a personal laptop) due to the required computing power. +While you can adjust some parameters to make this more doable on a laptop, it may decrease the reliability of your result if taken to an extreme (more on this parameter, called `maxBlockSize`, in the [`Run WGCNA!` section](#run-wgcna)). + +### Microarray vs RNA-seq + +WGCNA can be used with both RNA-seq and microarray datasets so long as they are well normalized and filtered. +In this example we use RNA-seq and [normalize and transform the data with DESeq2's `vst()`](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which not only is a method and package we recommend in general, but is also the [authors' specific recommendations for using WGCNA with RNA-seq data](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). + +If you end up wanting to run WGCNA with a microarray dataset, the normalization done by refine.bio _should_ be sufficient, but you will likely want to [apply a minimum expression filter](#define-a-minimum-counts-cutoff) as we do in this example. +If you have troubles finding a `power` parameter that yields a sufficient R^2 even after applying a stringent cutoff, you may want to look into using a different dataset. + +*** + +   + +# Identifying co-expression gene modules with WGCNA - RNA-seq + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +We will be using `DESeq2` to normalize and transform our RNA-seq data before running WGCNA, so we will need to install that [@Love2014]. + +Of course, we will need the `WGCNA` package [@Langfelder2008]. +But `WGCNA` also requires a package called `impute` that it sometimes has trouble installing so we recommend installing that first [@Hastie2020]. + +For plotting purposes will be creating a `sina` plot and heatmaps which we will need a `ggplot2` companion package for, called `ggforce` as well as the `ComplexHeatmap` package [@Gu2020]. + +```{r} +if (!("DESeq2" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("DESeq2", update = FALSE) +} + +if (!("impute" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("impute") +} + +if (!("WGCNA" %in% installed.packages())) { + # Install this package if it isn't installed yet + install.packages("WGCNA") +} + +if (!("ggforce" %in% installed.packages())) { + # Install this package if it isn't installed yet + install.packages("ggforce") +} + +if (!("ComplexHeatmap" %in% installed.packages())) { + # Install this package if it isn't installed yet + install.packages("ComplexHeatmap") +} +``` + +Attach some of the packages we need for this analysis. + +```{r} +# Attach the DESeq2 library +library(DESeq2) + +# We will need this so we can use the pipe: %>% +library(magrittr) + +# We'll need this for finding gene modules +library(WGCNA) + +# We'll be making some plots +library(ggplot2) +``` + +## Import and set up data + +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +This chunk of code will read the both TSV files and add them as data frames to your environment. + +We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). + +```{r} +# Read in metadata TSV file +metadata <- readr::read_tsv(metadata_file) + +# Read in data TSV file +df <- readr::read_tsv(data_file) %>% + # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later + tibble::column_to_rownames("Gene") +``` + +Let's ensure that the metadata and data are in the same sample order. + +```{r} +# Make the data in the order of the metadata +df <- df %>% + dplyr::select(metadata$refinebio_accession_code) + +# Check if this is in the same order +all.equal(colnames(df), metadata$refinebio_accession_code) +``` + +### Prepare data for `DESeq2` + +There are two things we need to do to prep our expression data for DESeq2. + +First, we need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. + +Then, we need to filter out the genes that have not been expressed or that have low expression counts. +This is recommended by [WGCNA docs for RNA-seq data](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). +Removing low count genes can also help improve your WGCNA results. +We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples. + +```{r} +# The next DESeq2 functions need the values to be converted to integers +df <- round(df) %>% + # The next steps require a data frame and round() returns a matrix + as.data.frame() %>% + # Only keep rows that have total counts above the cutoff + dplyr::filter(rowSums(.) >= 50) +``` + +Another thing we need to do is set up our main experimental group variable. +Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves. + +For this study, PBMCs were collected at two time points: during the patients' first, acute bronchiolitis visit (abbreviated "AV") and their recovery visit, (called post-convalescence and abbreviated "CV"). + +For handier use of this information, we can create a new variable, `time_point`, that states this info more clearly. +This new `time_point` variable will have two labels: `acute illness` and `recovering` based on the `AV` or `CV` coding located in the `refinebio_title` string variable. + +```{r} +metadata <- metadata %>% + dplyr::mutate( + time_point = dplyr::case_when( + # Create our new variable based on refinebio_title containing AV/CV + stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness", + stringr::str_detect(refinebio_title, "_CV_") ~ "recovering" + ), + # It's easier for future items if this is already set up as a factor + time_point = as.factor(time_point) + ) +``` + +Let's double check that our factor set up is right. +We want `acute illness` to be the first level since it was the first time point collected. + +```{r} +levels(metadata$time_point) +``` + +Great! We're all set. + +## Create a DESeqDataset + +We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. +We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014]. +In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis. + +```{r} +# Create a `DESeqDataSet` object +dds <- DESeqDataSetFromMatrix( + countData = df, # Our prepped data frame with counts + colData = metadata, # Data frame with annotation for our samples + design = ~1 # Here we are not specifying a model +) +``` + +## Perform DESeq2 normalization and transformation + +We often suggest normalizing and transforming your data for various applications and in this instance WGCNA's authors [suggest using variance stabilizing transformation before running WGCNA](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). +We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. +For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). + +```{r} +# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` +# function from the `DESEq2` R package +dds_norm <- vst(dds) +``` + +At this point, if your data set has any outlier samples, you should look into removing them as they can affect your WGCNA results. + +WGCNA's tutorial has [an example of exploring your data for outliers you can reference](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf). + +For this example data set, we will skip this step (there are no obvious outliers) and proceed. + +## Format normalized data for WGCNA + +Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA. + +```{r} +# Retrieve the normalized data from the `DESeqDataSet` +normalized_counts <- assay(dds_norm) %>% + t() # Transpose this data +``` + +## Determine parameters for WGCNA + +To identify which genes are in the same modules, WGCNA first creates a weighted network to define which genes are near each other. +The measure of "adjacency" it uses is based on the correlation matrix, but requires the definition of a threshold value, which in turn depends on a "power" parameter that defines the exponent used when transforming the correlation values. +The choice of power parameter will affect the number of modules identified, and the WGCNA modules provides the `pickSoftThreshold()` function to help identify good choices for this parameter. + +```{r} +sft <- pickSoftThreshold(normalized_counts, + dataIsExpr = TRUE, + corFnc = cor, + networkType = "signed" +) +``` + +This `sft` object has a lot of information, we will want to plot some of it to figure out what our `power` soft-threshold should be. +We have to first calculate a measure of the model fit, the signed $R^2$, and make that a new variable. + +```{r} +sft_df <- data.frame(sft$fitIndices) %>% + dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq) +``` + +Now, let's plot the model fitting by the `power` soft threshold so we can decide on a soft-threshold for power. + +```{r} +ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) + + # Plot the points + geom_point() + + # We'll put the Power labels slightly above the data points + geom_text(nudge_y = 0.1) + + # We will plot what WGCNA recommends as an R^2 cutoff + geom_hline(yintercept = 0.80, col = "red") + + # Just in case our values are low, we want to make sure we can still see the 0.80 level + ylim(c(min(sft_df$model_fit), 1.05)) + + # We can add more sensible labels for our axis + xlab("Soft Threshold (power)") + + ylab("Scale Free Topology Model Fit, signed R^2") + + ggtitle("Scale independence") + + # This adds some nicer aesthetics to our plot + theme_classic() +``` + +Using this plot we can decide on a power parameter. +WGCNA's authors recommend using a `power` that has an signed $R^2$ above `0.80`, otherwise they warn your results may be too noisy to be meaningful. + +If you have multiple power values with signed $R^2$ above `0.80`, then picking the one at an inflection point, in other words where the $R^2$ values seem to have reached their saturation [@Zhang2005]. +You want to a `power` that gives you a big enough $R^2$ but is not excessively large. + +So using the plot above, going with a power soft-threshold of `7`! + +If you find you have all very low $R^2$ values this may be because there are too many genes with low expression values that are cluttering up the calculations. +You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). + +## Run WGCNA! + +We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `7` for the `power` argument like we determined above. + +This next step takes some time to run. +The `blockwise` part of the `blockwiseModules()` function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources. + +Here we are using the default `maxBlockSize`, 5000 but, you may want to adjust the `maxBlockSize` argument depending on your computer's memory. +The authors of WGCNA recommend running [the largest block your computer can handle](https://peterlangfelder.com/2018/11/25/blockwise-network-analysis-of-large-data/) and they provide some approximations as to GB of memory of a laptop and what `maxBlockSize` it should be able to handle: + +> • If the reader has access to a large workstation with more than 4 GB of memory, the parameter maxBlockSize +can be increased. A 16GB workstation should handle up to 20000 probes; a 32GB workstation should handle +perhaps 30000. A 4GB standard desktop or a laptop may handle up to 8000-10000 probes, depending on +operating system and other running programs. + +[@Langfelder2016] + +```{r} +bwnet <- blockwiseModules(normalized_counts, + maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in + TOMType = "signed", # topological overlap matrix + power = 7, # soft threshold for network construction + numericLabels = TRUE, # Let's use numbers instead of colors for module labels + randomSeed = 1234, # there's some randomness associated with this calculation + # so we should set a seed +) +``` + +The `TOMtype` argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. +You can safely assume for most situations a `signed` network represents what you want -- we want WGCNA to pay attention to directionality. +However if you suspect you may benefit from an `unsigned` network, where positive/negative is ignored see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/) to help you figure that out [@Langfelder2018]. + +There are a lot of other settings you can tweak -- look at `?blockwiseModules` help page as well as the [WGCNA tutorial](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/) [@Langfelder2016]. + +## Write main WGCNA results object to file + +We will save our whole results object to an RDS file in case we want to return to our original WGCNA results. + +```{r} +readr::write_rds(bwnet, + file = file.path("results", "SRP140558_wgcna_results.RDS") +) +``` + +## Explore our WGCNA results + +The `bwnet` object has many parts, storing a lot of information. +We can pull out the parts we are most interested in and may want to use use for plotting. + +In `bwnet` we have a data frame of eigengene module data for each sample in the `MEs` slot. +These represent the collapsed, combined, and normalized expression of the genes that make up each module. + +```{r} +module_eigengenes <- bwnet$MEs + +# Print out a preview +head(module_eigengenes) +``` + +## Which modules have biggest differences across treatment groups? + +We can also see if our eigengenes relate to our metadata labels. +First we double check that our samples are still in order. + +```{r} +all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes)) +``` + +```{r} +# Create the design matrix from the `time_point` variable +des_mat <- model.matrix(~ metadata$time_point) +``` + +Run linear model on each module. +Limma wants our tests to be per row, so we also need to transpose so the eigengenes are rows + +```{r} +# lmFit() needs a transposed version of the matrix +fit <- limma::lmFit(t(module_eigengenes), design = des_mat) + +# Apply empirical Bayes to smooth standard errors +fit <- limma::eBayes(fit) +``` + +Apply multiple testing correction and obtain stats in a data frame. + +```{r} +# Apply multiple testing correction and obtain stats +stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>% + tibble::rownames_to_column("module") +``` + +Let's take a look at the results. +They are sorted with the most significant results at the top. + +```{r} +head(stats_df) +``` + +Module 19 seems to be the most differentially expressed across `time_point` groups. +Now we can do some investigation into this module. + +## Let's make plot of module 19 + +As a sanity check, let's use `ggplot` to see what module 18's eigengene looks like between treatment groups. + +First we need to set up the module eigengene for this module with the sample metadata labels we need. + +```{r} +module_19_df <- module_eigengenes %>% + tibble::rownames_to_column("accession_code") %>% + # Here we are performing an inner join with a subset of metadata + dplyr::inner_join(metadata %>% + dplyr::select(refinebio_accession_code, time_point), + by = c("accession_code" = "refinebio_accession_code") + ) +``` + +Now we are ready for plotting. + +```{r} +ggplot( + module_19_df, + aes( + x = time_point, + y = ME19, + color = time_point + ) +) + + # a boxplot with outlier points hidden (they will be in the sina plot) + geom_boxplot(width = 0.2, outlier.shape = NA) + + # A sina plot to show all of the individual data points + ggforce::geom_sina(maxwidth = 0.3) + + theme_classic() +``` + +This makes sense! +Looks like module 19 has elevated expression during the acute illness but not when recovering. + +## What genes are a part of module 19? + +If you want to know which of your genes make up a modules, you can look at the `$colors` slot. +This is a named list which associates the genes with the module they are a part of. +We can turn this into a data frame for handy use. + +```{r} +gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>% + # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere + dplyr::mutate(module = paste0("ME", module)) +``` + +Now we can find what genes are a part of module 19. + +```{r} +gene_module_key %>% + dplyr::filter(module == "ME19") +``` + +Let's save this gene to module key to a TSV file for future use. + +```{r} +readr::write_tsv(gene_module_key, + file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv") +) +``` + +## Make a custom heatmap function + +We will make a heatmap that summarizes our differentially expressed module. +Because we will make a couple of these, it makes sense to make a custom function for making this heatmap. + +```{r} +make_module_heatmap <- function(module_name, + expression_mat = normalized_counts, + metadata_df = metadata, + gene_module_key_df = gene_module_key, + module_eigengenes_df = module_eigengenes) { + # Create a summary heatmap of a given module. + # + # Args: + # module_name: a character indicating what module should be plotted, e.g. "ME19" + # expression_mat: The full gene expression matrix. Default is `normalized_counts`. + # metadata_df: a data frame with refinebio_accession_code and time_point + # as columns. Default is `metadata`. + # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`. + # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`. + # + # Returns: + # A heatmap of expression matrix for a module's genes, with a barplot of the + # eigengene expression for that module. + + # Set up the module eigengene with its refinebio_accession_code + module_eigengene <- module_eigengenes_df %>% + dplyr::select(all_of(module_name)) %>% + tibble::rownames_to_column("refinebio_accession_code") + + # Set up column annotation from metadata + col_annot_df <- metadata_df %>% + # Only select the treatment and sample ID columns + dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>% + # Add on the eigengene expression by joining with sample IDs + dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>% + # Arrange by patient and time point + dplyr::arrange(time_point, refinebio_subject) %>% + # Store sample + tibble::column_to_rownames("refinebio_accession_code") + + # Create the ComplexHeatmap column annotation object + col_annot <- ComplexHeatmap::HeatmapAnnotation( + # Supply treatment labels + time_point = col_annot_df$time_point, + # Add annotation barplot + module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)), + # Pick colors for each experimental group in time_point + col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3")) + ) + + # Get a vector of the Ensembl gene IDs that correspond to this module + module_genes <- gene_module_key_df %>% + dplyr::filter(module == module_name) %>% + dplyr::pull(gene) + + # Set up the gene expression data frame + mod_mat <- expression_mat %>% + t() %>% + as.data.frame() %>% + # Only keep genes from this module + dplyr::filter(rownames(.) %in% module_genes) %>% + # Order the samples to match col_annot_df + dplyr::select(rownames(col_annot_df)) %>% + # Data needs to be a matrix + as.matrix() + + # Normalize the gene expression values + mod_mat <- mod_mat %>% + # Scale can work on matrices, but it does it by column so we will need to + # transpose first + t() %>% + scale() %>% + # And now we need to transpose back + t() + + # Create a color function based on standardized scale + color_func <- circlize::colorRamp2( + c(-2, 0, 2), + c("#67a9cf", "#f7f7f7", "#ef8a62") + ) + + # Plot on a heatmap + heatmap <- ComplexHeatmap::Heatmap(mod_mat, + name = module_name, + # Supply color function + col = color_func, + # Supply column annotation + bottom_annotation = col_annot, + # We don't want to cluster samples + cluster_columns = FALSE, + # We don't need to show sample or gene labels + show_row_names = FALSE, + show_column_names = FALSE + ) + + # Return heatmap + return(heatmap) +} +``` + +## Make module heatmaps + +Let's try out the custom heatmap function with module 19 (our most differentially expressed module). + +```{r} +mod_19_heatmap <- make_module_heatmap(module_name = "ME19") + +# Print out the plot +mod_19_heatmap +``` + +From the barplot portion of our plot, we can see `acute illness` samples tend to have higher expression values for the module 19 eigengene. +In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the `recovering` samples. + +We can save this plot to PNG. + +```{r} +png(file.path("results", "SRP140558_module_19_heatmap.png")) +mod_19_heatmap +dev.off() +``` + +For comparison, let's try out the custom heatmap function with a different, _not_ differentially expressed module. + +```{r} +mod_25_heatmap <- make_module_heatmap(module_name = "ME25") + +# Print out the plot +mod_25_heatmap +``` + +In this non-significant module's heatmap, there's not a particularly strong pattern between acute illness and recovery samples. +Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!). + +Save this plot also. + +```{r} +png(file.path("results", "SRP140558_module_25_heatmap.png")) +mod_25_heatmap +dev.off() +``` + +# Resources for further learning + +- [WGCNA FAQ page](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html) [@Langfelder2016]. +- [WGCNA tutorial](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/) [@Langfelder2016]. +- [WGCNA paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) [@Langfelder2008]. +- [ComplexHeatmap's tutorial guide](https://jokergoo.github.io/ComplexHeatmap-reference/book/) for more info on how to tweak the heatmaps [@Gu2020]. + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessioninfo::session_info() +``` + +# References diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html new file mode 100644 index 00000000..f9bde0f5 --- /dev/null +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html @@ -0,0 +1,4683 @@ + + + + + + + + + + + + + + +WGCNA - RNA-seq + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules (Langfelder and Horvath 2008). WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set. This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data.

+

As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) (Langfelder and Horvath 2007). This eigengene x sample data can, in many instances, be used as you would the original gene expression values. In this example, we use eigengene x sample data to identify differentially expressed modules between our treatment and control group

+

This method does require some computing power, but can still be run locally (on your own computer) for most refine.bio datasets. As with many clustering and network methods, there are some parameters that may need tweaking.

+

⬇️ Jump to the analysis code ⬇️

+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

+
+
+

2.3 Obtain the dataset from refine.bio

+

For general information about downloading data for these examples, see our ‘Getting Started’ section.

+

Go to this dataset’s page on refine.bio.

+

Click the “Download Now” button on the right side of this screen.

+

+

Fill out the pop up window with your email and our Terms and Conditions:

+

+

We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.

+

+

It may take a few minutes for the dataset to process. You will get an email when it is ready.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we will use this acute viral bronchiolitis dataset. The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients. Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated “AV”) and their recovery, their post-convalescence visit (abbreviated “CV”).

+
+
+

2.5 Place the dataset in your new data/ folder

+

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

+

+

For more details on the contents of this folder see these docs on refine.bio.

+

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

+

Copy and paste the SRP140558 folder into your newly created data/ folder.

+
+
+

2.6 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called “data” which contains: +
      +
    • The SRP140558 folder which contains: +
        +
      • The gene expression
        +
      • +
      • The metadata TSV
        +
      • +
    • +
  • +
  • A folder for plots (currently empty)
  • +
  • A folder for results (currently empty)
  • +
+

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

+

+

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

+

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

+
# Define the file path to the data directory
+data_dir <- file.path("data", "SRP140558") # Replace with accession number which will be the name of the folder the files will be in
+
+# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
+data_file <- file.path(data_dir, "SRP140558.tsv") # Replace with file path to your dataset
+
+# Declare the file path to the metadata file using the data directory saved as `data_dir`
+metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") # Replace with file path to your metadata
+

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

+
# Check if the gene expression matrix file is at the file path stored in `data_file`
+file.exists(data_file)
+
## [1] TRUE
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
+
## [1] TRUE
+

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+
+

3.0.1 Sample size

+

Keep in mind when using a different refine.bio dataset with this example, that WGCNA requires at least 15 samples to produce a meaningful result according to its authors. So you will need to make sure the dataset you use is sufficiently large. However, note that very large datasets will be difficult to run locally (on a personal laptop) due to the required computing power. While you can adjust some parameters to make this more doable on a laptop, it may decrease the reliability of your result if taken to an extreme (more on this parameter, called maxBlockSize, in the Run WGCNA! section).

+
+
+

3.0.2 Microarray vs RNA-seq

+

WGCNA can be used with both RNA-seq and microarray datasets so long as they are well normalized and filtered. In this example we use RNA-seq and normalize and transform the data with DESeq2’s vst(), which not only is a method and package we recommend in general, but is also the authors’ specific recommendations for using WGCNA with RNA-seq data.

+

If you end up wanting to run WGCNA with a microarray dataset, the normalization done by refine.bio should be sufficient, but you will likely want to apply a minimum expression filter as we do in this example. If you have troubles finding a power parameter that yields a sufficient R^2 even after applying a stringent cutoff, you may want to look into using a different dataset.

+
+ +

 

+
+
+
+

4 Identifying co-expression gene modules with WGCNA - RNA-seq

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

We will be using DESeq2 to normalize and transform our RNA-seq data before running WGCNA, so we will need to install that (Love et al. 2014).

+

Of course, we will need the WGCNA package (Langfelder and Horvath 2008). But WGCNA also requires a package called impute that it sometimes has trouble installing so we recommend installing that first (Hastie et al. 2020).

+

For plotting purposes will be creating a sina plot and heatmaps which we will need a ggplot2 companion package for, called ggforce as well as the ComplexHeatmap package (Gu 2020).

+
if (!("DESeq2" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("DESeq2", update = FALSE)
+}
+
+if (!("impute" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("impute")
+}
+
+if (!("WGCNA" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("WGCNA")
+}
+
+if (!("ggforce" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("ggforce")
+}
+
+if (!("ComplexHeatmap" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("ComplexHeatmap")
+}
+

Attach some of the packages we need for this analysis.

+
# Attach the DESeq2 library
+library(DESeq2)
+
## Loading required package: S4Vectors
+
## Loading required package: stats4
+
## Loading required package: BiocGenerics
+
## Loading required package: parallel
+
## 
+## Attaching package: 'BiocGenerics'
+
## The following objects are masked from 'package:parallel':
+## 
+##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+##     clusterExport, clusterMap, parApply, parCapply, parLapply,
+##     parLapplyLB, parRapply, parSapply, parSapplyLB
+
## The following objects are masked from 'package:stats':
+## 
+##     IQR, mad, sd, var, xtabs
+
## The following objects are masked from 'package:base':
+## 
+##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+##     union, unique, unsplit, which.max, which.min
+
## 
+## Attaching package: 'S4Vectors'
+
## The following object is masked from 'package:base':
+## 
+##     expand.grid
+
## Loading required package: IRanges
+
## Loading required package: GenomicRanges
+
## Loading required package: GenomeInfoDb
+
## Loading required package: SummarizedExperiment
+
## Loading required package: MatrixGenerics
+
## Loading required package: matrixStats
+
## 
+## Attaching package: 'MatrixGenerics'
+
## The following objects are masked from 'package:matrixStats':
+## 
+##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+##     colWeightedMeans, colWeightedMedians, colWeightedSds,
+##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+##     rowWeightedSds, rowWeightedVars
+
## Loading required package: Biobase
+
## Welcome to Bioconductor
+## 
+##     Vignettes contain introductory material; view with
+##     'browseVignettes()'. To cite Bioconductor, see
+##     'citation("Biobase")', and for packages 'citation("pkgname")'.
+
## 
+## Attaching package: 'Biobase'
+
## The following object is masked from 'package:MatrixGenerics':
+## 
+##     rowMedians
+
## The following objects are masked from 'package:matrixStats':
+## 
+##     anyMissing, rowMedians
+
# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll need this for finding gene modules
+library(WGCNA)
+
## Loading required package: dynamicTreeCut
+
## Loading required package: fastcluster
+
## 
+## Attaching package: 'fastcluster'
+
## The following object is masked from 'package:stats':
+## 
+##     hclust
+
## 
+
## 
+## Attaching package: 'WGCNA'
+
## The following object is masked from 'package:IRanges':
+## 
+##     cor
+
## The following object is masked from 'package:S4Vectors':
+## 
+##     cor
+
## The following object is masked from 'package:stats':
+## 
+##     cor
+
# We'll be making some plots
+library(ggplot2)
+
+
+

4.2 Import and set up data

+

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

+

We stored our file paths as objects named metadata_file and data_file in this previous step.

+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_logical(),
+##   refinebio_accession_code = col_character(),
+##   experiment_accession = col_character(),
+##   refinebio_organism = col_character(),
+##   refinebio_platform = col_character(),
+##   refinebio_source_database = col_character(),
+##   refinebio_subject = col_character(),
+##   refinebio_title = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_double(),
+##   Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+

Let’s ensure that the metadata and data are in the same sample order.

+
# Make the data in the order of the metadata
+df <- df %>%
+  dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$refinebio_accession_code)
+
## [1] TRUE
+
+

4.2.1 Prepare data for DESeq2

+

There are two things we need to do to prep our expression data for DESeq2.

+

First, we need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

+

Then, we need to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can also help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 50 or more reads in total across the samples.

+
# The next DESeq2 functions need the values to be converted to integers
+df <- round(df) %>%
+  # The next steps require a data frame and round() returns a matrix
+  as.data.frame() %>%
+  # Only keep rows that have total counts above the cutoff
+  dplyr::filter(rowSums(.) >= 50)
+

Another thing we need to do is set up our main experimental group variable. Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves.

+

For this study, PBMCs were collected at two time points: during the patients’ first, acute bronchiolitis visit (abbreviated “AV”) and their recovery visit, (called post-convalescence and abbreviated “CV”).

+

For handier use of this information, we can create a new variable, time_point, that states this info more clearly. This new time_point variable will have two labels: acute illness and recovering based on the AV or CV coding located in the refinebio_title string variable.

+
metadata <- metadata %>%
+  dplyr::mutate(
+    time_point = dplyr::case_when(
+      # Create our new variable based on refinebio_title containing AV/CV
+      stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness",
+      stringr::str_detect(refinebio_title, "_CV_") ~ "recovering"
+    ),
+    # It's easier for future items if this is already set up as a factor
+    time_point = as.factor(time_point)
+  )
+

Let’s double check that our factor set up is right. We want acute illness to be the first level since it was the first time point collected.

+
levels(metadata$time_point)
+
## [1] "acute illness" "recovering"
+

Great! We’re all set.

+
+
+
+

4.3 Create a DESeqDataset

+

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

+
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+  countData = df, # Our prepped data frame with counts
+  colData = metadata, # Data frame with annotation for our samples
+  design = ~1 # Here we are not specifying a model
+)
+
## converting counts to integer mode
+
+
+

4.4 Perform DESeq2 normalization and transformation

+

We often suggest normalizing and transforming your data for various applications and in this instance WGCNA’s authors suggest using variance stabilizing transformation before running WGCNA.
+We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

+
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()`
+# function from the `DESEq2` R package
+dds_norm <- vst(dds)
+

At this point, if your data set has any outlier samples, you should look into removing them as they can affect your WGCNA results.

+

WGCNA’s tutorial has an example of exploring your data for outliers you can reference.

+

For this example data set, we will skip this step (there are no obvious outliers) and proceed.

+
+
+

4.5 Format normalized data for WGCNA

+

Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA.

+
# Retrieve the normalized data from the `DESeqDataSet`
+normalized_counts <- assay(dds_norm) %>%
+  t() # Transpose this data
+
+
+

4.6 Determine parameters for WGCNA

+

To identify which genes are in the same modules, WGCNA first creates a weighted network to define which genes are near each other. The measure of “adjacency” it uses is based on the correlation matrix, but requires the definition of a threshold value, which in turn depends on a “power” parameter that defines the exponent used when transforming the correlation values. The choice of power parameter will affect the number of modules identified, and the WGCNA modules provides the pickSoftThreshold() function to help identify good choices for this parameter.

+
sft <- pickSoftThreshold(normalized_counts,
+  dataIsExpr = TRUE,
+  corFnc = cor,
+  networkType = "signed"
+)
+
## Warning: executing %dopar% sequentially: no parallel backend registered
+
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
+## 1      1   0.0491  42.50          0.947 13400.0  13400.00  13600
+## 2      2   0.8530 -12.60          0.871  7230.0   7080.00   8430
+## 3      3   0.8800  -5.41          0.856  4120.0   3900.00   5840
+## 4      4   0.8910  -3.28          0.864  2470.0   2230.00   4340
+## 5      5   0.9060  -2.39          0.882  1560.0   1310.00   3380
+## 6      6   0.9140  -1.96          0.895  1030.0    798.00   2740
+## 7      7   0.9220  -1.72          0.908   706.0    496.00   2280
+## 8      8   0.9190  -1.58          0.910   504.0    314.00   1940
+## 9      9   0.9180  -1.48          0.917   371.0    203.00   1680
+## 10    10   0.9080  -1.42          0.915   282.0    134.00   1470
+## 11    12   0.9050  -1.34          0.927   174.0     60.40   1170
+## 12    14   0.8870  -1.31          0.927   116.0     28.60    964
+## 13    16   0.8660  -1.32          0.918    81.7     14.00    810
+## 14    18   0.8560  -1.33          0.921    59.7      7.13    692
+## 15    20   0.8570  -1.33          0.929    45.0      3.71    599
+

This sft object has a lot of information, we will want to plot some of it to figure out what our power soft-threshold should be. We have to first calculate a measure of the model fit, the signed \(R^2\), and make that a new variable.

+
sft_df <- data.frame(sft$fitIndices) %>%
+  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
+

Now, let’s plot the model fitting by the power soft threshold so we can decide on a soft-threshold for power.

+
ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
+  # Plot the points
+  geom_point() +
+  # We'll put the Power labels slightly above the data points
+  geom_text(nudge_y = 0.1) +
+  # We will plot what WGCNA recommends as an R^2 cutoff
+  geom_hline(yintercept = 0.80, col = "red") +
+  # Just in case our values are low, we want to make sure we can still see the 0.80 level
+  ylim(c(min(sft_df$model_fit), 1.05)) +
+  # We can add more sensible labels for our axis
+  xlab("Soft Threshold (power)") +
+  ylab("Scale Free Topology Model Fit, signed R^2") +
+  ggtitle("Scale independence") +
+  # This adds some nicer aesthetics to our plot
+  theme_classic()
+

+

Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed \(R^2\) above 0.80, otherwise they warn your results may be too noisy to be meaningful.

+

If you have multiple power values with signed \(R^2\) above 0.80, then picking the one at an inflection point, in other words where the \(R^2\) values seem to have reached their saturation (Zhang and Horvath 2005). You want to a power that gives you a big enough \(R^2\) but is not excessively large.

+

So using the plot above, going with a power soft-threshold of 7!

+

If you find you have all very low \(R^2\) values this may be because there are too many genes with low expression values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).

+
+
+

4.7 Run WGCNA!

+

We will use the blockwiseModules() function to find gene co-expression modules in WGCNA, using 7 for the power argument like we determined above.

+

This next step takes some time to run. The blockwise part of the blockwiseModules() function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.

+

Here we are using the default maxBlockSize, 5000 but, you may want to adjust the maxBlockSize argument depending on your computer’s memory. The authors of WGCNA recommend running the largest block your computer can handle and they provide some approximations as to GB of memory of a laptop and what maxBlockSize it should be able to handle:

+
+

• If the reader has access to a large workstation with more than 4 GB of memory, the parameter maxBlockSize can be increased. A 16GB workstation should handle up to 20000 probes; a 32GB workstation should handle perhaps 30000. A 4GB standard desktop or a laptop may handle up to 8000-10000 probes, depending on operating system and other running programs.

+
+

(Langfelder and Horvath 2016)

+
bwnet <- blockwiseModules(normalized_counts,
+  maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
+  TOMType = "signed", # topological overlap matrix
+  power = 7, # soft threshold for network construction
+  numericLabels = TRUE, # Let's use numbers instead of colors for module labels
+  randomSeed = 1234, # there's some randomness associated with this calculation
+  # so we should set a seed
+)
+

The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out (Langfelder 2018).

+

There are a lot of other settings you can tweak – look at ?blockwiseModules help page as well as the WGCNA tutorial (Langfelder and Horvath 2016).

+
+
+

4.8 Write main WGCNA results object to file

+

We will save our whole results object to an RDS file in case we want to return to our original WGCNA results.

+
readr::write_rds(bwnet,
+  file = file.path("results", "SRP140558_wgcna_results.RDS")
+)
+
+
+

4.9 Explore our WGCNA results

+

The bwnet object has many parts, storing a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.

+

In bwnet we have a data frame of eigengene module data for each sample in the MEs slot. These represent the collapsed, combined, and normalized expression of the genes that make up each module.

+
module_eigengenes <- bwnet$MEs
+
+# Print out a preview
+head(module_eigengenes)
+
+ +
+
+
+

4.10 Which modules have biggest differences across treatment groups?

+

We can also see if our eigengenes relate to our metadata labels. First we double check that our samples are still in order.

+
all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
+
## [1] TRUE
+
# Create the design matrix from the `time_point` variable
+des_mat <- model.matrix(~ metadata$time_point)
+

Run linear model on each module. Limma wants our tests to be per row, so we also need to transpose so the eigengenes are rows

+
# lmFit() needs a transposed version of the matrix
+fit <- limma::lmFit(t(module_eigengenes), design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- limma::eBayes(fit)
+

Apply multiple testing correction and obtain stats in a data frame.

+
# Apply multiple testing correction and obtain stats
+stats_df <- limma::topTable(fit, number = ncol(module_eigengenes)) %>%
+  tibble::rownames_to_column("module")
+
## Removing intercept from test coefficients
+

Let’s take a look at the results. They are sorted with the most significant results at the top.

+
head(stats_df)
+
+ +
+

Module 19 seems to be the most differentially expressed across time_point groups. Now we can do some investigation into this module.

+
+
+

4.11 Let’s make plot of module 19

+

As a sanity check, let’s use ggplot to see what module 18’s eigengene looks like between treatment groups.

+

First we need to set up the module eigengene for this module with the sample metadata labels we need.

+
module_19_df <- module_eigengenes %>%
+  tibble::rownames_to_column("accession_code") %>%
+  # Here we are performing an inner join with a subset of metadata
+  dplyr::inner_join(metadata %>%
+    dplyr::select(refinebio_accession_code, time_point),
+  by = c("accession_code" = "refinebio_accession_code")
+  )
+

Now we are ready for plotting.

+
ggplot(
+  module_19_df,
+  aes(
+    x = time_point,
+    y = ME19,
+    color = time_point
+  )
+) +
+  # a boxplot with outlier points hidden (they will be in the sina plot)
+  geom_boxplot(width = 0.2, outlier.shape = NA) +
+  # A sina plot to show all of the individual data points
+  ggforce::geom_sina(maxwidth = 0.3) +
+  theme_classic()
+

+

This makes sense! Looks like module 19 has elevated expression during the acute illness but not when recovering.

+
+
+

4.12 What genes are a part of module 19?

+

If you want to know which of your genes make up a modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of. We can turn this into a data frame for handy use.

+
gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>%
+  # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere
+  dplyr::mutate(module = paste0("ME", module))
+

Now we can find what genes are a part of module 19.

+
gene_module_key %>%
+  dplyr::filter(module == "ME19")
+
+ +
+

Let’s save this gene to module key to a TSV file for future use.

+
readr::write_tsv(gene_module_key,
+  file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv")
+)
+
+
+

4.13 Make a custom heatmap function

+

We will make a heatmap that summarizes our differentially expressed module. Because we will make a couple of these, it makes sense to make a custom function for making this heatmap.

+
make_module_heatmap <- function(module_name,
+                                expression_mat = normalized_counts,
+                                metadata_df = metadata,
+                                gene_module_key_df = gene_module_key,
+                                module_eigengenes_df = module_eigengenes) {
+  # Create a summary heatmap of a given module.
+  #
+  # Args:
+  # module_name: a character indicating what module should be plotted, e.g. "ME19"
+  # expression_mat: The full gene expression matrix. Default is `normalized_counts`.
+  # metadata_df: a data frame with refinebio_accession_code and time_point
+  #              as columns. Default is `metadata`.
+  # gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
+  # module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
+  #
+  # Returns:
+  # A heatmap of expression matrix for a module's genes, with a barplot of the
+  # eigengene expression for that module.
+
+  # Set up the module eigengene with its refinebio_accession_code
+  module_eigengene <- module_eigengenes_df %>%
+    dplyr::select(all_of(module_name)) %>%
+    tibble::rownames_to_column("refinebio_accession_code")
+
+  # Set up column annotation from metadata
+  col_annot_df <- metadata_df %>%
+    # Only select the treatment and sample ID columns
+    dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>%
+    # Add on the eigengene expression by joining with sample IDs
+    dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
+    # Arrange by patient and time point
+    dplyr::arrange(time_point, refinebio_subject) %>%
+    # Store sample
+    tibble::column_to_rownames("refinebio_accession_code")
+
+  # Create the ComplexHeatmap column annotation object
+  col_annot <- ComplexHeatmap::HeatmapAnnotation(
+    # Supply treatment labels
+    time_point = col_annot_df$time_point,
+    # Add annotation barplot
+    module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
+    # Pick colors for each experimental group in time_point
+    col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3"))
+  )
+
+  # Get a vector of the Ensembl gene IDs that correspond to this module
+  module_genes <- gene_module_key_df %>%
+    dplyr::filter(module == module_name) %>%
+    dplyr::pull(gene)
+
+  # Set up the gene expression data frame
+  mod_mat <- expression_mat %>%
+    t() %>%
+    as.data.frame() %>%
+    # Only keep genes from this module
+    dplyr::filter(rownames(.) %in% module_genes) %>%
+    # Order the samples to match col_annot_df
+    dplyr::select(rownames(col_annot_df)) %>%
+    # Data needs to be a matrix
+    as.matrix()
+
+  # Normalize the gene expression values
+  mod_mat <- mod_mat %>%
+    # Scale can work on matrices, but it does it by column so we will need to
+    # transpose first
+    t() %>%
+    scale() %>%
+    # And now we need to transpose back
+    t()
+
+  # Create a color function based on standardized scale
+  color_func <- circlize::colorRamp2(
+    c(-2, 0, 2),
+    c("#67a9cf", "#f7f7f7", "#ef8a62")
+  )
+
+  # Plot on a heatmap
+  heatmap <- ComplexHeatmap::Heatmap(mod_mat,
+    name = module_name,
+    # Supply color function
+    col = color_func,
+    # Supply column annotation
+    bottom_annotation = col_annot,
+    # We don't want to cluster samples
+    cluster_columns = FALSE,
+    # We don't need to show sample or gene labels
+    show_row_names = FALSE,
+    show_column_names = FALSE
+  )
+
+  # Return heatmap
+  return(heatmap)
+}
+
+
+

4.14 Make module heatmaps

+

Let’s try out the custom heatmap function with module 19 (our most differentially expressed module).

+
mod_19_heatmap <- make_module_heatmap(module_name = "ME19")
+
## Note: Using an external vector in selections is ambiguous.
+## ℹ Use `all_of(module_name)` instead of `module_name` to silence this message.
+## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
+## This message is displayed once per session.
+
# Print out the plot
+mod_19_heatmap
+

+

From the barplot portion of our plot, we can see acute illness samples tend to have higher expression values for the module 19 eigengene. In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the recovering samples.

+

We can save this plot to PNG.

+
png(file.path("results", "SRP140558_module_19_heatmap.png"))
+mod_19_heatmap
+dev.off()
+
## png 
+##   2
+

For comparison, let’s try out the custom heatmap function with a different, not differentially expressed module.

+
mod_25_heatmap <- make_module_heatmap(module_name = "ME25")
+
+# Print out the plot
+mod_25_heatmap
+

+

In this non-significant module’s heatmap, there’s not a particularly strong pattern between acute illness and recovery samples. Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!).

+

Save this plot also.

+
png(file.path("results", "SRP140558_module_25_heatmap.png"))
+mod_25_heatmap
+dev.off()
+
## png 
+##   2
+
+
+
+

5 Resources for further learning

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessioninfo::session_info()
+
## ─ Session info ───────────────────────────────────────────────────────────────
+##  setting  value                       
+##  version  R version 4.0.2 (2020-06-22)
+##  os       Ubuntu 20.04 LTS            
+##  system   x86_64, linux-gnu           
+##  ui       X11                         
+##  language (EN)                        
+##  collate  en_US.UTF-8                 
+##  ctype    en_US.UTF-8                 
+##  tz       Etc/UTC                     
+##  date     2020-11-30                  
+## 
+## ─ Packages ───────────────────────────────────────────────────────────────────
+##  package              * version  date       lib source        
+##  annotate               1.68.0   2020-10-27 [1] Bioconductor  
+##  AnnotationDbi          1.52.0   2020-10-27 [1] Bioconductor  
+##  assertthat             0.2.1    2019-03-21 [1] RSPM (R 4.0.0)
+##  backports              1.1.10   2020-09-15 [1] RSPM (R 4.0.2)
+##  base64enc              0.1-3    2015-07-28 [1] RSPM (R 4.0.0)
+##  Biobase              * 2.50.0   2020-10-27 [1] Bioconductor  
+##  BiocGenerics         * 0.36.0   2020-10-27 [1] Bioconductor  
+##  BiocParallel           1.24.1   2020-11-06 [1] Bioconductor  
+##  bit                    4.0.4    2020-08-04 [1] RSPM (R 4.0.2)
+##  bit64                  4.0.5    2020-08-30 [1] RSPM (R 4.0.2)
+##  bitops                 1.0-6    2013-08-17 [1] RSPM (R 4.0.0)
+##  blob                   1.2.1    2020-01-20 [1] RSPM (R 4.0.0)
+##  Cairo                  1.5-12.2 2020-07-07 [1] RSPM (R 4.0.2)
+##  checkmate              2.0.0    2020-02-06 [1] RSPM (R 4.0.0)
+##  circlize               0.4.10   2020-06-15 [1] RSPM (R 4.0.0)
+##  cli                    2.1.0    2020-10-12 [1] RSPM (R 4.0.2)
+##  clue                   0.3-57   2019-02-25 [1] RSPM (R 4.0.0)
+##  cluster                2.1.0    2019-06-19 [1] RSPM (R 4.0.0)
+##  codetools              0.2-16   2018-12-24 [2] CRAN (R 4.0.2)
+##  colorspace             1.4-1    2019-03-18 [1] RSPM (R 4.0.0)
+##  ComplexHeatmap         2.6.2    2020-11-12 [1] Bioconductor  
+##  crayon                 1.3.4    2017-09-16 [1] RSPM (R 4.0.0)
+##  data.table             1.13.0   2020-07-24 [1] RSPM (R 4.0.2)
+##  DBI                    1.1.0    2019-12-15 [1] RSPM (R 4.0.0)
+##  DelayedArray           0.16.0   2020-10-27 [1] Bioconductor  
+##  DESeq2               * 1.30.0   2020-10-27 [1] Bioconductor  
+##  digest                 0.6.25   2020-02-23 [1] RSPM (R 4.0.0)
+##  doParallel             1.0.15   2019-08-02 [1] RSPM (R 4.0.0)
+##  dplyr                  1.0.2    2020-08-18 [1] RSPM (R 4.0.2)
+##  dynamicTreeCut       * 1.63-1   2016-03-11 [1] RSPM (R 4.0.0)
+##  ellipsis               0.3.1    2020-05-15 [1] RSPM (R 4.0.0)
+##  evaluate               0.14     2019-05-28 [1] RSPM (R 4.0.0)
+##  fansi                  0.4.1    2020-01-08 [1] RSPM (R 4.0.0)
+##  farver                 2.0.3    2020-01-16 [1] RSPM (R 4.0.0)
+##  fastcluster          * 1.1.25   2018-06-07 [1] RSPM (R 4.0.0)
+##  foreach                1.5.0    2020-03-30 [1] RSPM (R 4.0.0)
+##  foreign                0.8-80   2020-05-24 [2] CRAN (R 4.0.2)
+##  Formula                1.2-3    2018-05-03 [1] RSPM (R 4.0.0)
+##  genefilter             1.72.0   2020-10-27 [1] Bioconductor  
+##  geneplotter            1.68.0   2020-10-27 [1] Bioconductor  
+##  generics               0.0.2    2018-11-29 [1] RSPM (R 4.0.0)
+##  GenomeInfoDb         * 1.26.1   2020-11-20 [1] Bioconductor  
+##  GenomeInfoDbData       1.2.4    2020-11-25 [1] Bioconductor  
+##  GenomicRanges        * 1.42.0   2020-10-27 [1] Bioconductor  
+##  getopt                 1.20.3   2019-03-22 [1] RSPM (R 4.0.0)
+##  GetoptLong             1.0.3    2020-10-01 [1] RSPM (R 4.0.2)
+##  ggforce                0.3.2    2020-06-23 [1] RSPM (R 4.0.2)
+##  ggplot2              * 3.3.2    2020-06-19 [1] RSPM (R 4.0.1)
+##  GlobalOptions          0.1.2    2020-06-10 [1] RSPM (R 4.0.0)
+##  glue                   1.4.2    2020-08-27 [1] RSPM (R 4.0.2)
+##  GO.db                  3.12.1   2020-11-25 [1] Bioconductor  
+##  gridExtra              2.3      2017-09-09 [1] RSPM (R 4.0.0)
+##  gtable                 0.3.0    2019-03-25 [1] RSPM (R 4.0.0)
+##  Hmisc                  4.4-1    2020-08-10 [1] RSPM (R 4.0.2)
+##  hms                    0.5.3    2020-01-08 [1] RSPM (R 4.0.0)
+##  htmlTable              2.1.0    2020-09-16 [1] RSPM (R 4.0.2)
+##  htmltools              0.5.0    2020-06-16 [1] RSPM (R 4.0.1)
+##  htmlwidgets            1.5.2    2020-10-03 [1] RSPM (R 4.0.2)
+##  httr                   1.4.2    2020-07-20 [1] RSPM (R 4.0.2)
+##  impute                 1.64.0   2020-10-27 [1] Bioconductor  
+##  IRanges              * 2.24.0   2020-10-27 [1] Bioconductor  
+##  iterators              1.0.12   2019-07-26 [1] RSPM (R 4.0.0)
+##  jpeg                   0.1-8.1  2019-10-24 [1] RSPM (R 4.0.0)
+##  jsonlite               1.7.1    2020-09-07 [1] RSPM (R 4.0.2)
+##  knitr                  1.30     2020-09-22 [1] RSPM (R 4.0.2)
+##  labeling               0.3      2014-08-23 [1] RSPM (R 4.0.0)
+##  lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
+##  latticeExtra           0.6-29   2019-12-19 [1] RSPM (R 4.0.0)
+##  lifecycle              0.2.0    2020-03-06 [1] RSPM (R 4.0.0)
+##  limma                  3.46.0   2020-10-27 [1] Bioconductor  
+##  locfit                 1.5-9.4  2020-03-25 [1] RSPM (R 4.0.0)
+##  magick                 2.4.0    2020-06-23 [1] RSPM (R 4.0.2)
+##  magrittr             * 1.5      2014-11-22 [1] RSPM (R 4.0.0)
+##  MASS                   7.3-51.6 2020-04-26 [2] CRAN (R 4.0.2)
+##  Matrix                 1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
+##  MatrixGenerics       * 1.2.0    2020-10-27 [1] Bioconductor  
+##  matrixStats          * 0.57.0   2020-09-25 [1] RSPM (R 4.0.2)
+##  memoise                1.1.0    2017-04-21 [1] RSPM (R 4.0.0)
+##  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.0.0)
+##  nnet                   7.3-14   2020-04-26 [2] CRAN (R 4.0.2)
+##  optparse             * 1.6.6    2020-04-16 [1] RSPM (R 4.0.0)
+##  pillar                 1.4.6    2020-07-10 [1] RSPM (R 4.0.2)
+##  pkgconfig              2.0.3    2019-09-22 [1] RSPM (R 4.0.0)
+##  png                    0.1-7    2013-12-03 [1] RSPM (R 4.0.0)
+##  polyclip               1.10-0   2019-03-14 [1] RSPM (R 4.0.0)
+##  preprocessCore         1.52.0   2020-10-27 [1] Bioconductor  
+##  ps                     1.4.0    2020-10-07 [1] RSPM (R 4.0.2)
+##  purrr                  0.3.4    2020-04-17 [1] RSPM (R 4.0.0)
+##  R.cache                0.14.0   2019-12-06 [1] RSPM (R 4.0.0)
+##  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.0.2)
+##  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.0.2)
+##  R.utils                2.10.1   2020-08-26 [1] RSPM (R 4.0.2)
+##  R6                     2.4.1    2019-11-12 [1] RSPM (R 4.0.0)
+##  RColorBrewer           1.1-2    2014-12-07 [1] RSPM (R 4.0.0)
+##  Rcpp                   1.0.5    2020-07-06 [1] RSPM (R 4.0.2)
+##  RCurl                  1.98-1.2 2020-04-18 [1] RSPM (R 4.0.0)
+##  readr                  1.4.0    2020-10-05 [1] RSPM (R 4.0.2)
+##  rematch2               2.1.2    2020-05-01 [1] RSPM (R 4.0.0)
+##  rjson                  0.2.20   2018-06-08 [1] RSPM (R 4.0.0)
+##  rlang                  0.4.8    2020-10-08 [1] RSPM (R 4.0.2)
+##  rmarkdown              2.4      2020-09-30 [1] RSPM (R 4.0.2)
+##  rpart                  4.1-15   2019-04-12 [2] CRAN (R 4.0.2)
+##  RSQLite                2.2.1    2020-09-30 [1] RSPM (R 4.0.2)
+##  rstudioapi             0.11     2020-02-07 [1] RSPM (R 4.0.0)
+##  S4Vectors            * 0.28.0   2020-10-27 [1] Bioconductor  
+##  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.0.0)
+##  sessioninfo            1.1.1    2018-11-05 [1] RSPM (R 4.0.0)
+##  shape                  1.4.5    2020-09-13 [1] RSPM (R 4.0.2)
+##  stringi                1.5.3    2020-09-09 [1] RSPM (R 4.0.2)
+##  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.0.0)
+##  styler                 1.3.2    2020-02-23 [1] RSPM (R 4.0.0)
+##  SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor  
+##  survival               3.1-12   2020-04-10 [2] CRAN (R 4.0.2)
+##  tibble                 3.0.4    2020-10-12 [1] RSPM (R 4.0.2)
+##  tidyselect             1.1.0    2020-05-11 [1] RSPM (R 4.0.0)
+##  tweenr                 1.0.1    2018-12-14 [1] RSPM (R 4.0.2)
+##  vctrs                  0.3.4    2020-08-29 [1] RSPM (R 4.0.2)
+##  WGCNA                * 1.69     2020-02-28 [1] RSPM (R 4.0.2)
+##  withr                  2.3.0    2020-09-22 [1] RSPM (R 4.0.2)
+##  xfun                   0.18     2020-09-29 [1] RSPM (R 4.0.2)
+##  XML                    3.99-0.5 2020-07-23 [1] RSPM (R 4.0.2)
+##  xtable                 1.8-4    2019-04-21 [1] RSPM (R 4.0.0)
+##  XVector                0.30.0   2020-10-27 [1] Bioconductor  
+##  yaml                   2.2.1    2020-02-01 [1] RSPM (R 4.0.0)
+##  zlibbioc               1.36.0   2020-10-27 [1] Bioconductor  
+## 
+## [1] /usr/local/lib/R/site-library
+## [2] /usr/local/lib/R/library
+
+
+

References

+
+
+

Gu Z., 2020 ComplexHeatmap complete reference. https://jokergoo.github.io/ComplexHeatmap-reference/book/

+
+
+

Hastie T., R. Tibshirani, B. Narasimhan, and G. Chu, 2020 Impute: Imputation for microarray data. https://www.bioconductor.org/packages/devel/bioc/manuals/impute/man/impute.pdf

+
+
+

Langfelder P., and S. Horvath, 2007 Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 1. https://doi.org/10.1186/1752-0509-1-54

+
+
+

Langfelder P., and S. Horvath, 2008 WGCNA: An r package for weighted correlation network analysis. BMC Bioinformatics 9. https://doi.org/10.1186/1471-2105-9-559

+
+
+

Langfelder P., and S. Horvath, 2016 Tutorials for the WGCNA package. https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/

+
+
+

Langfelder P., 2018 Signed or unsigned: Which network type is preferable? https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/

+
+
+

Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

+
+
+

Zhang B., and S. Horvath, 2005 A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4. https://doi.org/10.2202/1544-6115.1128

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 71fc5447..33b031cb 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -8,7 +8,9 @@ - [Setting up the docker container](#setting-up-the-docker-container) - [Docker image updates](#docker-image-updates) - [Download the datasets](#download-the-datasets) -- [Add a new analysis](#add-a-new-analysis) +- [Adding a new analysis](#adding-a-new-analysis) + - [Draft PR: Big picture reviews](#draft-pr-big-picture-reviews) + - [Refined PRs: Detailed reviews](#refined-prs-detailed-reviews) - [Setting up a new analysis file](#setting-up-a-new-analysis-file) - [How to use the template.Rmd](#how-to-use-the-templatermd) - [Adding datasets to the S3 bucket](#adding-datasets-to-the-s3-bucket) @@ -31,11 +33,18 @@ - [Mechanics of the rendering](#mechanics-of-the-rendering) - [How to re-render the notebooks locally](#how-to-re-render-the-notebooks-locally) - [Run snakemake without queueing up a web browser for the Docker container](#run-snakemake-without-queueing-up-a-web-browser-for-the-docker-container) - - [Automatic rendering using GitHub actions](#automatic-rendering-using-github-actions) +- [Pull Requests](#pull-requests) + - [Add to the testing branch: merges to staging](#add-to-the-testing-branch-merges-to-staging) + - [Publish: staged changes merged to master](#publish-staged-changes-merged-to-master) + - [Publish a change, but quickly: direct merges to master, hotfixes](#publish-a-change-but-quickly-direct-merges-to-master-hotfixes) + - [A summary of types of PRs.](#a-summary-of-types-of-prs) + - [Github actions](#github-actions) + - [Automatic Spell checking and styling](#automatic-spell-checking-and-styling) + - [Automatic Docker image and rendering](#automatic-docker-image-and-rendering) + - [Automatic rendering](#automatic-rendering) - [About the render-notebooks.R script](#about-the-render-notebooksr-script) - [Add new analyses to the Snakefile](#add-new-analyses-to-the-snakefile) - [Add new analyses to the navbar](#add-new-analyses-to-the-navbar) -- [Pull request status checks](#pull-request-status-checks) @@ -82,18 +91,35 @@ For development purposes, you can download all the datasets for the example note scripts/download-data.sh ``` -## Add a new analysis +## Adding a new analysis -Here are the summarized steps for adding a new analysis. -Click on the links to go to the detailed instructions for each step. +Our PR process for adding a new analysis involves two stages which are 2-3 (or more) PRs. +This splitting up an analysis into multiple PRs helps make the review process more manageable. +This process ensures that discussions around the big picture: conceptual decisions, what steps are included, which packages are used, are generally concluded before review moves on to the details and further polishing. +Note that all the following steps describe PRs to `staging` branch only (see more [about the branch set up](#pull-requests)). + +### Draft PR: Big picture reviews - On your new git branch, [set up the analysis file from the template](#setting-up-the-analysis-file). -- Add a [link to the html file to `_navbar.html`](#add-new-analyses-to-the-navbar) +- Get the basic steps for the analysis set up and create a draft PR for a big picture review (Not all descriptions need to be 100% word-smithed, but the general steps/outline should be reflected). +- Try to highlight things that encapsulate main concepts as ready for review using a `**REVIEW**` tag and/or use a `**DRAFT**` tag to indicate a section that hasn't really been worked on much yet. +- After the general outline of the analysis has been agreed upon through a reviewing process, incorporate the major feedback from the draft PR process before you split off new branches to file your refined PRs. +- Keep the original draft PR open for easy reference. + +### Refined PRs: Detailed reviews + +Break up the steps of the analysis into manageable review chunks on their own branches for detailed review (you may want to discuss what the chunks should be on the Draft PR). +- Delete any `**REVIEW/DRAFT**` tags leftover from the draft PR. +- Make sure each steps' explanations are fully realized for these PRs. +- Ensure that the notebook adheres to [the guidelines](#guidelines-for-analysis-notebooks). - [Cite sources and add them to the reference.bib file](#citing-sources-in-text) +- If the file has been [added to snakemake](#add-new-analyses-to-the-snakefile), in the Docker container, run [snakemake for rendering](#how-to-re-render-the-notebooks-locally) to make sure it runs. + +These steps should be done in the first refined PR, but don't need to be done again: +- Add a [link to the html file to `_navbar.html`](#add-new-analyses-to-the-navbar) - Add [data and metadata files to S3](#adding-datasets-to-the-s3-bucket) - Add not yet added packages needed for this analysis to the Dockerfile (make sure it successfully builds). - Add the [expected output html file to snakemake](#add-new-analyses-to-the-snakefile) -- In the Docker container, run [snakemake for rendering](#how-to-re-render-the-notebooks-locally) ### Setting up a new analysis file @@ -172,6 +198,9 @@ These analyses follow the [Google R Style Guide](https://google.github.io/styleg Snakemake will automatically runs the [r-lib/styler package](https://github.com/r-lib/styler) on each `.Rmd` file called in the `Snakefile`. This will help fix some spacing and formatting issues automatically. +Github actions will also automatically run styler and commit back to your branch any changes. +See the [Github actions section](#github-actions) for more info on how the automation steps for this works. + #### Session Info `sessioninfo::session_info()` should always be printed out at the end. @@ -284,7 +313,7 @@ Had no year associated with it, so it has keywords for its tag `pca-visually-exp #### Spell checking -Spell checks are run automatically using GitHub actions upon opening a PR for master or prior to merging to master. +Spell checks are [run automatically using GitHub actions](#github-actions) upon opening a PR for `master` or `staging`. GitHub actions will abort if there are more than 2 spelling errors and you will need to fix those before continuing. You can obtain the list of spelling errors on GitHub by going to `Actions` and clicking the workflow of PR you are working on. Click on the `style-n-check` step and in the upper right hand corner, there is a button that says "Artifacts" which should list a file called `spell-check-results`. @@ -300,11 +329,11 @@ If you want to run a spell check of all `.Rmd` files locally, you can use run `R ### Mechanics of the rendering -The `Snakefile` calls the `scripts/render-notebooks.R` which renders the `.html` files but leaves these `.Rmd` files ready for download and use [without the `pandoc` error](https://github.com/AlexsLemonade/refinebio-examples/pull/148#issuecomment-669170681). +The [`Snakefile`](#add-new-analyses-to-the-snakefile) calls the [`scripts/render-notebooks.R`](#about-the-render-notebooksr-script) which renders the `.html` files but leaves these `.Rmd` files ready for download and use [without the `pandoc` error](https://github.com/AlexsLemonade/refinebio-examples/pull/148#issuecomment-669170681). However, the `snakemake` workflow should also be run locally during development so that the author and reviewers can see the rendered output of the new material during the `Pull Request` process. Ideally snakemake will not re-render the `.html` for `.Rmd` files you have not edited, but if it does, you should only commit and push the files you have intended to change. -All `.html` files will be re-rendered upon merging to master, but by not committing files that are only altered incidentally, the `Files changed` page of your PR on GitHub will be more focused, easing the burden on reviewers. +All `.html` files will be re-rendered upon merging to `master` or `staging`, but by not committing files that are only altered incidentally, the `Files changed` page of your PR on GitHub will be more focused, easing the burden on reviewers. ### How to re-render the notebooks locally @@ -329,11 +358,137 @@ If you already have the `refinebio-examples` docker image: docker run --mount type=bind,target=/home/rstudio,source=$PWD ccdl/refinebio-examples snakemake --cores 4 ``` -### Automatic rendering using GitHub actions +## Pull Requests + +There are two protected branches in this repo: + +`staging` - where changes are initially sent for testing and previewing. +`master` - where changes that were tested are made user-facing. + +This two branch set up allows us to make incremental changes that we can test before publishing to our user-facing github pages. +It also means that pull requests have some extra methodology. + +#### Add to the testing branch: merges to staging + +After you've prepared the new or altered material for this repository, you can file a pull request to the `staging` branch this is the default branch for this repository so you should not have to change anything. +The `staging` branch does not feed into the published, user-facing material so you can feel free to make your pull requests as iterative and incremental as it is useful. +Use `squash and merge` when merging these pull requests into `staging` after approval. +`squash and merge` will be helpful for us if we need to "cherry pick" commits if particular project or example is not yet ready for user-facing but other material is. + +After merging a pull request into `staging`, check that all the changes are correct by reviewing the newly generated pages using html preview. +Go here and navigate to the pertinent pages that you've changed. +`http://htmlpreview.github.io/?https://github.com/AlexsLemonade/refinebio-examples/gh-pages-stages/01-getting-started/getting-started.html` +You may also want to spot-check other pages to be sure that there have not been inadvertent or unexpected changes introduced. + +#### Publish: staged changes merged to master + +Once a set of changes that are merged to `staging` are ready to be published, you can file a merge to `master` pull request. + +These types of PRs should only involve well-polished and ready for the public material. + +**Scenario 1: All changes from staging should be published** + +- Create a new branch from the most up-to-date `master` branch, call it `publish-mychanges` where `mychanges` is a short tag relevant to the changes. +- Checkout your newly created `publish` branch. +- If you are certain that all changes in `staging` should be carried over to `master` (aka published to Github pages), then you can do a `git merge staging` into your new branch. +- Now you can use the new branch to create the pull request to master as you would normally do. +- Try to be as specific as possible about what PRs (and by relation, their commits) you are requesting to merge to `master`. + + + +In this example above, the blue project is ready to be published and there are no other incomplete projects that have been merged in, so all commits from `staging` are wanted to be brought over `master`. + +By checking out a separate `publish` branch, we can resolve any merge conflicts in this `publish` branch so we don't do a bad thing by committing conflict resolutions directly to `staging`. +This also provides us with a "snapshot" if merges are continuing to happen to `staging` on other PRs -- this can make review hard if it keeps changing. + +**Scenario 2: Only some changes from staging should be published** + +- Create a new branch from the most up-to-date `master` branch, call it `publish-mychanges` where `mychanges` is a short tag relevant to the changes. +- Checkout your newly created `publish` branch. +- For each commit that needs to be published, add it to your new branch by using `git cherry-pick `. +Or in GitKraken, you can right click on the commit and choose `Cherry pick commit`. +- Now you can use the new branch to [create the pull request where the `master` branch](https://docs.github.com/en/free-pro-team@latest/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request#changing-the-branch-range-and-destination-repository) is the base branch. +- Try to be as specific as possible about what PRs (and by relation, their commits) you are requesting to merge to `master`. + + + +In this example above, the blue project is ready to be published, but the red project is not. +So in this scenario, we would cherry pick both commits from the blue project but ignore the one commit from the incomplete red project. + +This allows us to move forward changes from `staging` that are ready to be public facing even if other changes aren't ready (or if there isn't great timing for when the other changes will be ready). + +_Tips for getting commits ids_ +To perform the cherry picks, you will need the commit ids for finished projects only. +GitKraken shows commits, but sometimes I find it hard to follow which commits belong to which branch. +If you checkout `staging` and use `git log` you can see all the most recent commits for the `staging` branch. +If you want to save it to a file for easy browsing and copy/pasting you can use this command `git log --pretty=format:'%h was %an, %ar, message: %s' > git.log` + +#### Publish a change, but quickly: direct merges to master, hotfixes + +In (hopefully rare) scenarios where something that has already been published is noted to be broken and should be addressed quickly, [hotfix branch](https://nvie.com/posts/a-successful-git-branching-model/#hotfix-branches) pull requests are allowed. +These PRs should only be fairly small PRs and not anything that would require intense review. +This more for situations where "this is broken and here's a fix". + +- Create a new branch from the most up-to-date `master` branch and call it `hotfix-mybug`, where where `mybug` is a short tag relevant to the changes. +- Checkout your newly created `hotfix` branch. +- Make the hotfix change. +- Create a [pull request with `master` branch as the base branch](https://docs.github.com/en/free-pro-team@latest/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request#changing-the-branch-range-and-destination-repository). +- After your PR to `master` is approved and merged, merge the most up-to-date `staging` branch into your `hotfix` branch. +- File a second PR for `hotfix` but this time with [`staging` as the base branch](https://docs.github.com/en/free-pro-team@latest/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request#changing-the-branch-range-and-destination-repository). + + + +In this scenario, we start by making a branch from `master` and make the hotfix change on this `hotfix` branch. +Once the change is approved and merged into `master` from the first PR, we start the second PR and merge in `staging` to our `hotfix` branch. +This makes sure we don't accidentally undo any changes in `staging` that have not made it to `master` at this time. + +#### A summary of types of PRs. + +- `some-branch` -> `staging` -> NOT published to user-facing content. +This will be how most material is prepared so we can be more incremental with our changes and ultimately make sure only the very polished material is made user-facing. + +- Group of changes in `staging` -> `master` -> published to user-facing. +This is for when the updates from the previous kinds of PRs are "ready for prime time". + +- Hotfix PR: `hotfix` -> `master` -> published to user-facing. +For "this-is-broken" type changes that should be hastened to the user-facing content. +This requires a follow up pull request and merge to `staging`. + +### Github actions + +This repository has a lot of little moving parts, so to help make sure changes are where they are supposed to be and that some items aren't missed, we use Github actions to automate things where we can. + +Github action workflows are initiated either when a pull request is started or when a merge to `master` or `staging` is initiated. +The following sections explain more details about which Github actions happen when and what they do. + +#### Automatic Spell checking and styling + +When pull requests are initiated, spell check and styler are run by Github actions. + + + +If the styling introduces changes to the files, these changes will be committed back to your branch. +However if there are no styling changes, no commit will be made. + +If spell check finds more than 2 errors, Github actions will fail. +See the [spell check section](#spell-checking) for instructions on how to see your spelling errors and otherwise use spell check. + +#### Automatic Docker image and rendering + +After a pull request to `staging` or `master` branch is approved and a merge to one of these branches has been initiated, a sequence of Github actions makes sure that the rendered html files are pushed to the correct branch and that the updated docker image is pushed to Dockerhub. + + + +See the [Docker](#docker-for-refinebio-examples) and the next section about automatic rendering for more on how these steps are conducted. + +#### Automatic rendering This repository uses [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) to render all notebooks. -All notebooks are automatically re-rendered by GitHub actions upon merges to master. -The newly rendered html files are all pushed to the `gh-pages` branch which will publish the material to https://alexslemonade.github.io/refinebio-examples/. +All notebooks are automatically re-rendered by GitHub actions upon merges to `master` or `staging`. +The newly rendered html files are all pushed to the `gh-pages` branch (if merging to `master`) and `gh-pages-stages` branch (if merging to `staging`). + +The `gh-pages` branch is where material is publshed to https://alexslemonade.github.io/refinebio-examples/. +The `gh-pages-stages` staging branch can be viewed using html preview at `http://htmlpreview.github.io/?https://github.com/AlexsLemonade/refinebio-examples/gh-pages-stages/01-getting-started/getting-started.html` If this automatic rendering fails, you will see a failed check at at the bottom of your PR on GitHub (and probably an email). You can see the details of this error on by going to `Actions` and clicking the workflow of PR you are working on that also says `Build Docker` underneath. @@ -343,6 +498,7 @@ Hopefully the error message helps you track down the problem, but you can also c ### About the render-notebooks.R script The `render-notebooks.R` script adds a `bibliography:` specification in the `.Rmd` header so all citations are automatically rendered. +It also adds other components like CSS styling, a footer, and Google Analytics (these items are all hard-coded into the script). **Options:** - `--rmd`: provided by snakemake, the input `.Rmd` file to render. @@ -377,11 +533,3 @@ Follow these steps to add the `.html` link to the navigation bar upon rendering. 6) Replace `tech-section`, `analysis_file_name` with the corresponding file names. 7) Save the file! 8) After you [render the notebook with snakemake](#rendering-notebooks), test the link to make sure it works. - -## Pull request status checks - -To require that branches are up-to-date with `master` before merging, we need to require that a status check passes before merging to `master`. -Turning on this setting mitigates the risk that changes that have been merged will be undone by a pull request that was filed first and alters the same file. -The status check used is a GitHub Action that test builds the docker image. -Most of the time, this should pull cached docker layers, so this will complete in a matter of minutes. -As a bonus, this process also checks that any changes to the Dockerfile result in a buildable image. diff --git a/Snakefile b/Snakefile index 4909b910..e9c26d36 100644 --- a/Snakefile +++ b/Snakefile @@ -8,7 +8,9 @@ rule target: "02-microarray/dimension-reduction_microarray_01_pca.html", "02-microarray/dimension-reduction_microarray_02_umap.html", "02-microarray/gene-id-annotation_microarray_01_ensembl.html", - "02-microarray/pathway-analysis_microarray_02_ora.html", + "02-microarray/pathway-analysis_microarray_01_ora.html", + "02-microarray/pathway-analysis_microarray_02_gsea.html", + "02-microarray/pathway-analysis_microarray_03_gsva.html", "02-microarray/ortholog-mapping_microarray_01_ensembl.html", "03-rnaseq/00-intro-to-rnaseq.html", "03-rnaseq/clustering_rnaseq_01_heatmap.html", @@ -17,7 +19,9 @@ rule target: "03-rnaseq/dimension-reduction_rnaseq_02_umap.html", "03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html", "03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html", - "04-advanced-topics/00-intro-to-advanced-topics.html" + "04-advanced-topics/00-intro-to-advanced-topics.html", + "04-advanced-topics/network-analysis_rnaseq_01_wgcna.html" + rule render_citations: input: diff --git a/components/_navbar.html b/components/_navbar.html index 1ed6c957..fe4358c1 100644 --- a/components/_navbar.html +++ b/components/_navbar.html @@ -7,7 +7,7 @@ - +