Skip to content

Commit 3a2f92e

Browse files
WGCNA Part 5: switch dataset (#379)
* switch wording and dataset in general * Few more wording edits * Update dictionary; fix spelling errors * Re-render! * Change to 7 and incorporate jashapiro review * Also switch the most sig module! * Two comments from jashapiro review * Put the comments too * Style Rmds * Use all_of() to get rid warning * Style Rmds * Re-render Co-authored-by: GitHub Actions <actions@github.com>
1 parent ae9aea8 commit 3a2f92e

3 files changed

Lines changed: 332 additions & 228 deletions

File tree

04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd

Lines changed: 79 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ output:
1313

1414
In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules [@Langfelder2008].
1515
WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set.
16-
This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNAseq data.
16+
This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data.
1717

1818
As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) [@Langfelder2007].
1919
This eigengene x sample data can, in many instances, be used as you would the original gene expression values.
@@ -75,7 +75,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty
7575

7676
For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data).
7777

78-
Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer).
78+
Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP140558).
7979

8080
Click the "Download Now" button on the right side of this screen.
8181

@@ -96,9 +96,9 @@ You will get an email when it is ready.
9696

9797
## About the dataset we are using for this example
9898

99-
For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573).
100-
The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer.
101-
Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.
99+
For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558).
100+
The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients.
101+
Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV").
102102

103103
## Place the dataset in your new `data/` folder
104104

@@ -113,15 +113,15 @@ For more details on the contents of this folder see [these docs on refine.bio](h
113113
The `<experiment_accession_id>` folder has the data and metadata TSV files you will need for this example analysis.
114114
Experiment accession ids usually look something like `GSE1235` or `SRP12345`.
115115

116-
Copy and paste the `SRP133573` folder into your newly created `data/` folder.
116+
Copy and paste the `SRP140558` folder into your newly created `data/` folder.
117117

118118
## Check out our file structure!
119119

120120
Your new analysis folder should contain:
121121

122122
- The example analysis `.Rmd` you downloaded
123123
- A folder called "data" which contains:
124-
- The `SRP133573` folder which contains:
124+
- The `SRP140558` folder which contains:
125125
- The gene expression
126126
- The metadata TSV
127127
- A folder for `plots` (currently empty)
@@ -139,13 +139,13 @@ This is handy to do because if we want to switch the dataset (see next section f
139139

140140
```{r}
141141
# Define the file path to the data directory
142-
data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
142+
data_dir <- file.path("data", "SRP140558") # Replace with accession number which will be the name of the folder the files will be in
143143
144144
# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
145-
data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
145+
data_file <- file.path(data_dir, "SRP140558.tsv") # Replace with file path to your dataset
146146
147147
# Declare the file path to the metadata file using the data directory saved as `data_dir`
148-
metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
148+
metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") # Replace with file path to your metadata
149149
```
150150

151151
Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above.
@@ -273,7 +273,7 @@ all.equal(colnames(df), metadata$refinebio_accession_code)
273273

274274
### Prepare data for `DESeq2`
275275

276-
There are two things we neeed to do to prep our expression data for DESeq2.
276+
There are two things we need to do to prep our expression data for DESeq2.
277277

278278
First, we need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later.
279279

@@ -291,23 +291,36 @@ df <- round(df) %>%
291291
dplyr::filter(rowSums(.) >= 50)
292292
```
293293

294-
Another thing we need to do is make sure our main experimental group label is set up.
295-
In this case `refinebio_treatment` has two groups: `pre-adt` and `post-adt`.
296-
To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with `pre-adt` as the first level.
294+
Another thing we need to do is set up our main experimental group variable.
295+
Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves.
296+
297+
For this study, PBMCs were collected at two time points: during the patients' first, acute bronchiolitis visit (abbreviated "AV") and their recovery visit, (called post-convalescence and abbreviated "CV").
298+
299+
For handier use of this information, we can create a new variable, `time_point`, that states this info more clearly.
300+
This new `time_point` variable will have two labels: `acute illness` and `recovering` based on the `AV` or `CV` coding located in the `refinebio_title` string variable.
297301

298302
```{r}
299303
metadata <- metadata %>%
300-
dplyr::mutate(refinebio_treatment = factor(refinebio_treatment,
301-
levels = c("pre-adt", "post-adt")
302-
))
304+
dplyr::mutate(
305+
time_point = dplyr::case_when(
306+
# Create our new variable based on refinebio_title containing AV/CV
307+
stringr::str_detect(refinebio_title, "_AV_") ~ "acute illness",
308+
stringr::str_detect(refinebio_title, "_CV_") ~ "recovering"
309+
),
310+
# It's easier for future items if this is already set up as a factor
311+
time_point = as.factor(time_point)
312+
)
303313
```
304314

305315
Let's double check that our factor set up is right.
316+
We want `acute illness` to be the first level since it was the first time point collected.
306317

307318
```{r}
308-
levels(metadata$refinebio_treatment)
319+
levels(metadata$time_point)
309320
```
310321

322+
Great! We're all set.
323+
311324
## Create a DESeqDataset
312325

313326
We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object.
@@ -384,7 +397,7 @@ ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
384397
# We will plot what WGCNA recommends as an R^2 cutoff
385398
geom_hline(yintercept = 0.80, col = "red") +
386399
# Just in case our values are low, we want to make sure we can still see the 0.80 level
387-
ylim(c(min(sft_df$model_fit), 1)) +
400+
ylim(c(min(sft_df$model_fit), 1.05)) +
388401
# We can add more sensible labels for our axis
389402
xlab("Soft Threshold (power)") +
390403
ylab("Scale Free Topology Model Fit, signed R^2") +
@@ -399,14 +412,14 @@ WGCNA's authors recommend using a `power` that has an signed $R^2$ above `0.80`,
399412
If you have multiple power values with signed $R^2$ above `0.80`, then picking the one at an inflection point, in other words where the $R^2$ values seem to have reached their saturation [@Zhang2005].
400413
You want to a `power` that gives you a big enough $R^2$ but is not excessively large.
401414

402-
So using the plot above, going with a power soft-threshold of `16`!
415+
So using the plot above, going with a power soft-threshold of `7`!
403416

404417
If you find you have all very low $R^2$ values this may be because there are too many genes with low expression values that are cluttering up the calculations.
405418
You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).
406419

407420
## Run WGCNA!
408421

409-
We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `16` for the `power` argument like we determined above.
422+
We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `7` for the `power` argument like we determined above.
410423

411424
This next step takes some time to run.
412425
The `blockwise` part of the `blockwiseModules()` function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.
@@ -425,7 +438,7 @@ operating system and other running programs.
425438
bwnet <- blockwiseModules(normalized_counts,
426439
maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
427440
TOMType = "signed", # topological overlap matrix
428-
power = 16, # soft threshold for network construction
441+
power = 7, # soft threshold for network construction
429442
numericLabels = TRUE, # Let's use numbers instead of colors for module labels
430443
randomSeed = 1234, # there's some randomness associated with this calculation
431444
# so we should set a seed
@@ -444,7 +457,7 @@ We will save our whole results object to an RDS file in case we want to return t
444457

445458
```{r}
446459
readr::write_rds(bwnet,
447-
file = file.path("results", "SRP133573_wgcna_results.RDS")
460+
file = file.path("results", "SRP140558_wgcna_results.RDS")
448461
)
449462
```
450463

@@ -473,8 +486,8 @@ all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes))
473486
```
474487

475488
```{r}
476-
# Create the design matrix from the refinebio_treatment variable
477-
des_mat <- model.matrix(~ metadata$refinebio_treatment)
489+
# Create the design matrix from the `time_point` variable
490+
des_mat <- model.matrix(~ metadata$time_point)
478491
```
479492

480493
Run linear model on each module.
@@ -503,21 +516,21 @@ They are sorted with the most significant results at the top.
503516
head(stats_df)
504517
```
505518

506-
Module 52 seems to be the most differentially expressed across `refinebio_treatment` groups.
519+
Module 19 seems to be the most differentially expressed across `time_point` groups.
507520
Now we can do some investigation into this module.
508521

509-
## Let's make plot of module 52
522+
## Let's make plot of module 19
510523

511-
As a sanity check, let's use `ggplot` to see what module 52's eigengene looks like between treatment groups.
524+
As a sanity check, let's use `ggplot` to see what module 18's eigengene looks like between treatment groups.
512525

513526
First we need to set up the module eigengene for this module with the sample metadata labels we need.
514527

515528
```{r}
516-
module_52_df <- module_eigengenes %>%
529+
module_19_df <- module_eigengenes %>%
517530
tibble::rownames_to_column("accession_code") %>%
518531
# Here we are performing an inner join with a subset of metadata
519532
dplyr::inner_join(metadata %>%
520-
dplyr::select(refinebio_accession_code, refinebio_treatment),
533+
dplyr::select(refinebio_accession_code, time_point),
521534
by = c("accession_code" = "refinebio_accession_code")
522535
)
523536
```
@@ -526,21 +539,24 @@ Now we are ready for plotting.
526539

527540
```{r}
528541
ggplot(
529-
module_52_df,
542+
module_19_df,
530543
aes(
531-
x = refinebio_treatment,
532-
y = ME52,
533-
color = refinebio_treatment
544+
x = time_point,
545+
y = ME19,
546+
color = time_point
534547
)
535548
) +
536-
ggforce::geom_sina() +
549+
# a boxplot with outlier points hidden (they will be in the sina plot)
550+
geom_boxplot(width = 0.2, outlier.shape = NA) +
551+
# A sina plot to show all of the individual data points
552+
ggforce::geom_sina(maxwidth = 0.3) +
537553
theme_classic()
538554
```
539555

540556
This makes sense!
541-
Looks like module 52 has elevated expression post treatment.
557+
Looks like module 19 has elevated expression during the acute illness but not when recovering.
542558

543-
## What genes are a part of module 52?
559+
## What genes are a part of module 19?
544560

545561
If you want to know which of your genes make up a modules, you can look at the `$colors` slot.
546562
This is a named list which associates the genes with the module they are a part of.
@@ -552,18 +568,18 @@ gene_module_key <- tibble::enframe(bwnet$colors, name = "gene", value = "module"
552568
dplyr::mutate(module = paste0("ME", module))
553569
```
554570

555-
Now we can find what genes are a part of module 52.
571+
Now we can find what genes are a part of module 19.
556572

557573
```{r}
558574
gene_module_key %>%
559-
dplyr::filter(module == "ME52")
575+
dplyr::filter(module == "ME19")
560576
```
561577

562578
Let's save this gene to module key to a TSV file for future use.
563579

564580
```{r}
565581
readr::write_tsv(gene_module_key,
566-
file = file.path("results", "SRP133573_wgcna_gene_to_module.tsv")
582+
file = file.path("results", "SRP140558_wgcna_gene_to_module.tsv")
567583
)
568584
```
569585

@@ -581,9 +597,9 @@ make_module_heatmap <- function(module_name,
581597
# Create a summary heatmap of a given module.
582598
#
583599
# Args:
584-
# module_name: a character indicating what module should be plotted, e.g. "ME52"
600+
# module_name: a character indicating what module should be plotted, e.g. "ME19"
585601
# expression_mat: The full gene expression matrix. Default is `normalized_counts`.
586-
# metadata_df: a data frame with refinebio_accession_code and refinebio_treatment
602+
# metadata_df: a data frame with refinebio_accession_code and time_point
587603
# as columns. Default is `metadata`.
588604
# gene_module_key: a data.frame indicating what genes are a part of what modules. Default is `gene_module_key`.
589605
# module_eigengenes: a sample x eigengene data.frame with samples as rownames. Default is `module_eigengenes`.
@@ -594,28 +610,28 @@ make_module_heatmap <- function(module_name,
594610
595611
# Set up the module eigengene with its refinebio_accession_code
596612
module_eigengene <- module_eigengenes_df %>%
597-
dplyr::select(module_name) %>%
613+
dplyr::select(all_of(module_name)) %>%
598614
tibble::rownames_to_column("refinebio_accession_code")
599615
600616
# Set up column annotation from metadata
601617
col_annot_df <- metadata_df %>%
602618
# Only select the treatment and sample ID columns
603-
dplyr::select(refinebio_accession_code, refinebio_treatment) %>%
619+
dplyr::select(refinebio_accession_code, time_point, refinebio_subject) %>%
604620
# Add on the eigengene expression by joining with sample IDs
605621
dplyr::inner_join(module_eigengene, by = "refinebio_accession_code") %>%
606-
# Arrange by treatment
607-
dplyr::arrange(refinebio_treatment, refinebio_accession_code) %>%
622+
# Arrange by patient and time point
623+
dplyr::arrange(time_point, refinebio_subject) %>%
608624
# Store sample
609625
tibble::column_to_rownames("refinebio_accession_code")
610626
611627
# Create the ComplexHeatmap column annotation object
612628
col_annot <- ComplexHeatmap::HeatmapAnnotation(
613629
# Supply treatment labels
614-
refinebio_treatment = col_annot_df$refinebio_treatment,
630+
time_point = col_annot_df$time_point,
615631
# Add annotation barplot
616632
module_eigengene = ComplexHeatmap::anno_barplot(dplyr::select(col_annot_df, module_name)),
617-
# Pick colors for each experimental group in refinebio_treatment
618-
col = list(refinebio_treatment = c("post-adt" = "#f1a340", "pre-adt" = "#998ec3"))
633+
# Pick colors for each experimental group in time_point
634+
col = list(time_point = c("recovering" = "#f1a340", "acute illness" = "#998ec3"))
619635
)
620636
621637
# Get a vector of the Ensembl gene IDs that correspond to this module
@@ -670,44 +686,43 @@ make_module_heatmap <- function(module_name,
670686

671687
## Make module heatmaps
672688

673-
Let's try out the custom heatmap function with module 52 (our most differentially expressed module).
689+
Let's try out the custom heatmap function with module 19 (our most differentially expressed module).
674690

675691
```{r}
676-
mod_52_heatmap <- make_module_heatmap(module_name = "ME52")
692+
mod_19_heatmap <- make_module_heatmap(module_name = "ME19")
677693
678694
# Print out the plot
679-
mod_52_heatmap
695+
mod_19_heatmap
680696
```
681697

682-
From the barplot portion of our plot, we can see `post-adt` samples have higher values for this eigengene for module 52.
683-
In the heatmap portion, we can see how the individual genes that make up module 52 have more extreme values (very high or very low) in the `post-adt` samples.
698+
From the barplot portion of our plot, we can see `acute illness` samples tend to have higher expression values for the module 19 eigengene.
699+
In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the `recovering` samples.
684700

685701
We can save this plot to PDF.
686702

687703
```{r}
688-
pdf(file.path("results", "SRP133573_module_52_heatmap.pdf"))
689-
mod_52_heatmap
704+
pdf(file.path("results", "SRP140558_module_19_heatmap.pdf"))
705+
mod_19_heatmap
690706
dev.off()
691707
```
692708

693709
For comparison, let's try out the custom heatmap function with a different, _not_ differentially expressed module.
694710

695711
```{r}
696-
mod_10_heatmap <- make_module_heatmap(module_name = "ME10")
712+
mod_25_heatmap <- make_module_heatmap(module_name = "ME25")
697713
698714
# Print out the plot
699-
mod_10_heatmap
715+
mod_25_heatmap
700716
```
701717

702-
In this non-significant module's heatmap, there's not a particularly strong pattern between pre and post ADT samples.
703-
In general the expression of genes in module 10 does not vary much between groups, staying near the overall mean.
704-
There are a few samples and some genes that show higher expression, but it is not surprising this does not results in a significant overall difference between the groups.
718+
In this non-significant module's heatmap, there's not a particularly strong pattern between acute illness and recovery samples.
719+
Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!).
705720

706721
Save this plot also.
707722

708723
```{r}
709-
pdf(file.path("results", "SRP133573_module_10_heatmap.pdf"))
710-
mod_10_heatmap
724+
pdf(file.path("results", "SRP140558_module_25_heatmap.pdf"))
725+
mod_25_heatmap
711726
dev.off()
712727
```
713728

@@ -725,7 +740,7 @@ This helps make your code more reproducible by recording what versions of softwa
725740

726741
```{r}
727742
# Print session info
728-
sessionInfo()
743+
sessioninfo::session_info()
729744
```
730745

731746
# References

0 commit comments

Comments
 (0)