You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* switch wording and dataset in general
* Few more wording edits
* Update dictionary; fix spelling errors
* Re-render!
* Change to 7 and incorporate jashapiro review
* Also switch the most sig module!
* Two comments from jashapiro review
* Put the comments too
* Style Rmds
* Use all_of() to get rid warning
* Style Rmds
* Re-render
Co-authored-by: GitHub Actions <actions@github.com>
Copy file name to clipboardExpand all lines: 04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd
+79-64Lines changed: 79 additions & 64 deletions
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ output:
13
13
14
14
In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules [@Langfelder2008].
15
15
WGCNA uses a series of correlations to identify sets of genes that are expressed together in your data set.
16
-
This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNAseq data.
16
+
This is a fairly intuitive approach to gene network analysis which can aid in interpretation of microarray & RNA-seq data.
17
17
18
18
As output, WGCNA gives groups of co-expressed genes as well as an eigengene x sample matrix (where the values for each eigengene represent the summarized expression for a group of co-expressed genes) [@Langfelder2007].
19
19
This eigengene x sample data can, in many instances, be used as you would the original gene expression values.
@@ -75,7 +75,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty
75
75
76
76
For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data).
77
77
78
-
Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer).
78
+
Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP140558).
79
79
80
80
Click the "Download Now" button on the right side of this screen.
81
81
@@ -96,9 +96,9 @@ You will get an email when it is ready.
96
96
97
97
## About the dataset we are using for this example
98
98
99
-
For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573).
100
-
The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer.
101
-
Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.
99
+
For this example analysis, we will use this [acute viral bronchiolitis dataset](https://www.refine.bio/experiments/SRP140558).
100
+
The data that we downloaded from refine.bio for this analysis has 62 paired peripheral blood mononuclear cell RNA-seq samples obtained from 31 patients.
101
+
Samples were collected at two time points: during their first, acute bronchiolitis visit (abbreviated "AV") and their recovery, their post-convalescence visit (abbreviated "CV").
102
102
103
103
## Place the dataset in your new `data/` folder
104
104
@@ -113,15 +113,15 @@ For more details on the contents of this folder see [these docs on refine.bio](h
113
113
The `<experiment_accession_id>` folder has the data and metadata TSV files you will need for this example analysis.
114
114
Experiment accession ids usually look something like `GSE1235` or `SRP12345`.
115
115
116
-
Copy and paste the `SRP133573` folder into your newly created `data/` folder.
116
+
Copy and paste the `SRP140558` folder into your newly created `data/` folder.
117
117
118
118
## Check out our file structure!
119
119
120
120
Your new analysis folder should contain:
121
121
122
122
- The example analysis `.Rmd` you downloaded
123
123
- A folder called "data" which contains:
124
-
- The `SRP133573` folder which contains:
124
+
- The `SRP140558` folder which contains:
125
125
- The gene expression
126
126
- The metadata TSV
127
127
- A folder for `plots` (currently empty)
@@ -139,13 +139,13 @@ This is handy to do because if we want to switch the dataset (see next section f
139
139
140
140
```{r}
141
141
# Define the file path to the data directory
142
-
data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
142
+
data_dir <- file.path("data", "SRP140558") # Replace with accession number which will be the name of the folder the files will be in
143
143
144
144
# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
145
-
data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
145
+
data_file <- file.path(data_dir, "SRP140558.tsv") # Replace with file path to your dataset
146
146
147
147
# Declare the file path to the metadata file using the data directory saved as `data_dir`
148
-
metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
148
+
metadata_file <- file.path(data_dir, "metadata_SRP140558.tsv") # Replace with file path to your metadata
149
149
```
150
150
151
151
Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above.
There are two things we neeed to do to prep our expression data for DESeq2.
276
+
There are two things we need to do to prep our expression data for DESeq2.
277
277
278
278
First, we need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later.
279
279
@@ -291,23 +291,36 @@ df <- round(df) %>%
291
291
dplyr::filter(rowSums(.) >= 50)
292
292
```
293
293
294
-
Another thing we need to do is make sure our main experimental group label is set up.
295
-
In this case `refinebio_treatment` has two groups: `pre-adt` and `post-adt`.
296
-
To keep these two treatments in logical (rather than alphabetical) order, we will convert this to a factor with `pre-adt` as the first level.
294
+
Another thing we need to do is set up our main experimental group variable.
295
+
Unfortunately the metadata for this dataset are not set up into separate, neat columns, but we can accomplish that ourselves.
296
+
297
+
For this study, PBMCs were collected at two time points: during the patients' first, acute bronchiolitis visit (abbreviated "AV") and their recovery visit, (called post-convalescence and abbreviated "CV").
298
+
299
+
For handier use of this information, we can create a new variable, `time_point`, that states this info more clearly.
300
+
This new `time_point` variable will have two labels: `acute illness` and `recovering` based on the `AV` or `CV` coding located in the `refinebio_title` string variable.
# It's easier for future items if this is already set up as a factor
311
+
time_point = as.factor(time_point)
312
+
)
303
313
```
304
314
305
315
Let's double check that our factor set up is right.
316
+
We want `acute illness` to be the first level since it was the first time point collected.
306
317
307
318
```{r}
308
-
levels(metadata$refinebio_treatment)
319
+
levels(metadata$time_point)
309
320
```
310
321
322
+
Great! We're all set.
323
+
311
324
## Create a DESeqDataset
312
325
313
326
We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object.
# We will plot what WGCNA recommends as an R^2 cutoff
385
398
geom_hline(yintercept = 0.80, col = "red") +
386
399
# Just in case our values are low, we want to make sure we can still see the 0.80 level
387
-
ylim(c(min(sft_df$model_fit), 1)) +
400
+
ylim(c(min(sft_df$model_fit), 1.05)) +
388
401
# We can add more sensible labels for our axis
389
402
xlab("Soft Threshold (power)") +
390
403
ylab("Scale Free Topology Model Fit, signed R^2") +
@@ -399,14 +412,14 @@ WGCNA's authors recommend using a `power` that has an signed $R^2$ above `0.80`,
399
412
If you have multiple power values with signed $R^2$ above `0.80`, then picking the one at an inflection point, in other words where the $R^2$ values seem to have reached their saturation [@Zhang2005].
400
413
You want to a `power` that gives you a big enough $R^2$ but is not excessively large.
401
414
402
-
So using the plot above, going with a power soft-threshold of `16`!
415
+
So using the plot above, going with a power soft-threshold of `7`!
403
416
404
417
If you find you have all very low $R^2$ values this may be because there are too many genes with low expression values that are cluttering up the calculations.
405
418
You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).
406
419
407
420
## Run WGCNA!
408
421
409
-
We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `16` for the `power` argument like we determined above.
422
+
We will use the `blockwiseModules()` function to find gene co-expression modules in WGCNA, using `7` for the `power` argument like we determined above.
410
423
411
424
This next step takes some time to run.
412
425
The `blockwise` part of the `blockwiseModules()` function name refers to that these calculations will be done on chunks of your data at a time to help with conserving computing resources.
@@ -425,7 +438,7 @@ operating system and other running programs.
425
438
bwnet <- blockwiseModules(normalized_counts,
426
439
maxBlockSize = 5000, # What size chunks (how many genes) the calculations should be run in
427
440
TOMType = "signed", # topological overlap matrix
428
-
power = 16, # soft threshold for network construction
441
+
power = 7, # soft threshold for network construction
429
442
numericLabels = TRUE, # Let's use numbers instead of colors for module labels
430
443
randomSeed = 1234, # there's some randomness associated with this calculation
431
444
# so we should set a seed
@@ -444,7 +457,7 @@ We will save our whole results object to an RDS file in case we want to return t
From the barplot portion of our plot, we can see `post-adt` samples have higher values for this eigengene for module 52.
683
-
In the heatmap portion, we can see how the individual genes that make up module 52 have more extreme values (very high or very low) in the `post-adt` samples.
698
+
From the barplot portion of our plot, we can see `acute illness` samples tend to have higher expression values for the module 19 eigengene.
699
+
In the heatmap portion, we can see how the individual genes that make up module 19 are overall higher than in the `recovering` samples.
In this non-significant module's heatmap, there's not a particularly strong pattern between pre and post ADT samples.
703
-
In general the expression of genes in module 10 does not vary much between groups, staying near the overall mean.
704
-
There are a few samples and some genes that show higher expression, but it is not surprising this does not results in a significant overall difference between the groups.
718
+
In this non-significant module's heatmap, there's not a particularly strong pattern between acute illness and recovery samples.
719
+
Though we can still see the genes in this module seem to be very correlated with each other (which is how we found them in the first place, so this makes sense!).
0 commit comments