diff --git a/05-fastqc-galaxy-in-anvil.Rmd b/05-fastqc-galaxy-in-anvil.Rmd index 0609378b..442ac883 100644 --- a/05-fastqc-galaxy-in-anvil.Rmd +++ b/05-fastqc-galaxy-in-anvil.Rmd @@ -244,7 +244,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1zre8qeE9x56RSjus
Expand for FASTQ files explained -For more information on the [contents of a FASTQ file, consider this resource from Illumina](https://knowledge.illumina.com/software/general/software-general-reference_material-list/000002211). +For more information on the [contents of a FASTQ file, consider this resource from Illumina](https://support.illumina.com.cn/bulletins/2016/04/fastq-files-explained.html). ::: {.reflection} QUESTIONS: @@ -394,4 +394,4 @@ cow::borrow_chapter( # Next steps {#run-tool-with-galaxy-next-steps} -This tutorial is just a first step in working with this data, specifically considering the quality scores of one set of the sequencing reads. Next steps include running FastQC on the reverse reads and downstream applications such as alignment and variant detection. Walk-throughs for these next steps can be found in [this Genomic Data Science Community Network Lab Exercise Book](https://jhudatascience.org/GDSCN_Book_SARS_Galaxy_on_AnVIL/student-activity-guide.html#alignment) or [this AnVIL Demos Recording](https://www.youtube.com/watch?v=_uT0IWL1wso). \ No newline at end of file +This tutorial is just a first step in working with this data, specifically considering the quality scores of one set of the sequencing reads. Next steps include running FastQC on the reverse reads and downstream applications such as alignment and variant detection. Walk-throughs for these next steps can be found in [this Genomic Data Science Community Network Lab Exercise Book](https://jhudatascience.org/GDSCN_Book_SARS_Galaxy_on_AnVIL/student-activity-guide.html#alignment) or [this AnVIL Demos Recording](https://www.youtube.com/watch?v=_uT0IWL1wso). diff --git a/_bookdown.yml b/_bookdown.yml index 1e2e316c..a21fc0b2 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -5,7 +5,9 @@ rmd_files: ["index.Rmd", "02-scale-with-workflows.Rmd", "03-single-cell-with-bioconductor.Rmd", "05-fastqc-galaxy-in-anvil.Rmd", - "human-genetic-variation-in-mage.Rmd"] + "human-genetic-variation-in-mage.Rmd", + "mini-hack.Rmd" + ] new_session: yes bibliography: [book.bib] delete_merged_file: true diff --git a/mini-hack.Rmd b/mini-hack.Rmd new file mode 100644 index 00000000..b75c32b7 --- /dev/null +++ b/mini-hack.Rmd @@ -0,0 +1,147 @@ +```{r echo = FALSE} +knitr::opts_chunk$set(out.width = "100%") +``` + +# (PART\*) MAGE mini-hackathon {.unnumbered} + +# Overview + +Every dataset tells a story, yet deciphering exactly how a figure was constructed can feel like solving a puzzle with missing pieces. Reproducibility is a common challenge. This mini-hackathon challenges teams to recreate figures from the MAGE RNA sequencing study using real omics data from the 1000 Genomes Project. Teams can also create additional visualizations from the open data. Through hands-on experience with cloud-based tools on AnVIL, participants will discover that computational reproducibility requires creativity, problem-solving, and detective work that go beyond simply following published protocols. + +**References**: + +- Paper to reproduce here: + +- Github repository companion for the paper here: + +- Data from the paper can be found here: + +- Original Workspace: + +## Prepare the Workspace + +Be sure to use the start up script when creating an RStudio Cloud Environment on AnVIL. Paste the path below into the Startup script _Optional_ box to install `AnVILGCP`, `vcfR`, and `bcftools`. The start-up script can be found in this workspace under the Data tab, in the other data option there it will be under Workspace data. +For example, format of the start-up script would be: +``` +gs://fc-7e522183-41b0-4a9b-8f48-699fe8d114e8/install-conda.sh +``` + +Here is a video tutorial on finding the path of the startup script: . + +To read more on preconfiguring a Cloud Environment using startup scripts, click [here](https://support.terra.bio/hc/en-us/articles/360058193872-Preconfigure-a-Cloud-Environment-with-a-startup-script#h_01J5R7WP1WFDKV5H5P96883TXZ). + +# Setting up your Workspace + +1. Clone the AnVIL workspace: + +- For steps on launching Terra and cloning workspaces, read [here](https://hutchdatascience.org/AnVIL_Demos/what-is-anvil-exercises.html#launch-terra). + +2. Then, follow [these steps](https://hutchdatascience.org/AnVIL_Demos/human-genetic-variation-in-mage-exercises.html#launch-rstudio) to launch RStudio within your AnVIL workspace. + +3. Open `Reproducibility_in_Action.Rmd` and follow through the notebook. + - **Ensure that you replace _add your Google Project ID here_ with your Google Project ID in line 288 of your R Markdown file.** + - To locate this information: In the dashboard page of your workspace, you will see **Google Project ID** under the Cloud Information section. The format of it will follow 'terra-xxxxxxxx'. + +4. At the end of the exercise, remember to shut down compute! Refer to steps [here](https://hutchdatascience.org/AnVIL_Demos/human-genetic-variation-in-mage-wrap-up.html#shut-down-compute). + +# Exercises + +This hands-on activity will help you explore [MAGE](https://www.internationalgenome.org/data-portal/data-collection/mage_rnaseq), an open-access RNA sequencing dataset of lymphoblastoid cell lines from 731 individuals from the [1000 Genomes Project](https://www.internationalgenome.org/). As part of this exploration, we will attempt to recreate various figures from this [paper](https://pmc.ncbi.nlm.nih.gov/articles/PMC11291278/). + +The GitHub repository can be found [here](https://github.com/mccoy-lab/MAGE). + +Processed data can be found in Zenodo [here](https://zenodo.org/records/10535719) or in Dropbox [here](##0). + +By the end of this module, you will be able to: + +- Set up and manage an R analysis environment on AnVIL using cloud-native tools +- Import, reshape, and join multiple genomic data types (expression counts, sample metadata, variant calls, and genome annotations) +- Import data into an AnVIL workspace from multiple sources +- Apply core tidyverse operations including piping, filtering, joining, and mutation +- Extract and visualize expression quantitative trait loci (eQTL) data +- Reproduce figures from a peer-reviewed publication using open-access data + +## Structure of steps + +**Reminder**: The following steps are found in the detailed notebook `Reproducibility_in_Action.Rmd` in your workspace. To set up your workspace, read through [the steps here](setting-up-your-workspace.html#setting-up-your-workspace). + +### Environment Setup + +We will load the three required R packages (`tidyverse`, `vcfR`, `AnVILGCP`) and introduce how R packages work, including the distinction between installation and loading. + +### Recreating Figure 5C: GSTP1 Expression by Genotype + +1. **Importing Expression Counts** + +We will copy a pre-loaded expression counts CSV from the workspace bucket using AnVILGCP functions, then read it into R. + +2. **Reshaping Expression Data** + +The wide-format counts matrix is transposed into tidy format using tidyverse operations, placing samples as rows and genes as columns. + +3. **Importing Metadata** + +Sample metadata is fetched from the bucket and read into R, providing population labels and other per-sample information. + +4. **Joining Counts and Metadata** + +The reshaped counts table and metadata are merged into a single object using an inner join on shared sample identifiers. + +5. **Importing Reference Annotations** + +The GENCODE v48 GTF annotation file is downloaded and parsed to identify the Ensembl ID for the gene of interest (GSTP1). + +6. **Extracting Variant Data ** + +We will retrieve chromosome 11 variant calls from the 1000 Genomes high-coverage dataset, index the VCF, and use bcftools to subset the file to the specific SNP of interest (rs115070172, position chr11:67,559,635). The VCF is then read into R using vcfR. + +7. **Combining Variants and Expression** + +Genotype calls are joined to GSTP1 expression data, phased alleles are collapsed into unphased genotype groups, and per-genotype sample counts are computed and appended as plot labels. + +8. **Visualization** + +A combined violin and boxplot is produced using ggplot2, displaying log2-normalized GSTP1 counts stratified by rs115070172 genotype, closely matching the published figure. + +### Recreating Figure 5D: GSTP1 Expression by Population + +Building on the joined dataset, we will create a new column distinguishing Peruvian (PEL) from non-PEL samples using a conditional mutation, then produce an analogous violin/boxplot stratified by population label. + +### Independent Extension: A Second eQTL + +You will independently apply the full workflow to a second eQTL (rs7927381 × GSTP1), locating the SNP, subsetting the VCF, joining with expression data, and generating a comparable plot. + +### Additional Exploration + +Several optional prompts are provided to you, including faceting plots by population or continental group, visualizing genotype frequency distributions, comparing GSTP1 expression by sex, and replicating Figure 1B using principal component data. + + + + + + + + +## Learn More + +```{r, echo=FALSE} +ottrpal::include_slide("https://docs.google.com/presentation/d/16DHXD2KNyjjP2mKzPDHmNE9loYy6OgqGG0-Kn8CLFak/edit#slide=id.g369240085c8_0_136") +``` + +```{r, echo=FALSE} +ottrpal::include_slide("https://docs.google.com/presentation/d/16DHXD2KNyjjP2mKzPDHmNE9loYy6OgqGG0-Kn8CLFak/edit#slide=id.g369240085c8_0_126") +``` + +## Provide Feedback + +::: {.notice} +Fill out [this poll](https://docs.google.com/forms/d/e/1FAIpQLScrDVb_utm55pmb_SHx-RgELTEbCCWdLea0T3IzS0Oj00GE4w/viewform?usp=pp_url&entry.1565230805=GBCC2025) to share your feedback +::: + +```{r, echo=FALSE} +ottrpal::include_slide("https://docs.google.com/presentation/d/16DHXD2KNyjjP2mKzPDHmNE9loYy6OgqGG0-Kn8CLFak/edit#slide=id.g369240085c8_0_120") +``` + + + + diff --git a/resources/dictionary.txt b/resources/dictionary.txt index d10cf9dc..bce84847 100644 --- a/resources/dictionary.txt +++ b/resources/dictionary.txt @@ -107,3 +107,28 @@ Workspaces Workspace's www Zenodo +bcftools +chr +Ensembl +GENCODE +ggplot +GTF +hackathon +https +lymphoblastoid +mccoy +minihack +ncbi +nih +nlm +omics +PEL +pmc +PMC +preconfiguring +tidyverse +unphased +VCF +vcfR +xxxxxxxx +zenodo