Minor changes to intro notebook

alancleary · alancleary · commit e8d21d9aa093 · 2025-04-28T18:42:56.000Z
diff --git a/module_notebooks/01-intro-to-pangenomics.ipynb b/module_notebooks/01-intro-to-pangenomics.ipynb
@@ -22,7 +22,7 @@
     "\n",
     "## Overview\n",
     "\n",
-    "Pangenome graphs are representations of related genomes that enable exploration of the relationships of the genomes to one another, their commonalities and novelties, and their collective genetic variation. You will learn about different types of pangenomic graphs and their strengths and weaknesses.\n",
+    "Pangenome graphs are representations of related genomes that enable exploration of the relationships of the genomes to one another, their commonalities and novelties, and their collective genetic variation. In this submodule, you will learn about different types of pangenomic graphs and their strengths and weaknesses.\n",
     "\n",
     "A brief overview of pangenomes and this module is available in this video XXX."
    ]
@@ -42,19 +42,20 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Get Started\n",
+    "## Getting Started\n",
     "\n",
     "In this submodule you will learn about pangenomic data and different types of graphs, including their pros and cons. In addition, you will learn about the pangenomics pipeline that will be used in this module.\n",
     "\n",
     "#### Pangenomics pipeline\n",
     "- Build graphs\n",
+    "- Index graphs\n",
     "- Map reads\n",
     "- Call variants\n",
-    "- Visualize graphs and mapped reads3\n",
+    "- Visualize graphs and mapped reads\n",
     "\n",
     "----------------\n",
     "\n",
-    "## What is a \"pangenome\"?\n",
+    "## What is a \"Pangenome\"?\n",
     "\n",
     "The term “pangenome” was first coined by [Sigaux et al. (2000)](https://pubmed.ncbi.nlm.nih.gov/11261250/) and was used to describe a public database containing an assessment of genome and transcriptome alterations in major types of tumors, tissues, and experimental models. The figure below shows the use of the term \"pan genome\" in the abstract of the [Sigaux et al. (2000)](https://pubmed.ncbi.nlm.nih.gov/11261250/) paper.\n",
     "\n",
@@ -65,7 +66,7 @@
     "  <figcaption><a href=\"https://pubmed.ncbi.nlm.nih.gov/11261250/\">https://pubmed.ncbi.nlm.nih.gov/11261250/</a></figcaption>\n",
     "</figure>\n",
     "\n",
-    "The term was later revitalized by [Tettelin et al. (2005)](https://pubmed.ncbi.nlm.nih.gov/16172379/) to describe a microbial genome by which genes were in the core (present in all strains) and which genes were dispensable (missing from one or more of the strains). The figure below shows the use of the term \"pan-genome\" in the abstract of the [Tettelin et al. (2005)](https://pubmed.ncbi.nlm.nih.gov/16172379/) paper. This paper also introduces the concept of a \"core genome\", consisting of genomic regions or genes that are present in all the strains that were analyzed and a \"dispensible genome\" consisting of regions or genes present in only one or some of the strains. \"Core genome\" sequences are presumed to be critical to the species because they have been retained by all of the strains.\n",
+    "The term was later revitalized by [Tettelin et al. (2005)](https://pubmed.ncbi.nlm.nih.gov/16172379/) to describe a microbial genome by which genes were in the core (present in all strains) and which genes were dispensable (missing from one or more of the strains). The figure below shows the use of the term \"pan-genome\" in the abstract of the [Tettelin et al. (2005)](https://pubmed.ncbi.nlm.nih.gov/16172379/) paper. This paper also introduces the concept of a \"core genome,\" consisting of genomic regions or genes that are present in all the strains that were analyzed and a \"dispensible genome\" consisting of regions or genes present in only one or some of the strains. \"Core genome\" sequences are presumed to be critical to the species because they have been retained by all of the strains.\n",
     "\n",
     "<figure>\n",
     "  <img\n",
@@ -97,7 +98,7 @@
     "\n",
     "### Then vs. Now\n",
     "\n",
-    "Pangenomes are becoming increasingly feasible because sequencing costs have dropped precipitously while throughput has increased rapidly. The figure below ([Wetterstrand, 2023](https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data)) shows the cost of sequencing a human genome tracking Moore's law until the Next Generation Sequencing Revolution, in which several new sequencing technologies were introduced, manifested around 2007/2008. Since then, the cost of sequencing a human genome has dropped approximately 10,000 fold.\n",
+    "Pangenomes are becoming increasingly feasible because sequencing costs have dropped precipitously while throughput has increased rapidly. The figure below ([Wetterstrand, 2023](https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data)) shows the cost of sequencing a human genome starting with the first draft of the Human Genome Reference in 2001. Initially, the cost decreased at the same rate as [Moore's law](https://en.wikipedia.org/wiki/Moore%27s_law), reducing by approximately half every 2 years. However, between 2007 and 2008 the Next Generation Sequencing Revolution occurred, in which several new sequencing technologies were introduced, causing the cost of DNA sequencing to plummet. Since then, the cost of sequencing a human genome has dropped approximately 10,000 fold, eventually stabilizing around $1,000 per genome.\n",
     "\n",
     "<figure>\n",
     "  <img\n",
@@ -107,7 +108,7 @@
     "</figure>\n",
     "\n",
     "\n",
-    "These cost reductions, along with improvements in the generation of high quality long sequencing reads such as [PacBio Hifi](https://www.pacb.com/technology/hifi-sequencing/) reads, have enabled the sequencing of multiple, high quality genomes within many different species, faciliting the construction of pangenomes and resulting in a steady increase in the number of publications mentioning pangenomes [(Bayer et al., 2020)](https://www.nature.com/articles/s41477-020-0733-0).\n",
+    "These cost reductions, along with improvements in the generation of high quality long sequencing reads (such as [PacBio Hifi](https://www.pacb.com/technology/hifi-sequencing/) reads) have enabled the sequencing of multiple, high quality genomes within many different species, faciliting the construction of pangenomes and resulting in a steady increase in the number of publications mentioning pangenomes [(Bayer et al., 2020)](https://www.nature.com/articles/s41477-020-0733-0).\n",
     "\n",
     "<figure>\n",
     "  <img\n",
@@ -146,6 +147,8 @@
     "\n",
     "##  Computational Pangenomics\n",
     "\n",
+    "The prevalence of pangenomic data sets has given rise to a new subfield of genomics known as <em>Computational Pangenomics</em>:\n",
+    "\n",
     "“Questions about efficient data structures, algorithms and statistical methods to perform bioinformatic analyses of pan-genomes give rise to the discipline of ‘computational pan-genomics’.” – [Computational Pangenomics Consortium](https://academic.oup.com/bib/article/19/1/118/2566735)\n",
     "\n",
     "<figure>\n",
@@ -170,7 +173,7 @@
     "\n",
     "### Variation Graphs\n",
     "\n",
-    "We will focus on variation graphs, which most pangenomic tools use. These graphs consist of nodes that contain sequence data, edges that connect the nodes, and paths that thread genomes, chromosomes, haplotypes, genes, or other sequence information through the graph, thereby connecting original sequence inputs directly with walks through the graph. Variation graphs retain all variation present in the original input sequences and balance construction efficiency with ease of visualization and interpretation.\n",
+    "In this module, we will focus on variation graphs, which most pangenomic tools use. These graphs consist of nodes that contain sequence data, edges that connect the nodes, and paths that thread genomes, chromosomes, haplotypes, genes, or other sequence information through the graph, thereby connecting original sequence inputs directly with walks through the graph. Variation graphs retain all variation present in the original input sequences and balance construction efficiency with ease of visualization and interpretation.\n",
     "\n",
     "+ Variation forms bubbles\n",
     "+ Nodes represent sequences\n",
@@ -190,7 +193,7 @@
     "\n",
     "1. Reference Graph (vg)\n",
     "      + A reference with variants\n",
-    "      + e.g., [Human reference now includes VCF with common variation](https://www.ncbi.nlm.nih.gov/genome/guide/human/)\n",
+    "      + e.g. [Human reference now includes VCF with common variation](https://www.ncbi.nlm.nih.gov/genome/guide/human/)\n",
     "2. Reference Backbone; “iterative” (minigraph)\n",
     "      + Graph starts as reference and other sequences are layered on, i.e. variants can be relative to sequences other than the reference\n",
     "3. Reference-Free (Cactus and pggb)\n",
@@ -200,7 +203,7 @@
     "\n",
     "### Mapping Reads to Variation Graphs\n",
     "\n",
-    "You can map sequencing reads from an individual to variation graphs, splitting the reads across matchin nodes as shown below [(Hickey et al, 2020)](https://link.springer.com/article/10.1186/s13059-020-1941-7). This allows you to identify which of the nodes (and, therefore, which variants) are supported by the reads. This read support can then be converted into genotypes for individual you sequenced and aligned to the variation graph.\n",
+    "You can map sequencing reads from an individual to variation graphs, splitting the reads across matching nodes as shown below [(Hickey et al, 2020)](https://link.springer.com/article/10.1186/s13059-020-1941-7). This allows you to identify which of the nodes (and, therefore, which variants) are supported by the reads. This read support can then be converted into genotypes for individuals you sequenced and aligned to the variation graph.\n",
     "\n",
     "<figure>\n",
     "  <img\n",
@@ -213,7 +216,7 @@
     "\n",
     "## Pangenome Data Sets\n",
     "\n",
-    "Here are some links to some pangenome data sets you can explore.\n",
+    "Here are some pangenome data sets you can explore. In this module, we will use a subset of the YPRP data set.\n",
     "\n",
     "+ [Human Reference + Variation VCF](https://www.ncbi.nlm.nih.gov/genome/guide/human/)\n",
     "+ [Human Pangenome Reference Consortium](https://humanpangenome.org)\n",
@@ -235,9 +238,32 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <iframe\n",
+       "            width=\"800\"\n",
+       "            height=\"400\"\n",
+       "            src=\"../html/quiz_pangenomes.html\"\n",
+       "            frameborder=\"0\"\n",
+       "            allowfullscreen\n",
+       "            \n",
+       "        ></iframe>\n",
+       "        "
+      ],
+      "text/plain": [
+       "<IPython.lib.display.IFrame at 0x7f2dee2f42f0>"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "from IPython.display import IFrame\n",
     "IFrame('../html/quiz_pangenomes.html', width=800, height=400)"
@@ -250,7 +276,7 @@
     "----------------------\n",
     "\n",
     "## Conclusion\n",
-    "This module gave an overview of pangenomes and the different types of pangenome representations, focusing on variation graphs. In the next submodule, you will build some pangenome graphs.\n",
+    "This submodule gave an overview of pangenomes and the different types of pangenome representations, focusing on variation graphs. In the next submodule, you will build some pangenome graphs.\n",
     "\n",
     "----------------"
    ]
@@ -266,8 +292,14 @@
   }
  ],
  "metadata": {
+  "environment": {
+   "kernel": "conda-env-nigms-pangenomics-nigms-pangenomics",
+   "name": "workbench-notebooks.m129",
+   "type": "gcloud",
+   "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m129"
+  },
   "kernelspec": {
-   "display_name": "nigms-pangenomics",
+   "display_name": "nigms-pangenomics (Local)",
    "language": "python",
    "name": "conda-env-nigms-pangenomics-nigms-pangenomics"
   },
@@ -281,7 +313,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.9"
+   "version": "3.12.10"
   }
  },
  "nbformat": 4,