Merge branch 'main' of https://github.com/ncgr/NIGMS-Sandbox-Pangenomics-Module

joannmudge · joannmudge · commit 925e875ae2c8 · 2025-04-30T17:23:48.000Z
diff --git a/module_notebooks/07-variant-calling-with-vg.ipynb b/module_notebooks/07-variant-calling-with-vg.ipynb
@@ -12,7 +12,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Variant Calling with vg"
+    "# Variant Calling with VG"
    ]
   },
   {
@@ -38,7 +38,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Get Started\n",
+    "## Getting Started\n",
     "\n",
     "When calling variants, we use aligned reads to find support for variants contained in the graph. For the original pangenome graph, it will find variants from the assemblies used to make the graph. You can also augment the pangenome graph with novel variants in the reads, creating an augemented pangenome graph that can be used to call variants.\n",
     "\n",
@@ -57,9 +57,9 @@
     "\n",
     "## Call Variants\n",
     "\n",
-    "We will look for two variant types:</mark\n",
+    "We will look for two variant types:\n",
     "- Variants that are supported by the graph.\n",
-    "- Variants that are novel (i.e., not in the graph but supported by the reads aligned to the graph).\n",
+    "- Variants that are novel (i.e. not in the graph but supported by the reads aligned to the graph).\n",
     "\n",
     "We will call variants against the graph, though you could also call variants using the surjected BAM file and traditional variant calling methods."
    ]
@@ -94,10 +94,10 @@
     "The parameters:\n",
     "\n",
     "`-x`  the graph  \n",
-    "`-g`  aligments in gam format  \n",
+    "`-g`  aligments in GAM format  \n",
     "`-Q`  ignore mapping and base qualities < N  \n",
     "`-s`  ignore the first and last N nucleotides of each read  \n",
-    "`-o`  the output pack file  \n",
+    "`-o`  the output PACK file  \n",
     "`-t`  use N threads"
    ]
   },
@@ -119,11 +119,10 @@
     "The parameters:\n",
     "\n",
     "`-k`  The read support file to read in  \n",
-    "`-t`  The number of threads\n",
+    "`-t`  The number of threads  \n",
     "`-z`  Restrict the search to GBZ haplotypes (can improve speed and accuracy); we won't use this here\n",
     "\n",
-    "Also, feed in the graph as a positional argument:  \n",
-    "*graphs/yprp.chrVIII.pggb.giraffe.gbz*"
+    "Also, feed in the graph as a positional argument: *graphs/yprp.chrVIII.pggb.giraffe.gbz*"
    ]
   },
   {
@@ -224,6 +223,17 @@
     "# Statistics for variants supported by the full genome graph"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<details>\n",
+    "<summary>Click for help</summary>\n",
+    "<br>\n",
+    "!bcftools stats variants/yprp.fullgenome.pggb.graph_calls.vcf | grep \"^SN\"\n",
+    "</details>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -232,7 +242,7 @@
     "\n",
     "## Including Novel Variant Calls\n",
     "\n",
-    "1. To call novel variants (i.e., those variants supported by the aligned reads),  we need to embed the variation from the reads we aligned back into the graph. To do this we need to convert the graph into a form that we can change. We will use `vg convert` to convert the .gbz file to a .vg file."
+    "1. To call novel variants (i.e. those variants supported by the aligned reads),  we need to embed the variation from the reads we aligned back into the graph. To do this we need to convert the graph into a form that we can change. We will use `vg convert` to convert the .gbz file to a .vg file."
    ]
   },
   {
@@ -255,7 +265,7 @@
     "`-A`  new, augmented graph with aligned reads  \n",
     "`-t`  the number of threads to use  \n",
     "\n",
-    "Also, feed in the the graph and the input alignment (gam) file as positional arguments:  \n",
+    "Also, feed in the the graph and the input alignment (GMA) file as positional arguments:  \n",
     "*graphs/yprp.chrVIII.pggb.giraffe.vg*  \n",
     "*alignments/SK1xyprp.chrVIII.pggb.mapped.gam*"
    ]
@@ -345,12 +355,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<div class=\"alert alert-block alert-success\"> <b>Try this in the cells below:</b>  \n",
+    "<div class=\"alert alert-block alert-success\"> <b>Try this in the cells below:</b><br/>\n",
+    "Call novel variants for the full genome graph (yprp.fullgenome.pggb.giraffe.gbz) by performing the following steps:\n",
     "    <ul>\n",
-    "        Call novel variants for the full genome graph (*yprp.fullgenome.pggb.giraffe.gbz*) by performing the following steps:\n",
-    "        <li>Convert the graph to vg format.</li>\n",
+    "        <li>Convert the graph to .vg format.</li>\n",
     "        <li>Augment the graph to embed the read alignments.</li>\n",
-    "        <li>Create an index (xg).</li>\n",
+    "        <li>Create an index (.xg).</li>\n",
     "        <li>Compute read support.</li>\n",
     "        <li>Generate a VCF.</li>\n",
     "        <li>Generate statistics.</li>\n",
@@ -434,6 +444,25 @@
     "</details>"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!vg convert graphs/yprp.fullgenome.pggb.giraffe.gbz > graphs/yprp.fullgenome.pggb.giraffe.vg\n",
+    "\n",
+    "!vg augment graphs/yprp.fullgenome.pggb.giraffe.vg alignments/SK1xyprp.fullgenome.pggb.mapped.gam -A alignments/SK1xyprp.fullgenome.pggb.mapped.aug.gam -t 4 > graphs/SK1xyprp.fullgenome.pggb.aug.vg\n",
+    "\n",
+    "!vg index -t 4 -x graphs/SK1xyprp.fullgenome.pggb.aug.xg graphs/SK1xyprp.fullgenome.pggb.aug.vg\n",
+    "\n",
+    "!vg pack -x graphs/SK1xyprp.fullgenome.pggb.aug.xg -g alignments/SK1xyprp.fullgenome.pggb.mapped.aug.gam -Q 5 -s 5 -o alignments/SK1xyprp.fullgenome.pggb.mapped.aug.pack -t 4\n",
+    "\n",
+    "!vg call graphs/SK1xyprp.fullgenome.pggb.aug.xg -k alignments/SK1xyprp.fullgenome.pggb.mapped.aug.pack -t 4 > variants/SK1xyprp.fullgenome.pggb.aug_calls.vcf\n",
+    "\n",
+    "!bcftools stats variants/SK1xyprp.fullgenome.pggb.aug_calls.vcf | grep \"^SN\""
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -479,7 +508,7 @@
    ]
   },
   {
-   "cell_type": "raw",
+   "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We will use the original chrVIII graph and export variants onto the S288C paths."
@@ -552,7 +581,7 @@
     "\n",
     "## Visualize the original and augmented graphs\n",
     "\n",
-    "1. Compare CUP1 region of the original graph (yprp.chrVIII.pggb.gfa) and augmented graph (SK1xyprp.chrVIII.pggb.aug.vg, which we will convert to SK1xyprp.chrVIII.pggb.aug.gfa) in bandage. First, convert the augmented graph to gfa format.\n",
+    "1. Compare CUP1 region of the original graph (yprp.chrVIII.pggb.gfa) and augmented graph (SK1xyprp.chrVIII.pggb.aug.vg, which we will convert to SK1xyprp.chrVIII.pggb.aug.gfa) in Bandage. First, convert the augmented graph to GFA format.\n",
     "\n",
     "The parameters:\n",
     "\n",
@@ -577,7 +606,7 @@
    "source": [
     "2. Then visualize the CUP1 region of each .gfa file in bandage.\n",
     "\n",
-    "<div class=\"alert alert-block alert-info\"> <b>NOTE:</b> The augmented graph is much bigger and it will have difficulty loading the entire chrVIII graph. So, before drawing the graph, blast the genes. Change the \"Scope\" to \"Around query hits\". Change \"Distance\" to 200 for the augmented graph. Then click \"Draw Graph\".  \n"
+    "<div class=\"alert alert-block alert-info\"> <b>NOTE:</b> The augmented graph is much bigger and it will have difficulty loading the entire chrVIII graph. So, before drawing the graph, BLAST the genes, then change the \"Scope\" to \"Around query hits,\" and change \"Distance\" to 200 for the augmented graph. Finally, click \"Draw Graph\" to apply these changes.</div>\n"
    ]
   },
   {
@@ -647,7 +676,7 @@
     "\n",
     "Congratulations, you have completed the pangenomics module!\n",
     "\n",
-    "In this module, you learned about pangenomics and used a yeast dataset to build pangenomics graphs using PGGB. You learned how to search these graphs for regions that match DNA sequence queries using BLAST and how to interactively visualize these graphs using Bandage. In addition, you learned how to index the graphs, map reads to the graphs, and call variants. Well done!\n",
+    "In this module, you learned about pangenomics and used a yeast dataset to build pangenomics graphs using PGGB. You learned how to search these graphs for regions that match DNA sequence queries using BLAST and how to interactively visualize these graphs using Bandage. In addition, you learned how to use VG to index the graphs, map reads to the graphs, and call variants. Well done!\n",
     "\n",
     "----------------------"
    ]
@@ -670,7 +699,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "nigms-pangenomics",
+   "display_name": "nigms-pangenomics (Local) (Local)",
    "language": "python",
    "name": "conda-env-nigms-pangenomics-nigms-pangenomics"
   },
@@ -684,7 +713,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.9"
+   "version": "3.12.10"
   }
  },
  "nbformat": 4,