|
21 | 21 | "source": [ |
22 | 22 | "\n", |
23 | 23 | "## Overview\n", |
24 | | - "Here you will learn how to search graphs with BLAST. In other words, you can use a DNA sequence, such as your favorite gene, to search the pangenomic graph, discover the structure of the graph, and explore homologous sequences." |
| 24 | + "In this submodule you will learn how to search graphs with BLAST. In other words, you can use a DNA sequence, such as your favorite gene, to search the pangenomic graph, discover the structure of the graph, and explore homologous sequences." |
25 | 25 | ] |
26 | 26 | }, |
27 | 27 | { |
|
77 | 77 | }, |
78 | 78 | { |
79 | 79 | "cell_type": "code", |
80 | | - "execution_count": null, |
| 80 | + "execution_count": 1, |
81 | 81 | "metadata": {}, |
82 | 82 | "outputs": [], |
83 | 83 | "source": [ |
|
95 | 95 | }, |
96 | 96 | { |
97 | 97 | "cell_type": "code", |
98 | | - "execution_count": null, |
| 98 | + "execution_count": 2, |
99 | 99 | "metadata": {}, |
100 | | - "outputs": [], |
| 100 | + "outputs": [ |
| 101 | + { |
| 102 | + "name": "stdout", |
| 103 | + "output_type": "stream", |
| 104 | + "text": [ |
| 105 | + ">S288C_chrVIII:213043-213228/rc\n", |
| 106 | + "ATGTTCAGCGAATTAATTAACTTCCAAAATGAAGGTCATGAGTGCCAATG\n", |
| 107 | + "CCAATGTGGTAGCTGCAAAAATAATGAACAATGCCAAAAATCATGTAGCT\n", |
| 108 | + "GCCCAACGGGGTGTAACAGCGACGACAAATGCCCCTGCGGTAACAAGTCT\n", |
| 109 | + "GAAGAAACCAAGAAGTCATGCTGCTCTGGGAAATGA\n", |
| 110 | + ">S288C_chrVIII:213693-214757/rc\n", |
| 111 | + "ATGGTACCCGCTGCTGAAAACCTATCTCCGATACCTGCCTCTATTGATAC\n", |
| 112 | + "GAACGACATTCCTTTAATTGCTAACGATTTAAAATTACTGGAAACGCAAG\n", |
| 113 | + "CAAAATTGATAAATATTCTGCAAGGTGTTCCTTTCTACTTGCCAGTAAAT\n", |
| 114 | + "TTAACCAAAATTGAAAGTCTGTTAGAAACCTTGACTATGGGCGTGAGTAA\n", |
| 115 | + "TACAGTAGACTTATATTTTCATGACAACGAAGTCAGAAAAGAATGGAAAG\n", |
| 116 | + "ACACTTTAAATTTTATCAATACCATTGTTTATACAAATTTTTTCCTTTTT\n", |
| 117 | + "GTTCAAAACGAATCCTCTTTGTCCATGGCAGTTCAACATTCTTCTAACAA\n", |
| 118 | + "CAATAAGACCTCGAACTCTGAAAGATGTGCAAAGGATCTGATGAAAATTA\n", |
| 119 | + "TTTCTAATATGCACATTTTTTACTCAATAACATTTAATTTTATCTTCCCC\n", |
| 120 | + "ATAAAGTCGATAAAGTCATTTTCAAGCGGCAATAATCGCTTTCATTCTAA\n", |
| 121 | + "TGGTAAAGAATTTTTATTCGCAAATCATTTTATTGAAATCTTACAGAATT\n", |
| 122 | + "TTATAGCAATCACATTTGCTATTTTCCAACGTTGTGAAGTAATATTATAT\n", |
| 123 | + "GACGAATTTTACAAAAATCTTTCAAATGAGGAGATTAATGTTCAATTGCT\n", |
| 124 | + "ATTGATTCATGACAAGATTTTGGAAATTTTAAAAAAAATAGAAATTATCG\n", |
| 125 | + "TATCCTTTTTACGAGATGAAATGAATAGCAACGGAAGTTTCAAATCTATT\n", |
| 126 | + "AAAGGTTTCAACAAGGTTTTGAATCTGATTAAATATATGCTGAGATTTAG\n", |
| 127 | + "CAAGAAAAAACAAAATTTTGCGAGAAACTCTGATAACAATAATGTTACAG\n", |
| 128 | + "ATTATAGTCAGTCGGCGAAGAACAAAAATGTTCTCTTGAAATTCCCCGTT\n", |
| 129 | + "AGTGAACTGAACAGAATCTATTTAAAATTTAAGGAGATTTCAGATTTTTT\n", |
| 130 | + "AATGGAAAGAGAAGTTGTCCAAAGGAGTATAATTATTGACAAGGATTTGG\n", |
| 131 | + "AATCTGATAATCTGGGTATTACTACGGCAAACTTCAACGATTTCTATGAT\n", |
| 132 | + "GCATTTTATAATTAG\n" |
| 133 | + ] |
| 134 | + } |
| 135 | + ], |
101 | 136 | "source": [ |
102 | 137 | "!cat genes/genes.fa" |
103 | 138 | ] |
|
115 | 150 | }, |
116 | 151 | { |
117 | 152 | "cell_type": "code", |
118 | | - "execution_count": null, |
| 153 | + "execution_count": 3, |
119 | 154 | "metadata": {}, |
120 | 155 | "outputs": [], |
121 | 156 | "source": [ |
|
133 | 168 | }, |
134 | 169 | { |
135 | 170 | "cell_type": "code", |
136 | | - "execution_count": null, |
| 171 | + "execution_count": 4, |
137 | 172 | "metadata": {}, |
138 | | - "outputs": [], |
| 173 | + "outputs": [ |
| 174 | + { |
| 175 | + "name": "stdout", |
| 176 | + "output_type": "stream", |
| 177 | + "text": [ |
| 178 | + ">CUP1\n", |
| 179 | + "ATGTTCAGCGAATTAATTAACTTCCAAAATGAAGGTCATGAGTGCCAATG\n", |
| 180 | + "CCAATGTGGTAGCTGCAAAAATAATGAACAATGCCAAAAATCATGTAGCT\n", |
| 181 | + "GCCCAACGGGGTGTAACAGCGACGACAAATGCCCCTGCGGTAACAAGTCT\n", |
| 182 | + "GAAGAAACCAAGAAGTCATGCTGCTCTGGGAAATGA\n", |
| 183 | + ">YHR054C\n", |
| 184 | + "ATGGTACCCGCTGCTGAAAACCTATCTCCGATACCTGCCTCTATTGATAC\n", |
| 185 | + "GAACGACATTCCTTTAATTGCTAACGATTTAAAATTACTGGAAACGCAAG\n", |
| 186 | + "CAAAATTGATAAATATTCTGCAAGGTGTTCCTTTCTACTTGCCAGTAAAT\n", |
| 187 | + "TTAACCAAAATTGAAAGTCTGTTAGAAACCTTGACTATGGGCGTGAGTAA\n", |
| 188 | + "TACAGTAGACTTATATTTTCATGACAACGAAGTCAGAAAAGAATGGAAAG\n", |
| 189 | + "ACACTTTAAATTTTATCAATACCATTGTTTATACAAATTTTTTCCTTTTT\n", |
| 190 | + "GTTCAAAACGAATCCTCTTTGTCCATGGCAGTTCAACATTCTTCTAACAA\n", |
| 191 | + "CAATAAGACCTCGAACTCTGAAAGATGTGCAAAGGATCTGATGAAAATTA\n", |
| 192 | + "TTTCTAATATGCACATTTTTTACTCAATAACATTTAATTTTATCTTCCCC\n", |
| 193 | + "ATAAAGTCGATAAAGTCATTTTCAAGCGGCAATAATCGCTTTCATTCTAA\n", |
| 194 | + "TGGTAAAGAATTTTTATTCGCAAATCATTTTATTGAAATCTTACAGAATT\n", |
| 195 | + "TTATAGCAATCACATTTGCTATTTTCCAACGTTGTGAAGTAATATTATAT\n", |
| 196 | + "GACGAATTTTACAAAAATCTTTCAAATGAGGAGATTAATGTTCAATTGCT\n", |
| 197 | + "ATTGATTCATGACAAGATTTTGGAAATTTTAAAAAAAATAGAAATTATCG\n", |
| 198 | + "TATCCTTTTTACGAGATGAAATGAATAGCAACGGAAGTTTCAAATCTATT\n", |
| 199 | + "AAAGGTTTCAACAAGGTTTTGAATCTGATTAAATATATGCTGAGATTTAG\n", |
| 200 | + "CAAGAAAAAACAAAATTTTGCGAGAAACTCTGATAACAATAATGTTACAG\n", |
| 201 | + "ATTATAGTCAGTCGGCGAAGAACAAAAATGTTCTCTTGAAATTCCCCGTT\n", |
| 202 | + "AGTGAACTGAACAGAATCTATTTAAAATTTAAGGAGATTTCAGATTTTTT\n", |
| 203 | + "AATGGAAAGAGAAGTTGTCCAAAGGAGTATAATTATTGACAAGGATTTGG\n", |
| 204 | + "AATCTGATAATCTGGGTATTACTACGGCAAACTTCAACGATTTCTATGAT\n", |
| 205 | + "GCATTTTATAATTAG\n" |
| 206 | + ] |
| 207 | + } |
| 208 | + ], |
139 | 209 | "source": [ |
140 | 210 | "!cat genes/genes.fa" |
141 | 211 | ] |
|
155 | 225 | "\n", |
156 | 226 | "## BLAST\n", |
157 | 227 | "\n", |
158 | | - "The Basic Local Alignment Search Tool (BLAST) tool allows you to compare DNA sequences in order to efficiently identify the best matches. Here we will use BLAST to search the DNA sequences in the pangenome for matches to two adjacent genes.\n", |
159 | | - "\n", |
160 | | - "Altschul, Stephen F., et al. \"Basic local alignment search tool.\" Journal of molecular biology 215.3 (1990): 403-410.\n", |
| 228 | + "The Basic Local Alignment Search Tool (BLAST) tool allows you to compare DNA sequences in order to efficiently identify the best matches. Here we will use BLAST to search the DNA sequences in the pangenome for matches to two adjacent genes ([Altschul, Stephen F., et al. 1990](https://doi.org/10.1016/S0022-2836(05)80360-2)).\n", |
161 | 229 | "\n", |
162 | 230 | "\n", |
163 | 231 | "### BLAST the graph manually\n", |
|
170 | 238 | }, |
171 | 239 | { |
172 | 240 | "cell_type": "code", |
173 | | - "execution_count": null, |
| 241 | + "execution_count": 5, |
174 | 242 | "metadata": {}, |
175 | | - "outputs": [], |
| 243 | + "outputs": [ |
| 244 | + { |
| 245 | + "name": "stdout", |
| 246 | + "output_type": "stream", |
| 247 | + "text": [ |
| 248 | + "[M::main] Version: 0.4-r214-dirty\n", |
| 249 | + "[M::main] CMD: gfatools gfa2fa graphs/yprp.chrVIII.pggb.gfa\n", |
| 250 | + "[M::main] Real time: 0.020 sec; CPU: 0.021 sec\n" |
| 251 | + ] |
| 252 | + } |
| 253 | + ], |
176 | 254 | "source": [ |
177 | 255 | "!gfatools gfa2fa graphs/yprp.chrVIII.pggb.gfa > graphs/yprp.chrVIII.pggb.fa" |
178 | 256 | ] |
|
192 | 270 | }, |
193 | 271 | { |
194 | 272 | "cell_type": "code", |
195 | | - "execution_count": null, |
| 273 | + "execution_count": 6, |
196 | 274 | "metadata": {}, |
197 | | - "outputs": [], |
| 275 | + "outputs": [ |
| 276 | + { |
| 277 | + "name": "stdout", |
| 278 | + "output_type": "stream", |
| 279 | + "text": [ |
| 280 | + "\n", |
| 281 | + "\n", |
| 282 | + "Building a new DB, current time: 04/28/2025 21:33:30\n", |
| 283 | + "New DB name: /home/jupyter/NIGMS-Sandbox-Pangenomics-Module/module_notebooks/graphs/yprp.chrVIII.pggb.fa\n", |
| 284 | + "New DB title: graphs/yprp.chrVIII.pggb.fa\n", |
| 285 | + "Sequence type: Nucleotide\n", |
| 286 | + "Keep MBits: T\n", |
| 287 | + "Maximum file size: 3000000000B\n", |
| 288 | + "Adding sequences from FASTA; added 19252 sequences in 0.19645 seconds.\n", |
| 289 | + "\n", |
| 290 | + "\n" |
| 291 | + ] |
| 292 | + } |
| 293 | + ], |
198 | 294 | "source": [ |
199 | 295 | "!makeblastdb -in graphs/yprp.chrVIII.pggb.fa -input_type fasta -dbtype nucl" |
200 | 296 | ] |
|
214 | 310 | }, |
215 | 311 | { |
216 | 312 | "cell_type": "code", |
217 | | - "execution_count": null, |
| 313 | + "execution_count": 7, |
218 | 314 | "metadata": {}, |
219 | 315 | "outputs": [], |
220 | 316 | "source": [ |
|
249 | 345 | }, |
250 | 346 | { |
251 | 347 | "cell_type": "code", |
252 | | - "execution_count": null, |
| 348 | + "execution_count": 8, |
253 | 349 | "metadata": {}, |
254 | | - "outputs": [], |
| 350 | + "outputs": [ |
| 351 | + { |
| 352 | + "name": "stdout", |
| 353 | + "output_type": "stream", |
| 354 | + "text": [ |
| 355 | + "CUP1\t7899\t100.000\t186\t0\t0\t1\t186\t456\t271\t9.20e-97\t344\n", |
| 356 | + "CUP1\t7715\t100.000\t186\t0\t0\t1\t186\t1524\t1339\t9.20e-97\t344\n", |
| 357 | + "CUP1\t7715\t100.000\t186\t0\t0\t1\t186\t5523\t5338\t9.20e-97\t344\n", |
| 358 | + "CUP1\t7715\t100.000\t186\t0\t0\t1\t186\t7521\t7336\t9.20e-97\t344\n", |
| 359 | + "CUP1\t7715\t99.462\t186\t0\t1\t1\t186\t3521\t3337\t1.54e-94\t337\n", |
| 360 | + "CUP1\t7773\t99.457\t184\t1\t0\t1\t184\t1130\t947\t5.54e-94\t335\n", |
| 361 | + "CUP1\t7851\t100.000\t164\t0\t0\t1\t164\t164\t1\t1.56e-84\t303\n", |
| 362 | + "CUP1\t7790\t100.000\t164\t0\t0\t1\t164\t164\t1\t1.56e-84\t303\n", |
| 363 | + "CUP1\t7732\t100.000\t164\t0\t0\t1\t164\t164\t1\t1.56e-84\t303\n", |
| 364 | + "CUP1\t7698\t100.000\t164\t0\t0\t1\t164\t164\t1\t1.56e-84\t303\n", |
| 365 | + "CUP1\t7638\t100.000\t164\t0\t0\t1\t164\t164\t1\t1.56e-84\t303\n", |
| 366 | + "CUP1\t7602\t100.000\t134\t0\t0\t31\t164\t134\t1\t7.42e-68\t248\n", |
| 367 | + "CUP1\t7605\t100.000\t29\t0\t0\t1\t29\t29\t1\t1.74e-09\t54.7\n", |
| 368 | + "YHR054C\t7715\t100.000\t1065\t0\t0\t1\t1065\t3053\t1989\t0.0\t1967\n", |
| 369 | + "YHR054C\t7715\t100.000\t1065\t0\t0\t1\t1065\t7052\t5988\t0.0\t1967\n", |
| 370 | + "YHR054C\t7715\t100.000\t1065\t0\t0\t1\t1065\t9050\t7986\t0.0\t1967\n", |
| 371 | + "YHR054C\t7715\t99.626\t1069\t0\t4\t1\t1065\t5054\t3986\t0.0\t1949\n", |
| 372 | + "YHR054C\t7715\t100.000\t1054\t0\t0\t1\t1054\t1054\t1\t0.0\t1947\n", |
| 373 | + "YHR054C\t7899\t99.905\t1053\t1\t0\t13\t1065\t1973\t921\t0.0\t1940\n", |
| 374 | + "YHR054C\t7883\t100.000\t286\t0\t0\t428\t713\t286\t1\t1.27e-151\t529\n", |
| 375 | + "YHR054C\t7821\t100.000\t286\t0\t0\t428\t713\t286\t1\t1.27e-151\t529\n", |
| 376 | + "YHR054C\t7763\t100.000\t286\t0\t0\t428\t713\t286\t1\t1.27e-151\t529\n", |
| 377 | + "YHR054C\t7671\t100.000\t261\t0\t0\t428\t688\t261\t1\t1.01e-137\t483\n", |
| 378 | + "YHR054C\t7892\t100.000\t210\t0\t0\t126\t335\t210\t1\t2.26e-109\t388\n", |
| 379 | + "YHR054C\t7827\t100.000\t210\t0\t0\t126\t335\t210\t1\t2.26e-109\t388\n", |
| 380 | + "YHR054C\t7769\t100.000\t210\t0\t0\t126\t335\t210\t1\t2.26e-109\t388\n", |
| 381 | + "YHR054C\t7677\t100.000\t210\t0\t0\t126\t335\t210\t1\t2.26e-109\t388\n", |
| 382 | + "YHR054C\t7874\t100.000\t129\t0\t0\t838\t966\t129\t1\t2.40e-64\t239\n", |
| 383 | + "YHR054C\t7812\t100.000\t129\t0\t0\t838\t966\t129\t1\t2.40e-64\t239\n", |
| 384 | + "YHR054C\t7754\t100.000\t129\t0\t0\t838\t966\t129\t1\t2.40e-64\t239\n", |
| 385 | + "YHR054C\t7660\t100.000\t129\t0\t0\t838\t966\t129\t1\t2.40e-64\t239\n", |
| 386 | + "YHR054C\t7895\t100.000\t111\t0\t0\t14\t124\t111\t1\t2.44e-54\t206\n", |
| 387 | + "YHR054C\t7830\t100.000\t111\t0\t0\t14\t124\t111\t1\t2.44e-54\t206\n", |
| 388 | + "YHR054C\t7772\t100.000\t111\t0\t0\t14\t124\t111\t1\t2.44e-54\t206\n", |
| 389 | + "YHR054C\t7680\t100.000\t111\t0\t0\t14\t124\t111\t1\t2.44e-54\t206\n", |
| 390 | + "YHR054C\t7877\t100.000\t99\t0\t0\t738\t836\t99\t1\t1.14e-47\t183\n", |
| 391 | + "YHR054C\t7815\t100.000\t99\t0\t0\t738\t836\t99\t1\t1.14e-47\t183\n", |
| 392 | + "YHR054C\t7757\t100.000\t99\t0\t0\t738\t836\t99\t1\t1.14e-47\t183\n", |
| 393 | + "YHR054C\t7663\t100.000\t99\t0\t0\t738\t836\t99\t1\t1.14e-47\t183\n", |
| 394 | + "YHR054C\t7809\t100.000\t98\t0\t0\t968\t1065\t108\t11\t4.11e-47\t182\n", |
| 395 | + "YHR054C\t7751\t100.000\t98\t0\t0\t968\t1065\t108\t11\t4.11e-47\t182\n", |
| 396 | + "YHR054C\t7657\t100.000\t98\t0\t0\t968\t1065\t108\t11\t4.11e-47\t182\n", |
| 397 | + "YHR054C\t7824\t100.000\t90\t0\t0\t337\t426\t90\t1\t1.15e-42\t167\n", |
| 398 | + "YHR054C\t7766\t100.000\t90\t0\t0\t337\t426\t90\t1\t1.15e-42\t167\n", |
| 399 | + "YHR054C\t7674\t100.000\t90\t0\t0\t337\t426\t90\t1\t1.15e-42\t167\n", |
| 400 | + "YHR054C\t7871\t100.000\t87\t0\t0\t968\t1054\t87\t1\t5.35e-41\t161\n", |
| 401 | + "YHR054C\t7886\t100.000\t75\t0\t0\t352\t426\t75\t1\t2.51e-34\t139\n" |
| 402 | + ] |
| 403 | + } |
| 404 | + ], |
255 | 405 | "source": [ |
256 | 406 | "!cat genes/genesXyprp.chrVIII.pggb.fa.txt" |
257 | 407 | ] |
|
293 | 443 | }, |
294 | 444 | { |
295 | 445 | "cell_type": "code", |
296 | | - "execution_count": null, |
| 446 | + "execution_count": 12, |
297 | 447 | "metadata": {}, |
298 | | - "outputs": [], |
| 448 | + "outputs": [ |
| 449 | + { |
| 450 | + "data": { |
| 451 | + "text/html": [ |
| 452 | + "\n", |
| 453 | + " <iframe\n", |
| 454 | + " width=\"800\"\n", |
| 455 | + " height=\"400\"\n", |
| 456 | + " src=\"../html/flashcard_blastout.html\"\n", |
| 457 | + " frameborder=\"0\"\n", |
| 458 | + " allowfullscreen\n", |
| 459 | + " \n", |
| 460 | + " ></iframe>\n", |
| 461 | + " " |
| 462 | + ], |
| 463 | + "text/plain": [ |
| 464 | + "<IPython.lib.display.IFrame at 0x7f1ac5ddcf80>" |
| 465 | + ] |
| 466 | + }, |
| 467 | + "execution_count": 12, |
| 468 | + "metadata": {}, |
| 469 | + "output_type": "execute_result" |
| 470 | + } |
| 471 | + ], |
299 | 472 | "source": [ |
300 | 473 | "from IPython.display import IFrame\n", |
301 | | - "IFrame('../html/blastout.html', width=800, height=400)" |
| 474 | + "IFrame('../html/flashcard_blastout.html', width=800, height=400)" |
302 | 475 | ] |
303 | 476 | }, |
304 | 477 | { |
|
309 | 482 | "\n", |
310 | 483 | "## Conclusion\n", |
311 | 484 | "\n", |
312 | | - "You learned how to blast against a pangenomic graph. Specifically, you searched for the CUP1 and YHR054C genes in the graph.\n", |
| 485 | + "In this submodule, you learned how to BLAST against a pangenomic graph. Specifically, you searched for the CUP1 and YHR054C genes in the graph.\n", |
313 | 486 | "\n", |
314 | | - "BLASTing gene sequences allows you to find out where genes of interest are in the fasta file exported from the pangenomic graph. It also allows you to identify copy numbers of the genes.\n", |
| 487 | + "BLASTing gene sequences allows you to find out where genes of interest are in the FASTA file exported from the pangenomic graph. It also allows you to identify copy numbers of the genes.\n", |
315 | 488 | "\n", |
316 | | - "In the next chapter you will learn how visualize graphs and to blast directly against the graph and visualize the result.\n", |
| 489 | + "In the next submodule you will learn how visualize graphs and to BLAST directly against the graph and visualize the result.\n", |
317 | 490 | "\n", |
318 | 491 | "----------------------" |
319 | 492 | ] |
|
329 | 502 | } |
330 | 503 | ], |
331 | 504 | "metadata": { |
| 505 | + "environment": { |
| 506 | + "kernel": "conda-env-nigms-pangenomics-nigms-pangenomics", |
| 507 | + "name": "workbench-notebooks.m129", |
| 508 | + "type": "gcloud", |
| 509 | + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m129" |
| 510 | + }, |
332 | 511 | "kernelspec": { |
333 | | - "display_name": "nigms-pangenomics", |
| 512 | + "display_name": "nigms-pangenomics (Local)", |
334 | 513 | "language": "python", |
335 | 514 | "name": "conda-env-nigms-pangenomics-nigms-pangenomics" |
336 | 515 | }, |
|
344 | 523 | "name": "python", |
345 | 524 | "nbconvert_exporter": "python", |
346 | 525 | "pygments_lexer": "ipython3", |
347 | | - "version": "3.12.9" |
| 526 | + "version": "3.12.10" |
348 | 527 | } |
349 | 528 | }, |
350 | 529 | "nbformat": 4, |
|
0 commit comments