@@ -115,14 +115,14 @@ plt.show()
115115```
116116
117117Genomic data in tree sequence format can be generated via the widely-used
118- [ msprime] ( https://tskit.dev/software/msprime.html ) simulator. Here we simulate 20
119- kilobases of genome sequence at the start of human chromosome 1 under this model,
118+ [ msprime] ( https://tskit.dev/software/msprime.html ) simulator. Here we simulate 1
119+ megabase of genome sequence at the start of human chromosome 1 under this model,
120120together with its evolutionary history. We generate 16 diploid genomes: 4 from each of
121121the populations in the model. The DNA sequences and their ancestry are stored in a
122122succinct tree sequence named ` ts ` :
123123
124124``` {code-cell}
125- contig = species.get_contig("chr1", mutation_rate=model.mutation_rate, right=20_000 )
125+ contig = species.get_contig("chr1", mutation_rate=model.mutation_rate, right=1_000_000 )
126126samples = {"AFR": 4, "EUR": 4, "ASIA": 4, "ADMIX": 4} # 16 diploid samples
127127engine = stdpopsim.get_engine("msprime")
128128ts = engine.simulate(model, contig, samples, seed=9).trim() # trim to first 20kb simulated
@@ -135,24 +135,44 @@ along the genome:
135135
136136``` {code-cell}
137137for v in ts.variants():
138- display(v)
139- if v.site.id >= 2: # Only show site 0, 1, and 2, for brevity
138+ print(
139+ f"Variable site {v.site.id} at position {v.site.position} has allele frequencies",
140+ {state: f"{freq:.1%}" for state, freq in v.frequencies().items()}
141+ )
142+ if v.site.id > 4:
143+ print("...")
140144 break
141145```
142146
143- Or we can display the {meth}` ~TreeSequence.haplotypes ` (i.e. the variable sites) for
144- each sample
147+ Or we can efficiently grab the genotypes for each sampled genome
145148
146149``` {code-cell}
147- samples = ts.samples()
148- for sample_id, h in zip(samples, ts.haplotypes(samples=samples)):
149- pop = ts.node(sample_id).population
150- print(f"Sample {sample_id:<2} ({ts.population(pop).metadata['name']:^5}): {h}")
150+ print("Sample ---> ", " ".join([f"{u:>2}" for p in ts.populations() for u in ts.samples(population=p.id)]))
151+ print("Population |", "".join([f"{p.metadata['name']:^{3*(len(ts.samples(population=p.id)))-1}}|" for p in ts.populations()]))
152+ print("__________ |", "".join(["_" * (3 * len(ts.samples(population=p.id)) - 1) + "|" for p in ts.populations()]))
153+ print(" Position")
154+ for v in ts.variants():
155+ print(f"{int(v.site.position):>10} | ", " ".join(v.states()))
156+ if v.site.id >= 30: # Only show the first 30 sites, for brevity
157+ break
158+ ```
159+
160+ It is also possible to grab the [ haplotypes] ( https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.haplotypes )
161+ for specific samples (although this is slightly less efficient)
162+
163+ ``` {code-cell}
164+ pop_id = 0
165+ samples = ts.samples(population=pop_id)
166+
167+ for sample_id, haplotype in zip(samples, ts.haplotypes(samples=samples)):
168+ h = ".".join(list(haplotype)) # Add a dot between letters, to clarify they are not adjacent
169+ print(f"Sample {sample_id:<2} ({ts.population(pop_id).metadata['name']:^5}): {h}")
151170```
152171
153- From the tree sequence it is easy to obtain the
172+ You can easily obtain the
154173{meth}` TreeSequence.allele_frequency_spectrum ` for the entire region (or for
155- {ref}` windowed regions<sec_tskit_getting_started_compute_statistics_windowing> ` )
174+ {ref}` windowed regions<sec_tskit_getting_started_compute_statistics_windowing> ` )
175+ directly from the tree sequence (i.e. without needing to reconstruct genotypes)
156176
157177``` {code-cell}
158178afs = ts.allele_frequency_spectrum()
@@ -163,16 +183,16 @@ plt.show()
163183
164184Similarly ` tskit ` allows fast and easy
165185{ref}` calculation of statistics<sec_tutorial_stats> ` along the genome. Here is
166- a plot of windowed $F_ {st}$ between Africans and admixed Americans over this short
186+ a plot of windowed $F_ {st}$ between Africans and admixed Americans over this
167187region of chromosome:
168188
169189``` {code-cell}
170190# Define the samples between which Fst will be calculated
171191pop_id = {p.metadata["name"]: p.id for p in ts.populations()}
172192sample_sets=[ts.samples(pop_id["AFR"]), ts.samples(pop_id["ADMIX"])]
173193
174- # Do the windowed calculation, using windows of 2 kilobases
175- windows = list(range(0, int(ts.sequence_length + 1), 2_000 ))
194+ # Do the windowed calculation, using windows of 10 kilobases
195+ windows = list(range(0, int(ts.sequence_length + 1), 10_000 ))
176196F_st = ts.Fst(sample_sets, windows=windows)
177197
178198# Plot
@@ -185,7 +205,8 @@ plt.show()
185205Extracting the genetic tree at a specific genomic location is easy using ` tskit ` , which
186206also provides methods to {ref}` plot<sec_tskit_viz> ` these trees. Here we
187207grab the tree at position 10kb, and colour the different populations by
188- different colours, as described in the {ref}` viz tutorial<sec_tskit_viz_styling> ` :
208+ grab the tree at position 10kb, and colour the samples according to their population,
209+ as described in the {ref}` viz tutorial<sec_tskit_viz_styling> ` :
189210
190211``` {code-cell}
191212tree = ts.at(10_000)
@@ -214,29 +235,51 @@ tree.draw_svg(
214235 y_axis=True,
215236 y_ticks=range(0, 30_000, 10_000)
216237)
238+ ```
217239
240+ Or we can plot a principal components analysis of the genome, which should reflect
241+ geographical distinctiveness:
242+
243+ ``` {code-cell}
244+ from matplotlib.patches import Patch
245+
246+ # Run the Principal Components Analysis (PCA)
247+ pca_obj = ts.pca(num_components=2)
248+
249+ # Plot the PCA "factors"
250+ col_list = [colours[pop.metadata["name"]] for pop in ts.populations()]
251+ sample_pop_ids = ts.nodes_population[ts.samples()]
252+ plt.scatter(*pca_obj.factors.T, c=[col_list[p] for p in sample_pop_ids], edgecolors= "black")
253+ plt.xlabel("PCA 1")
254+ plt.ylabel("PCA 2")
255+ plt.legend(handles=[
256+ Patch(color=col_list[pop.id], label=pop.metadata["name"]) for pop in ts.populations()
257+ ]);
218258```
219259
220260## Population genetic inference
221261
222262If, instead of simulations, you want to analyse existing genomic data (for example
223- stored in a VCF file), you will need to infer a tree sequence from it, using e.g.
224- [ tsinfer] ( https://tskit.dev/tsinfer/docs/stable/ ) . Here we load an illustrative portion
225- of an [ inferred tree sequence] ( https://zenodo.org/record/5512994 )
263+ stored in a VCF file), you will need to infer a tree sequence from it, using [ ARG
264+ inference software] ( https://www.patrickfournier.ca/arg-samplers-review/graph/#/page/arg%20inference ) (such as [ SINGER] ( https://github.com/popgenmethods/SINGER ) ,
265+ [ Relate] ( https://myersgroup.github.io/relate/ ) ,
266+ [ Threads] ( https://github.com/PalamaraLab/threads ) , or
267+ [ tsinfer] ( https://github.com/tskit-dev/tsinfer ) ). Here we load an
268+ illustrative portion of a tree sequence
269+ [ inferred by tsinfer+tsdate] ( https://zenodo.org/record/5512994 ) which is
226270based on about 7500 public human genomes, including genomes from the
227271[ Thousand Genomes Project] ( https://www.internationalgenome.org/data-portal/data-collection/grch38 ) and
228272[ Human Genome Diversity Project] ( https://www.internationalgenome.org/data-portal/data-collection/hgdp ) .
229273The genomic region encoded in this tree sequence has been cut down to
230274span positions 108Mb-110Mb of human chromosome 2, which spans the
231275[ EDAR] ( https://en.wikipedia.org/wiki/Ectodysplasin_A_receptor ) gene.
232276
233- Note that tree sequence files are usually imported using {func}` load ` ,
234- but because this file has been additionally compressed, we load it via
235- {func}` tszip:tszip.decompress ` :
277+ Note that we are using {func}` tszip:tszip.load ` to load the file, as this
278+ utility can also read and write compressed tree sequences in ` .tsz ` format.
236279
237280``` {code-cell}
238281import tszip
239- ts = tszip.decompress ("data/unified_genealogy_2q_108Mb-110Mb.tsz")
282+ ts = tszip.load ("data/unified_genealogy_2q_108Mb-110Mb.tsz")
240283
241284# The ts encompasses a region on chr 2 with an interesting SNP (rs3827760) in the EDAR gene
242285edar_gene_bounds = [108_894_471, 108_989_220] # In Mb from the start of chromosome 2
@@ -256,10 +299,11 @@ import pandas as pd
256299print(ts.num_populations, "populations defined in the tree sequence:")
257300
258301pop_names_regions = [
259- [p.metadata.get("name"), p.metadata.get("region")]
302+ [p.metadata.get("name"), p.metadata.get("region"), len(ts.samples(population=p.id)) ]
260303 for p in ts.populations()
261304]
262- display(pd.DataFrame(pop_names_regions, columns=["population name", "region"]))
305+ with pd.option_context('display.max_rows', 100):
306+ display(pd.DataFrame(pop_names_regions, columns=["name", "region", "# genomes"]))
263307```
264308
265309You can see that there are multiple African and East asian populations, grouped by
@@ -321,16 +365,100 @@ using tree sequences is simply that they allow these sorts of analysis to
321365### Topological analysis
322366
323367As this inferred tree sequence stores (an estimate of) the underlying
324- genealogy, we can also derive statistics based on genealogical relationships. For
325- example, this tree sequence also contains a sample genome based on an ancient
326- genome, a [ Denisovan] ( https://en.wikipedia.org/wiki/Denisovan ) individual. We can
327- look at the closeness of relationship between samples from the different geographical
328- regions and the Denisovan:
329-
330- :::{todo}
331- Show an example of looking at topological relationships between the Denisovan and
332- various East Asian groups, using the {ref}` sec_counting_topologies ` functionality.
333- :::
368+ genealogy, we can also derive statistics based on genealogical relationships. You
369+ may have noticed that this tree sequence also contains a sample genome based on an ancient
370+ genome, a [ Denisovan] ( https://en.wikipedia.org/wiki/Denisovan ) individual. We'll first
371+ simplify the tree sequence to focus on only the Denisovan plus
372+ a common East Asian and a common African population:
373+
374+ ``` {code-cell}
375+ # Focus on Han, San, and Denisovan
376+ focal = {
377+ "Han": ts.samples(population=6),
378+ "San": ts.samples(population=17),
379+ "Denisovan": ts.samples(population=66),
380+ }
381+
382+ for name, nodes in focal.items(): # Sanity check that we got the right IDs
383+ assert ts.population(ts.node(nodes[0]).population).metadata["name"] == name
384+
385+ # Simplify to just those samples ...
386+ all_focal_samples = [u for samples in focal.values() for u in samples]
387+ simplified_ts = ts.simplify(all_focal_samples, filter_sites=False)
388+
389+ # ... and find the tree around the rs3827760 SNP
390+ focal_site = simplified_ts.site(focal_variant.site.id)
391+ tree = simplified_ts.at(focal_site.position)
392+ ```
393+
394+ With this smaller number of samples, we can easily plot the tree
395+ at the "rs3827760" SNP:
396+
397+ ``` {code-cell}
398+ :"tags": ["hide-input"]
399+ # Make some nice labels, colours, and legend
400+ mutation_labels = {m.id: focal_site.metadata.get("ID") for m in focal_site.mutations}
401+ colours = dict(San="yellow", Han="green", Denisovan="magenta")
402+ styles = [
403+ f".leaf.p{pop.id} > .sym {{fill: {colours[pop.metadata['name']]}; stroke: grey}}"
404+ for pop in simplified_ts.populations()
405+ ]
406+ legend = '<rect width="125" height="75" x="100" y="30" fill="transparent" stroke="grey" />'
407+ legend += '<text x="120" y="45" font-weight="bold">Populations</text>'
408+ # Create the legend lines, one for each population. Setting classes that match those
409+ # used for normal nodes means that styled colours are auto automatically picked-up.
410+ legend += "".join([
411+ f'<g transform="translate(105, {60 + 15*p.id})" class="leaf p{p.id}">' # an SVG group
412+ f'<rect width="6" height="6" class="sym" />' # Square symbol
413+ f'<text x="10" y="7">{p.metadata["name"]}' # Label
414+ f'{(" (" + p.metadata["region"].replace("_", " ").title() + ")") if "region" in p.metadata else ""}</text></g>'
415+ for p in simplified_ts.populations()
416+ ])
417+
418+ tree.draw_svg(
419+ size=(1000, 400),
420+ style="".join(styles),
421+ node_labels={},
422+ mutation_labels=mutation_labels,
423+ preamble=legend,
424+ title=f"Tree of human chromosome 2 at position {int(focal_variant.site.position)}",
425+ y_axis=True,
426+ y_ticks=range(0, 50_000, 10_000),
427+ )
428+ ```
429+
430+ You can see that the pair of magenta Denisovan genomes in this region tend to be
431+ more closely associated with the East Asian genomes. We can assess that by counting
432+ all the 3-tip topologies in the tree that contain one genome from each population:
433+
434+ ``` {code-cell}
435+ topology_counter = tree.count_topologies()
436+ embedded_topologies = topology_counter[range(simplified_ts.num_populations)]
437+ ```
438+
439+ ``` {code-cell}
440+ :"tags": ["hide-input"]
441+ # All the following code is simply to plot the embedded_topologies nicely
442+ all_trees = list(tskit.all_trees(simplified_ts.num_populations))
443+ last = len(all_trees) - 1
444+ svgs = ""
445+ style = "".join(styles) + ".sample text.lab {baseline-shift: super; font-size: 0.7em;}"
446+ style = style.replace(".leaf.p", ".leaf.n") # Hack to map node IDs to population colours
447+ params = {
448+ "size": (160, 150),
449+ "node_labels": {pop.id: pop.metadata["name"] for pop in simplified_ts.populations()}
450+ }
451+ for i, t in enumerate(all_trees):
452+ rank = t.rank()
453+ count = embedded_topologies[rank]
454+ params["title"] = f"{count} trees"
455+ if i != last:
456+ svgs += t.draw_svg(root_svg_attributes={'x': (last - i) * 150}, **params)
457+ else:
458+ # Plot the last svg and stack the previous ones to the right
459+ display(t.draw_svg(preamble=svgs, canvas_size=(1000, 150), style=style, **params))
460+ ```
461+
334462
335463See {ref}` sec_counting_topologies ` for an introduction to topological methods in
336464` tskit ` .
0 commit comments