Skip to content

Commit 4b7b9d9

Browse files
committed
Finish the intro to popgen tutorial
1 parent d360610 commit 4b7b9d9

1 file changed

Lines changed: 136 additions & 29 deletions

File tree

popgen.md

Lines changed: 136 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -115,14 +115,14 @@ plt.show()
115115
```
116116

117117
Genomic data in tree sequence format can be generated via the widely-used
118-
[msprime](https://tskit.dev/software/msprime.html) simulator. Here we simulate 20
119-
kilobases of genome sequence at the start of human chromosome 1 under this model,
118+
[msprime](https://tskit.dev/software/msprime.html) simulator. Here we simulate 1
119+
megabase of genome sequence at the start of human chromosome 1 under this model,
120120
together with its evolutionary history. We generate 16 diploid genomes: 4 from each of
121121
the populations in the model. The DNA sequences and their ancestry are stored in a
122122
succinct tree sequence named `ts`:
123123

124124
```{code-cell}
125-
contig = species.get_contig("chr1", mutation_rate=model.mutation_rate, right=20_000)
125+
contig = species.get_contig("chr1", mutation_rate=model.mutation_rate, right=1_000_000)
126126
samples = {"AFR": 4, "EUR": 4, "ASIA": 4, "ADMIX": 4} # 16 diploid samples
127127
engine = stdpopsim.get_engine("msprime")
128128
ts = engine.simulate(model, contig, samples, seed=9).trim() # trim to first 20kb simulated
@@ -134,20 +134,24 @@ We can now inspect alleles and their frequencies at the variable sites we have s
134134
along the genome:
135135

136136
```{code-cell}
137+
print("Sample --->", " ".join([f"{u:>2}" for p in ts.populations() for u in ts.samples(population=p.id)]))
138+
print("Population |", "".join([f"{p.metadata['name']:^{3*(len(ts.samples(population=p.id)))-1}}|" for p in ts.populations()]))
139+
print("_Position_ |", "".join(["_" * (3 * len(ts.samples(population=p.id)) - 1) + "|" for p in ts.populations()]))
137140
for v in ts.variants():
138-
display(v)
139-
if v.site.id >= 2: # Only show site 0, 1, and 2, for brevity
141+
print(f"{int(v.site.position):>10} | ", " ".join(v.states()))
142+
if v.site.id >= 30: # Only show the first 30 sites, for brevity
140143
break
141144
```
142145

143-
Or we can display the {meth}`~TreeSequence.haplotypes` (i.e. the variable sites) for
144-
each sample
146+
We can efficiently grab the genotypes for each sampled genome
145147

146-
```{code-cell}
148+
```
147149
samples = ts.samples()
148-
for sample_id, h in zip(samples, ts.haplotypes(samples=samples)):
150+
149151
pop = ts.node(sample_id).population
150-
print(f"Sample {sample_id:<2} ({ts.population(pop).metadata['name']:^5}): {h}")
152+
153+
for v in ts.variants():
154+
print(f"Position {v.site.position:<5} ({ts.population(pop).metadata['name']:^5}): {h}")
151155
```
152156

153157
From the tree sequence it is easy to obtain the
@@ -163,16 +167,16 @@ plt.show()
163167

164168
Similarly `tskit` allows fast and easy
165169
{ref}`calculation of statistics<sec_tutorial_stats>` along the genome. Here is
166-
a plot of windowed $F_{st}$ between Africans and admixed Americans over this short
170+
a plot of windowed $F_{st}$ between Africans and admixed Americans over this
167171
region of chromosome:
168172

169173
```{code-cell}
170174
# Define the samples between which Fst will be calculated
171175
pop_id = {p.metadata["name"]: p.id for p in ts.populations()}
172176
sample_sets=[ts.samples(pop_id["AFR"]), ts.samples(pop_id["ADMIX"])]
173177
174-
# Do the windowed calculation, using windows of 2 kilobases
175-
windows = list(range(0, int(ts.sequence_length + 1), 2_000))
178+
# Do the windowed calculation, using windows of 10 kilobases
179+
windows = list(range(0, int(ts.sequence_length + 1), 10_000))
176180
F_st = ts.Fst(sample_sets, windows=windows)
177181
178182
# Plot
@@ -214,7 +218,26 @@ tree.draw_svg(
214218
y_axis=True,
215219
y_ticks=range(0, 30_000, 10_000)
216220
)
221+
```
222+
223+
Or we can plot a principal components analysis of the genome, which should reflect
224+
geographical distinctiveness:
217225

226+
```{code-cell}
227+
from matplotlib.patches import Patch
228+
229+
# Run the Principal Components Analysis (PCA)
230+
pca_obj = ts.pca(num_components=2)
231+
232+
# Plot the PCA "factors"
233+
col_list = [colours[pop.metadata["name"]] for pop in ts.populations()]
234+
sample_pop_ids = ts.nodes_population[ts.samples()]
235+
plt.scatter(*pca_obj.factors.T, c=[col_list[p] for p in sample_pop_ids], edgecolors= "black")
236+
plt.xlabel("PCA 1")
237+
plt.ylabel("PCA 2")
238+
plt.legend(handles=[
239+
Patch(color=col_list[pop.id], label=pop.metadata["name"]) for pop in ts.populations()
240+
]);
218241
```
219242

220243
## Population genetic inference
@@ -230,13 +253,12 @@ The genomic region encoded in this tree sequence has been cut down to
230253
span positions 108Mb-110Mb of human chromosome 2, which spans the
231254
[EDAR](https://en.wikipedia.org/wiki/Ectodysplasin_A_receptor) gene.
232255

233-
Note that tree sequence files are usually imported using {func}`load`,
234-
but because this file has been additionally compressed, we load it via
235-
{func}`tszip:tszip.decompress`:
256+
Note that we are using {func}`tszip:tszip.load` to load the file, as this
257+
utility can also read and write compressed tree sequences in `.tsz` format.
236258

237259
```{code-cell}
238260
import tszip
239-
ts = tszip.decompress("data/unified_genealogy_2q_108Mb-110Mb.tsz")
261+
ts = tszip.load("data/unified_genealogy_2q_108Mb-110Mb.tsz")
240262
241263
# The ts encompasses a region on chr 2 with an interesting SNP (rs3827760) in the EDAR gene
242264
edar_gene_bounds = [108_894_471, 108_989_220] # In Mb from the start of chromosome 2
@@ -256,10 +278,11 @@ import pandas as pd
256278
print(ts.num_populations, "populations defined in the tree sequence:")
257279
258280
pop_names_regions = [
259-
[p.metadata.get("name"), p.metadata.get("region")]
281+
[p.metadata.get("name"), p.metadata.get("region"), len(ts.samples(population=p.id))]
260282
for p in ts.populations()
261283
]
262-
display(pd.DataFrame(pop_names_regions, columns=["population name", "region"]))
284+
with pd.option_context('display.max_rows', 100):
285+
display(pd.DataFrame(pop_names_regions, columns=["name", "region", "# genomes"]))
263286
```
264287

265288
You can see that there are multiple African and East asian populations, grouped by
@@ -321,16 +344,100 @@ using tree sequences is simply that they allow these sorts of analysis to
321344
### Topological analysis
322345

323346
As this inferred tree sequence stores (an estimate of) the underlying
324-
genealogy, we can also derive statistics based on genealogical relationships. For
325-
example, this tree sequence also contains a sample genome based on an ancient
326-
genome, a [Denisovan](https://en.wikipedia.org/wiki/Denisovan) individual. We can
327-
look at the closeness of relationship between samples from the different geographical
328-
regions and the Denisovan:
329-
330-
:::{todo}
331-
Show an example of looking at topological relationships between the Denisovan and
332-
various East Asian groups, using the {ref}`sec_counting_topologies` functionality.
333-
:::
347+
genealogy, we can also derive statistics based on genealogical relationships. You
348+
may have noticed that this tree sequence also contains a sample genome based on an ancient
349+
genome, a [Denisovan](https://en.wikipedia.org/wiki/Denisovan) individual. We'll first
350+
simplify the tree sequence to focus on only the Denisovan plus
351+
a common East Asian and a common African population:
352+
353+
```{code-cell}
354+
# Focus on Han, San, and Denisovan
355+
focal = {
356+
"Han": ts.samples(population=6),
357+
"San": ts.samples(population=17),
358+
"Denisovan": ts.samples(population=66),
359+
}
360+
361+
for name, nodes in focal.items(): # Sanity check that we got the right IDs
362+
assert ts.population(ts.node(nodes[0]).population).metadata["name"] == name
363+
364+
# Simplify to just those samples ...
365+
all_focal_samples = [u for samples in focal.values() for u in samples]
366+
simplified_ts = ts.simplify(all_focal_samples, filter_sites=False)
367+
368+
# ... and find the tree around the rs3827760 SNP
369+
focal_site = simplified_ts.site(focal_variant.site.id)
370+
tree = simplified_ts.at(focal_site.position)
371+
```
372+
373+
With this smaller number of samples, we can easily plot the tree
374+
at the "rs3827760" SNP:
375+
376+
```{code-cell}
377+
:"tags": ["hide-input"]
378+
# Make some nide labels, colours, and legend etc.
379+
mutation_labels = {m.id: focal_site.metadata.get("ID") for m in focal_site.mutations}
380+
colours = dict(San="yellow", Han="green", Denisovan="magenta")
381+
styles = [
382+
f".leaf.p{pop.id} > .sym {{fill: {colours[pop.metadata['name']]}; stroke: grey}}"
383+
for pop in simplified_ts.populations()
384+
]
385+
legend = '<rect width="125" height="75" x="100" y="30" fill="transparent" stroke="grey" />'
386+
legend += '<text x="120" y="45" font-weight="bold">Populations</text>'
387+
# Create the legend lines, one for each population. Setting classes that match those
388+
# used for normal nodes means that styled colours are auto automatically picked-up.
389+
legend += "".join([
390+
f'<g transform="translate(105, {60 + 15*p.id})" class="leaf p{p.id}">' # an SVG group
391+
f'<rect width="6" height="6" class="sym" />' # Square symbol
392+
f'<text x="10" y="7">{p.metadata["name"]}' # Label
393+
f'{(" (" + p.metadata["region"].replace("_", " ").title() + ")") if "region" in p.metadata else ""}</text></g>'
394+
for p in simplified_ts.populations()
395+
])
396+
397+
tree.draw_svg(
398+
size=(1000, 400),
399+
style="".join(styles),
400+
node_labels={},
401+
mutation_labels=mutation_labels,
402+
preamble=legend,
403+
title=f"Tree of human chromosome 2 at position {int(focal_variant.site.position)}",
404+
y_axis=True,
405+
y_ticks=range(0, 50_000, 10_000),
406+
)
407+
```
408+
409+
You can see that the pair of magenta Denisovan genomes in this region tend to be
410+
more closely associated with the East Asian genomes. We can assess that by counting
411+
all the 3-tip topologies in the tree that contain one genome from each population:
412+
413+
```{code-cell}
414+
topology_counter = tree.count_topologies()
415+
embedded_topologies = topology_counter[range(simplified_ts.num_populations)]
416+
```
417+
418+
```{code-cell}
419+
:"tags": ["hide-input"]
420+
# All the following code is simply to plot the embedded_topologies nicely
421+
all_trees = list(tskit.all_trees(simplified_ts.num_populations))
422+
last = len(all_trees) - 1
423+
svgs = ""
424+
style = "".join(styles) + ".sample text.lab {baseline-shift: super; font-size: 0.7em;}"
425+
style = style.replace(".leaf.p", ".leaf.n") # Hack to map node IDs to population colours
426+
params = {
427+
"size": (160, 150),
428+
"node_labels": {pop.id: pop.metadata["name"] for pop in simplified_ts.populations()}
429+
}
430+
for i, t in enumerate(all_trees):
431+
rank = t.rank()
432+
count = embedded_topologies[rank]
433+
params["title"] = f"{count} trees"
434+
if i != last:
435+
svgs += t.draw_svg(root_svg_attributes={'x': (last - i) * 150}, **params)
436+
else:
437+
# Plot the last svg and stack the previous ones to the right
438+
display(t.draw_svg(preamble=svgs, canvas_size=(1000, 150), style=style, **params))
439+
```
440+
334441

335442
See {ref}`sec_counting_topologies` for an introduction to topological methods in
336443
`tskit`.

0 commit comments

Comments
 (0)