Skip to content

Commit f787635

Browse files
Merge branch 'tskit-dev:main' into gg_tutorial_edits
2 parents f27b81d + c3bc44a commit f787635

8 files changed

Lines changed: 306 additions & 96 deletions

File tree

.github/workflows/build.yml

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,17 @@ jobs:
4141
# We need to remove R to pull in a version that's compatible with CRAN, weirdly.
4242
sudo apt-get update
4343
sudo apt-get remove r-base-core
44-
sudo apt-get install r-cran-reticulate r-cran-pbdzmq r-cran-uuid r-cran-ape
45-
sudo R -e 'install.packages("IRkernel")'
44+
sudo apt-get install r-cran-pbdzmq r-cran-uuid r-cran-ape
45+
# Numpy 2 needs a newer reticulate version than ubuntu-24.04 + r-cran-reticulate
46+
sudo R -e 'install.packages(c("reticulate", "IRkernel"))'
4647
R -e 'IRkernel::installspec()'
4748
4849
# Build the book
4950
- name: Build the book
50-
run: ./build.sh
51+
run: |
52+
# Ensure R (reticulate) uses the python version with the new installed packages
53+
export RETICULATE_PYTHON=$(which python)
54+
./build.sh
5155
5256
- name: Copy files for users to run tutorials
5357
run: |

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@ under construction.
1717

1818
# Requirements
1919

20-
Install the Python requirements from requirements.txt:
20+
Install the Python requirements from `requirements.txt`:
2121
```
2222
$ python -m pip install -r requirements.txt
2323
```
2424

25-
You will also need a working R installation with reticulate and irkernel installed.
25+
You will also need a working R installation with `reticulate` and `irkernel` installed.
2626
This command should do the trick:
2727
```
2828
$ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
@@ -35,8 +35,8 @@ $ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
3535
[jupytext](https://github.com/mwouts/jupytext) to convert the notebook into
3636
the right format.
3737
- To build locally, run ``make``. The output tells you where to find the
38-
built HTML.
39-
- Pages rendered at https://tskit.dev/tutorials
38+
built `HTML` file(s).
39+
- Pages rendered at https://tskit.dev/tutorials.
4040
- Pages might take a while to be updated after a new tutorial is merged.
4141

4242
If you have an idea for a tutorial, please open an issue to discuss.

intro.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,9 @@ ts.draw_svg(size=size, x_scale="treewise", style=style, canvas_size=canvas_size)
7474
```
7575

7676
If you are new to the world of tree sequences, we suggest you start with the
77-
first tutorial: {ref}`sec_what_is`
77+
first tutorial: {ref}`sec_what_is`. If you wish to try `tskit` before
78+
installing it onto your computer, you might like to check out our
79+
[interactive online notebooks](https://tskit.dev/explore/lab/?path=tskit.ipynb).
7880

7981
:::{note}
8082
Tutorials are under constant development. Those that are still a work in progress and

popgen.md

Lines changed: 164 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -115,14 +115,14 @@ plt.show()
115115
```
116116

117117
Genomic data in tree sequence format can be generated via the widely-used
118-
[msprime](https://tskit.dev/software/msprime.html) simulator. Here we simulate 20
119-
kilobases of genome sequence at the start of human chromosome 1 under this model,
118+
[msprime](https://tskit.dev/software/msprime.html) simulator. Here we simulate 1
119+
megabase of genome sequence at the start of human chromosome 1 under this model,
120120
together with its evolutionary history. We generate 16 diploid genomes: 4 from each of
121121
the populations in the model. The DNA sequences and their ancestry are stored in a
122122
succinct tree sequence named `ts`:
123123

124124
```{code-cell}
125-
contig = species.get_contig("chr1", mutation_rate=model.mutation_rate, right=20_000)
125+
contig = species.get_contig("chr1", mutation_rate=model.mutation_rate, right=1_000_000)
126126
samples = {"AFR": 4, "EUR": 4, "ASIA": 4, "ADMIX": 4} # 16 diploid samples
127127
engine = stdpopsim.get_engine("msprime")
128128
ts = engine.simulate(model, contig, samples, seed=9).trim() # trim to first 20kb simulated
@@ -135,24 +135,44 @@ along the genome:
135135

136136
```{code-cell}
137137
for v in ts.variants():
138-
display(v)
139-
if v.site.id >= 2: # Only show site 0, 1, and 2, for brevity
138+
print(
139+
f"Variable site {v.site.id} at position {v.site.position} has allele frequencies",
140+
{state: f"{freq:.1%}" for state, freq in v.frequencies().items()}
141+
)
142+
if v.site.id > 4:
143+
print("...")
140144
break
141145
```
142146

143-
Or we can display the {meth}`~TreeSequence.haplotypes` (i.e. the variable sites) for
144-
each sample
147+
Or we can efficiently grab the genotypes for each sampled genome
145148

146149
```{code-cell}
147-
samples = ts.samples()
148-
for sample_id, h in zip(samples, ts.haplotypes(samples=samples)):
149-
pop = ts.node(sample_id).population
150-
print(f"Sample {sample_id:<2} ({ts.population(pop).metadata['name']:^5}): {h}")
150+
print("Sample ---> ", " ".join([f"{u:>2}" for p in ts.populations() for u in ts.samples(population=p.id)]))
151+
print("Population |", "".join([f"{p.metadata['name']:^{3*(len(ts.samples(population=p.id)))-1}}|" for p in ts.populations()]))
152+
print("__________ |", "".join(["_" * (3 * len(ts.samples(population=p.id)) - 1) + "|" for p in ts.populations()]))
153+
print(" Position")
154+
for v in ts.variants():
155+
print(f"{int(v.site.position):>10} | ", " ".join(v.states()))
156+
if v.site.id >= 30: # Only show the first 30 sites, for brevity
157+
break
158+
```
159+
160+
It is also possible to grab the [haplotypes](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.haplotypes)
161+
for specific samples (although this is slightly less efficient)
162+
163+
```{code-cell}
164+
pop_id = 0
165+
samples = ts.samples(population=pop_id)
166+
167+
for sample_id, haplotype in zip(samples, ts.haplotypes(samples=samples)):
168+
h = ".".join(list(haplotype)) # Add a dot between letters, to clarify they are not adjacent
169+
print(f"Sample {sample_id:<2} ({ts.population(pop_id).metadata['name']:^5}): {h}")
151170
```
152171

153-
From the tree sequence it is easy to obtain the
172+
You can easily obtain the
154173
{meth}`TreeSequence.allele_frequency_spectrum` for the entire region (or for
155-
{ref}`windowed regions<sec_tskit_getting_started_compute_statistics_windowing>`)
174+
{ref}`windowed regions<sec_tskit_getting_started_compute_statistics_windowing>`)
175+
directly from the tree sequence (i.e. without needing to reconstruct genotypes)
156176

157177
```{code-cell}
158178
afs = ts.allele_frequency_spectrum()
@@ -163,16 +183,16 @@ plt.show()
163183

164184
Similarly `tskit` allows fast and easy
165185
{ref}`calculation of statistics<sec_tutorial_stats>` along the genome. Here is
166-
a plot of windowed $F_{st}$ between Africans and admixed Americans over this short
186+
a plot of windowed $F_{st}$ between Africans and admixed Americans over this
167187
region of chromosome:
168188

169189
```{code-cell}
170190
# Define the samples between which Fst will be calculated
171191
pop_id = {p.metadata["name"]: p.id for p in ts.populations()}
172192
sample_sets=[ts.samples(pop_id["AFR"]), ts.samples(pop_id["ADMIX"])]
173193
174-
# Do the windowed calculation, using windows of 2 kilobases
175-
windows = list(range(0, int(ts.sequence_length + 1), 2_000))
194+
# Do the windowed calculation, using windows of 10 kilobases
195+
windows = list(range(0, int(ts.sequence_length + 1), 10_000))
176196
F_st = ts.Fst(sample_sets, windows=windows)
177197
178198
# Plot
@@ -185,7 +205,8 @@ plt.show()
185205
Extracting the genetic tree at a specific genomic location is easy using `tskit`, which
186206
also provides methods to {ref}`plot<sec_tskit_viz>` these trees. Here we
187207
grab the tree at position 10kb, and colour the different populations by
188-
different colours, as described in the {ref}`viz tutorial<sec_tskit_viz_styling>`:
208+
grab the tree at position 10kb, and colour the samples according to their population,
209+
as described in the {ref}`viz tutorial<sec_tskit_viz_styling>`:
189210

190211
```{code-cell}
191212
tree = ts.at(10_000)
@@ -214,29 +235,51 @@ tree.draw_svg(
214235
y_axis=True,
215236
y_ticks=range(0, 30_000, 10_000)
216237
)
238+
```
217239

240+
Or we can plot a principal components analysis of the genome, which should reflect
241+
geographical distinctiveness:
242+
243+
```{code-cell}
244+
from matplotlib.patches import Patch
245+
246+
# Run the Principal Components Analysis (PCA)
247+
pca_obj = ts.pca(num_components=2)
248+
249+
# Plot the PCA "factors"
250+
col_list = [colours[pop.metadata["name"]] for pop in ts.populations()]
251+
sample_pop_ids = ts.nodes_population[ts.samples()]
252+
plt.scatter(*pca_obj.factors.T, c=[col_list[p] for p in sample_pop_ids], edgecolors= "black")
253+
plt.xlabel("PCA 1")
254+
plt.ylabel("PCA 2")
255+
plt.legend(handles=[
256+
Patch(color=col_list[pop.id], label=pop.metadata["name"]) for pop in ts.populations()
257+
]);
218258
```
219259

220260
## Population genetic inference
221261

222262
If, instead of simulations, you want to analyse existing genomic data (for example
223-
stored in a VCF file), you will need to infer a tree sequence from it, using e.g.
224-
[tsinfer](https://tskit.dev/tsinfer/docs/stable/). Here we load an illustrative portion
225-
of an [inferred tree sequence](https://zenodo.org/record/5512994)
263+
stored in a VCF file), you will need to infer a tree sequence from it, using [ARG
264+
inference software](https://www.patrickfournier.ca/arg-samplers-review/graph/#/page/arg%20inference) (such as [SINGER](https://github.com/popgenmethods/SINGER),
265+
[Relate](https://myersgroup.github.io/relate/),
266+
[Threads](https://github.com/PalamaraLab/threads), or
267+
[tsinfer](https://github.com/tskit-dev/tsinfer)). Here we load an
268+
illustrative portion of a tree sequence
269+
[inferred by tsinfer+tsdate](https://zenodo.org/record/5512994) which is
226270
based on about 7500 public human genomes, including genomes from the
227271
[Thousand Genomes Project](https://www.internationalgenome.org/data-portal/data-collection/grch38) and
228272
[Human Genome Diversity Project](https://www.internationalgenome.org/data-portal/data-collection/hgdp).
229273
The genomic region encoded in this tree sequence has been cut down to
230274
span positions 108Mb-110Mb of human chromosome 2, which spans the
231275
[EDAR](https://en.wikipedia.org/wiki/Ectodysplasin_A_receptor) gene.
232276

233-
Note that tree sequence files are usually imported using {func}`load`,
234-
but because this file has been additionally compressed, we load it via
235-
{func}`tszip:tszip.decompress`:
277+
Note that we are using {func}`tszip:tszip.load` to load the file, as this
278+
utility can also read and write compressed tree sequences in `.tsz` format.
236279

237280
```{code-cell}
238281
import tszip
239-
ts = tszip.decompress("data/unified_genealogy_2q_108Mb-110Mb.tsz")
282+
ts = tszip.load("data/unified_genealogy_2q_108Mb-110Mb.tsz")
240283
241284
# The ts encompasses a region on chr 2 with an interesting SNP (rs3827760) in the EDAR gene
242285
edar_gene_bounds = [108_894_471, 108_989_220] # In Mb from the start of chromosome 2
@@ -256,10 +299,11 @@ import pandas as pd
256299
print(ts.num_populations, "populations defined in the tree sequence:")
257300
258301
pop_names_regions = [
259-
[p.metadata.get("name"), p.metadata.get("region")]
302+
[p.metadata.get("name"), p.metadata.get("region"), len(ts.samples(population=p.id))]
260303
for p in ts.populations()
261304
]
262-
display(pd.DataFrame(pop_names_regions, columns=["population name", "region"]))
305+
with pd.option_context('display.max_rows', 100):
306+
display(pd.DataFrame(pop_names_regions, columns=["name", "region", "# genomes"]))
263307
```
264308

265309
You can see that there are multiple African and East asian populations, grouped by
@@ -321,16 +365,100 @@ using tree sequences is simply that they allow these sorts of analysis to
321365
### Topological analysis
322366

323367
As this inferred tree sequence stores (an estimate of) the underlying
324-
genealogy, we can also derive statistics based on genealogical relationships. For
325-
example, this tree sequence also contains a sample genome based on an ancient
326-
genome, a [Denisovan](https://en.wikipedia.org/wiki/Denisovan) individual. We can
327-
look at the closeness of relationship between samples from the different geographical
328-
regions and the Denisovan:
329-
330-
:::{todo}
331-
Show an example of looking at topological relationships between the Denisovan and
332-
various East Asian groups, using the {ref}`sec_counting_topologies` functionality.
333-
:::
368+
genealogy, we can also derive statistics based on genealogical relationships. You
369+
may have noticed that this tree sequence also contains a sample genome based on an ancient
370+
genome, a [Denisovan](https://en.wikipedia.org/wiki/Denisovan) individual. We'll first
371+
simplify the tree sequence to focus on only the Denisovan plus
372+
a common East Asian and a common African population:
373+
374+
```{code-cell}
375+
# Focus on Han, San, and Denisovan
376+
focal = {
377+
"Han": ts.samples(population=6),
378+
"San": ts.samples(population=17),
379+
"Denisovan": ts.samples(population=66),
380+
}
381+
382+
for name, nodes in focal.items(): # Sanity check that we got the right IDs
383+
assert ts.population(ts.node(nodes[0]).population).metadata["name"] == name
384+
385+
# Simplify to just those samples ...
386+
all_focal_samples = [u for samples in focal.values() for u in samples]
387+
simplified_ts = ts.simplify(all_focal_samples, filter_sites=False)
388+
389+
# ... and find the tree around the rs3827760 SNP
390+
focal_site = simplified_ts.site(focal_variant.site.id)
391+
tree = simplified_ts.at(focal_site.position)
392+
```
393+
394+
With this smaller number of samples, we can easily plot the tree
395+
at the "rs3827760" SNP:
396+
397+
```{code-cell}
398+
:"tags": ["hide-input"]
399+
# Make some nice labels, colours, and legend
400+
mutation_labels = {m.id: focal_site.metadata.get("ID") for m in focal_site.mutations}
401+
colours = dict(San="yellow", Han="green", Denisovan="magenta")
402+
styles = [
403+
f".leaf.p{pop.id} > .sym {{fill: {colours[pop.metadata['name']]}; stroke: grey}}"
404+
for pop in simplified_ts.populations()
405+
]
406+
legend = '<rect width="125" height="75" x="100" y="30" fill="transparent" stroke="grey" />'
407+
legend += '<text x="120" y="45" font-weight="bold">Populations</text>'
408+
# Create the legend lines, one for each population. Setting classes that match those
409+
# used for normal nodes means that styled colours are auto automatically picked-up.
410+
legend += "".join([
411+
f'<g transform="translate(105, {60 + 15*p.id})" class="leaf p{p.id}">' # an SVG group
412+
f'<rect width="6" height="6" class="sym" />' # Square symbol
413+
f'<text x="10" y="7">{p.metadata["name"]}' # Label
414+
f'{(" (" + p.metadata["region"].replace("_", " ").title() + ")") if "region" in p.metadata else ""}</text></g>'
415+
for p in simplified_ts.populations()
416+
])
417+
418+
tree.draw_svg(
419+
size=(1000, 400),
420+
style="".join(styles),
421+
node_labels={},
422+
mutation_labels=mutation_labels,
423+
preamble=legend,
424+
title=f"Tree of human chromosome 2 at position {int(focal_variant.site.position)}",
425+
y_axis=True,
426+
y_ticks=range(0, 50_000, 10_000),
427+
)
428+
```
429+
430+
You can see that the pair of magenta Denisovan genomes in this region tend to be
431+
more closely associated with the East Asian genomes. We can assess that by counting
432+
all the 3-tip topologies in the tree that contain one genome from each population:
433+
434+
```{code-cell}
435+
topology_counter = tree.count_topologies()
436+
embedded_topologies = topology_counter[range(simplified_ts.num_populations)]
437+
```
438+
439+
```{code-cell}
440+
:"tags": ["hide-input"]
441+
# All the following code is simply to plot the embedded_topologies nicely
442+
all_trees = list(tskit.all_trees(simplified_ts.num_populations))
443+
last = len(all_trees) - 1
444+
svgs = ""
445+
style = "".join(styles) + ".sample text.lab {baseline-shift: super; font-size: 0.7em;}"
446+
style = style.replace(".leaf.p", ".leaf.n") # Hack to map node IDs to population colours
447+
params = {
448+
"size": (160, 150),
449+
"node_labels": {pop.id: pop.metadata["name"] for pop in simplified_ts.populations()}
450+
}
451+
for i, t in enumerate(all_trees):
452+
rank = t.rank()
453+
count = embedded_topologies[rank]
454+
params["title"] = f"{count} trees"
455+
if i != last:
456+
svgs += t.draw_svg(root_svg_attributes={'x': (last - i) * 150}, **params)
457+
else:
458+
# Plot the last svg and stack the previous ones to the right
459+
display(t.draw_svg(preamble=svgs, canvas_size=(1000, 150), style=style, **params))
460+
```
461+
334462

335463
See {ref}`sec_counting_topologies` for an introduction to topological methods in
336464
`tskit`.

requirements-CI.txt

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
1-
git+https://github.com/tskit-dev/tsconvert@e99c837e4e26ccbf4f480a4c48626338eeff7dc3
1+
git+https://github.com/tskit-dev/tsconvert@4ed3c0f9bbbb79c7c394b352c58f5386cfef2b90
22
demes==0.2.3
3-
demesdraw==0.4.0
4-
jupyter-book==1.0.2
3+
demesdraw==0.4.1
4+
jupyter-book==1.0.3
55
jupyter-cache==0.6.1
6-
msprime==1.3.2
6+
msprime==1.4.0
77
networkx==3.3
8-
numpy==1.26.4
9-
pandas==2.2.2
8+
numpy==2.2.6
9+
pandas==2.3.3
1010
pygraphviz==1.13
11-
scikit-allel==1.3.8
11+
scikit-allel==1.3.13
1212
stdpopsim==0.3.0
1313
tqdm==4.66.3
14-
tskit==0.5.8
15-
tskit_arg_visualizer==0.0.1
16-
tszip==0.2.4
14+
tskit==1.0.0
15+
tskit_arg_visualizer==0.1.2
16+
tszip==0.3
1717
jsonschema==4.18.6 # Pinned due to 4.19 "AttributeError module jsonschema has no attribute _validators"

requirements.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
git+https://github.com/tskit-dev/tsconvert@e99c837e4e26ccbf4f480a4c48626338eeff7dc3
1+
git+https://github.com/tskit-dev/tsconvert@4ed3c0f9bbbb79c7c394b352c58f5386cfef2b90
22
demes
33
demesdraw
4-
jupyter-book>=0.12.0
4+
jupyter-book >= 0.12.0, < 2.0
55
jupyter-cache
66
msprime>=1.0
77
networkx
88
pandas
99
pygraphviz
10-
scikit-allel
10+
scikit-allel>=1.3.9
1111
stdpopsim>=0.3
1212
tqdm
13-
tskit>=0.5.4
14-
tskit_arg_visualizer
13+
tskit>=1.0.0
14+
tskit_arg_visualizer>=0.1.1
1515
tszip

0 commit comments

Comments
 (0)