Skip to content

Commit e2d14d4

Browse files
committed
Make a simplification tutorial
1 parent 7c8d073 commit e2d14d4

1 file changed

Lines changed: 285 additions & 6 deletions

File tree

simplification.md

Lines changed: 285 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,296 @@ kernelspec:
1616

1717
```{code-cell} ipython3
1818
:tags: [remove-cell]
19-
def create_notebook_data():
20-
pass
19+
import msprime
2120
21+
def create_notebook_data():
22+
ts = msprime.sim_ancestry(
23+
samples=4,
24+
sequence_length=1_000,
25+
population_size=300,
26+
random_seed=42,
27+
)
28+
ts = msprime.sim_mutations(ts, rate=1e-6, random_seed=321)
29+
assert ts.num_mutations == 2
30+
site_pos = ts.simplify([0, 1, 2]).sites_position
31+
assert len(site_pos) == 1 and site_pos[0] == ts.sites_position[1]
32+
ts.dump("data/simplification_basic.trees")
2233
# create_notebook_data() # uncomment to recreate the tree seqs used in this notebook
2334
```
2435

2536
(sec_simplification)=
2637

27-
# _Simplification_
28-
% remove underscores in title when tutorial is complete or near-complete
38+
# Simplification
39+
40+
The {meth}`~TreeSequence.simplify` method provides one of the most powerful ways to modify a [tskit](https://tskit.dev) {class}`TreeSequence`. It reduces a tree sequence to the ancestry of a chosen set of focal nodes, and by default marks those as samples, removing everything not needed to represent their ancestry. It is commonly used:
41+
42+
* In forward simulations, to remove lineages that have gone extinct
43+
* To create a smaller tree sequence focussed on a subset of samples
44+
* To remove redundant nodes and other tskit objects (e.g. unreferenced populations)
45+
46+
Other less common uses, such as retaining unary regions of coalescent nodes, and
47+
simplification in parallel, are described in the {ref}`sec_advanced_simplification` tutorial.
48+
49+
50+
## A single tree example
51+
52+
We start with a very small example for ease of visualisation. This is
53+
a tree sequence consisting of a single tree with 8 haploid genomes
54+
(4 diploid individuals) and 2 variable sites.
55+
56+
```{code-cell} ipython3
57+
import tskit
58+
ts = tskit.load("data/simplification_basic.trees")
59+
plot_params = {"size": (500, 200), "time_scale": "log_time", "y_axis":True, "y_ticks": [0, 10, 100, 1000]}
60+
ts.draw_svg(**plot_params)
61+
```
62+
63+
Suppose we only want to retain the ancestry of sample nodes 0, 1, and 2. We can do this
64+
by passing those IDs as the new samples to the {meth}`~TreeSequence.simplify` method:
65+
66+
```{code-cell} ipython3
67+
focal_node_ids = [0, 1, 2]
68+
ts_simp1 = ts.simplify(samples=focal_node_ids)
69+
ts_simp1.draw_svg(**plot_params)
70+
```
71+
72+
Restricting the sample nodes to [0, 1, 2] makes the ancestry much simpler.
73+
Note that one of the mutations is not relevant to the new samples, so it has been
74+
filtered out, causing the ID of the remaining mutation to change from 1 to 0.
75+
Similarly, many nodes have been filtered out, resulting in changed node IDs
76+
(e.g. the root node at time 1000 is the same in both trees, but its ID has
77+
changed from 14 to 4).
78+
79+
To keep node IDs the same, you can specify `filter_nodes=False`. Although this makes
80+
the result easier to compare with the original, it is not generally recommended, as it
81+
leaves redundant nodes cluttering up the tree sequence.
82+
83+
```{code-cell} ipython3
84+
ts_simp2 = ts.simplify(focal_node_ids, filter_nodes=False, filter_sites=False)
85+
# Node IDs should now remain unchanged
86+
ts_simp2.draw_svg(**plot_params)
87+
```
88+
89+
Note that the example above also used another `filter_` argument, setting
90+
`filter_sites=False`, so that the first site, which has no mutations after
91+
simplification, is also retained (it is shown as a bare tick mark on the X axis,
92+
around position 250). However, mutations above unused nodes are still deleted
93+
so mutation IDs are not guaranteed to stay the same.
94+
95+
To further reduce the size of the simplified tree sequence, simplification normally
96+
removes nodes from the ancestry that no longer represent branch points (coalescences).
97+
We can leave those in using `keep_unary=True`.
98+
99+
```{code-cell} ipython3
100+
ts_simp4 = ts.simplify(focal_node_ids, filter_nodes=False, keep_unary=True)
101+
ts_simp4.draw_svg(**plot_params)
102+
```
103+
104+
:::{note}
105+
As modifying a tree sequence can change the IDs of nodes, sites, and other objects,
106+
it can be useful to use {ref}`metadata <sec_tutorial_metadata>`:
107+
information that stays associated with tskit objects even when their IDs change.
108+
When simplifying, it is also possible to keep track of node ID changes by using
109+
the `map_nodes` parameter, as demonstrated later in this tutorial.
110+
:::
111+
112+
## A larger simplification example
113+
114+
Here we examine the impact of simplification on the efficiency of tree sequence
115+
storage and processing. We'll start with a larger backward simulation that has
116+
a handful of admixed individuals:
117+
118+
```{code-cell} ipython3
119+
demography = msprime.Demography()
120+
demography.add_population(name="SMALL", initial_size=2_000)
121+
demography.add_population(name="BIG", initial_size=5_000)
122+
demography.add_population(name="ADMIX", initial_size=2_000)
123+
demography.add_population(name="ANC", initial_size=5_000)
124+
demography.add_admixture(
125+
time=100, derived="ADMIX", ancestral=["SMALL", "BIG"], proportions=[0.5, 0.5]
126+
)
127+
demography.add_population_split(time=1_000, derived=["SMALL", "BIG"], ancestral="ANC")
128+
129+
big_ts = msprime.sim_ancestry(
130+
samples={"SMALL": 400, "BIG": 400, "ADMIX": 6, "ANC": 400},
131+
demography=demography,
132+
sequence_length=5e7,
133+
recombination_rate=2e-8,
134+
random_seed=2432,
135+
)
136+
big_ts = msprime.sim_mutations(big_ts, rate=1e-8, random_seed=6151)
137+
print(
138+
f"`big_ts` represents a simulation with admixture of {big_ts.num_samples} samples",
139+
f"over {big_ts.sequence_length/1e6:g} Mb ({big_ts.num_trees} trees)",
140+
)
141+
```
142+
143+
## Use case 1: remove historical samples
144+
145+
Here, about a third of the sample nodes (those from the `ANC` population) exist
146+
at times prior to the current generation, i.e. they are *historical* sample nodes.
147+
In fact, in forward simulations most nodes will be of this sort, left over from
148+
previously simulated generations. These are often unwanted, and one of the main
149+
use cases for simplification is to reduce the ancestry to that of just the
150+
contemporary genomes: i.e. removing any edges, nodes, and mutations that track
151+
"extinct" lineages.
152+
153+
```{code-cell} ipython3
154+
modern_sample_ids = big_ts.samples(time=0)
155+
ts_modern = big_ts.simplify(modern_sample_ids)
156+
print(f"Tree sequence simplified to {ts_modern.nbytes/big_ts.nbytes:.1%} of original size")
157+
```
158+
159+
Simplifying has only produced a modest reduction in size, but you can imagine that
160+
in a forward simulation, where the majority of genomes are historical, repeated
161+
simplification can result in huge savings. In practice, simulators usually do this
162+
regular simplification of the tables used to store the paths of genetic inheritance
163+
automatically, so using
164+
{meth}`~TableCollection.simplify` in this way is mainly of interest if you are
165+
{ref}`building your own forward simulator <sec_tskit_forward_simulations>`.
166+
167+
## Use case 2: simplify to a subset of samples
168+
169+
Often you might want to focus on only one group of samples, for example the small
170+
group of `ADMIX` individuals (population ID 2 in this simulation):
171+
172+
```{code-cell} ipython3
173+
admix_pop_id = 2
174+
assert big_ts.population(admix_pop_id).metadata["name"] == "ADMIX"
175+
admix_sample_ids = big_ts.samples(population=admix_pop_id)
176+
177+
ts_admix, node_map = big_ts.simplify(admix_sample_ids, map_nodes=True)
178+
print(f"Tree sequence simplified to {ts_admix.nbytes/big_ts.nbytes:.2%} of original size")
179+
print()
180+
print(f"Previous admixed sample IDs were {admix_sample_ids}")
181+
print(f"Simplifying has changed these to {node_map[admix_sample_ids]}")
182+
183+
# Check that these are indeed the only sample IDs
184+
assert set(node_map[admix_sample_ids]) == set(ts_admix.samples())
185+
```
29186

30-
:::{todo}
31-
Create content. See https://github.com/tskit-dev/tutorials/issues/52
187+
:::{note}
188+
The `map_nodes=True` argument means that `simplify()` returns both a new
189+
tree sequence and an array mapping each old node ID to its new ID, or to
190+
`tskit.NULL` if that node is removed.
191+
Here you can see that (unlike in previous examples) the sample node IDs
192+
have changed: unless `filter_nodes=False`, the _N_ node IDs provided as the `samples`
193+
argument will be allocated new IDs from 0 to _N_ - 1 in the returned tree sequence (so simplify can be used to reorder sample IDs, although
194+
{meth}`~TreeSequence.subset` is a way to do this with fewer side effects).
32195
:::
196+
197+
### Efficiency
198+
199+
Edges take up the majority of the space in most tree sequences. In this case you can
200+
see that although simplify has reduced the sample nodes to 12 genomes from
201+
the 6 diploid `ADMIX` individuals (a reduction of 99.5%), the number of edges
202+
has not been reduced by such a large amount.
203+
That's because many of the ancestors of the SMALL and BIG populations are also shared
204+
by `ADMIX`. It also shows why tree sequence structures are so effective for encoding
205+
and analysing large datasets: storage and processing efficiency, in particular the
206+
number of edges, is sub-linear in the number of samples.
207+
208+
```{code-cell} ipython3
209+
print(
210+
f"The simplified tree sequence has only {ts_admix.num_samples / big_ts.num_samples:.2%} of the samples,",
211+
f"but retains {ts_admix.num_edges / big_ts.num_edges:.2%} of the edges."
212+
)
213+
```
214+
215+
If you want to analyse only the admixed individuals, using the simplified tree sequence
216+
is much more efficient than running equivalent operations on the original `big_ts`:
217+
218+
```{code-cell} ipython3
219+
%%timeit
220+
# Speed test for decoding all the genetic variants of the admixed individuals
221+
for v in ts_admix.variants():
222+
pass
223+
```
224+
225+
Identical results can be obtained using the full tree sequence and restricting calculations to the `admix_sample_ids`, but this approach is much slower:
226+
227+
```{code-cell} ipython3
228+
%%timeit
229+
# Equivalent processing of admixed individuals, using the full tree sequence
230+
for v in big_ts.variants(samples=admix_sample_ids):
231+
pass
232+
```
233+
234+
The same efficiencies apply to calculating statistics on subsets of genomic samples.
235+
As simplification has been highly optimised in `tskit`, if you perform repeated
236+
processing of the same subset of genomes, it can be worth simplifying before
237+
processing.
238+
239+
### Removing other unused objects
240+
241+
If we print out the original and admix-only (simplified) tree sequence, we can see
242+
that a number of other tables have also been reduced in size. For instance,
243+
simplification has reduced the number of individuals from 1206 to 6, and the
244+
number of sites by about three quarters.
245+
246+
```{code-cell} ipython3
247+
print("Original tree sequence")
248+
big_ts
249+
```
250+
251+
```{code-cell} ipython3
252+
print("Simplified tree sequence")
253+
ts_admix
254+
```
255+
256+
Note that the call to {meth}`TreeSequence.simplify` has been recorded in the
257+
{ref}`sec_provenance` information. Like most tree sequence methods, you can pass
258+
`record_provenance=False` if you want this to be omitted (which will save space, but not
259+
lead to other efficiency gains).
260+
261+
On closer inspection, you might be surprised to see that there are still 4 populations in
262+
the simplified tree sequence, although it contains only samples from the `ADMIX` population:
263+
264+
```{code-cell} ipython3
265+
print(
266+
"Sample nodes in `ts_admix` belong to the following populations",
267+
ts_admix.tables.nodes.population[ts_admix.samples()],
268+
)
269+
ts_admix.tables.populations
270+
```
271+
272+
The reason that the other populations (`BIG`, `SMALL`, and `ANC`) have been retained is
273+
that the simulation has assigned populations to both sample and nonsample nodes. If we
274+
{ref}`edit <sec_tables_editing>` the tree sequence tables such that ancestral
275+
(non-sample) genomes are not associated with defined populations, then simplification will
276+
remove all but the admixed population (and reassign the population IDs as
277+
necessary).
278+
279+
An example of this is given in the code below, which performs a further round of simplification,
280+
taking advantage of the fact that if a list of focal nodes is not given, `simplify`
281+
uses the existing sample nodes.
282+
283+
```{code-cell} ipython3
284+
import numpy as np
285+
import tskit
286+
287+
tables = ts_admix.dump_tables()
288+
samples = ts_admix.samples()
289+
# Make an array of NULL population values for each node
290+
nodes_population = np.full_like(tables.nodes.population, tskit.NULL)
291+
# Set the sample node populations back to their expected population
292+
nodes_population[samples] = ts_admix.nodes_population[samples]
293+
tables.nodes.population = nodes_population
294+
tables.simplify() # This is the tables version of simplify, often used in forward sims
295+
ts_admix_only = tables.tree_sequence()
296+
297+
print(
298+
"Sample nodes in `ts_admix_only` belong to the following populations",
299+
ts_admix_only.tables.nodes.population[ts_admix_only.samples()],
300+
)
301+
ts_admix_only.tables.populations
302+
```
303+
304+
Although reducing the number of populations saves space, it requires care.
305+
For instance `admix_pop_id` can no longer be used to refer to the correct ID
306+
in the `ts_admix_only` tree sequence.
307+
308+
## Extra uses for simplification
309+
310+
Simplify is somewhat of a "Swiss Army knife" for tree sequences, and can be used in
311+
several other ways. See the {ref}`sec_advanced_simplification` tutorial for more details.

0 commit comments

Comments
 (0)