You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
assert len(site_pos) == 1 and site_pos[0] == ts.sites_position[1]
32
+
ts.dump("data/simplification_basic.trees")
22
33
# create_notebook_data() # uncomment to recreate the tree seqs used in this notebook
23
34
```
24
35
25
36
(sec_simplification)=
26
37
27
-
# _Simplification_
28
-
% remove underscores in title when tutorial is complete or near-complete
38
+
# Simplification
39
+
40
+
The {meth}`~TreeSequence.simplify` method provides one of the most powerful ways to modify a [tskit](https://tskit.dev) {class}`TreeSequence`. It reduces a tree sequence to the ancestry of a chosen set of focal nodes, and by default marks those as samples, removing everything not needed to represent their ancestry. It is commonly used:
41
+
42
+
* In forward simulations, to remove lineages that have gone extinct
43
+
* To create a smaller tree sequence focussed on a subset of samples
44
+
* To remove redundant nodes and other tskit objects (e.g. unreferenced populations)
45
+
46
+
Other less common uses, such as retaining unary regions of coalescent nodes, and
47
+
simplification in parallel, are described in the {ref}`sec_advanced_simplification` tutorial.
48
+
49
+
50
+
## A single tree example
51
+
52
+
We start with a very small example for ease of visualisation. This is
53
+
a tree sequence consisting of a single tree with 8 haploid genomes
Create content. See https://github.com/tskit-dev/tutorials/issues/52
187
+
:::{note}
188
+
The `map_nodes=True` argument means that `simplify()` returns both a new
189
+
tree sequence and an array mapping each old node ID to its new ID, or to
190
+
`tskit.NULL` if that node is removed.
191
+
Here you can see that (unlike in previous examples) the sample node IDs
192
+
have changed: unless `filter_nodes=False`, the _N_ node IDs provided as the `samples`
193
+
argument will be allocated new IDs from 0 to _N_ - 1 in the returned tree sequence (so simplify can be used to reorder sample IDs, although
194
+
{meth}`~TreeSequence.subset` is a way to do this with fewer side effects).
32
195
:::
196
+
197
+
### Efficiency
198
+
199
+
Edges take up the majority of the space in most tree sequences. In this case you can
200
+
see that although simplify has reduced the sample nodes to 12 genomes from
201
+
the 6 diploid `ADMIX` individuals (a reduction of 99.5%), the number of edges
202
+
has not been reduced by such a large amount.
203
+
That's because many of the ancestors of the SMALL and BIG populations are also shared
204
+
by `ADMIX`. It also shows why tree sequence structures are so effective for encoding
205
+
and analysing large datasets: storage and processing efficiency, in particular the
206
+
number of edges, is sub-linear in the number of samples.
207
+
208
+
```{code-cell} ipython3
209
+
print(
210
+
f"The simplified tree sequence has only {ts_admix.num_samples / big_ts.num_samples:.2%} of the samples,",
211
+
f"but retains {ts_admix.num_edges / big_ts.num_edges:.2%} of the edges."
212
+
)
213
+
```
214
+
215
+
If you want to analyse only the admixed individuals, using the simplified tree sequence
216
+
is much more efficient than running equivalent operations on the original `big_ts`:
217
+
218
+
```{code-cell} ipython3
219
+
%%timeit
220
+
# Speed test for decoding all the genetic variants of the admixed individuals
221
+
for v in ts_admix.variants():
222
+
pass
223
+
```
224
+
225
+
Identical results can be obtained using the full tree sequence and restricting calculations to the `admix_sample_ids`, but this approach is much slower:
226
+
227
+
```{code-cell} ipython3
228
+
%%timeit
229
+
# Equivalent processing of admixed individuals, using the full tree sequence
230
+
for v in big_ts.variants(samples=admix_sample_ids):
231
+
pass
232
+
```
233
+
234
+
The same efficiencies apply to calculating statistics on subsets of genomic samples.
235
+
As simplification has been highly optimised in `tskit`, if you perform repeated
236
+
processing of the same subset of genomes, it can be worth simplifying before
237
+
processing.
238
+
239
+
### Removing other unused objects
240
+
241
+
If we print out the original and admix-only (simplified) tree sequence, we can see
242
+
that a number of other tables have also been reduced in size. For instance,
243
+
simplification has reduced the number of individuals from 1206 to 6, and the
244
+
number of sites by about three quarters.
245
+
246
+
```{code-cell} ipython3
247
+
print("Original tree sequence")
248
+
big_ts
249
+
```
250
+
251
+
```{code-cell} ipython3
252
+
print("Simplified tree sequence")
253
+
ts_admix
254
+
```
255
+
256
+
Note that the call to {meth}`TreeSequence.simplify` has been recorded in the
257
+
{ref}`sec_provenance` information. Like most tree sequence methods, you can pass
258
+
`record_provenance=False` if you want this to be omitted (which will save space, but not
259
+
lead to other efficiency gains).
260
+
261
+
On closer inspection, you might be surprised to see that there are still 4 populations in
262
+
the simplified tree sequence, although it contains only samples from the `ADMIX` population:
263
+
264
+
```{code-cell} ipython3
265
+
print(
266
+
"Sample nodes in `ts_admix` belong to the following populations",
0 commit comments