Skip to content

Commit 0f5487d

Browse files
authored
Merge pull request #302 from hyanwong/simplification-tutes
Simplification tutes
2 parents da1f625 + edc2048 commit 0f5487d

6 files changed

Lines changed: 599 additions & 49 deletions

File tree

_toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ parts:
1919
chapters:
2020
- file: tables_and_editing
2121
- file: simplification
22+
sections:
23+
- file: advanced_simplification
2224
- file: viz
2325
- file: metadata
2426
- file: args

advanced_simplification.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.12
7+
jupytext_version: 1.9.1
8+
kernelspec:
9+
display_name: Python 3
10+
language: python
11+
name: python3
12+
---
13+
14+
```{currentmodule} tskit
15+
```
16+
17+
(sec_advanced_simplification)=
18+
19+
# _Advanced simplification_
20+
% remove underscores in title when tutorial is complete or near-complete
21+
22+
23+
This is a companion to the basic {ref}`sec_simplification` tutorial.
24+
It focuses on details of `simplify` behavior that are useful when you need precise
25+
control over node IDs, retained ancestry structure, and sample flags.
26+
27+
If you are primarily subsetting samples or reducing tree sequence size, start with
28+
{ref}`sec_simplification`.
29+
30+
31+
```{code-cell} ipython3
32+
:"tags": ["hide-input"]
33+
# Build a full ARG for demonstrations
34+
import msprime
35+
import numpy as np
36+
import tskit
37+
import tskit_arg_visualizer as argviz
38+
39+
arg = msprime.sim_ancestry(
40+
samples=10,
41+
sequence_length=1e4,
42+
recombination_rate=1e-8,
43+
population_size=1e4,
44+
record_full_arg=True,
45+
random_seed=123,
46+
)
47+
arg = msprime.sim_mutations(arg, rate=1e-8, random_seed=124)
48+
```
49+
50+
:::{note}
51+
You can both simplify an immutable {class}`TreeSequence` (to return a new one),
52+
or run {meth}`~TableCollection.simplify` on a {class}`TableCollection` (which
53+
modifies in place, and requires {ref}`sec_table_indexes` to be built as well as the
54+
tables to be {meth}`sorted <TableCollection.sort>`). Simplifying tables in place
55+
is often useful for {ref}`forward-time simulations <sec_tskit_forward_simulations>`.
56+
:::
57+
58+
## 1) Tracking node ID changes
59+
60+
With default settings, simplification compacts tables and therefore reassigns node
61+
IDs. If downstream code makes use of stored node IDs, request a map:
62+
63+
```{code-cell} ipython3
64+
focal = arg.samples()[-6:] # pick last 6 samples (3 diploids)
65+
simp, node_map = arg.simplify(focal, map_nodes=True)
66+
67+
print(f"Original ARG: {arg.num_individuals} individuals, simplified to {simp.num_individuals} individuals")
68+
print(f"Original ARG: {arg.num_nodes} nodes, simplified to {simp.num_nodes} nodes")
69+
print("Dropped nodes:", int(np.sum(node_map == tskit.NULL)))
70+
print("Old sample ID", int(focal[0]), "maps to new ID", int(node_map[focal[0]]))
71+
```
72+
73+
Note that when simplifying tables in-place using {meth}`TableCollection.simplify` a map
74+
is always returned. To avoid compacting the node table, and leave node IDs unchanged, use
75+
`filter_nodes=False`.
76+
77+
### Obtaining the reverse map
78+
79+
Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
80+
a simple way to do this:
81+
82+
```{code-cell} ipython3
83+
def invert_map(node_mapping):
84+
kept = node_mapping != tskit.NULL
85+
indexes = node_mapping[kept] # indexes are guaranteed 0..N-1
86+
rev_map = np.full_like(indexes, tskit.NULL)
87+
rev_map[indexes] = np.flatnonzero(kept)
88+
return rev_map
89+
90+
reverse_map = invert_map(node_map)
91+
print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
92+
```
93+
94+
## 2) Keeping input roots
95+
96+
:::{todo}
97+
This is easy to illustrate, and useful for forward sims / census approaches
98+
:::
99+
100+
## 3) Setting sample flags
101+
102+
Normally the nodes that are provided to the `simplify()` function are marked as sample
103+
nodes in the output (by setting the `NODE_IS_SAMPLE` flag), and other nodes have that flag unset.
104+
If you provide the `update_sample_flags=False` option, all node flags are left unchanged.
105+
Here are some cases where that can be useful.
106+
107+
### Parallel simplification
108+
109+
One use for the `update_sample_flags=False` option combines it with `filter_nodes=False`,
110+
to ensure that the node table remains untouched during simplification.
111+
This is primarily a use-case targetted at developers of forward simulators, and allows
112+
logically disjunct parts of the edge table to be simplified in parallel, without
113+
risking two parallel processes trying to alter the same data.
114+
115+
:::{todo}
116+
Should we provide an example? However, this tends to be done at a lower level, e.g.
117+
using the C API, and is likely to be only useful for developers.
118+
:::
119+
120+
### Retaining non-coalescent nodes
121+
122+
The `keep_unary=True` option retains *all* ancestral nodes after simplification, even
123+
if they don't represent coalescences. Sometimes you might want to retain *some* but not
124+
all such nodes, which can be done by adding them to the focal node list but using
125+
`update_sample_flags=False` to ensure they are not marked as samples in the final output.
126+
This is usually followed by an additional simplification pass to remove any topology
127+
above the additional nodes which is not ancestral to the true samples. Here are some examples:
128+
129+
#### Keeping non-coalescent regions of coalescent nodes
130+
131+
By default, `simplify()` deletes not just non-coalescent nodes, but also
132+
removes from the ancestry those regions of coalescent nodes which do not represent
133+
a coalescence in the local tree (i.e. are "locally unary"). We can identify coalescent
134+
nodes using an initial round of simplification:
135+
136+
```{code-cell}
137+
simp, node_map = arg.simplify(focal, map_nodes=True)
138+
keep_nodes = np.where(node_map != tskit.NULL)[0] #includes sample and coalescent nodes
139+
140+
# Retain lots of nodes by treating them as focal, but don't change sample flags
141+
tmp_ts = arg.simplify(keep_nodes, update_sample_flags=False)
142+
# Now re-simplify, in case any coalescent nodes have unwanted ancestry above
143+
part_simp_ts = tmp_ts.simplify(keep_unary=True)
144+
```
145+
146+
Often, leaving in the non-coalescent regions of coalescent nodes can lead to a reduction
147+
in the number of edges. This is one reason that the {meth}`TreeSequence.extend_haplotypes` exists
148+
(see the documentation of that method for more detail).
149+
150+
```{code-cell}
151+
print(f"Full simplification to samples {focal} leads to {simp.num_edges} edges")
152+
print(
153+
f"Similar simplification leaving partially-coalescent nodes leads to {part_simp_ts.num_edges} edges"
154+
)
155+
part_simp_ts.draw_svg(title=(
156+
f"ARG simplified to samples {focal}, leaving part-coalescent nodes: "
157+
f"this has {part_simp_ts.num_edges} edges"
158+
))
159+
```
160+
161+
162+
#### Subsetting a full ARG
163+
164+
The {ref}`sec_args` tutorial discusses the idea of a "full ARG", containing recombination and
165+
non-coalescent common ancestor nodes, which do not correspond to coalescent events anywhere in
166+
the genealogy. We can use `simplify()` to reduce a full ARG to a subset of samples, while retaining those non-coalescent nodes. Again, we need to figure out
167+
which nodes to keep, then add them to the focal nodes, simplifying with `update_sample_flags=False`.
168+
169+
We can detect the common ancestor nodes by finding those which still have 2 unique children after
170+
simplifying with `keep_unary=True` and the recombination nodes
171+
by finding which still come in pairs after such simplification.
172+
173+
```{code-cell} ipython3
174+
def arg_num_children(ts): # see the ARG tutorial which explains this code
175+
same_parent = np.concatenate((ts.edges_parent[1:] == ts.edges_parent[:-1], [False]))
176+
same_child = np.concatenate((ts.edges_child[1:] == ts.edges_child[:-1], [False]))
177+
is_last_unique = ~same_parent | ~same_child # last occurrence of each unique child per parent
178+
return np.bincount(ts.edges_parent[is_last_unique], minlength=ts.num_nodes)
179+
180+
re_node_pairs = np.where(arg.nodes_flags & msprime.NODE_IS_RE_EVENT)[0].reshape(-1, 2)
181+
print("RE nodes before simplification\n", re_node_pairs)
182+
```
183+
184+
Identifying the _msprime_ recombination nodes that stay as pairs after simplification requires
185+
a little work:
186+
187+
:::{todo}
188+
Currently the code below doesn't quite work, because `keep_unary` forces the nodes above the local
189+
roots to be kept, see https://github.com/tskit-dev/tskit/issues/3450. This means that some RE
190+
(and possibly CA) nodes are kept when they should be discarded.
191+
:::
192+
193+
```{code-cell} ipython3
194+
simp, node_map = arg.simplify(focal, keep_unary=True, map_nodes=True)
195+
reverse_map = invert_map(node_map)
196+
simp_re_nodes = reverse_map[np.where(simp.nodes_flags & msprime.NODE_IS_RE_EVENT)[0]]
197+
valid_pairs = np.isin(re_node_pairs, simp_re_nodes).all(axis=1)
198+
keep_re_nodes = re_node_pairs[valid_pairs, :]
199+
print("paired RE nodes after simplification\n", keep_re_nodes)
200+
201+
keep_RE_nodes = keep_re_nodes.flatten()
202+
keep_CA_nodes = reverse_map[arg_num_children(simp) > 1]
203+
```
204+
205+
Now that we have defined which nodes to keep, we can use the same trick as before,
206+
passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False`
207+
then again with `keep_unary=True`:
208+
209+
```{code-cell} ipython3
210+
keep = np.concatenate((focal, keep_CA_nodes, keep_RE_nodes))
211+
212+
tmp_arg = arg.simplify(keep, update_sample_flags=False)
213+
subset_arg = tmp_arg.simplify(keep_unary=True) # Defaults to focal nodes = existing samples
214+
```
215+
216+
Here's what it looks like in graph form:
217+
218+
```{code-cell} ipython3
219+
d3arg = argviz.D3ARG.from_ts(ts=subset_arg)
220+
d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples");
221+
```
222+
223+
## 4) Keeping individuals
224+
225+
In some cases, a tree sequence might contain historical individuals which are associated
226+
with nodes that are not samples, and you wish to retain information on individuals which are
227+
ancestral to the sample nodes. For example a forward-time simulation could
228+
define individuals for all nodes in the past, including the pedigree links between parents
229+
and children (see also discussion in the SLiM manual, and at
230+
https://github.com/MesserLab/SLiM/issues/139).
231+
232+
To keep all the individuals associated with genetic ancestry, you can use
233+
`keep_unary_in_individuals=True`.
234+
235+
:::{todo}
236+
Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to
237+
create a simulator that saves pedigree information into each individual, and we could distill
238+
some of the discussion from https://github.com/MesserLab/SLiM/issues/139 into that.
239+
:::
240+
241+
## 5) reduce_to_site_topology
242+
243+
:::{todo}
244+
Explain why you might use `reduce_to_site_topology`, e.g. if there is a lot of information
245+
that is not inferable from sites. Note that this does not change the topology at each site,
246+
although some of the topology at a site might not actually be needed to represent the site
247+
variation (to see how to condense topology into "polytomies", see
248+
https://github.com/tskit-dev/tskit/discussions/2926).
249+
:::

0 commit comments

Comments
 (0)