Skip to content

Commit 15417a0

Browse files
committed
Fill out advanced simplification, item 2
1 parent f1ccda4 commit 15417a0

1 file changed

Lines changed: 86 additions & 15 deletions

File tree

advanced_simplification.md

Lines changed: 86 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ kernelspec:
2020
% remove underscores in title when tutorial is complete or near-complete
2121

2222
:::{todo}
23-
This tutorial is only partly complete: and there are a number of sections containing TODO items.
23+
This tutorial is only partly complete, and there are a number of sections containing TODO items.
2424
:::
2525

2626
This is a companion to the basic {ref}`sec_simplification` tutorial.
@@ -41,13 +41,13 @@ import tskit_arg_visualizer as argviz
4141
4242
arg = msprime.sim_ancestry(
4343
samples=10,
44-
sequence_length=1e4,
44+
sequence_length=1.8e4,
4545
recombination_rate=1e-8,
4646
population_size=1e4,
4747
record_full_arg=True,
4848
random_seed=123,
4949
)
50-
arg = msprime.sim_mutations(arg, rate=1e-8, random_seed=124)
50+
arg = msprime.sim_mutations(arg, rate=2e-8, random_seed=123)
5151
```
5252

5353
:::{note}
@@ -86,7 +86,7 @@ is always returned. To avoid compacting the node table, and leave node IDs uncha
8686
Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
8787
a simple way to do this:
8888

89-
```{code-cell} ipython3
89+
```{code-cell}
9090
def invert_map(node_mapping):
9191
kept = node_mapping != tskit.NULL
9292
indexes = node_mapping[kept] # indexes are guaranteed 0..N-1
@@ -98,12 +98,74 @@ reverse_map = invert_map(node_map)
9898
print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
9999
```
100100

101-
## 2) Keeping input roots
101+
Here's how the IDs in the first tree have changed:
102102

103-
:::{todo}
104-
The `keep_input_roots=True` argument is easy to illustrate, and useful for
105-
forward sims / census approaches.
106-
:::
103+
```{code-cell}
104+
simp.first().draw_svg(
105+
size=(800, 300),
106+
node_labels={nd.id: f"id:{nd.id} (old id:{reverse_map[nd.id]})" for nd in simp.nodes()}
107+
)
108+
```
109+
110+
## 2) Information above the local roots
111+
112+
Usually there are no nodes or mutations in a tree sequence above each local root.
113+
However, as simplification deletes topology, it can create new local roots, leading to this
114+
expectation being broken.
115+
116+
In some cases, you might want to retain nodes above the local roots, which is possible by
117+
setting `keep_input_roots=True`. The most common reason for this is to allow
118+
{ref}`recapitation <sec_completing_forward_simulations>` of forward-time simulations.
119+
See {ref}`this tutorial section <sec_completing_forward_simulations_input_roots>`
120+
for details.
121+
122+
### Removing mutations above the root
123+
124+
You can also end up with mutations above the root, for instance when all the chosen samples
125+
are monomorphic, sharing a single derived mutation. In long running forward-time
126+
simulations with mutation, this many mutations like this can gather above each local root.
127+
These can be removed by re-setting the `ancestral_state` of a site to the `derived_state`
128+
of the mutation immediately above the root node. There is currently no method
129+
provided to do this, but the following code should work
130+
(see [this tskit issue](https://github.com/tskit-dev/tskit/issues/260)):
131+
132+
```{code-cell}
133+
def remove_root_mutations(ts):
134+
tables = ts.dump_tables()
135+
tables.sites.clear()
136+
tables.mutations.clear()
137+
for tree in ts.trees():
138+
for s in tree.sites():
139+
anc_state = s.ancestral_state
140+
root_states = {u: anc_state for u in tree.roots if not tree.is_isolated(u)}
141+
for m in s.mutations:
142+
if m.node in root_states:
143+
anc_state = m.derived_state
144+
root_states[m.node] = anc_state
145+
else:
146+
tables.mutations.append(m.replace(parent=tskit.NULL))
147+
if all([anc == anc_state for anc in root_states.values()]):
148+
if anc_state != s.ancestral_state:
149+
print(
150+
f"Changed ancestral state from {s.ancestral_state} to {anc_state} "
151+
f"for site {s.id} at position {s.position}"
152+
)
153+
tables.sites.append(s.replace(ancestral_state=anc_state))
154+
else:
155+
raise ValueError(
156+
f"Multiple roots with different inherited states exist for the site at position {s.position}"
157+
)
158+
tables.compute_mutation_parents()
159+
return tables.tree_sequence()
160+
161+
simp_no_root_muts = remove_root_mutations(simp)
162+
# Check this encodes the same genetic variation
163+
for var1, var2 in zip(simp.variants(), simp_no_root_muts.variants()):
164+
assert (var1.states() == var2.states()).all()
165+
print(
166+
f"Original simplified tree sequence had {simp.num_mutations} mutations, "
167+
f"now has {simp_no_root_muts.num_mutations} mutations")
168+
```
107169

108170
## 3) Keeping ancestral individuals
109171

@@ -223,9 +285,9 @@ Identifying the _msprime_ recombination nodes that stay as pairs after simplific
223285
a little work:
224286

225287
:::{todo}
226-
Currently the code below doesn't quite work, because `keep_unary` forces the nodes above the local
227-
roots to be kept, see https://github.com/tskit-dev/tskit/issues/3450. This means that some RE
228-
(and possibly CA) nodes are kept when they should be discarded.
288+
Currently the code below wrongly includes a few extra RE and CA nodes, because nodes
289+
above local roots are retained when `keep_unary=True`, see
290+
https://github.com/tskit-dev/tskit/issues/3450.
229291
:::
230292

231293
```{code-cell} ipython3
@@ -241,7 +303,7 @@ keep_CA_nodes = reverse_map[arg_num_children(simp) > 1]
241303
```
242304

243305
Now that we have defined which nodes to keep, we can use the same trick as before,
244-
passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False`
306+
passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False`,
245307
then again with `keep_unary=True`:
246308

247309
```{code-cell} ipython3
@@ -251,11 +313,20 @@ tmp_arg = arg.simplify(keep, update_sample_flags=False)
251313
subset_arg = tmp_arg.simplify(keep_unary=True) # Defaults to focal nodes = existing samples
252314
```
253315

254-
Here's what it looks like in graph form:
316+
Here's what it looks like in graph form, with recombination nodes in red and common ancestor non-coalescent nodes in blue:
255317

256318
```{code-cell} ipython3
257319
d3arg = argviz.D3ARG.from_ts(ts=subset_arg)
258-
d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples");
320+
d3arg.set_all_node_styles(size=100, stroke_width=2)
321+
d3arg.set_node_styles({
322+
u: {"symbol": "d3.symbolSquare", "fill": "black"} for u in subset_arg.samples()
323+
})
324+
d3arg.set_node_styles({i : {"fill": "red"} for i in node_map[keep_RE_nodes]})
325+
d3arg.set_node_styles({i : {"fill": "blue"} for i in node_map[keep_CA_nodes]})
326+
d3arg.draw(
327+
edge_type="ortho",
328+
height=800,
329+
title=f"A full ARG, subset to {subset_arg.num_samples} samples");
259330
```
260331

261332
## 5) reduce_to_site_topology

0 commit comments

Comments
 (0)