@@ -20,7 +20,7 @@ kernelspec:
2020% remove underscores in title when tutorial is complete or near-complete
2121
2222:::{todo}
23- This tutorial is only partly complete: and there are a number of sections containing TODO items.
23+ This tutorial is only partly complete, and there are a number of sections containing TODO items.
2424:::
2525
2626This is a companion to the basic {ref}` sec_simplification ` tutorial.
@@ -41,13 +41,13 @@ import tskit_arg_visualizer as argviz
4141
4242arg = msprime.sim_ancestry(
4343 samples=10,
44- sequence_length=1e4 ,
44+ sequence_length=1.8e4 ,
4545 recombination_rate=1e-8,
4646 population_size=1e4,
4747 record_full_arg=True,
4848 random_seed=123,
4949)
50- arg = msprime.sim_mutations(arg, rate=1e -8, random_seed=124 )
50+ arg = msprime.sim_mutations(arg, rate=2e -8, random_seed=123 )
5151```
5252
5353:::{note}
@@ -86,7 +86,7 @@ is always returned. To avoid compacting the node table, and leave node IDs uncha
8686Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
8787a simple way to do this:
8888
89- ``` {code-cell} ipython3
89+ ``` {code-cell}
9090def invert_map(node_mapping):
9191 kept = node_mapping != tskit.NULL
9292 indexes = node_mapping[kept] # indexes are guaranteed 0..N-1
@@ -98,12 +98,74 @@ reverse_map = invert_map(node_map)
9898print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
9999```
100100
101- ## 2) Keeping input roots
101+ Here's how the IDs in the first tree have changed:
102102
103- :::{todo}
104- The ` keep_input_roots=True ` argument is easy to illustrate, and useful for
105- forward sims / census approaches.
106- :::
103+ ``` {code-cell}
104+ simp.first().draw_svg(
105+ size=(800, 300),
106+ node_labels={nd.id: f"id:{nd.id} (old id:{reverse_map[nd.id]})" for nd in simp.nodes()}
107+ )
108+ ```
109+
110+ ## 2) Information above the local roots
111+
112+ Usually there are no nodes or mutations in a tree sequence above each local root.
113+ However, as simplification deletes topology, it can create new local roots, leading to this
114+ expectation being broken.
115+
116+ In some cases, you might want to retain nodes above the local roots, which is possible by
117+ setting ` keep_input_roots=True ` . The most common reason for this is to allow
118+ {ref}` recapitation <sec_completing_forward_simulations> ` of forward-time simulations.
119+ See {ref}` this tutorial section <sec_completing_forward_simulations_input_roots> `
120+ for details.
121+
122+ ### Removing mutations above the root
123+
124+ You can also end up with mutations above the root, for instance when all the chosen samples
125+ are monomorphic, sharing a single derived mutation. In long running forward-time
126+ simulations with mutation, this many mutations like this can gather above each local root.
127+ These can be removed by re-setting the ` ancestral_state ` of a site to the ` derived_state `
128+ of the mutation immediately above the root node. There is currently no method
129+ provided to do this, but the following code should work
130+ (see [ this tskit issue] ( https://github.com/tskit-dev/tskit/issues/260 ) ):
131+
132+ ``` {code-cell}
133+ def remove_root_mutations(ts):
134+ tables = ts.dump_tables()
135+ tables.sites.clear()
136+ tables.mutations.clear()
137+ for tree in ts.trees():
138+ for s in tree.sites():
139+ anc_state = s.ancestral_state
140+ root_states = {u: anc_state for u in tree.roots if not tree.is_isolated(u)}
141+ for m in s.mutations:
142+ if m.node in root_states:
143+ anc_state = m.derived_state
144+ root_states[m.node] = anc_state
145+ else:
146+ tables.mutations.append(m.replace(parent=tskit.NULL))
147+ if all([anc == anc_state for anc in root_states.values()]):
148+ if anc_state != s.ancestral_state:
149+ print(
150+ f"Changed ancestral state from {s.ancestral_state} to {anc_state} "
151+ f"for site {s.id} at position {s.position}"
152+ )
153+ tables.sites.append(s.replace(ancestral_state=anc_state))
154+ else:
155+ raise ValueError(
156+ f"Multiple roots with different inherited states exist for the site at position {s.position}"
157+ )
158+ tables.compute_mutation_parents()
159+ return tables.tree_sequence()
160+
161+ simp_no_root_muts = remove_root_mutations(simp)
162+ # Check this encodes the same genetic variation
163+ for var1, var2 in zip(simp.variants(), simp_no_root_muts.variants()):
164+ assert (var1.states() == var2.states()).all()
165+ print(
166+ f"Original simplified tree sequence had {simp.num_mutations} mutations, "
167+ f"now has {simp_no_root_muts.num_mutations} mutations")
168+ ```
107169
108170## 3) Keeping ancestral individuals
109171
@@ -223,9 +285,9 @@ Identifying the _msprime_ recombination nodes that stay as pairs after simplific
223285a little work:
224286
225287:::{todo}
226- Currently the code below doesn't quite work, because ` keep_unary ` forces the nodes above the local
227- roots to be kept, see https://github.com/tskit-dev/tskit/issues/3450 . This means that some RE
228- (and possibly CA) nodes are kept when they should be discarded .
288+ Currently the code below wrongly includes a few extra RE and CA nodes, because nodes
289+ above local roots are retained when ` keep_unary=True ` , see
290+ https://github.com/tskit-dev/tskit/issues/3450 .
229291:::
230292
231293``` {code-cell} ipython3
@@ -241,7 +303,7 @@ keep_CA_nodes = reverse_map[arg_num_children(simp) > 1]
241303```
242304
243305Now that we have defined which nodes to keep, we can use the same trick as before,
244- passing these nodes as focal, but simplifying twice, once with ` update_sample_flags=False `
306+ passing these nodes as focal, but simplifying twice, once with ` update_sample_flags=False ` ,
245307then again with ` keep_unary=True ` :
246308
247309``` {code-cell} ipython3
@@ -251,11 +313,20 @@ tmp_arg = arg.simplify(keep, update_sample_flags=False)
251313subset_arg = tmp_arg.simplify(keep_unary=True) # Defaults to focal nodes = existing samples
252314```
253315
254- Here's what it looks like in graph form:
316+ Here's what it looks like in graph form, with recombination nodes in red and common ancestor non-coalescent nodes in blue :
255317
256318``` {code-cell} ipython3
257319d3arg = argviz.D3ARG.from_ts(ts=subset_arg)
258- d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples");
320+ d3arg.set_all_node_styles(size=100, stroke_width=2)
321+ d3arg.set_node_styles({
322+ u: {"symbol": "d3.symbolSquare", "fill": "black"} for u in subset_arg.samples()
323+ })
324+ d3arg.set_node_styles({i : {"fill": "red"} for i in node_map[keep_RE_nodes]})
325+ d3arg.set_node_styles({i : {"fill": "blue"} for i in node_map[keep_CA_nodes]})
326+ d3arg.draw(
327+ edge_type="ortho",
328+ height=800,
329+ title=f"A full ARG, subset to {subset_arg.num_samples} samples");
259330```
260331
261332## 5) reduce_to_site_topology
0 commit comments