Skip to content

Commit 09827e1

Browse files
authored
Merge pull request #304 from hyanwong/simplification-tutes
Adds information for above-the-root details
2 parents ff4575c + 890dbb0 commit 09827e1

6 files changed

Lines changed: 110 additions & 38 deletions

advanced_simplification.md

Lines changed: 86 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ kernelspec:
2020
% remove underscores in title when tutorial is complete or near-complete
2121

2222
:::{todo}
23-
This tutorial is only partly complete: and there are a number of sections containing TODO items.
23+
This tutorial is only partly complete, and there are a number of sections containing TODO items.
2424
:::
2525

2626
This is a companion to the basic {ref}`sec_simplification` tutorial.
@@ -41,13 +41,13 @@ import tskit_arg_visualizer as argviz
4141
4242
arg = msprime.sim_ancestry(
4343
samples=10,
44-
sequence_length=1e4,
44+
sequence_length=1.8e4,
4545
recombination_rate=1e-8,
4646
population_size=1e4,
4747
record_full_arg=True,
4848
random_seed=123,
4949
)
50-
arg = msprime.sim_mutations(arg, rate=1e-8, random_seed=124)
50+
arg = msprime.sim_mutations(arg, rate=2e-8, random_seed=123)
5151
```
5252

5353
:::{note}
@@ -86,7 +86,7 @@ is always returned. To avoid compacting the node table, and leave node IDs uncha
8686
Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
8787
a simple way to do this:
8888

89-
```{code-cell} ipython3
89+
```{code-cell}
9090
def invert_map(node_mapping):
9191
kept = node_mapping != tskit.NULL
9292
indexes = node_mapping[kept] # indexes are guaranteed 0..N-1
@@ -98,12 +98,74 @@ reverse_map = invert_map(node_map)
9898
print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
9999
```
100100

101-
## 2) Keeping input roots
101+
Here's how the IDs in the first tree have changed:
102102

103-
:::{todo}
104-
The `keep_input_roots=True` argument is easy to illustrate, and useful for
105-
forward sims / census approaches.
106-
:::
103+
```{code-cell}
104+
simp.first().draw_svg(
105+
size=(800, 300),
106+
node_labels={nd.id: f"id:{nd.id} (old id:{reverse_map[nd.id]})" for nd in simp.nodes()}
107+
)
108+
```
109+
110+
## 2) Information above the local roots
111+
112+
Usually there are no nodes or mutations in a tree sequence above each local root.
113+
However, as simplification deletes topology, it can create new local roots, leading to this
114+
expectation being broken.
115+
116+
In some cases, you might want to retain nodes above the local roots, which is possible by
117+
setting `keep_input_roots=True`. The most common reason for this is to allow
118+
{ref}`recapitation <sec_completing_forward_simulations>` of forward-time simulations.
119+
See {ref}`this tutorial section <sec_completing_forward_simulations_input_roots>`
120+
for details.
121+
122+
### Removing mutations above the root
123+
124+
You can also end up with mutations above the root, for instance when all the chosen samples
125+
are monomorphic, sharing a single derived mutation. In long running forward-time
126+
simulations with mutation, this many mutations like this can gather above each local root.
127+
These can be removed by re-setting the `ancestral_state` of a site to the `derived_state`
128+
of the mutation immediately above the root node. There is currently no method
129+
provided to do this, but the following code should work
130+
(see [this tskit issue](https://github.com/tskit-dev/tskit/issues/260)):
131+
132+
```{code-cell}
133+
def remove_root_mutations(ts):
134+
tables = ts.dump_tables()
135+
tables.sites.clear()
136+
tables.mutations.clear()
137+
for tree in ts.trees():
138+
for s in tree.sites():
139+
anc_state = s.ancestral_state
140+
root_states = {u: anc_state for u in tree.roots if not tree.is_isolated(u)}
141+
for m in s.mutations:
142+
if m.node in root_states:
143+
anc_state = m.derived_state
144+
root_states[m.node] = anc_state
145+
else:
146+
tables.mutations.append(m.replace(parent=tskit.NULL))
147+
if all([anc == anc_state for anc in root_states.values()]):
148+
if anc_state != s.ancestral_state:
149+
print(
150+
f"Changed ancestral state from {s.ancestral_state} to {anc_state} "
151+
f"for site {s.id} at position {s.position}"
152+
)
153+
tables.sites.append(s.replace(ancestral_state=anc_state))
154+
else:
155+
raise ValueError(
156+
f"Multiple roots with different inherited states exist for the site at position {s.position}"
157+
)
158+
tables.compute_mutation_parents()
159+
return tables.tree_sequence()
160+
161+
simp_no_root_muts = remove_root_mutations(simp)
162+
# Check this encodes the same genetic variation
163+
for var1, var2 in zip(simp.variants(), simp_no_root_muts.variants()):
164+
assert (var1.states() == var2.states()).all()
165+
print(
166+
f"Original simplified tree sequence had {simp.num_mutations} mutations, "
167+
f"now has {simp_no_root_muts.num_mutations} mutations")
168+
```
107169

108170
## 3) Keeping ancestral individuals
109171

@@ -223,9 +285,9 @@ Identifying the _msprime_ recombination nodes that stay as pairs after simplific
223285
a little work:
224286

225287
:::{todo}
226-
Currently the code below doesn't quite work, because `keep_unary` forces the nodes above the local
227-
roots to be kept, see https://github.com/tskit-dev/tskit/issues/3450. This means that some RE
228-
(and possibly CA) nodes are kept when they should be discarded.
288+
Currently the code below wrongly includes a few extra RE and CA nodes, because nodes
289+
above local roots are retained when `keep_unary=True`, see
290+
https://github.com/tskit-dev/tskit/issues/3450.
229291
:::
230292

231293
```{code-cell} ipython3
@@ -241,7 +303,7 @@ keep_CA_nodes = reverse_map[arg_num_children(simp) > 1]
241303
```
242304

243305
Now that we have defined which nodes to keep, we can use the same trick as before,
244-
passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False`
306+
passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False`,
245307
then again with `keep_unary=True`:
246308

247309
```{code-cell} ipython3
@@ -251,11 +313,20 @@ tmp_arg = arg.simplify(keep, update_sample_flags=False)
251313
subset_arg = tmp_arg.simplify(keep_unary=True) # Defaults to focal nodes = existing samples
252314
```
253315

254-
Here's what it looks like in graph form:
316+
Here's what it looks like in graph form, with recombination nodes in red and common ancestor non-coalescent nodes in blue:
255317

256318
```{code-cell} ipython3
257319
d3arg = argviz.D3ARG.from_ts(ts=subset_arg)
258-
d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples");
320+
d3arg.set_all_node_styles(size=100, stroke_width=2)
321+
d3arg.set_node_styles({
322+
u: {"symbol": "d3.symbolSquare", "fill": "black"} for u in subset_arg.samples()
323+
})
324+
d3arg.set_node_styles({i : {"fill": "red"} for i in node_map[keep_RE_nodes]})
325+
d3arg.set_node_styles({i : {"fill": "blue"} for i in node_map[keep_CA_nodes]})
326+
d3arg.draw(
327+
edge_type="ortho",
328+
height=800,
329+
title=f"A full ARG, subset to {subset_arg.num_samples} samples");
259330
```
260331

261332
## 5) reduce_to_site_topology

completing_forward_sims.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,20 @@ kernelspec:
99
name: python3
1010
---
1111

12-
(sec_completing_forwards_simulations)=
12+
(sec_completing_forward_simulations)=
1313

14-
# Completing forwards simulations
14+
# Recapitation: completing a forward simulation
1515

16-
The ``msprime`` simulator generates tree sequences using the backwards in
17-
time coalescent model. But it is also possible to output tree sequences
18-
from [forwards-time](https://doi.org/10.1371/journal.pcbi.1006581)
16+
The ``msprime`` simulator generates tree sequences using the
17+
backward-in-time coalescent model. But it is also possible to output tree sequences
18+
from [forward-time](https://doi.org/10.1371/journal.pcbi.1006581)
1919
simulators such as [SLiM](https://messerlab.org/slim)
2020
and [fwdpy11](https://fwdpy11.readthedocs.io/) (see the
2121
{ref}`sec_tskit_forward_simulations` tutorial).
2222
There are many advantages to using forward-time simulators, but they
2323
are usually quite slow compared to similar coalescent simulations. In this
2424
section we show how to combine the best of both approaches by simulating
25-
the recent past using a forwards-time simulator and then complete the
25+
the recent past using a forward-time simulator and then complete the
2626
simulation of the ancient past using ``msprime``. (We sometimes refer to this
2727
"recapitation", as we can think of it as adding a "head" onto a tree sequence.)
2828

@@ -133,9 +133,10 @@ coalesced_ts = msprime.sim_ancestry(
133133
coalesced_ts.draw_svg()
134134
```
135135

136-
The trees have fully coalesced and we've successfully combined a forwards-time
136+
The trees have fully coalesced and we've successfully combined a forward-time
137137
Wright-Fisher simulation with a coalescent simulation: hooray!
138138

139+
(sec_completing_forward_simulations_input_roots)=
139140

140141
## Why keep input roots (i.e., the initial generation)?
141142

@@ -164,7 +165,7 @@ the method presented here.
164165

165166
## Topology gotchas
166167

167-
The trees that we output from this combined forwards and backwards simulation
168+
The trees that we output from this combined forward and backward simulation
168169
process have some slightly odd properties that are important to be aware of.
169170
In the example above, we can see that the old roots are still present in both trees,
170171
even through they have only one child and are clearly redundant.
@@ -179,7 +180,7 @@ they may cause problems:
179180
2. If you are computing the overall tree "height" by taking the time of the
180181
root node, you may overestimate the height because there is a unary edge
181182
above the "real" root (this would happen if one of the trees had already
182-
coalesced in the forwards-time simulation).
183+
coalesced in the forward-time simulation).
183184

184185
For these reasons it may be better to remove this redundancy from your
185186
computed tree sequence which is easily done using the

forward_sims.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ along with their genomes, storing inherited genomic regions as well as the full
2525
The code in this tutorial is broken into separate functions for clarity and
2626
to make it easier to modify for your own purposes; a simpler and substantially
2727
condensed forward-simulator is coded as a single function at the top of the
28-
{ref}`sec_completing_forwards_simulations` tutorial.
28+
{ref}`sec_completing_forward_simulations` tutorial.
2929

3030
:::{note}
3131
If you are simply trying to obtain a tree sequence which is
@@ -210,7 +210,7 @@ that the child inherits a mosaic of the two genomes present in each parent.
210210

211211
The exact details of the mosaic will depend on the model of recombination you
212212
wish to implement. For instance, a simple model such as that in the
213-
{ref}`sec_completing_forwards_simulations` tutorial might assume exactly one
213+
{ref}`sec_completing_forward_simulations` tutorial might assume exactly one
214214
crossover per chromosome. A complex model might allow not just multiple
215215
crossovers with e.g. recombination "hotspots", but also non-crossover events
216216
such as {ref}`msprime:sec_ancestry_gene_conversion`.
@@ -475,7 +475,7 @@ in which an alternative technique, such as backward-in-time coalescent simulatio
475475
is used to to fill in the "head" of the tree sequence. In other words,
476476
we can use a fast backward-time simulator such as `msprime` to simulate the
477477
prior genealogy of the oldest nodes in the simplified tree sequence.
478-
Details are described in the {ref}`sec_completing_forwards_simulations`
478+
Details are described in the {ref}`sec_completing_forward_simulations`
479479
tutorial.
480480

481481
## More complex forward-simulations

more_forward_sims.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,5 +254,5 @@ This will primarily deal with sites and mutations (and mutational metadata).
254254
We could also include details on selection, if that seems sensible.
255255

256256
The section in that workbook on "Starting with a prior history" should be put in
257-
the {ref}`sec_completing_forwards_simulations` tutorial.
257+
the {ref}`sec_completing_forward_simulations` tutorial.
258258
:::

simulation_overview.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -44,29 +44,29 @@ Compare to expectations
4444
(e.g. for use as a null model). For instance, comparison to *neutral* simulations
4545
can be used to identify regions subject to selection.
4646

47-
There are two major forms of population genetic simulation: **forwards-time**
48-
and **backwards-time**. In general, forwards-time simulation is detailed and more
49-
realistic, while backwards-time simulation is fast and efficient.
47+
There are two major forms of population genetic simulation: **forward-time**
48+
and **backward-time**. In general, forward-time simulation is detailed and more
49+
realistic, while backward-time simulation is fast and efficient.
5050

5151
More specifically, apart from a
5252
{ref}`few exceptions <msprime:sec_ancestry_models_selective_sweeps>`,
53-
backwards-time simulations are primarily focused on neutral simulations, while
53+
backward-time simulations are primarily focused on neutral simulations, while
5454
forward simulation is better suited to complex simulations, including those involving
5555
selection and continuous space.
5656

5757
## Advantages of tree sequences
5858

59-
Some forwards-time ([SLiM](http://messerlab.org/slim/),
60-
[fwdpy](http://molpopgen.github.io/fwdpy/)) and backwards-time
59+
Some forward-time ([SLiM](http://messerlab.org/slim/),
60+
[fwdpy](http://molpopgen.github.io/fwdpy/)) and backward-time
6161
([msprime](https://tskit.dev/msprime)) simulators have a built-in capacity to output
6262
tree sequences. This can have several benefits:
6363

6464
1. Neutral mutations, which often account for the majority of genetic variation, do not
6565
need to be tracked during the simulation, but can be added afterwards. See
6666
"{ref}`sec_tskit_no_mutations`".
67-
2. Tree sequences can be used as an interchange format to combine backwards and
68-
forwards simulations, allowing you to take advantage of the advantages of both
69-
approaches. This is detailed in {ref}`sec_completing_forwards_simulations`.
67+
2. Tree sequences can be used as an interchange format to combine backward and
68+
forward simulations, allowing you to take advantage of the advantages of both
69+
approaches. This is detailed in {ref}`sec_completing_forward_simulations`.
7070

7171
## Some tips on simulation
7272

terminology_and_concepts.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ objects. In tree sequences, however, all nodes, both internal and terminal,
145145
are represented by an **integer ID**, unique over the entire tree sequence, and which exists
146146
at a specific point in time. A branch point in any of the trees is associated with
147147
an *internal node*, representing an ancestor in which a single DNA
148-
sequence was duplicated (in forwards-time terminology) or in which multiple sequences
148+
sequence was duplicated (in forward-time terminology) or in which multiple sequences
149149
coalesced (in backwards-time terminology).
150150

151151

0 commit comments

Comments
 (0)