Skip to content

Commit ff4575c

Browse files
authored
Merge pull request #303 from hyanwong/simplification-tutes
Minor simplification tutes corrections
2 parents 0f5487d + 1c13ed0 commit ff4575c

3 files changed

Lines changed: 65 additions & 38 deletions

File tree

advanced_simplification.md

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ kernelspec:
1919
# _Advanced simplification_
2020
% remove underscores in title when tutorial is complete or near-complete
2121

22+
:::{todo}
23+
This tutorial is only partly complete: and there are a number of sections containing TODO items.
24+
:::
2225

2326
This is a companion to the basic {ref}`sec_simplification` tutorial.
2427
It focuses on details of `simplify` behavior that are useful when you need precise
@@ -55,6 +58,8 @@ tables to be {meth}`sorted <TableCollection.sort>`). Simplifying tables in place
5558
is often useful for {ref}`forward-time simulations <sec_tskit_forward_simulations>`.
5659
:::
5760

61+
(sec_advanced_simplification_map_nodes)=
62+
5863
## 1) Tracking node ID changes
5964

6065
With default settings, simplification compacts tables and therefore reassigns node
@@ -74,6 +79,8 @@ Note that when simplifying tables in-place using {meth}`TableCollection.simplify
7479
is always returned. To avoid compacting the node table, and leave node IDs unchanged, use
7580
`filter_nodes=False`.
7681

82+
(sec_advanced_simplification_map_nodes_reverse)=
83+
7784
### Obtaining the reverse map
7885

7986
Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
@@ -94,21 +101,52 @@ print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
94101
## 2) Keeping input roots
95102

96103
:::{todo}
97-
This is easy to illustrate, and useful for forward sims / census approaches
104+
The `keep_input_roots=True` argument is easy to illustrate, and useful for
105+
forward sims / census approaches.
106+
:::
107+
108+
## 3) Keeping ancestral individuals
109+
110+
In some cases, a tree sequence might contain historical individuals which are associated
111+
with nodes that are not samples, and you wish to retain information on individuals which
112+
remain ancestral after simplifying. For example a forward-time simulation could
113+
define individuals for all nodes in the past, including the
114+
{ref}`pedigree links <msprime:sec_pedigrees_encoding>` between parents and children,
115+
and you wish to retain the chain of individuals that define that portion of the pedigree
116+
which is relevant to the genetic ancestry (see also discussion in the SLiM manual, and in
117+
[SLiM issue #139](https://github.com/MesserLab/SLiM/issues/139)).
118+
119+
To keep all the individuals associated with genetic ancestry, you can use
120+
`keep_unary_in_individuals=True`. In particular, this means
121+
that ancestral nodes which are not coalescent anywhere along the genome,
122+
but which are associated with an individual, will be retained (and
123+
so the referenced individuals will be retained too).
124+
125+
:::{todo}
126+
Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to
127+
create a simulator that saves pedigree information into each individual, and we could distill
128+
some of the discussion from https://github.com/MesserLab/SLiM/issues/139 into an example
129+
of storing a coherent pedigree.
98130
:::
99131

100-
## 3) Setting sample flags
132+
The `keep_unary_in_individuals` argument is a specific example of keeping some, but not all,
133+
non-coalescent ancestry in the tree sequence. If you need to retain a known set of
134+
non-coalescent nodes, it can be helpful to treat them as focal samples and use the
135+
`update_sample_flags=False` option, as described next.
136+
137+
138+
## 4) Setting sample flags
101139

102140
Normally the nodes that are provided to the `simplify()` function are marked as sample
103141
nodes in the output (by setting the `NODE_IS_SAMPLE` flag), and other nodes have that flag unset.
104-
If you provide the `update_sample_flags=False` option, all node flags are left unchanged.
142+
If you provide the `update_sample_flags=False` argument, all node flags are left unchanged.
105143
Here are some cases where that can be useful.
106144

107145
### Parallel simplification
108146

109147
One use for the `update_sample_flags=False` option combines it with `filter_nodes=False`,
110148
to ensure that the node table remains untouched during simplification.
111-
This is primarily a use-case targetted at developers of forward simulators, and allows
149+
This is primarily a use-case targeted at developers of forward simulators, and allows
112150
logically disjunct parts of the edge table to be simplified in parallel, without
113151
risking two parallel processes trying to alter the same data.
114152

@@ -220,24 +258,6 @@ d3arg = argviz.D3ARG.from_ts(ts=subset_arg)
220258
d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples");
221259
```
222260

223-
## 4) Keeping individuals
224-
225-
In some cases, a tree sequence might contain historical individuals which are associated
226-
with nodes that are not samples, and you wish to retain information on individuals which are
227-
ancestral to the sample nodes. For example a forward-time simulation could
228-
define individuals for all nodes in the past, including the pedigree links between parents
229-
and children (see also discussion in the SLiM manual, and at
230-
https://github.com/MesserLab/SLiM/issues/139).
231-
232-
To keep all the individuals associated with genetic ancestry, you can use
233-
`keep_unary_in_individuals=True`.
234-
235-
:::{todo}
236-
Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to
237-
create a simulator that saves pedigree information into each individual, and we could distill
238-
some of the discussion from https://github.com/MesserLab/SLiM/issues/139 into that.
239-
:::
240-
241261
## 5) reduce_to_site_topology
242262

243263
:::{todo}

simplification.md

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -38,17 +38,24 @@ def create_notebook_data():
3838
# Simplification
3939

4040
The {meth}`~TreeSequence.simplify` method provides one of the most powerful ways to modify a
41-
[tskit](https://tskit.dev) {class}`TreeSequence`. It removes and modifies edges to leave only the
42-
ancestry of a provided set of focal nodes. By default it ensuring these focal nodes are marked as
43-
samples and removes non-ancestral nodes and associated objects such as individuals and populations.
44-
It is commonly used:
41+
[tskit](https://tskit.dev) {class}`TreeSequence`.
42+
43+
At a high level, simplification works as follows: it starts from a chosen set of focal nodes
44+
and then traces their ancestry back through the tree sequence. Any nodes, edges, and mutations
45+
(as well as individuals, populations, and sites) that are not needed to represent that ancestry
46+
are discarded, and the remaining information is compacted into a new, equivalent tree sequence.
47+
During this process, IDs of nodes and other objects may change. In particular, non-coalescent
48+
nodes are usually removed, unless you ask to keep them.
49+
50+
Simplification is commonly used:
4551

4652
* In forward simulations, to remove lineages that have gone extinct
4753
* To create a smaller tree sequence focussed on a subset of samples
4854
* To remove redundant nodes and other tskit objects (e.g. unreferenced populations)
4955

50-
Other less common uses, such as retaining unary regions of coalescent nodes, and
51-
simplification in parallel, are described in the {ref}`sec_advanced_simplification` tutorial.
56+
Other less common uses, such as retaining all ancestral individuals, retaining unary
57+
regions of coalescent nodes, and simplifying without touching the node table,
58+
are described in the {ref}`sec_advanced_simplification` tutorial.
5259

5360

5461
## A single tree example
@@ -93,7 +100,7 @@ ts_simp2.draw_svg(**plot_params)
93100
Note that the example above also used another `filter_` argument, setting
94101
`filter_sites=False`, so that the first site, which has no mutations after
95102
simplification, is also retained (it is shown as a bare tick mark on the X axis,
96-
around position 250). However, mutations above unused nodes are still deleted
103+
around position 250). However, mutations above unused nodes are still deleted,
97104
so mutation IDs are not guaranteed to stay the same.
98105

99106
To further reduce the size of the simplified tree sequence, simplification normally
@@ -106,11 +113,11 @@ ts_simp3.draw_svg(**plot_params)
106113
```
107114

108115
:::{note}
109-
As modifying a tree sequence can change the IDs of nodes, sites, and other objects,
110-
it can be useful to use {ref}`metadata <sec_tutorial_metadata>`:
111-
information that stays associated with tskit objects even when their IDs change.
112-
When simplifying, it is also possible to keep track of node ID changes by using
113-
the `map_nodes` parameter, as demonstrated later in this tutorial.
116+
As modifying a tree sequence can change the IDs of nodes, sites, and other objects, it
117+
can be useful to use {ref}`metadata <sec_tutorial_metadata>`: information that stays
118+
associated with tskit objects even when their IDs change. When simplifying, it is
119+
also possible to keep track of node ID changes by using the `map_nodes` parameter,
120+
see the {ref}`advanced simplification <sec_advanced_simplification_map_nodes>` tutorial.
114121
:::
115122

116123
## A larger simplification example

viz.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -746,7 +746,7 @@ css_string = (
746746
747747
# Override default node text position to be based at (0, 0) relative to the node pos
748748
# Note that the .tree specifier is needed to make this more specific than the default
749-
# positioning which is targetted at ".lab.lft" and ".lab.rgt"
749+
# positioning which is targeted at ".lab.lft" and ".lab.rgt"
750750
".tree .node > .lab {transform: translate(0, 0); text-anchor: middle; font-size: 7pt}"
751751
752752
# For leaf nodes, override the above positioning using a subsequent CSS style
@@ -941,7 +941,7 @@ itself (and not its descendants) a slightly different specification is required,
941941
involving, the "`>`" symbol, or
942942
[child combinator](https://www.w3.org/TR/selectors-3/#child-combinators) (we have,
943943
in fact, used it in several previous examples). The following plot shows the difference
944-
when all decendant symbols are targetted, versus just the immediate child symbol:
944+
when all decendant symbols are targeted, versus just the immediate child symbol:
945945

946946
```{code-cell} ipython3
947947
node_style1 = ".n13 .sym {fill: yellow}" # All symbols under node 13
@@ -953,7 +953,7 @@ ts_small.draw_svg(y_axis=True, y_ticks=y_tick_pos, x_lim=x_limits, style=css_str
953953
Another example of modifying the style target is *negation*. This is needed, for example,
954954
to target nodes that are *not* leaves (i.e. internal nodes). One way to do this is to
955955
target *all* the node symbols first, then replace the style with a more specific
956-
targetting of the leaf symbols only:
956+
targeting of the leaf symbols only:
957957

958958
```{code-cell} ipython3
959959
hide_internal_symlabs = ".node > .sym, .node > .lab {display: none}"
@@ -1770,7 +1770,7 @@ def tanglegram(
17701770
17711771
lft_node_map, lft = reorder_tree_nodes(lft, leaves)
17721772
lft_rev_map = make_reverse_map(lft_node_map)
1773-
# Have to change the node labels, because even provided ones will be targetting the wrong IDs
1773+
# Have to change the node labels, because even provided ones will be targeting the wrong IDs
17741774
lft_node_labels = {u: node_labels[v] for u, v in enumerate(lft_node_map) if v in node_labels}
17751775
if order[1] is None:
17761776
# We do not reorder the RH tree, so the node IDs should stay as-is

0 commit comments

Comments
 (0)