|
| 1 | +--- |
| 2 | +jupytext: |
| 3 | + text_representation: |
| 4 | + extension: .md |
| 5 | + format_name: myst |
| 6 | + format_version: 0.12 |
| 7 | + jupytext_version: 1.9.1 |
| 8 | +kernelspec: |
| 9 | + display_name: Python 3 |
| 10 | + language: python |
| 11 | + name: python3 |
| 12 | +--- |
| 13 | + |
| 14 | +```{currentmodule} tskit |
| 15 | +``` |
| 16 | + |
| 17 | +(sec_advanced_simplification)= |
| 18 | + |
| 19 | +# _Advanced simplification_ |
| 20 | +% remove underscores in title when tutorial is complete or near-complete |
| 21 | + |
| 22 | + |
| 23 | +This is a companion to the basic {ref}`sec_simplification` tutorial. |
| 24 | +It focuses on details of `simplify` behavior that are useful when you need precise |
| 25 | +control over node IDs, retained ancestry structure, and sample flags. |
| 26 | + |
| 27 | +If you are primarily subsetting samples or reducing tree sequence size, start with |
| 28 | +{ref}`sec_simplification`. |
| 29 | + |
| 30 | + |
| 31 | +```{code-cell} ipython3 |
| 32 | +:"tags": ["hide-input"] |
| 33 | +# Build a full ARG for demonstrations |
| 34 | +import msprime |
| 35 | +import numpy as np |
| 36 | +import tskit |
| 37 | +import tskit_arg_visualizer as argviz |
| 38 | +
|
| 39 | +arg = msprime.sim_ancestry( |
| 40 | + samples=10, |
| 41 | + sequence_length=1e4, |
| 42 | + recombination_rate=1e-8, |
| 43 | + population_size=1e4, |
| 44 | + record_full_arg=True, |
| 45 | + random_seed=123, |
| 46 | +) |
| 47 | +arg = msprime.sim_mutations(arg, rate=1e-8, random_seed=124) |
| 48 | +``` |
| 49 | + |
| 50 | +:::{note} |
| 51 | +You can both simplify an immutable {class}`TreeSequence` (to return a new one), |
| 52 | +or run {meth}`~TableCollection.simplify` on a {class}`TableCollection` (which |
| 53 | +modifies in place, and requires {ref}`sec_table_indexes` to be built as well as the |
| 54 | +tables to be {meth}`sorted <TableCollection.sort>`). Simplifying tables in place |
| 55 | +is often useful for {ref}`forward-time simulations <sec_tskit_forward_simulations>`. |
| 56 | +::: |
| 57 | + |
| 58 | +## 1) Tracking node ID changes |
| 59 | + |
| 60 | +With default settings, simplification compacts tables and therefore reassigns node |
| 61 | +IDs. If downstream code makes use of stored node IDs, request a map: |
| 62 | + |
| 63 | +```{code-cell} ipython3 |
| 64 | +focal = arg.samples()[-6:] # pick last 6 samples (3 diploids) |
| 65 | +simp, node_map = arg.simplify(focal, map_nodes=True) |
| 66 | +
|
| 67 | +print(f"Original ARG: {arg.num_individuals} individuals, simplified to {simp.num_individuals} individuals") |
| 68 | +print(f"Original ARG: {arg.num_nodes} nodes, simplified to {simp.num_nodes} nodes") |
| 69 | +print("Dropped nodes:", int(np.sum(node_map == tskit.NULL))) |
| 70 | +print("Old sample ID", int(focal[0]), "maps to new ID", int(node_map[focal[0]])) |
| 71 | +``` |
| 72 | + |
| 73 | +Note that when simplifying tables in-place using {meth}`TableCollection.simplify` a map |
| 74 | +is always returned. To avoid compacting the node table, and leave node IDs unchanged, use |
| 75 | +`filter_nodes=False`. |
| 76 | + |
| 77 | +### Obtaining the reverse map |
| 78 | + |
| 79 | +Often you might want a reverse map, mapping the new node IDs to the old ones. Here's |
| 80 | +a simple way to do this: |
| 81 | + |
| 82 | +```{code-cell} ipython3 |
| 83 | +def invert_map(node_mapping): |
| 84 | + kept = node_mapping != tskit.NULL |
| 85 | + indexes = node_mapping[kept] # indexes are guaranteed 0..N-1 |
| 86 | + rev_map = np.full_like(indexes, tskit.NULL) |
| 87 | + rev_map[indexes] = np.flatnonzero(kept) |
| 88 | + return rev_map |
| 89 | +
|
| 90 | +reverse_map = invert_map(node_map) |
| 91 | +print("New sample ID 0", "maps to old ID", int(reverse_map[0])) |
| 92 | +``` |
| 93 | + |
| 94 | +## 2) Keeping input roots |
| 95 | + |
| 96 | +:::{todo} |
| 97 | +This is easy to illustrate, and useful for forward sims / census approaches |
| 98 | +::: |
| 99 | + |
| 100 | +## 3) Setting sample flags |
| 101 | + |
| 102 | +Normally the nodes that are provided to the `simplify()` function are marked as sample |
| 103 | +nodes in the output (by setting the `NODE_IS_SAMPLE` flag), and other nodes have that flag unset. |
| 104 | +If you provide the `update_sample_flags=False` option, all node flags are left unchanged. |
| 105 | +Here are some cases where that can be useful. |
| 106 | + |
| 107 | +### Parallel simplification |
| 108 | + |
| 109 | +One use for the `update_sample_flags=False` option combines it with `filter_nodes=False`, |
| 110 | +to ensure that the node table remains untouched during simplification. |
| 111 | +This is primarily a use-case targetted at developers of forward simulators, and allows |
| 112 | +logically disjunct parts of the edge table to be simplified in parallel, without |
| 113 | +risking two parallel processes trying to alter the same data. |
| 114 | + |
| 115 | +:::{todo} |
| 116 | +Should we provide an example? However, this tends to be done at a lower level, e.g. |
| 117 | +using the C API, and is likely to be only useful for developers. |
| 118 | +::: |
| 119 | + |
| 120 | +### Retaining non-coalescent nodes |
| 121 | + |
| 122 | +The `keep_unary=True` option retains *all* ancestral nodes after simplification, even |
| 123 | +if they don't represent coalescences. Sometimes you might want to retain *some* but not |
| 124 | +all such nodes, which can be done by adding them to the focal node list but using |
| 125 | +`update_sample_flags=False` to ensure they are not marked as samples in the final output. |
| 126 | +This is usually followed by an additional simplification pass to remove any topology |
| 127 | +above the additional nodes which is not ancestral to the true samples. Here are some examples: |
| 128 | + |
| 129 | +#### Keeping non-coalescent regions of coalescent nodes |
| 130 | + |
| 131 | +By default, `simplify()` deletes not just non-coalescent nodes, but also |
| 132 | +removes from the ancestry those regions of coalescent nodes which do not represent |
| 133 | +a coalescence in the local tree (i.e. are "locally unary"). We can identify coalescent |
| 134 | +nodes using an initial round of simplification: |
| 135 | + |
| 136 | +```{code-cell} |
| 137 | +simp, node_map = arg.simplify(focal, map_nodes=True) |
| 138 | +keep_nodes = np.where(node_map != tskit.NULL)[0] #includes sample and coalescent nodes |
| 139 | +
|
| 140 | +# Retain lots of nodes by treating them as focal, but don't change sample flags |
| 141 | +tmp_ts = arg.simplify(keep_nodes, update_sample_flags=False) |
| 142 | +# Now re-simplify, in case any coalescent nodes have unwanted ancestry above |
| 143 | +part_simp_ts = tmp_ts.simplify(keep_unary=True) |
| 144 | +``` |
| 145 | + |
| 146 | +Often, leaving in the non-coalescent regions of coalescent nodes can lead to a reduction |
| 147 | +in the number of edges. This is one reason that the {meth}`TreeSequence.extend_haplotypes` exists |
| 148 | +(see the documentation of that method for more detail). |
| 149 | + |
| 150 | +```{code-cell} |
| 151 | +print(f"Full simplification to samples {focal} leads to {simp.num_edges} edges") |
| 152 | +print( |
| 153 | + f"Similar simplification leaving partially-coalescent nodes leads to {part_simp_ts.num_edges} edges" |
| 154 | +) |
| 155 | +part_simp_ts.draw_svg(title=( |
| 156 | + f"ARG simplified to samples {focal}, leaving part-coalescent nodes: " |
| 157 | + f"this has {part_simp_ts.num_edges} edges" |
| 158 | +)) |
| 159 | +``` |
| 160 | + |
| 161 | + |
| 162 | +#### Subsetting a full ARG |
| 163 | + |
| 164 | +The {ref}`sec_args` tutorial discusses the idea of a "full ARG", containing recombination and |
| 165 | +non-coalescent common ancestor nodes, which do not correspond to coalescent events anywhere in |
| 166 | +the genealogy. We can use `simplify()` to reduce a full ARG to a subset of samples, while retaining those non-coalescent nodes. Again, we need to figure out |
| 167 | +which nodes to keep, then add them to the focal nodes, simplifying with `update_sample_flags=False`. |
| 168 | + |
| 169 | +We can detect the common ancestor nodes by finding those which still have 2 unique children after |
| 170 | +simplifying with `keep_unary=True` and the recombination nodes |
| 171 | +by finding which still come in pairs after such simplification. |
| 172 | + |
| 173 | +```{code-cell} ipython3 |
| 174 | +def arg_num_children(ts): # see the ARG tutorial which explains this code |
| 175 | + same_parent = np.concatenate((ts.edges_parent[1:] == ts.edges_parent[:-1], [False])) |
| 176 | + same_child = np.concatenate((ts.edges_child[1:] == ts.edges_child[:-1], [False])) |
| 177 | + is_last_unique = ~same_parent | ~same_child # last occurrence of each unique child per parent |
| 178 | + return np.bincount(ts.edges_parent[is_last_unique], minlength=ts.num_nodes) |
| 179 | +
|
| 180 | +re_node_pairs = np.where(arg.nodes_flags & msprime.NODE_IS_RE_EVENT)[0].reshape(-1, 2) |
| 181 | +print("RE nodes before simplification\n", re_node_pairs) |
| 182 | +``` |
| 183 | + |
| 184 | +Identifying the _msprime_ recombination nodes that stay as pairs after simplification requires |
| 185 | +a little work: |
| 186 | + |
| 187 | +:::{todo} |
| 188 | +Currently the code below doesn't quite work, because `keep_unary` forces the nodes above the local |
| 189 | +roots to be kept, see https://github.com/tskit-dev/tskit/issues/3450. This means that some RE |
| 190 | +(and possibly CA) nodes are kept when they should be discarded. |
| 191 | +::: |
| 192 | + |
| 193 | +```{code-cell} ipython3 |
| 194 | +simp, node_map = arg.simplify(focal, keep_unary=True, map_nodes=True) |
| 195 | +reverse_map = invert_map(node_map) |
| 196 | +simp_re_nodes = reverse_map[np.where(simp.nodes_flags & msprime.NODE_IS_RE_EVENT)[0]] |
| 197 | +valid_pairs = np.isin(re_node_pairs, simp_re_nodes).all(axis=1) |
| 198 | +keep_re_nodes = re_node_pairs[valid_pairs, :] |
| 199 | +print("paired RE nodes after simplification\n", keep_re_nodes) |
| 200 | +
|
| 201 | +keep_RE_nodes = keep_re_nodes.flatten() |
| 202 | +keep_CA_nodes = reverse_map[arg_num_children(simp) > 1] |
| 203 | +``` |
| 204 | + |
| 205 | +Now that we have defined which nodes to keep, we can use the same trick as before, |
| 206 | +passing these nodes as focal, but simplifying twice, once with `update_sample_flags=False` |
| 207 | +then again with `keep_unary=True`: |
| 208 | + |
| 209 | +```{code-cell} ipython3 |
| 210 | +keep = np.concatenate((focal, keep_CA_nodes, keep_RE_nodes)) |
| 211 | +
|
| 212 | +tmp_arg = arg.simplify(keep, update_sample_flags=False) |
| 213 | +subset_arg = tmp_arg.simplify(keep_unary=True) # Defaults to focal nodes = existing samples |
| 214 | +``` |
| 215 | + |
| 216 | +Here's what it looks like in graph form: |
| 217 | + |
| 218 | +```{code-cell} ipython3 |
| 219 | +d3arg = argviz.D3ARG.from_ts(ts=subset_arg) |
| 220 | +d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples"); |
| 221 | +``` |
| 222 | + |
| 223 | +## 4) Keeping individuals |
| 224 | + |
| 225 | +In some cases, a tree sequence might contain historical individuals which are associated |
| 226 | +with nodes that are not samples, and you wish to retain information on individuals which are |
| 227 | +ancestral to the sample nodes. For example a forward-time simulation could |
| 228 | +define individuals for all nodes in the past, including the pedigree links between parents |
| 229 | +and children (see also discussion in the SLiM manual, and at |
| 230 | +https://github.com/MesserLab/SLiM/issues/139). |
| 231 | + |
| 232 | +To keep all the individuals associated with genetic ancestry, you can use |
| 233 | +`keep_unary_in_individuals=True`. |
| 234 | + |
| 235 | +:::{todo} |
| 236 | +Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to |
| 237 | +create a simulator that saves pedigree information into each individual, and we could distill |
| 238 | +some of the discussion from https://github.com/MesserLab/SLiM/issues/139 into that. |
| 239 | +::: |
| 240 | + |
| 241 | +## 5) reduce_to_site_topology |
| 242 | + |
| 243 | +:::{todo} |
| 244 | +Explain why you might use `reduce_to_site_topology`, e.g. if there is a lot of information |
| 245 | +that is not inferable from sites. Note that this does not change the topology at each site, |
| 246 | +although some of the topology at a site might not actually be needed to represent the site |
| 247 | +variation (to see how to condense topology into "polytomies", see |
| 248 | +https://github.com/tskit-dev/tskit/discussions/2926). |
| 249 | +::: |
0 commit comments