@@ -16,9 +16,10 @@ kernelspec:
1616
1717(sec_phylogen)=
1818
19- # ` Tskit ` for phylogenetics
19+ # {program} ` Tskit ` for phylogenetics
2020
21- ` Tskit ` , the tree sequence toolkit, can be used as an efficient library for very large evolutionary trees. ` Tskit ` makes it easy to deal with trees with millions of
21+ {program}` Tskit ` , the tree sequence toolkit, can be used as an efficient library for
22+ very large evolutionary trees. {program}` Tskit ` makes it easy to deal with trees with millions of
2223tips, as in the example below:
2324
2425``` {code-cell}
@@ -77,7 +78,7 @@ import tsconvert # used for reading tree sequences from different formats
7778ts = tsconvert.from_newick("(A:6,((B:1,C:1):2,(D:2,E:2):1):3);", span=1000)
7879```
7980
80- The "succinct tree sequence" format used by ` tskit ` can also store mutations
81+ The "succinct tree sequence" format used by {program} ` tskit ` can also store mutations
8182(and optionally a reference genome) along with the tree(s). This results in a
8283single unified representation of large genomic datasets, storing trees,
8384sequence data and metadata in a single efficient structure. Examples are given
@@ -92,11 +93,11 @@ sorting. An overview, and links to further details are given at the
9293
9394## Hints for phylogeneticists
9495
95- Unlike other phylogenetic libraries, ` tskit ` is designed to efficiently store not just
96+ Unlike other phylogenetic libraries, {program} ` tskit ` is designed to efficiently store not just
9697single trees, but sequences of correlated trees along a genome. This means that the
9798library has some features not found in more standard phylogenetic libraries.
9899Here we focus on the {ref}` sec_python_api ` ,
99- introducing seven ` tskit ` concepts that may be useful to those with a background in
100+ introducing seven {program} ` tskit ` concepts that may be useful to those with a background in
100101phylogenetics (each is linked to a separate section below):
101102
1021031 . An evolutionary tree is always contained within a "tree sequence".
@@ -149,7 +150,7 @@ tree.tree_sequence # When output in a notebook, prints a summary of the tree se
149150(sec_phylogen_ids)=
150151### Integer node and edge IDs
151152
152- The plot above labels nodes by their name, but internally the ` tskit ` library relies
153+ The plot above labels nodes by their name, but internally the {program} ` tskit ` library relies
153154heavily on integer IDs. Here's the same tree with node IDs plotted instead:
154155
155156``` {code-cell}
@@ -164,9 +165,9 @@ integer ID starting from 0 to `ts.num_nodes - 1` (IDs can be allocated in any or
164165often the tips are labelled starting from 0 but this is not necessarily so, and
165166is not the case in the example above).
166167
167- For efficiency reasons, tree traversal routines, as well as many other ` tskit ` methods,
168- tend to return integer IDs. You can use this ID to get specific information about the
169- node and its position in the tree, for example
168+ For efficiency reasons, tree traversal routines, as well as many other {program} ` tskit `
169+ methods, tend to return integer IDs. You can use this ID to get specific information
170+ about the node and its position in the tree, for example
170171
171172``` {code-cell}
172173node_id = 4
@@ -187,8 +188,8 @@ Other methods also exist to
187188Rather than refer to "branches" of a tree, tskit tends to refer to
188189{ref}` sec_terminology_edges ` (the term "edge" emphasises that these can span
189190{ref}` sec_phylogen_multiple_trees ` , although for tree sequences containing a single
190- tree, the terms are interchangeable). Like other entities in ` tskit ` , edges are referred
191- to by an integer ID. For instance, here is the edge above the internal node 4
191+ tree, the terms are interchangeable). Like other entities in {program} ` tskit ` , edges
192+ are referred to by an integer ID. For instance, here is the edge above the internal node 4
192193
193194``` {code-cell}
194195node_id = 4
@@ -233,7 +234,7 @@ Often we are only have detailed information about specific nodes that we have sa
233234such as genomes A, B, C, D, and E in the example above. These are designated as
234235* sample nodes* , and are plotted as square nodes. The concept of
235236{ref}` sample nodes<sec_data_model_definitions_sample> ` is integral
236- to the ` tskit ` format. They can be identified by using the
237+ to the {program} ` tskit ` format. They can be identified by using the
237238{meth}` Node.is_sample ` and {meth}` Tree.is_sample ` methods, or can be listed using
238239{meth}` TreeSequence.samples ` or {meth}` Tree.samples() ` (internally, the ` node.flags `
239240field is used to {ref}` flag up<sec_node_table_definition> ` which nodes are samples):
@@ -283,19 +284,19 @@ tree.tree_sequence.nodes_time
283284(sec_phylogen_node_time)=
284285### Nodes must have times
285286
286- Perhaps the most noticable different between a ` tskit ` tree and the encoding of trees
287- in other phylogenetic libraries is that ` tskit ` does not explicitly store branch lengths.
287+ Perhaps the most noticable different between a {program} ` tskit ` tree and the encoding of trees
288+ in other phylogenetic libraries is that {program} ` tskit ` does not explicitly store branch lengths.
288289Instead, each node has a * time* associated with it. Branch lengths can therefore be
289290found by calculating the difference between the time of a node and the time of its
290291parent node.
291292
292- Since nodes * must* have a time, ` tskit ` trees aways have these (implicit) branch
293+ Since nodes * must* have a time, {program} ` tskit ` trees aways have these (implicit) branch
293294lengths. To represent a tree ("cladogram") in which the branch lengths are not
294295meaningful, the {attr}` TreeSequence.time_units ` of a tree sequence can be
295296specified as ` "uncalibrated" ` (see below)
296297
297- Another implication of storing node times rather than branch lengths is that ` tskit `
298- trees are always directional (i.e. they are "rooted"). The reason that ` tskit ` stores
298+ Another implication of storing node times rather than branch lengths is that {program} ` tskit `
299+ trees are always directional (i.e. they are "rooted"). The reason that {program} ` tskit ` stores
299300times of nodes (rather than e.g. genetic distances between them) is to ensure temporal
300301consistency. In particular it makes it impossible for a node to be an ancestor of a
301302node in one tree, and a descendant of the same node in another tree in the tree sequence.
@@ -310,7 +311,7 @@ print("Time units are", tree.tree_sequence.time_units)
310311tree.draw_svg(y_axis=True)
311312```
312313
313- Although branch lengths are not stored explicitly, for convenience ` tskit ` provides a
314+ Although branch lengths are not stored explicitly, for convenience {program} ` tskit ` provides a
314315{meth}` Tree.branch_length ` method:
315316
316317``` {code-cell}
@@ -342,7 +343,7 @@ print(
342343
343344It is worth noting that this distance is the basis for the "genetic divergence"
344345between two samples in a tree. For this reason, an equivalent way to carry out the
345- calculation is to use {meth}` TreeSequence.divergence ` , part of the the standard ` tskit `
346+ calculation is to use {meth}` TreeSequence.divergence ` , part of the the standard {program} ` tskit `
346347{ref}` sec_stats ` framework, setting ` mode="branch" ` and
347348` windows="trees" ` . This is a more flexible approach, as it allows the distance between
348349multiple sets of samples in {ref}` sec_phylogen_multiple_trees ` to be calculated
@@ -364,7 +365,7 @@ print(
364365(sec_phylogen_multiroot)=
365366### Roots and multiroot trees
366367
367- In ` tskit ` , {ref}` sec_data_model_tree_roots ` of trees are defined with respect to the
368+ In {program} ` tskit ` , {ref}` sec_data_model_tree_roots ` of trees are defined with respect to the
368369sample nodes. In particular, if we move back in time along the tree branches from a
369370sample, the oldest node that we encounter is defined as a root. The ID of a root can be
370371obtained using {attr}` Tree.root ` :
@@ -374,7 +375,7 @@ print("The root node of the following tree has ID", tree.root)
374375tree.draw_svg()
375376```
376377
377- But in ` tskit ` , we can also create a single "tree" consisting of multiple unlinked
378+ But in {program} ` tskit ` , we can also create a single "tree" consisting of multiple unlinked
378379clades. In our example, we can create one of these phylogenetically unusual objects
379380if we remove the edge above node 4, by
380381{ref}` editing the underlying tables<sec_tables_editing> ` :
@@ -389,8 +390,8 @@ new_tree = new_ts.first()
389390new_tree.draw_svg()
390391```
391392
392- Although there are two separate topologies in this plot, in ` tskit ` terminology, it is
393- considered a single tree, but with two roots:
393+ Although there are two separate topologies in this plot, in {program} ` tskit ` terminology,
394+ it is considered a single tree, but with two roots:
394395
395396``` {code-cell}
396397print("The first tree has", len(new_tree.roots), "roots:", new_tree.roots)
@@ -408,7 +409,7 @@ empty_tree.draw_svg()
408409```
409410
410411The samples here are {ref}` sec_data_model_tree_isolated_nodes ` . This may seem like a
411- strange corner case, but in ` tskit ` , isolated sample nodes are used to represent
412+ strange corner case, but in {program} ` tskit ` , isolated sample nodes are used to represent
412413{ref}` sec_data_model_missing_data ` . This therefore represents a tree in which
413414relationships between the samples are not known. This could apply, for instance,
414415in regions of the genome where no genetic data exists, or where genetic ancestry
@@ -429,7 +430,7 @@ Demo some phylogenetic methods. e.g.
429430(sec_phylogen_unified_structure)=
430431## Storing and accessing genetic data
431432
432- ` Tskit ` has been designed to capture both evolutionary tree topologies and the genetic
433+ {program} ` Tskit ` has been designed to capture both evolutionary tree topologies and the genetic
433434sequences that evolve along the branches of these trees. This is achieved by defining
434435{ref}` sec_terminology_mutations_and_sites ` which are associated with specific positions
435436along the genome.
@@ -468,13 +469,13 @@ for node_id, alignment in zip(
468469(sec_phylogen_multiple_trees)=
469470## Multiple trees
470471
471- Where ` tskit ` really shines is when the ancestry of your dataset cannot be adequately
472+ Where {program} ` tskit ` really shines is when the ancestry of your dataset cannot be adequately
472473represented by a single tree. This is a pervasive issue in genomes (even from different
473474species) that have undergone recombination in the past. The resulting series of
474475{ref}` local trees<sec_what_is_local_trees> ` along a genome are highly correlated
475476(see {ref}` sec_concepts ` ).
476477
477- Instead of storing each tree along a genome separately, ` tskit ` records the genomic
478+ Instead of storing each tree along a genome separately, {program} ` tskit ` records the genomic
478479coordinates of each edge, which leads to enormous efficiencies in storage and
479480analysis. As a basic demonstration, we can repeat the edge removal example
480481{ref}` above <sec_phylogen_multiroot> ` , but only remove the ancestral link above node 4
@@ -498,7 +499,7 @@ generate 2 trees in the tree sequence, which differ only in the presence of abse
498499a single branch. We do not have to separately store the entire tree on the right: all
499500the edges that are shared between trees are stored only once.
500501
501- The rest of the ` tskit ` tutorials will lead you through the concepts involved with
502+ The rest of the {program} ` tskit ` tutorials will lead you through the concepts involved with
502503storing and analysing sequences of many correlated trees. For a simple introduction, you
503504might want to start with {ref}` sec_what_is ` .
504505
0 commit comments