Skip to content

Commit 969bf16

Browse files
committed
Make tskit formatting consistent
1 parent 79ce6cf commit 969bf16

1 file changed

Lines changed: 29 additions & 28 deletions

File tree

phylogen.md

Lines changed: 29 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,10 @@ kernelspec:
1616

1717
(sec_phylogen)=
1818

19-
# `Tskit` for phylogenetics
19+
# {program}`Tskit` for phylogenetics
2020

21-
`Tskit`, the tree sequence toolkit, can be used as an efficient library for very large evolutionary trees. `Tskit` makes it easy to deal with trees with millions of
21+
{program}`Tskit`, the tree sequence toolkit, can be used as an efficient library for
22+
very large evolutionary trees. {program}`Tskit` makes it easy to deal with trees with millions of
2223
tips, as in the example below:
2324

2425
```{code-cell}
@@ -77,7 +78,7 @@ import tsconvert # used for reading tree sequences from different formats
7778
ts = tsconvert.from_newick("(A:6,((B:1,C:1):2,(D:2,E:2):1):3);", span=1000)
7879
```
7980

80-
The "succinct tree sequence" format used by `tskit` can also store mutations
81+
The "succinct tree sequence" format used by {program}`tskit` can also store mutations
8182
(and optionally a reference genome) along with the tree(s). This results in a
8283
single unified representation of large genomic datasets, storing trees,
8384
sequence data and metadata in a single efficient structure. Examples are given
@@ -92,11 +93,11 @@ sorting. An overview, and links to further details are given at the
9293

9394
## Hints for phylogeneticists
9495

95-
Unlike other phylogenetic libraries, `tskit` is designed to efficiently store not just
96+
Unlike other phylogenetic libraries, {program}`tskit` is designed to efficiently store not just
9697
single trees, but sequences of correlated trees along a genome. This means that the
9798
library has some features not found in more standard phylogenetic libraries.
9899
Here we focus on the {ref}`sec_python_api`,
99-
introducing seven `tskit` concepts that may be useful to those with a background in
100+
introducing seven {program}`tskit` concepts that may be useful to those with a background in
100101
phylogenetics (each is linked to a separate section below):
101102

102103
1. An evolutionary tree is always contained within a "tree sequence".
@@ -149,7 +150,7 @@ tree.tree_sequence # When output in a notebook, prints a summary of the tree se
149150
(sec_phylogen_ids)=
150151
### Integer node and edge IDs
151152

152-
The plot above labels nodes by their name, but internally the `tskit` library relies
153+
The plot above labels nodes by their name, but internally the {program}`tskit` library relies
153154
heavily on integer IDs. Here's the same tree with node IDs plotted instead:
154155

155156
```{code-cell}
@@ -164,9 +165,9 @@ integer ID starting from 0 to `ts.num_nodes - 1` (IDs can be allocated in any or
164165
often the tips are labelled starting from 0 but this is not necessarily so, and
165166
is not the case in the example above).
166167

167-
For efficiency reasons, tree traversal routines, as well as many other `tskit` methods,
168-
tend to return integer IDs. You can use this ID to get specific information about the
169-
node and its position in the tree, for example
168+
For efficiency reasons, tree traversal routines, as well as many other {program}`tskit`
169+
methods, tend to return integer IDs. You can use this ID to get specific information
170+
about the node and its position in the tree, for example
170171

171172
```{code-cell}
172173
node_id = 4
@@ -187,8 +188,8 @@ Other methods also exist to
187188
Rather than refer to "branches" of a tree, tskit tends to refer to
188189
{ref}`sec_terminology_edges` (the term "edge" emphasises that these can span
189190
{ref}`sec_phylogen_multiple_trees`, although for tree sequences containing a single
190-
tree, the terms are interchangeable). Like other entities in `tskit`, edges are referred
191-
to by an integer ID. For instance, here is the edge above the internal node 4
191+
tree, the terms are interchangeable). Like other entities in {program}`tskit`, edges
192+
are referred to by an integer ID. For instance, here is the edge above the internal node 4
192193

193194
```{code-cell}
194195
node_id = 4
@@ -233,7 +234,7 @@ Often we are only have detailed information about specific nodes that we have sa
233234
such as genomes A, B, C, D, and E in the example above. These are designated as
234235
*sample nodes*, and are plotted as square nodes. The concept of
235236
{ref}`sample nodes<sec_data_model_definitions_sample>` is integral
236-
to the `tskit` format. They can be identified by using the
237+
to the {program}`tskit` format. They can be identified by using the
237238
{meth}`Node.is_sample` and {meth}`Tree.is_sample` methods, or can be listed using
238239
{meth}`TreeSequence.samples` or {meth}`Tree.samples()` (internally, the `node.flags`
239240
field is used to {ref}`flag up<sec_node_table_definition>` which nodes are samples):
@@ -283,19 +284,19 @@ tree.tree_sequence.nodes_time
283284
(sec_phylogen_node_time)=
284285
### Nodes must have times
285286

286-
Perhaps the most noticable different between a `tskit` tree and the encoding of trees
287-
in other phylogenetic libraries is that `tskit` does not explicitly store branch lengths.
287+
Perhaps the most noticable different between a {program}`tskit` tree and the encoding of trees
288+
in other phylogenetic libraries is that {program}`tskit` does not explicitly store branch lengths.
288289
Instead, each node has a *time* associated with it. Branch lengths can therefore be
289290
found by calculating the difference between the time of a node and the time of its
290291
parent node.
291292

292-
Since nodes *must* have a time, `tskit` trees aways have these (implicit) branch
293+
Since nodes *must* have a time, {program}`tskit` trees aways have these (implicit) branch
293294
lengths. To represent a tree ("cladogram") in which the branch lengths are not
294295
meaningful, the {attr}`TreeSequence.time_units` of a tree sequence can be
295296
specified as `"uncalibrated"` (see below)
296297

297-
Another implication of storing node times rather than branch lengths is that `tskit`
298-
trees are always directional (i.e. they are "rooted"). The reason that `tskit` stores
298+
Another implication of storing node times rather than branch lengths is that {program}`tskit`
299+
trees are always directional (i.e. they are "rooted"). The reason that {program}`tskit` stores
299300
times of nodes (rather than e.g. genetic distances between them) is to ensure temporal
300301
consistency. In particular it makes it impossible for a node to be an ancestor of a
301302
node in one tree, and a descendant of the same node in another tree in the tree sequence.
@@ -310,7 +311,7 @@ print("Time units are", tree.tree_sequence.time_units)
310311
tree.draw_svg(y_axis=True)
311312
```
312313

313-
Although branch lengths are not stored explicitly, for convenience `tskit` provides a
314+
Although branch lengths are not stored explicitly, for convenience {program}`tskit` provides a
314315
{meth}`Tree.branch_length` method:
315316

316317
```{code-cell}
@@ -342,7 +343,7 @@ print(
342343

343344
It is worth noting that this distance is the basis for the "genetic divergence"
344345
between two samples in a tree. For this reason, an equivalent way to carry out the
345-
calculation is to use {meth}`TreeSequence.divergence`, part of the the standard `tskit`
346+
calculation is to use {meth}`TreeSequence.divergence`, part of the the standard {program}`tskit`
346347
{ref}`sec_stats` framework, setting `mode="branch"` and
347348
`windows="trees"`. This is a more flexible approach, as it allows the distance between
348349
multiple sets of samples in {ref}`sec_phylogen_multiple_trees` to be calculated
@@ -364,7 +365,7 @@ print(
364365
(sec_phylogen_multiroot)=
365366
### Roots and multiroot trees
366367

367-
In `tskit`, {ref}`sec_data_model_tree_roots` of trees are defined with respect to the
368+
In {program}`tskit`, {ref}`sec_data_model_tree_roots` of trees are defined with respect to the
368369
sample nodes. In particular, if we move back in time along the tree branches from a
369370
sample, the oldest node that we encounter is defined as a root. The ID of a root can be
370371
obtained using {attr}`Tree.root`:
@@ -374,7 +375,7 @@ print("The root node of the following tree has ID", tree.root)
374375
tree.draw_svg()
375376
```
376377

377-
But in `tskit`, we can also create a single "tree" consisting of multiple unlinked
378+
But in {program}`tskit`, we can also create a single "tree" consisting of multiple unlinked
378379
clades. In our example, we can create one of these phylogenetically unusual objects
379380
if we remove the edge above node 4, by
380381
{ref}`editing the underlying tables<sec_tables_editing>`:
@@ -389,8 +390,8 @@ new_tree = new_ts.first()
389390
new_tree.draw_svg()
390391
```
391392

392-
Although there are two separate topologies in this plot, in `tskit` terminology, it is
393-
considered a single tree, but with two roots:
393+
Although there are two separate topologies in this plot, in {program}`tskit` terminology,
394+
it is considered a single tree, but with two roots:
394395

395396
```{code-cell}
396397
print("The first tree has", len(new_tree.roots), "roots:", new_tree.roots)
@@ -408,7 +409,7 @@ empty_tree.draw_svg()
408409
```
409410

410411
The samples here are {ref}`sec_data_model_tree_isolated_nodes`. This may seem like a
411-
strange corner case, but in `tskit`, isolated sample nodes are used to represent
412+
strange corner case, but in {program}`tskit`, isolated sample nodes are used to represent
412413
{ref}`sec_data_model_missing_data`. This therefore represents a tree in which
413414
relationships between the samples are not known. This could apply, for instance,
414415
in regions of the genome where no genetic data exists, or where genetic ancestry
@@ -429,7 +430,7 @@ Demo some phylogenetic methods. e.g.
429430
(sec_phylogen_unified_structure)=
430431
## Storing and accessing genetic data
431432

432-
`Tskit` has been designed to capture both evolutionary tree topologies and the genetic
433+
{program}`Tskit` has been designed to capture both evolutionary tree topologies and the genetic
433434
sequences that evolve along the branches of these trees. This is achieved by defining
434435
{ref}`sec_terminology_mutations_and_sites` which are associated with specific positions
435436
along the genome.
@@ -468,13 +469,13 @@ for node_id, alignment in zip(
468469
(sec_phylogen_multiple_trees)=
469470
## Multiple trees
470471

471-
Where `tskit` really shines is when the ancestry of your dataset cannot be adequately
472+
Where {program}`tskit` really shines is when the ancestry of your dataset cannot be adequately
472473
represented by a single tree. This is a pervasive issue in genomes (even from different
473474
species) that have undergone recombination in the past. The resulting series of
474475
{ref}`local trees<sec_what_is_local_trees>` along a genome are highly correlated
475476
(see {ref}`sec_concepts`).
476477

477-
Instead of storing each tree along a genome separately, `tskit` records the genomic
478+
Instead of storing each tree along a genome separately, {program}`tskit` records the genomic
478479
coordinates of each edge, which leads to enormous efficiencies in storage and
479480
analysis. As a basic demonstration, we can repeat the edge removal example
480481
{ref}`above <sec_phylogen_multiroot>`, but only remove the ancestral link above node 4
@@ -498,7 +499,7 @@ generate 2 trees in the tree sequence, which differ only in the presence of abse
498499
a single branch. We do not have to separately store the entire tree on the right: all
499500
the edges that are shared between trees are stored only once.
500501

501-
The rest of the `tskit` tutorials will lead you through the concepts involved with
502+
The rest of the {program}`tskit` tutorials will lead you through the concepts involved with
502503
storing and analysing sequences of many correlated trees. For a simple introduction, you
503504
might want to start with {ref}`sec_what_is`.
504505

0 commit comments

Comments
 (0)