From d6dcad42550d66a122b24334d5bd566674919250 Mon Sep 17 00:00:00 2001 From: Jerome Kelleher Date: Mon, 28 Jul 2025 11:06:28 +0100 Subject: [PATCH 1/3] Add a note on units Closes #2375 --- docs/ancestry.md | 169 +++++++++++++++++++++++++---------------------- 1 file changed, 90 insertions(+), 79 deletions(-) diff --git a/docs/ancestry.md b/docs/ancestry.md index 2344ce658..a1e5e053d 100644 --- a/docs/ancestry.md +++ b/docs/ancestry.md @@ -565,6 +565,17 @@ Because the recombination rate in from position 10 to 20 is so much higher than the rate from 0 to 10, we can see that all the recombinations have fallen in the high recombination region. +:::{note} +The units of recombination and other rates in msprime are +**per unit sequence length, per unit time**. If you have +specified a {ref}`demographic model` for a species +and are measuring sequence lengths in base pairs, then the units +for recombination rate will be per base pair, per generation. +It is also possible to specify time in "coalescent units" and to +model recombination breakpoints as occurring on the +{ref}`continuous unit interval`. +::: + :::{seealso} - See the {ref}`sec_rate_maps_creating` section for more examples of creating {class}`.RateMap` instances. @@ -700,7 +711,7 @@ simulations may be preferable as they will be much more efficient. That being said, this section describes how to simulate data over multiple chromosomes, or more generally, over multiple regions -with free recombination between them. +with free recombination between them. #### Using a fixed pedigree @@ -737,7 +748,7 @@ sequence length. #### Simulations without a fixed pedigree Without a pedigree, we can use other simulation models in msprime such as -{ref}`DTWF ` and the +{ref}`DTWF ` and the {ref}`multiple merger coalescent `. These models do not directly support simulating multiple chromosomes simultaneously, but we can emulate it using a single linear genome split into @@ -956,26 +967,26 @@ In `msprime` we usually want to simulate the coalescent with recombination and represent the output as efficiently as possible. As a result, we don't store individual recombination events, but rather their effects on the output tree sequence. We also do not explicitly store common ancestor events that -do not result in marginal coalescences. For some purposes, however, -we want record information on other events of interest, not just the mimimal +do not result in marginal coalescences. For some purposes, however, +we want record information on other events of interest, not just the mimimal representation of its outcome. -The `additional_nodes` and -{ref}`coalescing_segments_only ` options serve -this exact purpose. These options allow us to record the nodes associated with -a custom subset of all events we might observe in the history of a sample. -Besides samples and coalescence events, nodes can now also represent -{ref}`common ancestor events `, -{ref}`recombination `, -{ref}`gene conversion `, -{ref}`migration `, -and {ref}`pass through ` events. The example below -and the next few paragraphs provide a guide on how +The `additional_nodes` and +{ref}`coalescing_segments_only ` options serve +this exact purpose. These options allow us to record the nodes associated with +a custom subset of all events we might observe in the history of a sample. +Besides samples and coalescence events, nodes can now also represent +{ref}`common ancestor events `, +{ref}`recombination `, +{ref}`gene conversion `, +{ref}`migration `, +and {ref}`pass through ` events. The example below +and the next few paragraphs provide a guide on how to record additional nodes and interpret the resulting node tables. We first set up a simple pedigree simulation. Pedigrees are an -interesting starting point as in contrast to the coalescent models a single -node might be associated with more than one type of event. Conversely, +interesting starting point as in contrast to the coalescent models a single +node might be associated with more than one type of event. Conversely, for the continuous-time coalescent model simulated by default by msprime, each event is registered as a new node. @@ -995,16 +1006,16 @@ pedigree_tables = pb.finalise(sequence_length=5) draw_pedigree(pedigree_tables.tree_sequence()) ``` -To specify the particular node types we want to store information on, pass a -{class}`.NodeType` object to the `additional_nodes` option of -{func}`.sim_ancestry`. This class extends {class}``python:enum.Flag``. -Each node type in the enumeration is associated with a numerical constant. -Different node types can be combined and compared using +To specify the particular node types we want to store information on, pass a +{class}`.NodeType` object to the `additional_nodes` option of +{func}`.sim_ancestry`. This class extends {class}``python:enum.Flag``. +Each node type in the enumeration is associated with a numerical constant. +Different node types can be combined and compared using [bitwise operations](https://docs.python.org/3/library/stdtypes.html?highlight=bitwise). -At the same time, these constant also function as the flags identifying -the type of each node in the nodes table. Here, we take the bitwise union -of three members of the enumeration: -{class}`msprime.NodeType.RECOMBINANT`, {class}`msprime.NodeType.PASS_THROUGH` +At the same time, these constant also function as the flags identifying +the type of each node in the nodes table. Here, we take the bitwise union +of three members of the enumeration: +{class}`msprime.NodeType.RECOMBINANT`, {class}`msprime.NodeType.PASS_THROUGH` and {class}`msprime.NodeType.COMMON_ANCESTOR`. ```{code-cell} @@ -1015,7 +1026,7 @@ ts = msprime.sim_ancestry( random_seed=45, additional_nodes=( msprime.NodeType.RECOMBINANT | - msprime.NodeType.PASS_THROUGH | + msprime.NodeType.PASS_THROUGH | msprime.NodeType.COMMON_ANCESTOR ), coalescing_segments_only=False @@ -1032,20 +1043,20 @@ def count_flags(ts): print(count_flags(ts)) ``` The `count_flags` function demonstrates the use of -bitwise operations to determine the type of a node. This function iterates -across all node types and checks for all rows in the nodes table whether -the intersection of the basic node type and the node within the nodes table -is different from 0. It returns a `dict` containing the counts of all possible +bitwise operations to determine the type of a node. This function iterates +across all node types and checks for all rows in the nodes table whether +the intersection of the basic node type and the node within the nodes table +is different from 0. It returns a `dict` containing the counts of all possible node types. -Note that {ref}`recombination events ` are -associated with two nodes, -one for both ends that are created by recombination backwards in time. The -number of recombination events is thus half the number of recombination nodes. -`PASS_THROUGH` events occur when the ancestral material of only a single -lineage passes through the genome of an ancestor, making coalescence +Note that {ref}`recombination events ` are +associated with two nodes, +one for both ends that are created by recombination backwards in time. The +number of recombination events is thus half the number of recombination nodes. +`PASS_THROUGH` events occur when the ancestral material of only a single +lineage passes through the genome of an ancestor, making coalescence impossible. This event can only be observed in models that track individuals: -{class}``msprime.DiscreteTimeWrightFisher`` and +{class}``msprime.DiscreteTimeWrightFisher`` and {class}``msprime.FixedPedigree`` models. ```{code-cell} @@ -1064,28 +1075,28 @@ for node in ts.nodes(): wide_fmt = (800, 250) ts.draw_svg(style=css_string, size=wide_fmt) ``` -Similarly, the bitwise operations logic can be used to color the different nodes -in the tree sequence according to their {class}`.NodeType`. Node 5 is both -{class}`msprime.NodeType.RECOMBINANT` and -{class}`msprime.NodeType.PASS_THROUGH` (green). Nodes 0, 1, 4 are -recombinant nodes (yellow). All other nodes are only associated with a -`PASS_THROUGH` event (blue). Note that we do not observe any `COMMON_ANCESTOR` -events. This means that all yellow nodes are also associated with a -coalesence event (see {ref}`common ancestor events `). -Note that in order to verify whether a node is `RECOMBINANT` **and** `PASS_THROUGH`, +Similarly, the bitwise operations logic can be used to color the different nodes +in the tree sequence according to their {class}`.NodeType`. Node 5 is both +{class}`msprime.NodeType.RECOMBINANT` and +{class}`msprime.NodeType.PASS_THROUGH` (green). Nodes 0, 1, 4 are +recombinant nodes (yellow). All other nodes are only associated with a +`PASS_THROUGH` event (blue). Note that we do not observe any `COMMON_ANCESTOR` +events. This means that all yellow nodes are also associated with a +coalesence event (see {ref}`common ancestor events `). +Note that in order to verify whether a node is `RECOMBINANT` **and** `PASS_THROUGH`, we use the bitwise **OR** to define its associated `NodeType` constant. -To summarize: `additional_nodes` are specified using bitwise flags. Multiple -basic node types can be combined by taking the bitwise `OR` (|) and then -passed to the `additional_nodes` option. This enables us to track the specified -nodes across the genealogical history of the sample. By means of bitwise `AND` -(&) we can then query the nodes table and check the type of each of the recorded +To summarize: `additional_nodes` are specified using bitwise flags. Multiple +basic node types can be combined by taking the bitwise `OR` (|) and then +passed to the `additional_nodes` option. This enables us to track the specified +nodes across the genealogical history of the sample. By means of bitwise `AND` +(&) we can then query the nodes table and check the type of each of the recorded nodes. -If we wish to reduce these trees down to the minimal representation, we can use -{meth}`tskit.TreeSequence.simplify`. The resulting tree sequence will have -all of these unary nodes removed and will be equivalent to (but not identical, -due to stochastic effects) calling {func}`.sim_ancestry` without the +If we wish to reduce these trees down to the minimal representation, we can use +{meth}`tskit.TreeSequence.simplify`. The resulting tree sequence will have +all of these unary nodes removed and will be equivalent to (but not identical, +due to stochastic effects) calling {func}`.sim_ancestry` without the `additional_nodes` or `coalescing_segment_only` argument(s). @@ -1093,19 +1104,19 @@ due to stochastic effects) calling {func}`.sim_ancestry` without the ### Coalescing segments only -Note that the `coalescing_segments_only` option should be set to `False` when -recording additional nodes. Setting `coalescing_segments_only` to `False` -allows us to switch off the default simulator behaviour of only recording the -relationships between overlapping ancestry segments. Instead, you can now -record edges along the full extent of any of the ancestral lineages -that was involved in any of the events as specified by the `additional_nodes` flag. -This option can also be set to `False` without specifying any additional nodes. +Note that the `coalescing_segments_only` option should be set to `False` when +recording additional nodes. Setting `coalescing_segments_only` to `False` +allows us to switch off the default simulator behaviour of only recording the +relationships between overlapping ancestry segments. Instead, you can now +record edges along the full extent of any of the ancestral lineages +that was involved in any of the events as specified by the `additional_nodes` flag. +This option can also be set to `False` without specifying any additional nodes. When `coalescing_segments_only` is set to `False`, edges that are a result of a coalescent event will record the full length of the tracked ancestral material, not just the overlapping segments of genome (as is the default behavior. -As a result, this option will produce -in nodes with only one child (unary nodes) along parts (unary regions) of +As a result, this option will produce +in nodes with only one child (unary nodes) along parts (unary regions) of the genome. For instance: suppose an ancestral segment ancestral to node `m` spans from `a` to `b` along the genome, @@ -1122,22 +1133,22 @@ The nodes `m` and `n` coalesce (in `p`) on only the overlapping segment `[c,b)`, ### Common ancestor events -We distinguish two types of common ancestor events: coalescence and common +We distinguish two types of common ancestor events: coalescence and common ancestor (in the strict sense) events. -Coalescence events are the default node type (associated with value 0) and are -therefore only implicitly part of the {class}`.NodeType` enumeration. +Coalescence events are the default node type (associated with value 0) and are +therefore only implicitly part of the {class}`.NodeType` enumeration. In contrast, whenever two nonoverlapping ancestral segments are found to have inherited from the same ancestral genome, -there is no coalescence event, and so we register this in the node table as +there is no coalescence event, and so we register this in the node table as {class}`msprime.NodeType.COMMON_ANCESTOR`. -Finally, when keeping track of individuals -({class}``msprime.DiscreteTimeWrightFisher``, -{class}``msprime.FixedPedigree``), we might observe genomes (ploids) through -which the ancestral material of only a single lineage passes. -Such genomes can be registered in the table as +Finally, when keeping track of individuals +({class}``msprime.DiscreteTimeWrightFisher``, +{class}``msprime.FixedPedigree``), we might observe genomes (ploids) through +which the ancestral material of only a single lineage passes. +Such genomes can be registered in the table as {class}`msprime.NodeType.PASS_THROUGH` nodes. -Although this is obviously not a common ancestor event, note that for both +Although this is obviously not a common ancestor event, note that for both of these models each node has to carry at least one of these three flags. @@ -1145,7 +1156,7 @@ of these models each node has to carry at least one of these three flags. ### Recombination events -Additional nodes can also be used to mark recombination events (cross over) as well +Additional nodes can also be used to mark recombination events (cross over) as well as gene conversion. A separate {class}`msprime.NodeType` exists for each: {class}`msprime.NodeType.RECOMBINANT` and {class}`msprime.NodeType.GENE_CONVERSION`. @@ -1190,7 +1201,7 @@ migrating segement. For more details on migration records, see the Here, we provide a simple example of the effect of setting `record_migrations=True` using the {meth}`stepping stone model` where migration is permitted between adjacent populations. Additionally, we -set `additional_nodes=msprime.NodeType.MIGRANT` +set `additional_nodes=msprime.NodeType.MIGRANT` (see {ref}`previous section `) to record nodes corresponding to migration events. This is not necessary, but will be helpful to visualise the result. @@ -1349,7 +1360,7 @@ two of these haplotypes (nodes 12 and 13) were in population 1 at this time. ```{code-cell} print(ts.tables.nodes) -census_nodes = np.bitwise_and(ts.nodes_flags, msprime.NodeType.CENSUS.value) > 0 +census_nodes = np.bitwise_and(ts.nodes_flags, msprime.NodeType.CENSUS.value) > 0 print(ts.tables.nodes[census_nodes]) ``` @@ -1417,7 +1428,7 @@ SVG(ts.draw_svg(y_axis=True, time_scale="log_time")) ### Ancestral recombination graph This is a legacy option and identical to setting the `additional_nodes` option to store -common_ancestor events, recombinants, (and migrants if applicable). This is illustrated +common_ancestor events, recombinants, (and migrants if applicable). This is illustrated in the following example: ```{code-cell} From 6b3f1474b8bb238e9a8ccf68c9a0f5bd85ff37be Mon Sep 17 00:00:00 2001 From: Jerome Kelleher Date: Tue, 29 Jul 2025 09:30:33 +0100 Subject: [PATCH 2/3] Update docs/ancestry.md Co-authored-by: Peter Ralph --- docs/ancestry.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/ancestry.md b/docs/ancestry.md index a1e5e053d..7befec7aa 100644 --- a/docs/ancestry.md +++ b/docs/ancestry.md @@ -567,13 +567,13 @@ have fallen in the high recombination region. :::{note} The units of recombination and other rates in msprime are -**per unit sequence length, per unit time**. If you have -specified a {ref}`demographic model` for a species -and are measuring sequence lengths in base pairs, then the units +**per unit sequence length, per unit time**. If you have set +effective population sizes in units of individuals and +aremeasuring sequence lengths in base pairs, then the units for recombination rate will be per base pair, per generation. -It is also possible to specify time in "coalescent units" and to -model recombination breakpoints as occurring on the -{ref}`continuous unit interval`. +Other choices are possible (e.g., time in "coalescent units" and +recombination breakpoints on a +{ref}`continuous unit interval`.) ::: :::{seealso} From 782a54139ec5f727c6a2e0ffb9fe9e15a88e5fbf Mon Sep 17 00:00:00 2001 From: Jerome Kelleher Date: Tue, 29 Jul 2025 09:31:22 +0100 Subject: [PATCH 3/3] Fix typo --- docs/ancestry.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/ancestry.md b/docs/ancestry.md index 7befec7aa..d41419d0b 100644 --- a/docs/ancestry.md +++ b/docs/ancestry.md @@ -568,8 +568,8 @@ have fallen in the high recombination region. :::{note} The units of recombination and other rates in msprime are **per unit sequence length, per unit time**. If you have set -effective population sizes in units of individuals and -aremeasuring sequence lengths in base pairs, then the units +effective population sizes in units of individuals and +are measuring sequence lengths in base pairs, then the units for recombination rate will be per base pair, per generation. Other choices are possible (e.g., time in "coalescent units" and recombination breakpoints on a