Skip to content

Commit 67697e3

Browse files
Apply suggestions from code review
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
1 parent 87de30f commit 67697e3

2 files changed

Lines changed: 21 additions & 17 deletions

File tree

docs/stats.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -705,9 +705,9 @@ Two-locus statistics can be computed using two modes, either `site` or
705705
single-site statistics. Within this framework, statistics may be either
706706
polarised or unpolarised. For statistics that are polarised, we compute
707707
statistic values for pairs of derived alleles. Unpolarised statistics compute
708-
statistics over all possible alleles, derived and ancestral, and the result in
709-
averaged over statistics computed for all pairs of alleles (see weighting
710-
schemes below). The option for polarisation is not exposed to the user.
708+
statistics over all pairs of alleles, derived and ancestral. In either case,
709+
the result is averaged over these values, using a weighting
710+
scheme descrbed below. The option for polarisation is not exposed to the user.
711711
Instead, we implement statistics that are polarised where appropriate.
712712

713713
(sec_stats_two_locus_site)=
@@ -746,16 +746,17 @@ print(ld)
746746

747747
Because we allow for two-locus statistics to be computed for multi-allelic
748748
data, we need to be able to combine statistical results from each pair of
749-
alleles into one summary for a pair of sites. We use two implementations for
749+
alleles into one summary for a pair of sites. This does not affect biallelic
750+
data (and so this section can be skipped on first reading).
751+
We use two implementations for
750752
combining results from multiple alleles: `hap_weighted` and `total_weighted`.
751753
These are statistic-specific and not chosen by the user.
752754

753755
Briefly, consider a pair of sites with {math}`n` alleles at the first locus and
754-
{math}`m` alleles at the second. Write {math}`f_{i,j}` as the statistic
755-
computed for focal alleles {math}`A_i` and {math}`B_j`, with haplotype weights
756-
{math}`(A_i B_j, A_i b_j, a_i B_j)`, where {math}`a_i` and {math}`b_j` are the
757-
collection of alleles that are not the focal alleles {math}`A_i` or
758-
{math}`B_j`, respectively. Then the weighting schemes are defined as:
756+
{math}`m` alleles at the second. (Whether this includes the ancestral allele
757+
depends on whether the statistic is polarised.) Write {math}`f_{i,j}` as the statistic
758+
computed for focal alleles {math}`A_i` and {math}`B_j`.
759+
Then the weighting schemes are defined as:
759760

760761
- `hap_weighted`: {math}`\sum_{i=1}^{n}\sum_{j=1}^{m}p(A_{i}B_{j})f_{ij}`,
761762
where {math}`p(A_{i}B_{j})` is the frequency of haplotype {math}`A_{i}B_{j}`.
@@ -780,7 +781,7 @@ The `branch` mode computes expected two-locus statistics between pairs of
780781
trees, conditioned on the marginal topologies and branch lengths of those
781782
trees. The trees for which we compute statistics are specified by positions,
782783
and for a pair of positions we consider all possible haplotypes that could be
783-
generated by a single mutation occurring at the two trees.
784+
generated by a single mutation occurring on each of the two trees.
784785

785786
For two trees, one with {math}`n` branches and the other with {math}`m`
786787
branches, there are {math}`nm` possible pairs of branches that may carry the
@@ -827,13 +828,13 @@ in the same manner as the rest of the stats API (see
827828
the stats API in that we handle one-way and two-way statistics in the same
828829
function call.
829830

830-
To compute a two-way two-locus statistic, the `index` argument must be
831+
To compute a two-way two-locus statistic, the `indexes` argument must be
831832
provided. The statistics are selected in the same way (with the `stat`
832833
argument), but we provide a restricted set of two-way statistics (see
833834
{ref}`sec_stats_two_locus_summary_functions_two_way`). The dimension-dropping
834835
rules for the result follow the rest of the tskit stats API in that a single list
835836
or tuple will produce a single two-dimensional matrix, while list of these
836-
will produce a three-dimensional array, with the outer dimension of length
837+
will produce a three-dimensional array, with the first dimension of length
837838
equal to the length of the list.
838839

839840
For example, to compute the {math}`r^2` LD matrix over a subset of samples in
@@ -868,8 +869,8 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 di
868869

869870
#### Why are there nan values in the LD matrix?
870871

871-
For some statistics, it is possible to observe nan entries in the LD matrix,
872-
which can be surprising or numerically impact downstream analyses. A nan entry
872+
For some statistics, it is possible to observe `nan` entries in the LD matrix,
873+
which can be surprising or numerically impact downstream analyses. A `nan` entry
873874
may occur when computing a ratio statistic (such as {math}`r` or {math}`r^2`)
874875
with a denominator of zero, indicating that one or both sites in the pair are
875876
monomorphic. This can happen for a number of reasons:

python/tskit/trees.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10947,7 +10947,8 @@ def ld_matrix(
1094710947
between pairs of trees at all specified ``positions`` ("branch" mode,
1094810948
producing a num_positions-by-num_positions sized matrix).
1094910949
10950-
In the site mode, the sites under consideration can be restricted using
10950+
The sites considered for "site" mode defaults to all sites (which may
10951+
result in a very large matrix!), but can be restricted using
1095110952
the ``sites`` argument. Sites can be passed as a list of lists,
1095210953
specifying the ``[[row_sites], [col_sites]]``, resulting in a
1095310954
rectangular matrix, or by specifying a single list of ``[sites]``, in
@@ -10975,7 +10976,9 @@ def ld_matrix(
1097510976
section.
1097610977
1097710978
**Available Stats** (use ``Stat Name`` in the ``stat`` keyword
10978-
argument).
10979+
argument). Statistics marked as "multi sample set" allow
10980+
(but do not require) computation from two sample sets
10981+
via the ``indexes`` argument.
1097910982
1098010983
======================= ========== ================ ==============
1098110984
Stat Polarised Multi Sample Set Stat Name
@@ -10986,7 +10989,7 @@ def ld_matrix(
1098610989
:math:`D` y n "D"
1098710990
:math:`D'` y n "D_prime"
1098810991
:math:`D_z` n n "Dz"
10989-
:math:`\pi_2` n n "pi2"
10992+
:math:`\pi_2` n n "pi2"
1099010993
:math:`\widehat{D^2}` n y "D2_unbiased"
1099110994
:math:`\widehat{D_z}` n n "Dz_unbiased"
1099210995
:math:`\widehat{\pi_2}` n n "pi2_unbiased"

0 commit comments

Comments
 (0)