Skip to content

Commit 048b37f

Browse files
Apply suggestions from code review
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
1 parent a88e8ff commit 048b37f

2 files changed

Lines changed: 27 additions & 19 deletions

File tree

docs/stats.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -749,8 +749,7 @@ data (and so this section can be skipped on first reading).
749749
We use two implementations for
750750
combining results from multiple alleles: `hap_weighted` and `total_weighted`.
751751
These are statistic-specific and not chosen by the user, with choices motivated
752-
by [Zhao (2007)](https://doi.org/10.1017/S0016672307008634) with the goal
753-
of return statistics as expected by the user.
752+
by [Zhao (2007)](https://doi.org/10.1017/S0016672307008634).
754753

755754
Briefly, consider a pair of sites with {math}`n` alleles at the first locus and
756755
{math}`m` alleles at the second. (Whether this includes the ancestral allele
@@ -775,11 +774,13 @@ Out of all of the available summary functions, only {math}`r^2` uses
775774

776775
Within this framework, statistics may be either
777776
polarised or unpolarised. For statistics that are polarised, we compute
778-
statistic values for pairs of derived alleles. Unpolarised statistics compute
777+
statistic values for pairs of derived alleles. (For this purpose, the "derived" alleles
778+
at a site are all alleles except that stored as the ``ancestral_state`` for the site.)
779+
Unpolarised statistics compute
779780
statistics over all pairs of alleles, derived and ancestral. In either case,
780781
the result is averaged over these values, using a weighting
781-
scheme described below. The option for polarisation is not exposed to the user.
782-
Instead, we implement statistics that are polarised where appropriate.
782+
scheme described below. The option for polarisation is not exposed to the user,
783+
and list which statistics are polarised below.
783784

784785
(sec_stats_two_locus_branch)=
785786

@@ -801,8 +802,13 @@ mutations fall on any pair of branches.
801802
In other words, imagine we place two mutations uniformly, one on each tree, and
802803
then compute the statistic; the branch mode computes the expected value of the
803804
statistic over this process, multiplied by the product of the total branch
804-
lengths of each tree. In some cases, this process will compute statistics over
805-
branches that do not result in a polymorphism in the sample, in which case
805+
lengths of each tree. This weighting accounts for mutational opportunity, so that
806+
the sum of the branch-mode statistic over all positions in a genomic region,
807+
multiplied by a mutation rate, is equal to the expected sum of the two-locus site
808+
statistic over all mutations falling in that region under an infinite-sites model.
809+
810+
If some branches are ancestral to none or all of the samples, then a mutation on
811+
these branches does not result in a polymorphism in the sample, and so the
806812
ratio statistics ({math}`r`, {math}`r^2`) will return `nan` values. See
807813
further discussion of `nan` values below.
808814

@@ -873,26 +879,26 @@ ld = ts.ld_matrix(mode=“site”, sample_sets=range(8))
873879
print(ld)
874880
```
875881

876-
For concreteness, we would expect the following dimensions with the specified
882+
We would get the following dimensions with the specified
877883
`sample_sets` and `indexes` arguments.
878884

879885
```
880886
# one-way
881-
ts.ld_matrix(sample_sets=None) -> 2 dimensions
882-
ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
883-
ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
887+
ts.ld_matrix(sample_sets=None) # -> 2 dimensions
888+
ts.ld_matrix(sample_sets=[0, 1, 2, 3]) # -> 2 dimensions
889+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) # -> 3 dimensions
884890
# two-way
885-
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
886-
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
891+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) # -> 2 dimensions
892+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3 dimensions
887893
```
888894

889895
#### Why are there `nan` values in the LD matrix?
890896

891897
For some statistics, it is possible to observe `nan` entries in the LD matrix,
892898
which can be surprising or numerically impact downstream analyses. A `nan` entry
893-
may occur when computing a ratio statistic (such as {math}`r` or {math}`r^2`)
894-
with a denominator of zero, indicating that one or both sites in the pair are
895-
monomorphic. This can happen for a number of reasons:
899+
occurs if the denominator of a ratio statistic (such as {math}`r` or {math}`r^2`)
900+
is zero, indicating that one or both of the alleles in the pair is fixed or absent in the
901+
sample set under consideration. This can happen for a number of reasons:
896902

897903
- The mutation models allows for reversible mutations, so a back mutation at
898904
a site has resulted in a single allele despite multiple mutations in the
@@ -922,7 +928,7 @@ notation represent alternate alleles.
922928
Two-way statistics are summaries of haplotype counts between two sample sets,
923929
which operate on the three haplotype counts (as in one-way stats, above)
924930
computed from each sample set, indexed by `(i, j)`. These statistics take on
925-
a different meaning from their one-way counterparts. For instance {math}`D2`
931+
a different meaning from their one-way counterparts. For instance `stat="D2"`
926932
over a pair of sample sets computes {math}`D_i D_j`, which is the product of
927933
the covariance measure of LD within each sample set and is related to the
928934
covariance of {math}`D` between sample sets.
@@ -944,7 +950,7 @@ sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
944950
The two-locus summary functions all take haplotype counts and sample set size
945951
as input. Each of our summary functions has the signature
946952
{math}`f(n_{Ab}, n_{aB}, n_{AB}, n)`, converting to haplotype frequencies
947-
{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` where appropriate by dividing by {math}`n`.
953+
{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` by dividing by {math}`n`.
948954

949955
`D`
950956
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = p_{ab} - p_{a}p_{b}`

python/tskit/trees.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10957,7 +10957,9 @@ def ld_matrix(
1095710957
1095810958
Similarly, in the branch mode, the ``positions`` argument specifies
1095910959
genomic coordinates at which the expectation for the two-locus statistic
10960-
is computed, given the local tree structure. This defaults to computing
10960+
is computed, given the local tree structure.
10961+
(See :ref:`sec_stats_two_locus_branch` for explanation of in what sense
10962+
this is an expectation.) This defaults to computing
1096110963
the LD for each pair of distinct trees, which is equivalent to passing in
1096210964
the leftmost coordinates of each tree's span (since intervals are closed on
1096310965
the left and open on the right). Similar to the site mode, a nested list

0 commit comments

Comments
 (0)