Apply suggestions from code review

apragsdale · petrelharp · web-flow · commit 048b37fb4a74 · 2026-03-22T16:20:26.000-05:00
Co-authored-by: Peter Ralph &lt;petrel.harp@gmail.com&gt;
diff --git a/docs/stats.md b/docs/stats.md
@@ -749,8 +749,7 @@ data (and so this section can be skipped on first reading).
 We use two implementations for
 combining results from multiple alleles: `hap_weighted` and `total_weighted`.
 These are statistic-specific and not chosen by the user, with choices motivated
-by [Zhao (2007)](https://doi.org/10.1017/S0016672307008634) with the goal
-of return statistics as expected by the user.
+by [Zhao (2007)](https://doi.org/10.1017/S0016672307008634).
 
 Briefly, consider a pair of sites with {math}`n` alleles at the first locus and
 {math}`m` alleles at the second. (Whether this includes the ancestral allele
@@ -775,11 +774,13 @@ Out of all of the available summary functions, only {math}`r^2` uses
 
 Within this framework, statistics may be either
 polarised or unpolarised. For statistics that are polarised, we compute
-statistic values for pairs of derived alleles. Unpolarised statistics compute
+statistic values for pairs of derived alleles. (For this purpose, the "derived" alleles
+at a site are all alleles except that stored as the ``ancestral_state`` for the site.)
+Unpolarised statistics compute
 statistics over all pairs of alleles, derived and ancestral. In either case,
 the result is averaged over these values, using a weighting
-scheme described below. The option for polarisation is not exposed to the user.
-Instead, we implement statistics that are polarised where appropriate.
+scheme described below. The option for polarisation is not exposed to the user,
+and list which statistics are polarised below.
 
 (sec_stats_two_locus_branch)=
 
@@ -801,8 +802,13 @@ mutations fall on any pair of branches.
 In other words, imagine we place two mutations uniformly, one on each tree, and
 then compute the statistic; the branch mode computes the expected value of the
 statistic over this process, multiplied by the product of the total branch
-lengths of each tree. In some cases, this process will compute statistics over
-branches that do not result in a polymorphism in the sample, in which case
+lengths of each tree. This weighting accounts for mutational opportunity, so that
+the sum of the branch-mode statistic over all positions in a genomic region,
+multiplied by a mutation rate,  is equal to the expected sum of the two-locus site
+statistic over all mutations falling in that region under an infinite-sites model.
+
+If some branches are ancestral to none or all of the samples, then a mutation on
+these branches does not result in a polymorphism in the sample, and so the
 ratio statistics ({math}`r`, {math}`r^2`)  will return `nan` values. See
 further discussion of `nan` values below.
 
@@ -873,26 +879,26 @@ ld = ts.ld_matrix(mode=“site”, sample_sets=range(8))
 print(ld)
 ```
 
-For concreteness, we would expect the following dimensions with the specified
+We would get the following dimensions with the specified
 `sample_sets` and `indexes` arguments.
 
 ```
 # one-way
-ts.ld_matrix(sample_sets=None) -> 2 dimensions
-ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
-ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
+ts.ld_matrix(sample_sets=None) # -> 2 dimensions
+ts.ld_matrix(sample_sets=[0, 1, 2, 3]) # -> 2 dimensions
+ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) # -> 3 dimensions
 # two-way
-ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
-ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
+ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) # -> 2 dimensions
+ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3 dimensions
 ```
 
 #### Why are there `nan` values in the LD matrix?
 
 For some statistics, it is possible to observe `nan` entries in the LD matrix,
 which can be surprising or numerically impact downstream analyses. A `nan` entry
-may occur when computing a ratio statistic (such as {math}`r` or {math}`r^2`)
-with a denominator of zero, indicating that one or both sites in the pair are
-monomorphic. This can happen for a number of reasons:
+occurs if the denominator of a ratio statistic (such as {math}`r` or {math}`r^2`)
+is zero, indicating that one or both of the alleles in the pair is fixed or absent in the
+sample set under consideration. This can happen for a number of reasons:
 
 - The mutation models allows for reversible mutations, so a back mutation at
   a site has resulted in a single allele despite multiple mutations in the
@@ -922,7 +928,7 @@ notation represent alternate alleles.
 Two-way statistics are summaries of haplotype counts between two sample sets,
 which operate on the three haplotype counts (as in one-way stats, above)
 computed from each sample set, indexed by `(i, j)`. These statistics take on
-a different meaning from their one-way counterparts. For instance {math}`D2`
+a different meaning from their one-way counterparts. For instance `stat="D2"`
 over a pair of sample sets computes {math}`D_i D_j`, which is the product of
 the covariance measure of LD within each sample set and is related to the
 covariance of {math}`D` between sample sets.
@@ -944,7 +950,7 @@ sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
 The two-locus summary functions all take haplotype counts and sample set size
 as input. Each of our summary functions has the signature
 {math}`f(n_{Ab}, n_{aB}, n_{AB}, n)`, converting to haplotype frequencies
-{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` where appropriate by dividing by {math}`n`.
+{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` by dividing by {math}`n`.
 
 `D`
 : {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = p_{ab} - p_{a}p_{b}`
diff --git a/python/tskit/trees.py b/python/tskit/trees.py
@@ -10957,7 +10957,9 @@ def ld_matrix(
 
         Similarly, in the branch mode, the ``positions`` argument specifies
         genomic coordinates at which the expectation for the two-locus statistic
-        is computed, given the local tree structure. This defaults to computing
+        is computed, given the local tree structure.
+        (See :ref:`sec_stats_two_locus_branch` for explanation of in what sense
+        this is an expectation.) This defaults to computing
         the LD for each pair of distinct trees, which is equivalent to passing in
         the leftmost coordinates of each tree's span (since intervals are closed on
         the left and open on the right). Similar to the site mode, a nested list