More fixes, found errors in previous stat defs

apragsdale · apragsdale · commit 536f144a208e · 2026-03-22T16:45:57.000-05:00
diff --git a/docs/stats.md b/docs/stats.md
@@ -780,7 +780,7 @@ Unpolarised statistics compute
 statistics over all pairs of alleles, derived and ancestral. In either case,
 the result is averaged over these values, using a weighting
 scheme described below. The option for polarisation is not exposed to the user,
-and list which statistics are polarised below.
+and we list which statistics are polarised below.
 
 (sec_stats_two_locus_branch)=
 
@@ -807,11 +807,6 @@ the sum of the branch-mode statistic over all positions in a genomic region,
 multiplied by a mutation rate,  is equal to the expected sum of the two-locus site
 statistic over all mutations falling in that region under an infinite-sites model.
 
-If some branches are ancestral to none or all of the samples, then a mutation on
-these branches does not result in a polymorphism in the sample, and so the
-ratio statistics ({math}`r`, {math}`r^2`)  will return `nan` values. See
-further discussion of `nan` values below.
-
 The time complexity of this method is quadratic in the number of samples,
 due to the pairwise comparisons of branches from each pair of trees.
 By default, this method computes
@@ -826,15 +821,21 @@ between the first 4 trees in the tree sequence. The tree breakpoints are
 a convenient way to specify those first four trees.
 
 ```{code-cell} ipython3
-ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]])
+ld = ts.ld_matrix(
+    mode="branch",
+    positions=[ts.breakpoints(as_array=True)[0:4]]
+)
 print(ld)
 ```
 
 Again, we can specify the row and column trees separately.
 
 ```{code-cell} ipython3
 breakpoints = ts.breakpoints(as_array=True)
-ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]])
+ld = ts.ld_matrix(
+    mode="branch",
+    positions=[breakpoints[[0]], breakpoints[0:4]]
+)
 print(ld)
 ```
 
@@ -895,10 +896,11 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3
 #### Why are there `nan` values in the LD matrix?
 
 For some statistics, it is possible to observe `nan` entries in the LD matrix,
-which can be surprising or numerically impact downstream analyses. A `nan` entry
-occurs if the denominator of a ratio statistic (such as {math}`r` or {math}`r^2`)
-is zero, indicating that one or both of the alleles in the pair is fixed or absent in the
-sample set under consideration. This can happen for a number of reasons:
+which can be surprising or numerically impact downstream analyses. A `nan`
+entry occurs if the denominator of a ratio statistic (including {math}`r` and
+{math}`r^2`) is zero, indicating that one or both of the alleles in the pair is
+fixed or absent in the sample set under consideration. This can happen for
+a number of reasons:
 
 - The mutation models allows for reversible mutations, so a back mutation at
   a site has resulted in a single allele despite multiple mutations in the
@@ -917,7 +919,7 @@ set.
 ##### One-way Statistics
 
 One-way statistics are summaries of two loci in a single sample set, using
-a triple of haplotype counts {math}`\{n_{Ab}, n_{aB}, n_{AB}\}` and the size of
+a triple of haplotype counts {math}`\{n_{AB}, n_{Ab}, n_{aB}\}` and the size of
 the sample set {math}`n`, where the capitalized and lowercase letters in our
 notation represent alternate alleles.
 
@@ -949,47 +951,50 @@ sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
 
 The two-locus summary functions all take haplotype counts and sample set size
 as input. Each of our summary functions has the signature
-{math}`f(n_{Ab}, n_{aB}, n_{AB}, n)`, converting to haplotype frequencies
-{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` by dividing by {math}`n`.
+{math}`f(n_{AB}, n_{Ab}, n_{aB}, n)`, converting to haplotype frequencies
+{math}`\{p_{AB}, p_{Ab}, p_{aB}\}` by dividing by {math}`n`. Below,
+{math}`n_{ab} = n - n_{AB} - n_{Ab} - n_{aB}`, {math}`n_A = n_{AB} + n_{Ab}`
+and {math}`n_B = n_{AB} + n_{aB}`, with frequencies {math}`p` found by dividing
+by {math}`n`.
 
 `D`
-: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = p_{ab} - p_{a}p_{b}`
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB}`
 
   This statistic is polarised, as the unpolarised result, which averages over
   allele labelings, is zero. Uses the `total` weighting method.
 
 `D_prime`
-: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{D}{D_{\max}}`,
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = \frac{D}{D_{\max}}`,
 
-  where {math}
+  where {math}`D_{\max} = \begin{cases}
             \min\{p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\
-            \min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise}
-        \end{cases}```
+            \min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{if }D<0
+        \end{cases}`
 
   and {math}`D` is defined above. Polarised, `total` weighted.
 
 `D2`
-: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = (p_{ab} - p_{a} p_{b})^2`
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D^2`
 
-  Unpolarised, `total` weighted.
+  and {math}`D` is defined above. Unpolarised, `total` weighted.
 
 `Dz`
-:  {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = D (1 - 2 p_{a})(1 - 2p_{b})`,
+:  {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D (1 - 2 p_A) (1 - 2 p_B)`,
 
   where {math}`D` is defined above. Unpolarised, `total` weighted.
 
 `pi2`
-: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = p_{a}p_{b}(1-p_{a})(1-p_{b})`
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_A (1-p_A) p_B (1-p_B)`
 
   Unpolarised, `total` weighted.
 
 `r`
-: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{D}{\sqrt{p_{a}p_{b}(1-p_{a})(1-p_{b})}}`,
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = \frac{D}{\sqrt{p_A (1-p_A) p_B (1-p_B)}}`,
 
   where {math}`D` is defined above. Polarised, `total` weighted.
 
 `r2`
-: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{D^{2}}{p_{a}p_{b}(1-p_{a})(1-p_{b})}`,
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = \frac{D^{2}}{p_A (1-p_A) p_B (1-p_B))}`,
 
   where {math}`D` is defined above. Unpolarised, `haplotype` weighted.
 
@@ -1010,12 +1015,16 @@ Two-way statistics are indexed by sample sets {math}`i, j` and compute values
 using haplotype counts within pairs of sample sets.
 
 `D2`
-: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = D_i * D_j`,
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D_i * D_j`,
 
-where {math}`D` is defined above.
+  where {math}`D_i` denotes {math}`D` computed within sample set {math}`i`,
+  and {math}`D` is defined above. Unpolarised, `total` weighted.
 
 `r2`
-: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i}  p_{B_i})) (p_{AB_j} - (p_{A_j}  p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}}`
+: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = r_i r_j`,
+
+  where {math}`r_i` denotes {math}`r` computed within sample set {math}`i`,
+  and {math}`r` is defined above. Unpolarised, `haplotype` weighted.
 
 And `D2_unbiased`, which can be found in [Ragsdale and Gravel
 (2020)](https://doi.org/10.1093/molbev/msz265).