Skip to content

Commit 536f144

Browse files
committed
More fixes, found errors in previous stat defs
1 parent 048b37f commit 536f144

1 file changed

Lines changed: 38 additions & 29 deletions

File tree

docs/stats.md

Lines changed: 38 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -780,7 +780,7 @@ Unpolarised statistics compute
780780
statistics over all pairs of alleles, derived and ancestral. In either case,
781781
the result is averaged over these values, using a weighting
782782
scheme described below. The option for polarisation is not exposed to the user,
783-
and list which statistics are polarised below.
783+
and we list which statistics are polarised below.
784784

785785
(sec_stats_two_locus_branch)=
786786

@@ -807,11 +807,6 @@ the sum of the branch-mode statistic over all positions in a genomic region,
807807
multiplied by a mutation rate, is equal to the expected sum of the two-locus site
808808
statistic over all mutations falling in that region under an infinite-sites model.
809809

810-
If some branches are ancestral to none or all of the samples, then a mutation on
811-
these branches does not result in a polymorphism in the sample, and so the
812-
ratio statistics ({math}`r`, {math}`r^2`) will return `nan` values. See
813-
further discussion of `nan` values below.
814-
815810
The time complexity of this method is quadratic in the number of samples,
816811
due to the pairwise comparisons of branches from each pair of trees.
817812
By default, this method computes
@@ -826,15 +821,21 @@ between the first 4 trees in the tree sequence. The tree breakpoints are
826821
a convenient way to specify those first four trees.
827822

828823
```{code-cell} ipython3
829-
ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]])
824+
ld = ts.ld_matrix(
825+
mode="branch",
826+
positions=[ts.breakpoints(as_array=True)[0:4]]
827+
)
830828
print(ld)
831829
```
832830

833831
Again, we can specify the row and column trees separately.
834832

835833
```{code-cell} ipython3
836834
breakpoints = ts.breakpoints(as_array=True)
837-
ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]])
835+
ld = ts.ld_matrix(
836+
mode="branch",
837+
positions=[breakpoints[[0]], breakpoints[0:4]]
838+
)
838839
print(ld)
839840
```
840841

@@ -895,10 +896,11 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3
895896
#### Why are there `nan` values in the LD matrix?
896897

897898
For some statistics, it is possible to observe `nan` entries in the LD matrix,
898-
which can be surprising or numerically impact downstream analyses. A `nan` entry
899-
occurs if the denominator of a ratio statistic (such as {math}`r` or {math}`r^2`)
900-
is zero, indicating that one or both of the alleles in the pair is fixed or absent in the
901-
sample set under consideration. This can happen for a number of reasons:
899+
which can be surprising or numerically impact downstream analyses. A `nan`
900+
entry occurs if the denominator of a ratio statistic (including {math}`r` and
901+
{math}`r^2`) is zero, indicating that one or both of the alleles in the pair is
902+
fixed or absent in the sample set under consideration. This can happen for
903+
a number of reasons:
902904

903905
- The mutation models allows for reversible mutations, so a back mutation at
904906
a site has resulted in a single allele despite multiple mutations in the
@@ -917,7 +919,7 @@ set.
917919
##### One-way Statistics
918920

919921
One-way statistics are summaries of two loci in a single sample set, using
920-
a triple of haplotype counts {math}`\{n_{Ab}, n_{aB}, n_{AB}\}` and the size of
922+
a triple of haplotype counts {math}`\{n_{AB}, n_{Ab}, n_{aB}\}` and the size of
921923
the sample set {math}`n`, where the capitalized and lowercase letters in our
922924
notation represent alternate alleles.
923925

@@ -949,47 +951,50 @@ sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
949951

950952
The two-locus summary functions all take haplotype counts and sample set size
951953
as input. Each of our summary functions has the signature
952-
{math}`f(n_{Ab}, n_{aB}, n_{AB}, n)`, converting to haplotype frequencies
953-
{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` by dividing by {math}`n`.
954+
{math}`f(n_{AB}, n_{Ab}, n_{aB}, n)`, converting to haplotype frequencies
955+
{math}`\{p_{AB}, p_{Ab}, p_{aB}\}` by dividing by {math}`n`. Below,
956+
{math}`n_{ab} = n - n_{AB} - n_{Ab} - n_{aB}`, {math}`n_A = n_{AB} + n_{Ab}`
957+
and {math}`n_B = n_{AB} + n_{aB}`, with frequencies {math}`p` found by dividing
958+
by {math}`n`.
954959

955960
`D`
956-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = p_{ab} - p_{a}p_{b}`
961+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_{AB}p_{ab} - p_{Ab}p_{aB}`
957962

958963
This statistic is polarised, as the unpolarised result, which averages over
959964
allele labelings, is zero. Uses the `total` weighting method.
960965

961966
`D_prime`
962-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{D}{D_{\max}}`,
967+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = \frac{D}{D_{\max}}`,
963968

964-
where {math}
969+
where {math}`D_{\max} = \begin{cases}
965970
\min\{p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\
966-
\min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise}
967-
\end{cases}```
971+
\min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{if }D<0
972+
\end{cases}`
968973

969974
and {math}`D` is defined above. Polarised, `total` weighted.
970975

971976
`D2`
972-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = (p_{ab} - p_{a} p_{b})^2`
977+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D^2`
973978

974-
Unpolarised, `total` weighted.
979+
and {math}`D` is defined above. Unpolarised, `total` weighted.
975980

976981
`Dz`
977-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = D (1 - 2 p_{a})(1 - 2p_{b})`,
982+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D (1 - 2 p_A) (1 - 2 p_B)`,
978983

979984
where {math}`D` is defined above. Unpolarised, `total` weighted.
980985

981986
`pi2`
982-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = p_{a}p_{b}(1-p_{a})(1-p_{b})`
987+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = p_A (1-p_A) p_B (1-p_B)`
983988

984989
Unpolarised, `total` weighted.
985990

986991
`r`
987-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{D}{\sqrt{p_{a}p_{b}(1-p_{a})(1-p_{b})}}`,
992+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = \frac{D}{\sqrt{p_A (1-p_A) p_B (1-p_B)}}`,
988993

989994
where {math}`D` is defined above. Polarised, `total` weighted.
990995

991996
`r2`
992-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{D^{2}}{p_{a}p_{b}(1-p_{a})(1-p_{b})}`,
997+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = \frac{D^{2}}{p_A (1-p_A) p_B (1-p_B))}`,
993998

994999
where {math}`D` is defined above. Unpolarised, `haplotype` weighted.
9951000

@@ -1010,12 +1015,16 @@ Two-way statistics are indexed by sample sets {math}`i, j` and compute values
10101015
using haplotype counts within pairs of sample sets.
10111016

10121017
`D2`
1013-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = D_i * D_j`,
1018+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = D_i * D_j`,
10141019

1015-
where {math}`D` is defined above.
1020+
where {math}`D_i` denotes {math}`D` computed within sample set {math}`i`,
1021+
and {math}`D` is defined above. Unpolarised, `total` weighted.
10161022

10171023
`r2`
1018-
: {math}`f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i} p_{B_i})) (p_{AB_j} - (p_{A_j} p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}}`
1024+
: {math}`f(n_{AB}, n_{Ab}, n_{aB}, n) = r_i r_j`,
1025+
1026+
where {math}`r_i` denotes {math}`r` computed within sample set {math}`i`,
1027+
and {math}`r` is defined above. Unpolarised, `haplotype` weighted.
10191028

10201029
And `D2_unbiased`, which can be found in [Ragsdale and Gravel
10211030
(2020)](https://doi.org/10.1093/molbev/msz265).

0 commit comments

Comments
 (0)