@@ -780,7 +780,7 @@ Unpolarised statistics compute
780780statistics over all pairs of alleles, derived and ancestral. In either case,
781781the result is averaged over these values, using a weighting
782782scheme described below. The option for polarisation is not exposed to the user,
783- and list which statistics are polarised below.
783+ and we list which statistics are polarised below.
784784
785785(sec_stats_two_locus_branch)=
786786
@@ -807,11 +807,6 @@ the sum of the branch-mode statistic over all positions in a genomic region,
807807multiplied by a mutation rate, is equal to the expected sum of the two-locus site
808808statistic over all mutations falling in that region under an infinite-sites model.
809809
810- If some branches are ancestral to none or all of the samples, then a mutation on
811- these branches does not result in a polymorphism in the sample, and so the
812- ratio statistics ({math}` r ` , {math}` r^2 ` ) will return ` nan ` values. See
813- further discussion of ` nan ` values below.
814-
815810The time complexity of this method is quadratic in the number of samples,
816811due to the pairwise comparisons of branches from each pair of trees.
817812By default, this method computes
@@ -826,15 +821,21 @@ between the first 4 trees in the tree sequence. The tree breakpoints are
826821a convenient way to specify those first four trees.
827822
828823``` {code-cell} ipython3
829- ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]])
824+ ld = ts.ld_matrix(
825+ mode="branch",
826+ positions=[ts.breakpoints(as_array=True)[0:4]]
827+ )
830828print(ld)
831829```
832830
833831Again, we can specify the row and column trees separately.
834832
835833``` {code-cell} ipython3
836834breakpoints = ts.breakpoints(as_array=True)
837- ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]])
835+ ld = ts.ld_matrix(
836+ mode="branch",
837+ positions=[breakpoints[[0]], breakpoints[0:4]]
838+ )
838839print(ld)
839840```
840841
@@ -895,10 +896,11 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3
895896#### Why are there ` nan ` values in the LD matrix?
896897
897898For some statistics, it is possible to observe ` nan ` entries in the LD matrix,
898- which can be surprising or numerically impact downstream analyses. A ` nan ` entry
899- occurs if the denominator of a ratio statistic (such as {math}` r ` or {math}` r^2 ` )
900- is zero, indicating that one or both of the alleles in the pair is fixed or absent in the
901- sample set under consideration. This can happen for a number of reasons:
899+ which can be surprising or numerically impact downstream analyses. A ` nan `
900+ entry occurs if the denominator of a ratio statistic (including {math}` r ` and
901+ {math}` r^2 ` ) is zero, indicating that one or both of the alleles in the pair is
902+ fixed or absent in the sample set under consideration. This can happen for
903+ a number of reasons:
902904
903905- The mutation models allows for reversible mutations, so a back mutation at
904906 a site has resulted in a single allele despite multiple mutations in the
917919##### One-way Statistics
918920
919921One-way statistics are summaries of two loci in a single sample set, using
920- a triple of haplotype counts {math}` \{n_{Ab }, n_{aB }, n_{AB }\} ` and the size of
922+ a triple of haplotype counts {math}` \{n_{AB }, n_{Ab }, n_{aB }\} ` and the size of
921923the sample set {math}` n ` , where the capitalized and lowercase letters in our
922924notation represent alternate alleles.
923925
@@ -949,47 +951,50 @@ sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
949951
950952The two-locus summary functions all take haplotype counts and sample set size
951953as input. Each of our summary functions has the signature
952- {math}` f(n_{Ab}, n_{aB}, n_{AB}, n) ` , converting to haplotype frequencies
953- {math}` \{p_{Ab}, p_{aB}, p_{AB}\} ` by dividing by {math}` n ` .
954+ {math}` f(n_{AB}, n_{Ab}, n_{aB}, n) ` , converting to haplotype frequencies
955+ {math}` \{p_{AB}, p_{Ab}, p_{aB}\} ` by dividing by {math}` n ` . Below,
956+ {math}` n_{ab} = n - n_{AB} - n_{Ab} - n_{aB} ` , {math}` n_A = n_{AB} + n_{Ab} `
957+ and {math}` n_B = n_{AB} + n_{aB} ` , with frequencies {math}` p ` found by dividing
958+ by {math}` n ` .
954959
955960` D `
956- : {math}` f(n_{Ab }, n_{aB }, n_{AB }, n) = p_{ab} - p_{a }p_{b } `
961+ : {math}` f(n_{AB }, n_{Ab }, n_{aB }, n) = p_{AB}p_{ ab} - p_{Ab }p_{aB } `
957962
958963 This statistic is polarised, as the unpolarised result, which averages over
959964 allele labelings, is zero. Uses the ` total ` weighting method.
960965
961966` D_prime `
962- : {math}` f(n_{Ab }, n_{aB }, n_{AB }, n) = \frac{D}{D_{\max}} ` ,
967+ : {math}` f(n_{AB }, n_{Ab }, n_{aB }, n) = \frac{D}{D_{\max}} ` ,
963968
964- where {math}
969+ where {math}`D _ {\max} = \begin{cases}
965970 \min\{ p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\
966- \min\{ p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise}
967- \end{cases}```
971+ \min\{ p_A p_B, (1-p_B) (1-p_B)\} & \textrm{if }D<0
972+ \end{cases}`
968973
969974 and {math}` D ` is defined above. Polarised, ` total ` weighted.
970975
971976` D2 `
972- : {math}` f(n_{Ab }, n_{aB }, n_{AB }, n) = (p_{ab} - p_{a} p_{b}) ^2 `
977+ : {math}` f(n_{AB }, n_{Ab }, n_{aB }, n) = D ^2 `
973978
974- Unpolarised, ` total ` weighted.
979+ and {math} ` D ` is defined above. Unpolarised, ` total ` weighted.
975980
976981` Dz `
977- : {math}` f(n_{Ab }, n_{aB }, n_{AB }, n) = D (1 - 2 p_{a}) (1 - 2p_{b} ) ` ,
982+ : {math}` f(n_{AB }, n_{Ab }, n_{aB }, n) = D (1 - 2 p_A) (1 - 2 p_B ) ` ,
978983
979984 where {math}` D ` is defined above. Unpolarised, ` total ` weighted.
980985
981986` pi2 `
982- : {math}` f(n_{Ab }, n_{aB }, n_{AB }, n) = p_{a}p_{b} (1-p_{a}) (1-p_{b} ) `
987+ : {math}` f(n_{AB }, n_{Ab }, n_{aB }, n) = p_A (1-p_A) p_B (1-p_B ) `
983988
984989 Unpolarised, ` total ` weighted.
985990
986991` r `
987- : {math}` f(n_{Ab }, n_{aB }, n_{AB }, n) = \frac{D}{\sqrt{p_{a}p_{b} (1-p_{a}) (1-p_{b} )}} ` ,
992+ : {math}` f(n_{AB }, n_{Ab }, n_{aB }, n) = \frac{D}{\sqrt{p_A (1-p_A) p_B (1-p_B )}} ` ,
988993
989994 where {math}` D ` is defined above. Polarised, ` total ` weighted.
990995
991996` r2 `
992- : {math}` f(n_{Ab }, n_{aB }, n_{AB }, n) = \frac{D^{2}}{p_{a}p_{b} (1-p_{a}) (1-p_{b} )} ` ,
997+ : {math}` f(n_{AB }, n_{Ab }, n_{aB }, n) = \frac{D^{2}}{p_A (1-p_A) p_B (1-p_B) )} ` ,
993998
994999 where {math}` D ` is defined above. Unpolarised, ` haplotype ` weighted.
9951000
@@ -1010,12 +1015,16 @@ Two-way statistics are indexed by sample sets {math}`i, j` and compute values
10101015using haplotype counts within pairs of sample sets.
10111016
10121017` D2 `
1013- : {math}` f(n_{Ab }, n_{aB }, n_{AB }, n) = D_i * D_j ` ,
1018+ : {math}` f(n_{AB }, n_{Ab }, n_{aB }, n) = D_i * D_j ` ,
10141019
1015- where {math}` D ` is defined above.
1020+ where {math}` D_i ` denotes {math}` D ` computed within sample set {math}` i ` ,
1021+ and {math}` D ` is defined above. Unpolarised, ` total ` weighted.
10161022
10171023` r2 `
1018- : {math}` f(n_{Ab}, n_{aB}, n_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i} p_{B_i})) (p_{AB_j} - (p_{A_j} p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}} `
1024+ : {math}` f(n_{AB}, n_{Ab}, n_{aB}, n) = r_i r_j ` ,
1025+
1026+ where {math}` r_i ` denotes {math}` r ` computed within sample set {math}` i ` ,
1027+ and {math}` r ` is defined above. Unpolarised, ` haplotype ` weighted.
10191028
10201029And ` D2_unbiased ` , which can be found in [ Ragsdale and Gravel
10211030(2020)] ( https://doi.org/10.1093/molbev/msz265 ) .
0 commit comments