@@ -749,8 +749,7 @@ data (and so this section can be skipped on first reading).
749749We use two implementations for
750750combining results from multiple alleles: ` hap_weighted ` and ` total_weighted ` .
751751These are statistic-specific and not chosen by the user, with choices motivated
752- by [ Zhao (2007)] ( https://doi.org/10.1017/S0016672307008634 ) with the goal
753- of return statistics as expected by the user.
752+ by [ Zhao (2007)] ( https://doi.org/10.1017/S0016672307008634 ) .
754753
755754Briefly, consider a pair of sites with {math}` n ` alleles at the first locus and
756755{math}` m ` alleles at the second. (Whether this includes the ancestral allele
@@ -775,11 +774,13 @@ Out of all of the available summary functions, only {math}`r^2` uses
775774
776775Within this framework, statistics may be either
777776polarised or unpolarised. For statistics that are polarised, we compute
778- statistic values for pairs of derived alleles. Unpolarised statistics compute
777+ statistic values for pairs of derived alleles. (For this purpose, the "derived" alleles
778+ at a site are all alleles except that stored as the `` ancestral_state `` for the site.)
779+ Unpolarised statistics compute
779780statistics over all pairs of alleles, derived and ancestral. In either case,
780781the result is averaged over these values, using a weighting
781- scheme described below. The option for polarisation is not exposed to the user.
782- Instead, we implement statistics that are polarised where appropriate .
782+ scheme described below. The option for polarisation is not exposed to the user,
783+ and list which statistics are polarised below .
783784
784785(sec_stats_two_locus_branch)=
785786
@@ -801,8 +802,13 @@ mutations fall on any pair of branches.
801802In other words, imagine we place two mutations uniformly, one on each tree, and
802803then compute the statistic; the branch mode computes the expected value of the
803804statistic over this process, multiplied by the product of the total branch
804- lengths of each tree. In some cases, this process will compute statistics over
805- branches that do not result in a polymorphism in the sample, in which case
805+ lengths of each tree. This weighting accounts for mutational opportunity, so that
806+ the sum of the branch-mode statistic over all positions in a genomic region,
807+ multiplied by a mutation rate, is equal to the expected sum of the two-locus site
808+ statistic over all mutations falling in that region under an infinite-sites model.
809+
810+ If some branches are ancestral to none or all of the samples, then a mutation on
811+ these branches does not result in a polymorphism in the sample, and so the
806812ratio statistics ({math}` r ` , {math}` r^2 ` ) will return ` nan ` values. See
807813further discussion of ` nan ` values below.
808814
@@ -873,26 +879,26 @@ ld = ts.ld_matrix(mode=“site”, sample_sets=range(8))
873879print(ld)
874880```
875881
876- For concreteness, we would expect the following dimensions with the specified
882+ We would get the following dimensions with the specified
877883` sample_sets ` and ` indexes ` arguments.
878884
879885```
880886# one-way
881- ts.ld_matrix(sample_sets=None) -> 2 dimensions
882- ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
883- ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
887+ ts.ld_matrix(sample_sets=None) # -> 2 dimensions
888+ ts.ld_matrix(sample_sets=[0, 1, 2, 3]) # -> 2 dimensions
889+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) # -> 3 dimensions
884890# two-way
885- ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
886- ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
891+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) # -> 2 dimensions
892+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) # -> 3 dimensions
887893```
888894
889895#### Why are there ` nan ` values in the LD matrix?
890896
891897For some statistics, it is possible to observe ` nan ` entries in the LD matrix,
892898which can be surprising or numerically impact downstream analyses. A ` nan ` entry
893- may occur when computing a ratio statistic (such as {math}` r ` or {math}` r^2 ` )
894- with a denominator of zero, indicating that one or both sites in the pair are
895- monomorphic . This can happen for a number of reasons:
899+ occurs if the denominator of a ratio statistic (such as {math}` r ` or {math}` r^2 ` )
900+ is zero, indicating that one or both of the alleles in the pair is fixed or absent in the
901+ sample set under consideration . This can happen for a number of reasons:
896902
897903- The mutation models allows for reversible mutations, so a back mutation at
898904 a site has resulted in a single allele despite multiple mutations in the
@@ -922,7 +928,7 @@ notation represent alternate alleles.
922928Two-way statistics are summaries of haplotype counts between two sample sets,
923929which operate on the three haplotype counts (as in one-way stats, above)
924930computed from each sample set, indexed by ` (i, j) ` . These statistics take on
925- a different meaning from their one-way counterparts. For instance {math} ` D2 `
931+ a different meaning from their one-way counterparts. For instance ` stat="D2" `
926932over a pair of sample sets computes {math}` D_i D_j ` , which is the product of
927933the covariance measure of LD within each sample set and is related to the
928934covariance of {math}` D ` between sample sets.
@@ -944,7 +950,7 @@ sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
944950The two-locus summary functions all take haplotype counts and sample set size
945951as input. Each of our summary functions has the signature
946952{math}` f(n_{Ab}, n_{aB}, n_{AB}, n) ` , converting to haplotype frequencies
947- {math}` \{p_{Ab}, p_{aB}, p_{AB}\} ` where appropriate by dividing by {math}` n ` .
953+ {math}` \{p_{Ab}, p_{aB}, p_{AB}\} ` by dividing by {math}` n ` .
948954
949955` D `
950956: {math}` f(n_{Ab}, n_{aB}, n_{AB}, n) = p_{ab} - p_{a}p_{b} `
0 commit comments