@@ -836,26 +836,11 @@ or tuple will produce a single two-dimensional matrix, while list of these
836836will produce a three-dimensional array, with the outer dimension of length
837837equal to the length of the list.
838838
839- For concreteness, we would expect the following dimensions with the specified
840- ` sample_sets ` and ` indexes ` arguments.
839+ For example, to compute the {math}` r^2 ` LD matrix over a subset of samples in
840+ the tree sequence (such as sample nodes 0 through 7), we would specify the
841+ samples as follows:
841842
842843```
843- # one-way
844- ts.ld_matrix(sample_sets=None) -> 2 dimensions
845- ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
846- ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
847- # two-way
848- ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
849- ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
850- ```
851-
852- By computing the LD statistic for a subset of samples in the tree sequence, it
853- is possible that statistics are computed over pairs of sites that are not
854- variable within the sample set. This can result in zero- or nan-valued statistics
855- as outputs. To avoid this, the tree sequence may first be simplified to the
856- sample subset, which will remove sites that are invariant within that subset:
857-
858- ``` {code-cell} ipython3
859844ts = msprime.sim_ancestry(
860845 20,
861846 population_size=10000,
@@ -864,32 +849,38 @@ ts = msprime.sim_ancestry(
864849 random_seed=12)
865850ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)
866851
867- ld_all = ts.ld_matrix(mode=" site" )
868- print(ld_all )
852+ ld = ts.ld_matrix(mode=“ site”, sample_sets=range(8) )
853+ print(ld )
869854```
870855
871- Considering a subset of samples:
856+ For concreteness, we would expect the following dimensions with the specified
857+ ` sample_sets ` and ` indexes ` arguments.
872858
873- ``` {code-cell} ipython3
874- ld_sub = ts.ld_matrix(mode="site", sample_sets=range(8))
875- print(ld_sub)
876859```
877-
878- Simplifying and computing the LD matrix avoids computing {math}` r^2 ` for
879- invariant sites in the subsample:
880-
881- ``` {code-cell} ipython3
882- ts_simp = ts.simplify(samples=range(8), filter_sites=True)
883- ld_sub = ts_simp.ld_matrix(mode="site", sample_sets=range(8))
884- print(ld_sub)
860+ # one-way
861+ ts.ld_matrix(sample_sets=None) -> 2 dimensions
862+ ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
863+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
864+ # two-way
865+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
866+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
885867```
886868
887- Any remaining sites that produce {math}` r^2 ` with a nan values are those with
888- a mutation that occurs above the root of the tree, so that all samples in the
889- subset carry the derived allele (so that its allele frequency is 1). Also note
890- that under certain mutation models, back mutations may create sites that are
891- invariant in a sample, and this will also result in nan values observed in the
892- LD matrix.
869+ #### Why are there nan values in the LD matrix?
870+
871+ For some statistics, it is possible to observe nan entries in the LD matrix,
872+ which can be surprising or numerically impact downstream analyses. A nan entry
873+ may occur when computing a ratio statistic (such as {math}` r ` or {math}` r^2 ` )
874+ with a denominator of zero, indicating that one or both sites in the pair are
875+ monomorphic. This can happen for a number of reasons:
876+
877+ - The mutation models allows for reversible mutations, so a back mutation at
878+ a site has resulted in a single allele despite multiple mutations in the
879+ history of the sample.
880+ - LD is computed for a subsample of individuals, and some sites are not
881+ variable among the sample nodes in the subsample.
882+ - A mutation exists above the root of the local tree, so that all samples carry
883+ the mutation, and one or more sites are not variable.
893884
894885(sec_stats_two_locus_sample_one_way_stats)=
895886
0 commit comments