@@ -849,6 +849,45 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dime
849849ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
850850```
851851
852+ By computing the LD statistic for a subset of samples in the tree sequence, it
853+ is possible that statistics are computed over pairs of sites that are not
854+ variable within the sample set. This can result in zero- or nan-valued statistics
855+ as outputs. To avoid this, the tree sequence may first be simplified to the
856+ sample subset, which will remove sites that are invariant within that subset:
857+
858+ ``` {code-cell} ipython3
859+ ts = msprime.sim_ancestry(
860+ 20,
861+ population_size=10000,
862+ sequence_length=1000,
863+ recombination_rate=2e-8,
864+ random_seed=12)
865+ ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)
866+
867+ ld_all = ts.ld_matrix(mode="site")
868+ print(ld_all)
869+ ```
870+
871+ Considering a subset of samples:
872+
873+ ``` {code-cell} ipython3
874+ ld_sub = ts.ld_matrix(mode="site", sample_sets=range(8))
875+ print(ld_sub)
876+ ```
877+
878+ Simplifying and computing the LD matrix avoids computing {math}` r^2 ` for
879+ invariant sites in the subsample:
880+
881+ ``` {code-cell} ipython3
882+ ts_simp = ts.simplify(samples=range(8), filter_sites=True)
883+ ld_sub = ts_simp.ld_matrix(mode="site", sample_sets=range(8))
884+ print(ld_sub)
885+ ```
886+
887+ Any remaining sites that produce {math}` r^2 ` with a nan values are those with
888+ a mutation that occurs above the root of the tree, so that all samples in the
889+ subset carry the derived allele (so that its allele frequency is 1).
890+
852891(sec_stats_two_locus_sample_one_way_stats)=
853892
854893##### One-way Statistics
0 commit comments