example of subsetting causing nan values

apragsdale · apragsdale · commit fa7b3ab6bd15 · 2026-03-15T13:28:25.000-05:00
diff --git a/docs/stats.md b/docs/stats.md
@@ -849,6 +849,45 @@ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dime
 ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
 ```
 
+By computing the LD statistic for a subset of samples in the tree sequence, it
+is possible that statistics are computed over pairs of sites that are not
+variable within the sample set. This can result in zero- or nan-valued statistics
+as outputs. To avoid this, the tree sequence may first be simplified to the
+sample subset, which will remove sites that are invariant within that subset:
+
+```{code-cell} ipython3
+ts = msprime.sim_ancestry(
+    20,
+    population_size=10000,
+    sequence_length=1000,
+    recombination_rate=2e-8,
+    random_seed=12)
+ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)
+
+ld_all = ts.ld_matrix(mode="site")
+print(ld_all)
+```
+
+Considering a subset of samples:
+
+```{code-cell} ipython3
+ld_sub = ts.ld_matrix(mode="site", sample_sets=range(8))
+print(ld_sub)
+```
+
+Simplifying and computing the LD matrix avoids computing {math}`r^2` for
+invariant sites in the subsample:
+
+```{code-cell} ipython3
+ts_simp = ts.simplify(samples=range(8), filter_sites=True)
+ld_sub = ts_simp.ld_matrix(mode="site", sample_sets=range(8))
+print(ld_sub)
+```
+
+Any remaining sites that produce {math}`r^2` with a nan values are those with
+a mutation that occurs above the root of the tree, so that all samples in the
+subset carry the derived allele (so that its allele frequency is 1).
+
 (sec_stats_two_locus_sample_one_way_stats)=
 
 ##### One-way Statistics