Skip to content

Commit 87de30f

Browse files
committed
Revise docs around nan values
1 parent 94bcc34 commit 87de30f

1 file changed

Lines changed: 29 additions & 38 deletions

File tree

docs/stats.md

Lines changed: 29 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -836,26 +836,11 @@ or tuple will produce a single two-dimensional matrix, while list of these
836836
will produce a three-dimensional array, with the outer dimension of length
837837
equal to the length of the list.
838838

839-
For concreteness, we would expect the following dimensions with the specified
840-
`sample_sets` and `indexes` arguments.
839+
For example, to compute the {math}`r^2` LD matrix over a subset of samples in
840+
the tree sequence (such as sample nodes 0 through 7), we would specify the
841+
samples as follows:
841842

842843
```
843-
# one-way
844-
ts.ld_matrix(sample_sets=None) -> 2 dimensions
845-
ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
846-
ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
847-
# two-way
848-
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
849-
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
850-
```
851-
852-
By computing the LD statistic for a subset of samples in the tree sequence, it
853-
is possible that statistics are computed over pairs of sites that are not
854-
variable within the sample set. This can result in zero- or nan-valued statistics
855-
as outputs. To avoid this, the tree sequence may first be simplified to the
856-
sample subset, which will remove sites that are invariant within that subset:
857-
858-
```{code-cell} ipython3
859844
ts = msprime.sim_ancestry(
860845
20,
861846
population_size=10000,
@@ -864,32 +849,38 @@ ts = msprime.sim_ancestry(
864849
random_seed=12)
865850
ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12)
866851
867-
ld_all = ts.ld_matrix(mode="site")
868-
print(ld_all)
852+
ld = ts.ld_matrix(mode=site”, sample_sets=range(8))
853+
print(ld)
869854
```
870855

871-
Considering a subset of samples:
856+
For concreteness, we would expect the following dimensions with the specified
857+
`sample_sets` and `indexes` arguments.
872858

873-
```{code-cell} ipython3
874-
ld_sub = ts.ld_matrix(mode="site", sample_sets=range(8))
875-
print(ld_sub)
876859
```
877-
878-
Simplifying and computing the LD matrix avoids computing {math}`r^2` for
879-
invariant sites in the subsample:
880-
881-
```{code-cell} ipython3
882-
ts_simp = ts.simplify(samples=range(8), filter_sites=True)
883-
ld_sub = ts_simp.ld_matrix(mode="site", sample_sets=range(8))
884-
print(ld_sub)
860+
# one-way
861+
ts.ld_matrix(sample_sets=None) -> 2 dimensions
862+
ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
863+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
864+
# two-way
865+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
866+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
885867
```
886868

887-
Any remaining sites that produce {math}`r^2` with a nan values are those with
888-
a mutation that occurs above the root of the tree, so that all samples in the
889-
subset carry the derived allele (so that its allele frequency is 1). Also note
890-
that under certain mutation models, back mutations may create sites that are
891-
invariant in a sample, and this will also result in nan values observed in the
892-
LD matrix.
869+
#### Why are there nan values in the LD matrix?
870+
871+
For some statistics, it is possible to observe nan entries in the LD matrix,
872+
which can be surprising or numerically impact downstream analyses. A nan entry
873+
may occur when computing a ratio statistic (such as {math}`r` or {math}`r^2`)
874+
with a denominator of zero, indicating that one or both sites in the pair are
875+
monomorphic. This can happen for a number of reasons:
876+
877+
- The mutation models allows for reversible mutations, so a back mutation at
878+
a site has resulted in a single allele despite multiple mutations in the
879+
history of the sample.
880+
- LD is computed for a subsample of individuals, and some sites are not
881+
variable among the sample nodes in the subsample.
882+
- A mutation exists above the root of the local tree, so that all samples carry
883+
the mutation, and one or more sites are not variable.
893884

894885
(sec_stats_two_locus_sample_one_way_stats)=
895886

0 commit comments

Comments
 (0)