Skip to content

Commit 2fa87cd

Browse files
committed
Update docstring (feedback from Peter)
Provide more precise requirements for `f` and `norm_f` and give some basic understanding of what these functions are and when normalisation will be required. Attempt to fix syntax errors by adding a newline in code blocks. Clarify output dimensions in the code comments (though this might need to change since I think we'll remove the `np.moveaxis` call at the end.
1 parent 7378e42 commit 2fa87cd

1 file changed

Lines changed: 37 additions & 30 deletions

File tree

python/tskit/trees.py

Lines changed: 37 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -10945,42 +10945,47 @@ def two_locus_count_stat(
1094510945
"""
1094610946
Compute two-locus statistics with a user-defined python function that
1094710947
operates on haplotype counts. TODO: reference modes in two-locus docs.
10948-
On each pair of sites or trees, the summary function is provided with
10949-
``X``, a matrix with shape (3, k) and ``n``, a vector with shape (k,),
10950-
where k is the number of sample sets provided. ``X`` is a read-only
10951-
matrix whose rows contain haplotype counts per sample set (counts of AB,
10952-
Ab, aB) and ``n`` is a vector of sample set sizes.
10953-
10954-
.. note::
10955-
Because we are operating on very small matrices/vectors, vectorised
10956-
operations are often times slower than operations on scalars. Simply
10957-
returning ``[value]`` can be faster than returning
10958-
``value[np.newaxis,]`` or ``np.expand_dims(value, 0)``.
10959-
10960-
What follows is an example of computing ``D`` from a tree sequence. Many
10961-
more examples can be found in the test suite
10962-
``test_ld_matrix.py::GeneralStatsFuncs``. Let's begin with our summary
10963-
function, ``D``. We convert counts to proportions, then compute ``D``,
10964-
returning a numpy array with length equal to the number of sample sets.
10948+
On each pair of sites or trees, the summary function is called with
10949+
haplotype counts for all provided sample sets. The summary function
10950+
(``f``) must accept two parameters: ``X``, a matrix with shape (3, k)
10951+
and ``n``, a vector with shape (k,), where k is the number of sample
10952+
sets provided. ``X`` is a read-only matrix whose rows contain haplotype
10953+
counts (AB, Ab, aB) per sample set and ``n`` is a vector of sample set
10954+
sizes. ``f`` must return a list of results with length ``result_dim``.
10955+
10956+
What follows is an example of computing ``D`` from a tree sequence
10957+
(TODO: cite two-locus docs for more details). We convert counts to
10958+
proportions, then compute ``D``, returning a numpy array with length
10959+
equal to the number of ``result_dim``s.
1096510960
1096610961
.. code-block:: python
10962+
1096710963
def D(X, n):
1096810964
pAB, pAb, paB = X / n
1096910965
pA = pAb + pAB
1097010966
pB = paB + pAB
1097110967
return pAB - (pA * pB)
1097210968
10973-
``norm_f`` is a normalisation function used to combine all computed
10974-
statistics for multiallelic allele pairs (TODO: see two-locus
10975-
docs). Biallelic sites do not require any normalisation (in fact, the
10976-
normalisation function is never called for biallelic sites). If one of
10977-
either site A or site B is multiallelic, then the normalisation function
10978-
will be called. The default normalisation function is identical to
10979-
``total_norm`` shown in the example below. ``hap_norm`` is required for
10980-
normalising :math:`r^2`. Both of these examples return a numpy array
10981-
with length equal to the number of sample sets (for one-way stats).
10969+
The summary function is called for each pair of sites or trees,
10970+
producing results that must be combined when multiallelic sites are
10971+
present (``site`` mode only), summary function results must
10972+
need to be normalised in order to be aggragated for all pairs of alleles
10973+
between both sites. Branch statistics and biallelic sites do not require
10974+
any normalisation, ``norm_f`` is only called if one of the two sites
10975+
under consideration is multiallelic. TODO: reference two-locus docs for
10976+
further information about normalisation. ``norm_f`` is a normalisation
10977+
function that must accept four parameters: ``X`` and ``n`` are the same
10978+
inputs that ``f`` accepts, along with ``nA`` and ``nB``, which hold the
10979+
count of ``A`` alleles and ``B`` alleles. For example, if ``A`` is
10980+
biallelic and ``B`` is triallelic, ``nA=2`` and ``nB=3``. ``f`` must
10981+
return a list of results with length ``result_dim``. The default
10982+
normalisation function is identical to ``total_norm`` shown in the
10983+
example below. ``hap_norm`` is required for normalising
10984+
:math:`r^2`. Both of these examples return a numpy array with length
10985+
equal to the number of ``result_dim``s.
1098210986
1098310987
.. code-block:: python
10988+
1098410989
def total_norm(X, n, nA, nB):
1098510990
[1 / (nA * nB)] * result_dim
1098610991
@@ -10990,6 +10995,7 @@ def hap_norm(X, n, nA, nB):
1099010995
A simple call (without specifying normalisation) would look like this
1099110996
1099210997
.. code-block::python
10998+
1099310999
ts.two_locus_count_stat([ts.samples()], D, 1, polarised=True)
1099411000
1099511001
:param list sample_sets: A list of lists of Node IDs, specifying the
@@ -11000,22 +11006,23 @@ def hap_norm(X, n, nA, nB):
1100011006
:param int result_dim: The length of ``f`` and ``norm_f``'s return value.
1100111007
:param norm_f: A function that takes four arguments - the first two are
1100211008
the same as ``f``, the second two are scalars representing the
11003-
number of A and B alleles, respectively.
11009+
number of A and B alleles, respectively. If ``None``, then defaults
11010+
to the "total" normalization described above.
1100411011
:param bool polarised: Whether to leave the ancestral state out of
1100511012
computations: see :ref:`sec_stats` for more details.
1100611013
:param list sites: TODO: two-locus docs
1100711014
:param list positions: TODO: two-locus docs
1100811015
:param str mode: A string giving the "type" of the statistic to be
1100911016
computed (defaults to "site").
1101011017
:return: A ndarray with shape equal to (TODO: reference two-locus docs,
11011-
no dimension dropping shape=(k, m, m) where k=num_sample_sets,
11018+
no dimension dropping shape=(k, m, m) where k=result_dim,
1101211019
m=num_sites or num_trees).
1101311020
"""
1101411021
row_sites, col_sites = self.parse_sites(sites)
1101511022
row_positions, col_positions = self.parse_positions(positions)
1101611023
_, sample_sets, sample_set_sizes = self.__convert_sample_sets(sample_sets)
1101711024
if norm_f is None:
11018-
# produce the same number of dims as output dimensions with [val] * dim
11025+
# produce the same number of dims as result dimensions with [val] * dim
1101911026
def norm_f(X, n, nA, nB):
1102011027
return [1 / (nA * nB)] * result_dim
1102111028

@@ -11032,7 +11039,7 @@ def norm_f(X, n, nA, nB):
1103211039
col_positions,
1103311040
mode,
1103411041
)
11035-
# Orient the data so that the first dimension is the sample set so that
11042+
# Orient the data so that the first dimension is the result_dim so that
1103611043
# we get one LD matrix per result dimension
1103711044
return np.moveaxis(result, -1, 0)
1103811045

0 commit comments

Comments
 (0)