@@ -10945,42 +10945,47 @@ def two_locus_count_stat(
1094510945 """
1094610946 Compute two-locus statistics with a user-defined python function that
1094710947 operates on haplotype counts. TODO: reference modes in two-locus docs.
10948- On each pair of sites or trees, the summary function is provided with
10949- ``X``, a matrix with shape (3, k) and ``n``, a vector with shape (k,),
10950- where k is the number of sample sets provided. ``X`` is a read-only
10951- matrix whose rows contain haplotype counts per sample set (counts of AB,
10952- Ab, aB) and ``n`` is a vector of sample set sizes.
10953-
10954- .. note::
10955- Because we are operating on very small matrices/vectors, vectorised
10956- operations are often times slower than operations on scalars. Simply
10957- returning ``[value]`` can be faster than returning
10958- ``value[np.newaxis,]`` or ``np.expand_dims(value, 0)``.
10959-
10960- What follows is an example of computing ``D`` from a tree sequence. Many
10961- more examples can be found in the test suite
10962- ``test_ld_matrix.py::GeneralStatsFuncs``. Let's begin with our summary
10963- function, ``D``. We convert counts to proportions, then compute ``D``,
10964- returning a numpy array with length equal to the number of sample sets.
10948+ On each pair of sites or trees, the summary function is called with
10949+ haplotype counts for all provided sample sets. The summary function
10950+ (``f``) must accept two parameters: ``X``, a matrix with shape (3, k)
10951+ and ``n``, a vector with shape (k,), where k is the number of sample
10952+ sets provided. ``X`` is a read-only matrix whose rows contain haplotype
10953+ counts (AB, Ab, aB) per sample set and ``n`` is a vector of sample set
10954+ sizes. ``f`` must return a list of results with length ``result_dim``.
10955+
10956+ What follows is an example of computing ``D`` from a tree sequence
10957+ (TODO: cite two-locus docs for more details). We convert counts to
10958+ proportions, then compute ``D``, returning a numpy array with length
10959+ equal to the number of ``result_dim``s.
1096510960
1096610961 .. code-block:: python
10962+
1096710963 def D(X, n):
1096810964 pAB, pAb, paB = X / n
1096910965 pA = pAb + pAB
1097010966 pB = paB + pAB
1097110967 return pAB - (pA * pB)
1097210968
10973- ``norm_f`` is a normalisation function used to combine all computed
10974- statistics for multiallelic allele pairs (TODO: see two-locus
10975- docs). Biallelic sites do not require any normalisation (in fact, the
10976- normalisation function is never called for biallelic sites). If one of
10977- either site A or site B is multiallelic, then the normalisation function
10978- will be called. The default normalisation function is identical to
10979- ``total_norm`` shown in the example below. ``hap_norm`` is required for
10980- normalising :math:`r^2`. Both of these examples return a numpy array
10981- with length equal to the number of sample sets (for one-way stats).
10969+ The summary function is called for each pair of sites or trees,
10970+ producing results that must be combined when multiallelic sites are
10971+ present (``site`` mode only), summary function results must
10972+ need to be normalised in order to be aggragated for all pairs of alleles
10973+ between both sites. Branch statistics and biallelic sites do not require
10974+ any normalisation, ``norm_f`` is only called if one of the two sites
10975+ under consideration is multiallelic. TODO: reference two-locus docs for
10976+ further information about normalisation. ``norm_f`` is a normalisation
10977+ function that must accept four parameters: ``X`` and ``n`` are the same
10978+ inputs that ``f`` accepts, along with ``nA`` and ``nB``, which hold the
10979+ count of ``A`` alleles and ``B`` alleles. For example, if ``A`` is
10980+ biallelic and ``B`` is triallelic, ``nA=2`` and ``nB=3``. ``f`` must
10981+ return a list of results with length ``result_dim``. The default
10982+ normalisation function is identical to ``total_norm`` shown in the
10983+ example below. ``hap_norm`` is required for normalising
10984+ :math:`r^2`. Both of these examples return a numpy array with length
10985+ equal to the number of ``result_dim``s.
1098210986
1098310987 .. code-block:: python
10988+
1098410989 def total_norm(X, n, nA, nB):
1098510990 [1 / (nA * nB)] * result_dim
1098610991
@@ -10990,6 +10995,7 @@ def hap_norm(X, n, nA, nB):
1099010995 A simple call (without specifying normalisation) would look like this
1099110996
1099210997 .. code-block::python
10998+
1099310999 ts.two_locus_count_stat([ts.samples()], D, 1, polarised=True)
1099411000
1099511001 :param list sample_sets: A list of lists of Node IDs, specifying the
@@ -11000,22 +11006,23 @@ def hap_norm(X, n, nA, nB):
1100011006 :param int result_dim: The length of ``f`` and ``norm_f``'s return value.
1100111007 :param norm_f: A function that takes four arguments - the first two are
1100211008 the same as ``f``, the second two are scalars representing the
11003- number of A and B alleles, respectively.
11009+ number of A and B alleles, respectively. If ``None``, then defaults
11010+ to the "total" normalization described above.
1100411011 :param bool polarised: Whether to leave the ancestral state out of
1100511012 computations: see :ref:`sec_stats` for more details.
1100611013 :param list sites: TODO: two-locus docs
1100711014 :param list positions: TODO: two-locus docs
1100811015 :param str mode: A string giving the "type" of the statistic to be
1100911016 computed (defaults to "site").
1101011017 :return: A ndarray with shape equal to (TODO: reference two-locus docs,
11011- no dimension dropping shape=(k, m, m) where k=num_sample_sets ,
11018+ no dimension dropping shape=(k, m, m) where k=result_dim ,
1101211019 m=num_sites or num_trees).
1101311020 """
1101411021 row_sites , col_sites = self .parse_sites (sites )
1101511022 row_positions , col_positions = self .parse_positions (positions )
1101611023 _ , sample_sets , sample_set_sizes = self .__convert_sample_sets (sample_sets )
1101711024 if norm_f is None :
11018- # produce the same number of dims as output dimensions with [val] * dim
11025+ # produce the same number of dims as result dimensions with [val] * dim
1101911026 def norm_f (X , n , nA , nB ):
1102011027 return [1 / (nA * nB )] * result_dim
1102111028
@@ -11032,7 +11039,7 @@ def norm_f(X, n, nA, nB):
1103211039 col_positions ,
1103311040 mode ,
1103411041 )
11035- # Orient the data so that the first dimension is the sample set so that
11042+ # Orient the data so that the first dimension is the result_dim so that
1103611043 # we get one LD matrix per result dimension
1103711044 return np .moveaxis (result , - 1 , 0 )
1103811045
0 commit comments