fix #527, add calcLD_D() and calcLD_Rsquared()

bhaller · bhaller · commit 043dd54551a2 · 2025-09-10T10:17:45.000-04:00
diff --git a/QtSLiM/help/SLiMHelpFunctions.html b/QtSLiM/help/SLiMHelpFunctions.html
@@ -5,7 +5,7 @@
   <meta http-equiv="Content-Style-Type" content="text/css">
   <title></title>
   <meta name="Generator" content="Cocoa HTML Writer">
-  <meta name="CocoaVersion" content="2299.77">
+  <meta name="CocoaVersion" content="2487.7">
   <style type="text/css">
     p.p1 {margin: 18.0px 0.0px 3.0px 0.0px; font: 11.0px Optima}
     p.p2 {margin: 9.0px 0.0px 3.0px 36.0px; text-indent: -22.3px; font: 9.0px Menlo}
@@ -221,6 +221,14 @@
 <p class="p11"><i>B</i> = sum(<i>qs</i>) − sum(<i>q</i><span class="s12"><sup>2</sup></span><i>s</i>) − 2sum(<i>q</i>(1−<i>q</i>)<i>sh</i>)</p>
 <p class="p3">where <i>q</i> is the frequency of a given deleterious allele, <i>s</i> is the absolute value of the selection coefficient, and <i>h</i> is its dominance coefficient.<span class="Apple-converted-space">  </span>Note that the implementation, viewable with <span class="s3">functionSource()</span>, sets a maximum |<i>s</i>| of <span class="s3">1.0</span> (i.e., a lethal allele); |<i>s</i>| can sometimes be greater than <span class="s3">1.0</span> when <i>s</i> is drawn from a distribution, but in practice an allele with <i>s</i> &lt; <span class="s3">-1.0</span> has the same lethal effect as when <i>s</i> = <span class="s3">-1.0</span>.<span class="Apple-converted-space">  </span>Also note that this implementation will not work when the model changes the dominance coefficients of mutations using <span class="s3">mutationEffect()</span> callbacks, since it relies on the <span class="s3">dominanceCoeff</span> property of <span class="s3">MutationType</span>. Finally, note that, to estimate the diploid number of lethal equivalents (2<i>B</i>), the result from this function can simply be multiplied by two.</p>
 <p class="p3">This function was contributed by Chris Kyriazis; thanks, Chris!</p>
+<p class="p4">(float)calcLD_D(object&lt;Mutation&gt;$ mut1, object&lt;Mutation&gt; mut2, [No&lt;Haplosome&gt; haplosomes = NULL])</p>
+<p class="p3">Calculates the linkage disequilibrium (LD) coefficient <i>D</i> between a focal mutation <span class="s3">mut1</span> and one or more mutations in <span class="s3">mut2</span>, evaluated across a set of haplosomes given by <span class="s3">haplosomes</span>.<span class="Apple-converted-space">  </span>The result is a <span class="s3">float</span> vector that matches the size and order of <span class="s3">mut2</span>.<span class="Apple-converted-space">  </span>The implementation of this function, viewable with <span class="s3">functionSource()</span>, calculates <i>D</i> as defined by Hill and Robertson (1968, p. 226).<span class="Apple-converted-space">  </span>The coefficient <i>D</i> is within [−<i>p</i>(1−<i>p</i>), <i>p</i>(1−<i>p</i>)], where <i>p</i> is the frequency of the more common mutation (that is, <i>p</i> = max(<i>f</i><span class="s4"><sub>1</sub></span>, <i>f</i><span class="s4"><sub>2</sub></span>) where <i>f</i><span class="s4"><sub>1</sub></span> and <i>f</i><span class="s4"><sub>2</sub></span> are the frequencies of the two mutations for which <i>D</i> is being calculated); for the normalized LD metric <i>r</i><span class="s4"><sup>2</sup></span>, which is within [0, 1], see <span class="s3">calcLD_Rsquared()</span>.<span class="Apple-converted-space">  </span>Departures of <i>D</i> from zero indicate LD; more specifically, <i>D</i> &gt; 0 indicates that the mutations occur together more often than expected by chance (positive linkage), whereas <i>D</i> &lt; 0 indicates they occur together less often than expected by chance (negative linkage).</p>
+<p class="p3">All mutations in <span class="s3">mut2</span> must be associated with the same chromosome as <span class="s3">mut1</span>; this function does not currently calculate LD between mutations associated with different chromosomes.<span class="Apple-converted-space">  </span>If <span class="s3">mut2</span> is <span class="s3">NULL</span> (the default), all such mutations in the population (including <span class="s3">mut1</span> itself) will be used.<span class="Apple-converted-space">  </span>Similarly, all haplosomes must be associated with the same chromosome as <span class="s3">mut1</span>.<span class="Apple-converted-space">  </span>If the <span class="s3">haplosomes</span> parameter is <span class="s3">NULL</span> (the default), all such haplosomes in the population will be used.</p>
+<p class="p3">This function was written by Vitor Sudbrack (currently affiliated with University of Lausanne).</p>
+<p class="p4">(float)calcLD_Rsquared(object&lt;Mutation&gt;$ mut1, object&lt;Mutation&gt; mut2, [No&lt;Haplosome&gt; haplosomes = NULL], [logical$ squared = T])</p>
+<p class="p3">Calculates the linkage disequilibrium (LD) squared correlation coefficient <i>r</i><span class="s4"><sup>2</sup></span> between a focal mutation <span class="s3">mut1</span> and one or more mutations in <span class="s3">mut2</span>, evaluated across a set of haplosomes given by <span class="s3">haplosomes</span>.<span class="Apple-converted-space">  </span>The result is a <span class="s3">float</span> vector that matches the size and order of <span class="s3">mut2</span>.<span class="Apple-converted-space">  </span>The implementation of this function, viewable with <span class="s3">functionSource()</span>, calculates <i>r</i><span class="s4"><sup>2</sup></span> as defined by Hill and Robertson (1968, p. 227).<span class="Apple-converted-space">  </span>The squared correlation coefficient <i>r</i><span class="s4"><sup>2</sup></span> is a normalized measure of LD within [0, 1] (for the unnormalized LD coefficient <i>D</i>, see <span class="s3">calcLD_D()</span>).<span class="Apple-converted-space">  </span>When <i>r</i><span class="s4"><sup>2</sup></span> = 0, there is no statistical association between the mutations; they co-occur as expected by chance.<span class="Apple-converted-space">  </span>A value of <i>r</i><span class="s4"><sup>2</sup></span> = 1 indicates complete correlation: the mutations either always appear together or never appear together, depending on the sign of the underlying correlation coefficient <i>r</i>.<span class="Apple-converted-space">  </span>To obtain the raw (signed) <i>r</i> value instead of <i>r</i><span class="s4"><sup>2</sup></span>, you can pass <span class="s3">squared=F</span> instead of the default of <span class="s3">T</span>.</p>
+<p class="p3">All mutations in <span class="s3">mut2</span> must be associated with the same chromosome as <span class="s3">mut1</span>; this function does not currently calculate LD between mutations associated with different chromosomes.<span class="Apple-converted-space">  </span>If <span class="s3">mut2</span> is <span class="s3">NULL</span> (the default), all such mutations in the population (including <span class="s3">mut1</span> itself) will be used.<span class="Apple-converted-space">  </span>Similarly, all haplosomes must be associated with the same chromosome as <span class="s3">mut1</span>.<span class="Apple-converted-space">  </span>If the <span class="s3">haplosomes</span> parameter is <span class="s3">NULL</span> (the default), all such haplosomes in the population will be used.</p>
+<p class="p3">This function was written by Vitor Sudbrack (currently affiliated with University of Lausanne).</p>
 <p class="p4">(float$)calcMeanFroh(object&lt;Individual&gt; individuals, [integer$ minimumLength = 1000000], [Niso&lt;Chromosome&gt;$ chromosome = NULL])</p>
 <p class="p3">Calculates the mean value of the <i>F</i><span class="s4"><sub>roh</sub></span> statistic across the individuals passed in <span class="s3">individuals</span>.<span class="Apple-converted-space">  </span>This statistic is a measure of individual autozygosity, likely resulting from inbreeding, and is calculated based upon “runs of homozygosity”, or ROH, in the genome of an individual.<span class="Apple-converted-space">  </span>Broadly speaking, <i>F</i><span class="s4"><sub>roh</sub></span> is the proportion of an individual’s genome that is spanned by ROH longer than a given threshold length.<span class="Apple-converted-space">  </span>However, it should be noted that there are many different ways of calculating <i>F</i><span class="s4"><sub>roh</sub></span>, producing different results.<span class="Apple-converted-space">  </span>For example, the threshold length might be a given constant, or might be determined statistically from the characteristics of the population.<span class="Apple-converted-space">  </span>Furthermore, some heterozygous sites might be discarded (to compensate for genotyping errors), a minimum SNP density might be required within a sliding window for an ROH to be diagnosed, and so forth – it can get quite complex, as seen in the software PLINK (Purcell et al., 2007) and GARLIC (Szpiech, Blant and Pemberton, 2017).<span class="Apple-converted-space">  </span>The method used by <span class="s3">calcMeanFroh()</span> is the simplest possible method, assessing ROH for each individual directly from the simulated mutations without filtering or modification, and applying a given constant threshold length.<span class="Apple-converted-space">  </span>If a more sophisticated <i>F</i><span class="s4"><sub>roh</sub></span> algorithm is desired, one could modify the implementation of <span class="s3">calcMeanFroh()</span>, which is viewable with <span class="s3">functionSource()</span>, or one could output VCF data from SLiM and analyze it with other tools, perhaps calling out from the running SLiM script with <span class="s3">system()</span>.</p>
 <p class="p3">The threshold ROH length used by <span class="s3">calcMeanFroh()</span> is supplied by the parameter <span class="s3">minimumLength</span>.<span class="Apple-converted-space">  </span>It defaults to <span class="s3">1e6</span>, or 1 Mbp, since that is a length commonly used in the literature, but can be adjusted as desired.</p>
diff --git a/SLiMgui/SLiMHelpFunctions.rtf b/SLiMgui/SLiMHelpFunctions.rtf
@@ -1,4 +1,4 @@
-{\rtf1\ansi\ansicpg1252\cocoartf2709
+{\rtf1\ansi\ansicpg1252\cocoartf2761
 \cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Optima-Bold;\f1\fnil\fcharset0 Menlo-Regular;\f2\fswiss\fcharset0 Optima-Regular;
 \f3\fswiss\fcharset0 Optima-Italic;\f4\froman\fcharset0 TimesNewRomanPSMT;\f5\fnil\fcharset0 Menlo-Bold;
 \f6\fnil\fcharset0 AppleColorEmoji;\f7\froman\fcharset0 TimesNewRomanPS-ItalicMT;}
@@ -931,7 +931,8 @@ This function is written in Eidos, and its source code can be viewed with
 \f2\fs20 , in which case the ellipsis should supply a 
 \f1\fs18 string$
 \f2\fs20  Eidos script parameter.  The global symbol for the new mutation type is immediately available; the return value also provides the new object.\
-\expnd0\expndtw0\kerning0
+\pard\pardeftab543\li547\ri720\sb60\sa60\partightenfactor0
+\cf2 \expnd0\expndtw0\kerning0
 Note that by default in WF models, all mutations of a given mutation type will be converted into 
 \f1\fs18 Substitution
 \f2\fs20  objects when they reach fixation, for efficiency reasons.  If you need to disable this conversion, to keep mutations of a given type active in the simulation even after they have fixed, you can do so by setting the 
@@ -1155,9 +1156,8 @@ There is no way to disable sex once it has been enabled; if you don\'92t want to
 \f2\fs20  property of 
 \f1\fs18 Individual
 \f2\fs20 , for example), and manage the consequences of that in your script yourself, in terms of which individuals can mate with which, and exactly how the offspring is produced.\
-\pard\pardeftab397\li547\ri720\sb60\sa60\partightenfactor0
 
-\f0\b \cf2 The 
+\f0\b The 
 \f5\fs18 xDominanceCoeff
 \f0\fs20  parameter has been deprecated and removed.
 \f2\b0   In SLiM 5 and later, use the 
@@ -1327,6 +1327,7 @@ If
 \f1\fs18 initializeChromosome()
 \f2\fs20 , allowing a different mutation run count to be specified for each chromosome in multi-chromosome models.\expnd0\expndtw0\kerning0
 \
+\pard\pardeftab720\li547\ri720\sb60\sa60\partightenfactor0
 \cf0 \kerning1\expnd0\expndtw0 If 
 \f1\fs18 preventIncidentalSelfing
 \f2\fs20  is 
@@ -1406,6 +1407,7 @@ If
 \f2\fs20  for 
 \f1\fs18 checkInfiniteLoops
 \f2\fs20  to disable these checks.  There is no way to turn these checks on or off for individual loops; it is a global setting.\
+\pard\pardeftab720\li547\ri720\sb60\sa60\partightenfactor0
 \cf0 This function will likely be extended with further options in the future, added on to the end of the argument list.  Using named arguments with this call is recommended for readability.  Note that turning on optional features may increase the runtime and memory footprint of SLiM.\
 \pard\pardeftab720\li720\fi-446\ri720\sb180\sa60\partightenfactor0
 
@@ -2147,8 +2149,9 @@ The code for
 \f3\i F
 \f2\i0\fs13\fsmilli6667 \sub ST
 \fs20 \nosupersub  (but see below for further discussion and clarification):\
+\pard\pardeftab397\li547\ri720\sb60\sa60\partightenfactor0
 
-\f3\i F
+\f3\i \cf2 F
 \f2\i0\fs13\fsmilli6667 \sub ST
 \fs20 \expnd0\expndtw0\kerning0
 \nosupersub  = 1 - 
@@ -2160,7 +2163,8 @@ The code for
 \f2\i0\fs13\fsmilli6667 \kerning1\expnd0\expndtw0 \sub T
 \fs20 \expnd0\expndtw0\kerning0
 \nosupersub \
-\kerning1\expnd0\expndtw0 where 
+\pard\pardeftab397\li547\ri720\sb60\sa60\partightenfactor0
+\cf2 \kerning1\expnd0\expndtw0 where 
 \f3\i H
 \fs13\fsmilli6667 \sub S
 \f2\i0\fs20 \nosupersub  is the average heterozygosity in the two subpopulations, and 
@@ -2334,6 +2338,151 @@ The inbreeding load is a measure of the quantity of recessive deleterious variat
 This function was contributed by Chris Kyriazis; thanks, Chris!\
 \pard\pardeftab720\li720\fi-446\ri720\sb180\sa60\partightenfactor0
 
+\f1\fs18 \cf2 (float)calcLD_D(object<Mutation>$\'a0mut1, [No<Mutation>\'a0mut2\'a0=\'a0NULL], [No<Haplosome>\'a0haplosomes\'a0=\'a0NULL])\
+\pard\pardeftab397\li547\ri720\sb60\sa60\partightenfactor0
+
+\f2\fs20 \cf2 Calculates the linkage disequilibrium (LD) coefficient 
+\f3\i D
+\f2\i0  between a focal mutation 
+\f1\fs18 mut1
+\f2\fs20  and one or more mutations in 
+\f1\fs18 mut2
+\f2\fs20 , evaluated across a set of haplosomes given by 
+\f1\fs18 haplosomes
+\f2\fs20 .  The result is a 
+\f1\fs18 float
+\f2\fs20  vector that matches the size and order of 
+\f1\fs18 mut2
+\f2\fs20 .  The implementation of this function, viewable with 
+\f1\fs18 functionSource()
+\f2\fs20 , calculates 
+\f3\i D
+\f2\i0  as defined by Hill and Robertson (1968, p. 226).  The coefficient 
+\f3\i D
+\f2\i0  is within [\uc0\u8722 
+\f3\i p
+\f2\i0 (1\uc0\u8722 
+\f3\i p
+\f2\i0 ), 
+\f3\i p
+\f2\i0 (1\uc0\u8722 
+\f3\i p
+\f2\i0 )], where 
+\f3\i p
+\f2\i0  is the frequency of the more common mutation (that is, 
+\f3\i p
+\f2\i0 \'a0=\'a0max(
+\f3\i f
+\f2\i0\fs13\fsmilli6667 \sub 1
+\fs20 \nosupersub , 
+\f3\i f
+\f2\i0\fs13\fsmilli6667 \sub 2
+\fs20 \nosupersub ) where 
+\f3\i f
+\f2\i0\fs13\fsmilli6667 \sub 1
+\fs20 \nosupersub  and 
+\f3\i f
+\f2\i0\fs13\fsmilli6667 \sub 2
+\fs20 \nosupersub  are the frequencies of the two mutations for which 
+\f3\i D
+\f2\i0  is being calculated); for the normalized LD metric 
+\f3\i r
+\f2\i0\fs13\fsmilli6667 \super 2
+\fs20 \nosupersub , which is within [0, 1], see 
+\f1\fs18 calcLD_Rsquared()
+\f2\fs20 .  Departures of 
+\f3\i D
+\f2\i0  from zero indicate LD; more specifically, 
+\f3\i D
+\f2\i0 \'a0>\'a00 indicates that the mutations occur together more often than expected by chance (positive linkage), whereas 
+\f3\i D
+\f2\i0 \'a0<\'a00 indicates they occur together less often than expected by chance (negative linkage).\
+All mutations in 
+\f1\fs18 mut2
+\f2\fs20  must be associated with the same chromosome as 
+\f1\fs18 mut1
+\f2\fs20 ; this function does not currently calculate LD between mutations associated with different chromosomes.  If 
+\f1\fs18 mut2
+\f2\fs20  is 
+\f1\fs18 NULL
+\f2\fs20  (the default), all such mutations in the population (including 
+\f1\fs18 mut1
+\f2\fs20  itself) will be used.  Similarly, all haplosomes must be associated with the same chromosome as 
+\f1\fs18 mut1
+\f2\fs20 .  If the 
+\f1\fs18 haplosomes
+\f2\fs20  parameter is 
+\f1\fs18 NULL
+\f2\fs20  (the default), all such haplosomes in the population will be used.\
+This function was written by Vitor Sudbrack (currently affiliated with University of Lausanne).\
+\pard\pardeftab720\li720\fi-446\ri720\sb180\sa60\partightenfactor0
+
+\f1\fs18 \cf2 (float)calcLD_Rsquared(object<Mutation>$\'a0mut1, [No<Mutation>\'a0mut2\'a0=\'a0NULL], [No<Haplosome>\'a0haplosomes\'a0=\'a0NULL], [logical$\'a0squared\'a0=\'a0T])\
+\pard\pardeftab397\li547\ri720\sb60\sa60\partightenfactor0
+
+\f2\fs20 \cf2 Calculates the linkage disequilibrium (LD) squared correlation coefficient 
+\f3\i r
+\f2\i0\fs13\fsmilli6667 \super 2
+\fs20 \nosupersub  between a focal mutation 
+\f1\fs18 mut1
+\f2\fs20  and one or more mutations in 
+\f1\fs18 mut2
+\f2\fs20 , evaluated across a set of haplosomes given by 
+\f1\fs18 haplosomes
+\f2\fs20 .  The result is a 
+\f1\fs18 float
+\f2\fs20  vector that matches the size and order of 
+\f1\fs18 mut2
+\f2\fs20 .  The implementation of this function, viewable with 
+\f1\fs18 functionSource()
+\f2\fs20 , calculates 
+\f3\i r
+\f2\i0\fs13\fsmilli6667 \super 2
+\fs20 \nosupersub  as defined by Hill and Robertson (1968, p. 227).  The squared correlation coefficient 
+\f3\i r
+\f2\i0\fs13\fsmilli6667 \super 2
+\fs20 \nosupersub  is a normalized measure of LD within [0, 1] (for the unnormalized LD coefficient 
+\f3\i D
+\f2\i0 , see 
+\f1\fs18 calcLD_D()
+\f2\fs20 ).  When 
+\f3\i r
+\f2\i0\fs13\fsmilli6667 \super 2
+\fs20 \nosupersub \'a0=\'a00, there is no statistical association between the mutations; they co-occur as expected by chance.  A value of 
+\f3\i r
+\f2\i0\fs13\fsmilli6667 \super 2
+\fs20 \nosupersub \'a0=\'a01 indicates complete correlation: the mutations either always appear together or never appear together, depending on the sign of the underlying correlation coefficient 
+\f3\i r
+\f2\i0 .  To obtain the raw (signed) 
+\f3\i r
+\f2\i0  value instead of 
+\f3\i r
+\f2\i0\fs13\fsmilli6667 \super 2
+\fs20 \nosupersub , you can pass 
+\f1\fs18 squared=F
+\f2\fs20  instead of the default of 
+\f1\fs18 T
+\f2\fs20 .\
+All mutations in 
+\f1\fs18 mut2
+\f2\fs20  must be associated with the same chromosome as 
+\f1\fs18 mut1
+\f2\fs20 ; this function does not currently calculate LD between mutations associated with different chromosomes.  If 
+\f1\fs18 mut2
+\f2\fs20  is 
+\f1\fs18 NULL
+\f2\fs20  (the default), all such mutations in the population (including 
+\f1\fs18 mut1
+\f2\fs20  itself) will be used.  Similarly, all haplosomes must be associated with the same chromosome as 
+\f1\fs18 mut1
+\f2\fs20 .  If the 
+\f1\fs18 haplosomes
+\f2\fs20  parameter is 
+\f1\fs18 NULL
+\f2\fs20  (the default), all such haplosomes in the population will be used.\
+This function was written by Vitor Sudbrack (currently affiliated with University of Lausanne).\
+\pard\pardeftab720\li720\fi-446\ri720\sb180\sa60\partightenfactor0
+
 \f1\fs18 \cf2 (float$)calcMeanFroh(object<Individual>\'a0individuals, [integer$\'a0minimumLength\'a0=\'a01000000], [Niso<Chromosome>$\'a0chromosome\'a0=\'a0NULL])\
 \pard\pardeftab397\li547\ri720\sb60\sa60\partightenfactor0
 
diff --git a/VERSIONS b/VERSIONS
@@ -42,6 +42,7 @@ development head (in the master branch):
 	add matrix() method to Plot to add a matrix-based image to a plot: (void)matrix(numeric matrix, numeric$ x1, numeric$ x2, numeric$ y1, numeric$ y2, [logical$ flipped = F], [Nif valueRange = NULL], [Ns$ colors = NULL], [float$ alpha = 1.0])
 	fix #538, check for unique SLiM ids only for alive individuals when loading a tree sequence
 	fix #533, memory smasher in Haplosome method nucleotides(format="char"); besides causing random crashes, this also caused this method to return incorrect results
+	fix #527, add calcLD_D() and calcLD_Rsquared() popgen functions, contributed by Vitor Sudbrack
 
 
 version 5.0 (Eidos version 4.0):
diff --git a/core/slim_functions.cpp b/core/slim_functions.cpp