You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -254,10 +257,55 @@ <h1>Bayesian vs Classical Inference</h1>
254
257
<li><strong>Bayesian.</strong> You set <code>posterior</code> to be a weighted average of <code>outcome</code> and of <code>prior</code>. The prior represents your best-estimate of the effect before the experiment ran. The position of the posterior between those two points depends on the relative tightness of the two distributions: if the confidence intervals from your experiment are tight relative to the uncertainty in your prior then the posterior will be closer to the outcome; if instead the confidence intervals are wide relative to your prior then the then the posterior will end up closer to your prior.</li>
255
258
</ol>
256
259
<p>Graphically we can show the three distributions:</p>
<p>Suppose our prior was a spike at zero and otherwise uniform. This prior will cause Bayesian inference to behave similarly to classical inferences: when the outcome is small then the posterior will be heavily influenced by the spike, and so will shrink to be very near zero. When the outcome becomes larger then at some point it will escape the gravity of the central spike, and we’ll have <code>posterior~=outcome</code>.</p>
<p><strong>The point:</strong> the two graphs at the bottom of the figure are similar: i.e., using a “stat-sig rule” is a not-too-bad approximation of Bayesian inference when you have a fat-tailed prior (and in most cases your prior should be fat-tailed).</p>
</div></div><p><strong>A single metric.</strong> We observe <spanclass="math inline">\(\hat{t}\)</span> which is the true treatment effect plus noise, <spanclass="math inline">\(\hat{t}=t+e\)</span>. Then for any given <spanclass="math inline">\(t\)</span> our posterior probability will be:</p>
<p><strong>A single metric.</strong> We observe <spanclass="math inline">\(\hat{t}\)</span> which is the true treatment effect plus noise, <spanclass="math inline">\(\hat{t}=t+e\)</span>. Then for any given <spanclass="math inline">\(t\)</span> our posterior probability will be:</p>
<divclass="no-row-height column-margin column-container"><divid="fn1"><p><sup>1</sup> Throughout we assume that the treatments change only the mean, not the variance, of outcomes.</p></div></div><p>The variances and covariances of <spanclass="math inline">\(t_1\)</span> and <spanclass="math inline">\(t_2\)</span> represent the experimenter’s priors, and so are often difficult to quantify. If we are willing to identify priors with some set of previously-run experiments, i.e. an “empirical-Bayes” technique, we can recover them from the data using this relationship between covariance matrices:</p>
<p>where <spanclass="math inline">\(\Sigma_u\)</span> is the unit-level covariance matrix. The following graph illustrates a case with negatively-correlated treatment effects, positively correlated noise, and uncorrelated outcomes.</p>
<p><strong>Many metrics.</strong> If we assume that everything has a normal distribution, we have a crisp expression for how the posterior expectations depend on the observed outcomes. For an arbitrary number of outcomes we can write this as:</p>
<p>See survey of methods in <spanclass="citation" data-cites="montiel2019">Azevedo et al. (<ahref="#ref-montiel2019" role="doc-biblioref">2019</a>)</span>. They also document the fat tailed distribution of effect-sizes.</p>
<li><p><strong>James-Stein.</strong> With a Normal prior, <spanclass="math inline">\(t\sim N(\mu_t,\sigma_t^2)\)</span>, we get: <spanclass="math display">\[\mathbb{E}[t|y]=\mu_t+\utt{\frac{\sigma_t^2}{\sigma_t^2+\sigma_e^2}}{shrinkage}{factor}(y-\mu_t).\]</span> We can also use the share of experiments that are significant, <spanclass="math inline">\(p\)</span>: <spanclass="math display">\[\begin{aligned}
<divclass="no-row-height column-margin column-container"><divid="fn7"><p><sup>7</sup> Precisely: if the distribution of effect-sizes is Normal with zero mean then having a statistically-significant effect in 50% of your experiments implies a shrinkage rate of just 10%.</p></div></div><sectionid="comparing-launch-rules" class="level2 page-columns page-full">
<figcaption>Ship if either stat-sig positive and neither stat-sig negative.</figcaption>
785
+
</figure>
786
+
</div>
787
+
</div></div></div>
788
+
<p>Suppose we have two metrics, 1 and 2, and we care about them equally much: <spanclass="math display">\[U(t_1,t_2)=t_1+t_2.\]</span></p>
734
789
<p>But we only observe noisy estimates <spanclass="math inline">\(\hat{t}_1,\hat{t}_2\)</span>.</p>
735
790
<p>A stat-sig shipping rule (either stat-sig positive, neither stat-sig negative) has some strange consequences: it will recommend shipping things even with <em>negative</em> face-value utility (<spanclass="math inline">\(U(\hat{t}_1,\hat{t}_2)<0\)</span>), when there’s a negative outcome on the relatively noisier metric. This will still hold if we evaluate utility with shrunk estimates, when there’s equal proportional shrinkage on the two metrics, but if there’s greater shrinkage on the noisier metric it will not hold.</p>
736
791
<p>Kohavi, Tang & Xu (2020) <em>Trustworthy Online Controlled Experiments</em> recommends a stat-sig shipping rule (p105): “(1) If no metrics are positive-significant then do not ship; (2) if some are positive-significant and none are negative-significant then ship; (3) if some are positive-significant and some are negative-significant then”decide based on the tradeoffs.” I think this is bad advice: the statistical significance of an estimate is only loosely related to the informativeness of that estimate. The decision should be made based on your best estimates of the impact on the overall goal metrics.</p>
<figcaption>Ship if sum stat-sig positive. I have drawn it assuming that <spanclass="math inline">\(cov(\hat{t}_1,\hat{t}_2)=0\)</span>. With a positive covariance the threshold would be higher.</figcaption>
0 commit comments