inlineMethology/experiment-trustworthiness.html at main · inlineapps/inlineMethology · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>How big tech really runs 40k A/B tests — the honest mechanics</title>
<link rel="stylesheet" href="framework.css">
<style>
  /* Page-accent — overrides framework.css fallback */
  :root{--page-accent:var(--blue);--page-accent-soft:var(--blue-soft)}
  .finding ul{margin:8px 0 0;padding-left:20px;color:#cfc9bb}
  .finding li{margin:5px 0}
  /* honest-counter box */
.breakdown{background:#fff;border:1px solid var(--line);border-radius:12px;padding:20px 24px;margin:14px 0;box-shadow:var(--shadow);border-left:5px solid var(--page-accent)}
  .breakdown h3{margin:0 0 10px;font-family:Georgia,serif;font-size:17px}
  .breakdown table{width:100%;border-collapse:collapse;font-size:13.5px}
  .breakdown td{padding:7px 10px;border-bottom:1px solid var(--line);vertical-align:top}
  .breakdown td:first-child{font-weight:700;width:45%}
  .breakdown td.n{font-family:Georgia,serif;font-size:15px;color:var(--page-accent);font-weight:700;text-align:right;width:80px}
  .breakdown tr:last-child td{border-bottom:none;font-weight:700;background:var(--page-accent-soft)}
  /* layers diagram */
.layers{background:#fff;border:1px solid var(--line);border-radius:12px;padding:22px;box-shadow:var(--shadow);margin:14px 0}
  .layers h3{margin:0 0 10px;font-family:Georgia,serif;font-size:17px}
  .layergrid{display:grid;grid-template-columns:130px repeat(3,1fr);gap:6px;margin:14px 0;font-size:12.5px}
  .lcell{padding:9px 8px;border-radius:5px;border:1px solid var(--line);background:#faf7f0;text-align:center;font-weight:600}
  .lcell.h{background:var(--ink);color:#f3efe6;border-color:var(--ink);font-size:11px;letter-spacing:.04em;text-transform:uppercase}
  .lcell.lh{background:#465065;color:#fff;border-color:#465065;text-align:left;padding:9px 10px;font-size:11.5px;letter-spacing:.04em;text-transform:uppercase}
  .lcell.t{background:#dde4f1;color:var(--page-accent)}
  .lcell.c{background:#f0eddf;color:#8a6d2e}
  .legline{font-size:12px;color:var(--ink-soft);margin-top:8px}
  .legline b{color:var(--ink)}
  /* user examples */
.userex{display:grid;grid-template-columns:1fr 1fr;gap:14px;margin:14px 0}
  .uex{background:#fff;border:1px solid var(--line);border-radius:10px;padding:14px 18px;border-top:5px solid var(--page-accent)}
  .uex h4{margin:0 0 6px;font-size:14px;font-family:Georgia,serif;color:var(--page-accent)}
  .uex ul{margin:6px 0;padding-left:18px;font-size:13px}
  .uex li{margin:3px 0}
  .uex code{font-family:"SF Mono",ui-monospace,Menlo,Consolas,monospace;font-size:12px;background:var(--page-accent-soft);padding:1px 6px;border-radius:4px}
  /* duration timeline */
.timeline{background:#fff;border:1px solid var(--line);border-radius:12px;padding:22px;box-shadow:var(--shadow);margin:14px 0}
  .timeline .row{display:grid;grid-template-columns:90px 70px 1fr;gap:12px;padding:10px 0;border-top:1px solid var(--line);align-items:center;font-size:13.5px}
  .timeline .row:first-of-type{border-top:none}
  .timeline .day{font-family:Georgia,serif;font-weight:700;color:var(--page-accent);font-size:14px}
  .timeline .pct{font-family:Georgia,serif;font-size:18px;color:var(--ink);font-weight:700;text-align:center;background:var(--page-accent-soft);border-radius:6px;padding:4px 0}
  .timeline .what{color:var(--ink-soft)}
  .timeline .what b{color:var(--ink)}
  /* significance diagram */
.sig{background:#fff;border:1px solid var(--line);border-radius:12px;padding:18px 22px;margin:14px 0;box-shadow:var(--shadow)}
  .sig table{width:100%;border-collapse:collapse;font-size:13.5px;margin-top:8px}
  .sig th,.sig td{padding:9px 11px;border-bottom:1px solid var(--line);text-align:left}
  .sig thead th{background:var(--ink);color:#f3efe6;font-size:11px;letter-spacing:.05em;text-transform:uppercase}
  .sig td.good{color:var(--teal);font-weight:700}
  .sig td.bad{color:var(--accent);font-weight:700}
  /* skeptic box */
.skeptic{background:#fbf6e7;border:1px dashed #d8c98f;border-left:5px solid var(--gold);border-radius:10px;padding:16px 20px;margin:18px 0;font-size:14px;color:#5b4708}
  .skeptic b{color:#7a5a08}
  .quote{background:#f6f4ee;border-left:3px solid var(--ink-soft);border-radius:6px;padding:10px 16px;margin:12px 0;font-style:italic;color:var(--ink-soft);font-size:14px}
  .quote .att{display:block;font-style:normal;font-size:12px;margin-top:6px}
  @media(max-width:680px){.userex{grid-template-columns:1fr}.layergrid{font-size:11px}.timeline .row{grid-template-columns:1fr;gap:4px}h1{font-size:28px}}
</style>
</head>
<body>
<nav class="sitenav">
<details>
<summary>📑 Jump to</summary>
<div class="navmenu">
<div class="navgrp"><h4>Start here</h4>
<a href="index.html"><b>← Home (goal &amp; map)</b></a>
<a href="impact-saas-companies.html">SaaS / B2B field study</a>
<a href="impact-consumer-companies.html">Consumer-tech field study</a>
<a href="methodologies-comparison.html"><b>All methods compared →</b></a>
<a class="cur" href="experiment-trustworthiness.html">How 40k tests actually work →</a>
<a href="jargon.html">Jargon (glossary)</a>
</div>
<div class="navgrp"><h4>Scoring &amp; Input modeling</h4>
<a href="rice-framework.html">RICE (Intercom)</a>
<a href="north-star-framework.html">North Star (Amplitude / Slack)</a>
</div>
<div class="navgrp"><h4>Goal-laddering / Define first</h4>
<a href="v2mom-framework.html">V2MOM (Salesforce)</a>
<a href="pyramid-of-clarity-framework.html">Pyramid of Clarity (Asana)</a>
<a href="pr-faq-framework.html">PR-FAQ / Working Backwards (Amazon)</a>
<a href="heart-framework.html">HEART (Google)</a>
<a href="dibb-framework.html">DIBB (Spotify)</a>
</div>
<div class="navgrp"><h4>Experimentation (SaaS)</h4>
<a href="microsoft-exp-framework.html">Microsoft ExP / CUPED</a>
<a href="linkedin-xlnt-framework.html">LinkedIn T-REX</a>
</div>
<div class="navgrp"><h4>Experimentation (Consumer)</h4>
<a href="netflix-experimentation.html">Netflix · ABlaze</a>
<a href="booking-experimentation.html">Booking.com</a>
<a href="airbnb-erf-framework.html">Airbnb ERF</a>
<a href="uber-xp-framework.html">Uber XP</a>
<a href="doordash-switchback-framework.html">DoorDash switchback</a>
<a href="lyft-experimentation.html">Lyft</a>
<a href="pinterest-ab-framework.html">Pinterest</a>
</div>
<div class="navgrp"><h4>AI labs</h4>
<a href="anthropic-pm-on-ai-exponential.html">Anthropic · PM on AI exponential</a>
<a href="google-customer-zero-2026.html">Google · "Customer zero" 2026</a>
</div>
<div class="navgrp"><h4>Written discipline</h4>
<a href="stripe-shaping-framework.html">Stripe shaping</a>
</div>
</div>
</details>
</nav>

<div class="wrap">
  <header class="masthead">
    <p class="kicker">Methods · Deep-dive · The skeptic's answer</p>
    <h1>How big tech actually runs 40,000 A/B tests per day — without drowning in noise <span class="srcyr">2010</span></h1>
    <p class="sub">A grounded answer to three reasonable challenges to the "experimentation works" story: <strong>what does 40k/day really count?</strong> <strong>How is interference between concurrent tests handled?</strong> <strong>How long does a real decision actually take?</strong></p>
    <p class="sub">No magic — the answers are public, well-cited, and older than most people realise. Foundational source: Google's <em>Overlapping Experiment Infrastructure</em> (Tang, Agarwal, O'Brien &amp; Meyer — <a class="j" href="jargon.html#kdd">KDD</a> 2010), with later work from Microsoft (Kohavi/Tang/Xu, 2013–2020) and LinkedIn (2020+).</p>
    <div class="goal"><span>Goal</span><br>Decide features by data-backed expected impact — choose by outcome, not by to-do list or opinion.</div>
  </header>

  <div class="eli">
    <div class="lbl">🎓 8th-grade version</div>
    When you hear that LinkedIn or Booking "run 40,000 experiments a day," three sensible doubts pop up. <b>(1) That can't be 40,000 totally separate tests, right?</b> Correct — most of them are different ramp stages or different country variants of the same idea. The honest number of distinct decisions is in the hundreds-to-thousands per day, still huge. <b>(2) Don't all those tests mess each other up?</b> They would, except Google solved this in 2007 (published 2010) by splitting the codebase into "layers" — different subsystems each get their own random coin-flip per user, so tests in different layers don't contaminate each other. <b>(3) Can you really decide overnight if a number went up?</b> No — the field requires at least 7 days to absorb weekday-vs-weekend differences. Most tests run 1–2 weeks. The whole discipline is built to prevent the "the number's up, ship it" mistake.
  </div>

  <nav class="toc">
    <a href="#headline">The honest answer</a>
    <a href="#count">What 40k counts</a>
    <a href="#interference">Interference / layers</a>
    <a href="#duration">Duration</a>
    <a href="#variance">Variance &amp; significance</a>
    <a href="#example">Worked 2-week example</a>
  </nav>

  <div class="finding" id="headline">
    <h2>The honest answer to all three concerns</h2>
    <p>Your scepticism is the right instinct — and the published mechanics happen to support it on each point:</p>
    <ul>
      <li><b>40k/day is real but it's a count of <em>variant evaluations</em></b>, not 40k independent A/B tests against the whole user base. It includes every ramp stage, every per-segment variant, and every metric computation. The underlying claim — "thousands of decisions per quarter resolved by experiment" — is real; the marketing-friendly headline number isn't a one-to-one count.</li>
      <li><b>Interference is real and the solution is older than most blogs</b>: Google's <em>Overlapping Experiment Infrastructure</em> (Tang et al., <b>KDD 2010</b>; deployed at Google since ~2007). Every user is independently randomised across <b>layers</b>. Tests in different layers can stack without contaminating each other; tests in the same layer are mutually exclusive. This is the only reason "thousands of concurrent tests" is possible.</li>
      <li><b>Nobody decides in a morning.</b> Microsoft (Kohavi/Tang/Xu, 2020) and the broader literature: <code>minimum 1 full week</code> to absorb day-of-week effects; <b>1–4 weeks typical</b>; longer if novelty/primacy effects are likely. A "tomorrow morning the number went up" decision is exactly what the field's discipline is built to prevent.</li>
    </ul>
  </div>

  <!-- SECTION 1: 40K BREAKDOWN -->
  <h2 class="sec" id="count">1. What "40,000 experiments per day" actually counts</h2>
  <p class="secsub">LinkedIn's T-REX engineering posts cite "more than 40,000 experiments per day on ~8,000 metrics." Real number, but it's a sum of several different things, not 40,000 independent A/B tests fighting for the same user base. The breakdown matters:</p>

  <div class="breakdown">
    <h3>What goes into LinkedIn's daily count (illustrative — the literal split isn't published, the categories are)</h3>
    <table>
      <tr><td>Each <b>ramp stage</b> of the same feature counted separately (1% → 5% → 25% → 50% → 100% = 5 entries)</td><td class="n">many</td></tr>
      <tr><td>Per-<b>segment</b> versions of the same change (per-country, per-device, per-locale) counted as distinct experiments</td><td class="n">many</td></tr>
      <tr><td><b>Multivariate</b> tests counted per arm (an A/B/C/D test = 4 entries)</td><td class="n">many</td></tr>
      <tr><td><b>Concurrent independent</b> A/B tests on different surfaces (genuine parallel decisions)</td><td class="n">many</td></tr>
      <tr><td><b>Long-running</b> always-on holdouts and quasi-experiments</td><td class="n">some</td></tr>
      <tr><td><b>Total daily experiment-evaluations</b> rolled up across all categories</td><td class="n">40k+</td></tr>
    </table>
  </div>
  <p style="font-size:13.5px;color:var(--ink-soft)">So the more honest reading: LinkedIn is running <em>hundreds-to-low-thousands of distinct product decisions</em> simultaneously, each producing many counted evaluations as it ramps, segments, and accumulates significance. That's still extraordinary — and the layered infrastructure below is why it works without chaos. It's not 40k unrelated tests treating users like a casino chip table.</p>
  <div class="src">Source: <a class="cite" href="https://engineering.linkedin.com/teams/data/analytics-platform-apps/data-applications/t-rex">LinkedIn — T-REX team page (40k/day, 8k metrics)</a> · <a class="cite" href="https://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin">XLNT platform paper</a>. The breakdown above is inferred from how published experimentation platforms compose their counts — LinkedIn doesn't publish the literal split.</div>

  <!-- SECTION 2: INTERFERENCE -->
  <h2 class="sec" id="interference">2. The interference problem — and the layered-randomization trick that solves it</h2>
  <p class="secsub">Your concern in your words: "if test A and test B both affect metric A, you can't say A caused the lift." Exactly right. This is a real, well-documented problem. The published solution is one of the most important ideas in modern experimentation: <strong>orthogonal <a class="j" href="jargon.html#layered-randomization">layered randomization</a></strong>.</p>

  <div class="quote">"The infrastructure had been used in Google since around 2007 … key concepts include <b>domains</b> (a segmentation of traffic), <b>layers</b> (corresponding to a subset of the system parameters), and <b>experiments</b> (a segmentation of traffic where zero or more system parameters can be given alternate values)."
    <span class="att">— Tang, Agarwal, O'Brien &amp; Meyer, <em>Overlapping Experiment Infrastructure: More, Better, Faster Experimentation</em>, KDD 2010 (pp. 17–26).</span>
  </div>

  <h3 style="font-family:Georgia,serif;font-size:18px;margin:18px 0 6px">The trick in one paragraph</h3>
  <p>Treat the codebase as a stack of <b>independent subsystems</b> (search-ranking, notifications, ads, homepage hero, onboarding, …). Each subsystem is a <b>layer</b>. Within a layer, only <em>one experiment</em> can touch a user at a time (mutually exclusive — true A/B integrity for that subsystem). <em>Across</em> layers, the user is randomised <b>independently</b> for each layer using a different hash, so the assignment in layer L1 carries no statistical relationship with the assignment in L2. Because the layers were defined to <em>not</em> share parameters, the tests can't contaminate each other.</p>

  <div class="layers">
    <h3>Concrete picture — three concurrent experiments, two real users</h3>
    <div class="layergrid">
      <div class="lcell h">Layer / subsystem</div>
      <div class="lcell h">Test running</div>
      <div class="lcell h">User X gets</div>
      <div class="lcell h">User Y gets</div>

      <div class="lcell lh">Ranking algorithm</div>
      <div class="lcell">Test A — new vs old ranker</div>
      <div class="lcell t">Treatment</div>
      <div class="lcell c">Control</div>

      <div class="lcell lh">Notifications</div>
      <div class="lcell">Test B — new send-time</div>
      <div class="lcell c">Control</div>
      <div class="lcell t">Treatment</div>

      <div class="lcell lh">Homepage hero</div>
      <div class="lcell">Test C — new image style</div>
      <div class="lcell t">Treatment</div>
      <div class="lcell t">Treatment</div>
    </div>
    <p class="legline">Each user's <b>layer assignments are statistically independent</b> — they're hashed with a per-layer salt, so being treatment in Layer 1 says nothing about being treatment in Layer 2. Test A's treatment vs control comparison is clean because <em>across</em> users, the other layers' assignments are balanced 50/50 in both A's treatment and A's control groups. Same for Test B and Test C. <b>This is why three tests can run on the same user without spoiling each other's reads.</b></p>
  </div>

  <h3 style="font-family:Georgia,serif;font-size:18px;margin:18px 0 6px">"But what if two tests <em>do</em> touch the same metric?"</h3>
  <p>Then they cannot live in different layers — the layer boundary was wrong. Three real mitigations from published platforms:</p>
  <ul style="margin:8px 0 14px;padding-left:22px">
    <li><b>Put them in the same layer</b> — mutually exclusive. One at a time. Cost: queueing; benefit: zero contamination. <em>(LaunchDarkly, Optimizely, AB Tasty all implement this as "mutually exclusive experiments" — standard third-party feature.)</em></li>
    <li><b>Run a full factorial</b> — A/B × C/D = four arms, measure interaction explicitly. Kohavi/Henne/Sommerfield (KDD 2007) put the equivalent point as: <em>"Strong interactions are rare in practice; awareness is usually enough. Pairwise statistical tests can auto-flag"</em> — i.e. if you suspect interaction, run multivariate, don't run parallel.</li>
    <li><b><a class="j" href="jargon.html#ghosting">Ghosting</a></b> (Booking.com's published practice) — log the counterfactual model score / UI state the user <em>would</em> have seen alongside the served version, so you can study interactions before committing to a live factorial.</li>
  </ul>

  <div class="skeptic">
    <b>The empirical-frequency answer to "but interactions must be everywhere":</b> Kohavi/Henne/Sommerfield (KDD 2007, verbatim): <em>"Strong interactions are rare in practice; awareness is usually enough."</em> Translated: most pairs of tests at scale don't interact because most subsystems are <em>actually</em> independent, and the discipline of layer assignment forces engineers to declare which subsystem they're touching before launching. The framework's discipline doesn't deny interactions — it makes them visible up front so the test is either gated or designed as multivariate.
  </div>
  <div class="src">Sources: <a class="cite" href="https://research.google.com/pubs/archive/36500.pdf">Tang et al. — Overlapping Experiment Infrastructure (PDF, KDD 2010)</a> · <a class="cite" href="https://research.google/pubs/overlapping-experiment-infrastructure-more-better-faster-experimentation/">Google Research listing</a> · Booking.com layered experimentation (<a class="cite" href="https://venue.cloud/news/insights/scaling-experiments-the-booking-com-way/">summary</a>) · Kohavi/Tang/Xu, <em>Trustworthy Online Controlled Experiments</em> (Cambridge, 2020) — Ch. 18 "Variance Estimation &amp; Improved Sensitivity" + Ch. 22 "Learnings from Running A/B Tests."</div>

  <!-- SECTION 3: DURATION -->
  <h2 class="sec" id="duration">3. Duration — why no one decides "the next morning"</h2>
  <p class="secsub">Your point: an overnight read isn't a decision; you need a real measurement window to know whether the move is signal or noise. The literature agrees, with specific minimums.</p>

  <div class="breakdown">
    <h3>Published practice on test duration</h3>
    <table>
      <tr><td><b>Hard minimum</b> — full <code style="font-family:'SF Mono',ui-monospace,Menlo,Consolas,monospace;background:#f6f4ee;padding:1px 5px;border-radius:3px">7 days</code> so day-of-week effects are absorbed (weekend behaviour ≠ weekday)</td><td class="n">7d</td></tr>
      <tr><td><b>Typical at Microsoft / Bing</b> — most experiments run 1–2 weeks (Kohavi/Tang/Xu, 2020)</td><td class="n">7–14d</td></tr>
      <tr><td><b>If <a class="j" href="jargon.html#novelty-effect">novelty effect</a> suspected</b> — initial 1–2 week spike fades, so read at week 2–3 not week 1</td><td class="n">14–21d</td></tr>
      <tr><td><b>If <a class="j" href="jargon.html#primacy-effect">primacy effect</a> suspected</b> — habituated users need weeks to adapt; long-term holdout (Uber) or 4+ weeks</td><td class="n">28d+</td></tr>
      <tr><td><b>Power-determined</b> — duration must hit a pre-computed sample-size threshold; underpowered tests are not "shorter" — they're inconclusive</td><td class="n">N-driven</td></tr>
    </table>
  </div>
  <p style="font-size:13.5px;color:var(--ink-soft)">"The number is up after one day" is exactly the trap the field's training material is built to prevent. Recent academic work (Larsen et al., <em>Setting the duration of online A/B experiments</em>, arXiv 2024) shows the duration choice itself is a decision under uncertainty — pre-declared, not adjusted mid-flight.</p>
  <div class="src">Sources: Kohavi, Tang &amp; Xu, <em>Trustworthy Online Controlled Experiments</em> (Cambridge, 2020); <a class="cite" href="https://arxiv.org/html/2408.02830">Larsen et al., "Setting the duration of online A/B experiments" (arXiv:2408.02830, 2024)</a>; <a class="cite" href="https://ai.stanford.edu/~ronnyk/2007GuideControlledExperiments.pdf">Kohavi et al., "Practical Guide to Controlled Experiments on the Web" (2007)</a>.</div>

  <!-- SECTION 4: VARIANCE -->
  <h2 class="sec" id="variance">4. Variance, standard deviation &amp; "is it actually different?"</h2>
  <p class="secsub">"Up by 2%" might be inside the daily noise. The discipline is to ask: <em>is the observed lift larger than the metric's known variance at this sample size?</em></p>

  <p>For each metric the team computes a baseline <b>standard deviation</b> from pre-experiment data. The required sample size to detect a desired minimum effect (<a class="j" href="jargon.html#mde">MDE</a>) is roughly:</p>
  <p style="text-align:center;background:#f6f4ee;border:1px solid var(--line);border-radius:8px;padding:14px;font-family:'SF Mono',ui-monospace,Menlo,Consolas,monospace;font-size:14px;color:var(--ink)">N ≈ 16 × (σ ÷ MDE)<sup>2</sup>   <span style="color:var(--ink-soft);font-style:italic;font-size:12.5px;font-family:inherit">(rough rule-of-thumb for two-sample 80% power, α=0.05)</span></p>
  <p>So smaller effects need quadratically more users — the reason "we'll just test it small" doesn't fly for subtle changes. The discipline of <strong>pre-declaring</strong> the MDE, the OEC, and the duration before launching is what stops the "the number's up, let's ship" failure mode.</p>

  <h3 style="font-family:Georgia,serif;font-size:18px;margin:18px 0 6px">Three published tricks that don't change duration much but make reads more honest</h3>
  <ul style="margin:8px 0 14px;padding-left:22px">
    <li><b><a class="j" href="jargon.html#cuped">CUPED</a></b> (Microsoft, 2013) — use each user's pre-experiment behaviour as a covariate; shrinks confidence interval ~50% → same answer from ~half the sample size.</li>
    <li><b><a class="j" href="jargon.html#sequential-testing">Sequential testing</a></b> (Netflix, Optimizely, others) — peek continuously but with adjusted significance thresholds (<a class="j" href="jargon.html#msprt">mSPRT</a>, group-sequential designs) so the false-positive rate stays controlled. Lets you stop early on clear winners or losers <em>without</em> the classic "<a class="j" href="jargon.html#p-hacking">p-hacking</a> by repeated peeking" problem.</li>
    <li><b>SRM check (<a class="j" href="jargon.html#srm">Sample Ratio Mismatch</a>)</b> — if treatment / control end up 51% / 49% instead of 50/50, the platform <b>refuses to read out</b>. A skewed split means the randomization was broken. (The "~70% of surprising results turn out to be bugs" figure widely repeated in this context is from Kohavi's talks / Trustworthy book commentary, not from a single published paper — treat as cultural finding.)</li>
  </ul>

  <!-- SECTION 5: WORKED EXAMPLE -->
  <h2 class="sec" id="example">5. A concrete worked example — what a real 2-week experiment looks like</h2>
  <p class="secsub">Illustrative reconstruction of how a typical staged-rollout experiment plays out — built from Microsoft / Booking / Uber published patterns. Numbers are realistic but illustrative.</p>

  <div class="note" style="background:var(--teal-soft);border-left-color:var(--teal)"><b>Try it Monday morning (30 minutes).</b> Pick the most recent experiment your team called a "win." Audit it against the three skeptic questions on this page: <b>(1)</b> Did the ramp run for at least 7 days at the read window (not "we saw it go up on day 2")? <b>(2)</b> Was there any concurrent test in the same layer / subsystem that could have contaminated the result? <b>(3)</b> Did the confidence interval on the OEC actually exclude zero, or did the team eyeball the headline number? If the answer to any is "no," the win is less trustworthy than it looks — and that's a finding worth surfacing before the next decision rests on it.</div>

  <div class="timeline">
    <div class="row"><div class="day">Day 1</div><div class="pct">1%</div><div class="what"><b>Smoke test.</b> Does anything crash? Watch error rates, latency. Statistically inert — too few users for any meaningful read.</div></div>
    <div class="row"><div class="day">Days 2–3</div><div class="pct">5%</div><div class="what"><b>Guardrails-only.</b> Confirm support volume, latency, crash rate haven't degraded. Auto-rollback wired up.</div></div>
    <div class="row"><div class="day">Days 4–7</div><div class="pct">25%</div><div class="what"><b>Power building.</b> Sample accumulating; confidence interval still wide; no decision yet.</div></div>
    <div class="row"><div class="day">Days 8–14</div><div class="pct">50%</div><div class="what"><b>The real read window.</b> Full week at half traffic so all weekdays + a weekend are covered. SRM check passes; OEC and guardrails read.</div></div>
    <div class="row"><div class="day">Day 15</div><div class="pct" style="background:var(--accent-soft);color:var(--accent)">decision</div><div class="what"><b>Read results.</b> Ship to 100%, kill, or extend if inconclusive. <em>Not</em> a morning decision — two weeks of accumulated signal.</div></div>
  </div>

  <div class="sig">
    <h3 style="margin:0 0 6px;font-family:Georgia,serif;font-size:17px">What the Day-15 readout actually looks like</h3>
    <table>
      <thead><tr><th>Metric</th><th>Treatment vs Control</th><th>95% Confidence Interval</th><th>Verdict</th></tr></thead>
      <tbody>
        <tr><td>OEC — primary success metric</td><td>+1.8%</td><td>[+0.6%, +3.0%]</td><td class="good">Significant ✓</td></tr>
        <tr><td>Guardrail — latency (<a class="j" href="jargon.html#p95">p95</a>)</td><td>+2 ms</td><td>[0, +5 ms]</td><td class="good">Within tolerance ✓</td></tr>
        <tr><td>Guardrail — support tickets</td><td>−0.4%</td><td>[−1.1%, +0.3%]</td><td class="good">Not worse ✓</td></tr>
        <tr><td>Sanity — sample ratio</td><td>50.04% / 49.96%</td><td>—</td><td class="good">No SRM ✓</td></tr>
      </tbody>
    </table>
    <p style="font-size:13px;color:var(--ink-soft);margin:10px 0 0">Because the 95% CI on the OEC does not include 0 (the lower bound is +0.6%), we say the lift is statistically significant. Guardrails held. SRM clean. <b>→ Decision: ship to 100%.</b> If the OEC's CI had been <code style="font-family:'SF Mono',ui-monospace,Menlo,Consolas,monospace;background:#fbeae4;padding:1px 5px;border-radius:3px">[−0.2%, +1.4%]</code>, the lift would be <em>inconclusive</em> (CI includes 0) — you'd extend the test, not ship.</p>
  </div>

  <div class="note" style="background:var(--accent-soft);border-left-color:var(--accent)"><b>The most important reading skill on this page.</b> Look at the Day-15 readout table. The OEC moved +1.8%. Most teams would call this a win and move on. But the load-bearing column is the <b>95% confidence interval</b>: <code>[+0.6%, +3.0%]</code>. The lower bound is +0.6% — barely above zero. If we'd hit "ship" the moment the headline showed positive, we'd be acting on a number whose <em>real</em> effect could plausibly be as small as 0.6%. The discipline isn't reading the central estimate; it's checking whether the <em>entire confidence interval</em> sits where you'd ship. The CI is the honest signal; the <a class="j" href="jargon.html#point-estimate">point estimate</a> is the marketing pitch. A team that ships off the point estimate alone is shipping noise as often as signal.</div>

  <div class="note"><b>Why two weeks is the floor, not the ceiling.</b> Day-of-week effects alone require a full 7 days. Adding a second week is the cheapest insurance against a Day-8 reversal (novelty effect tailing off, marketing campaign ending, weekend traffic mix shifting). Kohavi et al. recommend 1–2 weeks as the default at Microsoft; longer for novelty-prone surfaces.</div>

  <footer>
    Companion to: <a href="microsoft-exp-framework.html">← Microsoft ExP</a> · <a href="linkedin-xlnt-framework.html">LinkedIn T-REX</a> · <a href="booking-experimentation.html">Booking</a> · <a href="netflix-experimentation.html">Netflix</a> · <a href="methodologies-comparison.html">All methods compared</a><br>
    <b>Grounded in</b> <a href="https://research.google.com/pubs/archive/36500.pdf">Tang, Agarwal, O'Brien &amp; Meyer — "Overlapping Experiment Infrastructure: More, Better, Faster Experimentation" (KDD 2010)</a>; Kohavi, Tang &amp; Xu, <em>Trustworthy Online Controlled Experiments</em> (Cambridge, 2020); <a href="https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf">Deng, Xu, Kohavi &amp; Walker (2013) — CUPED</a>; <a href="https://ai.stanford.edu/~ronnyk/2007GuideControlledExperiments.pdf">Kohavi, Henne &amp; Sommerfield — "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO" (KDD 2007)</a>; <a href="https://hbr.org/2020/03/building-a-culture-of-experimentation">Thomke (HBR March–April 2020) — Booking culture</a>; <a href="https://engineering.linkedin.com/teams/data/analytics-platform-apps/data-applications/t-rex">LinkedIn T-REX team page</a>; <a href="https://arxiv.org/html/2408.02830">Larsen et al. (2024) — A/B duration</a>. <b>Verbatim from sources:</b> the three key concepts (Domain / Layer / Experiment) and "The overlapping experiment infrastructure was first deployed in March 2007" (Tang et al. KDD 2010); CUPED's 45/52/49% variance-reduction figures (Deng et al. 2013); "Strong interactions are rare in practice; awareness is usually enough" + "Run experiments at least a week or two, in multiples of a week, to capture day-of-week effects" (Kohavi/Henne/Sommerfield KDD 2007); 40k/day on ~8k metrics figure (LinkedIn T-REX). <b>Inferred (not literally published):</b> the specific category breakdown of LinkedIn's 40k count — the categories themselves are how published platforms compose their counts, but LinkedIn doesn't publish the literal split. <b>Talks-derived (not from a single published paper):</b> the "~70% of surprising results turn out to be bugs" figure — repeated as cultural finding rather than hard statistic. <b>Added by us, not in the sources:</b> the Day-15 worked-example readout numbers (illustrative — the format and discipline are what's verifiable), the in-page glossary framing, and the "Try it Monday" audit exercise.<br>
    <em>Note: a 2026-05-26 source-verification pass confirmed the Tang 2010 verbatim claims (Domain / Layer / Experiment definitions + March 2007 first-deployment date), the CUPED 2013 variance-reduction numbers, and the Kohavi 2007 quotes on interactions and day-of-week minimums. The interaction-quote attribution in §2 was tightened to the actual KDD 2007 wording.</em>
  </footer>
</div>
</body>
</html>