inlineMethology/booking-experimentation.html at main · inlineapps/inlineMethology · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Booking.com — democratised experimentation at marketplace scale</title>
<link rel="stylesheet" href="framework.css">
<style>
  /* Page-accent — overrides framework.css fallback */
  :root{--page-accent:var(--blue);--page-accent-soft:var(--blue-soft)}
  /* three pillars */
.three{display:grid;grid-template-columns:repeat(3,1fr);gap:14px;margin:14px 0}
  .concept{background:#fff;border:1px solid var(--line);border-radius:12px;padding:18px 20px;box-shadow:var(--shadow);border-top:5px solid var(--page-accent)}
  .concept .ab{font-family:Georgia,serif;font-size:13px;color:var(--page-accent);font-weight:700;letter-spacing:.05em;text-transform:uppercase;margin-bottom:4px}
  .concept h3{margin:0 0 8px;font-size:17px;font-family:Georgia,serif}
  .concept p{margin:0;font-size:13.5px;color:var(--ink-soft);line-height:1.5}
  .concept p b{color:var(--ink)}
  /* the famous example */
.ex{background:#fff;border:1px dashed var(--page-accent);border-radius:12px;padding:18px 22px;margin:14px 0;font-size:14.5px}
  .ex h3{margin:0 0 8px;color:var(--page-accent);font-family:Georgia,serif;font-size:17px}
  /* stats */
.stats{display:flex;gap:10px;flex-wrap:wrap;margin:14px 0}
  .stat{flex:1 1 150px;background:#f6f4ee;border:1px solid var(--line);border-radius:8px;padding:12px 14px;text-align:center}
  .stat .n{font-family:Georgia,serif;font-size:22px;color:var(--page-accent);font-weight:700;line-height:1.1}
  .stat .l{font-size:11.5px;color:var(--ink-soft);margin-top:4px;line-height:1.35}
  .caveat{font-size:12.5px;color:#8a6d2e;background:#fbf6e7;border:1px dashed #d8c98f;border-radius:6px;padding:7px 11px;margin-top:10px}
  @media(max-width:680px){.three{grid-template-columns:1fr}}
</style>
</head>
<body>
<nav class="sitenav">
<details>
<summary>📑 Jump to</summary>
<div class="navmenu">
<div class="navgrp"><h4>Start here</h4>
<a href="index.html"><b>← Home (goal &amp; map)</b></a>
<a href="impact-saas-companies.html">SaaS / B2B field study</a>
<a href="impact-consumer-companies.html">Consumer-tech field study</a>
<a href="methodologies-comparison.html"><b>All methods compared →</b></a>
<a href="experiment-trustworthiness.html">How 40k tests actually work →</a>
<a href="jargon.html">Jargon (glossary)</a>
</div>
<div class="navgrp"><h4>Scoring &amp; Input modeling</h4>
<a href="rice-framework.html">RICE (Intercom)</a>
<a href="north-star-framework.html">North Star (Amplitude / Slack)</a>
</div>
<div class="navgrp"><h4>Goal-laddering / Define first</h4>
<a href="v2mom-framework.html">V2MOM (Salesforce)</a>
<a href="pyramid-of-clarity-framework.html">Pyramid of Clarity (Asana)</a>
<a href="pr-faq-framework.html">PR-FAQ / Working Backwards (Amazon)</a>
<a href="heart-framework.html">HEART (Google)</a>
<a href="dibb-framework.html">DIBB (Spotify)</a>
</div>
<div class="navgrp"><h4>Experimentation (SaaS)</h4>
<a href="microsoft-exp-framework.html">Microsoft ExP / CUPED</a>
<a href="linkedin-xlnt-framework.html">LinkedIn T-REX</a>
</div>
<div class="navgrp"><h4>Experimentation (Consumer)</h4>
<a href="netflix-experimentation.html">Netflix · ABlaze</a>
<a class="cur" href="booking-experimentation.html">Booking.com</a>
<a href="airbnb-erf-framework.html">Airbnb ERF</a>
<a href="uber-xp-framework.html">Uber XP</a>
<a href="doordash-switchback-framework.html">DoorDash switchback</a>
<a href="lyft-experimentation.html">Lyft</a>
<a href="pinterest-ab-framework.html">Pinterest</a>
</div>
<div class="navgrp"><h4>AI labs</h4>
<a href="anthropic-pm-on-ai-exponential.html">Anthropic · PM on AI exponential</a>
<a href="google-customer-zero-2026.html">Google · "Customer zero" 2026</a>
</div>
<div class="navgrp"><h4>Written discipline</h4>
<a href="stripe-shaping-framework.html">Stripe shaping</a>
</div>
</div>
</details>
</nav>

<div class="wrap">
  <header class="masthead">
    <p class="kicker">Methods · Deep-dive · Experimentation</p>
    <h1>Booking.com — democratised experimentation as company culture <span class="srcyr">2017</span></h1>
    <p class="sub">Booking's distinctive contribution isn't a clever technique — it's a <strong>culture</strong>: any employee can launch an experiment on millions of users without management sign-off.</p>
    <p class="sub">The platform's job is to enforce trust (guardrails, SRM, defined metrics), not to gatekeep ideas. Canonical sources: Vermeer et al. <em>"Democratizing Online Controlled Experiments at Booking.com"</em> (2017 arxiv preprint) and Stefan Thomke's HBR write-up (March–April 2020).</p>
    <div class="goal"><span>Goal</span><br>Decide features by data-backed expected impact — choose by outcome, not by to-do list or opinion.</div>
  </header>

  <div class="eli">
    <div class="lbl">🎓 8th-grade version</div>
    Most companies have a manager decide which feature ideas get tested. Booking flipped this: <b>anyone</b> — designer, engineer, intern — can launch a real A/B test on millions of users, without asking permission. The trick that makes this safe: the platform won't let your test run unless you've written down what would count as a win <em>before</em> it starts, and the platform automatically catches when something's broken. So nobody can cheat, and bad ideas die fast on their own. The result: <b>25,000 experiments every year</b>. That's how they discovered weird wins no manager would have approved — like showing people hotels they couldn't book, which somehow made <em>more</em> people book others.
  </div>

  <nav class="toc">
    <a href="#headline">Honest headline</a>
    <a href="#anatomy">Three pillars</a>
    <a href="#mechanism">How it picks work</a>
    <a href="#example">The sold-out story</a>
    <a href="#apply">Apply to a sheet</a>
    <a href="methodologies-comparison.html" style="color:var(--blue);font-weight:700">Comparison table →</a>
  </nav>

  <div class="finding" id="headline">
    <h2>The honest headline: trust the platform, not the manager</h2>
    <p>Most experimentation cultures gate ideas at a person: a director, a council, a roadmap review. Booking gates at the <em>platform</em>: <b>if your hypothesis passes the automated guardrails, you can ship the test — no one's permission required</b>. That single inversion is what enables ~25,000 tests/year at the company level.</p>
    <p>The work is most associated with <b>Lukas Vermeer</b> (Booking's senior product owner who built much of the culture) and Stefan Thomke's HBR write-up <em>"Building a Culture of Experimentation"</em>. The platform is the enforcement layer; the culture is the choice to let employees act on what the platform clears.</p>
  </div>

  <!-- ANATOMY -->
  <h2 class="sec" id="anatomy">Three pillars that make the culture safe</h2>
  <p class="secsub">Anyone can launch a test, <em>but</em> the platform won't let an unsafe one through. The pillars are the auto-enforced trust checks.</p>

  <div class="three">
    <div class="concept">
      <div class="ab">Hypothesis</div>
      <h3>Written before the test runs</h3>
      <p>Every experiment must declare its <b>hypothesis, OEC, and guardrails</b> in a templated form. The system rejects tests without them. Forces "what would change my mind?" up front.</p>
    </div>
    <div class="concept">
      <div class="ab">SRM checks</div>
      <h3>Sample-Ratio-Mismatch — auto-flagged</h3>
      <p>If treatment / control groups end up unequal in size (a sign of bad randomization or instrumentation), the platform <b>refuses to read out the result</b>. This single check kills a huge fraction of false wins industry-wide.</p>
    </div>
    <div class="concept">
      <div class="ab">Standard metrics</div>
      <h3>Defined once, used everywhere</h3>
      <p>The OEC, guardrails, and supporting metrics are <b>shared definitions</b> across teams. Removes the "my metric vs your metric" argument; makes cross-experiment comparison honest.</p>
    </div>
  </div>
  <div class="src">Sources: <a class="cite" href="https://arxiv.org/abs/1710.08217">Vermeer et al. — "Democratizing Online Controlled Experiments at Booking.com" (arxiv preprint, October 2017)</a> · <a class="cite" href="https://hbr.org/2020/03/building-a-culture-of-experimentation">Stefan Thomke — "Building a Culture of Experimentation" (HBR, March–April 2020 issue)</a>.</div>

  <!-- MECHANISM -->
  <h2 class="sec" id="mechanism">How Booking actually picks what ships</h2>
  <p class="secsub">There is no "prioritization meeting" in the classical sense. The mechanism is <b>cheap-to-launch + guardrails-enforced = let the test decide</b>.</p>

  <div class="step"><div class="num">1</div><div><h3>Anyone writes the hypothesis</h3><p>A designer, an engineer, a PM. Templated: <em>"If we change X, OEC will move by Y because Z."</em> The template is the gate.</p></div></div>
  <div class="step"><div class="num">2</div><div><h3>Platform validates and provisions</h3><p>Auto-checks the hypothesis is complete; assigns randomization; sets the OEC and guardrails; opens the experiment.</p></div></div>
  <div class="step"><div class="num">3</div><div><h3>Run, with SRM and guardrails watching</h3><p>If anything goes off the rails (SRM, guardrail breach), the test is paused. No human in the loop required.</p></div></div>
  <div class="step"><div class="num">4</div><div><h3>Read the OEC. Ship, kill, or iterate.</h3><p>The platform reports significance against the pre-declared OEC. <b>The author decides what to do</b> — no management sign-off step.</p></div></div>
  <div class="step"><div class="num">5</div><div><h3>Cumulative culture: ~25,000 tests/year</h3><p>Because the cost-to-launch is near-zero, the rate of experiments per employee per year is the highest publicly reported. That volume is itself the prioritization mechanism — bad ideas die fast, good ones reveal themselves.</p></div></div>

  <div class="stats">
    <div class="stat"><div class="n">1,000+</div><div class="l">concurrent experiments</div></div>
    <div class="stat"><div class="n">~25,000</div><div class="l">experiments per year</div></div>
    <div class="stat"><div class="n">0</div><div class="l">required management sign-offs to launch</div></div>
  </div>
  <div class="caveat">Numbers via Thomke (HBR 2020) and Vermeer et al. (Booking's own arxiv paper, 2017 — with later updates). The ~25,000 tests/year figure is the most-quoted Thomke number; concurrent-count is Booking's own publication. Reputable, secondary-published — not an external audit.</div>

  <!-- EXAMPLE -->
  <h2 class="sec" id="example">The "sold-out properties" example — why you can't guess</h2>
  <div class="ex">
    <h3>A finding no PM would have proposed</h3>
    <p>One of Booking's most-quoted experiments: showing <em>sold-out</em> properties alongside available ones <strong>increased bookings</strong>. The signal was social-proof / scarcity — these places are good, others have already booked them — pushing travellers to book the available alternatives faster.</p>
    <p style="color:var(--ink-soft);font-size:13px">No prioritization framework would have ranked "show users things they can't buy" highly. The experiment found it. That's the entire case for the culture: <em>the wins you'd never have predicted are why measurement beats estimation</em>.</p>
    <div class="src" style="margin-top:6px">Source: Thomke, HBR 2020.</div>
  </div>

  <!-- APPLY TO A SHEET -->
  <h2 class="sec" id="apply">Apply to a feature sheet</h2>
  <p class="secsub">Booking's discipline turns the backlog into an always-on experiment queue. With 1,000+ concurrent tests, the sheet doesn't have a "this quarter's bets" view — it has <em>what's running right now</em>. Columns mirror Microsoft ExP with Booking-specific OEC (conversion to booking, revenue per visitor) and guardrails (cancellation rate, refund rate).</p>

  <div class="note" style="background:var(--teal-soft);border-left-color:var(--teal)"><b>Try it Monday morning (30 minutes).</b> Walk through your team's typical "we want to try X" path. Count the management approvals between idea and live test. Each approval is a tax on your experiment rate. Pick the lowest-stakes one and ask out loud: <em>"what does this approval actually catch that a guardrail wouldn't?"</em> If the honest answer is "nothing — it's a checkbox," you've identified a step to replace with an automated check. Booking's leverage isn't a smarter platform; it's having converted human checks into platform checks one at a time, over years.</div>

  <div class="note" style="background:var(--blue-soft);border-left-color:var(--blue);font-size:13.5px"><b>Quick glossary.</b> <b>OEC</b> = <a class="j" href="jargon.html#oec">Overall Evaluation Criterion</a> (the agreed success metric). <b>SRM</b> = <a class="j" href="jargon.html#srm">Sample-Ratio Mismatch</a> (auto-flags broken randomization). <b>Guardrails</b> = metrics that must not regress. <b>Cancellation rate</b> &amp; <b>refund rate</b> = Booking-specific marketplace guardrails — a feature that boosts <em>bookings</em> by tricking users into commitments they later cancel is a net loss; these two metrics make that visible.</div>

  <h3 style="font-family:Georgia,serif;font-size:18px;margin:18px 0 8px">Worked example — an experiment queue snapshot (Booking-style)</h3>
  <p style="font-size:13.5px;color:var(--ink-soft);margin:0 0 12px">Eight tests. Numbers illustrative. The "sold-out properties" row is the canonical published example.</p>

  <div style="overflow-x:auto;margin:14px 0">
    <table style="border-collapse:collapse;width:100%;font-size:13px;background:#fff;border:1px solid var(--line);border-radius:10px;overflow:hidden">
      <thead><tr style="background:var(--ink);color:#f3efe6;font-size:11.5px;letter-spacing:.05em;text-transform:uppercase"><th style="padding:9px 10px;text-align:left">Feature</th><th style="padding:9px 10px;text-align:left">OEC</th><th style="padding:9px 10px;text-align:left">Guardrails</th><th style="padding:9px 10px;text-align:left">N/arm</th><th style="padding:9px 10px;text-align:left">Ramp</th><th style="padding:9px 10px;text-align:left">Result</th><th style="padding:9px 10px;text-align:left">Decision</th></tr></thead>
      <tbody>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Show sold-out properties (canonical)</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Conv to booking</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Cancellation, refund</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">1.2M</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">50/50</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+1.3% conv, cancel flat</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Property-page redesign</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Conv to booking</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bounce, scroll depth</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">800k</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">25%·50%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.8% conv, bounce flat</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">AI description translation</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Conv to booking</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Description-edit complaints</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">400k</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">50/50</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.5% conv (locale), OK</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Loyalty badge on profile</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Repeat-booking rate</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">New-user conv</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">600k</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">50/50</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.4% repeat, OK</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Scarcity countdown timer</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Conv to booking</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Cancellation, support tickets</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">300k</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">25%·50%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.4% conv, tickets +6%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--gold);font-weight:700">Iterate — softer copy</td></tr>
        <tr><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">New filter UX</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Conv to booking</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Search-abandon</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">500k</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">25%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Flat, search-abandon +0.4%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--gold);font-weight:700">Iterate</td></tr>
        <tr><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Pre-emptive cancel-fee warning</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Conv to booking</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">—</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">200k</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">5%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">−0.5% conv</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--accent);font-weight:700">Kill</td></tr>
        <tr><td style="padding:9px 10px;font-weight:600">Aggressive cross-sell modal</td><td style="padding:9px 10px">Revenue/visitor</td><td style="padding:9px 10px">Cancellation</td><td style="padding:9px 10px">250k</td><td style="padding:9px 10px">5%</td><td style="padding:9px 10px">Rev +5% BUT cancellations +12%</td><td style="padding:9px 10px;font-family:Georgia,serif;color:var(--accent);font-weight:700">Kill</td></tr>
      </tbody>
    </table>
  </div>

  <div class="note" style="background:var(--accent-soft);border-left-color:var(--accent)"><b>The most important reading skill on this page.</b> The sold-out row is the table's most important entry — but not for the +1.3% conversion lift. It matters because <em>no human prioritization framework would have ranked it as a candidate worth shipping</em>. "Show users things they can't buy" sounds like a UX mistake. The only way the experiment got to run is that Booking's culture lets anyone launch a test without a manager's prior judgment about whether the idea sounds smart. That's the load-bearing claim of the entire page: <em>the wins you wouldn't have predicted are the ones democratized testing is for</em>. If your culture requires a manager to approve every test, you systematically filter out the surprising wins.</div>

  <div class="note"><b>Decision rule.</b> Standard OEC + guardrails — but Booking's two distinguishing guardrails are <em>cancellation rate</em> and <em>refund rate</em>. The cross-sell row shows why: a feature that lifts revenue by tricking users into bookings they cancel is worse than no feature. The sold-out row is the published canonical: nobody would have <em>guessed</em> that showing things users can't buy would increase bookings — only an experiment found it. That's the entire case for measure-don't-estimate at this scale.</div>

  <div class="note"><b>The transferable insight: cheap-to-launch is the lever.</b> The reason Booking's culture works is that the cost to run a test is so low that <em>not</em> running one is the unusual choice. For a team our size, "cheap" means automated guardrails and a single shared metric definition — not 100 data scientists. The expensive part is the discipline of writing the hypothesis and OEC <em>every time</em>.</div>
  <footer>
    Companion to <a href="impact-consumer-companies.html#measure">← Consumer case studies · Measure don't estimate</a> · <a href="methodologies-comparison.html">All methods compared</a> · siblings: <a href="microsoft-exp-framework.html">Microsoft ExP</a> · <a href="linkedin-xlnt-framework.html">LinkedIn T-REX</a><br>
    <b>Grounded in</b> <a href="https://arxiv.org/abs/1710.08217">Vermeer et al. — "Democratizing Online Controlled Experiments at Booking.com" (arxiv, October 2017)</a> and <a href="https://hbr.org/2020/03/building-a-culture-of-experimentation">Stefan Thomke — "Building a Culture of Experimentation" (HBR, March–April 2020 issue)</a>. <b>Verbatim from sources:</b> the 25,000-tests-per-year figure, the "any employee can launch tests" framing, the platform/culture inversion, and Lukas Vermeer's role as a key voice for the program. <b>Added by us, not in the sources:</b> the 8-row experiment-queue worked example, the Ship/Iterate/Kill verdict logic, the in-page glossary, and the "Try it Monday" exercise.
  </footer>
</div>
</body>
</html>