inlineMethology/airbnb-erf-framework.html at main · inlineapps/inlineMethology · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Airbnb ERF — variance reduction as the platform's product</title>
<link rel="stylesheet" href="framework.css">
<style>
  /* Page-accent — overrides framework.css fallback */
  :root{--page-accent:var(--blue);--page-accent-soft:var(--blue-soft)}
  /* three things */
.three{display:grid;grid-template-columns:repeat(3,1fr);gap:14px;margin:14px 0}
  .concept{background:#fff;border:1px solid var(--line);border-radius:12px;padding:18px 20px;box-shadow:var(--shadow);border-top:5px solid var(--page-accent)}
  .concept .ab{font-family:Georgia,serif;font-size:13px;color:var(--page-accent);font-weight:700;letter-spacing:.05em;text-transform:uppercase;margin-bottom:4px}
  .concept h3{margin:0 0 8px;font-size:17px;font-family:Georgia,serif}
  .concept p{margin:0;font-size:13.5px;color:var(--ink-soft);line-height:1.5}
  .concept p b{color:var(--ink)}
  /* stats */
.stats{display:flex;gap:10px;flex-wrap:wrap;margin:14px 0}
  .stat{flex:1 1 150px;background:#f6f4ee;border:1px solid var(--line);border-radius:8px;padding:12px 14px;text-align:center}
  .stat .n{font-family:Georgia,serif;font-size:22px;color:var(--page-accent);font-weight:700;line-height:1.1}
  .stat .l{font-size:11.5px;color:var(--ink-soft);margin-top:4px;line-height:1.35}
  @media(max-width:680px){.three{grid-template-columns:1fr}}
</style>
</head>
<body>
<nav class="sitenav">
<details>
<summary>📑 Jump to</summary>
<div class="navmenu">
<div class="navgrp"><h4>Start here</h4>
<a href="index.html"><b>← Home (goal &amp; map)</b></a>
<a href="impact-saas-companies.html">SaaS / B2B field study</a>
<a href="impact-consumer-companies.html">Consumer-tech field study</a>
<a href="methodologies-comparison.html"><b>All methods compared →</b></a>
<a href="experiment-trustworthiness.html">How 40k tests actually work →</a>
<a href="jargon.html">Jargon (glossary)</a>
</div>
<div class="navgrp"><h4>Scoring &amp; Input modeling</h4>
<a href="rice-framework.html">RICE (Intercom)</a>
<a href="north-star-framework.html">North Star (Amplitude / Slack)</a>
</div>
<div class="navgrp"><h4>Goal-laddering / Define first</h4>
<a href="v2mom-framework.html">V2MOM (Salesforce)</a>
<a href="pyramid-of-clarity-framework.html">Pyramid of Clarity (Asana)</a>
<a href="pr-faq-framework.html">PR-FAQ / Working Backwards (Amazon)</a>
<a href="heart-framework.html">HEART (Google)</a>
<a href="dibb-framework.html">DIBB (Spotify)</a>
</div>
<div class="navgrp"><h4>Experimentation (SaaS)</h4>
<a href="microsoft-exp-framework.html">Microsoft ExP / CUPED</a>
<a href="linkedin-xlnt-framework.html">LinkedIn T-REX</a>
</div>
<div class="navgrp"><h4>Experimentation (Consumer)</h4>
<a href="netflix-experimentation.html">Netflix · ABlaze</a>
<a href="booking-experimentation.html">Booking.com</a>
<a class="cur" href="airbnb-erf-framework.html">Airbnb ERF</a>
<a href="uber-xp-framework.html">Uber XP</a>
<a href="doordash-switchback-framework.html">DoorDash switchback</a>
<a href="lyft-experimentation.html">Lyft</a>
<a href="pinterest-ab-framework.html">Pinterest</a>
</div>
<div class="navgrp"><h4>AI labs</h4>
<a href="anthropic-pm-on-ai-exponential.html">Anthropic · PM on AI exponential</a>
<a href="google-customer-zero-2026.html">Google · "Customer zero" 2026</a>
</div>
<div class="navgrp"><h4>Written discipline</h4>
<a href="stripe-shaping-framework.html">Stripe shaping</a>
</div>
</div>
</details>
</nav>

<div class="wrap">
  <header class="masthead">
    <p class="kicker">Methods · Deep-dive · Experimentation</p>
    <h1>Airbnb ERF — standardised analysis as the platform's product <span class="srcyr">2014</span></h1>
    <p class="sub">Airbnb's <strong>Experiment Reporting Framework</strong> (ERF) was built by Will Moss and team when their experiment volume outgrew bespoke analysis. First publicly described in his <a class="cite" href="https://medium.com/airbnb-engineering/experiment-reporting-framework-4e3fcd29e6c0">Airbnb Engineering post on May 29, 2014</a>.</p>
    <p class="sub">The framework's distinctive contribution: a common metric configuration language and a standardised report per experiment that makes valid analysis the default — not a heroic act per test. Moss's own framing (verbatim): a tool to <em>"make running experiments easier by hiding all the pitfalls and automating the analytical heavy lifting."</em></p>
    <div class="goal"><span>Goal</span><br>Decide features by data-backed expected impact — choose by outcome, not by to-do list or opinion.</div>
  </header>

  <div class="eli">
    <div class="lbl">🎓 8th-grade version</div>
    Airbnb used to test ideas by having a data scientist look at the numbers from each experiment by hand. That works for a few tests — but Airbnb had hundreds running at once. So they built a tool called <b>ERF</b> that does the analysis automatically and prints the same kind of report every time. Two clever things: (1) every team uses the <em>same definition</em> of metrics like "bookings" so no one can argue about whose number is right; (2) the report tells you not just "the number went up" but "the number went up <em>after subtracting the noise</em>" — because sometimes a feature looks like a win but is actually just seasonal traffic the team would have gotten anyway.
  </div>

  <nav class="toc">
    <a href="#headline">Honest headline</a>
    <a href="#anatomy">What ERF actually is</a>
    <a href="#mechanism">How it picks work</a>
    <a href="#stats">Scale</a>
    <a href="#apply">Apply to a sheet</a>
    <a href="methodologies-comparison.html" style="color:var(--blue);font-weight:700">Comparison table →</a>
  </nav>

  <div class="finding" id="headline">
    <h2>The honest headline: standardise the analysis so engineers can run their own experiments</h2>
    <p>Airbnb's published story (2014–2018 engineering posts) is the painful one familiar to anyone who's tried to scale experimentation: <b>at low volume, you analyse each test by hand. At high volume, the bottleneck becomes the data scientist.</b> ERF was the framework that broke that bottleneck — define metrics once in a config language, and the platform produces a valid analysis automatically.</p>
    <p>Same decision rule as the rest of the experimentation set (OEC + guardrails + ramp). What's distinct: <b>variance reduction baked into the platform itself</b>, so smaller tests can detect smaller effects faster — without an analyst's intervention per experiment.</p>
  </div>

  <!-- ANATOMY -->
  <h2 class="sec" id="anatomy">What ERF actually is — three pieces</h2>
  <p class="secsub">It's a reporting / analysis layer on top of Airbnb's A/B platform, not a separate testing tool. The pieces are what make experiments cheap to read.</p>

  <div class="three">
    <div class="concept">
      <div class="ab">Config</div>
      <h3>Metrics defined once, in a shared DSL</h3>
      <p>Each metric (booking rate, search-to-book conversion, host response time…) is described in a single config file. All experiments reference the same definition — eliminating "your numerator vs my numerator" drift.</p>
    </div>
    <div class="concept">
      <div class="ab">Variance</div>
      <h3>Automatic variance reduction (CUPED-style)</h3>
      <p>Later iterations of ERF applied <b>pre-experiment-data covariates</b> by default to shrink confidence intervals — the same statistical trick that <a class="cite" href="microsoft-exp-framework.html">Microsoft's CUPED</a> (2013 paper) uses. The original 2014 ERF post focuses on standardisation; variance-reduction features appeared in later "Scaling ERF" follow-ups.</p>
    </div>
    <div class="concept">
      <div class="ab">Reports</div>
      <h3>Standard reports per experiment</h3>
      <p>One templated output per test: OEC, supporting metrics, guardrails, significance, confidence interval. Engineers and PMs read it without a data scientist on call.</p>
    </div>
  </div>
  <div class="src">Sources: <a class="cite" href="https://medium.com/airbnb-engineering/experiment-reporting-framework-4e3fcd29e6c0">Will Moss — "Experiment Reporting Framework" (Airbnb Engineering, May 29, 2014)</a> — the original ERF post · <a class="cite" href="https://medium.com/airbnb-engineering/https-medium-com-jonathan-parks-scaling-erf-23fd17c91166">Jonathan Parks — "Scaling ERF"</a> follow-up.</div>

  <!-- MECHANISM -->
  <h2 class="sec" id="mechanism">How an Airbnb-style experiment actually picks the winner</h2>
  <p class="secsub">Same OEC discipline as everyone else's platform. What changes is who can read the result — and how soon.</p>

  <div class="step"><div class="num">1</div><div><h3>Engineer / PM defines the experiment in code</h3><p>Hypothesis, OEC, sample-size estimate. Metrics referenced by name from the shared config — no bespoke calculations.</p></div></div>
  <div class="step"><div class="num">2</div><div><h3>Platform assigns randomization and starts the test</h3><p>Flag-based deploy. Variance-reduction covariates picked automatically from pre-experiment data for each user.</p></div></div>
  <div class="step"><div class="num">3</div><div><h3>Auto-generated report reads the OEC + guardrails</h3><p>Same report template every time. Standardisation is the trust mechanism — readers know exactly where to look.</p></div></div>
  <div class="step"><div class="num">4</div><div><h3>Ship, kill, or iterate — author decides</h3><p>No analyst gate. The author is on the hook for reading the report honestly; the report itself flags caveats (insufficient sample, <a class="j" href="jargon.html#srm">SRM</a>, etc.).</p></div></div>
  <div class="step"><div class="num">5</div><div><h3>Cumulative metric library improves the next test</h3><p>Every new metric added to the config becomes available to every future experiment. The framework <em>compounds</em>.</p></div></div>

  <!-- STATS -->
  <h2 class="sec" id="stats">Scale</h2>
  <div class="stats">
    <div class="stat"><div class="n">~500</div><div class="l">concurrent experiments (up from a few dozen in 2014)</div></div>
    <div class="stat"><div class="n">1 framework</div><div class="l">one shared metric <a class="j" href="jargon.html#dsl">DSL</a>, all teams</div></div>
    <div class="stat"><div class="n">~0</div><div class="l">analyst hours per standard experiment</div></div>
  </div>

  <!-- APPLY TO A SHEET -->
  <h2 class="sec" id="apply">Apply to a feature sheet</h2>
  <p class="secsub">ERF doesn't change the columns much — it changes the <strong>number you read</strong>. The sheet's <em>Result</em> column has two values: raw lift and ERF-adjusted lift (corrected for novelty, seasonality, network effects). Decisions are made on the adjusted number. The classic trap ERF prevents: shipping a feature whose "lift" was actually seasonal demand or a novelty bump.</p>

  <div class="note" style="background:var(--teal-soft);border-left-color:var(--teal)"><b>Try it Monday morning (30 minutes).</b> Pick the last A/B test your team shipped. Find the result. Now answer two questions: was the metric defined in a <em>shared</em> place where another team would use the exact same definition? Did anyone check whether the lift could be explained by <em>seasonality</em> (was there a holiday weekend in the test window)? If either answer is "no," your team is at the pre-ERF stage Airbnb was at in 2013 — and the ERF leverage is in those two specific gaps, not in more sophisticated stats.</div>

  <div class="note" style="background:var(--blue-soft);border-left-color:var(--blue);font-size:13.5px"><b>Quick glossary for the columns below.</b> <b>Raw lift</b> = the OEC delta straight from the test, as if no adjustments were needed. <b>ERF-adjusted lift</b> = the same delta after accounting for novelty effects (users react to anything new), seasonality (lift may be calendar-driven), and network effects (users in treatment can affect users in control). <b><a class="j" href="jargon.html#novelty-effect">Novelty effect</a></b> = the temporary engagement bump that fades after users get used to the change. <b>Seasonality</b> = systematic time-of-week / time-of-year demand patterns that exist independent of the test. The gap between raw and adjusted is the lie ERF prevents you from believing.</div>

  <h3 style="font-family:Georgia,serif;font-size:18px;margin:18px 0 8px">Worked example — an experiment ledger snapshot (Airbnb-style)</h3>
  <p style="font-size:13.5px;color:var(--ink-soft);margin:0 0 12px">Eight tests. Note the gap between raw and ERF-adjusted lift — that gap is the lie ERF prevents you from believing.</p>

  <div style="overflow-x:auto;margin:14px 0">
    <table style="border-collapse:collapse;width:100%;font-size:13px;background:#fff;border:1px solid var(--line);border-radius:10px;overflow:hidden">
      <thead><tr style="background:var(--ink);color:#f3efe6;font-size:11.5px;letter-spacing:.05em;text-transform:uppercase"><th style="padding:9px 10px;text-align:left">Feature</th><th style="padding:9px 10px;text-align:left">OEC</th><th style="padding:9px 10px;text-align:left">Guardrails</th><th style="padding:9px 10px;text-align:left">Raw lift</th><th style="padding:9px 10px;text-align:left">ERF-adjusted</th><th style="padding:9px 10px;text-align:left">Why the gap</th><th style="padding:9px 10px;text-align:left">Decision</th></tr></thead>
      <tbody>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Map-based search default</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bookings</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Cancellation</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.6%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.3%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Small novelty bump</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">New listing-photo layout</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bookings</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Listing-edits</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+1.1%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.9%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Mostly real</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Smart pricing for hosts</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Host listing-rate</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Host satisfaction</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+2.4%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+1.8%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Some seasonality</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Wishlist sharing</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bookings via shares</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Wishlist abuse</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.4%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.7%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Network effect (positive)</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Push notifications for saved searches</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bookings</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Mute rate</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.9%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.1%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Mostly novelty + seasonal</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--gold);font-weight:700">Iterate</td></tr>
        <tr><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Filter persistence</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Saved-search rate</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Search abandon</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.3%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+0.4%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Real effect</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Instant-book promotion</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bookings</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Host opt-out</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+2.0%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">−0.1%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Entirely seasonality</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--accent);font-weight:700">Kill</td></tr>
        <tr><td style="padding:9px 10px;font-weight:600">Aggressive cleaning-fee display</td><td style="padding:9px 10px">Bookings</td><td style="padding:9px 10px">Bounce</td><td style="padding:9px 10px">−0.4%</td><td style="padding:9px 10px">−0.5%</td><td style="padding:9px 10px">Real decline</td><td style="padding:9px 10px;font-family:Georgia,serif;color:var(--accent);font-weight:700">Kill</td></tr>
      </tbody>
    </table>
  </div>

  <div class="note" style="background:var(--accent-soft);border-left-color:var(--accent)"><b>The most important reading skill on this page.</b> Compare the "Instant-book promotion" row with the "Wishlist sharing" row. Instant-book: raw <em>+2.0%</em>, adjusted <em>−0.1%</em>. Wishlist: raw <em>+0.4%</em>, adjusted <em>+0.7%</em>. Both gaps are large — but they point in <em>opposite</em> directions. Adjustment doesn't always shrink the lift; sometimes it grows the lift because the raw number was masking a real network effect. This is why ERF is a <em>standardisation</em> tool, not a "make numbers smaller" tool. A team using raw lifts ships the instant-book feature (wrong) and under-ships the wishlist feature (also wrong). Both errors disappear when the analysis is consistent and the report is the same shape every time.</div>

  <div class="note"><b>Decision rule.</b> Decide on the <em>ERF-adjusted</em> lift, not the raw. The instant-book row is the canonical trap: raw +2.0% would look like a clean ship, but ERF reveals it as entirely seasonal demand the team would have taken credit for. The wishlist row shows the inverse — ERF can <em>increase</em> lift when a real network effect was being masked. Without standardised reporting, every team interprets their own numbers and trust collapses; with it, the whole company reads the same shape every time.</div>

  <div class="note"><b>The transferable insight: standardisation, not statistics.</b> The Airbnb story is sometimes told as a <a class="j" href="jargon.html#cuped">CUPED</a> variance-reduction story. The deeper lesson is that <b>the platform's product is the standardised analysis</b>. Without ERF, every team builds their own analysis and the results aren't comparable. With it, the team can ship 10× more tests because each one costs the same fixed (low) effort to read.</div>
  <footer>
    Companion to <a href="impact-consumer-companies.html#measure">← Consumer case studies · Measure don't estimate</a> · <a href="methodologies-comparison.html">All methods compared</a> · siblings: <a href="microsoft-exp-framework.html">Microsoft ExP / CUPED</a><br>
    <b>Grounded in</b> <a href="https://medium.com/airbnb-engineering/experiment-reporting-framework-4e3fcd29e6c0">Will Moss, "Experiment Reporting Framework" (Airbnb Engineering, May 29, 2014)</a> and <a href="https://medium.com/airbnb-engineering/https-medium-com-jonathan-parks-scaling-erf-23fd17c91166">Jonathan Parks, "Scaling ERF"</a>. <b>Verbatim from Moss's 2014 post:</b> the "hiding all the pitfalls and automating the analytical heavy lifting" framing; the <a class="j" href="jargon.html#yaml">YAML</a>-based experiment declaration; nightly automated analysis; treatment-assignment via lambda functions in Ruby/JavaScript. <b>Attribution timing:</b> the original 2014 ERF focused on standardisation and report templating; CUPED-style variance reduction was a later addition documented in follow-up posts (not Moss's original). <b>Added by us, not in Airbnb's posts:</b> the 8-row experiment-ledger worked example with raw-vs-adjusted lift gaps, the verdict labels, the in-page glossary, and the "Try it Monday" exercise. The ~500-concurrent / "dozens in 2014" figures are reasonable estimates from across Airbnb's published posts but should be re-verified against current Airbnb engineering writing before quoting.
  </footer>
</div>
</body>
</html>