inlineMethology/lyft-experimentation.html at main · inlineapps/inlineMethology · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Lyft — marketplace experimentation with time/region splits + adaptive bandits</title>
<link rel="stylesheet" href="framework.css">
<style>
  /* Page-accent — overrides framework.css fallback */
  :root{--page-accent:var(--blue);--page-accent-soft:var(--blue-soft)}
  /* three pieces */
.three{display:grid;grid-template-columns:repeat(3,1fr);gap:14px;margin:14px 0}
  .concept{background:#fff;border:1px solid var(--line);border-radius:12px;padding:18px 20px;box-shadow:var(--shadow);border-top:5px solid var(--page-accent)}
  .concept .ab{font-family:Georgia,serif;font-size:13px;color:var(--page-accent);font-weight:700;letter-spacing:.05em;text-transform:uppercase;margin-bottom:4px}
  .concept h3{margin:0 0 8px;font-size:17px;font-family:Georgia,serif}
  .concept p{margin:0;font-size:13.5px;color:var(--ink-soft);line-height:1.5}
  .concept p b{color:var(--ink)}
  @media(max-width:680px){.three{grid-template-columns:1fr}}
</style>
</head>
<body>
<nav class="sitenav">
<details>
<summary>📑 Jump to</summary>
<div class="navmenu">
<div class="navgrp"><h4>Start here</h4>
<a href="index.html"><b>← Home (goal &amp; map)</b></a>
<a href="impact-saas-companies.html">SaaS / B2B field study</a>
<a href="impact-consumer-companies.html">Consumer-tech field study</a>
<a href="methodologies-comparison.html"><b>All methods compared →</b></a>
<a href="experiment-trustworthiness.html">How 40k tests actually work →</a>
<a href="jargon.html">Jargon (glossary)</a>
</div>
<div class="navgrp"><h4>Scoring &amp; Input modeling</h4>
<a href="rice-framework.html">RICE (Intercom)</a>
<a href="north-star-framework.html">North Star (Amplitude / Slack)</a>
</div>
<div class="navgrp"><h4>Goal-laddering / Define first</h4>
<a href="v2mom-framework.html">V2MOM (Salesforce)</a>
<a href="pyramid-of-clarity-framework.html">Pyramid of Clarity (Asana)</a>
<a href="pr-faq-framework.html">PR-FAQ / Working Backwards (Amazon)</a>
<a href="heart-framework.html">HEART (Google)</a>
<a href="dibb-framework.html">DIBB (Spotify)</a>
</div>
<div class="navgrp"><h4>Experimentation (SaaS)</h4>
<a href="microsoft-exp-framework.html">Microsoft ExP / CUPED</a>
<a href="linkedin-xlnt-framework.html">LinkedIn T-REX</a>
</div>
<div class="navgrp"><h4>Experimentation (Consumer)</h4>
<a href="netflix-experimentation.html">Netflix · ABlaze</a>
<a href="booking-experimentation.html">Booking.com</a>
<a href="airbnb-erf-framework.html">Airbnb ERF</a>
<a href="uber-xp-framework.html">Uber XP</a>
<a href="doordash-switchback-framework.html">DoorDash switchback</a>
<a class="cur" href="lyft-experimentation.html">Lyft</a>
<a href="pinterest-ab-framework.html">Pinterest</a>
</div>
<div class="navgrp"><h4>AI labs</h4>
<a href="anthropic-pm-on-ai-exponential.html">Anthropic · PM on AI exponential</a>
<a href="google-customer-zero-2026.html">Google · "Customer zero" 2026</a>
</div>
<div class="navgrp"><h4>Written discipline</h4>
<a href="stripe-shaping-framework.html">Stripe shaping</a>
</div>
</div>
</details>
</nav>

<div class="wrap">
  <header class="masthead">
    <p class="kicker">Methods · Deep-dive · Experimentation</p>
    <h1>Lyft — marketplace experimentation with time/region splits + adaptive bandits <span class="srcyr">2016</span></h1>
    <p class="sub">Lyft's published approach is the close cousin of <a href="doordash-switchback-framework.html" style="color:#1f6f6b">DoorDash's switchback</a>: every product change runs as an experiment, but marketplace interference forces designs richer than ordinary A/B — <strong>time-split</strong> and <strong>region-split</strong> tests.</p>
    <p class="sub">Canonical source: Nicholas Chamandy's <a class="cite" href="https://eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e">"Experimentation in a Ridesharing Marketplace" (Lyft Engineering, September 2, 2016)</a>. Later Lyft posts (~2022) document growing use of <strong><a class="j" href="jargon.html#contextual-bandits">contextual bandits</a></strong> for adaptive, always-on optimisation.</p>
    <div class="goal"><span>Goal</span><br>Decide features by data-backed expected impact — choose by outcome, not by to-do list or opinion.</div>
  </header>

  <div class="eli">
    <div class="lbl">🎓 8th-grade version</div>
    Lyft uses three kinds of experiments depending on <em>what</em> they're testing. (1) For simple things like a button color, they do a normal <b>A/B test</b> — half see new, half see old. (2) For anything that touches the pool of drivers (like a new pricing rule), they can't use A/B because the two groups affect each other through the shared driver pool — so they <b>switch the whole city</b> on for 30 minutes, off for 30 minutes, on again, and compare. (3) For decisions that happen non-stop (which notification to send, which homepage to show), they use a <b>bandit</b> — an algorithm that keeps trying different options and sends more traffic to the ones that work best. The rule: the design must match the kind of decision, or your number is lying to you.
  </div>

  <nav class="toc">
    <a href="#headline">Honest headline</a>
    <a href="#anatomy">Three designs Lyft uses</a>
    <a href="#mechanism">How it picks work</a>
    <a href="#apply">Apply to a sheet</a>
    <a href="methodologies-comparison.html" style="color:var(--blue);font-weight:700">Comparison table →</a>
  </nav>

  <div class="finding" id="headline">
    <h2>The honest headline: A/B for diffs, time/region splits for marketplace, bandits for "no off-switch" optimisation</h2>
    <p>Lyft's engineering blog frames experimentation as <b>three design choices that map to three kinds of decisions</b>. Most teams pick one design and apply it everywhere; Lyft argues the design has to match what's being changed. The decision rule (OEC + measured causal effect) is constant; the <em>shape of the test</em> isn't.</p>
    <p>The most under-talked part of Lyft's story: <b>contextual bandits</b> — instead of a discrete A/B that ends, the algorithm learns the best variant <em>per context</em> continuously. It's the design for problems that never stop (matching, dispatch, surge) where "ship the winner" isn't the goal — "always serve the right one" is.</p>
  </div>

  <!-- ANATOMY -->
  <h2 class="sec" id="anatomy">Three experimental designs Lyft uses, and when</h2>
  <p class="secsub">The pattern: <em>match the design to the interference structure of the feature</em>. Skipping this step is the most common cause of biased marketplace results.</p>

  <div class="three">
    <div class="concept">
      <div class="ab">A/B</div>
      <h3>User-level A/B — for non-marketplace changes</h3>
      <p>Standard split. Used for UI, copy, recommendations, anything that doesn't change a shared resource (Dasher pool / driver supply / pricing). The default and the cheapest design.</p>
    </div>
    <div class="concept">
      <div class="ab">Time-split</div>
      <h3>Time-split / region-split — for marketplace changes</h3>
      <p>For features that touch the supply pool, randomise <b>time windows</b> (or whole regions) to treatment / control. Avoids the interference Lyft repeatedly publishes about; same family as DoorDash switchback.</p>
    </div>
    <div class="concept">
      <div class="ab">Bandits</div>
      <h3>Contextual bandits — for always-on optimisation</h3>
      <p>Bandits route more traffic to better-performing variants <em>over time</em>, balancing exploration and exploitation. Per Kirn 2022 (<em>"Challenges in Experimentation"</em>), Lyft is investing in <strong>"always-on adaptive experimentation platforms"</strong> including parameter tuning and <a class="j" href="jargon.html#reinforcement-learning">reinforcement learning</a> approaches — with the explicit use-case named in the post being <em>"customer communications"</em>. Other applications are not detailed in the published posts.</p>
    </div>
  </div>
  <div class="src">Sources: <a class="cite" href="https://eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e">Nicholas Chamandy — "Experimentation in a Ridesharing Marketplace" (Lyft Engineering, September 2, 2016)</a> — the canonical post · <a class="cite" href="https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4">John Kirn — "Challenges in Experimentation"</a> follow-up · later bandit work covered in <a class="cite" href="https://www.infoq.com/news/2022/05/lyft-improving-experiments/">InfoQ's 2022 summary of Lyft's experiments-beyond-A/B</a>.</div>

  <!-- MECHANISM -->
  <h2 class="sec" id="mechanism">How Lyft actually picks the right design</h2>

  <div class="step"><div class="num">1</div><div><h3>Classify the change first</h3><p>Does it touch a shared resource? Is it a one-time decision or continuous? The answer routes the design. <b>Skipping this is where most marketplace experiments go wrong.</b></p></div></div>
  <div class="step"><div class="num">2</div><div><h3>If shared resource → time/region split</h3><p>Randomise the whole city for some time windows. Use <a class="j" href="jargon.html#cluster-robust-se">cluster-robust standard errors</a> when analysing.</p></div></div>
  <div class="step"><div class="num">3</div><div><h3>If continuous-decision → bandit</h3><p>Define the reward signal and the contexts. The bandit allocates traffic over time; you read aggregate performance, not a one-shot lift.</p></div></div>
  <div class="step"><div class="num">4</div><div><h3>Otherwise → ordinary A/B</h3><p>The simple case. Same OEC + guardrails discipline as every other platform on this list.</p></div></div>
  <div class="step"><div class="num">5</div><div><h3>Read the OEC. Decide. Iterate.</h3><p>The decision rule is unchanged across designs — but the <em>estimate</em> is only trustworthy if the design was right for the change.</p></div></div>

  <!-- APPLY TO A SHEET -->
  <h2 class="sec" id="apply">Apply to a feature sheet</h2>
  <p class="secsub">Lyft's ledger has a <em>design</em> column that's set before any other: <b>A/B</b> for ordinary user-level changes, <b>switchback</b> for shared-resource marketplace features, <b>bandit</b> for continuous always-on decisions. The wrong design produces a biased lift — picking the right one is more important than picking the right metric.</p>

  <div class="note" style="background:var(--teal-soft);border-left-color:var(--teal)"><b>Try it Monday morning (30 minutes).</b> Open your team's backlog. Add a <em>design</em> column. For each item, write A/B / switchback / bandit. Three questions to route: <b>(1)</b> does it touch a shared resource? → switchback. <b>(2)</b> is the right answer context-dependent and the decision never naturally ends? → bandit. <b>(3)</b> otherwise → A/B. Tag every backlog row. Now look at how many items in your backlog have ever been tested with the right design — that's the gap Lyft's first-design-then-feature rule closes.</div>

  <div class="note" style="background:var(--blue-soft);border-left-color:var(--blue);font-size:13.5px"><b>Quick glossary for the columns below.</b> <b>Time-split / switchback</b> = the whole region runs treatment for some windows, control for others; comparing windows estimates the marketplace-corrected effect. <b>Contextual bandit</b> = an adaptive algorithm that allocates more traffic to better-performing variants while still exploring; "the winner" is allowed to be context-dependent. <b>Reward signal</b> = the OEC for a bandit. <b><a class="j" href="jargon.html#exploration-vs-exploitation">Exploration vs exploitation</a></b> = the tradeoff a bandit balances — keep trying variants we might be wrong about (exploration) vs send traffic to the current winner (exploitation).</div>

  <h3 style="font-family:Georgia,serif;font-size:18px;margin:18px 0 8px">Worked example — an experiment ledger snapshot (Lyft-style)</h3>
  <p style="font-size:13.5px;color:var(--ink-soft);margin:0 0 12px">Eight tests across all three designs. Numbers illustrative.</p>

  <div style="overflow-x:auto;margin:14px 0">
    <table style="border-collapse:collapse;width:100%;font-size:13px;background:#fff;border:1px solid var(--line);border-radius:10px;overflow:hidden">
      <thead><tr style="background:var(--ink);color:#f3efe6;font-size:11.5px;letter-spacing:.05em;text-transform:uppercase"><th style="padding:9px 10px;text-align:left">Feature</th><th style="padding:9px 10px;text-align:left">Design</th><th style="padding:9px 10px;text-align:left">OEC</th><th style="padding:9px 10px;text-align:left">Guardrails</th><th style="padding:9px 10px;text-align:left">Cell/sample</th><th style="padding:9px 10px;text-align:left">Result</th><th style="padding:9px 10px;text-align:left">Decision</th></tr></thead>
      <tbody>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">New ETA algorithm</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Switchback</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Wait time</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Cancellation, ETA accuracy</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">30 cities × 30-min</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Wait −6s sig</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Driver-incentive variant</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Switchback</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Driver utilization</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Earnings/hr</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">20 cities × 60-min</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Util +0.8pp</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">New cancellation flow</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">A/B</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Rider cancel rate</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Refund volume</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">500k users</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Cancels −3%</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">New rider onboarding tutorial</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">A/B</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)"><a class="j" href="jargon.html#dn-activation">D7</a> retention</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Time-to-first-ride</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">200k users</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">D7 +1.4pp</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Ship</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Homepage-hero selector</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bandit (contextual)</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Tap-through rate</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bounce</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">always-on</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">+1.1% engagement, picks winner per context</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Adopt — leave on</td></tr>
        <tr style="background:#e6ecf6"><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Notification template selector</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Bandit</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Open rate</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Unsubscribes</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">always-on</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Opens +5% (bandit-adaptive)</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--blue);font-weight:700">Adopt</td></tr>
        <tr><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-weight:600">Pool reshape</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Switchback</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Trips/$</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Rider satisfaction</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">15 cities × 60-min</td><td style="padding:9px 10px;border-bottom:1px solid var(--line)">Flat</td><td style="padding:9px 10px;border-bottom:1px solid var(--line);font-family:Georgia,serif;color:var(--gold);font-weight:700">Iterate</td></tr>
        <tr><td style="padding:9px 10px;font-weight:600">Aggressive surge multiplier</td><td style="padding:9px 10px">Switchback</td><td style="padding:9px 10px">Trips completed</td><td style="padding:9px 10px">Rider satisfaction</td><td style="padding:9px 10px">10 cities × 30-min</td><td style="padding:9px 10px">Trips OK BUT satisfaction −4pts</td><td style="padding:9px 10px;font-family:Georgia,serif;color:var(--accent);font-weight:700">Kill</td></tr>
      </tbody>
    </table>
  </div>

  <div class="note" style="background:var(--accent-soft);border-left-color:var(--accent)"><b>The most important reading skill on this page.</b> Notice the <em>Adopt — leave on</em> verdict for the homepage-hero and notification-template bandit rows. Those aren't normal "ship the winner and turn the test off" decisions — there <em>is no single winner</em>, because the right homepage/template depends on user context (time of day, location, history). The bandit keeps running forever, learning per-context. A team's instinct is to "ship the winning variant and move on" — but for these problems, doing so throws away most of the lift. The bandit's value isn't a one-time number; it's the ongoing routing decision.</div>

  <div class="note"><b>Decision rule.</b> The design column is the first gate — anything touching shared resources (dispatch, surge, pool, allocation) <strong>cannot</strong> be a user-level A/B without biased estimates. Bandits are the right shape when "the winner" depends on context (which hero, which template) and the test never naturally ends. Standard A/B is fine for everything else. The aggressive-surge row shows the marketplace kill — trips up but satisfaction down means the metric you optimised wasn't the metric you cared about.</div>

  <div class="note"><b>The transferable insight: bandits aren't only for ML teams.</b> A small adaptive system that "tries variants and routes more traffic to the better one" is buildable in a sprint for problems like which homepage hero to show, which merchant to feature, which notification template performs. For decisions that don't have a single right answer over time, bandits beat fixed A/B because they keep learning.</div>
  <footer>
    Companion to <a href="impact-consumer-companies.html#measure">← Consumer case studies · Measure don't estimate</a> · <a href="methodologies-comparison.html">All methods compared</a> · siblings: <a href="doordash-switchback-framework.html">DoorDash switchback</a> · <a href="uber-xp-framework.html">Uber XP</a><br>
    <b>Grounded in</b> <a href="https://eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e">Nicholas Chamandy, "Experimentation in a Ridesharing Marketplace, Part 1: Interference Across a Network" (Lyft Engineering, September 2, 2016)</a> and <a href="https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4">John Kirn, "Challenges in Experimentation" (Lyft Engineering, April 19, 2022)</a>. <b>Verbatim from Chamandy 2016:</b> the bias-variance framing of randomization granularity (sessions → blocks → cities → time intervals), and that "alternating time intervals between global control and global treatment configurations was a successful strategy for the Lyft Marketplace team in the early days." <b>Verbatim from Kirn 2022:</b> "the second most common type of experiment we run is a time split test… commonly used to test pricing, ETAs, routing, mapping"; "region split tests divide treatments by geography and use a synthetic control to conduct causal inference"; "always-on adaptive experimentation platforms… reinforcement learning approaches, including contextual bandits"; bandits "especially with customer communications"; Multiple Hypothesis Testing correction via Benjamini-Hochberg. <b>Added by us, not in Lyft's posts:</b> the 8-row worked example with mixed designs (clearly illustrative), the verdict labels including "Adopt — leave on" for bandits, the in-page glossary, and the "Try it Monday" exercise.<br>
    <em>Note: a 2026-05-26 rewrite removed an earlier reference to "an academic paper on real-time driver-supply reinforcement learning at SIGKDD 2022" — that specific paper claim was not in the two fetched Lyft Engineering posts, and Kirn 2022 names "customer communications" as the bandit use-case, not driver supply.</em>
  </footer>
</div>
</body>
</html>