data_and_learning_where_it_matters/index.html at main · learnsyslab/data_and_learning_where_it_matters · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Data and Learning Where it Matters for Contact-Rich Manipulation</title>
<meta name="description" content="We use expensive robot data only for the most challenging segment of a task and rely on non-robot data for free-space motion — for robust, generalizable, yet highly precise contact-rich manipulation.">
<meta property="og:title" content="Data and Learning Where it Matters for Contact-Rich Manipulation">
<meta property="og:description" content="Dense robot data only where it matters; planning everywhere else. 96% average success across four contact-rich tasks.">
<meta property="og:type" content="website">
<meta property="og:image" content="static/images/posters/long_horizon_task.jpg">

<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=Newsreader:ital,opsz,wght@0,6..72,400..600;1,6..72,400..500&display=swap" rel="stylesheet">
<link rel="stylesheet" href="static/css/index.css">
</head>
<body>

<!-- ============================ HERO ============================ -->
<header class="hero">
  <div class="wrap">
    <span class="venue-pill"><span class="dot"></span>Anonymous Submission &middot; Under Review</span>

    <h1 class="title">Data and Learning <span class="accent">Where&nbsp;it&nbsp;Matters</span> for Contact-Rich Manipulation</h1>

    <p class="authors">Anonymous Author(s)</p>
    <p class="author-note">Affiliation and author identities withheld for double-blind review</p>

    <p class="tagline">
      We use <span class="hl">expensive robot data only for the most challenging section</span> of the task,
      and rely on non-robot data wherever possible.
      The result is a policy that is <span class="hl2">robust and generalizable, yet very precise.</span>
    </p>

    <div class="btn-row">
      <a class="btn" href="paper.pdf" target="_blank" rel="noopener">
        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><path d="M14 2v6h6"/></svg>
        Paper (PDF)
      </a>
      <a class="btn alt" href="#money">
        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polygon points="5 3 19 12 5 21 5 3"/></svg>
        Watch Video
      </a>
      <span class="btn soon">
        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="3" y="11" width="18" height="11" rx="2"/><path d="M7 11V7a5 5 0 0 1 10 0v4"/></svg>
        Code &amp; Data <span class="badge">Coming&nbsp;soon</span>
      </span>
    </div>

    <span class="sped">
      <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polygon points="13 19 22 12 13 5 13 19"/><polygon points="2 19 11 12 2 5 2 19"/></svg>
      All videos are sped up.
    </span>
  </div>
</header>

<!-- ====================== MONEY VIDEO ====================== -->
<section class="money" id="money">
  <div class="wrap">
    <div class="money-frame reveal">
      <video muted loop playsinline controls preload="auto" poster="static/images/posters/long_horizon_task.jpg" data-autoplay>
        <source src="static/videos/long_horizon_task.mp4" type="video/mp4">
      </video>
    </div>
    <div class="money-cap reveal">
      <span class="tag">Long-horizon task</span>
      <p>Using natural-language prompts, the robot solves a very long-horizon sequence of challenging high-precision tasks in a changing and cluttered scene — using robot data only where necessary to achieve the required precision. For the rest of the motion, we rely on simple motion planning <span style="color:var(--accent-2);font-weight:700;">including collision avoidance</span>.</p>
      <p style="margin-top:18px;"><strong style="color:var(--ink);font-weight:700;">We never collected any data in this specific scene!</strong></p>
    </div>
  </div>
</section>

<!-- ====================== ABSTRACT + STATS ====================== -->
<section class="alt-bg" id="abstract">
  <div class="wrap narrow">
    <div class="section-head">
      <p class="eyebrow">Abstract</p>
    </div>
    <div class="abstract-card reveal">
      <p>
        Learned policies trained <span class="em">end-to-end on large datasets often remain brittle</span> in
        high-precision tasks and struggle with generalization. We find that these limitations largely stem from a lack
        of structure and focus in data collection. Our key insight is to leverage <span class="em">dense data
        collection only for the critical, contact-rich segment</span> of a task and to rely on traditional planning
        during simple free-space motion.
      </p>
      <p>
        We propose an <span class="em">automated data-collection scheme combined with offline deep reinforcement
        learning</span> for the critical segment — eliminating reliance on a teleoperator's skill and on online policy
        updates. Across four challenging real-world tasks, using only 2–2.5&nbsp;h of autonomous data collection, we
        achieve an average success rate of 96%, compared to the strongest baseline at 55%. Notably, performance remains
        high in <span class="em">out-of-distribution scenarios</span> where end-to-end approaches struggle.
      </p>
    </div>

    <div class="stats">
      <div class="stat reveal"><div class="num blue">1</div><div class="lab">Human demonstration to bootstrap</div></div>
      <div class="stat reveal"><div class="num">96%</div><div class="lab">Average success rate across four tasks</div></div>
      <div class="stat reveal"><div class="num blue">55%</div><div class="lab">Strongest baseline (end-to-end), same tasks</div></div>
      <div class="stat reveal"><div class="num">2–2.5h</div><div class="lab">Autonomous data collection per task</div></div>
    </div>
  </div>
</section>

<!-- ====================== OOD ====================== -->
<section id="robustness">
  <div class="wrap">
    <div class="section-head">
      <p class="eyebrow">Robustness &amp; Generalization</p>
      <p>Robot data and learning are only used where absolutely required to achieve the necessary precision for our tasks. This approach allows the policy to achieve the required precision while generalizing to entirely novel scenes — settings where end-to-end policies typically fail.</p>
      <p style="margin-top:14px;"><strong style="color:var(--ink);">Our method generalizes across dimensions</strong>: scene-level distractors (objects and backgrounds), target object location, pick-up object location, collision avoidance, dynamic grasping from a human hand, and sequential task compositions.</p>
    </div>

    <div class="vgrid cols-2">
      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/lego_ood_1.jpg" data-autoplay>
            <source src="static/videos/lego_ood_1.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k ood">Lego stacking · OOD</span><h4>Cluttered scene, distractors</h4><p>The workspace is filled with unseen objects and extra bricks — the policy still finds and stacks the target precisely.</p></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/lego_ood_2.jpg" data-autoplay>
            <source src="static/videos/lego_ood_2.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k ood">Lego stacking · OOD</span><h4>Picking from a human hand</h4><p>The policy works robustly when picking up the part from a moving human hand.</p></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/shelf_ood.jpg" data-autoplay>
            <source src="static/videos/shelf_ood.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k ood">Shelf stocking · OOD</span><h4>Scene changes</h4><p>A different table, lighting, and a mix of novel grocery items and distractors — the items are still stowed correctly.</p></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/siemens_ood_1.jpg" data-autoplay>
            <source src="static/videos/siemens_ood_1.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k ood">Fan-cover insertion · OOD</span><h4>Distractor objects &amp; clutter</h4><p>Surrounded by cables, boxes and stray parts, the policy still aligns and seats the cover with tight clearance.</p></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/siemens_ood_2.jpg" data-autoplay>
            <source src="static/videos/siemens_ood_2.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k ood">Fan-cover insertion · OOD</span><h4>Part handed over</h4><p>A human hands the part to the robot at a novel pose — pose estimation re-localizes and the insertion succeeds.</p></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/collision_avoidance.jpg" data-autoplay>
            <source src="static/videos/collision_avoidance.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k ood">Cluttered scene · Planning</span><h4>Collision avoidance</h4><p>In a densely cluttered workspace, motion planning routes the arm around obstacles to reach each target collision-free.</p></div>
      </div>
    </div>
  </div>
</section>

<!-- ====================== METHOD ====================== -->
<section class="alt-bg" id="method">
  <div class="wrap">
    <div class="section-head">
      <p class="eyebrow">Method</p>
      <h2 class="sec-title">Single Demo, Autonomous Data Collection, Deployment</h2>
    </div>

    <figure class="method-fig reveal">
      <img src="static/images/method.png" alt="Method overview: Initialization from a single demonstration, autonomous data collection at the critical segment, and sequential deployment combining planning and learning.">
      <figcaption>Dense data collection &amp; robot learning for the critical segment; the rest is solved without using any robot data.</figcaption>
    </figure>
  </div>
</section>

<!-- ====================== OURS (full rollouts) ====================== -->
<section id="results">
  <div class="wrap">
    <div class="section-head">
      <p class="eyebrow">Rollouts · Our Method</p>
      <h2 class="sec-title">High success rates</h2>
      <p>The key is to leverage dense data collection at the difficult segment, so that the policy never goes out-of-distribution for that part of the task. Success rates below are measured over 50 trials per task.</p>
      <p style="margin-top:18px;"><strong style="color:var(--ink);font-weight:700;">In the paper, we show that learning with offline DRL from densely collected data is required to achieve high success rates!</strong></p>

    </div>

    <div class="vgrid cols-3">
      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline controls preload="none" poster="static/images/posters/lego_ours.jpg" data-autoplay>
            <source src="static/videos/lego_ours.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k">Lego stacking</span><h4>94% success <span style="color:var(--ink-faint);font-weight:500;">(100% partial)</span></h4><p>Precise brick-on-brick insertion with tight clearance.</p></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline controls preload="none" poster="static/images/posters/shelf_ours.jpg" data-autoplay>
            <source src="static/videos/shelf_ours.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k">Shelf stocking</span><h4>98% success <span style="color:var(--ink-faint);font-weight:500;">(100% partial)</span></h4><p>Stowing a yellow cardboard salt box into a tightly packed box.</p></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline controls preload="none" poster="static/images/posters/siemens_ours.jpg" data-autoplay>
            <source src="static/videos/siemens_ours.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k">Fan-cover insertion</span><h4>96% success <span style="color:var(--ink-faint);font-weight:500;">(98% partial)</span></h4><p>Aligning and seating a PBT/PC-molded part under tight tolerances.</p></div>
      </div>
    </div>
  </div>
</section>

<!-- ====================== DATA COLLECTION ====================== -->
<section class="alt-bg" id="data-collection">
  <div class="wrap">
    <div class="section-head">
      <p class="eyebrow">(Almost) Autonomous Data Collection</p>
      <h2 class="sec-title">No teleoperator required</h2>
      <p>We show that dense data collection through explorative offline DRL is required to achieve high success rates for our tasks.


        Operator intervention is only required for resetting the scene in some tasks.</p>
      <p style="margin-top:18px;"><strong style="color:var(--ink);font-weight:700;">All the policies above are deployed using data collected from the setups shown below!</strong></p>
    </div>

    <div class="vgrid cols-3">
      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline controls preload="none" poster="static/images/posters/lego_data_collection.jpg" data-autoplay>
            <source src="static/videos/lego_data_collection.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k">Lego stacking</span><h4>Self-collected interaction data</h4></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline controls preload="none" poster="static/images/posters/shelf_data_collection.jpg" data-autoplay>
            <source src="static/videos/shelf_data_collection.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k">Shelf stocking</span><h4>Self-collected interaction data</h4></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline controls preload="none" poster="static/images/posters/siemens_datacollection.jpg" data-autoplay>
            <source src="static/videos/siemens_datacollection.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k">Fan-cover insertion</span><h4>Self-collected interaction data</h4></div>
      </div>
    </div>

    <div class="money-frame reveal" style="max-width:920px;margin-top:40px;">
      <video muted loop playsinline controls preload="none" poster="static/images/posters/single_demonstration_datacollection.jpg" data-autoplay>
        <source src="static/videos/single_demonstration_datacollection.mp4" type="video/mp4"></video>
    </div>
    <div class="money-cap reveal">
      <span class="tag">Single kinesthetic teaching demonstration</span>
      <p>The scene is calibrated using a single human demonstration from kinesthetic teaching. Operator intervention during data collection is only required for resetting the scene for some tasks.</p>
    </div>
  </div>
</section>

<!-- ====================== BASELINE FAILURES ====================== -->
<section id="baselines">
  <div class="wrap">
    <div class="section-head">
      <p class="eyebrow" style="color:#c0392b;">Baselines</p>
      <h2 class="sec-title">Comparison to end-to-end</h2>
      <p>The same high-precision and out-of-distribution tasks cause strong end-to-end baselines to fail — they lack
      precision at the contact segment and break down under distractors. Further, our method naturally provides a
      success classification through Q-function evaluation.</p>
    </div>

    <p class="reveal" style="text-align:center;max-width:1000px;margin:-22px auto 36px;"><strong style="color:var(--ink);font-weight:700;">It is not clear whether simple scaling of end-to-end data collection is practical or solves the performance gap!</strong></p>

    <div class="vgrid cols-2">
      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/baselines_lack_precision.jpg" data-autoplay>
            <source src="static/videos/baselines_lack_precision.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k" style="color:#c0392b;">Lego stacking · Baseline</span><h4>Lacks precision</h4></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/baselines_ood_simple1.jpg" data-autoplay>
            <source src="static/videos/baselines_ood_simple1.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k" style="color:#c0392b;">Lego stacking · Baseline</span><h4>New object positions (OOD)</h4></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/baselines_ood_simple2.jpg" data-autoplay>
            <source src="static/videos/baselines_ood_simple2.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k" style="color:#c0392b;">Lego stacking · Baseline</span><h4>Added distractor bricks (OOD)</h4></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/baselines_ood_difficult1.jpg" data-autoplay>
            <source src="static/videos/baselines_ood_difficult1.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k" style="color:#c0392b;">Lego stacking · Baseline</span><h4>Clutter &amp; novel objects (OOD)</h4></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/baselines_ood_shelf.jpg" data-autoplay>
            <source src="static/videos/baselines_ood_shelf.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k" style="color:#c0392b;">Shelf stocking · Baseline</span><h4>Novel item in shelf (OOD)</h4></div>
      </div>

      <div class="vcard reveal">
        <div class="vid-wrap">
          <video muted loop playsinline preload="none" poster="static/images/posters/baselines_distractor_siemens.jpg" data-autoplay>
            <source src="static/videos/baselines_distractor_siemens.mp4" type="video/mp4"></video>
        </div>
        <div class="vmeta"><span class="k" style="color:#c0392b;">Fan-cover insertion · Baseline</span><h4>Similar-looking distractor object (OOD)</h4></div>
      </div>
    </div>

    <p style="max-width:880px;margin:36px auto 0;text-align:center;color:var(--ink-soft);">For instance, end-to-end policies fail to correctly detect target objects and instead grasp a similar-looking object (<strong style="color:var(--ink);">bottom right</strong>). End-to-end policies need much training data to avoid such failure cases, whereas other models trained without any robot data — such as SAM3 for object detection — are far more robust by default.</p>
  </div>
</section>

<!-- ====================== FOOTER ====================== -->
<footer>
  <div class="wrap">
    <p class="ftitle">Data and Learning Where it Matters for Contact-Rich Manipulation</p>
    <p class="anon">Anonymous submission &middot; under double-blind review.</p>
    <p>All videos on this page are sped up. We will release all videos, training datasets, and evaluation datasets.</p>
    <p style="margin-top:18px;opacity:.7;">Page inspired by <a href="https://nerfies.github.io/" target="_blank" rel="noopener">Nerfies</a> and extended with Claude.</p>
  </div>
</footer>

<script src="static/js/index.js"></script>
</body>
</html>