Skip to content

Commit 8ed98a7

Browse files
abrichrclaude
andauthored
docs: reframe positioning with multi-pillar strategy (#991)
* docs: reframe positioning with multi-pillar strategy and honest scoping - README: Replace "Demo-Conditioned Prompting" with "Trajectory-Conditioned Disambiguation" showing the 2x2 experimental matrix (prompting validated, fine-tuning in progress). Add OpenCUA industry validation. - Landing page strategy: Lead with capture-to-deployment pipeline, add specialization pillar, update competitor table for March 2026 landscape (Agent S3, OpenCUA, Browser Use, CUA/Bytebot). Add honesty notes for proof points. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct OpenCUA attribution to macOS a11y code reuse OpenCUA reused OpenAdapt's macOS accessibility tree capture code (AX API traversal functions + oa_atomacos dependency), not the full capture-to-deployment pipeline. The recorder architecture came from DuckTrack. Updated README, landing page strategy, competitor table, and proof points to reflect this accurately. Evidence: arxiv.org/html/2508.09123v3 Section 2.2, OpenCUA README "Acknowledge" section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: review fixes — accuracy, claims, and add builders section - Use 46.7% consistently (not 33-47% range) - Change "core goal" to "planned" in 2x2 matrix - Drop "superhuman" for Agent S3 (barely above human baseline) - Fix possessive "our" to "OpenAdapt's" in competitor table - Add "Built for Builders" section for non-technical users - Renumber subsequent sections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 60b081d commit 8ed98a7

File tree

2 files changed

+88
-44
lines changed

2 files changed

+88
-44
lines changed

README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -164,18 +164,20 @@ OpenAdapt follows a streamlined **Demonstrate → Learn → Execute** pipeline:
164164
- **Act**: Execute validated actions with safety gates
165165
- **Evaluate**: Measure success with `openadapt-evals` and feed results back for improvement
166166

167-
### Core Approach: Demo-Conditioned Prompting
167+
### Core Approach: Trajectory-Conditioned Disambiguation
168168

169-
OpenAdapt explores **demonstration-conditioned automation** - "show, don't tell":
169+
Zero-shot VLMs fail on GUI tasks not due to lack of capability, but due to **ambiguity in UI affordances**. OpenAdapt resolves this by conditioning agents on human demonstrations — "show, don't tell."
170170

171-
| Traditional Agent | OpenAdapt Agent |
172-
|-------------------|-----------------|
173-
| User writes prompts | User records demonstration |
174-
| Ambiguous instructions | Grounded in actual UI |
175-
| Requires prompt engineering | Reduced prompt engineering |
176-
| Context-free | Context from similar demos |
171+
| | No Retrieval | With Retrieval |
172+
|---|---|---|
173+
| **No Fine-tuning** | 46.7% (zero-shot baseline) | **100%** (validated, n=45) |
174+
| **Fine-tuning** | Standard SFT (baseline) | **Demo-conditioned FT** (planned) |
177175

178-
**Retrieval powers BOTH training AND evaluation**: Similar demonstrations are retrieved as context for the VLM. In early experiments on a controlled macOS benchmark, this improved first-action accuracy from 46.7% to 100% - though all 45 tasks in that benchmark share the same navigation entry point. See the [publication roadmap](docs/publication-roadmap.md) for methodology and limitations.
176+
The bottom-right cell is OpenAdapt's unique value: training models to **use** demonstrations they haven't seen before, combining retrieval with fine-tuning for maximum accuracy. Phase 2 (retrieval-only prompting) is validated; Phase 3 (demo-conditioned fine-tuning) is in progress.
177+
178+
**Validated result**: On a controlled macOS benchmark (45 System Settings tasks sharing a common navigation entry point), demo-conditioned prompting improved first-action accuracy from 46.7% to 100%. A length-matched control (+11.1 pp only) confirms the benefit is semantic, not token-length. See the [research thesis](https://github.com/OpenAdaptAI/openadapt-ml/blob/main/docs/research_thesis.md) for methodology and the [publication roadmap](docs/publication-roadmap.md) for limitations.
179+
180+
**Industry validation**: [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight, XLANG Lab) [reused OpenAdapt's macOS accessibility capture code](https://arxiv.org/html/2508.09123v3) in their AgentNetTool, but uses demos only for model training — not runtime conditioning. No open-source CUA framework currently does demo-conditioned inference, which remains OpenAdapt's architectural differentiator.
179181

180182
### Key Concepts
181183

docs/design/landing-page-strategy.md

Lines changed: 77 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,9 @@ OpenAdapt has evolved from a monolithic application (v0.46.0) to a **modular met
4040
4. **Open Source (MIT License)**: Full transparency, no vendor lock-in
4141

4242
**Key Innovation**:
43-
- **Trajectory-conditioned disambiguation of UI affordances** - validated experiment showing 33% -> 100% first-action accuracy with demo conditioning
43+
- **Trajectory-conditioned disambiguation of UI affordances** — the only open-source CUA framework that conditions agents on recorded demonstrations at runtime (validated: 46.7% → 100% first-action accuracy)
44+
- **Specialization over scale** — fine-tuned Qwen3-VL-2B outperforms Claude Sonnet 4.5 and GPT-5.1 on action accuracy (42.9% vs 11.2% vs 23.2%) on an internal benchmark
45+
- **Capture-to-deployment pipeline** — record → retrieve → train → deploy. [OpenCUA](https://github.com/xlang-ai/OpenCUA) (NeurIPS 2025 Spotlight) [reused OpenAdapt's macOS accessibility capture code](https://arxiv.org/html/2508.09123v3) in their AgentNetTool
4446
- **Set-of-Marks (SoM) mode**: 100% accuracy on synthetic benchmarks using element IDs instead of coordinates
4547

4648
### 1.2 Current Landing Page Assessment
@@ -218,25 +220,31 @@ Why: Clear 3-step process, action-oriented
218220

219221
### 3.3 Key Differentiators to Emphasize
220222

221-
1. **Demonstration-Based Learning**
222-
- Not: "Use natural language to describe tasks"
223-
- But: "Just do the task and OpenAdapt learns from watching"
224-
- Proof: 33% -> 100% first-action accuracy with demo conditioning
223+
1. **Capture-to-Deployment Pipeline**
224+
- Not: "Prompt the AI to do your task"
225+
- But: "Record the task once. OpenAdapt handles the rest — retrieval, training, deployment."
226+
- Proof: 7 modular packages (capture, ML, evals, grounding, retrieval, privacy, viewer); OpenCUA (NeurIPS 2025) [reused OpenAdapt's macOS a11y capture code](https://arxiv.org/html/2508.09123v3)
225227

226-
2. **Model Agnostic**
228+
2. **Demonstration-Conditioned Agents**
229+
- Not: "Zero-shot reasoning about what to click"
230+
- But: "Agents conditioned on relevant demos — at inference AND during training"
231+
- Proof: 46.7% → 100% first-action accuracy with demo conditioning (validated, n=45). No other open-source CUA framework does runtime demo conditioning.
232+
- Note: This is first-action accuracy on tasks sharing a navigation entry point. Multi-step and cross-domain evaluation is ongoing on Windows Agent Arena.
233+
234+
3. **Specialization Over Scale**
235+
- Not: "Use the biggest model available"
236+
- But: "A 2B model fine-tuned on your workflows outperforms frontier models"
237+
- Proof: Qwen3-VL-2B (42.9%) vs Claude Sonnet 4.5 (11.2%) on action accuracy (internal benchmark, synthetic login task)
238+
239+
4. **Model Agnostic**
227240
- Not: "Works with [specific AI]"
228241
- But: "Your choice: Claude, GPT-4V, Gemini, Qwen, or custom models"
229242
- Proof: Adapters for multiple VLM backends
230243

231-
3. **Runs Anywhere**
232-
- Not: "Cloud-powered automation"
233-
- But: "Run locally, in the cloud, or hybrid"
234-
- Proof: CLI-based, works offline
235-
236-
4. **Open Source**
237-
- Not: "Try our free tier"
238-
- But: "MIT licensed, fully transparent, community-driven"
239-
- Proof: GitHub, PyPI, active Discord
244+
5. **Runs Anywhere & Open Source**
245+
- Not: "Cloud-powered automation" / "Try our free tier"
246+
- But: "Run locally, in the cloud, or hybrid. MIT licensed, fully transparent."
247+
- Proof: CLI-based, works offline; GitHub, PyPI, active Discord
240248

241249
### 3.4 Messaging Framework
242250

@@ -256,14 +264,17 @@ Why: Clear 3-step process, action-oriented
256264

257265
## 4. Competitive Positioning
258266

259-
### 4.1 Primary Competitors
267+
### 4.1 Primary Competitors (Updated March 2026)
260268

261269
| Competitor | Strengths | Weaknesses | Our Advantage |
262270
|------------|-----------|------------|---------------|
263-
| **Anthropic Computer Use** | First-mover, Claude integration, simple API | Proprietary, cloud-only, no customization | Open source, model-agnostic, trainable |
264-
| **UI-TARS (ByteDance)** | Strong benchmark scores, research backing | Closed source, not productized | Open source, deployable, extensible |
265-
| **Traditional RPA (UiPath, etc.)** | Enterprise-proven, large ecosystems | Brittle selectors, no AI reasoning, expensive | AI-first, learns from demos, affordable |
266-
| **GPT-4V + Custom Code** | Powerful model, flexibility | Requires building everything, no structure | Ready-made SDK, training pipeline, benchmarks |
271+
| **Anthropic Computer Use** | 72.5% OSWorld (near-human), simple API | Proprietary, cloud-only, no customization, per-action cost | Open source, model-agnostic, trainable, runs locally |
272+
| **Agent S3 (Simular)** | 72.6% OSWorld, open source | Zero-shot only, no demo conditioning, no fine-tuning pipeline | Demo-conditioned agents, capture-to-train pipeline |
273+
| **OpenCUA (XLANG Lab)** | NeurIPS Spotlight, 45% OSWorld, open models (7B-72B) | Zero-shot at inference — demos used only for training, not runtime | Runtime demo conditioning (unique); OpenCUA reused OpenAdapt's macOS a11y code |
274+
| **Browser Use** | 50k+ GitHub stars, 89% WebVoyager | Browser-only, no desktop, no training pipeline | Full desktop support, fine-tuning, demo library |
275+
| **UI-TARS (ByteDance)** | Local models (2B-72B), Apache 2.0 | No demo conditioning, no capture pipeline | End-to-end record→train→deploy, demo retrieval |
276+
| **CUA / Bytebot** | Container infra, YC-backed | Infrastructure-only, no ML training pipeline | Full pipeline: capture + train + eval + deploy |
277+
| **Traditional RPA (UiPath, etc.)** | Enterprise-proven, UiPath Screen Agent #1 on OSWorld | Brittle selectors, expensive ($10K+/yr), requires scripting | AI-first, learns from demos, open source |
267278

268279
### 4.2 Positioning Statement
269280

@@ -352,24 +363,46 @@ Show it once. Let it handle the rest.
352363
```
353364
## Why OpenAdapt?
354365
355-
### Demonstration-Based Learning
356-
No prompt engineering required. OpenAdapt learns from how you actually do tasks.
357-
[Stat: 33% -> 100% first-action accuracy with demo conditioning]
366+
### Record Once, Automate Forever
367+
Capture any workflow. OpenAdapt retrieves relevant demos to guide agents
368+
AND trains specialized models on your recordings.
369+
[Stat: 46.7% → 100% first-action accuracy with demo conditioning]
370+
371+
### Small Models, Big Results
372+
A 2B model fine-tuned on your workflows outperforms frontier models.
373+
Specialization beats scale for GUI tasks.
374+
[Stat: 42.9% action accuracy (Qwen 2B FT) vs 11.2% (Claude Sonnet 4.5)]
358375
359-
### Model Agnostic
376+
### Model Agnostic & Open Source
360377
Your choice of AI: Claude, GPT-4V, Gemini, Qwen-VL, or fine-tune your own.
361-
Not locked to any single provider.
378+
MIT licensed. Run locally, in the cloud, or hybrid.
379+
```
380+
381+
### 5.5 For Builders Section
362382

363-
### Run Anywhere
364-
CLI-based, works offline. Deploy locally, in the cloud, or hybrid.
365-
Your data stays where you want it.
383+
````
384+
## Built for Builders
366385
367-
### Fully Open Source
368-
MIT licensed. Transparent, auditable, community-driven.
369-
No vendor lock-in, ever.
386+
### Show it once. Done.
387+
Record yourself doing a task. OpenAdapt handles the rest.
388+
No code, no prompts, no configuration.
389+
390+
### Three commands
391+
```bash
392+
pip install openadapt
393+
openadapt capture start --name my-task # Record
394+
openadapt run --capture my-task # Replay with AI
370395
```
371396
372-
### 5.5 For Developers Section
397+
### Works with the AI you already use
398+
Claude, GPT-4V, Gemini, Qwen — pick your model.
399+
Or let OpenAdapt train a small one that runs on your laptop.
400+
401+
### Your data stays yours
402+
Everything runs locally. Nothing leaves your machine unless you want it to.
403+
````
404+
405+
### 5.6 For Developers Section
373406

374407
````
375408
## Built for Developers
@@ -406,7 +439,7 @@ Compare your models against published baselines.
406439
[View Documentation] [GitHub Repository]
407440
````
408441

409-
### 5.6 For Enterprise Section
442+
### 5.7 For Enterprise Section
410443

411444
```
412445
## Enterprise-Ready Automation
@@ -429,7 +462,7 @@ Custom development, training, and support packages available.
429462
[Contact Sales: sales@openadapt.ai]
430463
```
431464

432-
### 5.7 Use Cases Section (Refined)
465+
### 5.8 Use Cases Section (Refined)
433466

434467
**Current**: Generic industry grid
435468

@@ -486,13 +519,22 @@ Example: Onboarding guides for complex internal tools.
486519

487520
### 6.3 Proof Points to Include
488521

489-
- "33% -> 100% first-action accuracy with demonstration conditioning"
522+
- "46.7% → 100% first-action accuracy with demo conditioning (n=45, same model, no training)"
523+
- "Fine-tuned 2B model outperforms Claude Sonnet 4.5 on action accuracy (42.9% vs 11.2%, internal benchmark)"
524+
- "OpenCUA (NeurIPS 2025 Spotlight) reused OpenAdapt's macOS accessibility capture code in AgentNetTool"
525+
- "Only open-source CUA framework with runtime demo-conditioned inference"
490526
- "[X,XXX] PyPI downloads this month" (dynamic)
491527
- "[XXX] GitHub stars" (dynamic)
492528
- "7 modular packages, 1 unified CLI"
493529
- "Integrated with Windows Agent Arena, WebArena, OSWorld benchmarks"
494530
- "MIT licensed, fully open source"
495531

532+
**Honesty notes for proof points**:
533+
- The 46.7%→100% result is first-action only on 45 macOS tasks sharing the same navigation entry point
534+
- The 42.9% vs 11.2% result is on a controlled internal synthetic login benchmark (~3 UI elements)
535+
- Multi-step episode success on real-world benchmarks (WAA) is under active evaluation
536+
- Frame these as "validated signal" not "production-proven"
537+
496538
---
497539

498540
## 7. Wireframe Concepts

0 commit comments

Comments
 (0)