Skip to content

Commit 1aa4d46

Browse files
abrichrclaude
andcommitted
docs: add experimental roadmap and evidence context to vision
- Add 2x2 experimental matrix (retrieval × fine-tuning) to Core Thesis - Add evidence context to benchmark table: note it's an internal synthetic benchmark (~3 UI elements) that validates the pipeline, not real-world performance. Link to openadapt-evals for ongoing WAA/OSWorld evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent aeac459 commit 1aa4d46

1 file changed

Lines changed: 13 additions & 0 deletions

File tree

docs/vision.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,15 @@ because:
1919
2. Specialization compounds: the model learns *your* apps, *your* patterns
2020
3. The workflow structure is known — the model just navigates it
2121

22+
**The experimental roadmap** (see [research thesis](research_thesis.md)):
23+
24+
| | No Retrieval | With Retrieval |
25+
|---|---|---|
26+
| **No Fine-tuning** | 33–47% (baseline) | **100%** (validated) |
27+
| **Fine-tuning** | Standard SFT (baseline) | **Demo-conditioned FT** (unique value) |
28+
29+
Phase 2 (retrieval-only) is validated. Phase 3 (demo-conditioned fine-tuning — training models to *use* demonstrations they haven't seen) is the core planned work.
30+
2231
## Architecture
2332

2433
OpenAdapt treats workflows as **state machines**, not pixel sequences:
@@ -37,12 +46,16 @@ abstract states in visual reality.
3746

3847
## Why Specialization Wins
3948

49+
Results on an internal synthetic login benchmark (~3 UI elements, ~20-30 training examples):
50+
4051
| Model | Action Accuracy | Click Hit Rate |
4152
|-------|-----------------|----------------|
4253
| Qwen 2B Fine-tuned | 42.9% | 100% |
4354
| Claude Sonnet 4.5 | 11.2% | 0% |
4455
| GPT-5.1 | 23.2% | 66.7% |
4556

57+
> **Note**: These results validate that the training pipeline works and that specialization provides signal. They do not yet represent real-world performance — evaluation on standard benchmarks (WAA, OSWorld) is ongoing via [openadapt-evals](https://github.com/OpenAdaptAI/openadapt-evals).
58+
4659
General-purpose models must infer workflow structure from scratch on every query.
4760
OpenAdapt agents *know* the structure — they just navigate it.
4861

0 commit comments

Comments
 (0)