You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add experimental roadmap and evidence context to vision
- Add 2x2 experimental matrix (retrieval × fine-tuning) to Core Thesis
- Add evidence context to benchmark table: note it's an internal synthetic
benchmark (~3 UI elements) that validates the pipeline, not real-world
performance. Link to openadapt-evals for ongoing WAA/OSWorld evaluation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|**Fine-tuning**| Standard SFT (baseline) |**Demo-conditioned FT** (unique value) |
28
+
29
+
Phase 2 (retrieval-only) is validated. Phase 3 (demo-conditioned fine-tuning — training models to *use* demonstrations they haven't seen) is the core planned work.
30
+
22
31
## Architecture
23
32
24
33
OpenAdapt treats workflows as **state machines**, not pixel sequences:
@@ -37,12 +46,16 @@ abstract states in visual reality.
37
46
38
47
## Why Specialization Wins
39
48
49
+
Results on an internal synthetic login benchmark (~3 UI elements, ~20-30 training examples):
50
+
40
51
| Model | Action Accuracy | Click Hit Rate |
41
52
|-------|-----------------|----------------|
42
53
| Qwen 2B Fine-tuned | 42.9% | 100% |
43
54
| Claude Sonnet 4.5 | 11.2% | 0% |
44
55
| GPT-5.1 | 23.2% | 66.7% |
45
56
57
+
> **Note**: These results validate that the training pipeline works and that specialization provides signal. They do not yet represent real-world performance — evaluation on standard benchmarks (WAA, OSWorld) is ongoing via [openadapt-evals](https://github.com/OpenAdaptAI/openadapt-evals).
58
+
46
59
General-purpose models must infer workflow structure from scratch on every query.
47
60
OpenAdapt agents *know* the structure — they just navigate it.
0 commit comments