You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat: implement unified baseline adapters for VLM comparison
Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini
models across multiple evaluation tracks:
Provider Abstraction (models/providers/):
- BaseAPIProvider ABC with common interface for all providers
- AnthropicProvider: Base64 PNG encoding, Messages API
- OpenAIProvider: Data URL format, Chat Completions API
- GoogleProvider: Native PIL Image support, GenerateContent API
- Factory functions: get_provider(), resolve_model_alias()
- Error hierarchy: ProviderError, AuthenticationError, RateLimitError
Baseline Module (baselines/):
- TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM)
- TrackConfig dataclass with factory methods for each track
- BaselineConfig with model alias resolution and registry
- PromptBuilder for track-specific system prompts and user content
- UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats
- ElementRegistry for element_id to coordinate conversion
Benchmark Integration:
- UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks
- Converts BenchmarkObservation -> adapter format -> BenchmarkAction
- Support for all three tracks via --track flag
CLI Commands (baselines/cli.py):
- run: Single model prediction with track selection
- compare: Multi-model comparison on same task
- list-models: Show available models and providers
All 92 tests pass. Ready for model comparison experiments.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix: lower Python requirement from 3.12 to 3.10 for meta-package compatibility
All dependencies (torch, transformers, pillow, peft, etc.) support Python 3.10+.
The 3.12 requirement was unnecessarily restrictive and broke `pip install openadapt[all]`
on Python 3.10 and 3.11.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* ci: add GitHub Actions test workflow
Add CI workflow that runs on pull requests and main branch pushes:
- Tests on Python 3.10 and 3.11
- Runs on Ubuntu and macOS
- Uses uv for dependency management
- Runs ruff linter and formatter
- Runs pytest suite
Matches pattern used by openadapt-viewer and follows OpenAdapt ecosystem conventions.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix: add defaults to CanonicalEpisode fields for test compatibility
- cluster_id: default=0
- cluster_centroid_distance: default=0.0
- internal_similarity: default=1.0
Fixes 1/14 test failures in test_segmentation.py
* fix: resolve all linting and formatting errors
- Fix unused imports in baselines, benchmarks, and ingest modules
- Fix ambiguous variable names (renamed 'l' to 'loss'/'line')
- Add missing time import in benchmarks/cli.py
- Move warnings import to top of file in benchmarks/cli.py
- Add noqa comments for intentional code patterns
- Fix bare except clause in lambda_labs.py
- Add Episode to TYPE_CHECKING imports in grounding.py
- Rename conflicting local variable in config.py
- Fix undefined _build_nav_links in viewer.py
- Run ruff format to ensure consistent code style
All ruff checks now pass successfully.
* fix: update parquet export tests to match new schema
- Change 'goal' to 'instruction' in column assertions
- Change 'image_path' to 'screenshot_path' to match schema
* docs: add intellectual honesty qualifiers and fix badge URL
- Update badge URL to use filename-based path (from PR #3)
- Add qualifiers to claims about accuracy and performance (from PR #4)
- Clarify that results are from synthetic benchmarks, not production UIs
- Add disclaimers about extrapolating synthetic results to real-world performance
- Update section titles to indicate synthetic nature of benchmarks
This consolidates the documentation improvements from PRs #3 and #4.
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
-**Demo-conditioned inference**: Retrieval-augmented prompting (in early experiments: 46.7% -> 100% first-action accuracy on a controlled macOS benchmark where all 45 tasks share the same navigation entry point - see [publication roadmap](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/docs/publication-roadmap.md) for methodology and limitations)
-**Training pipeline**: TRL + Unsloth integration for 2x faster training with 50% less VRAM
17
17
18
18
OpenAdapt-ML is **not** a training framework, optimizer, hardware orchestrator, or experiment manager. We use TRL/Unsloth, Lambda Labs/Azure, and W&B/MLflow for those.
@@ -204,7 +204,9 @@ simple login flow.
204
204
### 5.1 Synthetic scenarios
205
205
206
206
OpenAdapt-ML includes synthetic UI generators for structured GUI automation benchmarks.
207
-
Currently two scenarios are supported:
207
+
Currently two scenarios are supported.
208
+
209
+
> **Note:** These are **synthetic, controlled benchmarks** designed for rapid iteration and debugging, not real-world evaluation. The 100% accuracy results below demonstrate that fine-tuning works on simple scenarios with known ground truth - they do not represent performance on production UIs or standard benchmarks like WAA. See section 14 (Limitations) for details.
208
210
209
211
#### Login Scenario (6 steps, 3 elements)
210
212
@@ -387,15 +389,18 @@ It exposes step-level performance metrics, which let us visually answer the ques
387
389
| Claude Sonnet 4.5 | API | 0.121 | 0.757 | 0.000 |
388
390
| GPT-5.1 | API | 0.183 | 0.057 | 0.600 |
389
391
390
-
**Key findings:**
391
-
1.**Fine-tuning delivers massive gains**: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning
392
-
2.**Small fine-tuned models beat large APIs**: Qwen3-VL-2B FT (469% base) outperforms both Claude Sonnet 4.5 (121%) and GPT-5.1 (183%)
393
-
3.**Precision matters**: Fine-tuned models have excellent click precision (85-100% hit rate, <0.05 coord error) while API models struggle with the action format
394
-
4.**Size vs specialization**: The fine-tuned 2B model outperforms the general-purpose Claude Sonnet 4.5, showing that domain-specific fine-tuning trumps raw model size
392
+
**Observations on synthetic login benchmark:**
393
+
394
+
> **Important:** These findings are from a synthetic benchmark with ~3 UI elements and a fixed action sequence. They demonstrate the training pipeline works, but should not be extrapolated to real-world GUI automation performance. Evaluation on standard benchmarks (WAA, WebArena) is ongoing.
395
+
396
+
1.**Fine-tuning improves synthetic task performance**: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning on this specific task
397
+
2.**On this synthetic benchmark, fine-tuned models outperform zero-shot API calls**: This is expected since the task is simple and the models are trained on it directly
398
+
3.**Coordinate precision is learnable**: Fine-tuned models achieve low coordinate error on training distribution
399
+
4.**API models struggle with custom action format**: Without fine-tuning on the specific DSL (CLICK/TYPE/DONE), API models have high format-error rates
395
400
396
-
### 6.4 Set-of-Marks (SoM) Mode: 100% Accuracy
401
+
### 6.4 Set-of-Marks (SoM) Mode: 100% Accuracy on Synthetic Benchmarks
397
402
398
-
With **Set-of-Marks** visual prompting, fine-tuned Qwen3-VL-2B achieves **100% accuracy** on both login (6-step) and registration (12-step) scenarios:
403
+
With **Set-of-Marks** visual prompting, fine-tuned Qwen3-VL-2B achieves **100% accuracy** on both login (6-step) and registration (12-step) synthetic scenarios. Note that these are controlled, toy benchmarks with a small number of UI elements:
399
404
400
405
| Scenario | Steps | Elements | Action Acc | Element Acc | Episode Success |
0 commit comments