Skip to content

Commit aeed4bf

Browse files
abrichrclaude
andauthored
Add GitHub Actions CI workflow (#6)
* feat: implement unified baseline adapters for VLM comparison Add comprehensive unified baseline adapters supporting Claude, GPT, and Gemini models across multiple evaluation tracks: Provider Abstraction (models/providers/): - BaseAPIProvider ABC with common interface for all providers - AnthropicProvider: Base64 PNG encoding, Messages API - OpenAIProvider: Data URL format, Chat Completions API - GoogleProvider: Native PIL Image support, GenerateContent API - Factory functions: get_provider(), resolve_model_alias() - Error hierarchy: ProviderError, AuthenticationError, RateLimitError Baseline Module (baselines/): - TrackType enum: TRACK_A (coords), TRACK_B (ReAct), TRACK_C (SoM) - TrackConfig dataclass with factory methods for each track - BaselineConfig with model alias resolution and registry - PromptBuilder for track-specific system prompts and user content - UnifiedResponseParser supporting JSON, function-call, PyAutoGUI formats - ElementRegistry for element_id to coordinate conversion Benchmark Integration: - UnifiedBaselineAgent wrapping UnifiedBaselineAdapter for benchmarks - Converts BenchmarkObservation -> adapter format -> BenchmarkAction - Support for all three tracks via --track flag CLI Commands (baselines/cli.py): - run: Single model prediction with track selection - compare: Multi-model comparison on same task - list-models: Show available models and providers All 92 tests pass. Ready for model comparison experiments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: lower Python requirement from 3.12 to 3.10 for meta-package compatibility All dependencies (torch, transformers, pillow, peft, etc.) support Python 3.10+. The 3.12 requirement was unnecessarily restrictive and broke `pip install openadapt[all]` on Python 3.10 and 3.11. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * ci: add GitHub Actions test workflow Add CI workflow that runs on pull requests and main branch pushes: - Tests on Python 3.10 and 3.11 - Runs on Ubuntu and macOS - Uses uv for dependency management - Runs ruff linter and formatter - Runs pytest suite Matches pattern used by openadapt-viewer and follows OpenAdapt ecosystem conventions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: add defaults to CanonicalEpisode fields for test compatibility - cluster_id: default=0 - cluster_centroid_distance: default=0.0 - internal_similarity: default=1.0 Fixes 1/14 test failures in test_segmentation.py * fix: resolve all linting and formatting errors - Fix unused imports in baselines, benchmarks, and ingest modules - Fix ambiguous variable names (renamed 'l' to 'loss'/'line') - Add missing time import in benchmarks/cli.py - Move warnings import to top of file in benchmarks/cli.py - Add noqa comments for intentional code patterns - Fix bare except clause in lambda_labs.py - Add Episode to TYPE_CHECKING imports in grounding.py - Rename conflicting local variable in config.py - Fix undefined _build_nav_links in viewer.py - Run ruff format to ensure consistent code style All ruff checks now pass successfully. * fix: update parquet export tests to match new schema - Change 'goal' to 'instruction' in column assertions - Change 'image_path' to 'screenshot_path' to match schema * docs: add intellectual honesty qualifiers and fix badge URL - Update badge URL to use filename-based path (from PR #3) - Add qualifiers to claims about accuracy and performance (from PR #4) - Clarify that results are from synthetic benchmarks, not production UIs - Add disclaimers about extrapolating synthetic results to real-world performance - Update section titles to indicate synthetic nature of benchmarks This consolidates the documentation improvements from PRs #3 and #4. --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 979b9f7 commit aeed4bf

75 files changed

Lines changed: 9371 additions & 3026 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/test.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: Test
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- '**'
7+
push:
8+
branches:
9+
- main
10+
11+
jobs:
12+
test:
13+
runs-on: ${{ matrix.os }}
14+
strategy:
15+
matrix:
16+
os: [ubuntu-latest, macos-latest]
17+
python-version: ['3.10', '3.11']
18+
19+
steps:
20+
- name: Checkout code
21+
uses: actions/checkout@v4
22+
23+
- name: Install uv
24+
uses: astral-sh/setup-uv@v4
25+
with:
26+
version: "latest"
27+
28+
- name: Set up Python ${{ matrix.python-version }}
29+
run: uv python install ${{ matrix.python-version }}
30+
31+
- name: Install dependencies
32+
run: uv sync --all-extras
33+
34+
- name: Run ruff linter (check)
35+
run: uv run ruff check openadapt_ml/
36+
37+
- name: Run ruff formatter (check)
38+
run: uv run ruff format --check openadapt_ml/
39+
40+
- name: Run pytest
41+
run: uv run pytest tests/ -v

README.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# OpenAdapt-ML
22

3-
[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/workflows/Publish%20to%20PyPI/badge.svg?branch=main)](https://github.com/OpenAdaptAI/openadapt-ml/actions)
3+
[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml)
44
[![PyPI version](https://img.shields.io/pypi/v/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/)
55
[![Downloads](https://img.shields.io/pypi/dm/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -10,9 +10,9 @@ OpenAdapt-ML is a **model-agnostic, domain-agnostic ML engine** for GUI
1010
automation agents. It sits above **TRL + Unsloth** (which we use directly for training performance) and provides the GUI-specific layer:
1111

1212
- **Episode semantics**: Step/action/observation alignment, screenshot-action coupling, termination handling
13-
- **Demo-conditioned inference**: Retrieval-augmented prompting (validated: 33% → 100% first-action accuracy)
13+
- **Demo-conditioned inference**: Retrieval-augmented prompting (in early experiments: 46.7% -> 100% first-action accuracy on a controlled macOS benchmark where all 45 tasks share the same navigation entry point - see [publication roadmap](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/docs/publication-roadmap.md) for methodology and limitations)
1414
- **Benchmark adapters**: WAA today, OSWorld/WebArena planned
15-
- **VLM adapters**: Updated with leading GUI-agent SOTA open-source models
15+
- **VLM adapters**: Supports open-source GUI-agent models (Qwen3-VL, Qwen2.5-VL)
1616
- **Training pipeline**: TRL + Unsloth integration for 2x faster training with 50% less VRAM
1717

1818
OpenAdapt-ML is **not** a training framework, optimizer, hardware orchestrator, or experiment manager. We use TRL/Unsloth, Lambda Labs/Azure, and W&B/MLflow for those.
@@ -204,7 +204,9 @@ simple login flow.
204204
### 5.1 Synthetic scenarios
205205

206206
OpenAdapt-ML includes synthetic UI generators for structured GUI automation benchmarks.
207-
Currently two scenarios are supported:
207+
Currently two scenarios are supported.
208+
209+
> **Note:** These are **synthetic, controlled benchmarks** designed for rapid iteration and debugging, not real-world evaluation. The 100% accuracy results below demonstrate that fine-tuning works on simple scenarios with known ground truth - they do not represent performance on production UIs or standard benchmarks like WAA. See section 14 (Limitations) for details.
208210
209211
#### Login Scenario (6 steps, 3 elements)
210212

@@ -387,15 +389,18 @@ It exposes step-level performance metrics, which let us visually answer the ques
387389
| Claude Sonnet 4.5 | API | 0.121 | 0.757 | 0.000 |
388390
| GPT-5.1 | API | 0.183 | 0.057 | 0.600 |
389391

390-
**Key findings:**
391-
1. **Fine-tuning delivers massive gains**: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning
392-
2. **Small fine-tuned models beat large APIs**: Qwen3-VL-2B FT (469% base) outperforms both Claude Sonnet 4.5 (121%) and GPT-5.1 (183%)
393-
3. **Precision matters**: Fine-tuned models have excellent click precision (85-100% hit rate, <0.05 coord error) while API models struggle with the action format
394-
4. **Size vs specialization**: The fine-tuned 2B model outperforms the general-purpose Claude Sonnet 4.5, showing that domain-specific fine-tuning trumps raw model size
392+
**Observations on synthetic login benchmark:**
393+
394+
> **Important:** These findings are from a synthetic benchmark with ~3 UI elements and a fixed action sequence. They demonstrate the training pipeline works, but should not be extrapolated to real-world GUI automation performance. Evaluation on standard benchmarks (WAA, WebArena) is ongoing.
395+
396+
1. **Fine-tuning improves synthetic task performance**: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning on this specific task
397+
2. **On this synthetic benchmark, fine-tuned models outperform zero-shot API calls**: This is expected since the task is simple and the models are trained on it directly
398+
3. **Coordinate precision is learnable**: Fine-tuned models achieve low coordinate error on training distribution
399+
4. **API models struggle with custom action format**: Without fine-tuning on the specific DSL (CLICK/TYPE/DONE), API models have high format-error rates
395400

396-
### 6.4 Set-of-Marks (SoM) Mode: 100% Accuracy
401+
### 6.4 Set-of-Marks (SoM) Mode: 100% Accuracy on Synthetic Benchmarks
397402

398-
With **Set-of-Marks** visual prompting, fine-tuned Qwen3-VL-2B achieves **100% accuracy** on both login (6-step) and registration (12-step) scenarios:
403+
With **Set-of-Marks** visual prompting, fine-tuned Qwen3-VL-2B achieves **100% accuracy** on both login (6-step) and registration (12-step) synthetic scenarios. Note that these are controlled, toy benchmarks with a small number of UI elements:
399404

400405
| Scenario | Steps | Elements | Action Acc | Element Acc | Episode Success |
401406
|----------|-------|----------|------------|-------------|-----------------|

openadapt_ml/baselines/__init__.py

Lines changed: 74 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,12 @@
77
- Track B: ReAct-style reasoning with coordinates
88
- Track C: Set-of-Mark element selection
99
10+
Based on SOTA patterns from:
11+
- Claude Computer Use (Anthropic)
12+
- Microsoft UFO/UFO2
13+
- OSWorld benchmark
14+
- Agent-S/Agent-S2 (Simular AI)
15+
1016
Usage:
1117
from openadapt_ml.baselines import UnifiedBaselineAdapter, BaselineConfig, TrackConfig
1218
@@ -21,35 +27,95 @@
2127
track=TrackConfig.track_c(),
2228
)
2329
adapter = UnifiedBaselineAdapter(config)
30+
31+
# OSWorld-compatible configuration
32+
config = BaselineConfig(
33+
provider="openai",
34+
model="gpt-5.2",
35+
track=TrackConfig.osworld_compatible(),
36+
)
37+
38+
# Parse responses directly
39+
from openadapt_ml.baselines import UnifiedResponseParser, ElementRegistry
40+
41+
parser = UnifiedResponseParser()
42+
action = parser.parse('{"action": "CLICK", "x": 0.5, "y": 0.3}')
43+
44+
# With element ID to coordinate conversion
45+
registry = ElementRegistry.from_a11y_tree(tree)
46+
parser = UnifiedResponseParser(element_registry=registry)
47+
action = parser.parse_and_resolve('{"action": "CLICK", "element_id": 17}')
2448
"""
2549

2650
from openadapt_ml.baselines.adapter import UnifiedBaselineAdapter
2751
from openadapt_ml.baselines.config import (
52+
# Enums
53+
ActionOutputFormat,
54+
CoordinateSystem,
55+
TrackType,
56+
# Config dataclasses
2857
BaselineConfig,
2958
ModelSpec,
59+
ReActConfig,
60+
ScreenConfig,
61+
SoMConfig,
3062
TrackConfig,
31-
TrackType,
63+
# Registry
3264
MODELS,
33-
get_model_spec,
65+
# Helper functions
3466
get_default_model,
67+
get_model_spec,
68+
)
69+
from openadapt_ml.baselines.parser import (
70+
ElementRegistry,
71+
ParsedAction,
72+
UIElement,
73+
UnifiedResponseParser,
74+
)
75+
from openadapt_ml.baselines.prompts import (
76+
# System prompts
77+
FORMAT_PROMPTS,
78+
SYSTEM_PROMPT_OSWORLD,
79+
SYSTEM_PROMPT_TRACK_A,
80+
SYSTEM_PROMPT_TRACK_B,
81+
SYSTEM_PROMPT_TRACK_C,
82+
SYSTEM_PROMPT_UFO,
83+
SYSTEM_PROMPTS,
84+
# Builder class
85+
PromptBuilder,
3586
)
36-
from openadapt_ml.baselines.parser import ParsedAction, UnifiedResponseParser
37-
from openadapt_ml.baselines.prompts import PromptBuilder
3887

3988
__all__ = [
4089
# Main adapter
4190
"UnifiedBaselineAdapter",
42-
# Configuration
43-
"BaselineConfig",
44-
"TrackConfig",
91+
# Configuration - Enums
92+
"ActionOutputFormat",
93+
"CoordinateSystem",
4594
"TrackType",
95+
# Configuration - Dataclasses
96+
"BaselineConfig",
4697
"ModelSpec",
98+
"ReActConfig",
99+
"ScreenConfig",
100+
"SoMConfig",
101+
"TrackConfig",
102+
# Configuration - Registry
47103
"MODELS",
48-
"get_model_spec",
104+
# Configuration - Functions
49105
"get_default_model",
106+
"get_model_spec",
50107
# Parsing
108+
"ElementRegistry",
51109
"ParsedAction",
110+
"UIElement",
52111
"UnifiedResponseParser",
53112
# Prompts
113+
"FORMAT_PROMPTS",
54114
"PromptBuilder",
115+
"SYSTEM_PROMPT_OSWORLD",
116+
"SYSTEM_PROMPT_TRACK_A",
117+
"SYSTEM_PROMPT_TRACK_B",
118+
"SYSTEM_PROMPT_TRACK_C",
119+
"SYSTEM_PROMPT_UFO",
120+
"SYSTEM_PROMPTS",
55121
]

openadapt_ml/baselines/adapter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
import os
99
from typing import TYPE_CHECKING, Any
1010

11-
from openadapt_ml.baselines.config import BaselineConfig, TrackConfig, get_model_spec
11+
from openadapt_ml.baselines.config import BaselineConfig, TrackConfig
1212
from openadapt_ml.baselines.parser import ParsedAction, UnifiedResponseParser
1313
from openadapt_ml.baselines.prompts import PromptBuilder
1414
from openadapt_ml.config import settings

openadapt_ml/baselines/cli.py

Lines changed: 42 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,10 @@
88
import json
99
import sys
1010
from pathlib import Path
11-
from typing import Any
1211

1312
import click
1413

15-
from openadapt_ml.baselines.config import MODELS, TrackConfig, TrackType
14+
from openadapt_ml.baselines.config import MODELS
1615

1716

1817
@click.group()
@@ -23,35 +22,41 @@ def baselines():
2322

2423
@baselines.command()
2524
@click.option(
26-
"--model", "-m",
25+
"--model",
26+
"-m",
2727
required=True,
2828
type=click.Choice(list(MODELS.keys())),
2929
help="Model alias to use",
3030
)
3131
@click.option(
32-
"--track", "-t",
32+
"--track",
33+
"-t",
3334
type=click.Choice(["A", "B", "C"]),
3435
default="A",
3536
help="Evaluation track (A=coords, B=ReAct, C=SoM)",
3637
)
3738
@click.option(
38-
"--image", "-i",
39+
"--image",
40+
"-i",
3941
type=click.Path(exists=True),
4042
required=True,
4143
help="Screenshot image path",
4244
)
4345
@click.option(
44-
"--goal", "-g",
46+
"--goal",
47+
"-g",
4548
required=True,
4649
help="Task goal/instruction",
4750
)
4851
@click.option(
49-
"--output", "-o",
52+
"--output",
53+
"-o",
5054
type=click.Path(),
5155
help="Output JSON file path",
5256
)
5357
@click.option(
54-
"--verbose", "-v",
58+
"--verbose",
59+
"-v",
5560
is_flag=True,
5661
help="Enable verbose output",
5762
)
@@ -122,7 +127,9 @@ def run(
122127
click.echo(f"Thought: {action.thought}")
123128
else:
124129
click.echo(f"Parse Error: {action.parse_error}")
125-
click.echo(f"Raw Response: {action.raw_response[:200] if action.raw_response else 'None'}...")
130+
click.echo(
131+
f"Raw Response: {action.raw_response[:200] if action.raw_response else 'None'}..."
132+
)
126133

127134
# Save output if requested
128135
if output:
@@ -140,29 +147,34 @@ def run(
140147

141148
@baselines.command()
142149
@click.option(
143-
"--models", "-m",
150+
"--models",
151+
"-m",
144152
required=True,
145153
help="Comma-separated model aliases",
146154
)
147155
@click.option(
148-
"--track", "-t",
156+
"--track",
157+
"-t",
149158
type=click.Choice(["A", "B", "C"]),
150159
default="A",
151160
help="Evaluation track",
152161
)
153162
@click.option(
154-
"--image", "-i",
163+
"--image",
164+
"-i",
155165
type=click.Path(exists=True),
156166
required=True,
157167
help="Screenshot image path",
158168
)
159169
@click.option(
160-
"--goal", "-g",
170+
"--goal",
171+
"-g",
161172
required=True,
162173
help="Task goal/instruction",
163174
)
164175
@click.option(
165-
"--output", "-o",
176+
"--output",
177+
"-o",
166178
type=click.Path(),
167179
help="Output JSON file path",
168180
)
@@ -221,23 +233,27 @@ def compare(
221233
adapter = UnifiedBaselineAdapter.from_alias(model, track=track_config)
222234
action = adapter.predict(screenshot, goal)
223235

224-
results.append({
225-
"model": model,
226-
"success": action.is_valid,
227-
"action": action.to_dict(),
228-
"error": action.parse_error,
229-
})
236+
results.append(
237+
{
238+
"model": model,
239+
"success": action.is_valid,
240+
"action": action.to_dict(),
241+
"error": action.parse_error,
242+
}
243+
)
230244

231245
status = "OK" if action.is_valid else "FAILED"
232246
click.echo(f" {status}: {action.action_type}")
233247

234248
except Exception as e:
235-
results.append({
236-
"model": model,
237-
"success": False,
238-
"action": None,
239-
"error": str(e),
240-
})
249+
results.append(
250+
{
251+
"model": model,
252+
"success": False,
253+
"action": None,
254+
"error": str(e),
255+
}
256+
)
241257
click.echo(f" ERROR: {e}")
242258

243259
# Summary table

0 commit comments

Comments
 (0)