OpenAdaptAI
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 41 additions & 0 deletions b/‎.github/workflows/test.yml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 16 additions & 11 deletions b/‎README.md‎
Lines changed: 16 additions & 11 deletions
diff --git a/‎openadapt_ml/baselines/__init__.py‎
Lines changed: 74 additions & 8 deletions b/‎openadapt_ml/baselines/__init__.py‎
Lines changed: 74 additions & 8 deletions
diff --git a/‎openadapt_ml/baselines/adapter.py‎
Lines changed: 1 addition & 1 deletion b/‎openadapt_ml/baselines/adapter.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎openadapt_ml/baselines/cli.py‎
Lines changed: 42 additions & 26 deletions b/‎openadapt_ml/baselines/cli.py‎
Lines changed: 42 additions & 26 deletions
@@ -0,0 +1,41 @@
+name: Test
+
+on:
+  pull_request:
+    branches:
+      - '**'
+  push:
+    branches:
+      - main
+
+jobs:
+  test:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-latest, macos-latest]
+        python-version: ['3.10', '3.11']
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v4
+        with:
+          version: "latest"
+
+      - name: Set up Python ${{ matrix.python-version }}
+        run: uv python install ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: uv sync --all-extras
+
+      - name: Run ruff linter (check)
+        run: uv run ruff check openadapt_ml/
+
+      - name: Run ruff formatter (check)
+        run: uv run ruff format --check openadapt_ml/
+
+      - name: Run pytest
+        run: uv run pytest tests/ -v
@@ -1,6 +1,6 @@
 # OpenAdapt-ML
 
-[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/workflows/Publish%20to%20PyPI/badge.svg?branch=main)](https://github.com/OpenAdaptAI/openadapt-ml/actions)
+[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml)
 [![PyPI version](https://img.shields.io/pypi/v/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/)
 [![Downloads](https://img.shields.io/pypi/dm/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -10,9 +10,9 @@ OpenAdapt-ML is a **model-agnostic, domain-agnostic ML engine** for GUI
 automation agents. It sits above **TRL + Unsloth** (which we use directly for training performance) and provides the GUI-specific layer:
 
 - **Episode semantics**: Step/action/observation alignment, screenshot-action coupling, termination handling
-- **Demo-conditioned inference**: Retrieval-augmented prompting (validated: 33% → 100% first-action accuracy)
+- **Demo-conditioned inference**: Retrieval-augmented prompting (in early experiments: 46.7% -> 100% first-action accuracy on a controlled macOS benchmark where all 45 tasks share the same navigation entry point - see [publication roadmap](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/docs/publication-roadmap.md) for methodology and limitations)
 - **Benchmark adapters**: WAA today, OSWorld/WebArena planned
-- **VLM adapters**: Updated with leading GUI-agent SOTA open-source models
+- **VLM adapters**: Supports open-source GUI-agent models (Qwen3-VL, Qwen2.5-VL)
 - **Training pipeline**: TRL + Unsloth integration for 2x faster training with 50% less VRAM
 
 OpenAdapt-ML is **not** a training framework, optimizer, hardware orchestrator, or experiment manager. We use TRL/Unsloth, Lambda Labs/Azure, and W&B/MLflow for those.
@@ -204,7 +204,9 @@ simple login flow.
 ### 5.1 Synthetic scenarios
 
 OpenAdapt-ML includes synthetic UI generators for structured GUI automation benchmarks.
-Currently two scenarios are supported:
+Currently two scenarios are supported.
+
+> **Note:** These are **synthetic, controlled benchmarks** designed for rapid iteration and debugging, not real-world evaluation. The 100% accuracy results below demonstrate that fine-tuning works on simple scenarios with known ground truth - they do not represent performance on production UIs or standard benchmarks like WAA. See section 14 (Limitations) for details.
 
 #### Login Scenario (6 steps, 3 elements)
 
@@ -387,15 +389,18 @@ It exposes step-level performance metrics, which let us visually answer the ques
 | Claude Sonnet 4.5   | API          | 0.121           | 0.757       | 0.000          |
 | GPT-5.1             | API          | 0.183           | 0.057       | 0.600          |
 
-**Key findings:**
-1. **Fine-tuning delivers massive gains**: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning
-2. **Small fine-tuned models beat large APIs**: Qwen3-VL-2B FT (469% base) outperforms both Claude Sonnet 4.5 (121%) and GPT-5.1 (183%)
-3. **Precision matters**: Fine-tuned models have excellent click precision (85-100% hit rate, <0.05 coord error) while API models struggle with the action format
-4. **Size vs specialization**: The fine-tuned 2B model outperforms the general-purpose Claude Sonnet 4.5, showing that domain-specific fine-tuning trumps raw model size
+**Observations on synthetic login benchmark:**
+
+> **Important:** These findings are from a synthetic benchmark with ~3 UI elements and a fixed action sequence. They demonstrate the training pipeline works, but should not be extrapolated to real-world GUI automation performance. Evaluation on standard benchmarks (WAA, WebArena) is ongoing.
+
+1. **Fine-tuning improves synthetic task performance**: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning on this specific task
+2. **On this synthetic benchmark, fine-tuned models outperform zero-shot API calls**: This is expected since the task is simple and the models are trained on it directly
+3. **Coordinate precision is learnable**: Fine-tuned models achieve low coordinate error on training distribution
+4. **API models struggle with custom action format**: Without fine-tuning on the specific DSL (CLICK/TYPE/DONE), API models have high format-error rates
 
-### 6.4 Set-of-Marks (SoM) Mode: 100% Accuracy
+### 6.4 Set-of-Marks (SoM) Mode: 100% Accuracy on Synthetic Benchmarks
 
-With **Set-of-Marks** visual prompting, fine-tuned Qwen3-VL-2B achieves **100% accuracy** on both login (6-step) and registration (12-step) scenarios:
+With **Set-of-Marks** visual prompting, fine-tuned Qwen3-VL-2B achieves **100% accuracy** on both login (6-step) and registration (12-step) synthetic scenarios. Note that these are controlled, toy benchmarks with a small number of UI elements:
 
 | Scenario | Steps | Elements | Action Acc | Element Acc | Episode Success |
 |----------|-------|----------|------------|-------------|-----------------|
 
@@ -7,6 +7,12 @@
 - Track B: ReAct-style reasoning with coordinates
 - Track C: Set-of-Mark element selection
 
+Based on SOTA patterns from:
+- Claude Computer Use (Anthropic)
+- Microsoft UFO/UFO2
+- OSWorld benchmark
+- Agent-S/Agent-S2 (Simular AI)
+
 Usage:
     from openadapt_ml.baselines import UnifiedBaselineAdapter, BaselineConfig, TrackConfig
 
@@ -21,35 +27,95 @@
         track=TrackConfig.track_c(),
     )
     adapter = UnifiedBaselineAdapter(config)
+
+    # OSWorld-compatible configuration
+    config = BaselineConfig(
+        provider="openai",
+        model="gpt-5.2",
+        track=TrackConfig.osworld_compatible(),
+    )
+
+    # Parse responses directly
+    from openadapt_ml.baselines import UnifiedResponseParser, ElementRegistry
+
+    parser = UnifiedResponseParser()
+    action = parser.parse('{"action": "CLICK", "x": 0.5, "y": 0.3}')
+
+    # With element ID to coordinate conversion
+    registry = ElementRegistry.from_a11y_tree(tree)
+    parser = UnifiedResponseParser(element_registry=registry)
+    action = parser.parse_and_resolve('{"action": "CLICK", "element_id": 17}')
 """
 
 from openadapt_ml.baselines.adapter import UnifiedBaselineAdapter
 from openadapt_ml.baselines.config import (
+    # Enums
+    ActionOutputFormat,
+    CoordinateSystem,
+    TrackType,
+    # Config dataclasses
     BaselineConfig,
     ModelSpec,
+    ReActConfig,
+    ScreenConfig,
+    SoMConfig,
     TrackConfig,
-    TrackType,
+    # Registry
     MODELS,
-    get_model_spec,
+    # Helper functions
     get_default_model,
+    get_model_spec,
+)
+from openadapt_ml.baselines.parser import (
+    ElementRegistry,
+    ParsedAction,
+    UIElement,
+    UnifiedResponseParser,
+)
+from openadapt_ml.baselines.prompts import (
+    # System prompts
+    FORMAT_PROMPTS,
+    SYSTEM_PROMPT_OSWORLD,
+    SYSTEM_PROMPT_TRACK_A,
+    SYSTEM_PROMPT_TRACK_B,
+    SYSTEM_PROMPT_TRACK_C,
+    SYSTEM_PROMPT_UFO,
+    SYSTEM_PROMPTS,
+    # Builder class
+    PromptBuilder,
 )
-from openadapt_ml.baselines.parser import ParsedAction, UnifiedResponseParser
-from openadapt_ml.baselines.prompts import PromptBuilder
 
 __all__ = [
     # Main adapter
     "UnifiedBaselineAdapter",
-    # Configuration
-    "BaselineConfig",
-    "TrackConfig",
+    # Configuration - Enums
+    "ActionOutputFormat",
+    "CoordinateSystem",
     "TrackType",
+    # Configuration - Dataclasses
+    "BaselineConfig",
     "ModelSpec",
+    "ReActConfig",
+    "ScreenConfig",
+    "SoMConfig",
+    "TrackConfig",
+    # Configuration - Registry
     "MODELS",
-    "get_model_spec",
+    # Configuration - Functions
     "get_default_model",
+    "get_model_spec",
     # Parsing
+    "ElementRegistry",
     "ParsedAction",
+    "UIElement",
     "UnifiedResponseParser",
     # Prompts
+    "FORMAT_PROMPTS",
     "PromptBuilder",
+    "SYSTEM_PROMPT_OSWORLD",
+    "SYSTEM_PROMPT_TRACK_A",
+    "SYSTEM_PROMPT_TRACK_B",
+    "SYSTEM_PROMPT_TRACK_C",
+    "SYSTEM_PROMPT_UFO",
+    "SYSTEM_PROMPTS",
 ]
@@ -8,7 +8,7 @@
 import os
 from typing import TYPE_CHECKING, Any
 
-from openadapt_ml.baselines.config import BaselineConfig, TrackConfig, get_model_spec
+from openadapt_ml.baselines.config import BaselineConfig, TrackConfig
 from openadapt_ml.baselines.parser import ParsedAction, UnifiedResponseParser
 from openadapt_ml.baselines.prompts import PromptBuilder
 from openadapt_ml.config import settings
 
@@ -8,11 +8,10 @@
 import json
 import sys
 from pathlib import Path
-from typing import Any
 
 import click
 
-from openadapt_ml.baselines.config import MODELS, TrackConfig, TrackType
+from openadapt_ml.baselines.config import MODELS
 
 
 @click.group()
@@ -23,35 +22,41 @@ def baselines():
 
 @baselines.command()
 @click.option(
-    "--model", "-m",
+    "--model",
+    "-m",
     required=True,
     type=click.Choice(list(MODELS.keys())),
     help="Model alias to use",
 )
 @click.option(
-    "--track", "-t",
+    "--track",
+    "-t",
     type=click.Choice(["A", "B", "C"]),
     default="A",
     help="Evaluation track (A=coords, B=ReAct, C=SoM)",
 )
 @click.option(
-    "--image", "-i",
+    "--image",
+    "-i",
     type=click.Path(exists=True),
     required=True,
     help="Screenshot image path",
 )
 @click.option(
-    "--goal", "-g",
+    "--goal",
+    "-g",
     required=True,
     help="Task goal/instruction",
 )
 @click.option(
-    "--output", "-o",
+    "--output",
+    "-o",
     type=click.Path(),
     help="Output JSON file path",
 )
 @click.option(
-    "--verbose", "-v",
+    "--verbose",
+    "-v",
     is_flag=True,
     help="Enable verbose output",
 )
@@ -122,7 +127,9 @@ def run(
             click.echo(f"Thought: {action.thought}")
     else:
         click.echo(f"Parse Error: {action.parse_error}")
-        click.echo(f"Raw Response: {action.raw_response[:200] if action.raw_response else 'None'}...")
+        click.echo(
+            f"Raw Response: {action.raw_response[:200] if action.raw_response else 'None'}..."
+        )
 
     # Save output if requested
     if output:
@@ -140,29 +147,34 @@ def run(
 
 @baselines.command()
 @click.option(
-    "--models", "-m",
+    "--models",
+    "-m",
     required=True,
     help="Comma-separated model aliases",
 )
 @click.option(
-    "--track", "-t",
+    "--track",
+    "-t",
     type=click.Choice(["A", "B", "C"]),
     default="A",
     help="Evaluation track",
 )
 @click.option(
-    "--image", "-i",
+    "--image",
+    "-i",
     type=click.Path(exists=True),
     required=True,
     help="Screenshot image path",
 )
 @click.option(
-    "--goal", "-g",
+    "--goal",
+    "-g",
     required=True,
     help="Task goal/instruction",
 )
 @click.option(
-    "--output", "-o",
+    "--output",
+    "-o",
     type=click.Path(),
     help="Output JSON file path",
 )
@@ -221,23 +233,27 @@ def compare(
             adapter = UnifiedBaselineAdapter.from_alias(model, track=track_config)
             action = adapter.predict(screenshot, goal)
 
-            results.append({
-                "model": model,
-                "success": action.is_valid,
-                "action": action.to_dict(),
-                "error": action.parse_error,
-            })
+            results.append(
+                {
+                    "model": model,
+                    "success": action.is_valid,
+                    "action": action.to_dict(),
+                    "error": action.parse_error,
+                }
+            )
 
             status = "OK" if action.is_valid else "FAILED"
             click.echo(f"  {status}: {action.action_type}")
 
         except Exception as e:
-            results.append({
-                "model": model,
-                "success": False,
-                "action": None,
-                "error": str(e),
-            })
+            results.append(
+                {
+                    "model": model,
+                    "success": False,
+                    "action": None,
+                    "error": str(e),
+                }
+            )
             click.echo(f"  ERROR: {e}")
 
     # Summary table