diff --git a/docs/design/custom_task_evaluation.md b/docs/design/custom_task_evaluation.md new file mode 100644 index 0000000..bf3c9c2 --- /dev/null +++ b/docs/design/custom_task_evaluation.md @@ -0,0 +1,478 @@ +# Custom Task Evaluation: Easy Task Definition Without Forking WAA + +> Date: 2026-03-17 +> Status: PROPOSED +> Goal: Let users define custom tasks with real evaluation, no WAA fork required + +--- + +## 1. The Problem + +The enterprise customer evaluates their SFT model's task success as **product of per-step action accuracy** (e.g., 80.5%^N). This is a proxy metric — they need real end-to-end evaluation: did the task actually complete? + +WAA has a powerful evaluation system, but defining new tasks requires: +1. Writing a complex JSON config with setup commands, evaluator specs, getter types, and metric functions +2. Placing the config file **inside the Docker container** at `/client/evaluation_examples_windows/examples/{domain}/{task_id}.json` +3. Knowledge of WAA's internal evaluation architecture (getter modules, metric modules, postconfig) + +This is too high a barrier. Users should be able to define a task in a simple YAML file and get real evaluation without touching the Docker image. + +--- + +## 2. Key Insight: The Server Already Supports Client-Side Configs + +The WAA `/evaluate` endpoint accepts the full evaluator config in the POST body. The `/task/` lookup is just a convenience for finding configs stored on disk. **If we send the complete config from the client, the server doesn't need anything on disk.** + +Similarly, `reset()` sends setup commands from `task.raw_config["config"]` to the server. These commands come from the client-loaded task config. + +**The entire task lifecycle can be client-driven:** +``` +Client loads YAML → translates to WAA format → sends to server +No fork. No Docker modifications. No server-side task registry. +``` + +--- + +## 3. Options + +### Option A: Simple YAML Task Format (Recommended) + +Create a human-friendly YAML format. Our code translates it to WAA's JSON format at runtime and sends it to the server. + +```yaml +# tasks/change-calc-font.yaml +name: Change font to Arial 14pt in LibreOffice Calc +id: custom-calc-font-001 + +setup: + - launch: soffice --calc + - sleep: 3 + - execute: | + python -c "import pyautogui; pyautogui.hotkey('ctrl', 'a')" + +evaluate: + # Simple: run a command, check output matches expected + - check: command + run: | + python -c " + import subprocess + r = subprocess.run(['powershell', '-c', + 'Get-Content C:\\Users\\Docker\\Documents\\test.xlsx'], + capture_output=True, text=True) + print(r.stdout.strip()) + " + expect: "Arial" + match: contains + + # Or: check a file exists + - check: file_exists + path: C:\Users\Docker\Documents\output.xlsx + + # Or: VLM judge (no server-side logic needed) + - check: screenshot + description: "The spreadsheet shows Arial 14pt font in cell A1" + confidence: 0.8 +``` + +**Pros:** +- Human-readable, easy to write +- No server changes, no Docker modifications +- Supports simple checks (command output, file existence) and VLM-based checks +- YAML files live in the user's repo, version-controlled + +**Cons:** +- We maintain the translation layer +- VLM-based checks are less precise than programmatic checks +- Some WAA evaluator features (compare_table, cloud_file) would need explicit mapping + +### Option B: VLM-Only Evaluation + +Skip WAA's evaluator entirely. After the agent finishes, take a screenshot and ask a VLM "Did the task complete? [task description]". + +```yaml +name: Change font to Arial 14pt +evaluate: + judge: "The spreadsheet shows text formatted in Arial 14pt font" +``` + +**Pros:** +- Zero server-side code needed +- Works for ANY task without writing evaluator configs +- One-line evaluation definition + +**Cons:** +- Less precise — VLM may hallucinate success +- Expensive (API call per evaluation) +- Not deterministic (same state → different judgment) +- Can't verify non-visual state (file contents, registry values) + +### Option C: Programmatic Python Evaluator + +Let users write a Python function that runs on the VM and returns True/False. + +```yaml +name: Change font to Arial 14pt +evaluate: + python: | + import subprocess + result = subprocess.run( + ['powershell', '-c', 'Get-Content ...'], + capture_output=True, text=True + ) + return 'Arial' in result.stdout +``` + +**Pros:** +- Maximally flexible +- Deterministic +- Users write in a language they know + +**Cons:** +- Security risk (arbitrary code execution on VM) +- Debugging is hard (code runs remotely) +- Still requires understanding VM file paths and tools + +### Option D: Hybrid (Recommended) + +Combine all three approaches. Users choose the evaluation method per task: + +1. **`command`** — run a command on VM, check output (programmatic, precise) +2. **`file`** — check file exists / matches expected (simple, common) +3. **`screenshot`** — VLM judges screenshot (easy to write, less precise) +4. **`python`** — custom Python evaluator (maximum flexibility) +5. **`waa`** — full WAA evaluator config (for power users / existing tasks) + +--- + +## 4. Recommended Design: Simple YAML + Hybrid Evaluation + +### Task YAML Schema + +```yaml +# Required +name: "Human-readable task description" + +# Optional (auto-generated UUID if omitted) +id: "custom-task-001" + +# Optional: what domain (desktop, web, etc.) +domain: desktop + +# Setup: commands to prepare the VM before the agent runs +# Executed in order. Each is sent to the WAA server. +setup: + - launch: "notepad.exe" # Start an application + - open: "C:\\Users\\Docker\\test.docx" # Open a file + - execute: "powershell -c 'Set-ItemProperty ...'" # Run arbitrary command + - sleep: 2 # Wait N seconds + - download: # Download a file to VM + url: "https://example.com/test.xlsx" + dest: "C:\\Users\\Docker\\Downloads\\test.xlsx" + +# Evaluation: how to check if the task succeeded +# Multiple checks can be combined (all must pass by default) +evaluate: + # Method 1: Command output check + - check: command + run: "powershell -c 'Get-ItemProperty HKCU:\\... -Name FontName | Select -Expand FontName'" + expect: "Arial" + match: exact # exact | contains | regex | fuzzy + + # Method 2: File check + - check: file + path: "C:\\Users\\Docker\\Documents\\output.txt" + exists: true + contains: "expected content" # optional + + # Method 3: VLM screenshot judge + - check: screenshot + description: "Notepad shows 'Hello World' in the text area" + + # Method 4: Custom Python (runs on VM) + - check: python + code: | + import json + with open(r'C:\Users\Docker\AppData\...\settings.json') as f: + data = json.load(f) + return data.get('editor.fontSize') == 14 + +# Optional: combine checks with AND (default) or OR +combine: and + +# Optional: max steps for the agent +max_steps: 15 + +# Optional: partial credit milestones (for dense rewards) +milestones: + - description: "Application is open" + check: screenshot + description: "Notepad window is visible" + - description: "Text is entered" + check: command + run: "powershell -c 'Get-Process notepad -ErrorAction SilentlyContinue | Select -Expand MainWindowTitle'" + expect: "Untitled" + match: contains +``` + +### Implementation + +#### Core: `TaskConfig` class (~150 lines) + +```python +# openadapt_evals/task_config.py + +@dataclass +class TaskConfig: + name: str + id: str + domain: str + setup: list[dict] + evaluate: list[dict] + combine: str # "and" | "or" + max_steps: int + milestones: list[dict] + + @classmethod + def from_yaml(cls, path: str) -> "TaskConfig": + """Load from YAML file.""" + + @classmethod + def from_dir(cls, dir_path: str) -> list["TaskConfig"]: + """Load all .yaml/.yml files from a directory.""" + + def to_waa_config(self) -> dict: + """Translate to WAA's native JSON format for /evaluate.""" + + def to_benchmark_task(self) -> BenchmarkTask: + """Create a BenchmarkTask for use with adapters.""" +``` + +#### Translation: YAML → WAA format + +The `to_waa_config()` method translates each `check` into WAA evaluator format: + +```python +def _translate_check(self, check: dict) -> dict: + if check["check"] == "command": + return { + "func": check.get("match", "exact") + "_match", + "result": { + "type": "vm_command_line", + "command": check["run"], + }, + "expected": { + "type": "literal", + "value": check["expect"], + }, + } + elif check["check"] == "file": + evaluator = {"func": "file_exists", ...} + if "contains" in check: + evaluator = {"func": "contains", ...} + return evaluator + elif check["check"] == "screenshot": + # VLM evaluation — handled client-side, not sent to WAA + return {"func": "_vlm_judge", "description": check["description"]} + elif check["check"] == "python": + return { + "func": "exact_match", + "result": { + "type": "vm_command_line", + "command": f'python -c "{check["code"]}"', + }, + "expected": {"type": "literal", "value": "True"}, + } +``` + +#### VLM Judge (~50 lines) + +For `check: screenshot` evaluations, we don't use the WAA evaluator. Instead: + +```python +def _vlm_evaluate(self, screenshot: bytes, description: str) -> tuple[bool, float]: + """Use a VLM to judge task completion from screenshot.""" + from openadapt_evals.vlm import vlm_call + + response = vlm_call( + prompt=f"""Look at this screenshot. Answer YES or NO: +Does this screenshot show that the following condition is met? +Condition: {description} +Answer YES or NO, then explain briefly.""", + images=[screenshot], + model="gpt-4.1-mini", + provider="openai", + ) + success = response.strip().upper().startswith("YES") + confidence = 0.9 if success else 0.1 # simplified + return success, confidence +``` + +#### CLI Integration + +```bash +# Run evaluation with custom tasks +openadapt-evals run --agent api-claude --tasks ./my_tasks/ + +# Run a single custom task +openadapt-evals run --agent api-claude --task-config ./my_tasks/change-font.yaml + +# Validate task configs without running +openadapt-evals validate-tasks ./my_tasks/ + +# List available checks for a task +openadapt-evals describe-task ./my_tasks/change-font.yaml +``` + +--- + +## 5. Example: Customer Defines a Task + +The customer wants to evaluate "Can my model change the font in LibreOffice Calc?" + +**Step 1: Write YAML** (2 minutes) + +```yaml +# tasks/calc-change-font.yaml +name: "Change font to Arial 14pt in cell A1 of LibreOffice Calc" + +setup: + - launch: "soffice --calc" + - sleep: 5 + - execute: | + python -c "import pyautogui; pyautogui.click(200, 200); pyautogui.typewrite('Hello', interval=0.05)" + +evaluate: + - check: command + run: | + python -c " + import subprocess + r = subprocess.run(['powershell', '-c', + 'Get-Content C:\\Users\\Docker\\Documents\\test.ods | Select-String Arial'], + capture_output=True, text=True) + print('Arial' in r.stdout) + " + expect: "True" + match: exact + + # Backup: VLM check + - check: screenshot + description: "Cell A1 in LibreOffice Calc shows text formatted in Arial font" + +combine: or # Pass if either check succeeds + +max_steps: 15 +``` + +**Step 2: Run evaluation** + +```bash +openadapt-evals run \ + --agent http --agent-endpoint http://customer-model:8000 \ + --task-config ./tasks/calc-change-font.yaml \ + --server http://localhost:5001 +``` + +**Step 3: Get results** + +``` +Task: Change font to Arial 14pt in cell A1 +Steps: 8 +Result: SUCCESS (score=1.0) + check[0] (command): PASS — output matched "True" + check[1] (screenshot): PASS — VLM confirmed Arial font visible +Time: 45.2s +``` + +--- + +## 6. Partial Credit via Milestones + +The YAML format supports milestones for dense rewards: + +```yaml +milestones: + - name: "Calc is open" + check: command + run: "powershell -c 'Get-Process soffice -ErrorAction SilentlyContinue | Measure | Select -Expand Count'" + expect: "1" + match: exact + + - name: "Cell A1 is selected" + check: screenshot + description: "Cell A1 appears highlighted/selected in the spreadsheet" + + - name: "Font dialog is open" + check: screenshot + description: "A font selection dialog or dropdown is visible" + + - name: "Arial is selected" + check: screenshot + description: "Arial font is selected or typed in the font selector" +``` + +During RL training: +```python +reward = milestones_passed / total_milestones # e.g., 3/4 = 0.75 +``` + +--- + +## 7. Where This Lives + +| Component | Location | Lines | +|-----------|----------|-------| +| `TaskConfig` class | `openadapt_evals/task_config.py` | ~150 | +| VLM judge | `openadapt_evals/vlm_evaluator.py` | ~50 | +| CLI integration | `openadapt_evals/benchmarks/cli.py` | ~30 (new flags) | +| Example tasks | `openadapt_evals/example_tasks/` | ~5 YAML files | +| Tests | `tests/test_task_config.py` | ~100 | + +**Total new code: ~330 lines + example YAML files.** + +--- + +## 8. Tradeoffs + +| Approach | Precision | Ease of Use | Server Changes | Cost | +|----------|-----------|-------------|----------------|------| +| `command` check | High | Medium (need to know PowerShell/Python) | None | Free | +| `file` check | High | Easy | None | Free | +| `screenshot` (VLM) | Medium | Very easy (one sentence) | None | ~$0.01/eval | +| `python` check | Highest | Medium | None | Free | +| `waa` (native) | Highest | Hard (complex JSON) | None | Free | + +**Recommendation**: Default to `command` + `screenshot` combo. `command` for precise state verification, `screenshot` as backup/sanity check. Users who need maximum precision can use `python` checks. + +--- + +## 9. What This Enables for the Customer + +**Before**: "My model gets 80.5% per-step accuracy" (product of per-step = 0% task completion, but they don't know this from real evaluation) + +**After**: +```bash +# Define 10 custom tasks in YAML (30 minutes) +# Run real evaluation +openadapt-evals run --agent http --agent-endpoint http://their-model:8000 \ + --tasks ./their_tasks/ --server http://waa-vm:5001 + +# Get actual task completion rate +# Results: 2/10 tasks completed (20%) +# Plus: milestone progress per task (dense signal) +``` + +This answers the fundamental question: **does the model actually work?** + +And with dense milestone rewards, feeds directly into GRPO training. + +--- + +## 10. Implementation Priority + +1. **`TaskConfig.from_yaml()` + `to_benchmark_task()`** — load YAML, create task objects +2. **`command` and `file` check translation** — most common, highest precision +3. **`screenshot` VLM judge** — easiest for users to write +4. **CLI `--task-config` flag** — wire into existing run command +5. **Example tasks** — 3-5 YAML files for Core4 WAA tasks +6. **Milestone support** — for dense rewards +7. **`python` check** — maximum flexibility, last priority diff --git a/example_tasks/calc-formula.yaml b/example_tasks/calc-formula.yaml new file mode 100644 index 0000000..3b077b4 --- /dev/null +++ b/example_tasks/calc-formula.yaml @@ -0,0 +1,37 @@ +# Medium task: enter values and a SUM formula in LibreOffice Calc +name: "Enter values in A1:A3 and a SUM formula in A4 in LibreOffice Calc" +id: custom-calc-formula + +setup: + - execute: "powershell -c 'Stop-Process -Name soffice* -Force -ErrorAction SilentlyContinue'" + - sleep: 2 + - launch: "soffice --calc" + - sleep: 5 + +evaluate: + - check: command + run: | + powershell -c "Get-Process soffice* -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count" + expect: "1" + match: exact + + - check: screenshot + description: "LibreOffice Calc is open with values in cells A1, A2, A3 and a SUM formula result in A4" + +combine: and +max_steps: 20 + +milestones: + - name: "Calc is open" + check: command + run: "powershell -c \"Get-Process soffice* -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\"" + expect: "1" + match: exact + + - name: "Values entered" + check: screenshot + description: "Cells A1, A2, and A3 contain numeric values" + + - name: "Formula entered" + check: screenshot + description: "Cell A4 contains a formula result (a number)" diff --git a/example_tasks/clear-browsing-data-chrome.yaml b/example_tasks/clear-browsing-data-chrome.yaml new file mode 100644 index 0000000..9b35be4 --- /dev/null +++ b/example_tasks/clear-browsing-data-chrome.yaml @@ -0,0 +1,63 @@ +# Clear browsing data in Google Chrome +# This is a WAA benchmark staple — tests navigation through Settings UI +name: "Clear browsing data in Google Chrome" +id: custom-clear-chrome-data + +setup: + # Populate Chrome with browsing history so there's data to clear + - execute: "powershell -c \"Start-Process chrome 'https://example.com' -WindowStyle Normal\"" + - sleep: 3 + - execute: "powershell -c \"Start-Process chrome 'https://wikipedia.org'\"" + - sleep: 2 + # Close Chrome so the agent starts fresh + - execute: "powershell -c 'Stop-Process -Name chrome -Force -ErrorAction SilentlyContinue'" + - sleep: 2 + +evaluate: + # Check 1: Chrome history is empty after clearing + - check: command + run: | + powershell -c " + $histPath = \"$env:LOCALAPPDATA\\Google\\Chrome\\User Data\\Default\\History\" + if (Test-Path $histPath) { + $size = (Get-Item $histPath).Length + if ($size -lt 50000) { Write-Output 'cleared' } else { Write-Output 'not_cleared' } + } else { Write-Output 'no_history_file' } + " + expect: "cleared" + match: contains + + # Check 2: VLM confirms the "Clear browsing data" dialog was used + - check: screenshot + description: "Chrome shows a confirmation that browsing data has been cleared, or the Settings page for clearing data is visible with completed state" + +combine: or +max_steps: 20 + +milestones: + - name: "Chrome is open" + check: command + run: "powershell -c \"Get-Process chrome -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\"" + expect: "1" + match: contains + + - name: "Settings page is open" + check: screenshot + description: "Chrome Settings page is visible, or chrome://settings is in the address bar" + + - name: "Clear browsing data dialog is open" + check: screenshot + description: "The 'Clear browsing data' dialog or panel is visible in Chrome" + + - name: "Data is cleared" + check: command + run: | + powershell -c " + $histPath = \"$env:LOCALAPPDATA\\Google\\Chrome\\User Data\\Default\\History\" + if (Test-Path $histPath) { + $size = (Get-Item $histPath).Length + if ($size -lt 50000) { Write-Output 'cleared' } else { Write-Output 'not_cleared' } + } else { Write-Output 'no_history_file' } + " + expect: "cleared" + match: contains diff --git a/example_tasks/clear-browsing-data-edge.yaml b/example_tasks/clear-browsing-data-edge.yaml new file mode 100644 index 0000000..3eaf398 --- /dev/null +++ b/example_tasks/clear-browsing-data-edge.yaml @@ -0,0 +1,63 @@ +# Clear browsing data in Microsoft Edge +# Edge is the default browser on Windows 11 WAA VMs +name: "Clear browsing data in Microsoft Edge" +id: custom-clear-edge-data + +setup: + # Populate Edge with browsing history + - execute: "powershell -c \"Start-Process msedge 'https://example.com' -WindowStyle Normal\"" + - sleep: 3 + - execute: "powershell -c \"Start-Process msedge 'https://wikipedia.org'\"" + - sleep: 2 + # Close Edge so the agent starts fresh + - execute: "powershell -c 'Stop-Process -Name msedge -Force -ErrorAction SilentlyContinue'" + - sleep: 2 + +evaluate: + # Check 1: Edge history is empty after clearing + - check: command + run: | + powershell -c " + $histPath = \"$env:LOCALAPPDATA\\Microsoft\\Edge\\User Data\\Default\\History\" + if (Test-Path $histPath) { + $size = (Get-Item $histPath).Length + if ($size -lt 50000) { Write-Output 'cleared' } else { Write-Output 'not_cleared' } + } else { Write-Output 'no_history_file' } + " + expect: "cleared" + match: contains + + # Check 2: VLM confirms clearing + - check: screenshot + description: "Microsoft Edge shows a confirmation that browsing data has been cleared, or the Settings page shows completed clearing state" + +combine: or +max_steps: 20 + +milestones: + - name: "Edge is open" + check: command + run: "powershell -c \"Get-Process msedge -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\"" + expect: "1" + match: contains + + - name: "Settings page is open" + check: screenshot + description: "Edge Settings page is visible, or edge://settings is in the address bar" + + - name: "Clear browsing data dialog is open" + check: screenshot + description: "The 'Clear browsing data' dialog or panel is visible in Edge" + + - name: "Data is cleared" + check: command + run: | + powershell -c " + $histPath = \"$env:LOCALAPPDATA\\Microsoft\\Edge\\User Data\\Default\\History\" + if (Test-Path $histPath) { + $size = (Get-Item $histPath).Length + if ($size -lt 50000) { Write-Output 'cleared' } else { Write-Output 'not_cleared' } + } else { Write-Output 'no_history_file' } + " + expect: "cleared" + match: contains diff --git a/example_tasks/create-desktop-folder.yaml b/example_tasks/create-desktop-folder.yaml new file mode 100644 index 0000000..2645a3b --- /dev/null +++ b/example_tasks/create-desktop-folder.yaml @@ -0,0 +1,26 @@ +# Simple 2-step task: create a folder named "TestFolder" on the Desktop +name: "Create a folder named TestFolder on the Desktop" +id: custom-desktop-folder + +setup: + - execute: "powershell -c \"Remove-Item -Path $env:USERPROFILE\\Desktop\\TestFolder -Recurse -Force -ErrorAction SilentlyContinue\"" + - sleep: 1 + +evaluate: + - check: command + run: "powershell -c \"Test-Path $env:USERPROFILE\\Desktop\\TestFolder\"" + expect: "True" + match: exact + +max_steps: 10 + +milestones: + - name: "Desktop is visible" + check: screenshot + description: "The Windows desktop is visible with icons" + + - name: "Folder exists" + check: command + run: "powershell -c \"Test-Path $env:USERPROFILE\\Desktop\\TestFolder\"" + expect: "True" + match: exact diff --git a/example_tasks/notepad-hello.yaml b/example_tasks/notepad-hello.yaml new file mode 100644 index 0000000..429eec9 --- /dev/null +++ b/example_tasks/notepad-hello.yaml @@ -0,0 +1,30 @@ +# Simple 2-step task: open Notepad and type "Hello World" +name: "Open Notepad and type Hello World" +id: custom-notepad-hello + +setup: + - execute: "powershell -c 'Stop-Process -Name notepad -Force -ErrorAction SilentlyContinue'" + - sleep: 1 + +evaluate: + - check: command + run: "powershell -c \"Get-Process notepad -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\"" + expect: "1" + match: exact + + - check: screenshot + description: "Notepad window is open with 'Hello World' typed in the text area" + +combine: and +max_steps: 10 + +milestones: + - name: "Notepad is open" + check: command + run: "powershell -c \"Get-Process notepad -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\"" + expect: "1" + match: exact + + - name: "Text is typed" + check: screenshot + description: "Notepad shows 'Hello World' in the text area" diff --git a/openadapt_evals/adapters/rl_env.py b/openadapt_evals/adapters/rl_env.py index 1ed02b0..b5dc738 100644 --- a/openadapt_evals/adapters/rl_env.py +++ b/openadapt_evals/adapters/rl_env.py @@ -50,6 +50,11 @@ BenchmarkTask, ) +# Avoid circular import — TaskConfig imported lazily +TYPE_CHECKING = False +if TYPE_CHECKING: + from openadapt_evals.task_config import TaskConfig + logger = logging.getLogger(__name__) @@ -112,10 +117,12 @@ def __init__( adapter: BenchmarkAdapter, default_task_id: str | None = None, evaluate_every_step: bool = False, + task_config: TaskConfig | None = None, ): self._adapter = adapter self._default_task_id = default_task_id self._evaluate_every_step = evaluate_every_step + self._task_config = task_config self._current_task: BenchmarkTask | None = None self._step_count = 0 self._done = False @@ -160,6 +167,23 @@ def screen_size(self) -> tuple[int, int]: return (width, height) return (1920, 1200) + def load_task_config(self, task_config: TaskConfig) -> None: + """Set a TaskConfig for dense reward evaluation. + + When set, collect_rollout() and evaluate_dense() use milestone-based + partial credit instead of binary evaluation. + + Args: + task_config: A TaskConfig loaded from YAML. + """ + self._task_config = task_config + self._default_task_id = task_config.id + logger.info( + "Loaded TaskConfig: %s (%d milestones)", + task_config.name, + len(task_config.milestones), + ) + def reset(self, config: ResetConfig | None = None) -> BenchmarkObservation: """Reset environment to a task's initial state. @@ -186,8 +210,15 @@ def reset(self, config: ResetConfig | None = None) -> BenchmarkObservation: "Pass task_id in ResetConfig or set default_task_id in constructor." ) - # Load and reset the task - self._current_task = self._adapter.load_task(task_id) + # Load the task — prefer TaskConfig if available (avoids server lookup) + if self._task_config and self._task_config.id == task_id: + self._current_task = self._task_config.to_benchmark_task() + elif hasattr(self._adapter, "load_task_from_json") and self._task_config: + self._current_task = self._adapter.load_task_from_json( + task_id, self._task_config.to_waa_config() + ) + else: + self._current_task = self._adapter.load_task(task_id) obs = self._adapter.reset(self._current_task) # Reset episode state @@ -429,6 +460,66 @@ def evaluate(self) -> float: ) return result.score + def evaluate_dense(self) -> float: + """Evaluate using dense partial rewards via milestones. + + If a TaskConfig with milestones is set, returns the fraction of + milestones passed (0.0 to 1.0). Falls back to binary evaluate() + if no TaskConfig or no milestones are defined. + + This gives GRPO gradient signal even when no task fully completes: + an agent that passes 3/5 milestones gets reward 0.6 vs 0.0 for + one that passes 0/5. + + Returns: + Dense reward score between 0.0 and 1.0. + """ + if self._current_task is None: + raise RuntimeError("Call reset() before evaluate_dense().") + + # Try milestone evaluation first + if self._task_config and self._task_config.milestones: + screenshot = b"" + if self._last_obs and self._last_obs.screenshot: + screenshot = self._last_obs.screenshot + + server_url = getattr( + getattr(self._adapter, "config", None), "server_url", "" + ) or "" + + passed, total = self._task_config.evaluate_milestones( + screenshot, server_url + ) + if total > 0: + milestone_score = passed / total + + # Also try binary evaluation if available + try: + binary_score = self.evaluate() + except Exception: + binary_score = 0.0 + + # Use the higher of milestone score and binary score + # This way, full task completion (1.0) always beats partial (0.6) + score = max(milestone_score, binary_score) + + # Backfill reward on last trajectory step + if self._trajectory: + self._trajectory[-1].reward = score + self._trajectory[-1].info["milestone_score"] = milestone_score + self._trajectory[-1].info["binary_score"] = binary_score + self._trajectory[-1].info["milestones_passed"] = passed + self._trajectory[-1].info["milestones_total"] = total + + logger.info( + "Dense evaluation: milestones=%d/%d (%.2f), binary=%.2f, final=%.2f", + passed, total, milestone_score, binary_score, score, + ) + return score + + # Fallback to binary evaluation + return self.evaluate() + def collect_rollout( self, agent_fn: Callable[[BenchmarkObservation], BenchmarkAction], @@ -493,8 +584,11 @@ def collect_rollout( if rollout_step.done: break - # Evaluate and backfill reward - score = self.evaluate() + # Evaluate and backfill reward — use dense rewards if milestones exist + if self._task_config and self._task_config.milestones: + score = self.evaluate_dense() + else: + score = self.evaluate() logger.info( "Rollout complete: %d steps, score=%.2f", diff --git a/openadapt_evals/task_config.py b/openadapt_evals/task_config.py new file mode 100644 index 0000000..f080037 --- /dev/null +++ b/openadapt_evals/task_config.py @@ -0,0 +1,339 @@ +"""YAML-based custom task configuration. + +Lets users define tasks with setup commands and evaluation checks in +simple YAML files, without forking WAA or modifying the Docker image. +The WAA server already accepts evaluator configs in POST /evaluate — +this module translates YAML into that format. + +Usage: + from openadapt_evals.task_config import TaskConfig + + # Load a single task + task = TaskConfig.from_yaml("tasks/change-font.yaml") + benchmark_task = task.to_benchmark_task() + + # Load all tasks from a directory + tasks = TaskConfig.from_dir("tasks/") +""" + +from __future__ import annotations + +import logging +import os +import uuid +from dataclasses import dataclass, field +from pathlib import Path +from typing import Any + +import yaml + +from openadapt_evals.adapters.base import BenchmarkTask + +logger = logging.getLogger(__name__) + + +@dataclass +class TaskCheck: + """A single evaluation check.""" + + check: str # "command", "file", "screenshot", "python" + # command check + run: str | None = None + expect: str | None = None + match: str = "exact" # "exact", "contains", "regex", "fuzzy" + # file check + path: str | None = None + exists: bool = True + contains: str | None = None + # screenshot check + description: str | None = None + # python check + code: str | None = None + + +@dataclass +class Milestone: + """An intermediate checkpoint for dense rewards.""" + + name: str + check: TaskCheck + + +@dataclass +class TaskConfig: + """A custom task definition loaded from YAML.""" + + name: str + id: str + domain: str + setup: list[dict[str, Any]] + checks: list[TaskCheck] + combine: str # "and" | "or" + max_steps: int + milestones: list[Milestone] + + @classmethod + def from_yaml(cls, path: str) -> TaskConfig: + """Load a task config from a YAML file.""" + with open(path) as f: + data = yaml.safe_load(f) + + task_id = data.get("id", f"custom-{uuid.uuid4().hex[:8]}") + name = data.get("name", Path(path).stem) + domain = data.get("domain", "desktop") + max_steps = data.get("max_steps", 15) + combine = data.get("combine", "and") + + # Parse setup commands + setup = [] + for item in data.get("setup", []): + if isinstance(item, dict): + setup.append(item) + else: + setup.append({"execute": str(item)}) + + # Parse evaluation checks + checks = [] + for item in data.get("evaluate", []): + checks.append(TaskCheck(**{k: v for k, v in item.items()})) + + # Parse milestones + milestones = [] + for item in data.get("milestones", []): + ms_name = item.pop("name", "milestone") + milestones.append(Milestone(name=ms_name, check=TaskCheck(**item))) + + return cls( + name=name, + id=task_id, + domain=domain, + setup=setup, + checks=checks, + combine=combine, + max_steps=max_steps, + milestones=milestones, + ) + + @classmethod + def from_dir(cls, dir_path: str) -> list[TaskConfig]: + """Load all YAML task configs from a directory.""" + tasks = [] + for fname in sorted(os.listdir(dir_path)): + if fname.endswith((".yaml", ".yml")): + try: + tasks.append(cls.from_yaml(os.path.join(dir_path, fname))) + except Exception as exc: + logger.warning("Skipping %s: %s", fname, exc) + return tasks + + def to_waa_config(self) -> dict[str, Any]: + """Translate to WAA's native JSON format for /evaluate.""" + config = { + "task_id": self.id, + "instruction": self.name, + "config": self._translate_setup(), + } + + evaluator = self._translate_evaluator() + if evaluator: + config["evaluator"] = evaluator + + return config + + def to_benchmark_task(self) -> BenchmarkTask: + """Create a BenchmarkTask for use with adapters.""" + waa_config = self.to_waa_config() + return BenchmarkTask( + task_id=self.id, + instruction=self.name, + domain=self.domain, + time_limit_steps=self.max_steps, + raw_config=waa_config, + evaluation_spec=waa_config.get("evaluator"), + ) + + def _translate_setup(self) -> list[dict[str, Any]]: + """Translate setup items to WAA config format.""" + result = [] + for item in self.setup: + if "launch" in item: + result.append({ + "type": "launch", + "parameters": {"command": item["launch"]}, + }) + elif "open" in item: + result.append({ + "type": "open", + "parameters": {"path": item["open"]}, + }) + elif "execute" in item: + cmd = item["execute"].strip() + # WAA execute handler runs the command via subprocess. + # Pass as a single string for shell execution. + result.append({ + "type": "execute", + "parameters": {"command": cmd}, + }) + elif "sleep" in item: + result.append({ + "type": "sleep", + "parameters": {"seconds": float(item["sleep"])}, + }) + elif "download" in item: + dl = item["download"] + result.append({ + "type": "download", + "parameters": {"url": dl["url"], "path": dl["dest"]}, + }) + else: + # Pass through raw WAA setup items + result.append(item) + return result + + def _translate_evaluator(self) -> dict[str, Any] | None: + """Translate checks to WAA evaluator format.""" + # Separate server-side checks from client-side (VLM) checks + server_checks = [c for c in self.checks if c.check != "screenshot"] + if not server_checks: + return None + + if len(server_checks) == 1: + return self._translate_check(server_checks[0]) + + # Multiple checks — use conjunction + metrics = [self._translate_check(c) for c in server_checks] + return { + "func": [m["func"] for m in metrics], + "result": [m["result"] for m in metrics], + "expected": [m["expected"] for m in metrics], + "conj": self.combine, + } + + def _translate_check(self, check: TaskCheck) -> dict[str, Any]: + """Translate a single check to WAA evaluator format.""" + if check.check == "command": + match_func = { + "exact": "exact_match", + "contains": "contains", + "regex": "regex_match", + "fuzzy": "fuzzy_match", + }.get(check.match, "exact_match") + return { + "func": match_func, + "result": { + "type": "vm_command_line", + "command": check.run, + }, + "expected": { + "type": "literal", + "value": check.expect or "", + }, + } + elif check.check == "file": + if check.contains: + return { + "func": "contains", + "result": { + "type": "vm_file", + "path": check.path, + }, + "expected": { + "type": "literal", + "value": check.contains, + }, + } + return { + "func": "file_exists", + "result": { + "type": "vm_command_line", + "command": f'python -c "import os; print(os.path.exists(r\'{check.path}\'))"', + }, + "expected": { + "type": "literal", + "value": "True", + }, + } + elif check.check == "python": + # Wrap code so it prints True/False + escaped = check.code.replace('"', '\\"').replace("\n", "\\n") + return { + "func": "exact_match", + "result": { + "type": "vm_command_line", + "command": f'python -c "{escaped}"', + }, + "expected": { + "type": "literal", + "value": "True", + }, + } + else: + raise ValueError(f"Cannot translate check type '{check.check}' to WAA format") + + def get_vlm_checks(self) -> list[TaskCheck]: + """Return checks that need client-side VLM evaluation.""" + return [c for c in self.checks if c.check == "screenshot"] + + def evaluate_milestones( + self, + screenshot: bytes, + server_url: str, + ) -> tuple[int, int]: + """Evaluate milestones and return (passed, total). + + Server-side milestones are evaluated via /execute_windows. + Screenshot milestones are evaluated via VLM. + """ + passed = 0 + total = len(self.milestones) + if total == 0: + return 0, 0 + + for ms in self.milestones: + try: + if ms.check.check == "screenshot": + from openadapt_evals.vlm_evaluator import vlm_judge + + success, _ = vlm_judge(screenshot, ms.check.description or "") + if success: + passed += 1 + elif ms.check.check == "command": + result = self._run_vm_command(ms.check.run or "", server_url) + if self._check_match(result, ms.check.expect or "", ms.check.match): + passed += 1 + else: + logger.warning("Milestone check type '%s' not yet supported", ms.check.check) + except Exception as exc: + logger.warning("Milestone '%s' evaluation failed: %s", ms.name, exc) + + logger.info("Milestones: %d/%d passed", passed, total) + return passed, total + + @staticmethod + def _run_vm_command(command: str, server_url: str) -> str: + """Execute a command on the VM and return stdout.""" + import requests + + resp = requests.post( + f"{server_url}/execute_windows", + json={"command": f'python -c "{command}"'}, + timeout=30, + ) + if resp.status_code == 200: + return resp.json().get("output", "").strip() + return "" + + @staticmethod + def _check_match(actual: str, expected: str, match_type: str) -> bool: + """Check if actual matches expected using the specified method.""" + if match_type == "exact": + return actual.strip() == expected.strip() + elif match_type == "contains": + return expected.strip() in actual + elif match_type == "regex": + import re + return bool(re.search(expected, actual)) + elif match_type == "fuzzy": + import difflib + return difflib.SequenceMatcher(None, actual, expected).ratio() >= 0.8 + return False diff --git a/openadapt_evals/vlm_evaluator.py b/openadapt_evals/vlm_evaluator.py new file mode 100644 index 0000000..6171f6f --- /dev/null +++ b/openadapt_evals/vlm_evaluator.py @@ -0,0 +1,90 @@ +"""VLM-based screenshot evaluation for custom task checks. + +Uses a vision-language model to judge whether a screenshot shows +that a condition is met. This is less precise than programmatic +checks but much easier to define — users write one sentence instead +of PowerShell commands. + +Usage: + from openadapt_evals.vlm_evaluator import vlm_judge + + success, confidence = vlm_judge( + screenshot_bytes, + "Cell A1 shows text formatted in Arial font", + ) +""" + +from __future__ import annotations + +import json +import logging + +logger = logging.getLogger(__name__) + +_JUDGE_PROMPT = """\ +Look at this screenshot carefully. + +Does this screenshot show that the following condition is met? + +Condition: {description} + +Respond in this exact JSON format: +{{"verdict": "YES" or "NO", "confidence": 0.0 to 1.0, "explanation": "brief reason"}} + +Respond with ONLY the JSON object.""" + + +def vlm_judge( + screenshot: bytes, + description: str, + model: str = "gpt-4.1-mini", + provider: str = "openai", +) -> tuple[bool, float]: + """Judge whether a screenshot satisfies a condition. + + Args: + screenshot: PNG screenshot bytes. + description: Natural language condition to check. + model: VLM model name. + provider: VLM provider ("openai" or "anthropic"). + + Returns: + Tuple of (success: bool, confidence: float). + """ + from openadapt_evals.vlm import vlm_call + + prompt = _JUDGE_PROMPT.format(description=description) + response = vlm_call( + prompt, + images=[screenshot], + model=model, + provider=provider, + max_tokens=256, + temperature=0.1, + ) + + try: + import re + + match = re.search(r"\{[^}]+\}", response, re.DOTALL) + if match: + data = json.loads(match.group()) + else: + data = json.loads(response) + + verdict = str(data.get("verdict", "NO")).upper().startswith("Y") + confidence = float(data.get("confidence", 0.5)) + explanation = data.get("explanation", "") + logger.info( + "VLM judge: %s (confidence=%.2f) — %s", + "PASS" if verdict else "FAIL", + confidence, + explanation[:80], + ) + return verdict, confidence + + except (json.JSONDecodeError, ValueError, KeyError): + # Fallback: check if response starts with YES + verdict = response.strip().upper().startswith("YES") + logger.warning("VLM judge JSON parse failed, fallback: %s", verdict) + return verdict, 0.5 diff --git a/tests/test_dense_rewards.py b/tests/test_dense_rewards.py new file mode 100644 index 0000000..d9e51c8 --- /dev/null +++ b/tests/test_dense_rewards.py @@ -0,0 +1,224 @@ +"""Tests for dense partial rewards via milestones in RLEnvironment.""" + +from __future__ import annotations + +from unittest.mock import MagicMock, patch + +import pytest + +from openadapt_evals.adapters.base import ( + BenchmarkAction, + BenchmarkObservation, + BenchmarkResult, + BenchmarkTask, +) +from openadapt_evals.adapters.rl_env import RLEnvironment, ResetConfig +from openadapt_evals.task_config import Milestone, TaskCheck, TaskConfig + + +def _make_adapter(): + adapter = MagicMock() + adapter.observe.return_value = BenchmarkObservation( + screenshot=b"fake-screenshot", raw_observation={} + ) + adapter.step.return_value = ( + BenchmarkObservation(screenshot=b"fake-screenshot", raw_observation={}), + False, + {}, + ) + adapter.load_task.return_value = BenchmarkTask( + task_id="test-001", instruction="Test task", domain="desktop" + ) + adapter.load_task_from_json.return_value = BenchmarkTask( + task_id="test-001", instruction="Test task", domain="desktop" + ) + adapter.reset.return_value = BenchmarkObservation( + screenshot=b"fake-screenshot", raw_observation={} + ) + adapter.evaluate.return_value = BenchmarkResult( + task_id="test-001", success=False, score=0.0 + ) + adapter.config = MagicMock(server_url="http://localhost:5001") + return adapter + + +def _make_task_config(milestones=None): + return TaskConfig( + name="Test task", + id="test-001", + domain="desktop", + setup=[], + checks=[], + combine="and", + max_steps=15, + milestones=milestones or [], + ) + + +class TestDenseRewards: + def test_evaluate_dense_with_milestones(self): + adapter = _make_adapter() + task_config = _make_task_config( + milestones=[ + Milestone( + name="Step 1 done", + check=TaskCheck(check="command", run="echo 1", expect="1", match="exact"), + ), + Milestone( + name="Step 2 done", + check=TaskCheck(check="command", run="echo 0", expect="1", match="exact"), + ), + ] + ) + env = RLEnvironment(adapter, task_config=task_config) + env.reset(config=ResetConfig(task_id="test-001")) + + with patch.object(TaskConfig, "_run_vm_command") as mock_cmd: + mock_cmd.side_effect = ["1", "0"] + score = env.evaluate_dense() + + # 1/2 milestones passed = 0.5, binary = 0.0, max(0.5, 0.0) = 0.5 + assert score == 0.5 + + def test_evaluate_dense_all_pass(self): + adapter = _make_adapter() + adapter.evaluate.return_value = BenchmarkResult( + task_id="test-001", success=True, score=1.0 + ) + task_config = _make_task_config( + milestones=[ + Milestone( + name="Done", + check=TaskCheck(check="command", run="echo ok", expect="ok", match="exact"), + ), + ] + ) + env = RLEnvironment(adapter, task_config=task_config) + env.reset(config=ResetConfig(task_id="test-001")) + + with patch.object(TaskConfig, "_run_vm_command", return_value="ok"): + score = env.evaluate_dense() + + # milestones = 1/1 = 1.0, binary = 1.0, max = 1.0 + assert score == 1.0 + + def test_evaluate_dense_no_milestones_falls_back(self): + adapter = _make_adapter() + adapter.evaluate.return_value = BenchmarkResult( + task_id="test-001", success=False, score=0.0 + ) + task_config = _make_task_config(milestones=[]) + env = RLEnvironment(adapter, task_config=task_config) + env.reset(config=ResetConfig(task_id="test-001")) + + score = env.evaluate_dense() + assert score == 0.0 + + def test_evaluate_dense_no_task_config(self): + adapter = _make_adapter() + adapter.evaluate.return_value = BenchmarkResult( + task_id="test-001", success=False, score=0.0 + ) + env = RLEnvironment(adapter) + env.reset(config=ResetConfig(task_id="test-001")) + + score = env.evaluate_dense() + assert score == 0.0 + + def test_load_task_config(self): + adapter = _make_adapter() + env = RLEnvironment(adapter) + + task_config = _make_task_config() + env.load_task_config(task_config) + + assert env._task_config is task_config + assert env._default_task_id == "test-001" + + def test_collect_rollout_uses_dense_rewards(self): + adapter = _make_adapter() + # Make step return done after 2 steps + adapter.step.side_effect = [ + (BenchmarkObservation(screenshot=b"s1", raw_observation={}), False, {}), + (BenchmarkObservation(screenshot=b"s2", raw_observation={}), True, {}), + ] + + task_config = _make_task_config( + milestones=[ + Milestone( + name="Step done", + check=TaskCheck(check="command", run="echo 1", expect="1", match="exact"), + ), + Milestone( + name="Not done", + check=TaskCheck(check="command", run="echo 0", expect="1", match="exact"), + ), + ] + ) + env = RLEnvironment(adapter, task_config=task_config) + + def agent_fn(obs): + return BenchmarkAction(type="click", x=100, y=100) + + with patch.object(TaskConfig, "_run_vm_command") as mock_cmd: + mock_cmd.side_effect = ["1", "0"] + trajectory = env.collect_rollout(agent_fn, max_steps=5, task_id="test-001") + + # Last step should have dense reward + assert trajectory[-1].reward == 0.5 # 1/2 milestones + assert trajectory[-1].info.get("milestones_passed") == 1 + assert trajectory[-1].info.get("milestones_total") == 2 + + def test_trajectory_info_contains_milestone_details(self): + adapter = _make_adapter() + task_config = _make_task_config( + milestones=[ + Milestone( + name="M1", + check=TaskCheck(check="command", run="echo 1", expect="1", match="exact"), + ), + ] + ) + env = RLEnvironment(adapter, task_config=task_config) + env.reset(config=ResetConfig(task_id="test-001")) + env.step(BenchmarkAction(type="click", x=100, y=100)) + + with patch.object(TaskConfig, "_run_vm_command", return_value="1"): + env.evaluate_dense() + + last_step = env.trajectory[-1] + assert "milestone_score" in last_step.info + assert "binary_score" in last_step.info + assert last_step.info["milestones_passed"] == 1 + assert last_step.info["milestones_total"] == 1 + + @patch("openadapt_evals.vlm_evaluator.vlm_judge") + def test_milestone_with_vlm_check(self, mock_vlm): + mock_vlm.return_value = (True, 0.9) + + adapter = _make_adapter() + task_config = _make_task_config( + milestones=[ + Milestone( + name="VLM check", + check=TaskCheck(check="screenshot", description="App is open"), + ), + ] + ) + env = RLEnvironment(adapter, task_config=task_config) + env.reset(config=ResetConfig(task_id="test-001")) + + score = env.evaluate_dense() + assert score == 1.0 # 1/1 milestone passed + mock_vlm.assert_called_once() + + def test_reset_uses_task_config_for_task_loading(self): + adapter = _make_adapter() + task_config = _make_task_config() + env = RLEnvironment(adapter, task_config=task_config) + + env.reset(config=ResetConfig(task_id="test-001")) + + # Should use load_task_from_json since task_config matches + assert env._current_task is not None + assert env._current_task.task_id == "test-001" diff --git a/tests/test_rl_pipeline_e2e.py b/tests/test_rl_pipeline_e2e.py new file mode 100644 index 0000000..dc19d3b --- /dev/null +++ b/tests/test_rl_pipeline_e2e.py @@ -0,0 +1,290 @@ +"""Synthetic end-to-end test of the RL training pipeline. + +Validates the full chain: TaskConfig YAML → RLEnvironment → collect_rollout +→ dense rewards → output dict matching TRL rollout_func signature. + +Uses WAAMockAdapter — no VM or GPU required. When the enterprise customer +is ready with real tasks, the only change is swapping MockAdapter for +WAALiveAdapter. +""" + +from __future__ import annotations + +import textwrap +from unittest.mock import MagicMock, patch + +import pytest + +from openadapt_evals.adapters.base import ( + BenchmarkAction, + BenchmarkObservation, + BenchmarkResult, + BenchmarkTask, +) +from openadapt_evals.adapters.rl_env import RLEnvironment, ResetConfig +from openadapt_evals.task_config import TaskConfig + + +def _make_mock_adapter(step_count_to_done: int = 3): + """Create a mock adapter that terminates after N steps.""" + adapter = MagicMock() + call_count = {"n": 0} + + def mock_step(action): + call_count["n"] += 1 + done = call_count["n"] >= step_count_to_done + return ( + BenchmarkObservation( + screenshot=b"\x89PNG\r\n\x1a\n" + b"\x00" * 100, # minimal PNG header + raw_observation={}, + ), + done, + {"step": call_count["n"]}, + ) + + adapter.step.side_effect = mock_step + adapter.reset.return_value = BenchmarkObservation( + screenshot=b"\x89PNG\r\n\x1a\n" + b"\x00" * 100, + raw_observation={}, + ) + adapter.load_task.return_value = BenchmarkTask( + task_id="test-task", instruction="Test", domain="desktop" + ) + adapter.load_task_from_json.return_value = BenchmarkTask( + task_id="test-task", instruction="Test", domain="desktop" + ) + adapter.evaluate.return_value = BenchmarkResult( + task_id="test-task", success=False, score=0.0 + ) + adapter.config = MagicMock(server_url="http://mock:5001") + return adapter + + +class TestRLPipelineE2E: + """Full pipeline test: YAML → Environment → Rollout → Dense Rewards.""" + + def test_full_pipeline_with_milestones(self, tmp_path): + """Simulate a 3-step task where agent passes 2/3 milestones.""" + # 1. Write a task YAML + task_yaml = tmp_path / "test_task.yaml" + task_yaml.write_text( + textwrap.dedent("""\ + name: "Test task with milestones" + id: test-task + setup: + - sleep: 1 + evaluate: + - check: command + run: "echo done" + expect: "done" + milestones: + - name: "Step 1 done" + check: command + run: "echo pass" + expect: "pass" + match: exact + - name: "Step 2 done" + check: command + run: "echo pass" + expect: "pass" + match: exact + - name: "Step 3 NOT done" + check: command + run: "echo fail" + expect: "pass" + match: exact + max_steps: 5 + """) + ) + + # 2. Load TaskConfig + tc = TaskConfig.from_yaml(str(task_yaml)) + assert tc.name == "Test task with milestones" + assert len(tc.milestones) == 3 + + # 3. Create environment with mock adapter + adapter = _make_mock_adapter(step_count_to_done=3) + env = RLEnvironment(adapter, task_config=tc) + + # 4. Define a simple agent + def agent_fn(obs): + return BenchmarkAction(type="click", x=100, y=100) + + # 5. Collect rollout with dense rewards + with patch.object(TaskConfig, "_run_vm_command") as mock_cmd: + # First two milestones pass, third fails + mock_cmd.side_effect = ["pass", "pass", "fail"] + trajectory = env.collect_rollout(agent_fn, max_steps=5, task_id="test-task") + + # 6. Verify trajectory + assert len(trajectory) == 3 # 3 steps before done + assert trajectory[-1].done is True + + # 7. Verify dense reward + final_reward = trajectory[-1].reward + assert final_reward == pytest.approx(2 / 3, abs=0.01) # 2/3 milestones + + # 8. Verify info contains milestone details + info = trajectory[-1].info + assert info["milestones_passed"] == 2 + assert info["milestones_total"] == 3 + assert "milestone_score" in info + + def test_pipeline_binary_fallback_no_milestones(self, tmp_path): + """Without milestones, falls back to binary evaluation.""" + task_yaml = tmp_path / "binary_task.yaml" + task_yaml.write_text( + textwrap.dedent("""\ + name: "Binary task" + id: binary-task + evaluate: + - check: command + run: "echo ok" + expect: "ok" + """) + ) + + tc = TaskConfig.from_yaml(str(task_yaml)) + adapter = _make_mock_adapter(step_count_to_done=2) + adapter.evaluate.return_value = BenchmarkResult( + task_id="binary-task", success=True, score=1.0 + ) + env = RLEnvironment(adapter, task_config=tc) + + def agent_fn(obs): + return BenchmarkAction(type="click", x=50, y=50) + + trajectory = env.collect_rollout(agent_fn, max_steps=5, task_id="binary-task") + + # No milestones → binary evaluation + assert trajectory[-1].reward == 1.0 + + def test_rollout_func_output_shape(self, tmp_path): + """Verify output matches TRL rollout_func return signature.""" + task_yaml = tmp_path / "shape_task.yaml" + task_yaml.write_text( + textwrap.dedent("""\ + name: "Shape test" + id: shape-task + evaluate: + - check: screenshot + description: "Task is done" + milestones: + - name: "M1" + check: command + run: "echo 1" + expect: "1" + match: exact + max_steps: 3 + """) + ) + + tc = TaskConfig.from_yaml(str(task_yaml)) + adapter = _make_mock_adapter(step_count_to_done=2) + env = RLEnvironment(adapter, task_config=tc) + + def agent_fn(obs): + return BenchmarkAction(type="click", x=50, y=50) + + with patch.object(TaskConfig, "_run_vm_command", return_value="1"): + trajectory = env.collect_rollout(agent_fn, max_steps=3, task_id="shape-task") + + # Simulate what a rollout_func would return to TRL + rollout_result = { + "prompt_ids": [[1, 2, 3]], # would be real token IDs + "completion_ids": [[4, 5, 6]], + "logprobs": [[-0.5, -0.3, -0.1]], + "env_reward": [trajectory[-1].reward], + } + + # Verify shape matches TRL expectations + assert "prompt_ids" in rollout_result + assert "completion_ids" in rollout_result + assert "logprobs" in rollout_result + assert "env_reward" in rollout_result + assert isinstance(rollout_result["env_reward"][0], float) + assert 0.0 <= rollout_result["env_reward"][0] <= 1.0 + + def test_multiple_rollouts_produce_reward_variance(self, tmp_path): + """GRPO needs reward variance. Dense rewards provide it.""" + task_yaml = tmp_path / "variance_task.yaml" + task_yaml.write_text( + textwrap.dedent("""\ + name: "Variance test" + id: variance-task + evaluate: + - check: command + run: "echo ok" + expect: "ok" + milestones: + - name: "M1" + check: command + run: "echo x" + expect: "pass" + match: exact + - name: "M2" + check: command + run: "echo x" + expect: "pass" + match: exact + - name: "M3" + check: command + run: "echo x" + expect: "pass" + match: exact + max_steps: 3 + """) + ) + + tc = TaskConfig.from_yaml(str(task_yaml)) + rewards = [] + + # Simulate N=4 rollouts with varying milestone success + milestone_results = [ + ["pass", "pass", "pass"], # 3/3 = 1.0 + ["pass", "pass", "fail"], # 2/3 = 0.67 + ["pass", "fail", "fail"], # 1/3 = 0.33 + ["fail", "fail", "fail"], # 0/3 = 0.0 + ] + + for ms_results in milestone_results: + adapter = _make_mock_adapter(step_count_to_done=2) + env = RLEnvironment(adapter, task_config=tc) + + def agent_fn(obs): + return BenchmarkAction(type="click", x=50, y=50) + + with patch.object(TaskConfig, "_run_vm_command") as mock_cmd: + mock_cmd.side_effect = ms_results + trajectory = env.collect_rollout( + agent_fn, max_steps=3, task_id="variance-task" + ) + rewards.append(trajectory[-1].reward) + + # Verify reward variance exists (GRPO can compute advantages) + assert len(set(rewards)) > 1, f"No variance in rewards: {rewards}" + assert max(rewards) > min(rewards) + # Rewards should be approximately [1.0, 0.67, 0.33, 0.0] + assert rewards[0] > rewards[1] > rewards[2] > rewards[3] + + def test_load_example_tasks_and_create_environments(self): + """All example YAML files produce valid environments.""" + import os + + example_dir = os.path.join( + os.path.dirname(os.path.dirname(__file__)), "example_tasks" + ) + if not os.path.isdir(example_dir): + pytest.skip("example_tasks/ not found") + + tasks = TaskConfig.from_dir(example_dir) + assert len(tasks) >= 3 + + for tc in tasks: + adapter = _make_mock_adapter() + env = RLEnvironment(adapter, task_config=tc) + + # Verify environment can be created and reset + env.reset(config=ResetConfig(task_id=tc.id)) + assert env._current_task is not None + assert env._current_task.instruction == tc.name diff --git a/tests/test_task_config.py b/tests/test_task_config.py new file mode 100644 index 0000000..f0d2945 --- /dev/null +++ b/tests/test_task_config.py @@ -0,0 +1,387 @@ +"""Tests for custom YAML task configuration.""" + +from __future__ import annotations + +import json +import os +import textwrap +from unittest.mock import MagicMock, patch + +import pytest +import yaml + +from openadapt_evals.task_config import Milestone, TaskCheck, TaskConfig + + +# --------------------------------------------------------------------------- +# YAML loading tests +# --------------------------------------------------------------------------- + + +class TestTaskConfigFromYaml: + def test_minimal_yaml(self, tmp_path): + config = tmp_path / "task.yaml" + config.write_text( + textwrap.dedent("""\ + name: "Open Notepad" + evaluate: + - check: screenshot + description: "Notepad is open" + """) + ) + task = TaskConfig.from_yaml(str(config)) + assert task.name == "Open Notepad" + assert task.id.startswith("custom-") + assert task.domain == "desktop" + assert task.max_steps == 15 + assert task.combine == "and" + assert len(task.checks) == 1 + assert task.checks[0].check == "screenshot" + + def test_full_yaml(self, tmp_path): + config = tmp_path / "task.yaml" + config.write_text( + textwrap.dedent("""\ + name: "Change font to Arial" + id: custom-font-001 + domain: desktop + max_steps: 20 + combine: or + + setup: + - launch: "soffice --calc" + - sleep: 3 + - execute: "powershell -c 'echo hello'" + + evaluate: + - check: command + run: "powershell -c 'echo Arial'" + expect: "Arial" + match: contains + - check: file + path: "C:\\\\test.txt" + exists: true + - check: screenshot + description: "Font is Arial" + + milestones: + - name: "App open" + check: command + run: "powershell -c 'Get-Process soffice | Measure | Select -Expand Count'" + expect: "1" + match: exact + """) + ) + task = TaskConfig.from_yaml(str(config)) + assert task.name == "Change font to Arial" + assert task.id == "custom-font-001" + assert task.domain == "desktop" + assert task.max_steps == 20 + assert task.combine == "or" + assert len(task.setup) == 3 + assert len(task.checks) == 3 + assert len(task.milestones) == 1 + assert task.milestones[0].name == "App open" + + def test_from_dir(self, tmp_path): + for i in range(3): + (tmp_path / f"task{i}.yaml").write_text( + f"name: Task {i}\nevaluate:\n - check: screenshot\n description: ok" + ) + (tmp_path / "readme.txt").write_text("not a yaml") + + tasks = TaskConfig.from_dir(str(tmp_path)) + assert len(tasks) == 3 + + def test_from_dir_skips_invalid(self, tmp_path): + (tmp_path / "good.yaml").write_text( + "name: Good\nevaluate:\n - check: screenshot\n description: ok" + ) + (tmp_path / "bad.yaml").write_text("not: valid: yaml: [[[") + + tasks = TaskConfig.from_dir(str(tmp_path)) + assert len(tasks) == 1 + + +# --------------------------------------------------------------------------- +# WAA translation tests +# --------------------------------------------------------------------------- + + +class TestWaaTranslation: + def _make_task(self, **kwargs) -> TaskConfig: + defaults = { + "name": "Test task", + "id": "test-001", + "domain": "desktop", + "setup": [], + "checks": [], + "combine": "and", + "max_steps": 15, + "milestones": [], + } + defaults.update(kwargs) + return TaskConfig(**defaults) + + def test_command_check_to_waa(self): + task = self._make_task( + checks=[ + TaskCheck(check="command", run="echo hello", expect="hello", match="exact") + ] + ) + waa = task.to_waa_config() + evaluator = waa["evaluator"] + assert evaluator["func"] == "exact_match" + assert evaluator["result"]["type"] == "vm_command_line" + assert evaluator["result"]["command"] == "echo hello" + assert evaluator["expected"]["value"] == "hello" + + def test_contains_match(self): + task = self._make_task( + checks=[ + TaskCheck(check="command", run="echo hello world", expect="hello", match="contains") + ] + ) + evaluator = task.to_waa_config()["evaluator"] + assert evaluator["func"] == "contains" + + def test_file_exists_check(self): + task = self._make_task( + checks=[ + TaskCheck(check="file", path="C:\\test.txt", exists=True) + ] + ) + evaluator = task.to_waa_config()["evaluator"] + assert evaluator["func"] == "file_exists" + assert "vm_command_line" in evaluator["result"]["type"] + + def test_file_contains_check(self): + task = self._make_task( + checks=[ + TaskCheck(check="file", path="C:\\test.txt", contains="expected content") + ] + ) + evaluator = task.to_waa_config()["evaluator"] + assert evaluator["func"] == "contains" + assert evaluator["result"]["type"] == "vm_file" + + def test_python_check(self): + task = self._make_task( + checks=[ + TaskCheck(check="python", code="print(1 + 1 == 2)") + ] + ) + evaluator = task.to_waa_config()["evaluator"] + assert evaluator["func"] == "exact_match" + assert evaluator["expected"]["value"] == "True" + + def test_screenshot_only_no_server_evaluator(self): + task = self._make_task( + checks=[ + TaskCheck(check="screenshot", description="Notepad is open") + ] + ) + waa = task.to_waa_config() + assert "evaluator" not in waa + + def test_multiple_server_checks_use_conjunction(self): + task = self._make_task( + checks=[ + TaskCheck(check="command", run="echo a", expect="a"), + TaskCheck(check="command", run="echo b", expect="b"), + ], + combine="and", + ) + evaluator = task.to_waa_config()["evaluator"] + assert isinstance(evaluator["func"], list) + assert len(evaluator["func"]) == 2 + assert evaluator["conj"] == "and" + + def test_vlm_checks_separated(self): + task = self._make_task( + checks=[ + TaskCheck(check="command", run="echo ok", expect="ok"), + TaskCheck(check="screenshot", description="Looks good"), + ] + ) + vlm_checks = task.get_vlm_checks() + assert len(vlm_checks) == 1 + assert vlm_checks[0].description == "Looks good" + + def test_setup_translation(self): + task = self._make_task( + setup=[ + {"launch": "notepad.exe"}, + {"sleep": 2}, + {"download": {"url": "http://example.com/f.txt", "dest": "C:\\f.txt"}}, + ] + ) + waa = task.to_waa_config() + config = waa["config"] + assert config[0]["type"] == "launch" + assert config[1]["type"] == "sleep" + assert config[2]["type"] == "download" + + +# --------------------------------------------------------------------------- +# BenchmarkTask conversion +# --------------------------------------------------------------------------- + + +class TestBenchmarkTaskConversion: + def test_to_benchmark_task(self): + task = TaskConfig( + name="Test", + id="test-001", + domain="desktop", + setup=[{"launch": "notepad.exe"}], + checks=[TaskCheck(check="command", run="echo ok", expect="ok")], + combine="and", + max_steps=10, + milestones=[], + ) + bt = task.to_benchmark_task() + assert bt.task_id == "test-001" + assert bt.instruction == "Test" + assert bt.domain == "desktop" + assert bt.time_limit_steps == 10 + assert bt.evaluation_spec is not None + + +# --------------------------------------------------------------------------- +# Milestone evaluation +# --------------------------------------------------------------------------- + + +class TestMilestones: + def test_milestone_command_check(self): + task = TaskConfig( + name="Test", + id="test-001", + domain="desktop", + setup=[], + checks=[], + combine="and", + max_steps=15, + milestones=[ + Milestone( + name="App open", + check=TaskCheck(check="command", run="echo 1", expect="1", match="exact"), + ), + Milestone( + name="File saved", + check=TaskCheck(check="command", run="echo 0", expect="1", match="exact"), + ), + ], + ) + + with patch.object(TaskConfig, "_run_vm_command") as mock_cmd: + mock_cmd.side_effect = ["1", "0"] + passed, total = task.evaluate_milestones(b"fake-screenshot", "http://localhost:5001") + + assert total == 2 + assert passed == 1 # first matches, second doesn't + + @patch("openadapt_evals.vlm_evaluator.vlm_judge") + def test_milestone_screenshot_check(self, mock_vlm): + mock_vlm.return_value = (True, 0.95) + + task = TaskConfig( + name="Test", + id="test-001", + domain="desktop", + setup=[], + checks=[], + combine="and", + max_steps=15, + milestones=[ + Milestone( + name="App visible", + check=TaskCheck(check="screenshot", description="Notepad is open"), + ), + ], + ) + + passed, total = task.evaluate_milestones(b"fake-screenshot", "http://localhost:5001") + assert total == 1 + assert passed == 1 + mock_vlm.assert_called_once() + + def test_no_milestones(self): + task = TaskConfig( + name="Test", + id="test-001", + domain="desktop", + setup=[], + checks=[], + combine="and", + max_steps=15, + milestones=[], + ) + passed, total = task.evaluate_milestones(b"fake", "http://localhost:5001") + assert passed == 0 + assert total == 0 + + +# --------------------------------------------------------------------------- +# VLM evaluator tests +# --------------------------------------------------------------------------- + + +class TestVlmEvaluator: + @patch("openadapt_evals.vlm.vlm_call") + def test_vlm_judge_yes(self, mock_vlm): + mock_vlm.return_value = '{"verdict": "YES", "confidence": 0.95, "explanation": "Font is Arial"}' + from openadapt_evals.vlm_evaluator import vlm_judge + + success, confidence = vlm_judge(b"fake-png", "Font is Arial") + assert success is True + assert confidence == 0.95 + + @patch("openadapt_evals.vlm.vlm_call") + def test_vlm_judge_no(self, mock_vlm): + mock_vlm.return_value = '{"verdict": "NO", "confidence": 0.8, "explanation": "Font is Times"}' + from openadapt_evals.vlm_evaluator import vlm_judge + + success, confidence = vlm_judge(b"fake-png", "Font is Arial") + assert success is False + assert confidence == 0.8 + + @patch("openadapt_evals.vlm.vlm_call") + def test_vlm_judge_bad_json_fallback(self, mock_vlm): + mock_vlm.return_value = "YES, the font looks like Arial to me." + from openadapt_evals.vlm_evaluator import vlm_judge + + success, confidence = vlm_judge(b"fake-png", "Font is Arial") + assert success is True + assert confidence == 0.5 # fallback confidence + + @patch("openadapt_evals.vlm.vlm_call") + def test_vlm_judge_no_fallback(self, mock_vlm): + mock_vlm.return_value = "NO, I don't see that." + from openadapt_evals.vlm_evaluator import vlm_judge + + success, confidence = vlm_judge(b"fake-png", "Font is Arial") + assert success is False + + +# --------------------------------------------------------------------------- +# Example YAML files load correctly +# --------------------------------------------------------------------------- + + +class TestExampleTasks: + def test_load_example_tasks(self): + example_dir = os.path.join( + os.path.dirname(os.path.dirname(__file__)), "example_tasks" + ) + if not os.path.isdir(example_dir): + pytest.skip("example_tasks/ not found") + + tasks = TaskConfig.from_dir(example_dir) + assert len(tasks) >= 1 + for task in tasks: + assert task.name + assert task.id + bt = task.to_benchmark_task() + assert bt.task_id == task.id