Skip to content

Commit e62377b

Browse files
abrichrclaude
andauthored
feat: add YAML-based custom task evaluation without forking WAA (#125)
* feat: add YAML-based custom task evaluation without forking WAA Users can define tasks with setup commands and evaluation checks in simple YAML files. The WAA server already accepts evaluator configs in POST /evaluate — this module translates YAML into that format. Four check types: - command: run PowerShell/Python on VM, check output - file: check file exists or contains expected content - screenshot: VLM judges screenshot (one-sentence description) - python: run arbitrary Python on VM Includes milestone support for dense partial rewards, VLM-based screenshot evaluation, 5 example tasks (notepad, folder, calc, clear-browsing-data for Chrome and Edge), and 22 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add dense partial rewards via milestones in RLEnvironment RLEnvironment.evaluate_dense() uses TaskConfig milestones to compute partial credit (milestones_passed / total). This gives GRPO gradient signal even when no task fully completes — an agent passing 3/5 milestones gets reward 0.6 vs 0.0 for binary evaluation. - evaluate_dense(): milestone-based evaluation, falls back to binary - load_task_config(): convenience method to set TaskConfig - collect_rollout() uses dense rewards when milestones are defined - reset() uses TaskConfig for task loading (bypasses server lookup) - Trajectory info includes milestone_score, binary_score, counts 9 new tests, all passing. No changes to existing evaluate() behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: simplify execute command translation in TaskConfig Execute setup commands were being double-wrapped in python -c. Now passed through as-is to WAA's execute handler. Validated against live WAA VM: milestones correctly evaluate (VLM screenshot check + command check both work). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add synthetic E2E pipeline tests for RL training Validates full chain: TaskConfig YAML → RLEnvironment → collect_rollout → dense rewards → TRL rollout_func output shape. Key test: multiple_rollouts_produce_reward_variance proves that milestone-based rewards produce [1.0, 0.67, 0.33, 0.0] across 4 rollouts — GRPO can compute meaningful advantages from this, even when binary task completion is 0%. 5 tests, no VM or GPU required (uses mock adapter). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent fe48718 commit e62377b

12 files changed

Lines changed: 2125 additions & 4 deletions

docs/design/custom_task_evaluation.md

Lines changed: 478 additions & 0 deletions
Large diffs are not rendered by default.

example_tasks/calc-formula.yaml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Medium task: enter values and a SUM formula in LibreOffice Calc
2+
name: "Enter values in A1:A3 and a SUM formula in A4 in LibreOffice Calc"
3+
id: custom-calc-formula
4+
5+
setup:
6+
- execute: "powershell -c 'Stop-Process -Name soffice* -Force -ErrorAction SilentlyContinue'"
7+
- sleep: 2
8+
- launch: "soffice --calc"
9+
- sleep: 5
10+
11+
evaluate:
12+
- check: command
13+
run: |
14+
powershell -c "Get-Process soffice* -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count"
15+
expect: "1"
16+
match: exact
17+
18+
- check: screenshot
19+
description: "LibreOffice Calc is open with values in cells A1, A2, A3 and a SUM formula result in A4"
20+
21+
combine: and
22+
max_steps: 20
23+
24+
milestones:
25+
- name: "Calc is open"
26+
check: command
27+
run: "powershell -c \"Get-Process soffice* -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\""
28+
expect: "1"
29+
match: exact
30+
31+
- name: "Values entered"
32+
check: screenshot
33+
description: "Cells A1, A2, and A3 contain numeric values"
34+
35+
- name: "Formula entered"
36+
check: screenshot
37+
description: "Cell A4 contains a formula result (a number)"
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Clear browsing data in Google Chrome
2+
# This is a WAA benchmark staple — tests navigation through Settings UI
3+
name: "Clear browsing data in Google Chrome"
4+
id: custom-clear-chrome-data
5+
6+
setup:
7+
# Populate Chrome with browsing history so there's data to clear
8+
- execute: "powershell -c \"Start-Process chrome 'https://example.com' -WindowStyle Normal\""
9+
- sleep: 3
10+
- execute: "powershell -c \"Start-Process chrome 'https://wikipedia.org'\""
11+
- sleep: 2
12+
# Close Chrome so the agent starts fresh
13+
- execute: "powershell -c 'Stop-Process -Name chrome -Force -ErrorAction SilentlyContinue'"
14+
- sleep: 2
15+
16+
evaluate:
17+
# Check 1: Chrome history is empty after clearing
18+
- check: command
19+
run: |
20+
powershell -c "
21+
$histPath = \"$env:LOCALAPPDATA\\Google\\Chrome\\User Data\\Default\\History\"
22+
if (Test-Path $histPath) {
23+
$size = (Get-Item $histPath).Length
24+
if ($size -lt 50000) { Write-Output 'cleared' } else { Write-Output 'not_cleared' }
25+
} else { Write-Output 'no_history_file' }
26+
"
27+
expect: "cleared"
28+
match: contains
29+
30+
# Check 2: VLM confirms the "Clear browsing data" dialog was used
31+
- check: screenshot
32+
description: "Chrome shows a confirmation that browsing data has been cleared, or the Settings page for clearing data is visible with completed state"
33+
34+
combine: or
35+
max_steps: 20
36+
37+
milestones:
38+
- name: "Chrome is open"
39+
check: command
40+
run: "powershell -c \"Get-Process chrome -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\""
41+
expect: "1"
42+
match: contains
43+
44+
- name: "Settings page is open"
45+
check: screenshot
46+
description: "Chrome Settings page is visible, or chrome://settings is in the address bar"
47+
48+
- name: "Clear browsing data dialog is open"
49+
check: screenshot
50+
description: "The 'Clear browsing data' dialog or panel is visible in Chrome"
51+
52+
- name: "Data is cleared"
53+
check: command
54+
run: |
55+
powershell -c "
56+
$histPath = \"$env:LOCALAPPDATA\\Google\\Chrome\\User Data\\Default\\History\"
57+
if (Test-Path $histPath) {
58+
$size = (Get-Item $histPath).Length
59+
if ($size -lt 50000) { Write-Output 'cleared' } else { Write-Output 'not_cleared' }
60+
} else { Write-Output 'no_history_file' }
61+
"
62+
expect: "cleared"
63+
match: contains
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Clear browsing data in Microsoft Edge
2+
# Edge is the default browser on Windows 11 WAA VMs
3+
name: "Clear browsing data in Microsoft Edge"
4+
id: custom-clear-edge-data
5+
6+
setup:
7+
# Populate Edge with browsing history
8+
- execute: "powershell -c \"Start-Process msedge 'https://example.com' -WindowStyle Normal\""
9+
- sleep: 3
10+
- execute: "powershell -c \"Start-Process msedge 'https://wikipedia.org'\""
11+
- sleep: 2
12+
# Close Edge so the agent starts fresh
13+
- execute: "powershell -c 'Stop-Process -Name msedge -Force -ErrorAction SilentlyContinue'"
14+
- sleep: 2
15+
16+
evaluate:
17+
# Check 1: Edge history is empty after clearing
18+
- check: command
19+
run: |
20+
powershell -c "
21+
$histPath = \"$env:LOCALAPPDATA\\Microsoft\\Edge\\User Data\\Default\\History\"
22+
if (Test-Path $histPath) {
23+
$size = (Get-Item $histPath).Length
24+
if ($size -lt 50000) { Write-Output 'cleared' } else { Write-Output 'not_cleared' }
25+
} else { Write-Output 'no_history_file' }
26+
"
27+
expect: "cleared"
28+
match: contains
29+
30+
# Check 2: VLM confirms clearing
31+
- check: screenshot
32+
description: "Microsoft Edge shows a confirmation that browsing data has been cleared, or the Settings page shows completed clearing state"
33+
34+
combine: or
35+
max_steps: 20
36+
37+
milestones:
38+
- name: "Edge is open"
39+
check: command
40+
run: "powershell -c \"Get-Process msedge -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\""
41+
expect: "1"
42+
match: contains
43+
44+
- name: "Settings page is open"
45+
check: screenshot
46+
description: "Edge Settings page is visible, or edge://settings is in the address bar"
47+
48+
- name: "Clear browsing data dialog is open"
49+
check: screenshot
50+
description: "The 'Clear browsing data' dialog or panel is visible in Edge"
51+
52+
- name: "Data is cleared"
53+
check: command
54+
run: |
55+
powershell -c "
56+
$histPath = \"$env:LOCALAPPDATA\\Microsoft\\Edge\\User Data\\Default\\History\"
57+
if (Test-Path $histPath) {
58+
$size = (Get-Item $histPath).Length
59+
if ($size -lt 50000) { Write-Output 'cleared' } else { Write-Output 'not_cleared' }
60+
} else { Write-Output 'no_history_file' }
61+
"
62+
expect: "cleared"
63+
match: contains
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Simple 2-step task: create a folder named "TestFolder" on the Desktop
2+
name: "Create a folder named TestFolder on the Desktop"
3+
id: custom-desktop-folder
4+
5+
setup:
6+
- execute: "powershell -c \"Remove-Item -Path $env:USERPROFILE\\Desktop\\TestFolder -Recurse -Force -ErrorAction SilentlyContinue\""
7+
- sleep: 1
8+
9+
evaluate:
10+
- check: command
11+
run: "powershell -c \"Test-Path $env:USERPROFILE\\Desktop\\TestFolder\""
12+
expect: "True"
13+
match: exact
14+
15+
max_steps: 10
16+
17+
milestones:
18+
- name: "Desktop is visible"
19+
check: screenshot
20+
description: "The Windows desktop is visible with icons"
21+
22+
- name: "Folder exists"
23+
check: command
24+
run: "powershell -c \"Test-Path $env:USERPROFILE\\Desktop\\TestFolder\""
25+
expect: "True"
26+
match: exact

example_tasks/notepad-hello.yaml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Simple 2-step task: open Notepad and type "Hello World"
2+
name: "Open Notepad and type Hello World"
3+
id: custom-notepad-hello
4+
5+
setup:
6+
- execute: "powershell -c 'Stop-Process -Name notepad -Force -ErrorAction SilentlyContinue'"
7+
- sleep: 1
8+
9+
evaluate:
10+
- check: command
11+
run: "powershell -c \"Get-Process notepad -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\""
12+
expect: "1"
13+
match: exact
14+
15+
- check: screenshot
16+
description: "Notepad window is open with 'Hello World' typed in the text area"
17+
18+
combine: and
19+
max_steps: 10
20+
21+
milestones:
22+
- name: "Notepad is open"
23+
check: command
24+
run: "powershell -c \"Get-Process notepad -ErrorAction SilentlyContinue | Measure | Select -ExpandProperty Count\""
25+
expect: "1"
26+
match: exact
27+
28+
- name: "Text is typed"
29+
check: screenshot
30+
description: "Notepad shows 'Hello World' in the text area"

openadapt_evals/adapters/rl_env.py

Lines changed: 98 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,11 @@
5050
BenchmarkTask,
5151
)
5252

53+
# Avoid circular import — TaskConfig imported lazily
54+
TYPE_CHECKING = False
55+
if TYPE_CHECKING:
56+
from openadapt_evals.task_config import TaskConfig
57+
5358
logger = logging.getLogger(__name__)
5459

5560

@@ -112,10 +117,12 @@ def __init__(
112117
adapter: BenchmarkAdapter,
113118
default_task_id: str | None = None,
114119
evaluate_every_step: bool = False,
120+
task_config: TaskConfig | None = None,
115121
):
116122
self._adapter = adapter
117123
self._default_task_id = default_task_id
118124
self._evaluate_every_step = evaluate_every_step
125+
self._task_config = task_config
119126
self._current_task: BenchmarkTask | None = None
120127
self._step_count = 0
121128
self._done = False
@@ -160,6 +167,23 @@ def screen_size(self) -> tuple[int, int]:
160167
return (width, height)
161168
return (1920, 1200)
162169

170+
def load_task_config(self, task_config: TaskConfig) -> None:
171+
"""Set a TaskConfig for dense reward evaluation.
172+
173+
When set, collect_rollout() and evaluate_dense() use milestone-based
174+
partial credit instead of binary evaluation.
175+
176+
Args:
177+
task_config: A TaskConfig loaded from YAML.
178+
"""
179+
self._task_config = task_config
180+
self._default_task_id = task_config.id
181+
logger.info(
182+
"Loaded TaskConfig: %s (%d milestones)",
183+
task_config.name,
184+
len(task_config.milestones),
185+
)
186+
163187
def reset(self, config: ResetConfig | None = None) -> BenchmarkObservation:
164188
"""Reset environment to a task's initial state.
165189
@@ -186,8 +210,15 @@ def reset(self, config: ResetConfig | None = None) -> BenchmarkObservation:
186210
"Pass task_id in ResetConfig or set default_task_id in constructor."
187211
)
188212

189-
# Load and reset the task
190-
self._current_task = self._adapter.load_task(task_id)
213+
# Load the task — prefer TaskConfig if available (avoids server lookup)
214+
if self._task_config and self._task_config.id == task_id:
215+
self._current_task = self._task_config.to_benchmark_task()
216+
elif hasattr(self._adapter, "load_task_from_json") and self._task_config:
217+
self._current_task = self._adapter.load_task_from_json(
218+
task_id, self._task_config.to_waa_config()
219+
)
220+
else:
221+
self._current_task = self._adapter.load_task(task_id)
191222
obs = self._adapter.reset(self._current_task)
192223

193224
# Reset episode state
@@ -429,6 +460,66 @@ def evaluate(self) -> float:
429460
)
430461
return result.score
431462

463+
def evaluate_dense(self) -> float:
464+
"""Evaluate using dense partial rewards via milestones.
465+
466+
If a TaskConfig with milestones is set, returns the fraction of
467+
milestones passed (0.0 to 1.0). Falls back to binary evaluate()
468+
if no TaskConfig or no milestones are defined.
469+
470+
This gives GRPO gradient signal even when no task fully completes:
471+
an agent that passes 3/5 milestones gets reward 0.6 vs 0.0 for
472+
one that passes 0/5.
473+
474+
Returns:
475+
Dense reward score between 0.0 and 1.0.
476+
"""
477+
if self._current_task is None:
478+
raise RuntimeError("Call reset() before evaluate_dense().")
479+
480+
# Try milestone evaluation first
481+
if self._task_config and self._task_config.milestones:
482+
screenshot = b""
483+
if self._last_obs and self._last_obs.screenshot:
484+
screenshot = self._last_obs.screenshot
485+
486+
server_url = getattr(
487+
getattr(self._adapter, "config", None), "server_url", ""
488+
) or ""
489+
490+
passed, total = self._task_config.evaluate_milestones(
491+
screenshot, server_url
492+
)
493+
if total > 0:
494+
milestone_score = passed / total
495+
496+
# Also try binary evaluation if available
497+
try:
498+
binary_score = self.evaluate()
499+
except Exception:
500+
binary_score = 0.0
501+
502+
# Use the higher of milestone score and binary score
503+
# This way, full task completion (1.0) always beats partial (0.6)
504+
score = max(milestone_score, binary_score)
505+
506+
# Backfill reward on last trajectory step
507+
if self._trajectory:
508+
self._trajectory[-1].reward = score
509+
self._trajectory[-1].info["milestone_score"] = milestone_score
510+
self._trajectory[-1].info["binary_score"] = binary_score
511+
self._trajectory[-1].info["milestones_passed"] = passed
512+
self._trajectory[-1].info["milestones_total"] = total
513+
514+
logger.info(
515+
"Dense evaluation: milestones=%d/%d (%.2f), binary=%.2f, final=%.2f",
516+
passed, total, milestone_score, binary_score, score,
517+
)
518+
return score
519+
520+
# Fallback to binary evaluation
521+
return self.evaluate()
522+
432523
def collect_rollout(
433524
self,
434525
agent_fn: Callable[[BenchmarkObservation], BenchmarkAction],
@@ -493,8 +584,11 @@ def collect_rollout(
493584
if rollout_step.done:
494585
break
495586

496-
# Evaluate and backfill reward
497-
score = self.evaluate()
587+
# Evaluate and backfill reward — use dense rewards if milestones exist
588+
if self._task_config and self._task_config.milestones:
589+
score = self.evaluate_dense()
590+
else:
591+
score = self.evaluate()
498592

499593
logger.info(
500594
"Rollout complete: %d steps, score=%.2f",

0 commit comments

Comments
 (0)