Skip to content

Commit 5c0aa52

Browse files
abrichrclaude
andauthored
feat: add observe_pil() convenience method for PIL image output (#93)
Add observe_pil() to WAALiveAdapter and RLEnvironment for VLM/RL pipelines that work with PIL images directly. Also clean up changelog formatting (remove leaked Co-authored-by trailer lines, fix collapsed bullet lists). Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d4176b6 commit 5c0aa52

4 files changed

Lines changed: 137 additions & 97 deletions

File tree

CHANGELOG.md

Lines changed: 33 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,17 @@
99
([#90](https://github.com/OpenAdaptAI/openadapt-evals/pull/90),
1010
[`ca6a936`](https://github.com/OpenAdaptAI/openadapt-evals/commit/ca6a9362556852bd6ad040ba9ac7a5dfe3a7d880))
1111

12-
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
13-
1412
### Features
1513

1614
- Add TaskVerifierRegistry for custom task verification
1715
([#89](https://github.com/OpenAdaptAI/openadapt-evals/pull/89),
1816
[`639a6a2`](https://github.com/OpenAdaptAI/openadapt-evals/commit/639a6a2ba2a15e0c7a2a3bd65fa57a38f6966965))
19-
20-
Add a registry pattern for custom task verifiers that can inspect VM state after task execution.
21-
This enables GoTo IT Autopilot (and other integrators) to register domain-specific verification
22-
functions without subclassing BenchmarkAdapter.
23-
24-
- TaskVerifierRegistry with decorator and programmatic registration - VerificationResult dataclass
25-
with success/score/details - WAALiveAdapter.run_powershell() for executing PowerShell on the VM -
26-
Built-in clear_browsing_data reference verifier - 33 tests covering registry operations and
27-
built-in verifiers - Exports from evaluation package and main package __init__
28-
29-
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
17+
- TaskVerifierRegistry with decorator and programmatic registration
18+
- VerificationResult dataclass with success/score/details
19+
- WAALiveAdapter.run_powershell() for executing PowerShell on the VM
20+
- Built-in clear_browsing_data reference verifier
21+
- 33 tests covering registry operations and built-in verifiers
22+
- Exports from evaluation package and main package __init__
3023

3124

3225
## v0.25.1 (2026-03-03)
@@ -36,17 +29,15 @@ Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
3629
- Address review findings in verl-agent adapter
3730
([#88](https://github.com/OpenAdaptAI/openadapt-evals/pull/88),
3831
[`879c53c`](https://github.com/OpenAdaptAI/openadapt-evals/commit/879c53c182853104a4a8c5e179e810185180d37a))
39-
40-
- Fix SCROLL direction not forwarded to BenchmarkAction.scroll_direction - Fix DRAG parsing to
41-
include end_x/end_y coordinates - Fix is_action_valid logic: use pattern match instead of inverted
42-
condition - Fix fractional coord conversion: trust _use_fractional flag instead of checking value
43-
ranges (0 and 1 are ambiguous between frac and pixel) - Convert drag end coordinates (end_x/end_y)
44-
from fractional to pixel - Add health_check() method returning
45-
ready/busy/needs_recovery/not_initialized - Add DRAG to system prompt DSL documentation - Fix
46-
vendored VAGEN source URL (mll-lab-nu -> RAGEN-AI) - Add 12 new tests: scroll direction, drag
47-
coords, health_check, is_action_valid
48-
49-
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
32+
- Fix SCROLL direction not forwarded to BenchmarkAction.scroll_direction
33+
- Fix DRAG parsing to include end_x/end_y coordinates
34+
- Fix is_action_valid logic: use pattern match instead of inverted condition
35+
- Fix fractional coord conversion: trust _use_fractional flag
36+
- Convert drag end coordinates from fractional to pixel
37+
- Add health_check() method returning ready/busy/needs_recovery/not_initialized
38+
- Add DRAG to system prompt DSL documentation
39+
- Fix vendored VAGEN source URL (mll-lab-nu -> RAGEN-AI)
40+
- Add 12 new tests
5041

5142

5243
## v0.25.0 (2026-03-03)
@@ -56,86 +47,31 @@ Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
5647
- **agent**: Replace manual string escaping with repr() and fix CU agent bugs
5748
([#83](https://github.com/OpenAdaptAI/openadapt-evals/pull/83),
5849
[`ffcb41d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/ffcb41d9a2dd6cc53eae4bf478d2d3b139d22b84))
59-
60-
* fix(agent): replace manual string escaping with repr() and fix CU agent bugs
61-
62-
Five reliability fixes for eval runs:
63-
64-
1. Replace _escape_for_pyautogui() with repr() in _build_type_commands() - eliminates entire class
65-
of string-embedding bugs (newlines, tabs, quotes, unicode) using Python's own escaping mechanism
66-
67-
2. Fix drag coordinate field names: startCoordinate/endCoordinate (camelCase) →
68-
start_coordinate/coordinate (snake_case) per Claude computer_use API
69-
70-
3. Add _clamp_coord() to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe, applied to
71-
click, drag, and mouse_move actions
72-
73-
4. Re-inject demo text at every step in tool_result messages to prevent context drift in
74-
demo-conditioned evaluation
75-
76-
5. Add command logging in WAALiveAdapter.step() for debugging
77-
78-
Also adds docs/eval_analysis_2026_03_02.md documenting ZS vs DC eval results and literature review
79-
on demo-conditioning approaches.
80-
81-
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
82-
83-
* feat: add multi-level demo format transform and fix tests
84-
85-
- Add scripts/transform_demo_format.py: transforms rigid {Observation, Intent, Action, Result} demos
86-
into adaptive {Think, Action, Expect} format with PLAN section (Option D from eval analysis) -
87-
LLM-assisted mode (default): uses vlm_call() for semantic transform - Rule-based mode (--no-llm):
88-
free, no API calls needed - Supports --dry-run for preview
89-
90-
- Fix tests for repr() escaping and coordinate clamping: - Remove TestEscapeForPyautogui (tests
91-
deleted function) - Update TestBuildTypeCommands for repr() output format - Add
92-
test_all_special_chars_produce_valid_python invariant test - Fix drag test to use snake_case field
93-
names - Fix coordinate edge test to expect clamped (0.005, 0.005)
94-
95-
- Regenerate uv.lock for consilium package name resolution
96-
97-
* docs: add DC-multilevel eval results to analysis
98-
99-
DC-multilevel (new {Think, Action, Expect} + PLAN format) showed clear improvement over DC-rigid:
100-
agent followed the plan, entered all headers and years, typed correct formula, used drag-fill.
101-
Still scored 0.0 due to premature task completion (finished 1/3 columns), but qualitatively the
102-
best behavior across all three conditions.
103-
104-
---------
105-
106-
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
50+
1. Replace `_escape_for_pyautogui()` with `repr()` in `_build_type_commands()`
51+
2. Fix drag coordinate field names: camelCase to snake_case per Claude computer_use API
52+
3. Add `_clamp_coord()` to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe
53+
4. Re-inject demo text at every step to prevent context drift in demo-conditioned evaluation
54+
5. Add command logging in WAALiveAdapter.step() for debugging
55+
- Add `scripts/transform_demo_format.py`: transforms rigid demos into adaptive format
56+
- Fix tests for repr() escaping and coordinate clamping
57+
- Add eval analysis document with ZS vs DC results
10758

10859
### Features
10960

11061
- Add VAGEN/verl-agent environment adapter for VLM RL training
11162
([`0183321`](https://github.com/OpenAdaptAI/openadapt-evals/commit/018332168389b4be74660ecbc754eef5768f6b51))
11263

113-
* feat: add VAGEN/verl-agent environment adapter for VLM RL training
114-
115-
WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation
64+
WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation
11665
training with verl-agent's multi-turn VLM RL pipeline (GiGPO, GRPO, PPO).
117-
118-
The adapter translates between openadapt-evals BenchmarkObservation (PNG bytes + a11y tree) and
119-
VAGEN's observation format (obs_str + multi_modal_input with PIL images).
120-
121-
- Async interface (reset/step/close/system_prompt) - Action DSL parsing (CLICK, TYPE, KEY, SCROLL,
122-
WAIT, DONE) - Fractional coordinate support (0.0-1.0) - Lazy adapter initialization - 21 tests
123-
passing with mock adapter - Example VAGEN training config included
124-
125-
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
126-
127-
* docs: add comprehensive verl-agent decision document
128-
129-
Records the full reasoning chain for choosing verl-agent/VAGEN: - Framework comparison (TRL,
130-
standalone, verl-agent, VAGEN, OpenRLHF, Unsloth) - Key insight: per-step verification via GiGPO
131-
for long-horizon GUI tasks - TRL multi-turn VLM blocker (issues #5119, #5120) - "Environment is
132-
the moat" strategic framing - Architecture diagram and migration path
133-
134-
* feat: add verl-agent as optional dependency
135-
136-
* feat: vendor GymImageEnv base classes from VAGEN
137-
138-
* docs: fact-check framework review in verl decision doc
66+
- Async interface (reset/step/close/system_prompt)
67+
- Action DSL parsing (CLICK, TYPE, KEY, SCROLL, WAIT, DONE)
68+
- Fractional coordinate support (0.0-1.0)
69+
- Lazy adapter initialization
70+
- 21 tests passing with mock adapter
71+
- Example VAGEN training config included
72+
- Comprehensive verl-agent decision document
73+
- Vendor GymImageEnv base classes from VAGEN
74+
- Fact-check framework review in verl decision doc
13975

14076
Update Sections E (OpenRLHF), F (Unsloth), TRL, and comparison matrix with accurate details from
14177
thorough review:

openadapt_evals/adapters/rl_env.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,34 @@ def observe(self) -> BenchmarkObservation:
348348
return self._last_obs
349349
raise RuntimeError("Call reset() before observe().")
350350

351+
def observe_pil(self) -> "PIL.Image.Image":
352+
"""Get current screenshot as a PIL Image.
353+
354+
Convenience wrapper around observe() for VLM/RL training pipelines
355+
that work with PIL images directly.
356+
357+
If the underlying adapter has an ``observe_pil()`` method (e.g.,
358+
WAALiveAdapter), delegates to it. Otherwise calls observe() and
359+
converts the screenshot bytes to a PIL Image.
360+
361+
Returns:
362+
PIL.Image.Image of the current desktop state.
363+
364+
Raises:
365+
RuntimeError: If no screenshot is available (call reset() first).
366+
"""
367+
if hasattr(self._adapter, "observe_pil"):
368+
return self._adapter.observe_pil()
369+
370+
import io
371+
372+
from PIL import Image
373+
374+
obs = self.observe()
375+
if not obs.screenshot:
376+
raise RuntimeError("No screenshot available from adapter")
377+
return Image.open(io.BytesIO(obs.screenshot))
378+
351379
def evaluate(self) -> float:
352380
"""Run the WAA evaluator on the current VM state.
353381

openadapt_evals/adapters/waa/live.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -905,6 +905,27 @@ def observe(self) -> BenchmarkObservation:
905905
"""
906906
return self._get_observation()
907907

908+
def observe_pil(self) -> "PIL.Image.Image":
909+
"""Get current screenshot as a PIL Image.
910+
911+
Convenience wrapper around observe() for VLM/RL training pipelines
912+
that work with PIL images directly.
913+
914+
Returns:
915+
PIL.Image.Image of the current desktop state.
916+
917+
Raises:
918+
RuntimeError: If no screenshot is available.
919+
"""
920+
import io
921+
922+
from PIL import Image
923+
924+
obs = self.observe()
925+
if not obs.screenshot:
926+
raise RuntimeError("No screenshot available from WAA server")
927+
return Image.open(io.BytesIO(obs.screenshot))
928+
908929
def pixel_action(
909930
self,
910931
x: int | float | None = None,

tests/test_observe_pil.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
"""Tests for observe_pil() convenience method."""
2+
3+
from __future__ import annotations
4+
5+
import pytest
6+
7+
from openadapt_evals.adapters.rl_env import RLEnvironment
8+
from openadapt_evals.adapters.waa.mock import WAAMockAdapter
9+
10+
11+
def _make_env() -> RLEnvironment:
12+
adapter = WAAMockAdapter(num_tasks=3)
13+
task_id = adapter.list_tasks()[0].task_id
14+
return RLEnvironment(adapter, default_task_id=task_id)
15+
16+
17+
class TestObservePil:
18+
def test_returns_pil_image(self):
19+
from PIL import Image
20+
21+
env = _make_env()
22+
env.reset()
23+
img = env.observe_pil()
24+
assert isinstance(img, Image.Image)
25+
assert img.size == (1920, 1200)
26+
27+
def test_does_not_advance_step_count(self):
28+
env = _make_env()
29+
env.reset()
30+
before = env._step_count
31+
env.observe_pil()
32+
assert env._step_count == before
33+
34+
def test_raises_without_reset(self):
35+
env = _make_env()
36+
with pytest.raises(RuntimeError):
37+
env.observe_pil()
38+
39+
def test_image_mode_is_rgb(self):
40+
env = _make_env()
41+
env.reset()
42+
img = env.observe_pil()
43+
assert img.mode == "RGB"
44+
45+
def test_fallback_without_adapter_observe_pil(self):
46+
"""RLEnvironment falls back when adapter lacks observe_pil()."""
47+
from PIL import Image
48+
49+
adapter = WAAMockAdapter(num_tasks=3)
50+
assert not hasattr(adapter, "observe_pil")
51+
task_id = adapter.list_tasks()[0].task_id
52+
env = RLEnvironment(adapter, default_task_id=task_id)
53+
env.reset()
54+
img = env.observe_pil()
55+
assert isinstance(img, Image.Image)

0 commit comments

Comments
 (0)