Skip to content

Commit 212725c

Browse files
abrichrclaude
andauthored
feat: consolidate benchmark infrastructure (v0.1.1)
* feat: consolidate benchmark infrastructure Phase 1 of repo consolidation: Adapters restructuring: - Move adapters/waa.py → adapters/waa/mock.py - Move adapters/waa_live.py → adapters/waa/live.py - Create adapters/waa/__init__.py for clean imports New infrastructure/ directory: - Copy vm_monitor.py from openadapt-ml - Copy azure_ops_tracker.py from openadapt-ml - Copy ssh_tunnel.py from openadapt-ml New waa_deploy/ directory: - Copy Dockerfile for WAA Docker image - Copy api_agent.py for in-container agent - Copy start_waa_server.bat New namespaced CLI (oa evals): - Create cli/main.py with 'oa' entry point - Create cli/vm.py with VM management commands - Commands: oa evals vm, oa evals run, oa evals mock, etc. Delete dead code (verified unused): - benchmarks/agent.py, base.py, waa.py, waa_live.py (deprecated shims) - benchmarks/auto_screenshot.py, dashboard_server.py - benchmarks/generate_synthetic_demos.py, live_api.py - benchmarks/validate_demos.py, validate_screenshots.py Dependencies: - Add requests and httpx to core dependencies - Register 'oa' CLI entry point in pyproject.toml Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(tests): fix 9 pre-existing test failures - Fix classify_task_complexity to check medium before simple - Added "multitasking" to complex indicators - Added "file_explorer" to simple indicators and domains - Reordered checks: complex > medium > simple - Update test_cost_optimization.py to match simplified estimate_cost API - Remove tests for unimplemented optimization params - Add test_estimate_cost_basic and test_estimate_cost_single_worker - Update test_target_cost_with_optimizations to use calculate_potential_savings - Update test_evaluate_endpoint.py to match current adapter behavior - Adapter returns 0 score when evaluation unavailable (no fallback scoring) - Update assertions to check for "unavailable" or "evaluator" in reason All 188 tests now pass. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): add WAA benchmark results section with placeholders Add benchmark results section to track: - Baseline reproduction (GPT-4o vs paper reported ~19.5%) - Model comparison (GPT-4o, Claude Sonnet 4.5) - Domain breakdown by Windows application Placeholders will be replaced with actual results once full WAA evaluation completes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: revert incidental beads changes Remove local beads state changes that don't belong in this PR. The issues.jsonl changes were just comment ID renumbering, not substantive changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: delete dead code files as documented in PR Delete deprecated stubs and unused tools from benchmarks/: Deprecated stubs (re-exported from canonical locations): - agent.py - was re-exporting from openadapt_evals.agents - base.py - was re-exporting from openadapt_evals.adapters.base - waa.py - was re-exporting from openadapt_evals.adapters.waa - waa_live.py - was re-exporting from openadapt_evals.adapters.waa_live Unused standalone tools: - auto_screenshot.py - Playwright screenshot tool, only self-referenced - dashboard_server.py - Flask dashboard, only self-referenced - generate_synthetic_demos.py - LLM demo generator, never imported - live_api.py - Simple Flask API, never imported - validate_demos.py - Demo validator, never imported - validate_screenshots.py - Screenshot validator, never imported Also fixes imports in: - azure.py: WAAAdapter now imported from adapters.waa - adapters/waa/live.py: docstring example updated All 188 tests pass after deletion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: bump version to 0.1.1 Changes since 0.1.0: - Task ID format: mock_{domain}_{number:03d} (e.g., mock_browser_001) - Restructured adapters to waa/ subdirectory - Added infrastructure/ directory - Dead code cleanup Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 44697a1 commit 212725c

37 files changed

Lines changed: 5027 additions & 2810 deletions

.beads/issues.jsonl

Lines changed: 5 additions & 4 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -422,6 +422,46 @@ See [LIVE_MONITORING.md](./LIVE_MONITORING.md) for full documentation.
422422
- [CLAUDE.md](./CLAUDE.md) - Development guide and best practices
423423
- [CHANGELOG.md](./CHANGELOG.md) - Version history and changes
424424

425+
## WAA Benchmark Results
426+
427+
> **⚠️ PLACEHOLDER**: The results below are placeholders. Actual benchmark results will be added once the full evaluation completes.
428+
429+
### Baseline Reproduction
430+
431+
We run the full WAA benchmark using the same methodology as the original paper to establish baseline performance.
432+
433+
**WAA Baseline Results (GPT-4o):**
434+
435+
| Metric | Paper Reported | Our Reproduction | Status |
436+
|--------|----------------|------------------|--------|
437+
| Success Rate | ~19.5% | `[PLACEHOLDER]` | `[PENDING]` |
438+
| Tasks Evaluated | 154 | `[PLACEHOLDER]` | `[PENDING]` |
439+
| Avg Steps/Task | N/A | `[PLACEHOLDER]` | `[PENDING]` |
440+
| Avg Time/Task | N/A | `[PLACEHOLDER]` | `[PENDING]` |
441+
442+
### Model Comparison
443+
444+
Performance of different agents on WAA:
445+
446+
| Agent | Success Rate | Avg Steps | Notes |
447+
|-------|--------------|-----------|-------|
448+
| GPT-4o (baseline) | `[PLACEHOLDER]` | `[PLACEHOLDER]` | Zero-shot |
449+
| Claude Sonnet 4.5 | `[PLACEHOLDER]` | `[PLACEHOLDER]` | Zero-shot |
450+
451+
### Domain Breakdown
452+
453+
Success rates by Windows application domain:
454+
455+
| Domain | Tasks | Success Rate |
456+
|--------|-------|--------------|
457+
| Notepad | `[PLACEHOLDER]` | `[PLACEHOLDER]` |
458+
| Chrome | `[PLACEHOLDER]` | `[PLACEHOLDER]` |
459+
| File Explorer | `[PLACEHOLDER]` | `[PLACEHOLDER]` |
460+
| Settings | `[PLACEHOLDER]` | `[PLACEHOLDER]` |
461+
| ... | ... | ... |
462+
463+
> **Note**: Full domain breakdown will be added when benchmark completes.
464+
425465
## License
426466

427467
MIT

openadapt_evals/adapters/__init__.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,16 @@
3434
StaticDatasetAdapter,
3535
UIElement,
3636
)
37-
from openadapt_evals.adapters.waa import WAAAdapter, WAAConfig, WAAMockAdapter
38-
from openadapt_evals.adapters.waa_live import WAALiveAdapter, WAALiveConfig
37+
from openadapt_evals.adapters.waa import (
38+
WAAAdapter,
39+
WAAConfig,
40+
WAAMockAdapter,
41+
WAALiveAdapter,
42+
WAALiveConfig,
43+
SyntheticTaskError,
44+
is_real_waa_task_id,
45+
is_synthetic_task_id,
46+
)
3947

4048
__all__ = [
4149
# Base classes
@@ -52,4 +60,8 @@
5260
"WAAMockAdapter",
5361
"WAALiveAdapter",
5462
"WAALiveConfig",
63+
# Task ID validation
64+
"SyntheticTaskError",
65+
"is_real_waa_task_id",
66+
"is_synthetic_task_id",
5567
]
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
"""Windows Agent Arena (WAA) adapters.
2+
3+
This module provides adapters for the Windows Agent Arena benchmark:
4+
- WAAAdapter: Full WAA integration (requires WAA repo)
5+
- WAAMockAdapter: Mock adapter for testing (no Windows required)
6+
- WAALiveAdapter: HTTP adapter for remote WAA server
7+
8+
Example:
9+
```python
10+
from openadapt_evals.adapters.waa import WAAMockAdapter, WAALiveAdapter
11+
12+
# For local testing (no Windows VM)
13+
adapter = WAAMockAdapter(num_tasks=10)
14+
15+
# For remote evaluation
16+
adapter = WAALiveAdapter(server_url="http://vm-ip:5000")
17+
```
18+
"""
19+
20+
from openadapt_evals.adapters.waa.mock import (
21+
WAAAdapter,
22+
WAAConfig,
23+
WAAMockAdapter,
24+
WAA_DOMAINS,
25+
)
26+
from openadapt_evals.adapters.waa.live import (
27+
WAALiveAdapter,
28+
WAALiveConfig,
29+
SyntheticTaskError,
30+
is_real_waa_task_id,
31+
is_synthetic_task_id,
32+
WAA_TASK_ID_PATTERN,
33+
SYNTHETIC_TASK_PATTERNS,
34+
)
35+
36+
__all__ = [
37+
# Mock/full adapters
38+
"WAAAdapter",
39+
"WAAConfig",
40+
"WAAMockAdapter",
41+
"WAA_DOMAINS",
42+
# Live adapter
43+
"WAALiveAdapter",
44+
"WAALiveConfig",
45+
"WAA_TASK_ID_PATTERN",
46+
"SYNTHETIC_TASK_PATTERNS",
47+
# Task ID validation
48+
"SyntheticTaskError",
49+
"is_real_waa_task_id",
50+
"is_synthetic_task_id",
51+
]
Lines changed: 104 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
not pixel coordinates. WAA's Computer class handles the grounding.
1616
1717
Example:
18-
from openadapt_evals.benchmarks.waa_live import WAALiveAdapter, WAALiveConfig
18+
from openadapt_evals.adapters.waa import WAALiveAdapter, WAALiveConfig
1919
2020
adapter = WAALiveAdapter(WAALiveConfig(server_url="http://vm-ip:5000"))
2121
agent = DemoConditionedAgent(base_agent, retriever)
@@ -26,6 +26,7 @@
2626

2727
import base64
2828
import logging
29+
import re
2930
import time
3031
from dataclasses import dataclass
3132
from typing import Any
@@ -41,6 +42,70 @@
4142
logger = logging.getLogger(__name__)
4243

4344

45+
# WAA task IDs are UUIDs with a domain suffix, e.g., "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-WOS"
46+
# Common suffixes: WOS (Windows OS), CHR (Chrome), NTP (Notepad), etc.
47+
WAA_TASK_ID_PATTERN = re.compile(
48+
r'^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}(-[A-Za-z0-9]+)?$'
49+
)
50+
51+
# Synthetic task ID patterns (from mock adapter or testing)
52+
SYNTHETIC_TASK_PATTERNS = [
53+
re.compile(r'^(mock_)?[a-z_]+_\d+$'), # notepad_1, mock_chrome_001
54+
re.compile(r'^mock_'), # any mock_ prefix
55+
]
56+
57+
58+
def is_real_waa_task_id(task_id: str) -> bool:
59+
"""Check if a task ID matches the real WAA UUID format.
60+
61+
Real WAA task IDs are UUIDs from test_small.json or test_all.json, e.g.:
62+
- "a1b2c3d4-e5f6-7890-abcd-ef1234567890-WOS"
63+
- "12345678-1234-1234-1234-123456789012-CHR"
64+
65+
Synthetic task IDs are simple patterns like:
66+
- "notepad_1", "chrome_2" (from mock adapter)
67+
- "mock_notepad_001" (explicit mock prefix)
68+
69+
Args:
70+
task_id: Task identifier to check.
71+
72+
Returns:
73+
True if the task ID appears to be a real WAA UUID.
74+
"""
75+
return bool(WAA_TASK_ID_PATTERN.match(task_id))
76+
77+
78+
def is_synthetic_task_id(task_id: str) -> bool:
79+
"""Check if a task ID appears to be synthetic (for testing).
80+
81+
Args:
82+
task_id: Task identifier to check.
83+
84+
Returns:
85+
True if the task ID matches synthetic patterns.
86+
"""
87+
for pattern in SYNTHETIC_TASK_PATTERNS:
88+
if pattern.match(task_id):
89+
return True
90+
return False
91+
92+
93+
class SyntheticTaskError(ValueError):
94+
"""Raised when a synthetic task ID is used with the live adapter."""
95+
96+
def __init__(self, task_id: str):
97+
self.task_id = task_id
98+
super().__init__(
99+
f"Task ID '{task_id}' appears to be synthetic (for testing). "
100+
f"The live adapter requires real WAA task IDs (UUIDs from test_small.json or test_all.json). "
101+
f"\n\nTo fix this:"
102+
f"\n 1. Use --mock flag for testing without a Windows VM"
103+
f"\n 2. Or provide real WAA task IDs with --task-ids"
104+
f"\n 3. Or use --tasks N to select N random real tasks"
105+
f"\n\nExample real task ID: 'a1b2c3d4-e5f6-7890-abcd-ef1234567890-WOS'"
106+
)
107+
108+
44109
@dataclass
45110
class WAALiveConfig:
46111
"""Configuration for WAALiveAdapter.
@@ -139,11 +204,20 @@ def load_task(self, task_id: str) -> BenchmarkTask:
139204
3. Creates minimal task as fallback
140205
141206
Args:
142-
task_id: Task identifier (e.g., "notepad_1", "browser_abc123").
207+
task_id: Task identifier. Must be a real WAA UUID
208+
(e.g., "a1b2c3d4-e5f6-7890-abcd-ef1234567890-WOS").
143209
144210
Returns:
145211
BenchmarkTask object with evaluator config if available.
212+
213+
Raises:
214+
SyntheticTaskError: If task_id appears to be synthetic (e.g., "notepad_1").
215+
Use WAAMockAdapter for synthetic/testing tasks.
146216
"""
217+
# Validate that this is a real WAA task ID, not a synthetic one
218+
if is_synthetic_task_id(task_id):
219+
raise SyntheticTaskError(task_id)
220+
147221
import requests
148222

149223
# Try to load from server first
@@ -447,46 +521,45 @@ def evaluate(self, task: BenchmarkTask) -> BenchmarkResult:
447521
return self._evaluate_fallback(task)
448522

449523
def _evaluate_fallback(self, task: BenchmarkTask) -> BenchmarkResult:
450-
"""Fallback evaluation when /evaluate endpoint is unavailable.
524+
"""Fallback when proper evaluation unavailable - returns failure.
451525
452-
Uses a simple heuristic based on:
453-
- Whether the agent took any actions
454-
- Whether the agent called DONE
455-
- Whether the task has success criteria we can check locally
526+
This method explicitly fails instead of providing fake heuristic scores.
527+
Proper evaluation requires either:
528+
1. WAA server with /evaluate endpoint deployed
529+
2. Task configs with evaluator specs (set waa_examples_path)
530+
3. Real WAA task IDs (UUIDs from test_small.json/test_all.json)
456531
457532
Args:
458533
task: Task to evaluate.
459534
460535
Returns:
461-
BenchmarkResult with heuristic-based score.
536+
BenchmarkResult with success=False and score=0.0.
462537
"""
463-
has_actions = len(self._actions) > 0
464-
called_done = any(a.type == "done" for a in self._actions)
465-
typed_text = any(a.type == "type" and a.text for a in self._actions)
466-
467-
# Calculate heuristic score
468-
score = 0.0
469-
if has_actions:
470-
score += 0.2
471-
if called_done:
472-
score += 0.2
473-
if typed_text:
474-
score += 0.1
475-
if self._step_count >= 2:
476-
score += 0.1
477-
478-
# Cap at 0.5 since we can't truly verify success
479-
score = min(score, 0.5)
538+
# Check if task has evaluator config
539+
has_evaluator = bool(
540+
task.raw_config and task.raw_config.get("evaluator")
541+
)
542+
543+
if has_evaluator:
544+
reason = (
545+
"Evaluation unavailable: WAA /evaluate endpoint not deployed. "
546+
"Task has evaluator config but server cannot run it."
547+
)
548+
else:
549+
reason = (
550+
"Evaluation unavailable: task config missing evaluator spec. "
551+
"Set waa_examples_path in config or use real WAA task IDs "
552+
"(UUIDs from test_small.json/test_all.json, not synthetic IDs like 'notepad_1')."
553+
)
554+
555+
logger.error(reason)
480556

481557
return BenchmarkResult(
482558
task_id=task.task_id,
483-
success=False, # Can't determine without proper evaluation
484-
score=score,
559+
success=False,
560+
score=0.0,
485561
num_steps=self._step_count,
486-
reason=(
487-
"Fallback evaluation (WAA /evaluate endpoint unavailable). "
488-
f"Heuristic: actions={len(self._actions)}, done={called_done}, typed={typed_text}"
489-
),
562+
reason=reason,
490563
)
491564

492565
def close(self) -> None:
Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -544,14 +544,24 @@ def _to_waa_action(self, action: BenchmarkAction) -> dict:
544544
class WAAMockAdapter(BenchmarkAdapter):
545545
"""Mock WAA adapter for testing without Windows VM.
546546
547+
This adapter generates synthetic tasks for testing the benchmark infrastructure
548+
without requiring a Windows VM or WAA server. Task IDs are prefixed with "mock_"
549+
to clearly distinguish them from real WAA task IDs.
550+
547551
Useful for:
548552
- Testing the benchmark integration without actual WAA
549553
- Development on non-Windows platforms
550554
- Unit tests
555+
- Verifying agent behavior before running real evaluations
551556
552557
Args:
553558
num_tasks: Number of mock tasks to generate.
554559
domains: Domains to include in mock tasks.
560+
561+
Note:
562+
Mock task IDs use the format "mock_{domain}_{number}" (e.g., "mock_notepad_001").
563+
These IDs are explicitly rejected by WAALiveAdapter to prevent confusion
564+
between testing and real evaluation runs.
555565
"""
556566

557567
def __init__(
@@ -578,21 +588,27 @@ def benchmark_type(self) -> str:
578588
return "interactive"
579589

580590
def _generate_mock_tasks(self) -> None:
581-
"""Generate mock tasks for testing."""
591+
"""Generate mock tasks for testing.
592+
593+
Task IDs use the format "mock_{domain}_{number}" (e.g., "mock_notepad_001")
594+
to clearly distinguish them from real WAA UUIDs. This prevents accidental
595+
use of synthetic tasks with the live adapter.
596+
"""
582597
tasks_per_domain = self._num_tasks // len(self._domains)
583598
extra = self._num_tasks % len(self._domains)
584599

585600
for i, domain in enumerate(self._domains):
586601
count = tasks_per_domain + (1 if i < extra else 0)
587602
for j in range(count):
588-
task_id = f"{domain}_{j + 1}"
603+
# Use mock_ prefix to clearly indicate synthetic task
604+
task_id = f"mock_{domain}_{j + 1:03d}"
589605
self._tasks.append(
590606
BenchmarkTask(
591607
task_id=task_id,
592608
instruction=f"Mock task {j + 1} in {domain} domain",
593609
domain=domain,
594610
time_limit_steps=15,
595-
raw_config={"mock": True},
611+
raw_config={"mock": True, "synthetic": True},
596612
)
597613
)
598614

0 commit comments

Comments
 (0)