11# OpenAdapt Evals
22
3+ [ ![ Tests] ( https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/test.yml/badge.svg )] ( https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/test.yml )
34[ ![ Build] ( https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/release.yml/badge.svg )] ( https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/release.yml )
45[ ![ PyPI] ( https://img.shields.io/pypi/v/openadapt-evals.svg )] ( https://pypi.org/project/openadapt-evals/ )
56[ ![ Python 3.10+] ( https://img.shields.io/badge/python-3.10+-blue.svg )] ( https://www.python.org/downloads/ )
@@ -31,7 +32,7 @@ OpenAdapt Evals is a unified framework for evaluating GUI automation agents agai
3132## Key Features
3233
3334- ** Benchmark adapters** for WAA (live, mock, and local modes), with an extensible base for OSWorld, WebArena, and others
34- - ** Agent interfaces** including ` ApiAgent ` (Claude / GPT), ` RetrievalAugmentedAgent ` , ` RandomAgent ` , and ` PolicyAgent `
35+ - ** Agent interfaces** including ` ApiAgent ` (Claude / GPT), ` ClaudeComputerUseAgent ` , ` RetrievalAugmentedAgent ` , ` RandomAgent ` , and ` PolicyAgent `
3536- ** Azure VM infrastructure** with ` AzureVMManager ` , ` PoolManager ` , ` SSHTunnelManager ` , and ` VMMonitor ` for running evaluations at scale
3637- ** CLI tools** -- ` oa-vm ` for VM and pool management (50+ commands), benchmark CLI for running evals
3738- ** Cost optimization** -- tiered VM sizing, spot instance support, and real-time cost tracking
@@ -98,6 +99,29 @@ metrics = compute_metrics(results)
9899print (f " Success rate: { metrics[' success_rate' ]:.1% } " )
99100```
100101
102+ ### Demo-conditioned evaluation
103+
104+ Record demos on a remote VM via VNC, annotate with a VLM, then run demo-conditioned eval:
105+
106+ ``` bash
107+ # 1. Record demos interactively (perform actions on VNC, press Enter after each step)
108+ python scripts/record_waa_demos.py record-waa \
109+ --tasks 04d9aeaf,0a0faba3 \
110+ --server http://localhost:5001 \
111+ --output waa_recordings/
112+
113+ # 2. Annotate recordings with VLM
114+ python scripts/record_waa_demos.py annotate \
115+ --recordings waa_recordings/ \
116+ --output annotated_demos/ \
117+ --provider openai
118+
119+ # 3. Run demo-conditioned eval
120+ python scripts/record_waa_demos.py eval \
121+ --demo_dir annotated_demos/ \
122+ --tasks 04d9aeaf,0a0faba3
123+ ```
124+
101125### Parallel evaluation on Azure
102126
103127``` bash
@@ -159,13 +183,14 @@ LOCAL MACHINE AZURE VM (Ubuntu)
159183
160184| Command | Description |
161185| ------------| -----------------------------------------------|
162- | ` run ` | Run live evaluation (localhost:5001 default) |
163- | ` mock ` | Run with mock adapter (no VM required) |
164- | ` live ` | Run against a WAA server (full control) |
165- | ` azure ` | Run parallel evaluation on Azure ML |
166- | ` probe ` | Check if a WAA server is ready |
167- | ` view ` | Generate HTML viewer for results |
168- | ` estimate ` | Estimate Azure costs |
186+ | ` run ` | Run live evaluation (localhost:5001 default) |
187+ | ` mock ` | Run with mock adapter (no VM required) |
188+ | ` live ` | Run against a WAA server (full control) |
189+ | ` eval-suite ` | Automated full-cycle evaluation (ZS + DC) |
190+ | ` azure ` | Run parallel evaluation on Azure ML |
191+ | ` probe ` | Check if a WAA server is ready |
192+ | ` view ` | Generate HTML viewer for results |
193+ | ` estimate ` | Estimate Azure costs |
169194
170195### VM/Pool CLI (` oa-vm ` )
171196
@@ -175,7 +200,11 @@ LOCAL MACHINE AZURE VM (Ubuntu)
175200| ` pool-wait ` | Wait until WAA is ready on all workers |
176201| ` pool-run ` | Distribute tasks across pool workers |
177202| ` pool-status ` | Show status of all pool VMs |
203+ | ` pool-pause ` | Deallocate pool VMs (stop billing) |
204+ | ` pool-resume ` | Restart deallocated pool VMs |
178205| ` pool-cleanup ` | Delete all pool VMs and resources |
206+ | ` image-create ` | Create golden image from a pool VM |
207+ | ` image-list ` | List available golden images |
179208| ` vm monitor ` | Dashboard with SSH tunnels |
180209| ` vm setup-waa ` | Deploy WAA container on a VM |
181210
@@ -226,8 +255,8 @@ We welcome contributions. To get started:
226255``` bash
227256git clone https://github.com/OpenAdaptAI/openadapt-evals.git
228257cd openadapt-evals
229- pip install -e " .[ dev] "
230- pytest tests/ -v
258+ uv sync --extra dev
259+ uv run pytest tests/ -v
231260```
232261
233262See [ CLAUDE.md] ( ./CLAUDE.md ) for development conventions and architecture details.
0 commit comments