Skip to content

Commit 444d014

Browse files
abrichrclaude
andcommitted
docs: update README with eval-suite, demo pipeline, golden images, and CI badge
- Add Tests CI badge - Add ClaudeComputerUseAgent to agents list - Add demo-conditioned evaluation section (record-waa, annotate, eval) - Add eval-suite to benchmark CLI table - Add pool-pause, pool-resume, image-create, image-list to VM CLI table - Update contributing section to use uv instead of pip Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e54ef1b commit 444d014

1 file changed

Lines changed: 39 additions & 10 deletions

File tree

README.md

Lines changed: 39 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# OpenAdapt Evals
22

3+
[![Tests](https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/test.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/test.yml)
34
[![Build](https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/release.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-evals/actions/workflows/release.yml)
45
[![PyPI](https://img.shields.io/pypi/v/openadapt-evals.svg)](https://pypi.org/project/openadapt-evals/)
56
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
@@ -31,7 +32,7 @@ OpenAdapt Evals is a unified framework for evaluating GUI automation agents agai
3132
## Key Features
3233

3334
- **Benchmark adapters** for WAA (live, mock, and local modes), with an extensible base for OSWorld, WebArena, and others
34-
- **Agent interfaces** including `ApiAgent` (Claude / GPT), `RetrievalAugmentedAgent`, `RandomAgent`, and `PolicyAgent`
35+
- **Agent interfaces** including `ApiAgent` (Claude / GPT), `ClaudeComputerUseAgent`, `RetrievalAugmentedAgent`, `RandomAgent`, and `PolicyAgent`
3536
- **Azure VM infrastructure** with `AzureVMManager`, `PoolManager`, `SSHTunnelManager`, and `VMMonitor` for running evaluations at scale
3637
- **CLI tools** -- `oa-vm` for VM and pool management (50+ commands), benchmark CLI for running evals
3738
- **Cost optimization** -- tiered VM sizing, spot instance support, and real-time cost tracking
@@ -98,6 +99,29 @@ metrics = compute_metrics(results)
9899
print(f"Success rate: {metrics['success_rate']:.1%}")
99100
```
100101

102+
### Demo-conditioned evaluation
103+
104+
Record demos on a remote VM via VNC, annotate with a VLM, then run demo-conditioned eval:
105+
106+
```bash
107+
# 1. Record demos interactively (perform actions on VNC, press Enter after each step)
108+
python scripts/record_waa_demos.py record-waa \
109+
--tasks 04d9aeaf,0a0faba3 \
110+
--server http://localhost:5001 \
111+
--output waa_recordings/
112+
113+
# 2. Annotate recordings with VLM
114+
python scripts/record_waa_demos.py annotate \
115+
--recordings waa_recordings/ \
116+
--output annotated_demos/ \
117+
--provider openai
118+
119+
# 3. Run demo-conditioned eval
120+
python scripts/record_waa_demos.py eval \
121+
--demo_dir annotated_demos/ \
122+
--tasks 04d9aeaf,0a0faba3
123+
```
124+
101125
### Parallel evaluation on Azure
102126

103127
```bash
@@ -159,13 +183,14 @@ LOCAL MACHINE AZURE VM (Ubuntu)
159183

160184
| Command | Description |
161185
|------------|-----------------------------------------------|
162-
| `run` | Run live evaluation (localhost:5001 default) |
163-
| `mock` | Run with mock adapter (no VM required) |
164-
| `live` | Run against a WAA server (full control) |
165-
| `azure` | Run parallel evaluation on Azure ML |
166-
| `probe` | Check if a WAA server is ready |
167-
| `view` | Generate HTML viewer for results |
168-
| `estimate` | Estimate Azure costs |
186+
| `run` | Run live evaluation (localhost:5001 default) |
187+
| `mock` | Run with mock adapter (no VM required) |
188+
| `live` | Run against a WAA server (full control) |
189+
| `eval-suite` | Automated full-cycle evaluation (ZS + DC) |
190+
| `azure` | Run parallel evaluation on Azure ML |
191+
| `probe` | Check if a WAA server is ready |
192+
| `view` | Generate HTML viewer for results |
193+
| `estimate` | Estimate Azure costs |
169194

170195
### VM/Pool CLI (`oa-vm`)
171196

@@ -175,7 +200,11 @@ LOCAL MACHINE AZURE VM (Ubuntu)
175200
| `pool-wait` | Wait until WAA is ready on all workers |
176201
| `pool-run` | Distribute tasks across pool workers |
177202
| `pool-status` | Show status of all pool VMs |
203+
| `pool-pause` | Deallocate pool VMs (stop billing) |
204+
| `pool-resume` | Restart deallocated pool VMs |
178205
| `pool-cleanup` | Delete all pool VMs and resources |
206+
| `image-create` | Create golden image from a pool VM |
207+
| `image-list` | List available golden images |
179208
| `vm monitor` | Dashboard with SSH tunnels |
180209
| `vm setup-waa` | Deploy WAA container on a VM |
181210

@@ -226,8 +255,8 @@ We welcome contributions. To get started:
226255
```bash
227256
git clone https://github.com/OpenAdaptAI/openadapt-evals.git
228257
cd openadapt-evals
229-
pip install -e ".[dev]"
230-
pytest tests/ -v
258+
uv sync --extra dev
259+
uv run pytest tests/ -v
231260
```
232261

233262
See [CLAUDE.md](./CLAUDE.md) for development conventions and architecture details.

0 commit comments

Comments
 (0)