Skip to content

Commit 840f9ef

Browse files
abrichrclaude
andauthored
docs: update README with recent features from PRs #58-#75 (#82)
Add coverage for RL training environment, end-to-end eval pipeline, annotation pipeline, 4-layer probe diagnostics, demo recording persistence, review artifacts, coordinate clamping, and multi-cloud VMProvider protocol. Update architecture tree with new modules (rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py, evaluation/) and scripts directory. Add openadapt-consilium to related projects. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 8812e7c commit 840f9ef

1 file changed

Lines changed: 42 additions & 3 deletions

File tree

README.md

Lines changed: 42 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,13 @@ OpenAdapt Evals is a unified framework for evaluating GUI automation agents agai
3333

3434
- **Benchmark adapters** for WAA (live, mock, and local modes), with an extensible base for OSWorld, WebArena, and others
3535
- **Task setup handlers** -- `verify_apps` and `install_apps` ensure required applications are present on the Windows VM before evaluation begins
36-
- **Agent interfaces** including `ApiAgent` (Claude / GPT), `ClaudeComputerUseAgent`, `RetrievalAugmentedAgent`, `RandomAgent`, and `PolicyAgent`
36+
- **Agent interfaces** including `ApiAgent` (Claude / GPT), `ClaudeComputerUseAgent` (with coordinate clamping and fail-safe recovery), `RetrievalAugmentedAgent`, `RandomAgent`, and `PolicyAgent`
3737
- **Multi-cloud VM infrastructure** with `AzureVMManager`, `AWSVMManager`, `PoolManager`, `SSHTunnelManager`, and `VMMonitor` for running evaluations at scale on Azure or AWS
38+
- **End-to-end eval pipeline** (`scripts/run_eval_pipeline.py`) -- orchestrates demo generation, VM lifecycle, SSH tunnels, and ZS/DC evaluation in a single command
39+
- **RL training environment** -- `RLEnvironment` wrapper provides a Gymnasium-style `reset`/`step`/`evaluate` interface for online RL (GRPO, PPO) with outcome-based rewards from WAA scores
40+
- **Annotation pipeline** -- VLM-based screenshot annotation (`annotation.py`, `vlm.py`) migrated from openadapt-ml so the full record-annotate-evaluate workflow runs within this repo
41+
- **4-layer WAA probe** -- `probe --detailed` checks screenshot capture, accessibility tree, action pipeline, and scoring independently; supports `--json` and `--layers` filtering
42+
- **Demo recording and review** -- VNC-based demo capture with auto-persistence (incremental `meta.json`, hardlinked PNGs), JPEG thumbnail deduplication, and markdown review artifact generation
3843
- **CLI tools** -- `oa-vm` for VM and pool management (50+ commands), benchmark CLI for running evals
3944
- **Cost optimization** -- tiered VM sizing, spot instance support, and real-time cost tracking
4045
- **Results visualization** -- HTML viewer with step-by-step screenshot replay, execution logs, and domain breakdowns
@@ -134,6 +139,24 @@ python scripts/record_waa_demos.py eval \
134139
--tasks 04d9aeaf,0a0faba3
135140
```
136141

142+
### End-to-end eval pipeline
143+
144+
For a fully automated flow (demo generation, VM lifecycle, SSH tunnels, ZS and DC evaluation):
145+
146+
```bash
147+
# Run for all recordings that have demos
148+
python scripts/run_eval_pipeline.py
149+
150+
# Specific task(s)
151+
python scripts/run_eval_pipeline.py --tasks 04d9aeaf
152+
153+
# Dry run
154+
python scripts/run_eval_pipeline.py --tasks 04d9aeaf --dry-run
155+
156+
# AWS instead of Azure
157+
python scripts/run_eval_pipeline.py --cloud aws --vm-name waa-pool-00
158+
```
159+
137160
### Parallel evaluation
138161

139162
```bash
@@ -158,20 +181,26 @@ openadapt_evals/
158181
├── agents/ # Agent implementations
159182
│ ├── base.py # BenchmarkAgent ABC
160183
│ ├── api_agent.py # ApiAgent (Claude, GPT)
184+
│ ├── claude_computer_use_agent.py # ClaudeComputerUseAgent (coord clamping, fail-safe)
161185
│ ├── retrieval_agent.py# RetrievalAugmentedAgent
162186
│ └── policy_agent.py # PolicyAgent (trained models)
163187
├── adapters/ # Benchmark adapters
164188
│ ├── base.py # BenchmarkAdapter ABC + data classes
189+
│ ├── rl_env.py # RLEnvironment (Gymnasium-style wrapper for GRPO/PPO)
165190
│ └── waa/ # WAA live, mock, and local adapters
166191
├── infrastructure/ # Cloud VM and pool management
167192
│ ├── azure_vm.py # AzureVMManager
168193
│ ├── aws_vm.py # AWSVMManager
194+
│ ├── vm_provider.py # VMProvider protocol (multi-cloud abstraction)
169195
│ ├── pool.py # PoolManager
196+
│ ├── probe.py # 4-layer WAA probe (screenshot, a11y, action, score)
170197
│ ├── ssh_tunnel.py # SSHTunnelManager
171198
│ └── vm_monitor.py # VMMonitor dashboard
199+
├── evaluation/ # Shared evaluation utilities
200+
│ └── metrics.py # fuzzy_match and scoring functions
172201
├── benchmarks/ # Evaluation runner, CLI, viewers
173202
│ ├── runner.py # evaluate_agent_on_benchmark()
174-
│ ├── cli.py # Benchmark CLI (run, mock, live, view)
203+
│ ├── cli.py # Benchmark CLI (run, mock, live, view, probe)
175204
│ ├── vm_cli.py # VM/Pool CLI (oa-vm, 50+ commands)
176205
│ ├── viewer.py # HTML results viewer
177206
│ ├── pool_viewer.py # Pool results viewer
@@ -180,9 +209,18 @@ openadapt_evals/
180209
│ ├── evaluate_server.py# Flask server (port 5050): /setup, /evaluate, /task
181210
│ ├── Dockerfile # QEMU + Windows 11 + pre-downloaded apps
182211
│ └── tools_config.json # App installer URLs and configs
212+
├── annotation.py # VLM-based demo annotation pipeline
213+
├── vlm.py # VLM provider abstraction (OpenAI, Anthropic)
183214
├── server/ # WAA server extensions
184215
├── config.py # Settings (pydantic-settings, .env)
185216
└── __init__.py
217+
scripts/
218+
├── run_eval_pipeline.py # End-to-end eval: demo gen + VM + ZS/DC eval
219+
├── record_waa_demos.py # Record demos via VNC
220+
├── generate_demo_review.py # Markdown review artifacts with thumbnails
221+
├── run_grpo_rollout.py # Example: collect RL rollouts from WAA
222+
├── refine_demo.py # Two-pass LLM demo refinement
223+
└── run_dc_eval.py # Demo-conditioned evaluation
186224
```
187225

188226
### How it fits together
@@ -248,7 +286,7 @@ When a task config includes `related_apps`, the live adapter automatically prepe
248286
| `live` | Run against a WAA server (full control) |
249287
| `eval-suite` | Automated full-cycle evaluation (ZS + DC) |
250288
| `azure` | Run parallel evaluation on Azure ML |
251-
| `probe` | Check if a WAA server is ready |
289+
| `probe` | Check WAA readiness (`--detailed` for 4-layer diagnostics, `--json`, `--layers`) |
252290
| `view` | Generate HTML viewer for results |
253291
| `estimate` | Estimate Azure costs |
254292

@@ -370,6 +408,7 @@ See [CLAUDE.md](https://github.com/OpenAdaptAI/openadapt-evals/blob/main/CLAUDE.
370408
| [OpenAdapt](https://github.com/OpenAdaptAI/OpenAdapt) | Desktop automation with demo-conditioned AI agents |
371409
| [openadapt-ml](https://github.com/OpenAdaptAI/openadapt-ml) | Training and policy runtime |
372410
| [openadapt-capture](https://github.com/OpenAdaptAI/openadapt-capture) | Screen recording and demo sharing |
411+
| [openadapt-consilium](https://github.com/OpenAdaptAI/openadapt-consilium) | Multi-model consensus library |
373412
| [openadapt-grounding](https://github.com/OpenAdaptAI/openadapt-grounding) | UI element localization |
374413

375414
## License

0 commit comments

Comments
 (0)