Skip to content

Commit 12b6189

Browse files
abrichrclaude
andauthored
docs: update CLAUDE.md for unified evaluation CLI (#30)
All VM/pool management now lives in openadapt-evals (migrated from openadapt-ml in PR #29). Update CLAUDE.md to reflect: - Single repo for all evaluation infrastructure - oa-vm CLI entry point for VM/pool commands - Updated architecture tree with infrastructure/ and waa_deploy/ - Removed references to openadapt_ml.benchmarks.cli Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ca791bf commit 12b6189

1 file changed

Lines changed: 75 additions & 34 deletions

File tree

CLAUDE.md

Lines changed: 75 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -57,23 +57,28 @@ uv run python -m openadapt_evals.benchmarks.cli up
5757

5858
---
5959

60-
## 🎯 WAA BENCHMARK WORKFLOW (COMPLETE GUIDE)
60+
## WAA BENCHMARK WORKFLOW (COMPLETE GUIDE)
6161

6262
### Architecture Overview
6363

64-
The WAA setup spans TWO repos with distinct responsibilities:
64+
All evaluation infrastructure lives in openadapt-evals. Three CLI entry points:
65+
66+
- `oa` — unified CLI (`oa evals run`, `oa evals mock`)
67+
- `oa-vm` — VM/pool management (`oa-vm pool-create`, `oa-vm status`)
68+
- `openadapt-evals` — legacy entry point
6569

6670
```
67-
LOCAL MACHINE
68-
├── openadapt-ml CLI (VM management)
69-
│ - vm setup-waa # Create VM + Docker + WAA
70-
│ - vm monitor # Dashboard + SSH tunnels
71-
│ - vm deallocate # Stop billing
71+
LOCAL MACHINE (openadapt-evals)
72+
├── oa-vm CLI (VM + pool management)
73+
│ - create / delete # Single VM lifecycle
74+
│ - pool-create / pool-cleanup # Multi-VM pools
75+
│ - vm monitor # Dashboard + SSH tunnels
76+
│ - pool-run # Distributed benchmark execution
7277
73-
├── openadapt-evals CLI (benchmark execution)
74-
│ - run # Simplified benchmark run
75-
│ - live # Full control live eval
76-
│ - mock # No VM needed
78+
├── oa CLI (benchmark execution)
79+
│ - evals run # Simplified benchmark run
80+
│ - evals live # Full control live eval
81+
│ - evals mock # No VM needed
7782
7883
└── SSH Tunnels (auto-managed by vm monitor)
7984
- localhost:5001 → VM:5000 (WAA Flask API)
@@ -89,21 +94,24 @@ AZURE VM (Ubuntu)
8994

9095
### Step-by-Step Workflow
9196

92-
**Step 1: Setup VM (from openadapt-ml, first time only)**
93-
```bash
94-
cd /Users/abrichr/oa/src/openadapt-ml
95-
uv run python -m openadapt_ml.benchmarks.cli vm setup-waa ```
97+
All commands run from openadapt-evals (`cd /Users/abrichr/oa/src/openadapt-evals`).
9698

97-
**Step 2: Start Dashboard and Tunnels (from openadapt-ml)**
99+
**Step 1: Create VM Pool**
98100
```bash
99-
uv run python -m openadapt_ml.benchmarks.cli vm monitor
101+
# Single VM for quick tests
102+
oa-vm pool-create --workers 1
103+
104+
# Multiple VMs for parallel evaluation
105+
oa-vm pool-create --workers 3
100106
```
101-
Keep this running! It manages SSH tunnels automatically.
102107

103-
**Step 3: Run Benchmark (from openadapt-evals)**
108+
**Step 2: Wait for WAA Ready**
104109
```bash
105-
cd /Users/abrichr/oa/src/openadapt-evals
110+
oa-vm pool-wait
111+
```
106112

113+
**Step 3: Run Benchmark**
114+
```bash
107115
# Quick smoke test (no API key needed)
108116
uv run python -m openadapt_evals.benchmarks.cli run --agent noop --task notepad_1
109117

@@ -113,24 +121,23 @@ uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task no
113121
# With Claude
114122
uv run python -m openadapt_evals.benchmarks.cli run --agent api-claude --task notepad_1
115123

116-
# Multiple tasks
117-
uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --tasks notepad_1,notepad_2
124+
# Distributed across pool
125+
oa-vm pool-run --tasks 10
118126
```
119127

120128
**Step 4: View Results**
121129
```bash
122130
uv run python -m openadapt_evals.benchmarks.cli view --run-name live_eval
123131
```
124132

125-
**Step 5: Stop VM (from openadapt-ml)**
133+
**Step 5: Cleanup (Stop Billing)**
126134
```bash
127-
cd /Users/abrichr/oa/src/openadapt-ml
128-
uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
135+
oa-vm pool-cleanup -y
129136
```
130137

131138
### Key Points
132139

133-
1. **Two CLIs** - openadapt-ml manages VM/Docker, openadapt-evals runs benchmarks
140+
1. **One repo** - all VM management AND benchmark execution in openadapt-evals
134141
2. **SSH tunnels required** - Azure NSG blocks direct port access
135142
3. **Default server is localhost:5001** - The `run` command uses this automatically
136143
4. **WAA runs INSIDE Windows** - Not on the Ubuntu host
@@ -140,6 +147,8 @@ uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
140147

141148
## CLI Commands
142149

150+
### Benchmark CLI (`openadapt_evals.benchmarks.cli`)
151+
143152
| Command | Description |
144153
|---------|-------------|
145154
| `run` | **Simplified live evaluation** (uses localhost:5001 by default) |
@@ -151,10 +160,24 @@ uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
151160
| `estimate` | Estimate Azure costs |
152161
| `dashboard` | Generate VM usage dashboard |
153162
| `up` | All-in-one: Start VM + WAA server + wait until ready |
154-
| `vm-start` | Start an Azure VM |
155-
| `vm-stop` | Stop (deallocate) an Azure VM |
156-
| `vm-status` | Check Azure VM status and IP |
157-
| `vm-setup` | Full WAA container setup (automated) |
163+
164+
### VM/Pool CLI (`oa-vm`)
165+
166+
| Command | Description |
167+
|---------|-------------|
168+
| `pool-create --workers N` | Create N VMs with Docker + WAA |
169+
| `pool-wait` | Wait for WAA server ready on all workers |
170+
| `pool-run --tasks N` | Run N tasks distributed across workers |
171+
| `pool-status` | Show status of all pool VMs |
172+
| `pool-vnc` | Open VNC to pool workers |
173+
| `pool-logs` | Stream logs from all workers |
174+
| `pool-cleanup -y` | Delete all pool VMs and resources |
175+
| `create --fast` | Create single VM |
176+
| `delete` | Delete VM and all resources |
177+
| `status` | Show VM status and IP |
178+
| `vm monitor` | Dashboard + SSH tunnels |
179+
| `deallocate` | Stop VM (preserves disk, stops billing) |
180+
| `azure-ml-quota-wait` | Wait for Azure quota approval |
158181

159182
### `run` Command (Recommended for Live Evaluation)
160183

@@ -193,14 +216,28 @@ openadapt_evals/
193216
│ ├── base.py # BenchmarkAdapter ABC
194217
│ ├── waa.py # WAAAdapter, WAAMockAdapter
195218
│ └── waa_live.py # WAALiveAdapter (HTTP)
219+
├── infrastructure/ # Azure VM/pool management (migrated from openadapt-ml)
220+
│ ├── azure_vm.py # AzureVMManager (SDK + az CLI)
221+
│ ├── pool.py # PoolManager (multi-VM orchestration)
222+
│ ├── vm_monitor.py # VMMonitor dashboard
223+
│ ├── azure_ops_tracker.py # Azure operations tracking
224+
│ ├── resource_tracker.py # Cost tracking
225+
│ └── ssh_tunnel.py # SSH tunnel manager
226+
├── waa_deploy/ # Docker agent deployment (migrated from openadapt-ml)
227+
│ ├── api_agent.py # ApiAgent for WAA container
228+
│ └── Dockerfile # WAA Docker image
196229
├── server/ # WAA server extensions
197230
│ ├── evaluate_endpoint.py # /evaluate endpoint
198231
│ └── waa_server_patch.py # Deploy script
199232
├── benchmarks/ # Evaluation utilities
200233
│ ├── runner.py # evaluate_agent_on_benchmark()
201234
│ ├── azure.py # AzureWAAOrchestrator
202-
│ ├── cli.py # Unified CLI
235+
│ ├── cli.py # Benchmark CLI (run, mock, live, view)
236+
│ ├── vm_cli.py # VM/Pool CLI (oa-vm entry point, 50+ commands)
237+
│ ├── pool_viewer.py # Pool results HTML viewer
238+
│ ├── trace_export.py # Training data export
203239
│ └── viewer.py # HTML viewer
240+
├── config.py # Settings (pydantic-settings, .env loading)
204241
└── __init__.py
205242
```
206243

@@ -227,7 +264,11 @@ agent = ApiAgent(provider="anthropic", demo="Step 1: Click Start menu\n...")
227264
| `agents/retrieval_agent.py` | Auto demo selection |
228265
| `adapters/waa_live.py` | HTTP adapter for WAA server |
229266
| `benchmarks/azure.py` | Azure orchestrator with cost optimization |
230-
| `benchmarks/cli.py` | CLI entry point |
267+
| `benchmarks/cli.py` | Benchmark CLI entry point |
268+
| `benchmarks/vm_cli.py` | VM/Pool CLI (`oa-vm`, 50+ commands) |
269+
| `infrastructure/azure_vm.py` | AzureVMManager (SDK + az CLI fallback) |
270+
| `infrastructure/pool.py` | PoolManager for parallel evaluation |
271+
| `config.py` | Settings (pydantic-settings, .env loading) |
231272

232273
## Azure Dashboard
233274

@@ -239,11 +280,11 @@ Shows: real-time costs, VM status, activity logs, start/stop controls.
239280
## WAA Container Setup
240281

241282
```bash
242-
uv run python -m openadapt_evals.benchmarks.cli vm-setup --auto-verify
283+
oa-vm vm setup-waa
243284
```
244285

245286
Automated WAA deployment (95%+ reliability). Fresh VM: 15-20 min, existing: 2-5 min.
246-
Implementation: bash script in cli.py. Use `--help` for troubleshooting.
287+
Implementation: in `benchmarks/vm_cli.py`. Use `--help` for troubleshooting.
247288

248289
## Screenshot Requirements
249290

0 commit comments

Comments
 (0)