Skip to content

Commit 7f171e4

Browse files
abrichrclaude
andauthored
refactor(benchmarks): consolidate to re-export from openadapt-evals (#17)
* docs: add verified repo consolidation plan - Two-package architecture: openadapt-evals (foundation) + openadapt-ml (ML) - Verified audit findings: 10 dead files confirmed, 3 previously marked dead but used - CLI namespacing: oa evals <cmd>, oa ml <cmd> - Dependency direction: openadapt-ml depends on openadapt-evals (not circular) - Agents with ML deps (PolicyAgent, BaselineAgent) move to openadapt-ml - adapters/waa/ subdirectory pattern for benchmark organization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: add openadapt-evals as optional dependency Add [benchmarks] optional dependency for benchmark evaluation: - pip install openadapt-ml[benchmarks] This is part of the repo consolidation to establish: - openadapt-evals: Foundation for benchmarks + infrastructure - openadapt-ml: ML training (depends on evals for benchmarks) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(cli): clarify serve vs dashboard command naming - oa ml serve: serve trained models for inference - oa ml dashboard: training dashboard for monitoring This distinguishes the two use cases clearly: - serve = model inference endpoint - dashboard = training progress UI Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(benchmarks): consolidate to re-export from openadapt-evals Migrate benchmark infrastructure to two-package architecture: - openadapt-evals: Foundation package with all adapters, agents, runner - openadapt-ml: ML-specific agents that wrap openadapt-ml internals Changes: - Convert base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py to deprecation stubs that re-export from openadapt-evals - Keep only ML-specific agents in agent.py: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent - Update __init__.py to import from openadapt-evals with deprecation warning - Update tests to import from correct locations - Remove test_waa_live.py (tests belong in openadapt-evals) Net: -3540 lines of duplicate code removed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(benchmarks): delete deprecation stubs, import from openadapt-evals Remove deprecation stubs since there are no external users. Tests now import directly from openadapt-evals (canonical location). Deleted: - base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py Kept: - agent.py (ML-specific agents: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent) - __init__.py (simplified to only export ML-specific agents) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): add WAA benchmark results section with placeholders Add section 15 for Windows Agent Arena benchmark results with clearly marked placeholders. Results will be filled in when full evaluation completes. Warning banner indicates PR should not merge until placeholders are replaced. Sections added: - 15.1 Benchmark Overview - 15.2 Baseline Reproduction (paper vs our run) - 15.3 Model Comparison (GPT-4o, Claude, Qwen variants) - 15.4 Domain Breakdown Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): move WAA benchmark results to openadapt-evals WAA benchmark results belong in openadapt-evals (the benchmark infrastructure package) rather than openadapt-ml (the training package). See: OpenAdaptAI/openadapt-evals#22 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(cli): add VNC auto-launch and --fast VM option - Add setup_vnc_tunnel_and_browser() helper for automatic VNC access - Add VM_SIZE_FAST constants with D8 series sizes - Add VM_SIZE_FAST_FALLBACKS for automatic region/size retry - Add --fast flag to create command for faster installations - Add --fast flag to start command for more QEMU resources (6 cores, 16GB) - Opens browser automatically after container starts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add WAA speedup options documentation - Document --fast VM flag usage - Explain parallelization options - Detail golden image approach for future optimization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): add benchmark execution logs section - Add section 13.5 with log viewing commands - Add benchmark run commands with examples - Renumber screenshot capture tool section to 13.6 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): clarify --run flag for benchmark execution logs - Add logs --run command for viewing task progress - Add logs --run -f for live streaming - Add logs --run --tail N for last N lines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(readme): add example output for logs commands - Add example output for `logs` (container status) - Add example output for `logs --run -f` (benchmark execution) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(cli): add --progress flag for benchmark ETA - Add _show_benchmark_progress() function - Parse run logs for completed task count - Calculate elapsed time and estimated remaining - Show progress percentage Example usage: uv run python -m openadapt_ml.benchmarks.cli logs --progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(research): add cua.ai vs openadapt-ml WAA comparison Comprehensive analysis of Cua (YC X25) computer-use agent platform: - Architecture comparison (composite agents, sandbox-first) - Benchmark framework differences (cua-bench vs openadapt-evals) - Training data generation (trajectory replotting) - Recommendations: adopt patterns, not full migration Key findings: - Cua's parallelization uses multiple sandboxes (like our multi-VM plan) - Composite agent pattern could reduce API costs - HTML capture enables training data diversity Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(cli): add parallelization support with --worker-id and --num-workers WAA natively supports parallel execution by distributing tasks across workers. Usage: # Run on single VM (default) run --num-tasks 154 # Run in parallel on multiple VMs VM1: run --num-tasks 154 --worker-id 0 --num-workers 3 VM2: run --num-tasks 154 --worker-id 1 --num-workers 3 VM3: run --num-tasks 154 --worker-id 2 --num-workers 3 Tasks auto-distribute: worker 0 gets tasks 0-51, worker 1 gets 52-103, etc. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(research): add market positioning and strategic differentiation Expand cua_waa_comparison.md with: - Success rate gap analysis (38.1% vs 19.5%) - Market positioning comparison (TAM, buyers, value props) - Where sandbox approach fails (Citrix, licensed SW, compliance) - Shell applications convergence opportunities - Bottom line: Windows enterprise automation is hard, validates OpenAdapt approach Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs(waa): add parallelization and scalable benchmark design docs - Add WAA_PARALLELIZATION_DESIGN.md documenting: - Official WAA approach (Azure ML Compute) - Our dedicated VM approach (dev/debug) - When to use each approach - Add WAA_UNATTENDED_SCALABLE.md documenting: - Goal: unattended, scalable, programmatic WAA - Synthesized approach using official run_azure.py - Implementation plan and cost estimates - Update Dockerfile comments to clarify: - API agents (api-claude, api-openai) run externally - openadapt-evals CLI connects via SSH tunnel - No internal run.py patching needed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * style: fix ruff formatting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(imports): update internal code to import from openadapt-evals Replace imports from deleted benchmark files with direct imports from openadapt-evals: - azure.py: BenchmarkResult, BenchmarkTask, WAAAdapter - waa_demo/runner.py: BenchmarkAction, WAAMockAdapter, etc. This completes the migration to the two-package architecture where openadapt-evals is the canonical source for benchmark infrastructure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(imports): add missing EvaluationConfig import - Update azure.py to import BenchmarkAgent from openadapt_evals - Add EvaluationConfig to runner.py imports Fixes CI failure: F821 Undefined name `EvaluationConfig` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(deps): require openadapt-evals>=0.1.1 v0.1.0 uses task ID format "browser_1" but tests expect "mock_browser_001" which was added in v0.1.1. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 6d808ea commit 7f171e4

25 files changed

Lines changed: 2957 additions & 3961 deletions

README.md

Lines changed: 104 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -825,9 +825,112 @@ uv run python -m openadapt_ml.benchmarks.cli vm monitor --mock
825825
uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2
826826
```
827827

828+
### 13.5 Benchmark Execution Logs
829+
830+
View benchmark execution progress and logs:
831+
832+
```bash
833+
# View WAA container status and Docker logs
834+
uv run python -m openadapt_ml.benchmarks.cli logs
835+
836+
# View WAA benchmark execution logs (task progress, agent actions)
837+
uv run python -m openadapt_ml.benchmarks.cli logs --run
838+
839+
# Stream execution logs live
840+
uv run python -m openadapt_ml.benchmarks.cli logs --run -f
841+
842+
# Show last N lines of execution logs
843+
uv run python -m openadapt_ml.benchmarks.cli logs --run --tail 100
844+
845+
# Show benchmark progress and ETA
846+
uv run python -m openadapt_ml.benchmarks.cli logs --progress
847+
```
848+
849+
**Example: Container status (`logs`)**
850+
```
851+
WAA Status (20.12.180.208)
852+
============================================================
853+
854+
[Docker Images]
855+
REPOSITORY TAG SIZE
856+
waa-auto latest 25.4GB
857+
windowsarena/winarena latest 25.8GB
858+
859+
[Container]
860+
Status: Up 49 minutes
861+
862+
[Storage]
863+
Total: 21G
864+
Disk image: 64G
865+
866+
[QEMU VM]
867+
Status: Running (PID 1471)
868+
CPU: 176%, MEM: 51.6%, Uptime: 47:28
869+
870+
[WAA Server]
871+
"status": "Probe successful"
872+
(READY)
873+
```
874+
875+
**Example: Benchmark execution logs (`logs --run -f`)**
876+
```
877+
Run log: /home/azureuser/cli_logs/run_20260128_175507.log
878+
------------------------------------------------------------
879+
Streaming log (Ctrl+C to stop)...
880+
881+
[2026-01-28 23:05:10,303 INFO agent/401-MainProcess] Thinking...
882+
[2026-01-28 23:05:17,318 INFO python/62-MainProcess] Updated computer successfully
883+
[2026-01-28 23:05:17,318 INFO lib_run_single/56-MainProcess] Step 9: computer.window_manager.switch_to_application("Summer Trip - File Explorer")
884+
```
885+
886+
**Example: Benchmark progress (`logs --progress`)**
887+
```
888+
=== WAA Benchmark Progress ===
889+
890+
Log: /home/azureuser/cli_logs/run_20260128_175507.log
891+
Started: 2026-01-28 22:55:14
892+
Latest: 2026-01-28 23:28:37
893+
894+
Tasks completed: 1 / 154
895+
Elapsed: 33 minutes
896+
897+
Avg time per task: ~33 min
898+
Remaining tasks: 153
899+
Estimated remaining: ~84h 9m
900+
901+
Progress: 0% [1/154]
902+
```
903+
904+
**Other useful commands:**
905+
```bash
906+
# Check WAA server status (probe endpoint)
907+
uv run python -m openadapt_ml.benchmarks.cli probe
908+
909+
# Check VM/Azure status
910+
uv run python -m openadapt_ml.benchmarks.cli status
911+
912+
# Download benchmark results from VM
913+
uv run python -m openadapt_ml.benchmarks.cli download
914+
915+
# Analyze downloaded results
916+
uv run python -m openadapt_ml.benchmarks.cli analyze
917+
```
918+
919+
**Running benchmarks:**
920+
```bash
921+
# Run full benchmark (154 tasks)
922+
uv run python -m openadapt_ml.benchmarks.cli run --num-tasks 154
923+
924+
# Run specific domain
925+
uv run python -m openadapt_ml.benchmarks.cli run --domain notepad --num-tasks 5
926+
927+
# Run single task
928+
uv run python -m openadapt_ml.benchmarks.cli run --task notepad_1
929+
```
930+
828931
For complete VM management commands and Azure setup instructions, see [`CLAUDE.md`](CLAUDE.md) and [`docs/azure_waa_setup.md`](docs/azure_waa_setup.md).
829932

830-
### 13.5 Screenshot Capture Tool
933+
### 13.6 Screenshot Capture Tool
831934

832935
Capture screenshots of dashboards and VMs for documentation and PR purposes:
833936

0 commit comments

Comments
 (0)