|
1 | 1 | # CHANGELOG |
2 | 2 |
|
3 | 3 |
|
| 4 | +## v0.30.0 (2026-03-04) |
| 5 | + |
| 6 | +### Bug Fixes |
| 7 | + |
| 8 | +- **controller**: Prevent plan step drift and reduce VLM false negatives |
| 9 | + ([#97](https://github.com/OpenAdaptAI/openadapt-evals/pull/97), |
| 10 | + [`f1f3870`](https://github.com/OpenAdaptAI/openadapt-evals/commit/f1f3870c3d0dd1740b2943b9d25b28b14583e4a4)) |
| 11 | + |
| 12 | +* fix(controller): prevent plan step drift and reduce VLM false negatives |
| 13 | + |
| 14 | +Two improvements to the closed-loop demo-conditioned controller: |
| 15 | + |
| 16 | +1. Plan step tracking drift prevention: _advance_plan_steps() now only compares current step vs next |
| 17 | + step, advancing at most one step per call. Previously, bulk keyword matching could jump 5+ steps |
| 18 | + on a single action. |
| 19 | + |
| 20 | +2. VLM verification prompt tuning: Added "partially_verified" status for cases where the core |
| 21 | + outcome is achieved but with minor deviations (cursor position, formatting). Rewrote all |
| 22 | + verification prompts to be outcome-focused, reducing false negatives from live eval scenarios. |
| 23 | + |
| 24 | +Adds 68 new tests (8 drift prevention + 21 VLM prompt + 9 false-negative regressions + 30 existing |
| 25 | + test updates). All 147 controller tests pass. |
| 26 | + |
| 27 | +Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
| 28 | + |
| 29 | +* docs(cost): add LLM agent economics analysis |
| 30 | + |
| 31 | +Analyzes unit economics of the closed-loop controller architecture: Claude agent costs, VLM verifier |
| 32 | + costs, scaling projections, and a three-phase strategy from loop-as-product to |
| 33 | + trained-model-as-product. |
| 34 | + |
| 35 | +* fix(agent): replace pyautogui.drag() with mouseDown/moveTo/mouseUp |
| 36 | + |
| 37 | +pyautogui.drag() uses relative coordinates that compound with starting position errors, making it |
| 38 | + unreliable for small targets like LibreOffice fill handles (~3x3 pixels). Replace with explicit |
| 39 | + mouseDown/moveTo/mouseUp sequence with timing delays for reliable drag operations. |
| 40 | + |
| 41 | +Also adds drag case to _build_pixel_command() for the pixel_action() path. |
| 42 | + |
| 43 | +* fix: prevent heuristic/verifier drift and surface partial steps in goal verification |
| 44 | + |
| 45 | +Three issues addressed: |
| 46 | + |
| 47 | +1. Heuristic/verifier step drift: The agent's keyword-based _advance_plan_steps() heuristic and the |
| 48 | + DemoController's VLM verifier operated on independent state, allowing them to disagree on which |
| 49 | + step was current. Fix: add _external_step_control flag to the agent that the DemoController sets |
| 50 | + at init, making _advance_plan_steps() a no-op when the controller manages step progression via VLM |
| 51 | + verification. |
| 52 | + |
| 53 | +2. partially_verified invisible to goal verification: When steps were marked partially_verified, the |
| 54 | + final goal verification pass had no visibility into which steps had partial completions. Fix: |
| 55 | + _verify_goal() now builds a step verification summary and augments the goal text with it when |
| 56 | + noteworthy statuses (partially_verified, failed) exist. |
| 57 | + |
| 58 | +3. Missing integration tests: Added TestHeuristicVerifierSync (4 tests) and |
| 59 | + TestGoalVerificationContext (5 tests) that verify the heuristic is properly disabled under |
| 60 | + controller management, step advancement is driven by VLM verification, and partial/failed step |
| 61 | + context reaches goal verification. Also added 2 agent-level tests for _external_step_control |
| 62 | + behavior. |
| 63 | + |
| 64 | +* fix: suppress stale agent plan progress under external step control |
| 65 | + |
| 66 | +When DemoController sets _external_step_control=True, the agent's internal plan progress injection |
| 67 | + and done-override logic now become no-ops. This prevents the agent from sending conflicting |
| 68 | + step-tracking signals to the Claude model (agent says "step 1 in progress" while controller says |
| 69 | + "step 3 is current"). |
| 70 | + |
| 71 | +Three specific suppressions: 1. _build_initial_messages skips plan progress text injection 2. |
| 72 | + Follow-up messages skip plan progress / demo re-injection 3. Premature "done" override is left to |
| 73 | + the controller |
| 74 | + |
| 75 | +Adds integration tests exercising agent+controller interaction: - Agent suppresses progress under |
| 76 | + external control - Agent injects progress normally without external control - Controller's |
| 77 | + augmented task instruction reaches the agent - Done override handled by controller, not agent |
| 78 | + |
| 79 | +* fix(adapter): ensure target app is focused after task setup |
| 80 | + |
| 81 | +After WAA setup (close_all → verify_apps → download → open), the target application may be behind |
| 82 | + other windows, still loading, or obscured by notifications. This wastes 6+ agent steps recovering. |
| 83 | + |
| 84 | +Add _ensure_app_focused() with multi-strategy approach: - Maps task related_apps to window title |
| 85 | + patterns - Uses WAA /setup/activate_window endpoint (same as WAA postconfig) - Falls back to |
| 86 | + Alt+Tab - Retries 3x with increasing delays (2s, 3s, 5s) - Verifies foreground window title via |
| 87 | + pygetwindow on VM - Runs during reset(), does NOT count against agent step budget |
| 88 | + |
| 89 | +Also adds _APP_WINDOW_PATTERNS mapping, _get_expected_window_patterns(), |
| 90 | + _check_foreground_matches(), and _normalize_app_name() helpers. |
| 91 | + |
| 92 | +* docs: add systematic failure mode analysis and training strategy |
| 93 | + |
| 94 | +Comprehensive analysis of GUI agent failure modes with taxonomy, recording system design, training |
| 95 | + viability assessment, and prioritized action plan. Key findings: |
| 96 | + |
| 97 | +- 4-category taxonomy: Environment, Agent Planning, Grounding, Verifier - Existing |
| 98 | + ExecutionTraceCollector needs only minor extensions - SFT on 50-100 corrected trajectories |
| 99 | + expected 10-30pp improvement - Deterministic infrastructure fixes should come first (Tier 1) |
| 100 | + |
| 101 | +* fix: address PR #97 review comments with clarifying comments and test dep |
| 102 | + |
| 103 | +- Add comment in reset() explaining why _external_step_control is not reset - Add comment on hasattr |
| 104 | + guard explaining MagicMock behavior is acceptable - Add docstring note in |
| 105 | + TestFalseNegativeRegressions about VLM response limitation - Add flask to test |
| 106 | + optional-dependencies for CI coverage |
| 107 | + |
| 108 | +--------- |
| 109 | + |
| 110 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
| 111 | + |
| 112 | +### Features |
| 113 | + |
| 114 | +- Add GPU training automation for verl-agent E2E workflow |
| 115 | + ([#87](https://github.com/OpenAdaptAI/openadapt-evals/pull/87), |
| 116 | + [`da17355`](https://github.com/OpenAdaptAI/openadapt-evals/commit/da173553c138ba6c818485ce377589e8d6241200)) |
| 117 | + |
| 118 | +* feat: add GPU training automation for verl-agent E2E workflow |
| 119 | + |
| 120 | +- Add GPU_VM_SIZE_FALLBACKS to azure_vm.py (NC48ads_A100_v4, NC24ads, NC12s_v3) - Add |
| 121 | + GPU_INSTANCE_TYPE_FALLBACKS to aws_vm.py (p3.8xlarge, g5.12xlarge, p3.2xlarge) - Update |
| 122 | + find_available_size_and_region(gpu=True) on both providers + protocol - Add |
| 123 | + scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent - Add |
| 124 | + scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training - Add oa-vm |
| 125 | + gpu-setup and gpu-train CLI commands |
| 126 | + |
| 127 | +Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
| 128 | + |
| 129 | +* fix: correct verl-agent Hydra config paths and document integration gap |
| 130 | + |
| 131 | +Validated all 17 Hydra config paths against verl-agent's actual schema (ppo_trainer.yaml + |
| 132 | + make_envs()). Key fixes: |
| 133 | + |
| 134 | +- env.env_name: use 'waa_desktop' short name, not Python import path (verl-agent uses hardcoded |
| 135 | + dispatch, not dynamic imports) - Remove env.env_kwargs (doesn't exist), use env.waa.* sub-keys - |
| 136 | + Add data.train_files/val_files (required parquet, generated via data_preprocess.prepare --mode |
| 137 | + visual) - Add missing overrides: algorithm.gamma, gpu_memory_utilization, ppo_mini_batch_size, |
| 138 | + filter_overlong_prompts, test_freq - Add prepare_training_data() and patch_env_manager() steps - |
| 139 | + Document the EnvironmentManagerBase integration gap in decision doc |
| 140 | + |
| 141 | +* fix: replace EnvironmentManagerBase with VAGEN registry-based env integration |
| 142 | + |
| 143 | +The previous implementation incorrectly assumed verl-agent uses an EnvironmentManagerBase ABC with a |
| 144 | + hardcoded make_envs() dispatch. Research reveals VAGEN actually uses: - GymImageEnv protocol |
| 145 | + (which WAADesktopEnv already implements) - YAML-based env registry |
| 146 | + (vagen/configs/env_registry.yaml) - GymAgentLoop for training-time rollout orchestration |
| 147 | + |
| 148 | +Changes: - Replace patch_env_manager() with register_waa_env() (YAML registry) - Add |
| 149 | + register_in_vagen() and generate_env_spec() helpers to verl_env.py - Update launch_training() to |
| 150 | + generate proper VAGEN training config - Fix Integration Gap section in decision doc (no |
| 151 | + EnvironmentManagerBase) - Update training config YAML with architecture diagram - Add 5 new tests |
| 152 | + for registration helpers (40 total, all passing) - Export new helpers from adapters/__init__.py |
| 153 | + |
| 154 | +* fix: correct is_action_valid logic, scroll_direction, stale refs, and DRY violation |
| 155 | + |
| 156 | +Review fixes for the GPU training automation branch: |
| 157 | + |
| 158 | +- Fix is_action_valid: was inverted (DONE()→invalid, garbage→valid), now uses regex match on |
| 159 | + original action string - Fix scroll_direction: SCROLL parsing now populates |
| 160 | + BenchmarkAction.scroll_direction - Fix stale repo URLs: mll-lab-nu/VAGEN → RAGEN-AI/VAGEN across |
| 161 | + vendored files and docs - Fix stale branch ref: setup_gpu_training.sh referenced merged spike |
| 162 | + branch, now uses main - Fix stale repo URL: langfengQ/verl-agent → RAGEN-AI/VAGEN in setup script |
| 163 | + - Add --recurse-submodules to git clone (verl is a VAGEN submodule) - Remove dead params from |
| 164 | + register_waa_env() (waa_server, task_id, max_steps) - Deduplicate training command: vm_cli.py now |
| 165 | + delegates to launch_training() - Update test count in docs: 21 → 40+ - Add 3 new tests for |
| 166 | + is_action_valid behavior - Add scroll_direction assertion to existing scroll test |
| 167 | + |
| 168 | +All 43 tests pass. |
| 169 | + |
| 170 | +* fix: resolve lint errors (undefined use_fast, unused imports, f-strings) |
| 171 | + |
| 172 | +- Remove undefined `use_fast` guard — always log tried sizes on failure - Remove unused PoolManager |
| 173 | + import in vm_cli.py - Remove extraneous f-string prefixes - Remove unused boto3 and SSH_OPTS |
| 174 | + imports in aws_vm.py |
| 175 | + |
| 176 | +* fix: add evaluate_url support and E2E validation test |
| 177 | + |
| 178 | +WAADesktopEnv now correctly separates: - server_url (port 5000): Windows VM Flask API (/screenshot, |
| 179 | + /execute_windows) - evaluate_url (port 5001): evaluate_server.py (/setup, /evaluate, /probe) |
| 180 | + |
| 181 | +Previously, the single server_url default pointed at 5001 (evaluate server only), which caused 404s |
| 182 | + for screenshots and action execution. |
| 183 | + |
| 184 | +Also adds scripts/test_verl_env_e2e.py, validated on AWS g5.xlarge (A10G) with UNIX socket bridge |
| 185 | + proxy chain to Azure WAA VM. |
| 186 | + |
| 187 | +* fix: use Deep Learning AMI for GPU instances and fix setup issues |
| 188 | + |
| 189 | +- Add _find_latest_dl_ami() for GPU VMs (pre-installed NVIDIA drivers + CUDA) - Add gpu param to |
| 190 | + create_vm() to select DL AMI vs standard Ubuntu - Reorder GPU_INSTANCE_TYPE_FALLBACKS: prefer g5 |
| 191 | + (Ampere/A10G) over p3 (Volta/V100) since OSS NVIDIA driver requires GSP (Turing+) - Make |
| 192 | + OPENADAPT_EVALS_BRANCH configurable via env var in setup script - Add conda TOS acceptance step |
| 193 | + (required since Miniconda 2025) |
| 194 | + |
| 195 | +Validated on AWS g5.xlarge with NVIDIA A10G 24GB GPU. |
| 196 | + |
| 197 | +* docs: add GPU E2E validation report with artifacts |
| 198 | + |
| 199 | +Documents the successful end-to-end validation of the verl-agent/VAGEN training pipeline on AWS |
| 200 | + g5.xlarge (A10G 24GB) connecting to Azure WAA VM. Includes architecture diagrams, proxy chain |
| 201 | + details, raw test output, version listings, and issues discovered during validation. |
| 202 | + |
| 203 | +* fix: resolve port inconsistencies and add missing context in validation docs |
| 204 | + |
| 205 | +- Standardize evaluate_url port to 5051 (socat bridge) across all docs - Add Artifact Stage column |
| 206 | + to validation results table mapping tests to raw output - Add docs commit (c2555ef) to PR #87 |
| 207 | + commit list - Clarify 5050 vs 5051 port mapping in architecture diagrams and data flow - Expand |
| 208 | + e2e_test_output.txt Stage 7/8 with sub-steps matching README table - Add SSH tunnel tip about |
| 209 | + socat bridge still being required |
| 210 | + |
| 211 | +* fix: clarify uvicorn version discrepancy and complete commit list |
| 212 | + |
| 213 | +- Add note to gpu_vm_stack_versions.txt explaining that the full pip list is from Stage 5 (vLLM |
| 214 | + install) and uvicorn was later downgraded by VAGEN - Add b7efb4f to the commit list in README.md |
| 215 | + |
| 216 | +* fix: guard flash-attn install for Ampere+ GPUs and validate training data |
| 217 | + |
| 218 | +- Check GPU compute capability before installing flash-attn; V100s (sm_70) don't support Flash |
| 219 | + Attention 2 (requires sm_80+) and would fail at build or runtime - Add post-preparation validation |
| 220 | + to prepare_training_data() ensuring the expected parquet files exist and are non-empty, rather |
| 221 | + than silently proceeding with missing data |
| 222 | + |
| 223 | +* fix: update test to match server_url default port 5000 |
| 224 | + |
| 225 | +The generate_env_spec() default server_url is http://localhost:5000 (WAA Flask API port), not 5001. |
| 226 | + The test expectation was stale. |
| 227 | + |
| 228 | +* fix: split server_url/evaluate_url in training config and CLI args |
| 229 | + |
| 230 | +The two-port WAA architecture uses separate endpoints: - server_url (port 5000): WAA Flask API for |
| 231 | + screenshots and actions - evaluate_url (port 5001): evaluate_server for setup and evaluate |
| 232 | + |
| 233 | +Previously --waa-server defaulted to port 5001 and was assigned to server_url, conflating the two |
| 234 | + endpoints. This fixes: - train_verl_e2e.py: --waa-server default 5000, add --evaluate-server - |
| 235 | + vm_cli.py gpu-train: same CLI arg fixes, pass evaluate_url through - train_waa_vagen.yaml: correct |
| 236 | + server_url to 5000, add evaluate_url - Fix nested single quotes in register_waa_env (heredoc |
| 237 | + instead) - Replace fragile sys.path.insert with importlib.util |
| 238 | + |
| 239 | +* fix: correct stale port in verl_env docstring and SSH tunnel comment |
| 240 | + |
| 241 | +- verl_env.py docstring: server_url example 5001 -> 5000, add evaluate_url - train_waa_vagen.yaml: |
| 242 | + SSH tunnel dest 5050 -> 5051 (socat bridge, not broken Docker port) |
| 243 | + |
| 244 | +--------- |
| 245 | + |
| 246 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
| 247 | + |
| 248 | + |
4 | 249 | ## v0.29.0 (2026-03-03) |
5 | 250 |
|
6 | 251 | ### Documentation |
|
0 commit comments