feat: add end-to-end eval pipeline script#68
Merged
Conversation
Orchestrates the full evaluation flow in a single command: - Phase 1 (parallel): generate VLM demos + start VM if deallocated - Phase 2: establish SSH tunnels, socat proxy, wait for WAA readiness - Phase 3: run ZS and DC evaluations with health checks - Phase 4: print results summary Composes existing scripts (run_dc_eval, convert_recording_to_demo) without modifying them. Supports --dry-run, --tasks, --zs-only/--dc-only, --skip-vm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous approach imported functions from run_dc_eval.py which imports openadapt_evals at module level. This fails when running as a standalone script outside the uv environment. Inlining the small subprocess-based functions avoids the dependency chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Check and restart stopped WAA container in Phase 2 (handles VM deallocate/start where container exits) - Increase default WAA readiness timeout from 420s to 1200s (cold boot can take 15-35 min) - Add --vnc/--no-vnc flags to open VNC in browser (default: on) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes for the PyAutoGUI fail-safe issue: 1. Failsafe detection now checks ALL response statuses, not just 200. WAA returns fail-safe errors as HTTP 500 with the exception in the response body — the previous code only checked stderr on 200 responses. Also detect "fail-safe triggered" substring (WAA's error format). 2. Coordinate clamping: all pixel coordinates are clamped to a 5px margin from screen edges via _clamp_pixel_coords(), preventing accidental corner touches that trigger the fail-safe. 3. Drag coordinate validation: skip drag actions with missing or all-zero coordinates instead of defaulting to (0,0) which guarantees a fail-safe trigger. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive analysis of ZS vs DC evaluation on a 21-step LibreOffice Calc task. Key findings: - ZS: stuck after 1 step (wait loop) - DC: 30 steps, wrote 4 correct cross-sheet formulas for 1 of 3 columns - Binary scoring (0.00 both) masks significant DC behavioral advantage - Documents 3 infrastructure bugs found and fixed during eval Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a --deallocate-after flag that deallocates the VM after eval completes to stop billing. Uses raw az CLI because oa-vm deallocate hardcodes VM_NAME="waa-eval-vm" and doesn't accept --name, so it won't work for pool-style VMs like waa-pool-00. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Test _build_conditions with zs-only, dc-only, default, both-flags, multiple tasks, JSON fallback, and warning output - Test _find_recordings_needing_demos with mocked filesystem covering existing demos, missing demos, no recording dir, no meta.json, meta_refined.json, task filters, and sorted output - Test _print_summary for success/failure/empty/skip scenarios - Test CLI argument parsing defaults and flag behavior - Test --dry-run integration (exit codes and output content) - Test module-level constants - All 54 tests run without VM access Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace Azure-only az CLI calls with VMProvider interface (supports AWS) - Fix macOS-only VNC opener with cross-platform webbrowser.open() - Replace duplicate SSH/tunnel functions with infrastructure module calls - Capture eval subprocess output for cleaner pipeline logs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove references to removed DEFAULT_VM_USER constant - Replace --vm-user with --cloud in test parser - Add --deallocate-after to test parser - 53/53 tests pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rrors
When the agent sends text containing newlines, pyautogui.write() was
called with a literal newline in the Python string, causing an
"unterminated string literal" syntax error on the WAA server.
Adds _build_type_commands() which splits text on newlines and
interleaves pyautogui.write() with pyautogui.press('enter'). Also
extracts _escape_for_pyautogui() for consistent string escaping.
Updates analysis doc: corrects partial-scoring recommendation to note
it requires WAA server-side changes (compare_table metric), not just
adapter-side changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1a6826f to
8034a78
Compare
- Guard _create_vm_manager() behind dry-run check so --dry-run works without Azure/AWS SDKs configured - Remove unused as_completed import - Add parentheses to clarify sorted() if/else expression - Make _build_type_commands() self-contained (includes import pyautogui) so concatenation at call sites is no longer fragile - Extract build_parser() from main() so tests use the real parser instead of a manually reconstructed copy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update description, key features, and architecture to mention both Azure and AWS - Add aws extra to installation section - Show --cloud aws examples in Quick Start and Parallel Evaluation - Add aws_vm.py to architecture tree - Add smoke-test-aws to CLI reference table - Add AWS env vars to configuration section - Add Windows 11 on AWS screenshot Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/run_eval_pipeline.py— single command to run the full ZS/DC evaluation flowChanges
Pipeline script (
scripts/run_eval_pipeline.py)--cloud azure|aws) instead of hardcodedazCLI callsSSHTunnelManagerandwait_for_ssh()from infrastructure modules (no duplication)webbrowser.open()(not macOS-onlyopen){output_dir}/logs/{run_name}.log--deallocate-afterflag to stop VM billing after eval completesType action fix (
openadapt_evals/adapters/waa/live.py)\nand interleavepyautogui.write()withpyautogui.press('enter')Tests (
tests/test_eval_pipeline.py)_build_conditions,_find_recordings_needing_demos,_print_summary, CLI arg parsing, dry-run mode, module constantsRelated PRs
fix: harden PyAutoGUI fail-safe detection and coordinate clamping(split out from this PR)Test plan
python scripts/run_eval_pipeline.py --help— shows usagepython scripts/run_eval_pipeline.py --tasks 04d9aeaf --dry-run— shows plan without running--zs-onlyand--dc-onlyflags filter conditions correctly🤖 Generated with Claude Code