Skip to content

feat: add end-to-end eval pipeline script#68

Merged
abrichr merged 13 commits into
mainfrom
feat/eval-pipeline
Mar 3, 2026
Merged

feat: add end-to-end eval pipeline script#68
abrichr merged 13 commits into
mainfrom
feat/eval-pipeline

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 2, 2026

Summary

  • Add scripts/run_eval_pipeline.py — single command to run the full ZS/DC evaluation flow
  • Phase 1 (parallel): generates VLM demos for new recordings + starts VM if deallocated
  • Phase 2: establishes SSH tunnels, socat proxy, waits for WAA readiness
  • Phase 3: runs ZS and DC evaluations with per-task health checks
  • Phase 4: prints results matrix

Changes

Pipeline script (scripts/run_eval_pipeline.py)

  • Uses VMProvider protocol (--cloud azure|aws) instead of hardcoded az CLI calls
  • SSH/tunnel logic uses existing SSHTunnelManager and wait_for_ssh() from infrastructure modules (no duplication)
  • Cross-platform VNC opener via webbrowser.open() (not macOS-only open)
  • Eval subprocess output captured to {output_dir}/logs/{run_name}.log
  • --deallocate-after flag to stop VM billing after eval completes

Type action fix (openadapt_evals/adapters/waa/live.py)

  • Handle newlines in type actions to prevent unterminated string errors
  • Split text on \n and interleave pyautogui.write() with pyautogui.press('enter')

Tests (tests/test_eval_pipeline.py)

  • 53 unit tests across 6 test classes, all passing without VM
  • Covers _build_conditions, _find_recordings_needing_demos, _print_summary, CLI arg parsing, dry-run mode, module constants

Related PRs

Test plan

  • python scripts/run_eval_pipeline.py --help — shows usage
  • python scripts/run_eval_pipeline.py --tasks 04d9aeaf --dry-run — shows plan without running
  • --zs-only and --dc-only flags filter conditions correctly
  • 53/53 unit tests pass
  • Full live run (requires VM)

🤖 Generated with Claude Code

@abrichr abrichr changed the title feat: add end-to-end eval pipeline script feat: add eval pipeline script with failsafe fixes Mar 2, 2026
@abrichr abrichr changed the title feat: add eval pipeline script with failsafe fixes feat: add eval pipeline script with failsafe and type-action fixes Mar 2, 2026
@abrichr abrichr changed the title feat: add eval pipeline script with failsafe and type-action fixes feat: add end-to-end eval pipeline script Mar 2, 2026
abrichr and others added 11 commits March 2, 2026 18:27
Orchestrates the full evaluation flow in a single command:
- Phase 1 (parallel): generate VLM demos + start VM if deallocated
- Phase 2: establish SSH tunnels, socat proxy, wait for WAA readiness
- Phase 3: run ZS and DC evaluations with health checks
- Phase 4: print results summary

Composes existing scripts (run_dc_eval, convert_recording_to_demo) without
modifying them. Supports --dry-run, --tasks, --zs-only/--dc-only, --skip-vm.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous approach imported functions from run_dc_eval.py which
imports openadapt_evals at module level. This fails when running as a
standalone script outside the uv environment. Inlining the small
subprocess-based functions avoids the dependency chain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Check and restart stopped WAA container in Phase 2 (handles VM
  deallocate/start where container exits)
- Increase default WAA readiness timeout from 420s to 1200s (cold
  boot can take 15-35 min)
- Add --vnc/--no-vnc flags to open VNC in browser (default: on)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes for the PyAutoGUI fail-safe issue:

1. Failsafe detection now checks ALL response statuses, not just 200.
   WAA returns fail-safe errors as HTTP 500 with the exception in the
   response body — the previous code only checked stderr on 200 responses.
   Also detect "fail-safe triggered" substring (WAA's error format).

2. Coordinate clamping: all pixel coordinates are clamped to a 5px
   margin from screen edges via _clamp_pixel_coords(), preventing
   accidental corner touches that trigger the fail-safe.

3. Drag coordinate validation: skip drag actions with missing or
   all-zero coordinates instead of defaulting to (0,0) which
   guarantees a fail-safe trigger.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive analysis of ZS vs DC evaluation on a 21-step LibreOffice
Calc task. Key findings:
- ZS: stuck after 1 step (wait loop)
- DC: 30 steps, wrote 4 correct cross-sheet formulas for 1 of 3 columns
- Binary scoring (0.00 both) masks significant DC behavioral advantage
- Documents 3 infrastructure bugs found and fixed during eval

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a --deallocate-after flag that deallocates the VM after eval
completes to stop billing. Uses raw az CLI because oa-vm deallocate
hardcodes VM_NAME="waa-eval-vm" and doesn't accept --name, so it
won't work for pool-style VMs like waa-pool-00.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Test _build_conditions with zs-only, dc-only, default, both-flags,
  multiple tasks, JSON fallback, and warning output
- Test _find_recordings_needing_demos with mocked filesystem covering
  existing demos, missing demos, no recording dir, no meta.json,
  meta_refined.json, task filters, and sorted output
- Test _print_summary for success/failure/empty/skip scenarios
- Test CLI argument parsing defaults and flag behavior
- Test --dry-run integration (exit codes and output content)
- Test module-level constants
- All 54 tests run without VM access

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace Azure-only az CLI calls with VMProvider interface (supports AWS)
- Fix macOS-only VNC opener with cross-platform webbrowser.open()
- Replace duplicate SSH/tunnel functions with infrastructure module calls
- Capture eval subprocess output for cleaner pipeline logs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove references to removed DEFAULT_VM_USER constant
- Replace --vm-user with --cloud in test parser
- Add --deallocate-after to test parser
- 53/53 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rrors

When the agent sends text containing newlines, pyautogui.write() was
called with a literal newline in the Python string, causing an
"unterminated string literal" syntax error on the WAA server.

Adds _build_type_commands() which splits text on newlines and
interleaves pyautogui.write() with pyautogui.press('enter'). Also
extracts _escape_for_pyautogui() for consistent string escaping.

Updates analysis doc: corrects partial-scoring recommendation to note
it requires WAA server-side changes (compare_table metric), not just
adapter-side changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr force-pushed the feat/eval-pipeline branch from 1a6826f to 8034a78 Compare March 2, 2026 23:27
abrichr and others added 2 commits March 2, 2026 18:31
- Guard _create_vm_manager() behind dry-run check so --dry-run works
  without Azure/AWS SDKs configured
- Remove unused as_completed import
- Add parentheses to clarify sorted() if/else expression
- Make _build_type_commands() self-contained (includes import pyautogui)
  so concatenation at call sites is no longer fragile
- Extract build_parser() from main() so tests use the real parser
  instead of a manually reconstructed copy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update description, key features, and architecture to mention both
  Azure and AWS
- Add aws extra to installation section
- Show --cloud aws examples in Quick Start and Parallel Evaluation
- Add aws_vm.py to architecture tree
- Add smoke-test-aws to CLI reference table
- Add AWS env vars to configuration section
- Add Windows 11 on AWS screenshot

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant