Skip to content

Commit 9a37bb4

Browse files
abrichrclaude
andauthored
Vanilla WAA bootstrap automation (#10)
* feat: add openadapt-viewer dependency and adapter module (Phase 1) Phase 1 of viewer consolidation plan: Foundation Changes: - Add openadapt-viewer as local file dependency in pyproject.toml - Create openadapt_ml/training/viewer_components.py adapter module * screenshot_with_predictions() - Screenshot with human/AI overlays * training_metrics() - Training stats metrics grid * playback_controls() - Playback UI controls * correctness_badge() - Pass/fail badge component * generate_comparison_summary() - Model comparison summary - Add tests/test_viewer_screenshots.py with component validation tests - Add openadapt_ml/training/viewer_migration_example.py validation example Design: - Zero breaking changes to existing viewer.py code - Adapter pattern wraps openadapt-viewer with ML-specific context - Functions accept openadapt-ml data structures - Can be incrementally adopted in future phases Next steps (Phase 2): - Gradually migrate viewer.py to use these adapters - Replace inline HTML generation with component calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * feat: add workflow segmentation system with capture adapter Restored and enhanced the workflow segmentation system from commit dd9a393 with new integration for openadapt-capture format. ## What's Added ### Core Segmentation Pipeline (4 stages): 1. **Stage 1 - Frame Description (VLM)**: - Converts screenshots + actions into semantic descriptions - Supports Gemini, Claude, GPT-4o backends - Automatic caching for efficiency - File: openadapt_ml/segmentation/frame_describer.py 2. **Stage 2 - Episode Extraction (LLM)**: - Identifies coherent workflow boundaries - Few-shot prompting for better quality - Confidence-based filtering - File: openadapt_ml/segmentation/segment_extractor.py 3. **Stage 3 - Deduplication (Embeddings)**: - Finds similar workflows across recordings - Agglomerative clustering with cosine similarity - Supports OpenAI or local HuggingFace embeddings - File: openadapt_ml/segmentation/deduplicator.py 4. **Stage 4 - Annotation (VLM Quality Control)**: - Auto-annotates episodes for training data quality - Detects failures, boundary issues, incompleteness - Human-in-the-loop review workflow - File: openadapt_ml/segmentation/annotator.py ### Integration Features: - **CaptureAdapter**: Loads recordings from openadapt-capture SQLite format - File: openadapt_ml/segmentation/adapters/capture_adapter.py - Automatically used when capture.db is detected - Converts events to segmentation format - **Unified Pipeline**: Run all stages with single API - File: openadapt_ml/segmentation/pipeline.py - Automatic intermediate result caching - Resume support for interrupted runs - **CLI Interface**: Full command-line interface for all stages - File: openadapt_ml/segmentation/cli.py - Commands: describe, extract, deduplicate, annotate, review, export-gold - **Comprehensive Documentation**: - File: openadapt_ml/segmentation/README.md - 20+ code examples - Complete API reference - Integration guide - Cost estimates and performance benchmarks ## Use Cases 1. **Training Data Curation**: Extract and filter high-quality demonstration episodes 2. **Demo Retrieval**: Build searchable libraries for demo-conditioned prompting 3. **Workflow Documentation**: Auto-generate step-by-step guides from recordings ## Data Schemas All schemas use Pydantic for type safety (openadapt_ml/segmentation/schemas.py): - ActionTranscript: Frame-by-frame semantic descriptions - Episode: Coherent workflow segment with boundaries - CanonicalEpisode: Deduplicated workflow definition - EpisodeAnnotation: Quality assessment for training data ## Example Usage ```python from openadapt_ml.segmentation import SegmentationPipeline, PipelineConfig config = PipelineConfig( vlm_model="gemini-2.0-flash", llm_model="gpt-4o", similarity_threshold=0.85 ) pipeline = SegmentationPipeline(config) result = pipeline.run( recordings=["/path/to/recording1", "/path/to/recording2"], output_dir="workflow_library" ) print(f"Found {result.unique_episodes} unique workflows") ``` ## Next Steps See openadapt_ml/segmentation/README.md for: - P0: Integration tests with real openadapt-capture recordings - P0: Visualization generator for segment boundaries - P1: Improved prompt engineering and cost optimization - P2: Active learning and multi-modal features Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Enhance vm monitor command with comprehensive VM usage visibility Features added: - Azure ML job tracking: Shows recent jobs from last 7 days with status - Cost tracking: Real-time uptime, hourly rate, and cost estimation - VM activity detection: Identifies what VM is currently doing - Evaluation history: Past benchmark runs and success rates (--details flag) - Enhanced UI: Structured dashboard with clear sections and icons New utility functions in vm_monitor.py: - fetch_azure_ml_jobs(): Fetch recent Azure ML jobs with filtering - calculate_vm_costs(): Calculate VM costs with hourly/daily/weekly rates - get_vm_uptime_hours(): Get VM uptime from Azure activity logs - detect_vm_activity(): Detect current VM activity (idle, running, setup) - get_evaluation_history(): Load past evaluation runs from results dir CLI enhancements: - Added --details flag for extended information - Improved output formatting with sections and separators - Better error handling and status icons - Preserved existing SSH tunnel and dashboard functionality Documentation: - Updated CLAUDE.md with new features and usage examples - Added detailed docstrings to all new functions This consolidates VM monitoring into a single enhanced command rather than creating duplicate dashboards, following the viewer consolidation strategy. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Refactor segmentation pipeline to use screen.frame events Update CaptureAdapter to work with actual openadapt-capture database format. Key changes: - Use screen.frame events instead of generic event types - Pair action events (mouse.down + mouse.up → single click) - Map frame events to screenshots via timestamp matching - Update event type filtering to match openadapt-capture schema - Improve frame-to-action association logic This enables the segmentation pipeline to process real capture recordings from openadapt-capture instead of requiring simulated data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add VM monitoring dashboard with comprehensive usage visibility Enhance vm monitor command to provide complete VM usage tracking: - Real-time VM status (size, IP, power state) - Activity detection (idle, benchmark running, setup) - Cost tracking (uptime hours, hourly rate, total cost) - Azure ML jobs list (last 7 days with status) - Evaluation history (with --details flag) - Mock mode for testing without VM (--mock flag) Add new API endpoints to local.py dashboard server: - /api/benchmark/status - current job status with ETA - /api/benchmark/costs - cost breakdown (Azure VM, API, GPU) - /api/benchmark/metrics - performance metrics by domain - /api/benchmark/workers - worker status and utilization - /api/benchmark/runs - list all benchmark runs - /api/benchmark/tasks/{run}/{task} - task execution details Update README with VM monitor section including screenshots and usage examples. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add segmentation testing documentation and test files Add comprehensive test plan and results for workflow segmentation pipeline: - Test plan with 8 stages from environment setup to documentation - Test results documenting real capture processing outcomes - Test files for CaptureAdapter and segmentation pipeline Add VM monitor screenshot generation scripts and documentation: - Scripts for automated dashboard screenshot generation - Implementation plan for VM monitor screenshot feature - Analysis of screenshot capture approaches Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Document archived OpenAdapter repository - Archive OpenAdapter (incomplete pre-refactor cloud deployment POC) - Document key takeaways and lessons learned - Reference modern cloud infrastructure in openadapt-ml - Add guidelines for when to archive repositories OpenAdapter was an incomplete proof-of-concept from October 2024 with only 165 lines of code and no ecosystem usage. Cloud deployment is now production-ready in openadapt_ml/cloud/ and benchmarks/azure.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add search functionality to training viewer - Add search bar to viewer controls with Ctrl+F / Cmd+F keyboard shortcut - Implement advanced token-based search across step indices, action types, and text - Search filters step list in real-time with result count display - Clear button and Escape key support for resetting search - Consistent UI styling with existing viewer components - Integrates with existing step list filtering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: resolve ruff linting and formatting issues * fix: resolve test failures from missing dependencies - Remove non-existent openadapt_ml.shared_ui import from viewer.py - Skip anthropic test when anthropic package not installed (optional dependency) - Skip viewer_components test when openadapt-viewer not installed (optional dependency) All tests now pass (334 passed, 6 skipped). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * feat(dashboard): add Azure ops dashboard with live VNC embed - Add azure_ops_tracker.py for real-time status tracking via SSE - Add azure_ops_viewer.py with live VNC iframe embed - Add /api/azure-ops-status and /api/azure-ops-sse endpoints - Add progress bar, cost tracking, elapsed time display - Add copy logs button and auto-scroll controls feat(cli): add new VM management commands - Add vm start-windows command - Add vm restart-windows command - Add vm check-build command - Add vm screenshot command for capturing dashboards - Fix container restart to always use --cap-add NET_ADMIN feat(infra): add screenshot capture infrastructure - Add capture_screenshots.py script - Configure BuildKit GC with 30GB limit - Fix Dockerfile OEM path and networking docs: add Azure dashboard spec and update CLI documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: auto-cleanup before Docker builds, use VERSION=11 for unattended install - Add automatic Docker build cache cleanup before waa-auto builds - Fix all VERSION=11e → VERSION=11 for fully unattended Windows install (Enterprise Evaluation shows edition picker dialog; Pro does not) - Update CLAUDE.md documentation with disk space management solution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(waa): use VERSION=11e consistently to prevent product key prompt Root cause: CLI used VERSION=11 but Dockerfile uses VERSION=11e. This caused XML patches (applied for 11e) to be ignored at runtime. Enterprise Eval (11e) has built-in GVLK key - never prompts for product key. Fixes: openadapt-evals-b3l Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: fix inverted VERSION documentation in CLAUDE.md VERSION=11e (Enterprise Eval) has built-in GVLK - never prompts. VERSION=11 (Pro) may prompt for product key. Previous documentation was backwards, causing confusion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: automate vanilla WAA bootstrap * docs: clarify unattended WAA bootstrap * fix(waa): don't replace dockurr/windows autounattend.xml The previous approach copied windowsarena's autounattend.xml over dockurr/windows's version, which broke the OOBE flow. Changes: - Remove COPY commands that replaced the base image's XML files - Add conditional sed patch that only adds InstallFrom element if needed - Reorder Dockerfile to install Python deps before running python3 commands - Add clear comments explaining the OEM mechanism This fixes Windows installation failures where the OOBE would hang or show incorrect dialogs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor(cli): remove deprecated WAA handlers, add auto-cleanup Major cleanup of benchmarks CLI: Removed deprecated handlers (~1200 lines): - setup-waa: Replaced by top-level 'waa' command - run-waa: Replaced by top-level 'waa' command - prepare-windows: Replaced by top-level 'waa' command - waa-native: Replaced by scripts/waa_bootstrap_local.sh Added features: - cleanup_waa_resources(): Auto-cleanup leftover Azure resources (NICs, VNETs, NSGs, PublicIPs, disks) before VM creation - Updated default VM size to Standard_D8ds_v5 (300GB temp storage) - Updated help text with temp storage sizes for each VM option - Added deprecation notice to legacy viewer command The cleanup function prevents "resource already exists" errors when previous VM deletion was incomplete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add CLI-first and VM workflow guidelines to CLAUDE.md Added comprehensive guidelines for Claude Code sessions: CLI-First Rule: - Never use raw az/ssh commands that require permission - Always use or extend the CLI for VM operations - Example pattern for adding new CLI functionality Standard VM Configuration Workflow: - Delete VM, update code, recreate (vs. trying to resize) - Current VM defaults (D8ds_v5, eastus, Ubuntu 22.04) This reduces friction by documenting the pre-approved command patterns and standard operating procedures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: fix markdown formatting in waa_vanilla_automation.md - Close unclosed code block (lines 33-41) - Remove hardcoded absolute path, use relative description Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(waa): unattended Windows installation now works Key fixes to waa_deploy/Dockerfile: - Don't replace dockurr/windows autounattend.xml, only patch with InstallFrom element to prevent "Select the operating system" dialog - Use sed instead of python3 for run.py patching (Python installed later) - Fix entrypoint: use /run/entry.sh instead of non-existent /copy-oem.sh This enables fully automated Windows 11 Enterprise Eval installation with VERSION=11e, no manual intervention required. WAA server starts automatically via FirstLogonCommands. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(waa): fix unattended installation and benchmark execution Key fixes: - Dockerfile: Don't replace autounattend.xml, only patch it with InstallFrom element (preserves dockurr/windows OEM mechanism, prevents "Select OS" dialog) - CLI: Run benchmark inside container with `docker exec -w /client` - CLI: Use valid som_origin "oss" instead of invalid "omniparser" - CLI: Fix VNC URLs to use localhost (SSH tunnel) instead of public IP - CLI: Add SSH retry logic with exponential backoff - CLI: Add ConnectTimeout to SSH options for faster failure detection The WAA benchmark now runs successfully with the navi agent. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(dashboard): use Azure CLI for live VM IP, fix activity detection - Fetch VM IP from Azure CLI at runtime instead of stale registry file - Fix activity detection to use localhost:5000 (Docker port forward) - Change SSH tunnel to forward localhost:5001 -> VM:5000 - Update CLAUDE.md with comprehensive WAA workflow documentation - Add API key auto-loading note (loaded from .env via config.py) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent a6af99c commit 9a37bb4

35 files changed

Lines changed: 9051 additions & 1866 deletions

CLAUDE.md

Lines changed: 530 additions & 71 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -827,6 +827,45 @@ uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2
827827

828828
For complete VM management commands and Azure setup instructions, see [`CLAUDE.md`](CLAUDE.md) and [`docs/azure_waa_setup.md`](docs/azure_waa_setup.md).
829829

830+
### 13.5 Screenshot Capture Tool
831+
832+
Capture screenshots of dashboards and VMs for documentation and PR purposes:
833+
834+
```bash
835+
# Capture all available targets
836+
uv run python -m openadapt_ml.benchmarks.cli screenshot
837+
838+
# List available targets
839+
uv run python -m openadapt_ml.benchmarks.cli screenshot --list
840+
841+
# Capture specific targets
842+
uv run python -m openadapt_ml.benchmarks.cli screenshot --target terminal
843+
uv run python -m openadapt_ml.benchmarks.cli screenshot --target azure-ops --target vnc
844+
845+
# Custom output directory
846+
uv run python -m openadapt_ml.benchmarks.cli screenshot --output /path/to/screenshots
847+
848+
# Without timestamp in filename
849+
uv run python -m openadapt_ml.benchmarks.cli screenshot --target terminal --no-timestamp
850+
```
851+
852+
**Available targets:**
853+
854+
| Target | Description |
855+
|--------|-------------|
856+
| `azure-ops` | Azure ops dashboard (localhost:8765) |
857+
| `vnc` | VNC viewer (localhost:8006) - Windows VM |
858+
| `terminal` | VM monitor terminal output (mock mode) |
859+
| `terminal-live` | VM monitor terminal output (live, requires running VM) |
860+
| `training` | Training dashboard (localhost:8080) |
861+
| `vm-screen` | Windows VM screen capture via QEMU |
862+
863+
**Notes:**
864+
- Terminal screenshots use PIL to render terminal output as PNG images
865+
- Web page screenshots work best with playwright installed (`uv add playwright && playwright install chromium`)
866+
- On macOS, interactive capture using `screencapture` is available as a fallback
867+
- Screenshots are saved to `docs/screenshots/` by default with timestamps
868+
830869
---
831870

832871
## 14. Limitations & Notes

deprecated/Dockerfile.simple

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
FROM dockurr/windows:latest
2+
3+
RUN apt-get update && apt-get install -y fuse dos2unix wget curl python3 python3-pip git && rm -rf /var/lib/apt/lists/*
4+
5+
ENV YRES="900"
6+
ENV XRES="1440"
7+
ENV RAM_SIZE="8G"
8+
ENV CPU_CORES="8"
9+
ENV VERSION="11e"
10+
ENV DISK_SIZE="30G"
11+
ENV ARGUMENTS="-qmp tcp:0.0.0.0:7200,server,nowait"
12+
13+
COPY src/win-arena-container/client /client
14+
COPY src/win-arena-container/vm/setup /oem
15+
COPY src/win-arena-container/entry.sh /entry.sh
16+
COPY src/win-arena-container/entry_setup.sh /entry_setup.sh
17+
COPY src/win-arena-container/start_client.sh /start_client.sh
18+
COPY src/win-arena-container/start_vm.sh /start_vm.sh
19+
20+
RUN pip3 install --no-cache-dir -r /client/requirements.txt 2>/dev/null || true
21+
22+
RUN find / -maxdepth 3 -type f -name "*.sh" -exec dos2unix {} \; 2>/dev/null; chmod +x /*.sh 2>/dev/null || true
23+
24+
RUN sed -i "s|20\.20\.20\.21|172.30.0.2|g" /entry_setup.sh /entry.sh /start_client.sh 2>/dev/null || true
25+
26+
ENTRYPOINT ["/bin/bash", "-c"]

deprecated/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Deprecated WAA Legacy Materials
2+
3+
This folder contains legacy WAA automation files and documents that are no longer
4+
part of the vanilla WAA workflow. They are retained for review only.
5+
6+
Use `docs/waa_vanilla_automation.md` and the scripts in `scripts/` for the
7+
current vanilla automation flow.

deprecated/docs/WAA_ACR_DESIGN.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# WAA ACR Design (Unattended + Vanilla)
2+
3+
## Goals
4+
- Make WAA image pulls reliable (no Docker Hub throttling/timeouts).
5+
- Preserve unattended Windows install (no license prompts).
6+
- Use existing CLI/scripts wherever possible.
7+
8+
## Constraints
9+
- Windows install must be fully unattended (VERSION=11e + OEM Azure mode).
10+
- Prefer vanilla WAA components; no dev-mode UNC paths.
11+
- Avoid custom tooling if existing commands cover the flow.
12+
13+
## Proposed ACR Naming
14+
ACR names must be globally unique and <= 50 chars. Use a deterministic pattern tied to the subscription and region:
15+
16+
```
17+
openadapt-evals-<region>-<suffix>
18+
```
19+
20+
Suggested suffix: last 6 chars of the Azure subscription ID.
21+
22+
Example (eastus + sub id ...1234ab):
23+
24+
```
25+
openadapt-evals-eastus-1234ab
26+
```
27+
28+
If name is taken, append `-01`, `-02`, etc.
29+
30+
## Implementation Plan
31+
32+
### 1) Create ACR + import WinArena image
33+
Use the existing helper script in `openadapt-evals`:
34+
35+
```bash
36+
cd /Users/abrichr/oa/src/openadapt-evals
37+
./scripts/setup_acr.sh \
38+
--acr-name openadapt-evals-eastus-1234ab \
39+
--resource-group openadapt-agents \
40+
--workspace openadapt-ml \
41+
--location eastus
42+
```
43+
44+
This script:
45+
- Creates the registry (Basic tier).
46+
- Imports `docker.io/windowsarena/winarena:latest`.
47+
- Grants `AcrPull` to the Azure ML workspace identity.
48+
49+
### 2) Use ACR image for Azure ML runs
50+
No new code needed; use the existing config/env support:
51+
52+
```bash
53+
export AZURE_DOCKER_IMAGE="openadapt-evals-eastus-1234ab.azurecr.io/winarena:latest"
54+
```
55+
56+
Then run Azure evals as usual:
57+
58+
```bash
59+
uv run python -m openadapt_evals.benchmarks.cli azure --workers 1 --task-ids notepad_1 --waa-path /path/to/WAA
60+
```
61+
62+
### 3) Use ACR image for dedicated VM builds
63+
The VM flow already supports ACR via existing CLI commands:
64+
65+
```bash
66+
uv run python -m openadapt_ml.benchmarks.cli vm pull-image --acr openadapt-evals-eastus-1234ab
67+
```
68+
69+
When building the custom `waa-auto` image on the VM, set:
70+
71+
```bash
72+
export WAA_SOURCE_IMAGE="openadapt-evals-eastus-1234ab.azurecr.io/winarena:latest"
73+
uv run python -m openadapt_ml.benchmarks.cli vm prepare-windows
74+
```
75+
76+
This uses the simplified Dockerfile (OEM Azure mode) and keeps installs unattended.
77+
78+
## Verification Checklist
79+
- ACR import succeeded (`az acr repository show --name <acr> --repository winarena`).
80+
- Azure ML run logs show pulls from the ACR login server.
81+
- VM `prepare-windows` completes without product key prompts.
82+
- WAA `/probe` endpoint responds on port 5000 after boot.
83+
84+
## Notes
85+
- The simplified Dockerfile copies OEM assets from the source image and uses `VERSION=11e` for unattended installs.
86+
- If Windows prompts for a product key, treat it as a regression and follow `docs/RECURRING_ISSUES.md`.
87+
- Keep Azure ML and ACR in the same region to avoid throttling and reduce pull time.

0 commit comments

Comments
 (0)