Skip to content

Commit a6af99c

Browse files
abrichrclaude
andauthored
Viewer consolidation, workflow segmentation, and VM monitoring (#9)
* feat: add openadapt-viewer dependency and adapter module (Phase 1) Phase 1 of viewer consolidation plan: Foundation Changes: - Add openadapt-viewer as local file dependency in pyproject.toml - Create openadapt_ml/training/viewer_components.py adapter module * screenshot_with_predictions() - Screenshot with human/AI overlays * training_metrics() - Training stats metrics grid * playback_controls() - Playback UI controls * correctness_badge() - Pass/fail badge component * generate_comparison_summary() - Model comparison summary - Add tests/test_viewer_screenshots.py with component validation tests - Add openadapt_ml/training/viewer_migration_example.py validation example Design: - Zero breaking changes to existing viewer.py code - Adapter pattern wraps openadapt-viewer with ML-specific context - Functions accept openadapt-ml data structures - Can be incrementally adopted in future phases Next steps (Phase 2): - Gradually migrate viewer.py to use these adapters - Replace inline HTML generation with component calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * feat: add workflow segmentation system with capture adapter Restored and enhanced the workflow segmentation system from commit dd9a393 with new integration for openadapt-capture format. ## What's Added ### Core Segmentation Pipeline (4 stages): 1. **Stage 1 - Frame Description (VLM)**: - Converts screenshots + actions into semantic descriptions - Supports Gemini, Claude, GPT-4o backends - Automatic caching for efficiency - File: openadapt_ml/segmentation/frame_describer.py 2. **Stage 2 - Episode Extraction (LLM)**: - Identifies coherent workflow boundaries - Few-shot prompting for better quality - Confidence-based filtering - File: openadapt_ml/segmentation/segment_extractor.py 3. **Stage 3 - Deduplication (Embeddings)**: - Finds similar workflows across recordings - Agglomerative clustering with cosine similarity - Supports OpenAI or local HuggingFace embeddings - File: openadapt_ml/segmentation/deduplicator.py 4. **Stage 4 - Annotation (VLM Quality Control)**: - Auto-annotates episodes for training data quality - Detects failures, boundary issues, incompleteness - Human-in-the-loop review workflow - File: openadapt_ml/segmentation/annotator.py ### Integration Features: - **CaptureAdapter**: Loads recordings from openadapt-capture SQLite format - File: openadapt_ml/segmentation/adapters/capture_adapter.py - Automatically used when capture.db is detected - Converts events to segmentation format - **Unified Pipeline**: Run all stages with single API - File: openadapt_ml/segmentation/pipeline.py - Automatic intermediate result caching - Resume support for interrupted runs - **CLI Interface**: Full command-line interface for all stages - File: openadapt_ml/segmentation/cli.py - Commands: describe, extract, deduplicate, annotate, review, export-gold - **Comprehensive Documentation**: - File: openadapt_ml/segmentation/README.md - 20+ code examples - Complete API reference - Integration guide - Cost estimates and performance benchmarks ## Use Cases 1. **Training Data Curation**: Extract and filter high-quality demonstration episodes 2. **Demo Retrieval**: Build searchable libraries for demo-conditioned prompting 3. **Workflow Documentation**: Auto-generate step-by-step guides from recordings ## Data Schemas All schemas use Pydantic for type safety (openadapt_ml/segmentation/schemas.py): - ActionTranscript: Frame-by-frame semantic descriptions - Episode: Coherent workflow segment with boundaries - CanonicalEpisode: Deduplicated workflow definition - EpisodeAnnotation: Quality assessment for training data ## Example Usage ```python from openadapt_ml.segmentation import SegmentationPipeline, PipelineConfig config = PipelineConfig( vlm_model="gemini-2.0-flash", llm_model="gpt-4o", similarity_threshold=0.85 ) pipeline = SegmentationPipeline(config) result = pipeline.run( recordings=["/path/to/recording1", "/path/to/recording2"], output_dir="workflow_library" ) print(f"Found {result.unique_episodes} unique workflows") ``` ## Next Steps See openadapt_ml/segmentation/README.md for: - P0: Integration tests with real openadapt-capture recordings - P0: Visualization generator for segment boundaries - P1: Improved prompt engineering and cost optimization - P2: Active learning and multi-modal features Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Enhance vm monitor command with comprehensive VM usage visibility Features added: - Azure ML job tracking: Shows recent jobs from last 7 days with status - Cost tracking: Real-time uptime, hourly rate, and cost estimation - VM activity detection: Identifies what VM is currently doing - Evaluation history: Past benchmark runs and success rates (--details flag) - Enhanced UI: Structured dashboard with clear sections and icons New utility functions in vm_monitor.py: - fetch_azure_ml_jobs(): Fetch recent Azure ML jobs with filtering - calculate_vm_costs(): Calculate VM costs with hourly/daily/weekly rates - get_vm_uptime_hours(): Get VM uptime from Azure activity logs - detect_vm_activity(): Detect current VM activity (idle, running, setup) - get_evaluation_history(): Load past evaluation runs from results dir CLI enhancements: - Added --details flag for extended information - Improved output formatting with sections and separators - Better error handling and status icons - Preserved existing SSH tunnel and dashboard functionality Documentation: - Updated CLAUDE.md with new features and usage examples - Added detailed docstrings to all new functions This consolidates VM monitoring into a single enhanced command rather than creating duplicate dashboards, following the viewer consolidation strategy. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Refactor segmentation pipeline to use screen.frame events Update CaptureAdapter to work with actual openadapt-capture database format. Key changes: - Use screen.frame events instead of generic event types - Pair action events (mouse.down + mouse.up → single click) - Map frame events to screenshots via timestamp matching - Update event type filtering to match openadapt-capture schema - Improve frame-to-action association logic This enables the segmentation pipeline to process real capture recordings from openadapt-capture instead of requiring simulated data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add VM monitoring dashboard with comprehensive usage visibility Enhance vm monitor command to provide complete VM usage tracking: - Real-time VM status (size, IP, power state) - Activity detection (idle, benchmark running, setup) - Cost tracking (uptime hours, hourly rate, total cost) - Azure ML jobs list (last 7 days with status) - Evaluation history (with --details flag) - Mock mode for testing without VM (--mock flag) Add new API endpoints to local.py dashboard server: - /api/benchmark/status - current job status with ETA - /api/benchmark/costs - cost breakdown (Azure VM, API, GPU) - /api/benchmark/metrics - performance metrics by domain - /api/benchmark/workers - worker status and utilization - /api/benchmark/runs - list all benchmark runs - /api/benchmark/tasks/{run}/{task} - task execution details Update README with VM monitor section including screenshots and usage examples. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add segmentation testing documentation and test files Add comprehensive test plan and results for workflow segmentation pipeline: - Test plan with 8 stages from environment setup to documentation - Test results documenting real capture processing outcomes - Test files for CaptureAdapter and segmentation pipeline Add VM monitor screenshot generation scripts and documentation: - Scripts for automated dashboard screenshot generation - Implementation plan for VM monitor screenshot feature - Analysis of screenshot capture approaches Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Document archived OpenAdapter repository - Archive OpenAdapter (incomplete pre-refactor cloud deployment POC) - Document key takeaways and lessons learned - Reference modern cloud infrastructure in openadapt-ml - Add guidelines for when to archive repositories OpenAdapter was an incomplete proof-of-concept from October 2024 with only 165 lines of code and no ecosystem usage. Cloud deployment is now production-ready in openadapt_ml/cloud/ and benchmarks/azure.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add search functionality to training viewer - Add search bar to viewer controls with Ctrl+F / Cmd+F keyboard shortcut - Implement advanced token-based search across step indices, action types, and text - Search filters step list in real-time with result count display - Clear button and Escape key support for resetting search - Consistent UI styling with existing viewer components - Integrates with existing step list filtering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: resolve ruff linting and formatting issues * fix: resolve test failures from missing dependencies - Remove non-existent openadapt_ml.shared_ui import from viewer.py - Skip anthropic test when anthropic package not installed (optional dependency) - Skip viewer_components test when openadapt-viewer not installed (optional dependency) All tests now pass (334 passed, 6 skipped). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent aeed4bf commit a6af99c

33 files changed

Lines changed: 10559 additions & 88 deletions

CLAUDE.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,16 @@
11
# Claude Context for openadapt-ml
22

3+
## Project Status & Priorities
4+
5+
**IMPORTANT**: Before starting work, always check the project-wide status document:
6+
- **Location**: `/Users/abrichr/oa/src/STATUS.md`
7+
- **Purpose**: Tracks P0 priorities, active background tasks, blockers, and strategic decisions
8+
- **Action**: Read this file at the start of every session to understand current priorities
9+
10+
This ensures continuity between Claude Code sessions and context compactions.
11+
12+
---
13+
314
This file helps maintain context across sessions.
415

516
---
@@ -18,9 +29,32 @@ This file helps maintain context across sessions.
1829
uv run python -m openadapt_ml.benchmarks.cli vm monitor
1930
```
2031

32+
**ENHANCED FEATURES (as of Jan 2026):**
33+
The `vm monitor` command now provides comprehensive VM usage visibility:
34+
- **VM Status**: Real-time VM state, size, and IP
35+
- **Activity Detection**: What the VM is currently doing (idle, benchmark running, setup)
36+
- **Cost Tracking**: Current uptime, hourly rate, and total cost for session
37+
- **Azure ML Jobs**: Recent jobs from last 7 days with status
38+
- **Evaluation History**: Past benchmark runs and success rates (with --details flag)
39+
- **Dashboard & Tunnels**: Auto-starts web dashboard and SSH/VNC tunnels
40+
41+
**Usage:**
42+
```bash
43+
# Basic monitoring
44+
uv run python -m openadapt_ml.benchmarks.cli vm monitor
45+
46+
# With detailed information (costs per day/week, evaluation history)
47+
uv run python -m openadapt_ml.benchmarks.cli vm monitor --details
48+
49+
# With auto-shutdown after 2 hours
50+
uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2
51+
```
52+
2153
**WHY THIS MATTERS:**
2254
- VNC is ONLY accessible via SSH tunnel at `localhost:8006` (NOT the public IP)
2355
- The dashboard auto-manages SSH tunnels
56+
- Shows real-time costs to prevent budget overruns
57+
- Tracks all Azure ML jobs for visibility into what's running
2458
- Without it, you cannot see what Windows is doing
2559
- The user WILL be frustrated if you keep forgetting this
2660

@@ -120,6 +154,12 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a
120154
- With demo: 100% correct first actions
121155
- See `docs/experiments/demo_conditioned_prompting_results.md`
122156

157+
**✅ VALIDATED (Jan 17, 2026)**: Demo persistence fix is working
158+
- The P0 fix in `openadapt-evals` ensures demo is included at EVERY step, not just step 1
159+
- Mock test confirms: agent behavior changes from 6.8 avg steps (random) to 3.0 avg steps (focused)
160+
- See `openadapt-evals/CLAUDE.md` for full validation details
161+
- **Next step**: Run full WAA evaluation (154 tasks) to measure episode success improvement
162+
123163
**Next step**: Build demo retrieval to automatically select relevant demos from a library.
124164

125165
**Key insight**: OpenAdapt's value is **trajectory-conditioned disambiguation of UI affordances**, not "better reasoning".

README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -781,6 +781,52 @@ uv run python -m openadapt_ml.cloud.local serve --port 8080 --open
781781

782782
*View benchmark evaluation results with task-level filtering, success/failure status, and run comparison. Shows Claude achieving 30% on mock evaluation tasks (simulated environment for testing the pipeline - real WAA evaluation requires Windows VMs).*
783783

784+
### 13.4 VM Monitoring Dashboard
785+
786+
For managing Azure VMs used in benchmark evaluations, the `vm monitor` command provides a comprehensive dashboard:
787+
788+
```bash
789+
# Start VM monitoring dashboard (auto-opens browser)
790+
uv run python -m openadapt_ml.benchmarks.cli vm monitor
791+
792+
# Show detailed information (evaluation history, daily/weekly costs)
793+
uv run python -m openadapt_ml.benchmarks.cli vm monitor --details
794+
```
795+
796+
**VM Monitor Dashboard (Full View):**
797+
798+
![VM Monitor Dashboard](docs/screenshots/vm_monitor_dashboard_full.png)
799+
800+
*The VM monitor dashboard shows: (1) VM status (name, IP, size, state), (2) Current activity (idle/benchmark running), (3) Cost tracking (uptime, hourly rate, total cost), (4) Recent Azure ML jobs from last 7 days, and (6) Dashboard & access URLs.*
801+
802+
**VM Monitor Dashboard (With --details Flag):**
803+
804+
![VM Monitor Dashboard Details](docs/screenshots/vm_monitor_details.png)
805+
806+
*The --details flag adds: (5) Evaluation history with success rates and agent types, plus extended cost information (daily/weekly projections).*
807+
808+
**Features:**
809+
- **Real-time VM status** - Shows VM size, power state, and IP address
810+
- **Activity detection** - Identifies if VM is idle, running benchmarks, or in setup
811+
- **Cost tracking** - Displays uptime hours, hourly rate, and total cost for current session
812+
- **Azure ML jobs** - Lists recent jobs from last 7 days with status indicators
813+
- **Evaluation history** - Shows past benchmark runs with success rates (with --details flag)
814+
- **Dashboard & tunnels** - Auto-starts web dashboard and SSH/VNC tunnels for accessing Windows VM
815+
816+
**Mock mode for testing:**
817+
```bash
818+
# Generate screenshots or test dashboard without a VM running
819+
uv run python -m openadapt_ml.benchmarks.cli vm monitor --mock
820+
```
821+
822+
**Auto-shutdown option:**
823+
```bash
824+
# Automatically deallocate VM after 2 hours to prevent runaway costs
825+
uv run python -m openadapt_ml.benchmarks.cli vm monitor --auto-shutdown-hours 2
826+
```
827+
828+
For complete VM management commands and Azure setup instructions, see [`CLAUDE.md`](CLAUDE.md) and [`docs/azure_waa_setup.md`](docs/azure_waa_setup.md).
829+
784830
---
785831

786832
## 14. Limitations & Notes

docs/REPOSITORY_HISTORY.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Repository History
2+
3+
Documentation of deprecated and archived OpenAdapt ecosystem projects.
4+
5+
## Deprecated/Archived Projects
6+
7+
### OpenAdapter (Archived January 2026)
8+
9+
**Repository**: https://github.com/OpenAdaptAI/OpenAdapter (ARCHIVED)
10+
**Status**: Incomplete proof-of-concept from before OpenAdapt refactor
11+
12+
**Why Archived**:
13+
- Incomplete proof-of-concept code (only 165 lines, missing imports)
14+
- Created October 2024, minimal activity (14 commits, only 1 contributor)
15+
- Cloud infrastructure now handled by `openadapt_ml/cloud/` module
16+
- No active development, zero ecosystem usage
17+
- Last substantial commit was February 2025 (marked as WIP)
18+
19+
**Original Purpose**:
20+
Attempted to provide cloud deployment infrastructure for screenshot parsing and action models, specifically targeting AWS ECS/ECR deployment for OmniParser using CDKTF (Terraform via Python).
21+
22+
**Key Takeaways & Lessons Learned**:
23+
- Cloud training support is critical for productivity
24+
- Multiple backends (Lambda Labs, Azure) enable flexibility and cost optimization
25+
- Infrastructure as Code (Terraform/CDK) is appropriate for cloud setup
26+
- State management (tracking deployment IPs, configs) is important for multi-region deployments
27+
- Single-provider solutions are fragile - always support multiple cloud backends
28+
29+
**What Replaced It**:
30+
- `openadapt_ml/cloud/lambda_labs.py` - Lambda Labs GPU rental and management
31+
- `openadapt_ml/cloud/azure_inference.py` - Azure ML integration for inference
32+
- `openadapt_ml/benchmarks/azure.py` - Azure ML for automated WAA evaluation
33+
- `scripts/setup_azure.py` - Full Azure setup automation with resource management
34+
- Documentation: `docs/cloud_gpu_training.md`, `docs/azure_waa_setup.md`
35+
36+
**Modern Approach**:
37+
The current openadapt-ml cloud infrastructure is production-ready and supports:
38+
- Multiple cloud providers (Lambda Labs, Azure ML, local)
39+
- Multiple model types (not just OmniParser)
40+
- Automatic cleanup and quota management
41+
- Tested deployment patterns with comprehensive documentation
42+
- Cost estimation and monitoring tools
43+
44+
**References**:
45+
- Original incomplete code: https://github.com/OpenAdaptAI/OpenAdapter/tree/feat/omniparser
46+
- Cloud architecture docs: `docs/cloud_gpu_training.md`
47+
- Azure setup guide: `docs/azure_waa_setup.md`
48+
49+
---
50+
51+
## Notes on Repository Management
52+
53+
**When to Archive**:
54+
- No active development for 3+ months
55+
- Incomplete/experimental code that won't be finished
56+
- Functionality superseded by other ecosystem components
57+
- Zero usage in production or by other repos
58+
- Single contributor with no current interest
59+
60+
**Before Archiving**:
61+
1. Review code for valuable patterns or ideas
62+
2. Document key takeaways in this file
63+
3. Update references in other repositories
64+
4. Remove from GitHub organization profile README
65+
5. Add archive notice to repository description
66+
67+
**Alternative to Archiving**:
68+
- Move code to `legacy/` branch in main repository
69+
- Keep as example/reference in documentation
70+
- Convert to gist or snippet if very small

0 commit comments

Comments
 (0)