|
| 1 | +# Grounding Cascade Design for DemoExecutor |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +DemoExecutor replays mouse-based demonstrations. Click actions require grounding — finding where to click on the current screen given a description from the demo. The current grounder (gpt-4.1-mini) can't reliably click "Clear data" in Chrome's settings, capping clear-browsing at 0.25. |
| 6 | + |
| 7 | +Users record mouse-based demos. That's the product. We must make mouse demos work, not work around them with keyboard shortcuts. |
| 8 | + |
| 9 | +## Empirical evidence |
| 10 | + |
| 11 | +| Grounder | Task | Score | Source | |
| 12 | +|----------|------|-------|--------| |
| 13 | +| gpt-4.1-mini | notepad-hello (1 click) | 1.00 | Our flywheel validation | |
| 14 | +| gpt-4.1-mini | clear-browsing (1 critical click) | 0.25 | Our flywheel validation | |
| 15 | +| UI-Venus-1.5-8B | client's tasks | 1.00 | Client engagement | |
| 16 | +| GPT-5.4 + UI-Venus | client's tasks | 1.00 | Client engagement | |
| 17 | + |
| 18 | +UI-Venus is proven on our use case. gpt-4.1-mini is not. |
| 19 | + |
| 20 | +## Cascade architecture |
| 21 | + |
| 22 | +``` |
| 23 | +Step receives action from demo |
| 24 | + │ |
| 25 | + ├─ keyboard/type → Tier 1: Execute directly (deterministic) |
| 26 | + │ Cost: $0, Latency: 0ms |
| 27 | + │ |
| 28 | + └─ click/double_click → Tier 1.5 → Tier 2 → Tier 3 |
| 29 | +``` |
| 30 | + |
| 31 | +### Tier 1: Deterministic (keyboard, type) |
| 32 | +- Execute directly via pyautogui |
| 33 | +- No model, no cost, no latency |
| 34 | +- Already implemented in DemoExecutor |
| 35 | + |
| 36 | +### Tier 1.5: Visual template matching (NEW) |
| 37 | +- **1.5a — OCR text match**: Run lightweight OCR on current screenshot. Match demo step's `target_description` against detected text labels. Return center of matched text bounding box. |
| 38 | +- **1.5b — pHash/CLIP crop match**: Compare a crop of the target element from the demo screenshot against regions in the current screenshot. pHash for identical elements, CLIP for visually similar ones. |
| 39 | +- **Confidence threshold**: Accept if confidence > 0.8, else escalate to Tier 2. |
| 40 | +- **Cost**: $0 (CPU-only). **Latency**: <500ms. |
| 41 | +- **When it works**: UI looks similar to demo (same resolution, same theme, same layout). Handles window position changes. |
| 42 | +- **When it fails**: UI changed between demo and replay (different theme, different resolution, new layout). |
| 43 | + |
| 44 | +### Tier 2: Specialized UI grounder (UI-Venus, primary) |
| 45 | +- **2a — UI-Venus** (default): Purpose-built for UI element grounding. Native bbox output `[x1, y1, x2, y2]`. Proven 1.0 on client tasks. Requires GPU serving (~$0.50/hr via sglang/vLLM). |
| 46 | +- **2b — GPT-5.4** (fallback): Native computer use capabilities. 75% on OSWorld. Higher cost but no GPU needed. Use when UI-Venus endpoint is unavailable. |
| 47 | +- **2c — Other specialized models** (future): UI-TARS-1.5, MAI-UI, OmniParser+VLM. Test empirically on our tasks before adopting. |
| 48 | +- **Cost**: ~$0.002/call (UI-Venus local) or ~$0.01/call (GPT-5.4 API). **Latency**: 1-5s. |
| 49 | + |
| 50 | +### Tier 3: Planner recovery |
| 51 | +- Full VLM reasons about what went wrong and how to recover |
| 52 | +- Only triggered when the screen state doesn't match expectations |
| 53 | +- Already implemented in DemoExecutor (unused in practice — Tier 1+2 handle everything so far) |
| 54 | +- **Cost**: ~$0.05/call. **Latency**: 5-15s. |
| 55 | + |
| 56 | +## Confidence-based escalation |
| 57 | + |
| 58 | +```python |
| 59 | +def ground_click(step, current_screenshot, demo_screenshot): |
| 60 | + # Tier 1.5a: OCR text match |
| 61 | + if step.target_description: |
| 62 | + result = ocr_match(current_screenshot, step.target_description) |
| 63 | + if result.confidence > 0.8: |
| 64 | + return result.center |
| 65 | + |
| 66 | + # Tier 1.5b: Visual template match |
| 67 | + if step.target_crop: # crop from demo screenshot |
| 68 | + result = template_match(current_screenshot, step.target_crop) |
| 69 | + if result.confidence > 0.8: |
| 70 | + return result.center |
| 71 | + |
| 72 | + # Tier 2a: UI-Venus (primary grounder) |
| 73 | + if ui_venus_available(): |
| 74 | + result = ui_venus_ground(current_screenshot, step.target_description) |
| 75 | + if result.bbox: |
| 76 | + return result.bbox_center |
| 77 | + |
| 78 | + # Tier 2b: GPT-5.4 (fallback) |
| 79 | + result = gpt54_ground(current_screenshot, step.target_description) |
| 80 | + return result.coordinates |
| 81 | +``` |
| 82 | + |
| 83 | +## What exists vs what to build |
| 84 | + |
| 85 | +| Component | Status | Location | |
| 86 | +|-----------|--------|----------| |
| 87 | +| Tier 1 (keyboard/type) | Done | `demo_executor.py` | |
| 88 | +| Tier 1.5a (OCR match) | Partially done | `openadapt-grounding/locator.py` (ElementLocator) | |
| 89 | +| Tier 1.5b (pHash/CLIP) | Partially done | `training/planner_cache.py` (pHash), CLIP in deps | |
| 90 | +| Tier 2a (UI-Venus HTTP) | Done | `demo_executor.py` lines 254-313, `serve_ui_venus.sh` | |
| 91 | +| Tier 2b (GPT-5.4 API) | Done | `demo_executor.py` (grounder_model param) | |
| 92 | +| Tier 3 (planner recovery) | Done | `demo_executor.py` | |
| 93 | +| Cascade routing logic | **Not done** | Need to implement | |
| 94 | +| Demo screenshot crops | **Not done** | Need to store target element crops in demo | |
| 95 | + |
| 96 | +## Implementation plan |
| 97 | + |
| 98 | +### Phase 1: GPT-5.4 as grounder (1 hour, unblocks clear-browsing) |
| 99 | +Change `grounder_model="gpt-4.1-mini"` to `grounder_model="gpt-5.4"` in DemoExecutor defaults or test scripts. Re-run clear-browsing. If it hits 1.0, the product demo works TODAY. |
| 100 | + |
| 101 | +### Phase 2: UI-Venus as primary grounder (2 hours, cost optimization) |
| 102 | +Boot GPU, serve UI-Venus, point DemoExecutor at `grounder_endpoint`. Re-run clear-browsing. Verify 1.0. This replaces $0.01/click (GPT-5.4) with $0.002/click (UI-Venus local). |
| 103 | + |
| 104 | +### Phase 3: Tier 1.5 template matching (1-2 days) |
| 105 | +- Store target element crops in demo JSON (crop from demo screenshot at the click location) |
| 106 | +- Implement `template_match()` using pHash + CLIP |
| 107 | +- Implement `ocr_match()` using the existing ElementLocator from openadapt-grounding |
| 108 | +- Add confidence-based routing in DemoExecutor before the VLM grounder call |
| 109 | +- This eliminates most VLM calls for demos where the UI matches |
| 110 | + |
| 111 | +### Phase 4: Test-time zoom (RegionFocus) (1 day) |
| 112 | +- If Tier 2 grounder confidence is low, crop to predicted region at 2x resolution |
| 113 | +- Re-run grounder on the cropped image |
| 114 | +- Literature shows +28% accuracy improvement on ScreenSpot-Pro |
| 115 | + |
| 116 | +### Phase 5: Empirical evaluation of alternative grounders (1-2 days) |
| 117 | +- Test UI-TARS-1.5-7B, MAI-UI-8B on our tasks |
| 118 | +- Compare against UI-Venus empirically |
| 119 | +- Adopt whichever performs best on OUR tasks, not benchmarks |
| 120 | + |
| 121 | +## Cost analysis |
| 122 | + |
| 123 | +For a 10-step demo with 4 click actions: |
| 124 | + |
| 125 | +| Architecture | VLM calls | Cost | Latency | |
| 126 | +|-------------|-----------|------|---------| |
| 127 | +| Current (gpt-4.1-mini, every click) | 4 | $0.006 | 12s | |
| 128 | +| Phase 1 (GPT-5.4, every click) | 4 | $0.04 | 12s | |
| 129 | +| Phase 2 (UI-Venus, every click) | 4 | $0.008 | 8s | |
| 130 | +| Phase 3 (template + UI-Venus fallback) | 0-2 | $0-0.004 | 1-5s | |
| 131 | + |
| 132 | +Phase 3 is 10x cheaper than Phase 1 and faster, because template matching handles most clicks without a VLM call. |
| 133 | + |
| 134 | +## Design principles |
| 135 | + |
| 136 | +1. **Empirical over benchmarks.** Use what works on OUR tasks. UI-Venus 1.0 > UI-TARS 94.2% on someone else's test. |
| 137 | +2. **Cheapest first.** Template matching ($0) before VLM ($0.002) before API ($0.01). |
| 138 | +3. **Confidence-driven.** Don't escalate when you're already right. Each tier reports confidence. |
| 139 | +4. **Demo screenshots are gold.** The demo literally shows what the button looks like. Use that information (template matching) before asking a model. |
| 140 | +5. **Fallback gracefully.** Every tier has a fallback. No single point of failure. |
0 commit comments