Skip to content

Commit e912b65

Browse files
abrichrclaude
andauthored
feat: GroundingTarget + GroundingCandidate data model for cascade architecture (#256)
Phase 3 of the grounding cascade design (v3): - grounding.py: GroundingTarget (rich target per click step — description, crop, nearby text, window title, structured transition expectations) and GroundingCandidate (normalized output from each grounding tier) - demo_library.py: DemoStep gains optional grounding_target field, serialization/deserialization handles GroundingTarget objects - Fix demo description: "Clear data" → "Clear now" (actual button text) - Design docs: v1, v2, v3 cascade architecture GroundingTarget is the foundation for the entire cascade. Every downstream tier (OCR, CLIP, UI-Venus, GPT-5.4) operates on the same rich signal instead of a weak description string. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2393c53 commit e912b65

6 files changed

Lines changed: 959 additions & 1 deletion

File tree

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
{
2+
"task_id": "custom-clear-chrome-data",
3+
"demo_id": "manual",
4+
"description": "Clear browsing data in Google Chrome using Ctrl+Shift+Delete shortcut",
5+
"created_at": "2026-03-27T03:01:01.183920+00:00",
6+
"metadata": {
7+
"source": "manual",
8+
"created_by": "create_manual_demo.py"
9+
},
10+
"steps": [
11+
{
12+
"step_index": 0,
13+
"screenshot_path": "",
14+
"action_type": "click",
15+
"action_description": "CLICK(0.05, 0.20)",
16+
"target_description": "",
17+
"action_value": "",
18+
"x": 0.05,
19+
"y": 0.2,
20+
"description": "Double-click Google Chrome icon on the desktop to open Chrome",
21+
"metadata": {}
22+
},
23+
{
24+
"step_index": 1,
25+
"screenshot_path": "",
26+
"action_type": "key",
27+
"action_description": "KEY(ctrl+shift+delete)",
28+
"target_description": "",
29+
"action_value": "ctrl+shift+delete",
30+
"x": null,
31+
"y": null,
32+
"description": "Press Ctrl+Shift+Delete to open Clear Browsing Data dialog directly",
33+
"metadata": {}
34+
},
35+
{
36+
"step_index": 2,
37+
"screenshot_path": "",
38+
"action_type": "click",
39+
"action_description": "CLICK(0.50, 0.70)",
40+
"target_description": "",
41+
"action_value": "",
42+
"x": 0.5,
43+
"y": 0.7,
44+
"description": "Click the 'Clear now' button in the Clear Browsing Data dialog",
45+
"metadata": {}
46+
},
47+
{
48+
"step_index": 3,
49+
"screenshot_path": "",
50+
"action_type": "key",
51+
"action_description": "KEY(alt+f4)",
52+
"target_description": "",
53+
"action_value": "alt+f4",
54+
"x": null,
55+
"y": null,
56+
"description": "Close Chrome after clearing data",
57+
"metadata": {}
58+
}
59+
]
60+
}
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Grounding Cascade Design for DemoExecutor
2+
3+
## Problem
4+
5+
DemoExecutor replays mouse-based demonstrations. Click actions require grounding — finding where to click on the current screen given a description from the demo. The current grounder (gpt-4.1-mini) can't reliably click "Clear data" in Chrome's settings, capping clear-browsing at 0.25.
6+
7+
Users record mouse-based demos. That's the product. We must make mouse demos work, not work around them with keyboard shortcuts.
8+
9+
## Empirical evidence
10+
11+
| Grounder | Task | Score | Source |
12+
|----------|------|-------|--------|
13+
| gpt-4.1-mini | notepad-hello (1 click) | 1.00 | Our flywheel validation |
14+
| gpt-4.1-mini | clear-browsing (1 critical click) | 0.25 | Our flywheel validation |
15+
| UI-Venus-1.5-8B | client's tasks | 1.00 | Client engagement |
16+
| GPT-5.4 + UI-Venus | client's tasks | 1.00 | Client engagement |
17+
18+
UI-Venus is proven on our use case. gpt-4.1-mini is not.
19+
20+
## Cascade architecture
21+
22+
```
23+
Step receives action from demo
24+
25+
├─ keyboard/type → Tier 1: Execute directly (deterministic)
26+
│ Cost: $0, Latency: 0ms
27+
28+
└─ click/double_click → Tier 1.5 → Tier 2 → Tier 3
29+
```
30+
31+
### Tier 1: Deterministic (keyboard, type)
32+
- Execute directly via pyautogui
33+
- No model, no cost, no latency
34+
- Already implemented in DemoExecutor
35+
36+
### Tier 1.5: Visual template matching (NEW)
37+
- **1.5a — OCR text match**: Run lightweight OCR on current screenshot. Match demo step's `target_description` against detected text labels. Return center of matched text bounding box.
38+
- **1.5b — pHash/CLIP crop match**: Compare a crop of the target element from the demo screenshot against regions in the current screenshot. pHash for identical elements, CLIP for visually similar ones.
39+
- **Confidence threshold**: Accept if confidence > 0.8, else escalate to Tier 2.
40+
- **Cost**: $0 (CPU-only). **Latency**: <500ms.
41+
- **When it works**: UI looks similar to demo (same resolution, same theme, same layout). Handles window position changes.
42+
- **When it fails**: UI changed between demo and replay (different theme, different resolution, new layout).
43+
44+
### Tier 2: Specialized UI grounder (UI-Venus, primary)
45+
- **2a — UI-Venus** (default): Purpose-built for UI element grounding. Native bbox output `[x1, y1, x2, y2]`. Proven 1.0 on client tasks. Requires GPU serving (~$0.50/hr via sglang/vLLM).
46+
- **2b — GPT-5.4** (fallback): Native computer use capabilities. 75% on OSWorld. Higher cost but no GPU needed. Use when UI-Venus endpoint is unavailable.
47+
- **2c — Other specialized models** (future): UI-TARS-1.5, MAI-UI, OmniParser+VLM. Test empirically on our tasks before adopting.
48+
- **Cost**: ~$0.002/call (UI-Venus local) or ~$0.01/call (GPT-5.4 API). **Latency**: 1-5s.
49+
50+
### Tier 3: Planner recovery
51+
- Full VLM reasons about what went wrong and how to recover
52+
- Only triggered when the screen state doesn't match expectations
53+
- Already implemented in DemoExecutor (unused in practice — Tier 1+2 handle everything so far)
54+
- **Cost**: ~$0.05/call. **Latency**: 5-15s.
55+
56+
## Confidence-based escalation
57+
58+
```python
59+
def ground_click(step, current_screenshot, demo_screenshot):
60+
# Tier 1.5a: OCR text match
61+
if step.target_description:
62+
result = ocr_match(current_screenshot, step.target_description)
63+
if result.confidence > 0.8:
64+
return result.center
65+
66+
# Tier 1.5b: Visual template match
67+
if step.target_crop: # crop from demo screenshot
68+
result = template_match(current_screenshot, step.target_crop)
69+
if result.confidence > 0.8:
70+
return result.center
71+
72+
# Tier 2a: UI-Venus (primary grounder)
73+
if ui_venus_available():
74+
result = ui_venus_ground(current_screenshot, step.target_description)
75+
if result.bbox:
76+
return result.bbox_center
77+
78+
# Tier 2b: GPT-5.4 (fallback)
79+
result = gpt54_ground(current_screenshot, step.target_description)
80+
return result.coordinates
81+
```
82+
83+
## What exists vs what to build
84+
85+
| Component | Status | Location |
86+
|-----------|--------|----------|
87+
| Tier 1 (keyboard/type) | Done | `demo_executor.py` |
88+
| Tier 1.5a (OCR match) | Partially done | `openadapt-grounding/locator.py` (ElementLocator) |
89+
| Tier 1.5b (pHash/CLIP) | Partially done | `training/planner_cache.py` (pHash), CLIP in deps |
90+
| Tier 2a (UI-Venus HTTP) | Done | `demo_executor.py` lines 254-313, `serve_ui_venus.sh` |
91+
| Tier 2b (GPT-5.4 API) | Done | `demo_executor.py` (grounder_model param) |
92+
| Tier 3 (planner recovery) | Done | `demo_executor.py` |
93+
| Cascade routing logic | **Not done** | Need to implement |
94+
| Demo screenshot crops | **Not done** | Need to store target element crops in demo |
95+
96+
## Implementation plan
97+
98+
### Phase 1: GPT-5.4 as grounder (1 hour, unblocks clear-browsing)
99+
Change `grounder_model="gpt-4.1-mini"` to `grounder_model="gpt-5.4"` in DemoExecutor defaults or test scripts. Re-run clear-browsing. If it hits 1.0, the product demo works TODAY.
100+
101+
### Phase 2: UI-Venus as primary grounder (2 hours, cost optimization)
102+
Boot GPU, serve UI-Venus, point DemoExecutor at `grounder_endpoint`. Re-run clear-browsing. Verify 1.0. This replaces $0.01/click (GPT-5.4) with $0.002/click (UI-Venus local).
103+
104+
### Phase 3: Tier 1.5 template matching (1-2 days)
105+
- Store target element crops in demo JSON (crop from demo screenshot at the click location)
106+
- Implement `template_match()` using pHash + CLIP
107+
- Implement `ocr_match()` using the existing ElementLocator from openadapt-grounding
108+
- Add confidence-based routing in DemoExecutor before the VLM grounder call
109+
- This eliminates most VLM calls for demos where the UI matches
110+
111+
### Phase 4: Test-time zoom (RegionFocus) (1 day)
112+
- If Tier 2 grounder confidence is low, crop to predicted region at 2x resolution
113+
- Re-run grounder on the cropped image
114+
- Literature shows +28% accuracy improvement on ScreenSpot-Pro
115+
116+
### Phase 5: Empirical evaluation of alternative grounders (1-2 days)
117+
- Test UI-TARS-1.5-7B, MAI-UI-8B on our tasks
118+
- Compare against UI-Venus empirically
119+
- Adopt whichever performs best on OUR tasks, not benchmarks
120+
121+
## Cost analysis
122+
123+
For a 10-step demo with 4 click actions:
124+
125+
| Architecture | VLM calls | Cost | Latency |
126+
|-------------|-----------|------|---------|
127+
| Current (gpt-4.1-mini, every click) | 4 | $0.006 | 12s |
128+
| Phase 1 (GPT-5.4, every click) | 4 | $0.04 | 12s |
129+
| Phase 2 (UI-Venus, every click) | 4 | $0.008 | 8s |
130+
| Phase 3 (template + UI-Venus fallback) | 0-2 | $0-0.004 | 1-5s |
131+
132+
Phase 3 is 10x cheaper than Phase 1 and faster, because template matching handles most clicks without a VLM call.
133+
134+
## Design principles
135+
136+
1. **Empirical over benchmarks.** Use what works on OUR tasks. UI-Venus 1.0 > UI-TARS 94.2% on someone else's test.
137+
2. **Cheapest first.** Template matching ($0) before VLM ($0.002) before API ($0.01).
138+
3. **Confidence-driven.** Don't escalate when you're already right. Each tier reports confidence.
139+
4. **Demo screenshots are gold.** The demo literally shows what the button looks like. Use that information (template matching) before asking a model.
140+
5. **Fallback gracefully.** Every tier has a fallback. No single point of failure.

0 commit comments

Comments
 (0)