Compare UI element grounding approaches to determine optimal strategy for click-to-element detection:
- OmniParser (baseline) - Current approach
- OmniParser + ScreenSeekeR-style cropping - Progressive cropping enhancement
- UI-TARS 1.5 - SOTA model
- UI-TARS + ScreenSeekeR-style cropping - Combined approach
| ID | Hypothesis | Expected Result |
|---|---|---|
| H1 | UI-TARS outperforms OmniParser on both synthetic and real data | UI-TARS +20-30% accuracy |
| H2 | Progressive cropping improves both models | +50-100% relative improvement |
| H3 | Combined approach (UI-TARS + cropping) achieves best results | >70% accuracy |
| H4 | Latency increases with cropping but remains acceptable | <2s per detection |
| Variable | Levels |
|---|---|
| Base Model | OmniParser, UI-TARS 1.5 |
| Cropping Strategy | None, Fixed (our current), ScreenSeekeR-style |
| Dataset | Synthetic, ScreenSpot, Real (curated) |
| Metric | Description | Target |
|---|---|---|
| Detection Rate | % of elements correctly located | >80% |
| IoU | Bounding box accuracy | >0.7 |
| Latency | Time per detection (ms) | <2000 |
| Attempts | Number of crops/transforms tried | <5 |
- Hardware: Same GPU (e.g., A100 or 4090)
- Image resolution: Standardized to 1920x1080
- Confidence threshold: 0.5 for all models
Auto-generated UIs with perfect ground truth.
synthetic/
├── simple/ # Large buttons, high contrast (N=100)
├── medium/ # Standard UI complexity (N=200)
├── hard/ # Small icons, dense layouts (N=100)
└── professional/ # Mimics ScreenSpot-Pro apps (N=100)
Generation parameters:
- Element sizes: 16px-200px
- Element types: buttons, icons, text fields, checkboxes
- Backgrounds: light, dark, gradients
- Density: 5-50 elements per screen
Sampled from public ScreenSpot benchmark for comparability.
screenspot/
├── mobile/ # iOS, Android (N=60)
├── desktop/ # macOS, Windows (N=80)
└── web/ # Various websites (N=60)
Screenshots from actual usage, manually annotated.
real/
├── productivity/ # VSCode, Slack, Chrome (N=40)
├── creative/ # Figma, Photoshop (N=30)
└── system/ # OS dialogs, settings (N=30)
def baseline_omniparser(image, click_xy):
"""Single-pass OmniParser detection."""
elements = omniparser.parse(image)
return find_element_at_point(elements, click_xy)def omniparser_fixed_crop(image, click_xy):
"""Our current transform-based approach."""
transforms = [
lambda img: img, # Original
lambda img: crop_around(img, click_xy, 200),
lambda img: crop_around(img, click_xy, 300),
lambda img: adjust_brightness(img, 1.2),
lambda img: to_grayscale(img),
]
for transform in transforms:
elements = omniparser.parse(transform(image))
if element := find_element_at_point(elements, click_xy):
return element
return Nonedef omniparser_screenSeekeR(image, click_xy, target_description=None):
"""LLM-guided progressive cropping."""
# Step 1: Get region predictions from LLM
regions = llm.predict_ui_regions(image, click_xy, target_description)
# e.g., ["toolbar", "menu bar", "left sidebar"]
candidates = []
for region in regions:
# Step 2: Crop to predicted region
cropped = crop_to_region(image, region)
# Step 3: Run detection on cropped region
elements = omniparser.parse(cropped)
# Step 4: Map back to original coordinates
for elem in elements:
elem.bbox = map_to_original(elem.bbox, region)
candidates.append(elem)
# Step 5: Vote and select best candidate
return gaussian_vote(candidates, click_xy)def baseline_uitars(image, click_xy):
"""Single-pass UI-TARS detection."""
# UI-TARS uses different API - grounding query
prompt = f"Find the UI element at coordinates ({click_xy[0]}, {click_xy[1]})"
result = uitars.ground(image, prompt)
return resultdef uitars_screenSeekeR(image, click_xy, target_description=None):
"""UI-TARS with progressive cropping."""
# Same cropping strategy as Method 3
regions = llm.predict_ui_regions(image, click_xy, target_description)
candidates = []
for region in regions:
cropped = crop_to_region(image, region)
result = uitars.ground(cropped, prompt)
if result:
result.bbox = map_to_original(result.bbox, region)
candidates.append(result)
return gaussian_vote(candidates, click_xy)- Set up UI-TARS 1.5 deployment (GPU server)
- Implement unified evaluation harness
- Create synthetic dataset generator
- Download/prepare ScreenSpot subset
- Run OmniParser baseline on all datasets
- Run UI-TARS baseline on all datasets
- Compute metrics and generate initial plots
- Identify failure cases
- Implement fixed cropping for both models
- Implement ScreenSeekeR-style cropping
- Run evaluations on all datasets
- Compare cropping strategies
- Generate final comparison plots
- Analyze failure cases
- Write up findings
- Recommend production approach
@dataclass
class EvalMetrics:
detection_rate: float # % found within IoU threshold
mean_iou: float # Average IoU for detected elements
mean_latency_ms: float # Average detection time
mean_attempts: float # Average crops/transforms tried@dataclass
class DetailedMetrics:
# Breakdown by element size
detection_rate_small: float # <32px
detection_rate_medium: float # 32-100px
detection_rate_large: float # >100px
# Breakdown by element type
detection_rate_text: float
detection_rate_icon: float
detection_rate_button: float
# Failure analysis
false_positive_rate: float
wrong_element_rate: float # Detected but wrong elementDetection Rate by Method
========================
Synthetic ScreenSpot Real
OmniParser [===] [==] [==]
OmniParser+Crop [====] [===] [===]
OmniParser+SeekeR [=====] [====] [====]
UI-TARS [======] [=====] [=====]
UI-TARS+Crop [======] [=====] [=====]
UI-TARS+SeekeR [=======] [======] [======]
Accuracy vs Latency Tradeoff
============================
| * UI-TARS+SeekeR
A | * UI-TARS
c | * OmniParser+SeekeR
c | * OmniParser+Crop
| * OmniParser
+------------------------
Latency (ms)
Detection Rate by Element Size
==============================
100%|----*----*----*----*
| / / / /
50%|--/----/----/----/
| / / / /
0%|/----/----/----/----
16 32 64 128 256
Element Size (px)
— OmniParser — UI-TARS — +SeekeR
Detection Rate vs Attempts
==========================
100%| *---*---*
| *--/
80%| *--/
| */
60%|*/
|
+---+---+---+---+---
1 2 3 4 5
Attempts
Shows how detection rate improves with more cropping attempts
Element Type Confusion
======================
Predicted
Btn Icon Text None
Actual Btn [85] [ 5] [ 2] [ 8]
Icon [ 3] [72] [ 5] [20]
Text [ 1] [ 2] [90] [ 7]
Visual grid showing:
- Screenshot with ground truth bbox (green)
- Detected bbox (red/yellow)
- Click point (blue dot)
- Method used
- IoU score
| Criterion | Threshold | Rationale |
|---|---|---|
| Best method detection rate | >80% | Usable for production |
| Latency | <2000ms | Acceptable UX |
| Improvement over baseline | >30% relative | Worth the complexity |
| Consistent across datasets | <10% variance | Generalizable |
| Risk | Impact | Mitigation |
|---|---|---|
| UI-TARS deployment issues | High | Fall back to API/cloud |
| LLM costs for ScreenSeekeR | Medium | Cache predictions, batch |
| Synthetic data not representative | Medium | Validate on real data |
| GPU availability | Medium | Use cloud instances |
-
Code
evaluation/harness.py- Unified evaluation frameworkevaluation/datasets/- Dataset generators and loadersevaluation/methods/- All detection methodsevaluation/plots.py- Visualization generation
-
Data
- Synthetic dataset (500 samples)
- ScreenSpot subset (200 samples)
- Real curated dataset (100 samples)
- All evaluation results (JSON)
-
Documentation
- This experiment plan
- Results summary with plots
- Recommendations for production
-
Artifacts
- Trained/deployed models
- Evaluation plots (PNG/PDF)
- Interactive results viewer (optional)