January 2025
Our evaluation reveals surprising results that contradict the literature benchmarks:
| Metric | Literature (ScreenSpot-Pro) | Our Evaluation (Synthetic) |
|---|---|---|
| UI-TARS | 61.6% | 36.1% |
| OmniParser | 39.6% | 97.4% |
| Winner | UI-TARS (+22%) | OmniParser (+61.3%) |
Key finding: The task matters more than the model. OmniParser's detection-based approach dominates on our evaluation, while UI-TARS excels at complex instruction-following in professional applications.
The literature review (see docs/literature_review.md) identified UI-TARS 1.5 as SOTA:
- ScreenSpot-Pro: 61.6% (vs OmniParser's 39.6%)
- OSWorld: 42.5% (vs Claude 3.7's 28.0%)
- AndroidWorld: 64.2%
These benchmarks led to the hypothesis that UI-TARS would outperform OmniParser.
| Aspect | ScreenSpot-Pro | Our Synthetic Evaluation |
|---|---|---|
| Task | Natural language instruction → click | Ground truth bbox → verify detection |
| Example input | "Click the Save button in the File menu" | Element at (0.15, 0.23) with text "Submit" |
| Required reasoning | Parse instruction, locate hierarchically | Simple matching/detection |
UI-TARS is optimized for parsing complex instructions ("Click the third item in the dropdown menu") and multi-step reasoning. This capability is wasted when the target is already precisely specified.
OmniParser simply detects all UI elements and matches them. For well-defined targets, this direct approach wins.
| Characteristic | ScreenSpot-Pro | Our Synthetic |
|---|---|---|
| Avg element size | 0.07% of screen | ~1-5% of screen |
| Element density | High (professional apps) | Moderate |
| Ambiguity | High (many similar buttons) | Low (distinct elements) |
| Resolution | High-res professional software | Standard 1920x1080 |
ScreenSpot-Pro tests tiny elements in professional software (CAD, video editing, IDEs) where targets are often just 20x20 pixels. Our synthetic data has larger, clearer targets where detection is easier.
ScreenSpot-Pro instructions require reasoning:
- "Click the brush tool in the toolbar" (must identify toolbar region, then brush icon)
- "Select the layer named 'Background'" (must find Layers panel, scroll if needed)
Our evaluation uses direct descriptions:
- "Click the 'Submit' button" (single element lookup)
- "Click the search icon" (straightforward matching)
UI-TARS's "System-2 reasoning" capability provides no benefit for direct lookups.
- Fast detection (724ms vs 2724ms)
- High recall on standard UI elements
- Consistent - detection-based approach has predictable behavior
- Good for automation - works well when element characteristics are known
- Complex instructions - can parse "the third blue button from the left"
- Hierarchical navigation - understands "in the File menu, under Export"
- Ambiguity resolution - better at choosing among similar elements
- Professional apps - trained on complex software interfaces
Our evaluation would favor UI-TARS if we:
- Used ambiguous instructions ("click the settings icon" with multiple gear icons)
- Required hierarchical reasoning ("the close button in the modal dialog")
- Tested on professional software screenshots with tiny elements
- Evaluated instruction-following accuracy rather than element detection
For the core use case of replaying recorded actions:
- Click coordinates are known precisely
- Elements have been identified during recording
- Speed matters for responsive automation
- OmniParser's 97%+ detection rate is sufficient
- Natural language automation ("Click the submit button")
- Handling ambiguous targets
- Professional software with complex UIs
- Cases where OmniParser fails on tiny icons
Our error analysis found minimal complementarity:
- UI-TARS found only 1 unique element that OmniParser missed
- Ensemble potential: 99.6% (+0.3% over OmniParser alone)
- Not worth the 4x latency cost
The literature predicted cropping would help significantly (ScreenSeekeR: +254% improvement).
Our results:
| Method | Baseline | + Cropping | Improvement |
|---|---|---|---|
| UI-TARS | 36.1% | 70.6% | +95% |
| OmniParser | 97.4% | 99.3% | +2% |
Cropping helps UI-TARS dramatically (validates ScreenSeekeR findings) but provides marginal benefit for OmniParser (already at ceiling on our data).
-
Benchmark selection matters. ScreenSpot-Pro measures instruction-following on professional apps. Our synthetic benchmark measures element detection on standard UIs. Different tasks favor different approaches.
-
Simpler is often better. For well-defined targets, detection (OmniParser) beats reasoning (UI-TARS).
-
Know your use case. Recording playback = OmniParser. Natural language automation = consider UI-TARS.
-
Cropping remains valuable. Both methods benefit from cropping, especially UI-TARS.
- Evaluate on real recordings from openadapt to measure production performance
- Test on ScreenSpot-Pro to validate literature benchmarks
- Hybrid approach - use OmniParser for detection, fall back to UI-TARS for failures
- Fine-tune for small elements - the gap is largest on small targets