Skip to content

Commit ba44eaa

Browse files
abrichrclaude
andcommitted
docs: add first scored trace (Notepad Hello World, score 0.5)
6 steps, 91s, GPT-5.4-mini planner+grounder, lightweight mode. VLM judge passed milestone 2 (Hello World typed, confidence 1.00). Milestone 1 (process check) timed out during /execute_windows eval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d196dce commit ba44eaa

8 files changed

Lines changed: 49 additions & 0 deletions

File tree

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Trace: Notepad Hello World — Score 0.5 (1/2 milestones)
2+
3+
**Date**: 2026-03-21
4+
**Task**: Open Notepad and type "Hello World"
5+
**Planner**: GPT-5.4-mini (OpenAI)
6+
**Grounder**: GPT-5.4-mini (OpenAI)
7+
**Adapter**: WAALiveAdapter (lightweight=True)
8+
**Steps**: 6
9+
**Time**: 91.0s
10+
**Score**: 0.50 (1/2 milestones)
11+
12+
## Milestones
13+
14+
| # | Milestone | Check | Result |
15+
|---|-----------|-------|--------|
16+
| 1 | Notepad is open | `Get-Process notepad*` via /execute_windows | FAIL (timeout) |
17+
| 2 | Hello World typed | VLM screenshot judge | **PASS** (confidence 1.00) |
18+
19+
## Step-by-step
20+
21+
| Step | Action | Screenshot |
22+
|------|--------|------------|
23+
| 0 | Reset (clean desktop) | ![step 0](step_00_reset.png) |
24+
| 1 | Click Start button | ![step 1](step_01.png) |
25+
| 2 | Start menu open, click Notepad | ![step 2](step_02.png) |
26+
| 3 | Desktop (Notepad loading) | ![step 3](step_03.png) |
27+
| 4 | Notepad open, type Hello World | ![step 4](step_04.png) |
28+
| 5 | Hello World typed | ![step 5](step_05.png) |
29+
| 6 | Done | ![step 6](step_06.png) |
30+
31+
## What worked
32+
- Lightweight mode: no cleanup crashes, server stayed responsive
33+
- GPT-5.4-mini correctly identified Start → Notepad path
34+
- VLM screenshot evaluation: "PASS (confidence=1.00) — The Notepad window shows the text 'Hello World' clearly in the text area"
35+
- Task instruction emphasis: planner followed "open Notepad" instead of clicking Chrome
36+
37+
## What didn't work
38+
- Milestone 1 (process check): PowerShell `Get-Process notepad*` via /execute_windows timed out
39+
- WAA /evaluate endpoint: unreachable (evaluate_server.py can't connect to Windows VM)
40+
- OneDrive popup appeared but agent worked around it
41+
42+
## Gap vs customer results (score 1.0)
43+
Customer scored 1.0 (2/2 milestones) on same task. Differences:
44+
- They use WAADirect (direct HTTP, no adapter overhead)
45+
- They skip verify_apps/close_all entirely
46+
- They use GPT-5.4 (full, not mini) + UI-Venus grounder (not GPT as grounder)
47+
- Their milestone evaluation runs PowerShell successfully (their /execute_windows is responsive)
48+
49+
Our milestone 1 failed due to /execute_windows timeout during evaluation, not because Notepad wasn't open (screenshot proves it was). The evaluation plumbing is the gap, not the agent behavior.
587 KB
Loading
587 KB
Loading
203 KB
Loading
587 KB
Loading
210 KB
Loading
210 KB
Loading
211 KB
Loading

0 commit comments

Comments
 (0)