|
| 1 | +# Trace: Notepad Hello World — Score 0.5 (1/2 milestones) |
| 2 | + |
| 3 | +**Date**: 2026-03-21 |
| 4 | +**Task**: Open Notepad and type "Hello World" |
| 5 | +**Planner**: GPT-5.4-mini (OpenAI) |
| 6 | +**Grounder**: GPT-5.4-mini (OpenAI) |
| 7 | +**Adapter**: WAALiveAdapter (lightweight=True) |
| 8 | +**Steps**: 6 |
| 9 | +**Time**: 91.0s |
| 10 | +**Score**: 0.50 (1/2 milestones) |
| 11 | + |
| 12 | +## Milestones |
| 13 | + |
| 14 | +| # | Milestone | Check | Result | |
| 15 | +|---|-----------|-------|--------| |
| 16 | +| 1 | Notepad is open | `Get-Process notepad*` via /execute_windows | FAIL (timeout) | |
| 17 | +| 2 | Hello World typed | VLM screenshot judge | **PASS** (confidence 1.00) | |
| 18 | + |
| 19 | +## Step-by-step |
| 20 | + |
| 21 | +| Step | Action | Screenshot | |
| 22 | +|------|--------|------------| |
| 23 | +| 0 | Reset (clean desktop) |  | |
| 24 | +| 1 | Click Start button |  | |
| 25 | +| 2 | Start menu open, click Notepad |  | |
| 26 | +| 3 | Desktop (Notepad loading) |  | |
| 27 | +| 4 | Notepad open, type Hello World |  | |
| 28 | +| 5 | Hello World typed |  | |
| 29 | +| 6 | Done |  | |
| 30 | + |
| 31 | +## What worked |
| 32 | +- Lightweight mode: no cleanup crashes, server stayed responsive |
| 33 | +- GPT-5.4-mini correctly identified Start → Notepad path |
| 34 | +- VLM screenshot evaluation: "PASS (confidence=1.00) — The Notepad window shows the text 'Hello World' clearly in the text area" |
| 35 | +- Task instruction emphasis: planner followed "open Notepad" instead of clicking Chrome |
| 36 | + |
| 37 | +## What didn't work |
| 38 | +- Milestone 1 (process check): PowerShell `Get-Process notepad*` via /execute_windows timed out |
| 39 | +- WAA /evaluate endpoint: unreachable (evaluate_server.py can't connect to Windows VM) |
| 40 | +- OneDrive popup appeared but agent worked around it |
| 41 | + |
| 42 | +## Gap vs customer results (score 1.0) |
| 43 | +Customer scored 1.0 (2/2 milestones) on same task. Differences: |
| 44 | +- They use WAADirect (direct HTTP, no adapter overhead) |
| 45 | +- They skip verify_apps/close_all entirely |
| 46 | +- They use GPT-5.4 (full, not mini) + UI-Venus grounder (not GPT as grounder) |
| 47 | +- Their milestone evaluation runs PowerShell successfully (their /execute_windows is responsive) |
| 48 | + |
| 49 | +Our milestone 1 failed due to /execute_windows timeout during evaluation, not because Notepad wasn't open (screenshot proves it was). The evaluation plumbing is the gap, not the agent behavior. |
0 commit comments