|
| 1 | +# Eval Analysis: Demo-Conditioned vs Zero-Shot (2026-03-02) |
| 2 | + |
| 3 | +## Task |
| 4 | + |
| 5 | +**ID**: `04d9aeaf-7bed-4024-bedb-e10e6f00eb7f-WOS` (LibreOffice Calc) |
| 6 | +**Instruction**: "In a new sheet with 4 headers 'Year', 'CA changes', 'FA changes', and 'OA changes', calculate the annual changes for the Current Assets, Fixed Assets, and Other Assets columns. Set the results as percentage type." |
| 7 | +**Complexity**: 21 steps in human recording; requires sheet creation, header entry, formula computation, drag-fill, percentage formatting. |
| 8 | + |
| 9 | +## Bugs Fixed This Session |
| 10 | + |
| 11 | +| Bug | Root Cause | Fix | Status | |
| 12 | +|-----|-----------|-----|--------| |
| 13 | +| Multi-line type → "unterminated string literal" | Hand-rolled `_escape_for_pyautogui()` missed `\n` | Replaced with `repr()` — Python's own escaping mechanism. Eliminated entire class of string-embedding bugs. | **Verified working** (0 errors in both runs) | |
| 14 | +| Drag coordinates zeroed to (0,0) | `startCoordinate`/`endCoordinate` (camelCase) vs Claude API's `start_coordinate`/`coordinate` (snake_case) | Fixed field names in `_map_action()` | **Verified** (correct coords in trace) | |
| 15 | +| Demo not persisted across steps | Demo only injected at step 1 | Re-inject demo text in every `tool_result` message | **Re-applied** | |
| 16 | +| (0,0) coordinates trigger fail-safe | No validation at coordinate boundary | `_clamp_coord()` moves (0,0) → (eps, eps) | **Added** | |
| 17 | +| Fail-safe not detected on HTTP 500 | Only checked 200 response bodies | Check ALL response bodies for fail-safe strings | **Verified** (0 fail-safe crashes) | |
| 18 | + |
| 19 | +### Meta-fix: `repr()` replaces manual escaping |
| 20 | + |
| 21 | +The multi-line type bug was a symptom of a deeper architectural problem: **generating Python source code via string concatenation to send data across a boundary**. This is the same class of vulnerability as SQL injection. |
| 22 | + |
| 23 | +```python |
| 24 | +# Before (fragile — misses \n, \0, unicode, etc.): |
| 25 | +text.replace("\\", "\\\\").replace("'", "\\'").replace("\t", "\\t") |
| 26 | + |
| 27 | +# After (provably correct — Python's own escaping): |
| 28 | +repr(text) |
| 29 | +``` |
| 30 | + |
| 31 | +`repr()` handles ALL characters: newlines, tabs, quotes, backslashes, unicode, null bytes. The manual `_escape_for_pyautogui` function was deleted entirely. |
| 32 | + |
| 33 | +## Results |
| 34 | + |
| 35 | +### Run Configuration |
| 36 | + |
| 37 | +- **Agent**: ClaudeComputerUseAgent (claude-sonnet-4-6, computer_use beta) |
| 38 | +- **Max steps**: 30 |
| 39 | +- **Demo file**: `demo_prompts_vlm/04d9aeaf-...txt` (8,697 bytes, 21 steps, VLM-enriched) |
| 40 | +- **Demo format**: Step N → {Observation, Intent, Action, Result} |
| 41 | +- **WAA server**: Azure VM (waa-pool-00), SSH tunnel localhost:5001 |
| 42 | + |
| 43 | +### Scores |
| 44 | + |
| 45 | +| Metric | ZS (no demo) | DC (with demo) | |
| 46 | +|--------|-------------|----------------| |
| 47 | +| **Score** | 0.0 | 0.0 | |
| 48 | +| **Steps used** | 30/30 | 16/30 (quit early) | |
| 49 | +| **Time** | 20 min | 8 min | |
| 50 | +| **Formulas entered** | 10 (cols C + D) | 0 | |
| 51 | +| **Multi-line type errors** | 0 | 0 | |
| 52 | +| **Fail-safe crashes** | 0 | 0 | |
| 53 | + |
| 54 | +### ZS Trace (30 steps) |
| 55 | + |
| 56 | +``` |
| 57 | + 0-2: Navigate spreadsheet (clicks) |
| 58 | + 3: Click sheet tab area |
| 59 | + 4-6: Attempt to add new sheet (triple click on tab) |
| 60 | + 7-8: Dialog interaction (double-clicks) |
| 61 | + 9-11: Navigate/dismiss dialog (clicks + wait actions) |
| 62 | +12-13: Escape + Enter (dismiss dialog) |
| 63 | +14: Wait (5 internal retries) then click |
| 64 | +15-16: Navigate to sheet tabs |
| 65 | +17: Click cell for formulas |
| 66 | +18: TYPE 5 formulas for col C (with \n between each) ← MULTI-LINE SUCCESS |
| 67 | +19: Click next column |
| 68 | +20: TYPE 5 formulas for col D (with \n between each) ← MULTI-LINE SUCCESS |
| 69 | +21-25: Navigate/select cells (formatting attempts) |
| 70 | +26: Click Name Box |
| 71 | +27: TYPE "B2:D6\n" (cell range selection) |
| 72 | +28: Click toolbar (formatting?) |
| 73 | +29: Ctrl+S (save) |
| 74 | +``` |
| 75 | + |
| 76 | +**Observations**: The ZS agent independently figured out the formula pattern `=(Sheet1.Cn-Sheet1.Cn-1)/Sheet1.Cn-1`, entered ALL formulas for TWO columns in just 2 steps (thanks to multi-line type fix), then attempted formatting. It used all 30 steps productively but didn't complete all 3 formula columns or percentage formatting. |
| 77 | + |
| 78 | +### DC Trace (16 steps → quit) |
| 79 | + |
| 80 | +``` |
| 81 | + 0-2: Navigate spreadsheet (same clicks as ZS) |
| 82 | + 3: Double-click (different target than ZS — demo influence?) |
| 83 | + 4-5: Click dialog elements |
| 84 | + 6-8: Navigate/dismiss |
| 85 | + 9-10: Escape + Enter |
| 86 | +11-13: Click toolbar area (Open file?) |
| 87 | +14: TYPE "SmallBalanceSheet.xlsx" |
| 88 | +15: Enter |
| 89 | +16: DONE (no_tool_use — agent declared task complete) |
| 90 | +``` |
| 91 | + |
| 92 | +**Observations**: The DC agent never created headers, never typed formulas, never reached the actual task. It appeared to open a "Save As" or "Open" dialog, type the source file name, and declare itself done. The demo's specific UI state descriptions may have conflicted with what the agent actually saw, causing confusion. |
| 93 | + |
| 94 | +## Analysis: Why the Demo Hurt |
| 95 | + |
| 96 | +### The demo format problem |
| 97 | + |
| 98 | +Our demo uses a rigid step-by-step format: |
| 99 | +``` |
| 100 | +Step 1: |
| 101 | + Observation: The spreadsheet is open to "Sheet1," which contains financial data... |
| 102 | + Intent: To create a new sheet for calculating and displaying annual changes... |
| 103 | + Action: Right-click on the "Sheet1" tab at the bottom and select "Insert Sheet"... |
| 104 | + Result: A new, blank sheet named "Sheet2" is added to the workbook... |
| 105 | +``` |
| 106 | + |
| 107 | +When the actual UI doesn't match the described observation (e.g., a dialog appeared, or the tab area looks different from what was described), the agent faces a **reconciliation conflict**: should it follow the demo's specific actions, or respond to what it actually sees? |
| 108 | + |
| 109 | +In our case, the agent chose a third option: it abandoned the task structure entirely and performed an unrelated action (opening a file), then declared done. |
| 110 | + |
| 111 | +### Literature context |
| 112 | + |
| 113 | +This matches findings from multiple papers: |
| 114 | + |
| 115 | +1. **LMAct** (ICML 2025, Google DeepMind): Found that demonstrations can *actively hurt* performance. On several tasks, performance *decreased* with >2 demos. "Frontier LMs struggle to leverage large demonstration datasets for interactive decision-making." |
| 116 | + |
| 117 | +2. **DigiRL** (NeurIPS 2024): "Training with static demonstrations falls short for controlling real GUIs due to their failure to deal with real-world stochasticity and non-stationarity not captured in static observational data." SFT on demos: 17.7%. RL: 67.2%. |
| 118 | + |
| 119 | +3. **ShowUI-Aloha** (Jan 2026): Demonstrated that a single demo CAN improve performance dramatically (+26.6pp) — but using a {Observation, **Think**, Action, Expectation} format that includes reasoning, and crucially, with a PlannerMemory module that adapts the plan when the environment diverges. |
| 120 | + |
| 121 | +4. **Plan-and-Act** (ICML 2025): Dynamic replanning alone added 10.31pp. Without it, static plans degrade. Full pipeline improved from 9.85% (direct prediction) to 57.58% — a 6x improvement. |
| 122 | + |
| 123 | +5. **Instruction Agent** (Sep 2025, Microsoft): Achieved 60% success on tasks where ALL other agents scored 0% — using a single expert trajectory with step-by-step natural language instructions PLUS a backtracker module for error recovery. |
| 124 | + |
| 125 | +## Implications and Options |
| 126 | + |
| 127 | +### The design space |
| 128 | + |
| 129 | +The literature reveals a clear spectrum from rigid to flexible demo conditioning: |
| 130 | + |
| 131 | +``` |
| 132 | +RIGID ←────────────────────────────────────────────────→ FLEXIBLE |
| 133 | +
|
| 134 | +Raw action Step-by-step Semantic steps Abstract plan Goal only |
| 135 | +replay with states with intent with subgoals |
| 136 | + (OUR CURRENT) (ShowUI-Aloha) (Plan-and-Act) |
| 137 | +``` |
| 138 | + |
| 139 | +Our current format sits near the rigid end. The evidence strongly suggests moving rightward. |
| 140 | + |
| 141 | +### Option A: Abstract the demo format (semantic steps with intent) |
| 142 | + |
| 143 | +Transform demos from specific state descriptions to goal-oriented step summaries: |
| 144 | + |
| 145 | +``` |
| 146 | +# Current (too rigid — describes specific UI states): |
| 147 | +Step 11: |
| 148 | + Observation: The new sheet contains headers... with all cells below empty. |
| 149 | + Intent: To calculate the annual percentage change... |
| 150 | + Action: Click cell B2 and type "=(Sheet1.B3-Sheet1.B2)/Sheet1.B2". |
| 151 | + Result: Cell B2 now contains a formula... |
| 152 | +
|
| 153 | +# Proposed (more abstract — describes what to do, not what you see): |
| 154 | +Step 4: Enter the annual change formula for Current Assets |
| 155 | + Goal: Populate the CA changes column with formulas that compute |
| 156 | + (current_year - prev_year) / prev_year for each year pair. |
| 157 | + Approach: In each row of column B, enter a formula referencing the |
| 158 | + corresponding rows in Sheet1's column B. |
| 159 | + Example: =(Sheet1.B3-Sheet1.B2)/Sheet1.B2 |
| 160 | +``` |
| 161 | + |
| 162 | +**Tradeoffs**: |
| 163 | +- (+) Robust to UI state mismatch — doesn't assume specific screen appearance |
| 164 | +- (+) Preserves intent and approach, which is what disambiguates |
| 165 | +- (-) Loses grounding — agent must figure out WHERE to click |
| 166 | +- (-) Harder to auto-generate from recordings |
| 167 | + |
| 168 | +### Option B: Plan-then-act (hierarchical, inspired by Plan-and-Act) |
| 169 | + |
| 170 | +Extract a high-level plan from the demo, let the agent execute it: |
| 171 | + |
| 172 | +``` |
| 173 | +PLAN (derived from demonstration): |
| 174 | +1. Create a new sheet in the workbook |
| 175 | +2. Set up headers: Year, CA changes, FA changes, OA changes |
| 176 | +3. Enter years 2016-2019 in column A |
| 177 | +4. For each asset column (CA=B, FA=C, OA=D): |
| 178 | + a. Enter formula =(Sheet1.Xn-Sheet1.Xn-1)/Sheet1.Xn-1 for each year |
| 179 | + b. Fill down for all years |
| 180 | +5. Select the data range and format as percentage |
| 181 | +
|
| 182 | +Execute each step using your best judgment about the current screen state. |
| 183 | +If a step doesn't apply to what you see, skip it and move to the next. |
| 184 | +``` |
| 185 | + |
| 186 | +**Tradeoffs**: |
| 187 | +- (+) Maximum flexibility — agent adapts to any UI state |
| 188 | +- (+) Natural mismatch recovery (skip/replan) |
| 189 | +- (+) Captures the "what" without prescribing the "how" |
| 190 | +- (-) Loses the fine-grained disambiguation that is OpenAdapt's core thesis |
| 191 | +- (-) Similar to what any planning agent could derive zero-shot |
| 192 | + |
| 193 | +### Option C: Adaptive conditioning with mismatch detection |
| 194 | + |
| 195 | +Keep step-by-step demos but add explicit mismatch handling: |
| 196 | + |
| 197 | +``` |
| 198 | +DEMONSTRATION (adapt as needed — your screen may look different): |
| 199 | +
|
| 200 | +Step 1: Create a new sheet |
| 201 | + If you see a sheet tab bar → right-click and insert new sheet |
| 202 | + If you see a dialog → dismiss it first, then insert sheet |
| 203 | + If a new sheet already exists → use it |
| 204 | +
|
| 205 | +Step 2: Enter headers in row 1 |
| 206 | + Type "Year" in A1, then Tab, type "CA changes", Tab, "FA changes", Tab, "OA changes" |
| 207 | + If headers already exist → verify them and move on |
| 208 | +``` |
| 209 | + |
| 210 | +**Tradeoffs**: |
| 211 | +- (+) Preserves fine-grained demo detail (the disambiguation signal) |
| 212 | +- (+) Handles the mismatch problem explicitly |
| 213 | +- (-) Verbose — context window cost is high |
| 214 | +- (-) Hard to auto-generate the "If..." branches |
| 215 | + |
| 216 | +### Option D: Multi-level conditioning (most aligned with literature) |
| 217 | + |
| 218 | +Combine a high-level plan WITH a reference trajectory, inspired by ShowUI-Aloha + Instruction Agent: |
| 219 | + |
| 220 | +``` |
| 221 | +GOAL: Calculate annual asset changes in a new spreadsheet sheet. |
| 222 | +
|
| 223 | +PLAN: |
| 224 | +1. Create new sheet → 2. Headers → 3. Years → 4. Formulas → 5. Format as % |
| 225 | +
|
| 226 | +REFERENCE TRAJECTORY (for disambiguation — adapt actions to your actual screen): |
| 227 | +Step 1: [Think] I need to create a new sheet. I'll right-click the sheet tab. |
| 228 | + [Action] Right-click "Sheet1" tab → select "Insert Sheet" |
| 229 | + [Expect] New blank sheet appears |
| 230 | +Step 2: [Think] Now I'll set up the four headers. |
| 231 | + [Action] Type "Year" → Tab → "CA changes" → Tab → "FA changes" → Tab → "OA changes" |
| 232 | + [Expect] Row 1 has all four headers |
| 233 | +... |
| 234 | +
|
| 235 | +If your screen doesn't match what's expected, re-evaluate based on the PLAN and decide the best next action. |
| 236 | +``` |
| 237 | + |
| 238 | +**Tradeoffs**: |
| 239 | +- (+) Combines the benefits of planning AND trajectory disambiguation |
| 240 | +- (+) [Think] field provides reasoning that helps the model understand WHY each action is taken |
| 241 | +- (+) Explicit "re-evaluate" instruction for mismatch recovery |
| 242 | +- (+) Aligns with ShowUI-Aloha format that showed +26.6pp improvement |
| 243 | +- (-) Most complex format to generate |
| 244 | +- (-) Longest context window usage |
| 245 | + |
| 246 | +### Option E: RL fine-tuning (long-term, highest ceiling) |
| 247 | + |
| 248 | +DigiRL showed 17.7% → 67.2% improvement by moving from SFT on demos to online RL. WebRL showed 4.8% → 42.4%. The trajectory data becomes training signal rather than inference-time context. |
| 249 | + |
| 250 | +**Tradeoffs**: |
| 251 | +- (+) Highest performance ceiling by far |
| 252 | +- (+) No context window cost at inference time |
| 253 | +- (+) Handles stochasticity naturally |
| 254 | +- (-) Requires fine-tuning infrastructure (already have via Modal) |
| 255 | +- (-) Task-specific training needed |
| 256 | +- (-) This is OpenAdapt-ML's domain, not just eval infrastructure |
| 257 | + |
| 258 | +## Recommendation |
| 259 | + |
| 260 | +### Immediate (next eval): Option D (multi-level conditioning) |
| 261 | + |
| 262 | +The evidence from ShowUI-Aloha, Instruction Agent, and Plan-and-Act converges on this approach. Key changes: |
| 263 | + |
| 264 | +1. **Add a PLAN section** above the step-by-step trajectory — gives the agent a fallback when specific steps don't match |
| 265 | +2. **Add [Think] fields** to each step — captures reasoning that helps the model adapt |
| 266 | +3. **Add [Expect] fields** — lets the agent detect when reality diverges from the demo |
| 267 | +4. **Add explicit "adapt if needed" framing** — grants permission to deviate from the demo |
| 268 | + |
| 269 | +This can be implemented as a transformation of our existing VLM-enriched demos (add plan extraction + think field generation via a single LLM call). |
| 270 | + |
| 271 | +### Medium-term: Option A + retrieval |
| 272 | + |
| 273 | +Abstract the demo format to goal-oriented steps. Build a retrieval system that finds the most relevant demo for the current task. This is the LearnAct approach. |
| 274 | + |
| 275 | +### Long-term: Option E (RL) |
| 276 | + |
| 277 | +Use trajectory data for training, not just inference-time conditioning. This has the highest performance ceiling but requires infrastructure investment. |
| 278 | + |
| 279 | +## Key Insight for OpenAdapt |
| 280 | + |
| 281 | +OpenAdapt's core thesis is trajectory-conditioned disambiguation — using demonstrations to help agents understand WHAT to do in ambiguous situations. The evidence says this thesis is correct (ShowUI-Aloha: +26.6pp, Instruction Agent: 0% → 60%), BUT: |
| 282 | + |
| 283 | +1. **The demo must be abstracted**, not a literal replay |
| 284 | +2. **The agent needs permission and ability to deviate** when reality doesn't match |
| 285 | +3. **Reasoning (Think/Intent) is the disambiguation signal**, not specific observations |
| 286 | +4. **A high-level plan provides fallback** when step-level details don't apply |
| 287 | + |
| 288 | +The DC agent didn't fail because demo-conditioning is wrong. It failed because our demo format is too rigid and doesn't handle observation mismatch. This is a solvable formatting problem, not a fundamental limitation. |
| 289 | + |
| 290 | +## Sources |
| 291 | + |
| 292 | +- LMAct (ICML 2025): arxiv.org/abs/2412.01441 |
| 293 | +- ShowUI-Aloha (Jan 2026): arxiv.org/abs/2601.07181 |
| 294 | +- Instruction Agent (Sep 2025): arxiv.org/abs/2509.07098 |
| 295 | +- Plan-and-Act (ICML 2025): arxiv.org/abs/2503.09572 |
| 296 | +- DigiRL (NeurIPS 2024): arxiv.org/abs/2406.11896 |
| 297 | +- WebRL (ICLR 2025): arxiv.org/abs/2411.02337 |
| 298 | +- LearnAct (Apr 2025): arxiv.org/abs/2504.13805 |
| 299 | +- RAG-GUI (EMNLP 2025): arxiv.org/abs/2509.24183 |
| 300 | +- AdaptAgent (NeurIPS 2024 WS): arxiv.org/abs/2411.13451 |
| 301 | +- RT-Trajectory (ICLR 2024): arxiv.org/abs/2311.01977 |
| 302 | +- AgentTrek (ICLR 2025): arxiv.org/abs/2412.09605 |
| 303 | +- BacktrackAgent (EMNLP 2025): arxiv.org/abs/2505.20660 |
0 commit comments