Skip to content

Commit 48aa9f4

Browse files
abrichrclaude
andcommitted
fix(agent): replace manual string escaping with repr() and fix CU agent bugs
Five reliability fixes for eval runs: 1. Replace _escape_for_pyautogui() with repr() in _build_type_commands() - eliminates entire class of string-embedding bugs (newlines, tabs, quotes, unicode) using Python's own escaping mechanism 2. Fix drag coordinate field names: startCoordinate/endCoordinate (camelCase) → start_coordinate/coordinate (snake_case) per Claude computer_use API 3. Add _clamp_coord() to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe, applied to click, drag, and mouse_move actions 4. Re-inject demo text at every step in tool_result messages to prevent context drift in demo-conditioned evaluation 5. Add command logging in WAALiveAdapter.step() for debugging Also adds docs/eval_analysis_2026_03_02.md documenting ZS vs DC eval results and literature review on demo-conditioning approaches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 840f9ef commit 48aa9f4

3 files changed

Lines changed: 361 additions & 35 deletions

File tree

docs/eval_analysis_2026_03_02.md

Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# Eval Analysis: Demo-Conditioned vs Zero-Shot (2026-03-02)
2+
3+
## Task
4+
5+
**ID**: `04d9aeaf-7bed-4024-bedb-e10e6f00eb7f-WOS` (LibreOffice Calc)
6+
**Instruction**: "In a new sheet with 4 headers 'Year', 'CA changes', 'FA changes', and 'OA changes', calculate the annual changes for the Current Assets, Fixed Assets, and Other Assets columns. Set the results as percentage type."
7+
**Complexity**: 21 steps in human recording; requires sheet creation, header entry, formula computation, drag-fill, percentage formatting.
8+
9+
## Bugs Fixed This Session
10+
11+
| Bug | Root Cause | Fix | Status |
12+
|-----|-----------|-----|--------|
13+
| Multi-line type → "unterminated string literal" | Hand-rolled `_escape_for_pyautogui()` missed `\n` | Replaced with `repr()` — Python's own escaping mechanism. Eliminated entire class of string-embedding bugs. | **Verified working** (0 errors in both runs) |
14+
| Drag coordinates zeroed to (0,0) | `startCoordinate`/`endCoordinate` (camelCase) vs Claude API's `start_coordinate`/`coordinate` (snake_case) | Fixed field names in `_map_action()` | **Verified** (correct coords in trace) |
15+
| Demo not persisted across steps | Demo only injected at step 1 | Re-inject demo text in every `tool_result` message | **Re-applied** |
16+
| (0,0) coordinates trigger fail-safe | No validation at coordinate boundary | `_clamp_coord()` moves (0,0) → (eps, eps) | **Added** |
17+
| Fail-safe not detected on HTTP 500 | Only checked 200 response bodies | Check ALL response bodies for fail-safe strings | **Verified** (0 fail-safe crashes) |
18+
19+
### Meta-fix: `repr()` replaces manual escaping
20+
21+
The multi-line type bug was a symptom of a deeper architectural problem: **generating Python source code via string concatenation to send data across a boundary**. This is the same class of vulnerability as SQL injection.
22+
23+
```python
24+
# Before (fragile — misses \n, \0, unicode, etc.):
25+
text.replace("\\", "\\\\").replace("'", "\\'").replace("\t", "\\t")
26+
27+
# After (provably correct — Python's own escaping):
28+
repr(text)
29+
```
30+
31+
`repr()` handles ALL characters: newlines, tabs, quotes, backslashes, unicode, null bytes. The manual `_escape_for_pyautogui` function was deleted entirely.
32+
33+
## Results
34+
35+
### Run Configuration
36+
37+
- **Agent**: ClaudeComputerUseAgent (claude-sonnet-4-6, computer_use beta)
38+
- **Max steps**: 30
39+
- **Demo file**: `demo_prompts_vlm/04d9aeaf-...txt` (8,697 bytes, 21 steps, VLM-enriched)
40+
- **Demo format**: Step N → {Observation, Intent, Action, Result}
41+
- **WAA server**: Azure VM (waa-pool-00), SSH tunnel localhost:5001
42+
43+
### Scores
44+
45+
| Metric | ZS (no demo) | DC (with demo) |
46+
|--------|-------------|----------------|
47+
| **Score** | 0.0 | 0.0 |
48+
| **Steps used** | 30/30 | 16/30 (quit early) |
49+
| **Time** | 20 min | 8 min |
50+
| **Formulas entered** | 10 (cols C + D) | 0 |
51+
| **Multi-line type errors** | 0 | 0 |
52+
| **Fail-safe crashes** | 0 | 0 |
53+
54+
### ZS Trace (30 steps)
55+
56+
```
57+
0-2: Navigate spreadsheet (clicks)
58+
3: Click sheet tab area
59+
4-6: Attempt to add new sheet (triple click on tab)
60+
7-8: Dialog interaction (double-clicks)
61+
9-11: Navigate/dismiss dialog (clicks + wait actions)
62+
12-13: Escape + Enter (dismiss dialog)
63+
14: Wait (5 internal retries) then click
64+
15-16: Navigate to sheet tabs
65+
17: Click cell for formulas
66+
18: TYPE 5 formulas for col C (with \n between each) ← MULTI-LINE SUCCESS
67+
19: Click next column
68+
20: TYPE 5 formulas for col D (with \n between each) ← MULTI-LINE SUCCESS
69+
21-25: Navigate/select cells (formatting attempts)
70+
26: Click Name Box
71+
27: TYPE "B2:D6\n" (cell range selection)
72+
28: Click toolbar (formatting?)
73+
29: Ctrl+S (save)
74+
```
75+
76+
**Observations**: The ZS agent independently figured out the formula pattern `=(Sheet1.Cn-Sheet1.Cn-1)/Sheet1.Cn-1`, entered ALL formulas for TWO columns in just 2 steps (thanks to multi-line type fix), then attempted formatting. It used all 30 steps productively but didn't complete all 3 formula columns or percentage formatting.
77+
78+
### DC Trace (16 steps → quit)
79+
80+
```
81+
0-2: Navigate spreadsheet (same clicks as ZS)
82+
3: Double-click (different target than ZS — demo influence?)
83+
4-5: Click dialog elements
84+
6-8: Navigate/dismiss
85+
9-10: Escape + Enter
86+
11-13: Click toolbar area (Open file?)
87+
14: TYPE "SmallBalanceSheet.xlsx"
88+
15: Enter
89+
16: DONE (no_tool_use — agent declared task complete)
90+
```
91+
92+
**Observations**: The DC agent never created headers, never typed formulas, never reached the actual task. It appeared to open a "Save As" or "Open" dialog, type the source file name, and declare itself done. The demo's specific UI state descriptions may have conflicted with what the agent actually saw, causing confusion.
93+
94+
## Analysis: Why the Demo Hurt
95+
96+
### The demo format problem
97+
98+
Our demo uses a rigid step-by-step format:
99+
```
100+
Step 1:
101+
Observation: The spreadsheet is open to "Sheet1," which contains financial data...
102+
Intent: To create a new sheet for calculating and displaying annual changes...
103+
Action: Right-click on the "Sheet1" tab at the bottom and select "Insert Sheet"...
104+
Result: A new, blank sheet named "Sheet2" is added to the workbook...
105+
```
106+
107+
When the actual UI doesn't match the described observation (e.g., a dialog appeared, or the tab area looks different from what was described), the agent faces a **reconciliation conflict**: should it follow the demo's specific actions, or respond to what it actually sees?
108+
109+
In our case, the agent chose a third option: it abandoned the task structure entirely and performed an unrelated action (opening a file), then declared done.
110+
111+
### Literature context
112+
113+
This matches findings from multiple papers:
114+
115+
1. **LMAct** (ICML 2025, Google DeepMind): Found that demonstrations can *actively hurt* performance. On several tasks, performance *decreased* with >2 demos. "Frontier LMs struggle to leverage large demonstration datasets for interactive decision-making."
116+
117+
2. **DigiRL** (NeurIPS 2024): "Training with static demonstrations falls short for controlling real GUIs due to their failure to deal with real-world stochasticity and non-stationarity not captured in static observational data." SFT on demos: 17.7%. RL: 67.2%.
118+
119+
3. **ShowUI-Aloha** (Jan 2026): Demonstrated that a single demo CAN improve performance dramatically (+26.6pp) — but using a {Observation, **Think**, Action, Expectation} format that includes reasoning, and crucially, with a PlannerMemory module that adapts the plan when the environment diverges.
120+
121+
4. **Plan-and-Act** (ICML 2025): Dynamic replanning alone added 10.31pp. Without it, static plans degrade. Full pipeline improved from 9.85% (direct prediction) to 57.58% — a 6x improvement.
122+
123+
5. **Instruction Agent** (Sep 2025, Microsoft): Achieved 60% success on tasks where ALL other agents scored 0% — using a single expert trajectory with step-by-step natural language instructions PLUS a backtracker module for error recovery.
124+
125+
## Implications and Options
126+
127+
### The design space
128+
129+
The literature reveals a clear spectrum from rigid to flexible demo conditioning:
130+
131+
```
132+
RIGID ←────────────────────────────────────────────────→ FLEXIBLE
133+
134+
Raw action Step-by-step Semantic steps Abstract plan Goal only
135+
replay with states with intent with subgoals
136+
(OUR CURRENT) (ShowUI-Aloha) (Plan-and-Act)
137+
```
138+
139+
Our current format sits near the rigid end. The evidence strongly suggests moving rightward.
140+
141+
### Option A: Abstract the demo format (semantic steps with intent)
142+
143+
Transform demos from specific state descriptions to goal-oriented step summaries:
144+
145+
```
146+
# Current (too rigid — describes specific UI states):
147+
Step 11:
148+
Observation: The new sheet contains headers... with all cells below empty.
149+
Intent: To calculate the annual percentage change...
150+
Action: Click cell B2 and type "=(Sheet1.B3-Sheet1.B2)/Sheet1.B2".
151+
Result: Cell B2 now contains a formula...
152+
153+
# Proposed (more abstract — describes what to do, not what you see):
154+
Step 4: Enter the annual change formula for Current Assets
155+
Goal: Populate the CA changes column with formulas that compute
156+
(current_year - prev_year) / prev_year for each year pair.
157+
Approach: In each row of column B, enter a formula referencing the
158+
corresponding rows in Sheet1's column B.
159+
Example: =(Sheet1.B3-Sheet1.B2)/Sheet1.B2
160+
```
161+
162+
**Tradeoffs**:
163+
- (+) Robust to UI state mismatch — doesn't assume specific screen appearance
164+
- (+) Preserves intent and approach, which is what disambiguates
165+
- (-) Loses grounding — agent must figure out WHERE to click
166+
- (-) Harder to auto-generate from recordings
167+
168+
### Option B: Plan-then-act (hierarchical, inspired by Plan-and-Act)
169+
170+
Extract a high-level plan from the demo, let the agent execute it:
171+
172+
```
173+
PLAN (derived from demonstration):
174+
1. Create a new sheet in the workbook
175+
2. Set up headers: Year, CA changes, FA changes, OA changes
176+
3. Enter years 2016-2019 in column A
177+
4. For each asset column (CA=B, FA=C, OA=D):
178+
a. Enter formula =(Sheet1.Xn-Sheet1.Xn-1)/Sheet1.Xn-1 for each year
179+
b. Fill down for all years
180+
5. Select the data range and format as percentage
181+
182+
Execute each step using your best judgment about the current screen state.
183+
If a step doesn't apply to what you see, skip it and move to the next.
184+
```
185+
186+
**Tradeoffs**:
187+
- (+) Maximum flexibility — agent adapts to any UI state
188+
- (+) Natural mismatch recovery (skip/replan)
189+
- (+) Captures the "what" without prescribing the "how"
190+
- (-) Loses the fine-grained disambiguation that is OpenAdapt's core thesis
191+
- (-) Similar to what any planning agent could derive zero-shot
192+
193+
### Option C: Adaptive conditioning with mismatch detection
194+
195+
Keep step-by-step demos but add explicit mismatch handling:
196+
197+
```
198+
DEMONSTRATION (adapt as needed — your screen may look different):
199+
200+
Step 1: Create a new sheet
201+
If you see a sheet tab bar → right-click and insert new sheet
202+
If you see a dialog → dismiss it first, then insert sheet
203+
If a new sheet already exists → use it
204+
205+
Step 2: Enter headers in row 1
206+
Type "Year" in A1, then Tab, type "CA changes", Tab, "FA changes", Tab, "OA changes"
207+
If headers already exist → verify them and move on
208+
```
209+
210+
**Tradeoffs**:
211+
- (+) Preserves fine-grained demo detail (the disambiguation signal)
212+
- (+) Handles the mismatch problem explicitly
213+
- (-) Verbose — context window cost is high
214+
- (-) Hard to auto-generate the "If..." branches
215+
216+
### Option D: Multi-level conditioning (most aligned with literature)
217+
218+
Combine a high-level plan WITH a reference trajectory, inspired by ShowUI-Aloha + Instruction Agent:
219+
220+
```
221+
GOAL: Calculate annual asset changes in a new spreadsheet sheet.
222+
223+
PLAN:
224+
1. Create new sheet → 2. Headers → 3. Years → 4. Formulas → 5. Format as %
225+
226+
REFERENCE TRAJECTORY (for disambiguation — adapt actions to your actual screen):
227+
Step 1: [Think] I need to create a new sheet. I'll right-click the sheet tab.
228+
[Action] Right-click "Sheet1" tab → select "Insert Sheet"
229+
[Expect] New blank sheet appears
230+
Step 2: [Think] Now I'll set up the four headers.
231+
[Action] Type "Year" → Tab → "CA changes" → Tab → "FA changes" → Tab → "OA changes"
232+
[Expect] Row 1 has all four headers
233+
...
234+
235+
If your screen doesn't match what's expected, re-evaluate based on the PLAN and decide the best next action.
236+
```
237+
238+
**Tradeoffs**:
239+
- (+) Combines the benefits of planning AND trajectory disambiguation
240+
- (+) [Think] field provides reasoning that helps the model understand WHY each action is taken
241+
- (+) Explicit "re-evaluate" instruction for mismatch recovery
242+
- (+) Aligns with ShowUI-Aloha format that showed +26.6pp improvement
243+
- (-) Most complex format to generate
244+
- (-) Longest context window usage
245+
246+
### Option E: RL fine-tuning (long-term, highest ceiling)
247+
248+
DigiRL showed 17.7% → 67.2% improvement by moving from SFT on demos to online RL. WebRL showed 4.8% → 42.4%. The trajectory data becomes training signal rather than inference-time context.
249+
250+
**Tradeoffs**:
251+
- (+) Highest performance ceiling by far
252+
- (+) No context window cost at inference time
253+
- (+) Handles stochasticity naturally
254+
- (-) Requires fine-tuning infrastructure (already have via Modal)
255+
- (-) Task-specific training needed
256+
- (-) This is OpenAdapt-ML's domain, not just eval infrastructure
257+
258+
## Recommendation
259+
260+
### Immediate (next eval): Option D (multi-level conditioning)
261+
262+
The evidence from ShowUI-Aloha, Instruction Agent, and Plan-and-Act converges on this approach. Key changes:
263+
264+
1. **Add a PLAN section** above the step-by-step trajectory — gives the agent a fallback when specific steps don't match
265+
2. **Add [Think] fields** to each step — captures reasoning that helps the model adapt
266+
3. **Add [Expect] fields** — lets the agent detect when reality diverges from the demo
267+
4. **Add explicit "adapt if needed" framing** — grants permission to deviate from the demo
268+
269+
This can be implemented as a transformation of our existing VLM-enriched demos (add plan extraction + think field generation via a single LLM call).
270+
271+
### Medium-term: Option A + retrieval
272+
273+
Abstract the demo format to goal-oriented steps. Build a retrieval system that finds the most relevant demo for the current task. This is the LearnAct approach.
274+
275+
### Long-term: Option E (RL)
276+
277+
Use trajectory data for training, not just inference-time conditioning. This has the highest performance ceiling but requires infrastructure investment.
278+
279+
## Key Insight for OpenAdapt
280+
281+
OpenAdapt's core thesis is trajectory-conditioned disambiguation — using demonstrations to help agents understand WHAT to do in ambiguous situations. The evidence says this thesis is correct (ShowUI-Aloha: +26.6pp, Instruction Agent: 0% → 60%), BUT:
282+
283+
1. **The demo must be abstracted**, not a literal replay
284+
2. **The agent needs permission and ability to deviate** when reality doesn't match
285+
3. **Reasoning (Think/Intent) is the disambiguation signal**, not specific observations
286+
4. **A high-level plan provides fallback** when step-level details don't apply
287+
288+
The DC agent didn't fail because demo-conditioning is wrong. It failed because our demo format is too rigid and doesn't handle observation mismatch. This is a solvable formatting problem, not a fundamental limitation.
289+
290+
## Sources
291+
292+
- LMAct (ICML 2025): arxiv.org/abs/2412.01441
293+
- ShowUI-Aloha (Jan 2026): arxiv.org/abs/2601.07181
294+
- Instruction Agent (Sep 2025): arxiv.org/abs/2509.07098
295+
- Plan-and-Act (ICML 2025): arxiv.org/abs/2503.09572
296+
- DigiRL (NeurIPS 2024): arxiv.org/abs/2406.11896
297+
- WebRL (ICLR 2025): arxiv.org/abs/2411.02337
298+
- LearnAct (Apr 2025): arxiv.org/abs/2504.13805
299+
- RAG-GUI (EMNLP 2025): arxiv.org/abs/2509.24183
300+
- AdaptAgent (NeurIPS 2024 WS): arxiv.org/abs/2411.13451
301+
- RT-Trajectory (ICLR 2024): arxiv.org/abs/2311.01977
302+
- AgentTrek (ICLR 2025): arxiv.org/abs/2412.09605
303+
- BacktrackAgent (EMNLP 2025): arxiv.org/abs/2505.20660

openadapt_evals/adapters/waa/live.py

Lines changed: 15 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -215,43 +215,33 @@ def _is_failsafe_error(text: str) -> bool:
215215
return "failsafeexception" in lower or "fail-safe triggered" in lower
216216

217217

218-
def _escape_for_pyautogui(text: str) -> str:
219-
"""Escape text for embedding in a single-quoted Python string literal."""
220-
return (
221-
text
222-
.replace("\\", "\\\\")
223-
.replace("'", "\\'")
224-
.replace("\t", "\\t")
225-
.replace("\r", "")
226-
)
227-
228-
229218
def _build_type_commands(text: str) -> str:
230219
"""Build pyautogui command body to type text, handling embedded newlines.
231220
232-
``pyautogui.write()`` cannot handle literal newline characters — the
233-
generated Python command string becomes an unterminated string literal
234-
when executed via ``exec()``. This function splits the text on newlines
235-
and interleaves ``pyautogui.write()`` with ``pyautogui.press('enter')``.
221+
Uses ``repr()`` for string escaping instead of manual character-by-character
222+
replacement. This eliminates the entire class of escaping bugs (newlines,
223+
tabs, quotes, unicode, null bytes, etc.) because ``repr()`` is Python's own
224+
mechanism for producing valid string literals from any string content —
225+
the same principle as parameterized SQL queries vs string concatenation.
226+
227+
Newlines are handled semantically: split into separate ``write()`` calls
228+
with ``press('enter')`` between them, since the agent intends "press Enter."
236229
237230
Returns:
238231
A pyautogui command body string (without ``import pyautogui;`` prefix).
239232
Callers must prepend the import themselves.
240233
"""
234+
text = text.replace("\r", "")
241235
segments = text.split("\n")
242236
if len(segments) == 1:
243-
escaped = _escape_for_pyautogui(text)
244-
return f"pyautogui.write('{escaped}', interval=0.02)"
237+
return f"pyautogui.write({repr(text)}, interval=0.02)"
245238

246239
commands: list[str] = []
247240
for i, seg in enumerate(segments):
248-
# Skip empty trailing segment from a trailing newline
249-
if seg or i < len(segments) - 1:
250-
escaped = _escape_for_pyautogui(seg)
251-
if escaped:
252-
commands.append(f"pyautogui.write('{escaped}', interval=0.02)")
253-
if i < len(segments) - 1:
254-
commands.append("pyautogui.press('enter')")
241+
if seg:
242+
commands.append(f"pyautogui.write({repr(seg)}, interval=0.02)")
243+
if i < len(segments) - 1:
244+
commands.append("pyautogui.press('enter')")
255245
return "; ".join(commands) if commands else "pass"
256246

257247

@@ -575,6 +565,7 @@ def step(
575565

576566
# Execute command via /execute_windows (has access to computer object)
577567
if command:
568+
logger.info("Sending command to WAA: %r", command)
578569
try:
579570
resp = requests.post(
580571
f"{self.config.server_url}/execute_windows",

0 commit comments

Comments
 (0)