Commit ffcb41d
fix(agent): replace manual string escaping with repr() and fix CU agent bugs (#83)
* fix(agent): replace manual string escaping with repr() and fix CU agent bugs
Five reliability fixes for eval runs:
1. Replace _escape_for_pyautogui() with repr() in _build_type_commands() -
eliminates entire class of string-embedding bugs (newlines, tabs, quotes,
unicode) using Python's own escaping mechanism
2. Fix drag coordinate field names: startCoordinate/endCoordinate (camelCase)
→ start_coordinate/coordinate (snake_case) per Claude computer_use API
3. Add _clamp_coord() to prevent (0,0) coordinates from triggering PyAutoGUI
fail-safe, applied to click, drag, and mouse_move actions
4. Re-inject demo text at every step in tool_result messages to prevent
context drift in demo-conditioned evaluation
5. Add command logging in WAALiveAdapter.step() for debugging
Also adds docs/eval_analysis_2026_03_02.md documenting ZS vs DC eval
results and literature review on demo-conditioning approaches.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add multi-level demo format transform and fix tests
- Add scripts/transform_demo_format.py: transforms rigid {Observation,
Intent, Action, Result} demos into adaptive {Think, Action, Expect}
format with PLAN section (Option D from eval analysis)
- LLM-assisted mode (default): uses vlm_call() for semantic transform
- Rule-based mode (--no-llm): free, no API calls needed
- Supports --dry-run for preview
- Fix tests for repr() escaping and coordinate clamping:
- Remove TestEscapeForPyautogui (tests deleted function)
- Update TestBuildTypeCommands for repr() output format
- Add test_all_special_chars_produce_valid_python invariant test
- Fix drag test to use snake_case field names
- Fix coordinate edge test to expect clamped (0.005, 0.005)
- Regenerate uv.lock for consilium package name resolution
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add DC-multilevel eval results to analysis
DC-multilevel (new {Think, Action, Expect} + PLAN format) showed clear
improvement over DC-rigid: agent followed the plan, entered all headers
and years, typed correct formula, used drag-fill. Still scored 0.0 due
to premature task completion (finished 1/3 columns), but qualitatively
the best behavior across all three conditions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 4896b65 commit ffcb41d
7 files changed
Lines changed: 2205 additions & 789 deletions
File tree
- docs
- openadapt_evals
- adapters/waa
- agents
- scripts
- tests
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
215 | 215 | | |
216 | 216 | | |
217 | 217 | | |
218 | | - | |
219 | | - | |
220 | | - | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | 218 | | |
230 | 219 | | |
231 | 220 | | |
232 | | - | |
233 | | - | |
234 | | - | |
235 | | - | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
236 | 229 | | |
237 | 230 | | |
238 | 231 | | |
239 | 232 | | |
240 | 233 | | |
| 234 | + | |
241 | 235 | | |
242 | 236 | | |
243 | | - | |
244 | | - | |
| 237 | + | |
245 | 238 | | |
246 | 239 | | |
247 | 240 | | |
248 | | - | |
249 | | - | |
250 | | - | |
251 | | - | |
252 | | - | |
253 | | - | |
254 | | - | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
255 | 245 | | |
256 | 246 | | |
257 | 247 | | |
| |||
575 | 565 | | |
576 | 566 | | |
577 | 567 | | |
| 568 | + | |
578 | 569 | | |
579 | 570 | | |
580 | 571 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
78 | 78 | | |
79 | 79 | | |
80 | 80 | | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
81 | 84 | | |
82 | 85 | | |
83 | 86 | | |
| |||
123 | 126 | | |
124 | 127 | | |
125 | 128 | | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
126 | 139 | | |
127 | 140 | | |
128 | 141 | | |
| |||
178 | 191 | | |
179 | 192 | | |
180 | 193 | | |
181 | | - | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
182 | 205 | | |
183 | 206 | | |
184 | 207 | | |
| |||
374 | 397 | | |
375 | 398 | | |
376 | 399 | | |
377 | | - | |
378 | | - | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
379 | 403 | | |
380 | 404 | | |
381 | 405 | | |
| |||
414 | 438 | | |
415 | 439 | | |
416 | 440 | | |
417 | | - | |
418 | | - | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
419 | 452 | | |
420 | 453 | | |
421 | | - | |
422 | | - | |
423 | | - | |
424 | | - | |
| 454 | + | |
425 | 455 | | |
426 | 456 | | |
427 | 457 | | |
428 | 458 | | |
429 | 459 | | |
430 | 460 | | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
431 | 464 | | |
432 | 465 | | |
433 | 466 | | |
434 | | - | |
435 | | - | |
| 467 | + | |
436 | 468 | | |
437 | 469 | | |
438 | 470 | | |
| |||
0 commit comments