|
| 1 | +# QA: Observed Behavior Assertions |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +AppClaw completes a task and declares success, but gives no structured record of _what it observed_ — prices, confirmation messages, order numbers, screen states. On the next run there is no way to know if the outcome was the same. |
| 6 | + |
| 7 | +## Concept |
| 8 | + |
| 9 | +After a successful run, an LLM call reads the agent's step history and extracts observable facts as assertions: |
| 10 | + |
| 11 | +``` |
| 12 | +Run: "complete checkout for 1 large oat milk latte" |
| 13 | +Observed assertions: |
| 14 | + ✓ Order confirmation screen appeared |
| 15 | + ✓ Item: "Oat Milk Latte, Large" shown |
| 16 | + ✓ Price shown: $6.95 |
| 17 | + ✓ Payment method: Apple Pay |
| 18 | + ✓ Estimated ready time shown |
| 19 | + ✓ Completed in 4 steps |
| 20 | +``` |
| 21 | + |
| 22 | +On subsequent runs these become **soft assertions** — the agent flags any that no longer hold. |
| 23 | + |
| 24 | +## Assertion Types |
| 25 | + |
| 26 | +| Type | Example | How detected | |
| 27 | +| --------------- | ------------------------------------ | ---------------------------------- | |
| 28 | +| Screen appeared | "Order confirmation screen appeared" | Screen fingerprint match | |
| 29 | +| Text present | "Price shown: $6.95" | LLM extraction from DOM/screenshot | |
| 30 | +| Step count | "Completed in 4 steps" | `stepsInRun` from trajectory | |
| 31 | +| Element state | "Apple Pay button was selected" | LLM extraction | |
| 32 | + |
| 33 | +## Proposed Design |
| 34 | + |
| 35 | +### Extraction (async, post-run) |
| 36 | + |
| 37 | +```typescript |
| 38 | +// After successful finalize() |
| 39 | +const assertions = await extractAssertions(stepHistory, goal, llmClient); |
| 40 | +saveAssertions(appId, goalHash, assertions); |
| 41 | +``` |
| 42 | + |
| 43 | +Prompt to LLM: |
| 44 | + |
| 45 | +``` |
| 46 | +Given this agent run transcript, extract 3-6 observable facts about the outcome |
| 47 | +as short assertion strings. Focus on: screens that appeared, values shown, |
| 48 | +actions completed. Be specific. Format: one assertion per line. |
| 49 | +``` |
| 50 | + |
| 51 | +### Storage |
| 52 | + |
| 53 | +`~/.appclaw/assertions/<appId>/<goalHash>.json` |
| 54 | + |
| 55 | +```json |
| 56 | +{ |
| 57 | + "goal": "complete checkout", |
| 58 | + "appId": "com.starbucks", |
| 59 | + "extractedAt": 1712345678, |
| 60 | + "assertions": ["Order confirmation screen appeared", "Price shown: $6.95", "Completed in 4 steps"] |
| 61 | +} |
| 62 | +``` |
| 63 | + |
| 64 | +### Soft assertion check on next run |
| 65 | + |
| 66 | +At run end, retrieve stored assertions and ask the LLM: |
| 67 | + |
| 68 | +``` |
| 69 | +Previous run observed: ["Order confirmation screen appeared", "Price shown: $6.95"] |
| 70 | +Based on the current run transcript, which of these still hold? Which do not? |
| 71 | +``` |
| 72 | + |
| 73 | +Emit result in terminal and HTML report. |
| 74 | + |
| 75 | +### Hard assertions in YAML flows |
| 76 | + |
| 77 | +QA engineers can also write explicit assertions in flow files: |
| 78 | + |
| 79 | +```yaml |
| 80 | +steps: |
| 81 | + - tap checkout |
| 82 | + - ... |
| 83 | +assertions: |
| 84 | + - order confirmation screen is visible |
| 85 | + - price displayed is under $10 |
| 86 | + - no error messages present |
| 87 | +``` |
| 88 | +
|
| 89 | +These run after all steps complete and fail the flow if any assertion fails. |
| 90 | +
|
| 91 | +## Files to Touch |
| 92 | +
|
| 93 | +- New: `src/assertions/extractor.ts` — LLM-based assertion extraction |
| 94 | +- New: `src/assertions/checker.ts` — compare assertions against current run |
| 95 | +- New: `src/assertions/store.ts` — persist/load assertion sets |
| 96 | +- `src/flow/parse-yaml-flow.ts` — parse `assertions:` block from YAML |
| 97 | +- `src/flow/run-yaml-flow.ts` — run assertion checker after steps complete |
| 98 | +- `src/agent/loop.ts` — trigger async extraction on success |
| 99 | +- `src/report/writer.ts` — include assertion results in HTML report |
0 commit comments