Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,9 @@ jobs:
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: npm
cache-dependency-path: package-lock.json

- name: Install dependencies
run: npm ci
run: npm install --no-package-lock

- name: Format check (Prettier)
run: npm run format:check
Expand Down
99 changes: 99 additions & 0 deletions docs/qa-observed-assertions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# QA: Observed Behavior Assertions

## Problem

AppClaw completes a task and declares success, but gives no structured record of _what it observed_ — prices, confirmation messages, order numbers, screen states. On the next run there is no way to know if the outcome was the same.

## Concept

After a successful run, an LLM call reads the agent's step history and extracts observable facts as assertions:

```
Run: "complete checkout for 1 large oat milk latte"
Observed assertions:
✓ Order confirmation screen appeared
✓ Item: "Oat Milk Latte, Large" shown
✓ Price shown: $6.95
✓ Payment method: Apple Pay
✓ Estimated ready time shown
✓ Completed in 4 steps
```

On subsequent runs these become **soft assertions** — the agent flags any that no longer hold.

## Assertion Types

| Type | Example | How detected |
| --------------- | ------------------------------------ | ---------------------------------- |
| Screen appeared | "Order confirmation screen appeared" | Screen fingerprint match |
| Text present | "Price shown: $6.95" | LLM extraction from DOM/screenshot |
| Step count | "Completed in 4 steps" | `stepsInRun` from trajectory |
| Element state | "Apple Pay button was selected" | LLM extraction |

## Proposed Design

### Extraction (async, post-run)

```typescript
// After successful finalize()
const assertions = await extractAssertions(stepHistory, goal, llmClient);
saveAssertions(appId, goalHash, assertions);
```

Prompt to LLM:

```
Given this agent run transcript, extract 3-6 observable facts about the outcome
as short assertion strings. Focus on: screens that appeared, values shown,
actions completed. Be specific. Format: one assertion per line.
```

### Storage

`~/.appclaw/assertions/<appId>/<goalHash>.json`

```json
{
"goal": "complete checkout",
"appId": "com.starbucks",
"extractedAt": 1712345678,
"assertions": ["Order confirmation screen appeared", "Price shown: $6.95", "Completed in 4 steps"]
}
```

### Soft assertion check on next run

At run end, retrieve stored assertions and ask the LLM:

```
Previous run observed: ["Order confirmation screen appeared", "Price shown: $6.95"]
Based on the current run transcript, which of these still hold? Which do not?
```

Emit result in terminal and HTML report.

### Hard assertions in YAML flows

QA engineers can also write explicit assertions in flow files:

```yaml
steps:
- tap checkout
- ...
assertions:
- order confirmation screen is visible
- price displayed is under $10
- no error messages present
```

These run after all steps complete and fail the flow if any assertion fails.

## Files to Touch

- New: `src/assertions/extractor.ts` — LLM-based assertion extraction
- New: `src/assertions/checker.ts` — compare assertions against current run
- New: `src/assertions/store.ts` — persist/load assertion sets
- `src/flow/parse-yaml-flow.ts` — parse `assertions:` block from YAML
- `src/flow/run-yaml-flow.ts` — run assertion checker after steps complete
- `src/agent/loop.ts` — trigger async extraction on success
- `src/report/writer.ts` — include assertion results in HTML report
68 changes: 68 additions & 0 deletions docs/qa-personas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# QA: Test Persona Profiles

## Problem

Every QA test run requires a specific user context — free vs premium, new vs returning, admin vs regular. Currently this must be spelled out in the goal on every run, making flows verbose and hard to reuse.

## Proposed Design

### Persona files at `.appclaw/env/personas/<name>.yaml`

```yaml
# .appclaw/env/personas/premium-user.yaml
name: premium-user
credentials:
email: qa+premium@company.com
password: $SECRET_PREMIUM_PASS # interpolated from .appclaw/env/secrets
state:
subscription: active
cart: empty
onboarding: completed
notifications: denied
```

### CLI usage

```bash
appclaw --flow checkout.yaml --persona premium-user
appclaw --flow onboarding.yaml --persona new-user
```

### YAML flow usage

```yaml
persona: premium-user
steps:
- tap the checkout button
- ...
```

## How It Works

1. Persona file is loaded at run start
2. Persona fields are injected into the LLM system prompt as context:
```
CURRENT USER PERSONA: premium-user
- Subscription: active
- Cart: empty
- Onboarding: completed
```
3. Credentials are available for interpolation in steps:
```yaml
- type $persona.credentials.email into the email field
```
4. Secrets (values starting with `$`) are resolved from `.appclaw/env/secrets` before injection

## Personas to Ship With (Examples)

- `new-user` — no account, fresh install state
- `free-user` — logged in, free tier limits apply
- `premium-user` — logged in, all features unlocked
- `admin` — elevated permissions

## Files to Touch

- `src/flow/run-yaml-flow.ts` — load and inject persona at run start
- `src/config.ts` — add `--persona` CLI flag
- `src/llm/prompts.ts` — inject persona context into system prompt
- New: `src/persona/loader.ts` — load, validate, interpolate persona files
65 changes: 65 additions & 0 deletions docs/qa-regression-baseline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# QA: Trajectory as Regression Baseline

## Problem

AppClaw's trajectory store already records the exact path a successful run took — the sequence of actions, selectors, and step counts. This is a regression baseline sitting unused. There is no way today to compare a current run against a previous one and flag changes.

## Insight

A regression is detectable when:

- The same goal on the same app **took more steps** than before
- A screen that used to appear **no longer appears**
- An action that always worked **now fails**
- The completion path **diverged** from the recorded trajectory

## Proposed Design

### Regression report per run

After each run, compare against the stored trajectory for the same (goal, app, platform) and emit a diff:

```
Regression Check: "complete checkout" app: com.starbucks
─────────────────────────────────────────────────────────
✓ Step 1: find_and_click "Add to Cart" (same)
✓ Step 2: find_and_click "Proceed to Checkout" (same)
⚠ Step 3: NEW — dismiss_popup "Enable notifications" (not in baseline)
✓ Step 4: find_and_click "Apple Pay" (same)
✗ Step 5: MISSING — order confirmation screen (appeared in baseline, not now)

Steps: 4 (baseline: 4) ✓ | New steps: 1 | Missing steps: 1
```

### CLI flag

```bash
appclaw --flow checkout.yaml --check-regression
appclaw --flow checkout.yaml --update-baseline # overwrite stored baseline
```

### Baseline storage

Extend `TrajectoryEntry` in `src/memory/types.ts` with an ordered step sequence (not just the winning final action) so full path comparison is possible.

Or store baselines separately at `~/.appclaw/baselines/<appId>/<goalHash>.json`.

## Step Count Heuristic (Quick Win)

Without full path comparison, step count delta alone is a useful signal:

```
⚠ Regression risk: "complete checkout" took 7 steps (baseline: 4). App may have added screens.
```

This requires no schema changes — `stepsInRun` is already stored in `TrajectoryEntry`.

Surface this warning in the run summary today.

## Files to Touch

- `src/memory/types.ts` — extend `TrajectoryEntry` with step sequence (optional, for full diff)
- `src/memory/retriever.ts` — add baseline comparison function
- `src/agent/loop.ts` — emit regression warning at run end
- `src/report/writer.ts` — include regression diff in HTML report
- `src/config.ts` — add `--check-regression` and `--update-baseline` flags
79 changes: 79 additions & 0 deletions docs/qa-step-libraries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# QA: Named Setup Steps / Step Libraries

## Problem

Common setup sequences (login, clear cart, reset permissions, onboard a new user) are written out in full in every flow file that needs them. When the login flow changes, every flow that embeds it must be updated. There is no reuse.

## Concept

Named step sequences stored as shared YAML fragments, referenced from any flow:

```yaml
# .appclaw/steps/login-as-admin.yaml
name: login-as-admin
description: Log in using admin credentials, handle 2FA if prompted
steps:
- tap the Sign In button
- type $persona.credentials.email into the email field
- type $persona.credentials.password into the password field
- tap Login
- if OTP screen appears, wait for human input
```

Referenced in any flow:

```yaml
setup:
- use: login-as-admin
- use: clear-cart

steps:
- tap checkout
- ...
```

## Step Library Locations

Resolution order (first match wins):

1. `.appclaw/steps/` — project-level, checked into repo
2. `~/.appclaw/steps/` — user-level, shared across projects
3. Built-in steps shipped with AppClaw (login helpers, permission handlers)

## Built-in Steps to Ship

| Name | Description |
| ----------------------- | ------------------------------------------------- |
| `dismiss-notifications` | Deny notification permission prompt if it appears |
| `dismiss-tracking` | Deny app tracking permission if it appears |
| `clear-cart` | Navigate to cart and remove all items |
| `logout` | Navigate to account settings and log out |
| `wait-for-network` | Wait until a loading spinner disappears |

## Composability

Steps can reference other steps:

```yaml
# .appclaw/steps/fresh-checkout-session.yaml
steps:
- use: logout
- use: login-as-free-user
- use: clear-cart
```

## Discoverable via CLI

```bash
appclaw --list-steps # list all available named steps
appclaw --list-steps --filter login # filter by name
appclaw --run-step login-as-admin # run a single step in isolation
```

## Files to Touch

- `src/flow/parse-yaml-flow.ts` — resolve `use:` references, load step files
- `src/flow/run-yaml-flow.ts` — execute referenced steps inline
- New: `src/flow/step-library.ts` — resolve step files from project + user + built-in paths
- `src/config.ts` — add `--list-steps`, `--run-step` flags
- New: `src/flow/builtin-steps/` — built-in step YAML files
Loading
Loading