Skip to content

Commit f39d4d5

Browse files
Feat/sdk and vision tracking (#15)
* feat: track vision token usage per step and surface Gemini cached token savings - Add vision-token-tracker module to accumulate StarkVisionClient token usage across all call sites (stark-locate, vision-execute) and print per-step cost - Extract Gemini implicit cached token counts from provider metadata and display them in agent loop / flow run summaries (~75% cost reduction notice) - Cache system prompt in LLM provider so it is built once per run, enabling Gemini implicit prompt caching across steps - Fix resolveNaturalStep to return ResolvedStep (step + usage) instead of bare FlowStep - Update Gemini model pricing table with additional preview models - Bump version to 0.1.6; drop package-lock.json and switch CI to npm install --no-package-lock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Srinivasan Sekar <srinivasan.sekar1990@gmail.com> * feat: add SDK layer, e2e tests, and QA docs - Introduce `src/sdk/` public SDK (GoalRunner, FlowRunner, StepRunner, McpSession, ConfigBuilder) - Expose SDK as package main entry point with proper exports map - Add `loadConfig` overrides support for programmatic config injection - Add SDK unit tests (`tests/sdk/`) and e2e test suite (`tests/e2e/`) - Separate e2e tests from unit tests via dedicated `test:e2e` script - Add QA documentation (personas, regression baseline, step libraries, observed assertions) - Update landing/usage.html Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Srinivasan Sekar <srinivasan.sekar1990@gmail.com> * fix: remove npm cache config that requires missing package-lock.json Install step uses --no-package-lock so there is no lock file to use as a cache key. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Srinivasan Sekar <srinivasan.sekar1990@gmail.com> * fix: exclude e2e tests from CI test run - Replace --exclude glob (unreliable) with explicit dir args in test script - Remove describe.only from e2e suite that caused Vitest to fail even when the file was skipped Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Srinivasan Sekar <srinivasan.sekar1990@gmail.com> * fix: mock createPlatformSession in SDK unit tests Tests were hitting real device session code (startSpinner, capability building) because McpSession.connect() calls createPlatformSession which was not mocked. Add the mock to both appclaw.test.ts and mcp-session.test.ts so tests stay pure unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Srinivasan Sekar <srinivasan.sekar1990@gmail.com> --------- Co-authored-by: Srinivasan Sekar <srinivasan.sekar1990@gmail.com>
1 parent fcb3a64 commit f39d4d5

42 files changed

Lines changed: 4831 additions & 6142 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,9 @@ jobs:
1919
- uses: actions/setup-node@v4
2020
with:
2121
node-version: '20'
22-
cache: npm
23-
cache-dependency-path: package-lock.json
2422

2523
- name: Install dependencies
26-
run: npm ci
24+
run: npm install --no-package-lock
2725

2826
- name: Format check (Prettier)
2927
run: npm run format:check

docs/qa-observed-assertions.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# QA: Observed Behavior Assertions
2+
3+
## Problem
4+
5+
AppClaw completes a task and declares success, but gives no structured record of _what it observed_ — prices, confirmation messages, order numbers, screen states. On the next run there is no way to know if the outcome was the same.
6+
7+
## Concept
8+
9+
After a successful run, an LLM call reads the agent's step history and extracts observable facts as assertions:
10+
11+
```
12+
Run: "complete checkout for 1 large oat milk latte"
13+
Observed assertions:
14+
✓ Order confirmation screen appeared
15+
✓ Item: "Oat Milk Latte, Large" shown
16+
✓ Price shown: $6.95
17+
✓ Payment method: Apple Pay
18+
✓ Estimated ready time shown
19+
✓ Completed in 4 steps
20+
```
21+
22+
On subsequent runs these become **soft assertions** — the agent flags any that no longer hold.
23+
24+
## Assertion Types
25+
26+
| Type | Example | How detected |
27+
| --------------- | ------------------------------------ | ---------------------------------- |
28+
| Screen appeared | "Order confirmation screen appeared" | Screen fingerprint match |
29+
| Text present | "Price shown: $6.95" | LLM extraction from DOM/screenshot |
30+
| Step count | "Completed in 4 steps" | `stepsInRun` from trajectory |
31+
| Element state | "Apple Pay button was selected" | LLM extraction |
32+
33+
## Proposed Design
34+
35+
### Extraction (async, post-run)
36+
37+
```typescript
38+
// After successful finalize()
39+
const assertions = await extractAssertions(stepHistory, goal, llmClient);
40+
saveAssertions(appId, goalHash, assertions);
41+
```
42+
43+
Prompt to LLM:
44+
45+
```
46+
Given this agent run transcript, extract 3-6 observable facts about the outcome
47+
as short assertion strings. Focus on: screens that appeared, values shown,
48+
actions completed. Be specific. Format: one assertion per line.
49+
```
50+
51+
### Storage
52+
53+
`~/.appclaw/assertions/<appId>/<goalHash>.json`
54+
55+
```json
56+
{
57+
"goal": "complete checkout",
58+
"appId": "com.starbucks",
59+
"extractedAt": 1712345678,
60+
"assertions": ["Order confirmation screen appeared", "Price shown: $6.95", "Completed in 4 steps"]
61+
}
62+
```
63+
64+
### Soft assertion check on next run
65+
66+
At run end, retrieve stored assertions and ask the LLM:
67+
68+
```
69+
Previous run observed: ["Order confirmation screen appeared", "Price shown: $6.95"]
70+
Based on the current run transcript, which of these still hold? Which do not?
71+
```
72+
73+
Emit result in terminal and HTML report.
74+
75+
### Hard assertions in YAML flows
76+
77+
QA engineers can also write explicit assertions in flow files:
78+
79+
```yaml
80+
steps:
81+
- tap checkout
82+
- ...
83+
assertions:
84+
- order confirmation screen is visible
85+
- price displayed is under $10
86+
- no error messages present
87+
```
88+
89+
These run after all steps complete and fail the flow if any assertion fails.
90+
91+
## Files to Touch
92+
93+
- New: `src/assertions/extractor.ts` — LLM-based assertion extraction
94+
- New: `src/assertions/checker.ts` — compare assertions against current run
95+
- New: `src/assertions/store.ts` — persist/load assertion sets
96+
- `src/flow/parse-yaml-flow.ts` — parse `assertions:` block from YAML
97+
- `src/flow/run-yaml-flow.ts` — run assertion checker after steps complete
98+
- `src/agent/loop.ts` — trigger async extraction on success
99+
- `src/report/writer.ts` — include assertion results in HTML report

docs/qa-personas.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# QA: Test Persona Profiles
2+
3+
## Problem
4+
5+
Every QA test run requires a specific user context — free vs premium, new vs returning, admin vs regular. Currently this must be spelled out in the goal on every run, making flows verbose and hard to reuse.
6+
7+
## Proposed Design
8+
9+
### Persona files at `.appclaw/env/personas/<name>.yaml`
10+
11+
```yaml
12+
# .appclaw/env/personas/premium-user.yaml
13+
name: premium-user
14+
credentials:
15+
email: qa+premium@company.com
16+
password: $SECRET_PREMIUM_PASS # interpolated from .appclaw/env/secrets
17+
state:
18+
subscription: active
19+
cart: empty
20+
onboarding: completed
21+
notifications: denied
22+
```
23+
24+
### CLI usage
25+
26+
```bash
27+
appclaw --flow checkout.yaml --persona premium-user
28+
appclaw --flow onboarding.yaml --persona new-user
29+
```
30+
31+
### YAML flow usage
32+
33+
```yaml
34+
persona: premium-user
35+
steps:
36+
- tap the checkout button
37+
- ...
38+
```
39+
40+
## How It Works
41+
42+
1. Persona file is loaded at run start
43+
2. Persona fields are injected into the LLM system prompt as context:
44+
```
45+
CURRENT USER PERSONA: premium-user
46+
- Subscription: active
47+
- Cart: empty
48+
- Onboarding: completed
49+
```
50+
3. Credentials are available for interpolation in steps:
51+
```yaml
52+
- type $persona.credentials.email into the email field
53+
```
54+
4. Secrets (values starting with `$`) are resolved from `.appclaw/env/secrets` before injection
55+
56+
## Personas to Ship With (Examples)
57+
58+
- `new-user` — no account, fresh install state
59+
- `free-user` — logged in, free tier limits apply
60+
- `premium-user` — logged in, all features unlocked
61+
- `admin` — elevated permissions
62+
63+
## Files to Touch
64+
65+
- `src/flow/run-yaml-flow.ts` — load and inject persona at run start
66+
- `src/config.ts` — add `--persona` CLI flag
67+
- `src/llm/prompts.ts` — inject persona context into system prompt
68+
- New: `src/persona/loader.ts` — load, validate, interpolate persona files

docs/qa-regression-baseline.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# QA: Trajectory as Regression Baseline
2+
3+
## Problem
4+
5+
AppClaw's trajectory store already records the exact path a successful run took — the sequence of actions, selectors, and step counts. This is a regression baseline sitting unused. There is no way today to compare a current run against a previous one and flag changes.
6+
7+
## Insight
8+
9+
A regression is detectable when:
10+
11+
- The same goal on the same app **took more steps** than before
12+
- A screen that used to appear **no longer appears**
13+
- An action that always worked **now fails**
14+
- The completion path **diverged** from the recorded trajectory
15+
16+
## Proposed Design
17+
18+
### Regression report per run
19+
20+
After each run, compare against the stored trajectory for the same (goal, app, platform) and emit a diff:
21+
22+
```
23+
Regression Check: "complete checkout" app: com.starbucks
24+
─────────────────────────────────────────────────────────
25+
✓ Step 1: find_and_click "Add to Cart" (same)
26+
✓ Step 2: find_and_click "Proceed to Checkout" (same)
27+
⚠ Step 3: NEW — dismiss_popup "Enable notifications" (not in baseline)
28+
✓ Step 4: find_and_click "Apple Pay" (same)
29+
✗ Step 5: MISSING — order confirmation screen (appeared in baseline, not now)
30+
31+
Steps: 4 (baseline: 4) ✓ | New steps: 1 | Missing steps: 1
32+
```
33+
34+
### CLI flag
35+
36+
```bash
37+
appclaw --flow checkout.yaml --check-regression
38+
appclaw --flow checkout.yaml --update-baseline # overwrite stored baseline
39+
```
40+
41+
### Baseline storage
42+
43+
Extend `TrajectoryEntry` in `src/memory/types.ts` with an ordered step sequence (not just the winning final action) so full path comparison is possible.
44+
45+
Or store baselines separately at `~/.appclaw/baselines/<appId>/<goalHash>.json`.
46+
47+
## Step Count Heuristic (Quick Win)
48+
49+
Without full path comparison, step count delta alone is a useful signal:
50+
51+
```
52+
⚠ Regression risk: "complete checkout" took 7 steps (baseline: 4). App may have added screens.
53+
```
54+
55+
This requires no schema changes — `stepsInRun` is already stored in `TrajectoryEntry`.
56+
57+
Surface this warning in the run summary today.
58+
59+
## Files to Touch
60+
61+
- `src/memory/types.ts` — extend `TrajectoryEntry` with step sequence (optional, for full diff)
62+
- `src/memory/retriever.ts` — add baseline comparison function
63+
- `src/agent/loop.ts` — emit regression warning at run end
64+
- `src/report/writer.ts` — include regression diff in HTML report
65+
- `src/config.ts` — add `--check-regression` and `--update-baseline` flags

docs/qa-step-libraries.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# QA: Named Setup Steps / Step Libraries
2+
3+
## Problem
4+
5+
Common setup sequences (login, clear cart, reset permissions, onboard a new user) are written out in full in every flow file that needs them. When the login flow changes, every flow that embeds it must be updated. There is no reuse.
6+
7+
## Concept
8+
9+
Named step sequences stored as shared YAML fragments, referenced from any flow:
10+
11+
```yaml
12+
# .appclaw/steps/login-as-admin.yaml
13+
name: login-as-admin
14+
description: Log in using admin credentials, handle 2FA if prompted
15+
steps:
16+
- tap the Sign In button
17+
- type $persona.credentials.email into the email field
18+
- type $persona.credentials.password into the password field
19+
- tap Login
20+
- if OTP screen appears, wait for human input
21+
```
22+
23+
Referenced in any flow:
24+
25+
```yaml
26+
setup:
27+
- use: login-as-admin
28+
- use: clear-cart
29+
30+
steps:
31+
- tap checkout
32+
- ...
33+
```
34+
35+
## Step Library Locations
36+
37+
Resolution order (first match wins):
38+
39+
1. `.appclaw/steps/` — project-level, checked into repo
40+
2. `~/.appclaw/steps/` — user-level, shared across projects
41+
3. Built-in steps shipped with AppClaw (login helpers, permission handlers)
42+
43+
## Built-in Steps to Ship
44+
45+
| Name | Description |
46+
| ----------------------- | ------------------------------------------------- |
47+
| `dismiss-notifications` | Deny notification permission prompt if it appears |
48+
| `dismiss-tracking` | Deny app tracking permission if it appears |
49+
| `clear-cart` | Navigate to cart and remove all items |
50+
| `logout` | Navigate to account settings and log out |
51+
| `wait-for-network` | Wait until a loading spinner disappears |
52+
53+
## Composability
54+
55+
Steps can reference other steps:
56+
57+
```yaml
58+
# .appclaw/steps/fresh-checkout-session.yaml
59+
steps:
60+
- use: logout
61+
- use: login-as-free-user
62+
- use: clear-cart
63+
```
64+
65+
## Discoverable via CLI
66+
67+
```bash
68+
appclaw --list-steps # list all available named steps
69+
appclaw --list-steps --filter login # filter by name
70+
appclaw --run-step login-as-admin # run a single step in isolation
71+
```
72+
73+
## Files to Touch
74+
75+
- `src/flow/parse-yaml-flow.ts` — resolve `use:` references, load step files
76+
- `src/flow/run-yaml-flow.ts` — execute referenced steps inline
77+
- New: `src/flow/step-library.ts` — resolve step files from project + user + built-in paths
78+
- `src/config.ts` — add `--list-steps`, `--run-step` flags
79+
- New: `src/flow/builtin-steps/` — built-in step YAML files

0 commit comments

Comments
 (0)