You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .claude/skills/agent-device-evidence/SKILL.md
+17-60Lines changed: 17 additions & 60 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
name: agent-device-evidence
3
3
description: Records iOS/Android native MP4 evidence for test/repro flows extracted from an Expensify GitHub PR or issue. Use when the user asks to "record the flow for PR #X", "capture mobile evidence for issue #Y", or "produce screenshots/videos for <PR or issue URL>". Mobile-native only - declines mWeb and Desktop.
If no anchor matches, pass the **whole body** to the LLM and ask it to find the steps. The anchor list is a hint to reduce token cost on the 99% case, not a hard contract.
63
+
If no anchor matches, pass the **whole body** to the LLM and ask it to find the steps. The anchor list is a hint, not a hard contract.
66
64
67
65
Stop the section at the next equal-or-higher heading (e.g. for issues, `## Expected Result:` ends the steps section). Strip trailing GitHub-template footers (Upwork automation block, contributing-guide preamble, `## Workaround:`, `## Screenshots/Videos`).
68
66
@@ -104,55 +102,16 @@ Each step's text is passed verbatim to the agent-device driver, which decides pe
104
102
105
103
If the LLM returns an empty flow list (body was prose-only, "N/A", "We'll test it live", or empty after stripping), exit `3 NO_FLOWS`.
106
104
107
-
## Phase 1 cache (skip warm-up when flow text is unchanged)
105
+
## Phase 1 cache
108
106
109
-
Phase 1 is the expensive part - it runs the LLM-driven exploration loop to produce a deterministic `.ad` script. Phase 2 is just `agent-device replay` and is cheap. When a flow's text and target environment are identical to a prior run's, reuse the cached `.ad` and skip Phase 1 entirely.
107
+
Simple map: flow steps → `.ad` script. If the steps haven't changed, reuse the cached script and skip the warm-up.
Content-addressable, **shared across PRs**. Two PRs whose Tests sections both contain the same "Sign in and create an expense" flow share the same cache entry.
120
-
121
-
### Cache key
122
-
123
-
```
124
-
fingerprint = sha256(
125
-
flow.precondition || "" +
126
-
json(flow.steps) +
127
-
platform +
128
-
bundle_id +
129
-
agent_device_version
130
-
)
131
-
```
132
-
133
-
Fields included by design:
134
-
-**`flow.steps`** - if the steps change at all, the fingerprint changes. This is the primary correctness signal.
135
-
-**`flow.precondition`** - drives setup behavior; affects what the `.ad` script contains.
136
-
-**`platform`** - iOS and Android need separate scripts (different selectors).
137
-
-**`bundle_id`** - HybridApp vs standalone (and dev vs prod) render differently.
138
-
-**`agent_device_version`** - replay semantics can change between CLI versions; pinning to the running version protects against subtle drift.
-**Hit** (and `--no-cache` is not set): copy cached `.ad` to `$TEST_FLOW.ad`, mark `cache: "hit"` in the manifest, **skip Phase 1 entirely**, proceed to Phase 2.
152
-
-**Miss** (or `--no-cache`): mark `cache: "miss"`, run Phase 1 normally.
153
-
3.**On Phase 1 success** (cache miss path): write `$TEST_FLOW.ad` to the cache, write `<fingerprint>.meta.json` with `{created_ts, original_pr, hits: 1}`.
154
-
4.**On Phase 2 replay failure** (cache hit path only): the cached script is stale (UI changed under it). Delete the cache entry, mark the flow `cache: "invalidated"`, re-run Phase 1 fresh, retry Phase 2 once. If the retry still fails, mark `phase2_failed`.
155
-
5.**On Phase 2 success** (cache hit path): bump `last_used_ts` and increment `hits` in the meta file.
114
+
The skill does not delete, invalidate, or retry cache entries. If a cached `.ad` is stale, the flow is marked `phase2_failed`. To recover, edit the steps (which changes the fingerprint) or wipe `~/.cache/agent-device-evidence/.ad-cache/` externally.
156
115
157
116
## Capture loop (per flow per platform)
158
117
@@ -169,21 +128,20 @@ Two phases per flow. Lifecycle delegated to the parent skill's bring-up. Phase 1
169
128
agent-device close
170
129
```
171
130
172
-
3.**Set up run directory** - persistent cache, latest-run-wins:
131
+
3.**Set up run directory** - persistent, append-only:
# Optional: rm -rf prior runs for this PR before mkdir to keep disk lean
178
136
```
179
137
180
138
### Phase 1 - Warm-up (per flow, no camera)
181
139
182
140
Goal: produce a deterministic `.ad` script of the successful command sequence, plus per-step still candidates. Drives autonomously from cold start. No recording.
183
141
184
-
**Skip if cached.** Before any device work, consult the [Phase 1 cache](#phase-1-cache-skip-warm-up-when-flow-text-is-unchanged). On cache hit, copy the cached `.ad` to `$TEST_FLOW.ad`, log `cache: "hit"` to the manifest, and proceed straight to Phase 2 for this flow.
142
+
**Skip if cached.** Before any device work, consult the [Phase 1 cache](#phase-1-cache). On hit, copy the cached `.ad` to `$TEST_FLOW.ad`, mark `cached: true` in the manifest, and proceed straight to Phase 2.
185
143
186
-
On cache miss (or `--no-cache`):
144
+
On cache miss:
187
145
188
146
1.**Open the app** with the bring-up's resolved values:
189
147
```bash
@@ -245,7 +203,7 @@ Goal: clean MP4 of only the test-flow steps. No snapshots on camera, no retries,
245
203
|| { mark phase2_failed; continue }
246
204
```
247
205
248
-
**On Phase 2 replay failure (cache hit path only):** the cached `.ad` is stale. Delete `<fingerprint>.ad` and `<fingerprint>.meta.json`, mark `cache: "invalidated"`, re-run Phase 1 fresh, and retry Phase 2 once. If the retry still fails, mark `phase2_failed`. (Cache miss path failures don't trigger a retry - the freshly generated `.ad` failed on its own first replay, so retrying will hit the same problem.)
206
+
**On Phase 2 replay failure:** mark the flow `phase2_failed` and continue to the next flow.
249
207
250
208
### Multi-flow chunking
251
209
@@ -278,7 +236,7 @@ For flows classified `kind: still`:
278
236
└── ...
279
237
```
280
238
281
-
Run output is persistent across reboots. The skill purges prior runs for the same source at the start of each new run (latest-run-wins; no concurrent locking; single-user assumption). The `.ad-cache/` directory is **not** purged on per-source runs - it's shared across sources and self-heals on Phase 2 replay failure.
239
+
Run output is persistent across reboots and append-only - the skill never deletes prior runs or cache entries.
282
240
283
241
### Manifest schema
284
242
@@ -304,7 +262,7 @@ Run output is persistent across reboots. The skill purges prior runs for the sam
304
262
"stills": ["ios/flow-1-step-2-tap-signin.png"],
305
263
"expected": "App will show error when creating new agent without name.",
@@ -315,7 +273,7 @@ Run output is persistent across reboots. The skill purges prior runs for the sam
315
273
}
316
274
```
317
275
318
-
`source.kind` is `"pr"` or `"issue"`. `expected` is populated for issues (from `## Expected Result:`); absent for PRs. `status` is one of: `ok`, `phase1_failed`, `phase2_failed`, `skipped_after_failure`. `cache` is one of: `"hit"` (cached `.ad` reused, Phase 1 skipped), `"miss"` (no cache, Phase 1 ran fresh), `"invalidated"` (cache hit but Phase 2 replay failed; entry deleted, Phase 1 re-ran fresh, Phase 2 retried), `"bypassed"` (`--no-cache` flag).
276
+
`source.kind` is `"pr"` or `"issue"`. `expected` is populated for issues (from `## Expected Result:`); absent for PRs. `status` is one of: `ok`, `phase1_failed`, `phase2_failed`, `skipped_after_failure`. `cached` is `true` when the `.ad` came from the cache (Phase 1 skipped), `false` when Phase 1 ran fresh.
319
277
320
278
### Handoff
321
279
@@ -355,8 +313,7 @@ Hitting any cap marks the flow `phase1_failed` / `phase2_failed` and proceeds to
355
313
| Phase 1 step uninterpretable by LLM | Mark flow `phase1_failed`, log the step that failed, continue to next flow |
356
314
| Phase 1 a11y empty (0 nodes) on a screen | Use coordinate fallback; log `warnings: ["a11y_fallback:<screen>"]` |
357
315
| Phase 1 `$TEST_FLOW.ad` empty after warm-up | Mark flow `phase1_failed`, continue |
358
-
| Phase 2 `replay` fails on a step (cache hit path) | Cached `.ad` is stale - delete cache entry, mark `cache: "invalidated"`, re-run Phase 1, retry Phase 2 once. If still failing, mark `phase2_failed`. |
359
-
| Phase 2 `replay` fails on a step (cache miss path) | Selector drift between Phase 1 and Phase 2; mark flow `phase2_failed`, continue. No retry - Phase 1 just ran fresh, retrying would hit the same problem. |
316
+
| Phase 2 `replay` fails on a step | Mark flow `phase2_failed`, continue. |
360
317
| `record stop` produces 0-byte file | Retry Phase 2 once for that flow; if still empty, mark `phase2_failed` |
361
318
| Android flow exceeds 3-min cap | Mark `phase2_failed`, continue (per-flow MP4s should rarely hit this; if they do, the Tests section is too coarse-grained) |
0 commit comments