Skip to content

Commit b31d6b2

Browse files
remove cache and redundant allowed tools
1 parent aec996f commit b31d6b2

1 file changed

Lines changed: 17 additions & 60 deletions

File tree

  • .claude/skills/agent-device-evidence

.claude/skills/agent-device-evidence/SKILL.md

Lines changed: 17 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
name: agent-device-evidence
33
description: Records iOS/Android native MP4 evidence for test/repro flows extracted from an Expensify GitHub PR or issue. Use when the user asks to "record the flow for PR #X", "capture mobile evidence for issue #Y", or "produce screenshots/videos for <PR or issue URL>". Mobile-native only - declines mWeb and Desktop.
4-
allowed-tools: Bash(agent-device *) Bash(gh pr view *) Bash(gh issue view *) Bash(gh api *) Bash(mkdir -p *) Bash(rm -rf *) Bash(ls *) Bash(file *) Bash(test *) Bash(date *) Read Write
4+
allowed-tools: Bash(agent-device *) Bash(gh pr view *) Bash(gh issue view *) Bash(mkdir -p *) Bash(file *) Bash(test *) Bash(date *) Read Write
55
---
66

77
# agent-device-evidence
@@ -27,8 +27,6 @@ HybridApp-only (the parent skill's pre-flight enforces this). Standalone (non-Hy
2727
| Source URL (PR or issue) | First positional arg, e.g. `https://github.com/Expensify/App/pull/89475` or `.../issues/89855` | Yes |
2828
| `--platforms ios,android` | Flag | No (default: derived) |
2929
| `-e KEY=VALUE` step-param overrides | Repeatable | No |
30-
| `--no-cache` | Flag | No (default: cache enabled) - forces fresh Phase 1, bypasses `.ad` cache |
31-
| `--cache-clear` | Flag | No - wipes the entire `.ad` cache before running |
3230

3331
Bare numbers are rejected (PRs and issues share the GitHub number namespace; the URL path is the safe disambiguator). No interactive prompts.
3432

@@ -62,7 +60,7 @@ Strip the body to the steps section using a list of known headings, in order:
6260
| PR | `### Tests`, `### Test`, `## Tests` |
6361
| Issue | `## Action Performed:`, `## Repro`, `## Steps to reproduce`, `## Reproduction Steps` |
6462

65-
If no anchor matches, pass the **whole body** to the LLM and ask it to find the steps. The anchor list is a hint to reduce token cost on the 99% case, not a hard contract.
63+
If no anchor matches, pass the **whole body** to the LLM and ask it to find the steps. The anchor list is a hint, not a hard contract.
6664

6765
Stop the section at the next equal-or-higher heading (e.g. for issues, `## Expected Result:` ends the steps section). Strip trailing GitHub-template footers (Upwork automation block, contributing-guide preamble, `## Workaround:`, `## Screenshots/Videos`).
6866

@@ -104,55 +102,16 @@ Each step's text is passed verbatim to the agent-device driver, which decides pe
104102

105103
If the LLM returns an empty flow list (body was prose-only, "N/A", "We'll test it live", or empty after stripping), exit `3 NO_FLOWS`.
106104

107-
## Phase 1 cache (skip warm-up when flow text is unchanged)
105+
## Phase 1 cache
108106

109-
Phase 1 is the expensive part - it runs the LLM-driven exploration loop to produce a deterministic `.ad` script. Phase 2 is just `agent-device replay` and is cheap. When a flow's text and target environment are identical to a prior run's, reuse the cached `.ad` and skip Phase 1 entirely.
107+
Simple map: flow steps → `.ad` script. If the steps haven't changed, reuse the cached script and skip the warm-up.
110108

111-
### Cache layout
109+
- Path: `~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.ad`
110+
- Fingerprint: `sha256(precondition + json(steps) + platform)`. Platform is included so iOS and Android don't share an entry (different selectors).
111+
- Hit → copy to `$TEST_FLOW.ad`, mark `cached: true` in the manifest, skip Phase 1, proceed to Phase 2.
112+
- Miss → run Phase 1, write the script to the cache on success.
112113

113-
```
114-
~/.cache/agent-device-evidence/.ad-cache/
115-
├── <fingerprint>.ad # the cached Phase 1 script
116-
└── <fingerprint>.meta.json # {created_ts, original_pr, last_used_ts, hits}
117-
```
118-
119-
Content-addressable, **shared across PRs**. Two PRs whose Tests sections both contain the same "Sign in and create an expense" flow share the same cache entry.
120-
121-
### Cache key
122-
123-
```
124-
fingerprint = sha256(
125-
flow.precondition || "" +
126-
json(flow.steps) +
127-
platform +
128-
bundle_id +
129-
agent_device_version
130-
)
131-
```
132-
133-
Fields included by design:
134-
- **`flow.steps`** - if the steps change at all, the fingerprint changes. This is the primary correctness signal.
135-
- **`flow.precondition`** - drives setup behavior; affects what the `.ad` script contains.
136-
- **`platform`** - iOS and Android need separate scripts (different selectors).
137-
- **`bundle_id`** - HybridApp vs standalone (and dev vs prod) render differently.
138-
- **`agent_device_version`** - replay semantics can change between CLI versions; pinning to the running version protects against subtle drift.
139-
140-
Fields NOT included (intentionally):
141-
- **`flow.title`** - human-readable label only; doesn't affect actions.
142-
- **PR number / SHA** - we want sharing across PRs. Correctness is enforced at replay time, not at lookup time.
143-
- **App build SHA** - hard to extract reliably; relying on Phase 2 self-healing instead (see below).
144-
145-
### Lookup, write, invalidate
146-
147-
For each flow, in order:
148-
149-
1. **Compute fingerprint** from flow + platform + bundle_id + CLI version.
150-
2. **Look up** `~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.ad`:
151-
- **Hit** (and `--no-cache` is not set): copy cached `.ad` to `$TEST_FLOW.ad`, mark `cache: "hit"` in the manifest, **skip Phase 1 entirely**, proceed to Phase 2.
152-
- **Miss** (or `--no-cache`): mark `cache: "miss"`, run Phase 1 normally.
153-
3. **On Phase 1 success** (cache miss path): write `$TEST_FLOW.ad` to the cache, write `<fingerprint>.meta.json` with `{created_ts, original_pr, hits: 1}`.
154-
4. **On Phase 2 replay failure** (cache hit path only): the cached script is stale (UI changed under it). Delete the cache entry, mark the flow `cache: "invalidated"`, re-run Phase 1 fresh, retry Phase 2 once. If the retry still fails, mark `phase2_failed`.
155-
5. **On Phase 2 success** (cache hit path): bump `last_used_ts` and increment `hits` in the meta file.
114+
The skill does not delete, invalidate, or retry cache entries. If a cached `.ad` is stale, the flow is marked `phase2_failed`. To recover, edit the steps (which changes the fingerprint) or wipe `~/.cache/agent-device-evidence/.ad-cache/` externally.
156115

157116
## Capture loop (per flow per platform)
158117

@@ -169,21 +128,20 @@ Two phases per flow. Lifecycle delegated to the parent skill's bring-up. Phase 1
169128
agent-device close
170129
```
171130

172-
3. **Set up run directory** - persistent cache, latest-run-wins:
131+
3. **Set up run directory** - persistent, append-only:
173132
```bash
174133
PR_NUM=<num>; RUN_TS=$(date -u +%Y%m%dT%H%M%SZ)
175134
RUN_DIR="$HOME/.cache/agent-device-evidence/$PR_NUM/$RUN_TS"
176135
mkdir -p "$RUN_DIR/ios" "$RUN_DIR/android"
177-
# Optional: rm -rf prior runs for this PR before mkdir to keep disk lean
178136
```
179137

180138
### Phase 1 - Warm-up (per flow, no camera)
181139

182140
Goal: produce a deterministic `.ad` script of the successful command sequence, plus per-step still candidates. Drives autonomously from cold start. No recording.
183141

184-
**Skip if cached.** Before any device work, consult the [Phase 1 cache](#phase-1-cache-skip-warm-up-when-flow-text-is-unchanged). On cache hit, copy the cached `.ad` to `$TEST_FLOW.ad`, log `cache: "hit"` to the manifest, and proceed straight to Phase 2 for this flow.
142+
**Skip if cached.** Before any device work, consult the [Phase 1 cache](#phase-1-cache). On hit, copy the cached `.ad` to `$TEST_FLOW.ad`, mark `cached: true` in the manifest, and proceed straight to Phase 2.
185143

186-
On cache miss (or `--no-cache`):
144+
On cache miss:
187145

188146
1. **Open the app** with the bring-up's resolved values:
189147
```bash
@@ -245,7 +203,7 @@ Goal: clean MP4 of only the test-flow steps. No snapshots on camera, no retries,
245203
|| { mark phase2_failed; continue }
246204
```
247205
248-
**On Phase 2 replay failure (cache hit path only):** the cached `.ad` is stale. Delete `<fingerprint>.ad` and `<fingerprint>.meta.json`, mark `cache: "invalidated"`, re-run Phase 1 fresh, and retry Phase 2 once. If the retry still fails, mark `phase2_failed`. (Cache miss path failures don't trigger a retry - the freshly generated `.ad` failed on its own first replay, so retrying will hit the same problem.)
206+
**On Phase 2 replay failure:** mark the flow `phase2_failed` and continue to the next flow.
249207
250208
### Multi-flow chunking
251209
@@ -278,7 +236,7 @@ For flows classified `kind: still`:
278236
└── ...
279237
```
280238
281-
Run output is persistent across reboots. The skill purges prior runs for the same source at the start of each new run (latest-run-wins; no concurrent locking; single-user assumption). The `.ad-cache/` directory is **not** purged on per-source runs - it's shared across sources and self-heals on Phase 2 replay failure.
239+
Run output is persistent across reboots and append-only - the skill never deletes prior runs or cache entries.
282240
283241
### Manifest schema
284242
@@ -304,7 +262,7 @@ Run output is persistent across reboots. The skill purges prior runs for the sam
304262
"stills": ["ios/flow-1-step-2-tap-signin.png"],
305263
"expected": "App will show error when creating new agent without name.",
306264
"status": "ok",
307-
"cache": "hit",
265+
"cached": true,
308266
"fingerprint": "a3f9b2c4...",
309267
"warnings": [],
310268
"params": {"email": "test+ci-89475-1@expensify.com"}
@@ -315,7 +273,7 @@ Run output is persistent across reboots. The skill purges prior runs for the sam
315273
}
316274
```
317275
318-
`source.kind` is `"pr"` or `"issue"`. `expected` is populated for issues (from `## Expected Result:`); absent for PRs. `status` is one of: `ok`, `phase1_failed`, `phase2_failed`, `skipped_after_failure`. `cache` is one of: `"hit"` (cached `.ad` reused, Phase 1 skipped), `"miss"` (no cache, Phase 1 ran fresh), `"invalidated"` (cache hit but Phase 2 replay failed; entry deleted, Phase 1 re-ran fresh, Phase 2 retried), `"bypassed"` (`--no-cache` flag).
276+
`source.kind` is `"pr"` or `"issue"`. `expected` is populated for issues (from `## Expected Result:`); absent for PRs. `status` is one of: `ok`, `phase1_failed`, `phase2_failed`, `skipped_after_failure`. `cached` is `true` when the `.ad` came from the cache (Phase 1 skipped), `false` when Phase 1 ran fresh.
319277
320278
### Handoff
321279
@@ -355,8 +313,7 @@ Hitting any cap marks the flow `phase1_failed` / `phase2_failed` and proceeds to
355313
| Phase 1 step uninterpretable by LLM | Mark flow `phase1_failed`, log the step that failed, continue to next flow |
356314
| Phase 1 a11y empty (0 nodes) on a screen | Use coordinate fallback; log `warnings: ["a11y_fallback:<screen>"]` |
357315
| Phase 1 `$TEST_FLOW.ad` empty after warm-up | Mark flow `phase1_failed`, continue |
358-
| Phase 2 `replay` fails on a step (cache hit path) | Cached `.ad` is stale - delete cache entry, mark `cache: "invalidated"`, re-run Phase 1, retry Phase 2 once. If still failing, mark `phase2_failed`. |
359-
| Phase 2 `replay` fails on a step (cache miss path) | Selector drift between Phase 1 and Phase 2; mark flow `phase2_failed`, continue. No retry - Phase 1 just ran fresh, retrying would hit the same problem. |
316+
| Phase 2 `replay` fails on a step | Mark flow `phase2_failed`, continue. |
360317
| `record stop` produces 0-byte file | Retry Phase 2 once for that flow; if still empty, mark `phase2_failed` |
361318
| Android flow exceeds 3-min cap | Mark `phase2_failed`, continue (per-flow MP4s should rarely hit this; if they do, the Tests section is too coarse-grained) |
362319

0 commit comments

Comments
 (0)