Rename skill from agent-device-pr-media to agent-device-evidence

kacper-mikolajczak · kacper-mikolajczak · commit afd1b385f9b9 · 2026-05-07T06:37:19.000+02:00
The old name baked in "PR" and "media" - both became misleading once
issues entered scope as a valid input source and the skill's output
grew beyond MP4s (stills + JSON manifest).

`agent-device-evidence` keeps the parent-skill prefix, drops the
source-kind specifier, and uses "evidence" to cover all outputs.

Cache path moved from ~/.cache/agent-device-pr-media/ to
~/.cache/agent-device-evidence/. Prior caches under the old path are
orphaned; users can `rm -rf ~/.cache/agent-device-pr-media` to
reclaim disk - no migration script for v1.
diff --git a/.claude/skills/agent-device-evidence/ISSUES.md b/.claude/skills/agent-device-evidence/ISSUES.md
@@ -1,4 +1,4 @@
-# agent-device-pr-media issues
+# agent-device-evidence issues
 
 ## Resolved by v1 design (SKILL.md)
 
@@ -41,7 +41,7 @@
 - **`### Tests` is the only hard structural anchor.** Everything inside it is freeform - authors use single numbered lists, restarted numbered lists, h4 sub-headers, prose, "N/A", or leave it empty. The skill does **not** enforce a separator convention. Flow segmentation is LLM-driven; the LLM uses whatever signals are present (h4 headers, numbered restarts, prose markers like "Test case N:" / "Repeat with...", state-change cues). Sample (9 recent merged PRs, 2026-05-07): 0/9 used `#### Test case N:` headers - that convention exists (e.g. PR #89743) but is a minority pattern, not a default.
 - **Platform restriction is opt-in** via PR title or Tests prose ("iOS only", "Android only", "On iOS:"). Default is both. No file-path heuristics, no PR-checklist parsing.
 - **Per-flow artifact**: one MP4 per flow per platform (or one PNG for verify-only single-step flows). Not one big MP4.
-- **Persistent cache**: `~/.cache/agent-device-pr-media/<pr-num>/<run-ts>/`. Survives reboots; latest-run-wins.
+- **Persistent cache**: `~/.cache/agent-device-evidence/<pr-num>/<run-ts>/`. Survives reboots; latest-run-wins.
 
 ## Generalize input source: PR or issue (added 2026-05-07)
 
@@ -56,14 +56,14 @@
 - **`expected` field added to per-flow manifest** (issues only) - populated from `## Expected Result:`. The driver MAY use it as a final-state assertion target.
 - **New exit code `8 BAD_INPUT`** for malformed / non-PR-non-issue source URLs. Replaces the old exit `2 SKIP` behavior.
 - **Run-output dir generalized** from `<pr-num>/` to `<source-kind>-<source-num>/` (e.g. `pr-89475/`, `issue-89855/`).
-- **Skill name not yet aligned with broadened scope.** Directory + cache path + cross-links still say `agent-device-pr-media`. Rename to something like `agent-device-flow-evidence` or `agent-device-test-recorder` is a candidate follow-up.
+- **Skill renamed** from `agent-device-pr-media` to `agent-device-evidence` to match the broadened scope (PR + issue inputs, not just PRs). Cache path moved from `~/.cache/agent-device-pr-media/` to `~/.cache/agent-device-evidence/`. Existing prior caches under the old path are orphaned - users can `rm -rf ~/.cache/agent-device-pr-media` to reclaim disk; not worth a migration script for v1.
 
 Reference: https://github.com/Expensify/App/issues/89855 (and 6 other recent bug-tagged issues sampled 2026-05-07) - all follow the `## Action Performed:` + `## Expected Result:` + `## Platforms:` template consistently.
 
 ## Phase 1 cache (added 2026-05-07)
 
 - **Problem**: Phase 1 (LLM-driven exploration) is the expensive part; Phase 2 is just `agent-device replay`. Re-running on a PR whose Tests steps haven't changed wastes Phase 1's full cost on every invocation.
-- **Solution**: content-addressable `.ad` cache at `~/.cache/agent-device-pr-media/.ad-cache/<fingerprint>.ad`, shared across PRs.
+- **Solution**: content-addressable `.ad` cache at `~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.ad`, shared across PRs.
 - **Fingerprint** = `sha256(precondition + json(steps) + platform + bundle_id + agent_device_version)`. Title is excluded (label only). PR number / app SHA excluded so the cache shares across PRs; correctness enforced at replay time, not lookup time.
 - **Self-healing**: on Phase 2 replay failure (cache hit path only), invalidate the entry, re-run Phase 1, retry Phase 2 once. Stale entries get evicted lazily as the underlying app changes.
 - **Manifest**: per-flow `cache: "hit" | "miss" | "invalidated" | "bypassed"` and `fingerprint` so reviewers can see what was reused.
diff --git a/.claude/skills/agent-device-evidence/SKILL.md b/.claude/skills/agent-device-evidence/SKILL.md
@@ -1,10 +1,10 @@
 ---
-name: agent-device-pr-media
+name: agent-device-evidence
 description: Records iOS/Android native MP4 evidence for test/repro flows extracted from an Expensify GitHub PR or issue. Use when the user asks to "record the flow for PR #X", "capture mobile evidence for issue #Y", or "produce screenshots/videos for <PR or issue URL>". Mobile-native only - declines mWeb and Desktop.
 allowed-tools: Bash(agent-device *) Bash(gh pr view *) Bash(gh issue view *) Bash(gh api *) Bash(mkdir -p *) Bash(rm -rf *) Bash(ls *) Bash(file *) Bash(test *) Bash(date *) Read Write
 ---
 
-# agent-device-pr-media
+# agent-device-evidence
 
 Records `iOS: Native` and `Android: Native` MP4 evidence for the test or repro steps declared in an Expensify GitHub **PR or issue**. The source of truth is the test/repro steps themselves, not the surrounding code or context - the skill works equally well on a PR's `### Tests` section, an issue's `## Action Performed:` block, or any future Markdown body where steps are clearly authored.
 
@@ -111,7 +111,7 @@ Phase 1 is the expensive part - it runs the LLM-driven exploration loop to produ
 ### Cache layout
 
 ```
-~/.cache/agent-device-pr-media/.ad-cache/
+~/.cache/agent-device-evidence/.ad-cache/
 ├── <fingerprint>.ad         # the cached Phase 1 script
 └── <fingerprint>.meta.json  # {created_ts, original_pr, last_used_ts, hits}
 ```
@@ -147,7 +147,7 @@ Fields NOT included (intentionally):
 For each flow, in order:
 
 1. **Compute fingerprint** from flow + platform + bundle_id + CLI version.
-2. **Look up** `~/.cache/agent-device-pr-media/.ad-cache/<fingerprint>.ad`:
+2. **Look up** `~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.ad`:
    - **Hit** (and `--no-cache` is not set): copy cached `.ad` to `$TEST_FLOW.ad`, mark `cache: "hit"` in the manifest, **skip Phase 1 entirely**, proceed to Phase 2.
    - **Miss** (or `--no-cache`): mark `cache: "miss"`, run Phase 1 normally.
 3. **On Phase 1 success** (cache miss path): write `$TEST_FLOW.ad` to the cache, write `<fingerprint>.meta.json` with `{created_ts, original_pr, hits: 1}`.
@@ -172,7 +172,7 @@ Two phases per flow. Lifecycle delegated to the parent skill's bring-up. Phase 1
 3. **Set up run directory** - persistent cache, latest-run-wins:
    ```bash
    PR_NUM=<num>; RUN_TS=$(date -u +%Y%m%dT%H%M%SZ)
-   RUN_DIR="$HOME/.cache/agent-device-pr-media/$PR_NUM/$RUN_TS"
+   RUN_DIR="$HOME/.cache/agent-device-evidence/$PR_NUM/$RUN_TS"
    mkdir -p "$RUN_DIR/ios" "$RUN_DIR/android"
    # Optional: rm -rf prior runs for this PR before mkdir to keep disk lean
    ```
@@ -207,7 +207,7 @@ On cache miss (or `--no-cache`):
    test -s "$TEST_FLOW.ad" || { record per-flow status "phase1_failed: empty script"; continue }
    ```
 
-7. **Write to cache** - on success, copy `$TEST_FLOW.ad` to `~/.cache/agent-device-pr-media/.ad-cache/<fingerprint>.ad` and write the meta sidecar.
+7. **Write to cache** - on success, copy `$TEST_FLOW.ad` to `~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.ad` and write the meta sidecar.
 
 ### Phase 2 - Recording (per flow, deterministic replay)
 
@@ -262,7 +262,7 @@ For flows classified `kind: still`:
 ### Run-output layout
 
 ```
-~/.cache/agent-device-pr-media/
+~/.cache/agent-device-evidence/
 ├── .ad-cache/                            # cross-source Phase 1 cache (see "Phase 1 cache")
 │   ├── <fingerprint>.ad
 │   └── <fingerprint>.meta.json