You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extract reference docs from agent-device-evidence SKILL.md
Move verbose lookup material out of the always-loaded SKILL.md into
on-demand reference files. SKILL.md drops from 327 to 242 lines; the
load-bearing rules stay inline with short summaries pointing at:
- references/steps-parsing.md - anchors, boilerplate strip list, flow
segmentation signals, per-flow field semantics.
- references/manifest-schema.md - full manifest JSON and field
definitions.
- references/error-handling.md - per-situation failure matrix.
Addresses @Julesssss' review feedback on PR Expensify#89475 asking for shorter
SKILL.md and/or extraction into companion files.
Copy file name to clipboardExpand all lines: .claude/skills/agent-device-evidence/SKILL.md
+8-98Lines changed: 8 additions & 98 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,60 +47,11 @@ Bare numbers are rejected (PRs and issues share the GitHub number namespace; the
47
47
If the only platforms matched are out of scope (e.g. an issue checks only `MacOS: Chrome / Safari`), **exit `4 PLATFORM_UNSUPPORTED`**.
48
48
4.**Steps parsing** - extract the steps section and produce a flow list (see below). If the flow list is empty, **exit `3 NO_FLOWS`**.
49
49
50
-
## Steps parsing rules
50
+
## Steps parsing
51
51
52
-
The **only hard rule**: steps live in a Markdown body. Where they live within that body depends on the source kind, and what counts as "structure" inside the steps section varies wildly across authors.
52
+
Strip the body to its steps section (PR: `### Tests`; issue: `## Action Performed:`; fall back to whole body if no anchor matches), drop boilerplate, then ask the LLM to segment the result into a flow list `[{title, precondition?, steps[], expected?}, ...]`. Issues are typically single-flow; PRs may declare multiple via `#### Test case N:`. Flows with one verify-only step are classified `kind: still`; everything else is `kind: video`. Empty flow list -> exit `3 NO_FLOWS`.
53
53
54
-
### Section anchor (heuristic, with fallback)
55
-
56
-
Strip the body to the steps section using a list of known headings, in order:
Pass the stripped section to the LLM and ask it to return a list of flows: `[{title, precondition?, steps[]}, ...]`. Signals it may use (all optional - the LLM picks whichever apply):
76
-
77
-
- Explicit separators: `#### Test case N:` / `## ...` headers, `---` rules.
78
-
- Numbered-list restarts (a fresh `1.` after a `5.` typically signals a new flow).
- State-change indicators: "Sign out, then ...", "On a fresh session, ...".
81
-
82
-
**Issues are typically single-flow.** Bug reports describe one repro path. The LLM should return one flow for an issue body unless it sees explicit multi-scenario structure (rare).
83
-
84
-
When the LLM finds a single coherent flow, the whole section is one flow. When it finds N, it produces N.
85
-
86
-
### Per flow
87
-
88
-
The LLM returns these fields:
89
-
90
-
-`title` - short label (header text if present, or LLM-summarized intent).
91
-
-`precondition` - free-form setup metadata if the author provided one (e.g. "Account has no workspace.", "Log in with Expensifail account.").
92
-
-`steps[]` - the numbered/listed items belonging to this flow, with nested `a/b/c` sub-items flattened into the parent.
93
-
-`expected` (issues only) - free-form expected outcome from the issue's `## Expected Result:` block. The driver MAY use this as a final-state assertion target after the flow drives.
94
-
95
-
### Single-step verify-only classification
96
-
97
-
If a flow has exactly one step whose intent is purely a `Verify|Confirm|Check` (no preceding action), set `kind: still`. Otherwise `kind: video`. LLM judgment, not regex.
98
-
99
-
### Step interpretation
100
-
101
-
Each step's text is passed verbatim to the agent-device driver, which decides per-step whether it's a tap, fill, navigation, or assertion. If the driver cannot interpret a step, that step (and the rest of the flow) hard-fails.
102
-
103
-
If the LLM returns an empty flow list (body was prose-only, "N/A", "We'll test it live", or empty after stripping), exit `3 NO_FLOWS`.
54
+
Full rules (anchors, boilerplate strip list, segmentation signals, per-flow field semantics): [`references/steps-parsing.md`](references/steps-parsing.md).
104
55
105
56
## Phase 1 cache
106
57
@@ -243,40 +194,9 @@ Run output is persistent across reboots and append-only - the skill never delete
`manifest.json` at the run root captures: `source` (kind/number/url/title), `platforms_requested` vs `platforms_run`, and `flows.<platform>[]` with per-flow `id`, `title`, `kind`, `path`, `stills`, `expected` (issues only), `status` (`ok` / `phase1_failed` / `phase2_failed` / `skipped_after_failure`), `cached`, `fingerprint`, `warnings`, and any `params` the driver chose.
278
198
279
-
`source.kind` is `"pr"` or `"issue"`. `expected` is populated for issues (from `## Expected Result:`); absent for PRs. `status` is one of: `ok`, `phase1_failed`, `phase2_failed`, `skipped_after_failure`. `cached` is `true` when the `.ad` came from the cache (Phase 1 skipped), `false` when Phase 1 ran fresh.
199
+
Full schema and field semantics: [`references/manifest-schema.md`](references/manifest-schema.md).
280
200
281
201
### Handoff
282
202
@@ -306,19 +226,9 @@ Hitting any cap marks the flow `phase1_failed` / `phase2_failed` and proceeds to
306
226
307
227
## Error handling
308
228
309
-
| Situation | Action |
310
-
| --- | --- |
311
-
| Source URL missing or not a recognised PR/issue URL | Exit `8 BAD_INPUT`|
| Phase 1 step uninterpretable by LLM | Mark flow `phase1_failed`, log the step that failed, continue to next flow |
317
-
| Phase 1 a11y empty (0 nodes) on a screen | Use coordinate fallback; log `warnings: ["a11y_fallback:<screen>"]` |
318
-
| Phase 1 `$TEST_FLOW.ad` empty after warm-up | Mark flow `phase1_failed`, continue |
319
-
| Phase 2 `replay` fails on a step | Mark flow `phase2_failed`, continue. |
320
-
| `record stop` produces 0-byte file | Retry Phase 2 once for that flow; if still empty, mark `phase2_failed` |
321
-
| Android flow exceeds 3-min cap | Mark `phase2_failed`, continue (per-flow MP4s should rarely hit this; if they do, the Tests section is too coarse-grained) |
229
+
Gate failures exit hard (`3 NO_FLOWS`, `4 PLATFORM_UNSUPPORTED`, `7 BRING_UP_FAILED`, `8 BAD_INPUT`). Per-flow failures during Phase 1 or Phase 2 mark the flow `phase1_failed` / `phase2_failed` and continue to the next flow - the skill never aborts the run on a single flow.
230
+
231
+
Full per-situation matrix: [`references/error-handling.md`](references/error-handling.md).
Lookup table for handling specific failure modes. Read this when a phase or gate fails and you need to choose between exit, retry, mark-and-continue, or warn-and-proceed.
4
+
5
+
| Situation | Action |
6
+
| --- | --- |
7
+
| Source URL missing or not a recognised PR/issue URL | Exit `8 BAD_INPUT`|
| Only out-of-scope platforms checked on issue (e.g. `MacOS: Chrome / Safari` only) | Exit `4 PLATFORM_UNSUPPORTED`|
10
+
| mWeb / Desktop / Windows explicitly requested via `--platforms`| Exit `4 PLATFORM_UNSUPPORTED`|
11
+
| Bring-up fails (HybridApp gate, missing dev build, Metro start, etc.) | Surface parent skill's error verbatim; exit `7 BRING_UP_FAILED`|
12
+
| Phase 1 step uninterpretable by LLM | Mark flow `phase1_failed`, log the step that failed, continue to next flow |
13
+
| Phase 1 a11y empty (0 nodes) on a screen | Use coordinate fallback; log `warnings: ["a11y_fallback:<screen>"]`|
14
+
| Phase 1 `$TEST_FLOW.ad` empty after warm-up | Mark flow `phase1_failed`, continue |
15
+
| Phase 2 `replay` fails on a step | Mark flow `phase2_failed`, continue. |
16
+
|`record stop` produces 0-byte file | Retry Phase 2 once for that flow; if still empty, mark `phase2_failed`|
17
+
| Android flow exceeds 3-min cap | Mark `phase2_failed`, continue (per-flow MP4s should rarely hit this; if they do, the Tests section is too coarse-grained) |
Detailed rules for extracting a flow list from a PR or issue body. Read this when the triage gate reaches the "Steps parsing" step.
4
+
5
+
The **only hard rule**: steps live in a Markdown body. Where they live within that body depends on the source kind, and what counts as "structure" inside the steps section varies wildly across authors.
6
+
7
+
## Section anchor (heuristic, with fallback)
8
+
9
+
Strip the body to the steps section using a list of known headings, in order:
Pass the stripped section to the LLM and ask it to return a list of flows: `[{title, precondition?, steps[]}, ...]`. Signals it may use (all optional - the LLM picks whichever apply):
29
+
30
+
- Explicit separators: `#### Test case N:` / `## ...` headers, `---` rules.
31
+
- Numbered-list restarts (a fresh `1.` after a `5.` typically signals a new flow).
- State-change indicators: "Sign out, then ...", "On a fresh session, ...".
34
+
35
+
**Issues are typically single-flow.** Bug reports describe one repro path. The LLM should return one flow for an issue body unless it sees explicit multi-scenario structure (rare).
36
+
37
+
When the LLM finds a single coherent flow, the whole section is one flow. When it finds N, it produces N.
38
+
39
+
## Per flow
40
+
41
+
The LLM returns these fields:
42
+
43
+
-`title` - short label (header text if present, or LLM-summarized intent).
44
+
-`precondition` - free-form setup metadata if the author provided one (e.g. "Account has no workspace.", "Log in with Expensifail account.").
45
+
-`steps[]` - the numbered/listed items belonging to this flow, with nested `a/b/c` sub-items flattened into the parent.
46
+
-`expected` (issues only) - free-form expected outcome from the issue's `## Expected Result:` block. The driver MAY use this as a final-state assertion target after the flow drives.
47
+
48
+
## Single-step verify-only classification
49
+
50
+
If a flow has exactly one step whose intent is purely a `Verify|Confirm|Check` (no preceding action), set `kind: still`. Otherwise `kind: video`. LLM judgment, not regex.
51
+
52
+
## Step interpretation
53
+
54
+
Each step's text is passed verbatim to the agent-device driver, which decides per-step whether it's a tap, fill, navigation, or assertion. If the driver cannot interpret a step, that step (and the rest of the flow) hard-fails.
55
+
56
+
If the LLM returns an empty flow list (body was prose-only, "N/A", "We'll test it live", or empty after stripping), exit `3 NO_FLOWS`.
0 commit comments