Skip to content

Commit f23b054

Browse files
Extract reference docs from agent-device-evidence SKILL.md
Move verbose lookup material out of the always-loaded SKILL.md into on-demand reference files. SKILL.md drops from 327 to 242 lines; the load-bearing rules stay inline with short summaries pointing at: - references/steps-parsing.md - anchors, boilerplate strip list, flow segmentation signals, per-flow field semantics. - references/manifest-schema.md - full manifest JSON and field definitions. - references/error-handling.md - per-situation failure matrix. Addresses @Julesssss' review feedback on PR Expensify#89475 asking for shorter SKILL.md and/or extraction into companion files.
1 parent e14c073 commit f23b054

4 files changed

Lines changed: 137 additions & 98 deletions

File tree

.claude/skills/agent-device-evidence/SKILL.md

Lines changed: 8 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -47,60 +47,11 @@ Bare numbers are rejected (PRs and issues share the GitHub number namespace; the
4747
If the only platforms matched are out of scope (e.g. an issue checks only `MacOS: Chrome / Safari`), **exit `4 PLATFORM_UNSUPPORTED`**.
4848
4. **Steps parsing** - extract the steps section and produce a flow list (see below). If the flow list is empty, **exit `3 NO_FLOWS`**.
4949

50-
## Steps parsing rules
50+
## Steps parsing
5151

52-
The **only hard rule**: steps live in a Markdown body. Where they live within that body depends on the source kind, and what counts as "structure" inside the steps section varies wildly across authors.
52+
Strip the body to its steps section (PR: `### Tests`; issue: `## Action Performed:`; fall back to whole body if no anchor matches), drop boilerplate, then ask the LLM to segment the result into a flow list `[{title, precondition?, steps[], expected?}, ...]`. Issues are typically single-flow; PRs may declare multiple via `#### Test case N:`. Flows with one verify-only step are classified `kind: still`; everything else is `kind: video`. Empty flow list -> exit `3 NO_FLOWS`.
5353

54-
### Section anchor (heuristic, with fallback)
55-
56-
Strip the body to the steps section using a list of known headings, in order:
57-
58-
| Source | Anchor (in priority order) |
59-
| --- | --- |
60-
| PR | `### Tests`, `### Test`, `## Tests` |
61-
| Issue | `## Action Performed:`, `## Repro`, `## Steps to reproduce`, `## Reproduction Steps` |
62-
63-
If no anchor matches, pass the **whole body** to the LLM and ask it to find the steps. The anchor list is a hint, not a hard contract.
64-
65-
Stop the section at the next equal-or-higher heading (e.g. for issues, `## Expected Result:` ends the steps section). Strip trailing GitHub-template footers (Upwork automation block, contributing-guide preamble, `## Workaround:`, `## Screenshots/Videos`).
66-
67-
### Boilerplate stripping
68-
69-
- "Verify that no errors appear in the JS console" line - strip wherever it appears.
70-
- Trailing `- [x] ...` checklist blocks - strip.
71-
- Preamble metadata blocks (`**Version Number:** ...`, `**Device used:** ...`, etc.) - strip.
72-
73-
### Flow segmentation (LLM-driven)
74-
75-
Pass the stripped section to the LLM and ask it to return a list of flows: `[{title, precondition?, steps[]}, ...]`. Signals it may use (all optional - the LLM picks whichever apply):
76-
77-
- Explicit separators: `#### Test case N:` / `## ...` headers, `---` rules.
78-
- Numbered-list restarts (a fresh `1.` after a `5.` typically signals a new flow).
79-
- Prose markers: "Test case N:", "Repeat with...", "Then test...", "Now do...".
80-
- State-change indicators: "Sign out, then ...", "On a fresh session, ...".
81-
82-
**Issues are typically single-flow.** Bug reports describe one repro path. The LLM should return one flow for an issue body unless it sees explicit multi-scenario structure (rare).
83-
84-
When the LLM finds a single coherent flow, the whole section is one flow. When it finds N, it produces N.
85-
86-
### Per flow
87-
88-
The LLM returns these fields:
89-
90-
- `title` - short label (header text if present, or LLM-summarized intent).
91-
- `precondition` - free-form setup metadata if the author provided one (e.g. "Account has no workspace.", "Log in with Expensifail account.").
92-
- `steps[]` - the numbered/listed items belonging to this flow, with nested `a/b/c` sub-items flattened into the parent.
93-
- `expected` (issues only) - free-form expected outcome from the issue's `## Expected Result:` block. The driver MAY use this as a final-state assertion target after the flow drives.
94-
95-
### Single-step verify-only classification
96-
97-
If a flow has exactly one step whose intent is purely a `Verify|Confirm|Check` (no preceding action), set `kind: still`. Otherwise `kind: video`. LLM judgment, not regex.
98-
99-
### Step interpretation
100-
101-
Each step's text is passed verbatim to the agent-device driver, which decides per-step whether it's a tap, fill, navigation, or assertion. If the driver cannot interpret a step, that step (and the rest of the flow) hard-fails.
102-
103-
If the LLM returns an empty flow list (body was prose-only, "N/A", "We'll test it live", or empty after stripping), exit `3 NO_FLOWS`.
54+
Full rules (anchors, boilerplate strip list, segmentation signals, per-flow field semantics): [`references/steps-parsing.md`](references/steps-parsing.md).
10455

10556
## Phase 1 cache
10657

@@ -243,40 +194,9 @@ Run output is persistent across reboots and append-only - the skill never delete
243194
244195
### Manifest schema
245196
246-
`manifest.json` at the run root:
247-
248-
```json
249-
{
250-
"source": {
251-
"kind": "pr",
252-
"number": 89475,
253-
"url": "https://github.com/Expensify/App/pull/89475",
254-
"title": "<source title>"
255-
},
256-
"platforms_requested": ["ios", "android"],
257-
"platforms_run": ["ios", "android"],
258-
"flows": {
259-
"ios": [
260-
{
261-
"id": 1,
262-
"title": "Test case 1: ...",
263-
"kind": "video",
264-
"path": "ios/flow-1.mp4",
265-
"stills": ["ios/flow-1-step-2-tap-signin.png"],
266-
"expected": "App will show error when creating new agent without name.",
267-
"status": "ok",
268-
"cached": true,
269-
"fingerprint": "a3f9b2c4...",
270-
"warnings": [],
271-
"params": {"email": "test+ci-89475-1@expensify.com"}
272-
}
273-
],
274-
"android": [...]
275-
}
276-
}
277-
```
197+
`manifest.json` at the run root captures: `source` (kind/number/url/title), `platforms_requested` vs `platforms_run`, and `flows.<platform>[]` with per-flow `id`, `title`, `kind`, `path`, `stills`, `expected` (issues only), `status` (`ok` / `phase1_failed` / `phase2_failed` / `skipped_after_failure`), `cached`, `fingerprint`, `warnings`, and any `params` the driver chose.
278198
279-
`source.kind` is `"pr"` or `"issue"`. `expected` is populated for issues (from `## Expected Result:`); absent for PRs. `status` is one of: `ok`, `phase1_failed`, `phase2_failed`, `skipped_after_failure`. `cached` is `true` when the `.ad` came from the cache (Phase 1 skipped), `false` when Phase 1 ran fresh.
199+
Full schema and field semantics: [`references/manifest-schema.md`](references/manifest-schema.md).
280200
281201
### Handoff
282202
@@ -306,19 +226,9 @@ Hitting any cap marks the flow `phase1_failed` / `phase2_failed` and proceeds to
306226

307227
## Error handling
308228

309-
| Situation | Action |
310-
| --- | --- |
311-
| Source URL missing or not a recognised PR/issue URL | Exit `8 BAD_INPUT` |
312-
| Steps section missing or empty (PR `### Tests` / issue `## Action Performed:`) | Exit `3 NO_FLOWS` |
313-
| Only out-of-scope platforms checked on issue (e.g. `MacOS: Chrome / Safari` only) | Exit `4 PLATFORM_UNSUPPORTED` |
314-
| mWeb / Desktop / Windows explicitly requested via `--platforms` | Exit `4 PLATFORM_UNSUPPORTED` |
315-
| Bring-up fails (HybridApp gate, missing dev build, Metro start, etc.) | Surface parent skill's error verbatim; exit `7 BRING_UP_FAILED` |
316-
| Phase 1 step uninterpretable by LLM | Mark flow `phase1_failed`, log the step that failed, continue to next flow |
317-
| Phase 1 a11y empty (0 nodes) on a screen | Use coordinate fallback; log `warnings: ["a11y_fallback:<screen>"]` |
318-
| Phase 1 `$TEST_FLOW.ad` empty after warm-up | Mark flow `phase1_failed`, continue |
319-
| Phase 2 `replay` fails on a step | Mark flow `phase2_failed`, continue. |
320-
| `record stop` produces 0-byte file | Retry Phase 2 once for that flow; if still empty, mark `phase2_failed` |
321-
| Android flow exceeds 3-min cap | Mark `phase2_failed`, continue (per-flow MP4s should rarely hit this; if they do, the Tests section is too coarse-grained) |
229+
Gate failures exit hard (`3 NO_FLOWS`, `4 PLATFORM_UNSUPPORTED`, `7 BRING_UP_FAILED`, `8 BAD_INPUT`). Per-flow failures during Phase 1 or Phase 2 mark the flow `phase1_failed` / `phase2_failed` and continue to the next flow - the skill never aborts the run on a single flow.
230+
231+
Full per-situation matrix: [`references/error-handling.md`](references/error-handling.md).
322232

323233
## Out of scope (do not do these)
324234

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Error handling matrix
2+
3+
Lookup table for handling specific failure modes. Read this when a phase or gate fails and you need to choose between exit, retry, mark-and-continue, or warn-and-proceed.
4+
5+
| Situation | Action |
6+
| --- | --- |
7+
| Source URL missing or not a recognised PR/issue URL | Exit `8 BAD_INPUT` |
8+
| Steps section missing or empty (PR `### Tests` / issue `## Action Performed:`) | Exit `3 NO_FLOWS` |
9+
| Only out-of-scope platforms checked on issue (e.g. `MacOS: Chrome / Safari` only) | Exit `4 PLATFORM_UNSUPPORTED` |
10+
| mWeb / Desktop / Windows explicitly requested via `--platforms` | Exit `4 PLATFORM_UNSUPPORTED` |
11+
| Bring-up fails (HybridApp gate, missing dev build, Metro start, etc.) | Surface parent skill's error verbatim; exit `7 BRING_UP_FAILED` |
12+
| Phase 1 step uninterpretable by LLM | Mark flow `phase1_failed`, log the step that failed, continue to next flow |
13+
| Phase 1 a11y empty (0 nodes) on a screen | Use coordinate fallback; log `warnings: ["a11y_fallback:<screen>"]` |
14+
| Phase 1 `$TEST_FLOW.ad` empty after warm-up | Mark flow `phase1_failed`, continue |
15+
| Phase 2 `replay` fails on a step | Mark flow `phase2_failed`, continue. |
16+
| `record stop` produces 0-byte file | Retry Phase 2 once for that flow; if still empty, mark `phase2_failed` |
17+
| Android flow exceeds 3-min cap | Mark `phase2_failed`, continue (per-flow MP4s should rarely hit this; if they do, the Tests section is too coarse-grained) |
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Manifest schema
2+
3+
`manifest.json` is written to the run root after all platforms complete. Read this when populating or consuming the manifest.
4+
5+
```json
6+
{
7+
"source": {
8+
"kind": "pr",
9+
"number": 89475,
10+
"url": "https://github.com/Expensify/App/pull/89475",
11+
"title": "<source title>"
12+
},
13+
"platforms_requested": ["ios", "android"],
14+
"platforms_run": ["ios", "android"],
15+
"flows": {
16+
"ios": [
17+
{
18+
"id": 1,
19+
"title": "Test case 1: ...",
20+
"kind": "video",
21+
"path": "ios/flow-1.mp4",
22+
"stills": ["ios/flow-1-step-2-tap-signin.png"],
23+
"expected": "App will show error when creating new agent without name.",
24+
"status": "ok",
25+
"cached": true,
26+
"fingerprint": "a3f9b2c4...",
27+
"warnings": [],
28+
"params": {"email": "test+ci-89475-1@expensify.com"}
29+
}
30+
],
31+
"android": [...]
32+
}
33+
}
34+
```
35+
36+
## Field semantics
37+
38+
- `source.kind` - `"pr"` or `"issue"`.
39+
- `source.number` / `source.url` / `source.title` - identifying metadata, captured at fetch time.
40+
- `platforms_requested` - what the invocation asked for (via `--platforms` or inferred from the body).
41+
- `platforms_run` - subset of requested that actually executed (e.g. iOS-only when Android bring-up failed).
42+
- `flows.<platform>[]` - one entry per declared flow on that platform.
43+
44+
## Per-flow fields
45+
46+
- `id` - 1-indexed flow number, stable within a run.
47+
- `title` - flow label from the source body or LLM summary.
48+
- `kind` - `"video"` (Phase 2 records MP4) or `"still"` (single screenshot, verify-only).
49+
- `path` - artifact path relative to the run directory.
50+
- `stills` - candidate per-step PNGs captured as a Phase 1 side effect.
51+
- `expected` - free-form expected outcome from an issue's `## Expected Result:` block. Populated for issues, absent for PRs.
52+
- `status` - one of: `ok`, `phase1_failed`, `phase2_failed`, `skipped_after_failure`.
53+
- `cached` - `true` when the `.ad` came from the Phase 1 cache (Phase 1 skipped); `false` when Phase 1 ran fresh.
54+
- `fingerprint` - the `sha256(precondition + json(steps) + platform)` used as the cache key.
55+
- `warnings[]` - non-fatal annotations (e.g. `a11y_fallback:<screen>` when coordinate fallback was used).
56+
- `params` - any context-derived values the driver chose (e.g. test email), captured for reproducibility.
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Steps parsing rules
2+
3+
Detailed rules for extracting a flow list from a PR or issue body. Read this when the triage gate reaches the "Steps parsing" step.
4+
5+
The **only hard rule**: steps live in a Markdown body. Where they live within that body depends on the source kind, and what counts as "structure" inside the steps section varies wildly across authors.
6+
7+
## Section anchor (heuristic, with fallback)
8+
9+
Strip the body to the steps section using a list of known headings, in order:
10+
11+
| Source | Anchor (in priority order) |
12+
| --- | --- |
13+
| PR | `### Tests`, `### Test`, `## Tests` |
14+
| Issue | `## Action Performed:`, `## Repro`, `## Steps to reproduce`, `## Reproduction Steps` |
15+
16+
If no anchor matches, pass the **whole body** to the LLM and ask it to find the steps. The anchor list is a hint, not a hard contract.
17+
18+
Stop the section at the next equal-or-higher heading (e.g. for issues, `## Expected Result:` ends the steps section). Strip trailing GitHub-template footers (Upwork automation block, contributing-guide preamble, `## Workaround:`, `## Screenshots/Videos`).
19+
20+
## Boilerplate stripping
21+
22+
- "Verify that no errors appear in the JS console" line - strip wherever it appears.
23+
- Trailing `- [x] ...` checklist blocks - strip.
24+
- Preamble metadata blocks (`**Version Number:** ...`, `**Device used:** ...`, etc.) - strip.
25+
26+
## Flow segmentation (LLM-driven)
27+
28+
Pass the stripped section to the LLM and ask it to return a list of flows: `[{title, precondition?, steps[]}, ...]`. Signals it may use (all optional - the LLM picks whichever apply):
29+
30+
- Explicit separators: `#### Test case N:` / `## ...` headers, `---` rules.
31+
- Numbered-list restarts (a fresh `1.` after a `5.` typically signals a new flow).
32+
- Prose markers: "Test case N:", "Repeat with...", "Then test...", "Now do...".
33+
- State-change indicators: "Sign out, then ...", "On a fresh session, ...".
34+
35+
**Issues are typically single-flow.** Bug reports describe one repro path. The LLM should return one flow for an issue body unless it sees explicit multi-scenario structure (rare).
36+
37+
When the LLM finds a single coherent flow, the whole section is one flow. When it finds N, it produces N.
38+
39+
## Per flow
40+
41+
The LLM returns these fields:
42+
43+
- `title` - short label (header text if present, or LLM-summarized intent).
44+
- `precondition` - free-form setup metadata if the author provided one (e.g. "Account has no workspace.", "Log in with Expensifail account.").
45+
- `steps[]` - the numbered/listed items belonging to this flow, with nested `a/b/c` sub-items flattened into the parent.
46+
- `expected` (issues only) - free-form expected outcome from the issue's `## Expected Result:` block. The driver MAY use this as a final-state assertion target after the flow drives.
47+
48+
## Single-step verify-only classification
49+
50+
If a flow has exactly one step whose intent is purely a `Verify|Confirm|Check` (no preceding action), set `kind: still`. Otherwise `kind: video`. LLM judgment, not regex.
51+
52+
## Step interpretation
53+
54+
Each step's text is passed verbatim to the agent-device driver, which decides per-step whether it's a tap, fill, navigation, or assertion. If the driver cannot interpret a step, that step (and the rest of the flow) hard-fails.
55+
56+
If the LLM returns an empty flow list (body was prose-only, "N/A", "We'll test it live", or empty after stripping), exit `3 NO_FLOWS`.

0 commit comments

Comments
 (0)