Skip to content

Commit df8cc2c

Browse files
committed
Merge branch 'main' of github.com:Expensify/App into borys3kk-fix-only-first-expense-is-highlighted
2 parents 88e9426 + 045ff9b commit df8cc2c

1,465 files changed

Lines changed: 53617 additions & 20729 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
---
2+
name: agent-device-evidence
3+
description: Records iOS/Android native MP4 evidence for test/repro flows extracted from an Expensify GitHub PR or issue. Use when the user asks to "record the flow for PR #X", "capture mobile evidence for issue #Y", or "produce screenshots/videos for <PR or issue URL>". Mobile-native only - declines mWeb and Desktop.
4+
allowed-tools: Bash(agent-device *) Bash(gh pr view *) Bash(gh issue view *) Bash(mkdir -p *) Bash(file *) Bash(test *) Bash(date *) Read Write
5+
---
6+
7+
# agent-device-evidence
8+
9+
Records `iOS: Native` and `Android: Native` MP4 evidence for the test or repro steps declared in an Expensify GitHub **PR or issue**. The source of truth is the test/repro steps themselves, not the surrounding code or context - the skill works equally well on a PR's `### Tests` section, an issue's `## Action Performed:` block, or any future Markdown body where steps are clearly authored.
10+
11+
Specializes the [`agent-device`](../agent-device/SKILL.md) skill: delegates device lifecycle (bundle ID, Metro, device pick, session, open) to its [Bring-up](../agent-device/SKILL.md#bring-up), then captures one artifact per declared flow per platform, writes a JSON manifest, and surfaces local file paths.
12+
13+
The skill is **autonomous and non-interactive**. It never pauses for user input mid-run. All inputs are provided at invocation time; all failures surface as structured errors with exit codes.
14+
15+
HybridApp-only (the parent skill's pre-flight enforces this). Standalone (non-HybridApp) builds are out of scope - production mobile evidence runs against HybridApp.
16+
17+
## Scope
18+
19+
**In scope:** `iOS: Native` (iOS Simulator), `Android: Native` (Android Emulator), HybridApp dev build only. Inputs may come from PRs or issues - the skill does not gate on code changes.
20+
21+
**Out of scope:** `Android: mWeb Chrome`, `iOS: mWeb Safari`, `iOS: mWeb Chrome`, `Windows: Chrome`, `MacOS: Chrome / Safari`. Decline with `EXIT 4` and point to a browser-driver skill (`playwright-app-testing`). Standalone (non-HybridApp) builds. Decline with `EXIT 7 BRING_UP_FAILED` per the parent skill's gate.
22+
23+
## Inputs
24+
25+
| Input | Source | Required |
26+
| --- | --- | --- |
27+
| Source URL (PR or issue) | First positional arg, e.g. `https://github.com/Expensify/App/pull/89475` or `.../issues/89855` | Yes |
28+
| `--platforms ios,android` | Flag | No (default: derived) |
29+
| `-e KEY=VALUE` step-param overrides | Repeatable | No |
30+
31+
Bare numbers are rejected (PRs and issues share the GitHub number namespace; the URL path is the safe disambiguator). No interactive prompts.
32+
33+
## Triage gates (run in order, before any device work)
34+
35+
1. **Detect source kind** from the URL: `/pull/N` → PR, `/issues/N` → issue. Anything else → exit `8 BAD_INPUT`.
36+
2. **Fetch the source body**:
37+
- PR: `gh pr view <num> --json title,body`
38+
- Issue: `gh issue view <num> --json title,body,labels`
39+
3. **Platform resolution** - in priority order:
40+
1. `--platforms` arg (CSV, wins all).
41+
2. **PR source**: explicit prose markers in title or `### Tests` body - `iOS only`, `Android only`, `On iOS:`, `On Android:`.
42+
3. **Issue source**: the `## Platforms:` checkbox list. Filled boxes denote where the bug reproduces; restrict to the matching native platforms.
43+
4. Default: both `ios` and `android`.
44+
45+
Aliases: `iOS: Native``iOS: App` (both → `ios`); `Android: Native``Android: App` (both → `android`). All mWeb / Windows / MacOS variants are out of scope.
46+
47+
If the only platforms matched are out of scope (e.g. an issue checks only `MacOS: Chrome / Safari`), **exit `4 PLATFORM_UNSUPPORTED`**.
48+
4. **Steps parsing** - extract the steps section and produce a flow list (see below). If the flow list is empty, **exit `3 NO_FLOWS`**.
49+
50+
## Steps parsing
51+
52+
See [`references/steps-parsing.md`](references/steps-parsing.md).
53+
54+
## Phase 1 cache
55+
56+
Simple map: flow steps → `.ad` script. If the steps haven't changed, reuse the cached script and skip the warm-up.
57+
58+
- Path: `~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.ad`
59+
- Fingerprint: `sha256(precondition + json(steps) + platform)`. Platform is included so iOS and Android don't share an entry (different selectors).
60+
- Hit → copy to `$TEST_FLOW.ad`, mark `cached: true` in the manifest, skip Phase 1, proceed to Phase 2.
61+
- Miss → run Phase 1, write the script to the cache on success.
62+
63+
The skill does not delete, invalidate, or retry cache entries. If a cached `.ad` is stale, the flow is marked `phase2_failed`. To recover, edit the steps (which changes the fingerprint) or wipe `~/.cache/agent-device-evidence/.ad-cache/` externally.
64+
65+
## Capture loop (per flow per platform)
66+
67+
Two phases per flow. Lifecycle delegated to the parent skill's bring-up. Phase 1 is skipped on cache hit (see above).
68+
69+
### Shared setup (run once per platform, before the first flow)
70+
71+
1. **Run the [agent-device bring-up](../agent-device/SKILL.md#bring-up)** for the target platform. The parent skill resolves bundle ID, starts Metro, picks/confirms the device, manages session, and opens the app for sanity verification. Capture the resolved `$APP_ID` (bundle ID) and `$DEVICE_NAME` for re-opens in Phases 1 and 2.
72+
- If bring-up fails for any reason (HybridApp gate, missing dev build, Metro start, simulator boot, etc.), **exit `7 BRING_UP_FAILED`** and surface the parent skill's error verbatim.
73+
- Selector discipline (id > role+label, no coordinate fallback unless 0 a11y nodes) follows the parent skill's [`flows/README.md`](../agent-device/flows/README.md).
74+
- **Non-interactive overrides for the parent bring-up** (this skill never prompts):
75+
- Device pick (parent step 5, "If multiple are booted, ask the user which"): pick the **first booted device** in `agent-device devices --json` order, deterministically. Log the choice in the manifest under `device_selected`.
76+
- Session reuse vs reset (parent step 6, line 73): **always `reset`** for sessions not created in the current invocation - run `agent-device close --shutdown --session <name>` without prompting. Phase 1 and Phase 2 both rely on cold starts, so reuse of stale sessions is never desired here.
77+
78+
2. **Close the bring-up session** so each phase starts cold:
79+
```bash
80+
agent-device close
81+
```
82+
83+
3. **Set up run directory** - persistent, append-only:
84+
```bash
85+
SOURCE_KIND=<pr|issue>; SOURCE_NUM=<num>; RUN_TS=$(date -u +%Y%m%dT%H%M%SZ)
86+
RUN_DIR="$HOME/.cache/agent-device-evidence/$SOURCE_KIND-$SOURCE_NUM/$RUN_TS"
87+
mkdir -p "$RUN_DIR/ios" "$RUN_DIR/android"
88+
```
89+
90+
### Phase 1 - Warm-up (per flow, no camera)
91+
92+
Goal: produce a deterministic `.ad` script of the successful command sequence, plus per-step still candidates. Drives autonomously from cold start. No recording.
93+
94+
**Skip if cached.** Before any device work, consult the [Phase 1 cache](#phase-1-cache). On hit, copy the cached `.ad` to `$TEST_FLOW.ad`, mark `cached: true` in the manifest, and proceed straight to Phase 2.
95+
96+
On cache miss:
97+
98+
1. **Open the app** with the bring-up's resolved values:
99+
```bash
100+
agent-device open "$APP_ID" --device "$DEVICE_NAME"
101+
```
102+
103+
2. **Drive setup actions** based on the flow's `Precondition:` block (if any) and what the steps imply. Setup actions go into the `.ad` script up to the marker; everything after the marker is what Phase 2 records.
104+
105+
3. **Drive the test flow** - one numbered step at a time. For each step:
106+
- Send the step text verbatim to the agent-device LLM driver.
107+
- On success, append the **final, successful** action to `$TEST_FLOW.ad`. Do not append actions that needed retries on different selectors.
108+
- **If a value is explicit in the step** (e.g. "Enter $42.50"), pass it through verbatim. **If not**, the LLM picks a context-appropriate value and the chosen value is recorded in `params:` in the manifest.
109+
- The post-action `agent-device snapshot` (taken for selector matching) is **saved as a candidate still** - `flow-<id>-step-<n>-<label>.png`. Free side-effect.
110+
111+
4. **Verify final state** - `agent-device is exists "<selector>"` on the post-condition implied by the last step.
112+
113+
5. **Close session** - `agent-device close`.
114+
115+
6. **Sanity-check** the script is non-empty:
116+
```bash
117+
test -s "$TEST_FLOW.ad" || { record per-flow status "phase1_failed: empty script"; continue }
118+
```
119+
120+
7. **Write to cache** - on success, copy `$TEST_FLOW.ad` to `~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.ad` and write the meta sidecar.
121+
122+
### Phase 2 - Recording (per flow, deterministic replay)
123+
124+
Goal: clean MP4 of only the test-flow steps. No snapshots on camera, no retries, no LLM thinking time.
125+
126+
1. **Open the app fresh** with the bring-up's resolved values:
127+
```bash
128+
agent-device open "$APP_ID" --device "$DEVICE_NAME"
129+
```
130+
131+
2. **Replay setup silently** - everything in the `.ad` script up to the marker. Off-camera. The app reaches the test starting state.
132+
133+
3. **Start recording**:
134+
```bash
135+
agent-device record start "$RUN_DIR/$PLATFORM/flow-$ID.mp4" --fps 24
136+
```
137+
138+
> Android: `adb screenrecord` has a 3-min hard cap. Per-flow MP4s rarely hit this; if a flow exceeds, mark `status: phase2_failed` and continue.
139+
140+
4. **Replay test-flow portion**:
141+
```bash
142+
agent-device replay "$TEST_FLOW.ad" --from-marker
143+
```
144+
145+
5. **Stop recording**:
146+
```bash
147+
agent-device record stop
148+
```
149+
150+
6. **Close session** - `agent-device close`.
151+
152+
7. **Verify artifact**:
153+
```bash
154+
test -s "$RUN_DIR/$PLATFORM/flow-$ID.mp4" && file "$RUN_DIR/$PLATFORM/flow-$ID.mp4" \
155+
|| { mark phase2_failed; continue }
156+
```
157+
158+
**On Phase 2 replay failure:** mark the flow `phase2_failed` and continue to the next flow.
159+
160+
### Multi-flow chunking
161+
162+
Multiple flows in one PR share a single Phase 2 session (one `agent-device open` + replay-to-marker), with `record start` / `record stop` per flow. State carries between flows unless Phase 1 flagged `requires_cold_start: true` for a flow, in which case Phase 2 closes and re-opens before that flow.
163+
164+
### Single-step verify-only flows
165+
166+
For flows classified `kind: still`:
167+
- Phase 1 still drives autonomously to the verification screen.
168+
- Phase 2 opens fresh, replays setup, takes one screenshot at the verification screen via `agent-device screenshot`, and writes `flow-<id>.png`. No `record start`/`stop`.
169+
170+
## Output
171+
172+
### Run-output layout
173+
174+
```
175+
~/.cache/agent-device-evidence/
176+
├── .ad-cache/ # cross-source Phase 1 cache (see "Phase 1 cache")
177+
│ ├── <fingerprint>.ad
178+
│ └── <fingerprint>.meta.json
179+
└── <source-kind>-<source-num>/ # per-source run output, e.g. pr-89475/ or issue-89855/
180+
└── <run-ts>/
181+
├── manifest.json
182+
├── ios/
183+
│ ├── flow-1.mp4
184+
│ ├── flow-1-step-2-tap-signin.png
185+
│ ├── flow-2.png (still-only flow)
186+
│ └── ...
187+
└── android/
188+
└── ...
189+
```
190+
191+
Run output is persistent across reboots and append-only - the skill never deletes prior runs or cache entries.
192+
193+
### Manifest schema
194+
195+
See [`references/manifest-schema.md`](references/manifest-schema.md).
196+
197+
### Handoff
198+
199+
After all platforms, the skill prints the run directory and lists per-flow paths. The user drags each artifact into the PR's `### Screenshots/Videos` section (or attaches to the issue, depending on source). The skill never edits the source.
200+
201+
## Exit codes
202+
203+
| Code | Meaning |
204+
| --- | --- |
205+
| `0` | All applicable flows produced an artifact (or the run was best-effort with at least one usable artifact; per-flow status reflects reality). |
206+
| `3` | `NO_FLOWS` - steps section unparseable or empty after stripping. |
207+
| `4` | `PLATFORM_UNSUPPORTED` - mWeb / Desktop / Windows requested or only out-of-scope platforms checked on the source. |
208+
| `5` | `PHASE1_TOTAL_FAILURE` - every flow failed Phase 1. |
209+
| `6` | `PHASE2_TOTAL_FAILURE` - every flow failed Phase 2 despite Phase 1 success. |
210+
| `7` | `BRING_UP_FAILED` - parent skill bring-up failed (missing dev build, HybridApp gate, Metro start, simulator boot, etc.). Parent error is surfaced verbatim. |
211+
| `8` | `BAD_INPUT` - source URL is missing, malformed, or not a recognised PR/issue URL. |
212+
213+
## Cost guards
214+
215+
| Cap | Value |
216+
| --- | --- |
217+
| Phase 1 timeout | 5 min per flow |
218+
| Phase 2 timeout | 3 min per flow (Android cap) |
219+
| Max driver actions | 50 per flow |
220+
221+
Hitting any cap marks the flow `phase1_failed` / `phase2_failed` and proceeds to the next flow.
222+
223+
## Error handling
224+
225+
See [`references/error-handling.md`](references/error-handling.md).
226+
227+
## Out of scope (do not do these)
228+
229+
The skill must not attempt any of the following. If a request implies one of these, refuse or delegate.
230+
231+
- **Mobile web and Desktop platforms** (`iOS: mWeb Safari`, `Android: mWeb Chrome`, `MacOS: Chrome / Safari`) - belong in `playwright-app-testing` or a future browser-driver skill. Exit `4 PLATFORM_UNSUPPORTED`.
232+
- **Standalone (non-HybridApp) builds** - parent skill is HybridApp-only and this specialization inherits the gate. Production mobile evidence runs against HybridApp.
233+
- **Device lifecycle** (Metro, simulator boot, bundle ID resolution, session reuse, app install verification) - fully delegated to the parent skill's [Bring-up](../agent-device/SKILL.md#bring-up). Do not call `agent-device metro prepare`, `xcrun simctl`, or `is-hybrid-app.sh` directly.
234+
- **Editing the PR body or posting PR comments** - the skill only writes local files. The user handles upload.
235+
- **Interactive prompts of any kind** - CI is the eventual host; the skill must run end-to-end without human input.
236+
- Test data cleanup. Accounts/expenses/workspaces created during runs accumulate; rely on periodic test-account reset.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Error handling matrix
2+
3+
Lookup table for handling specific failure modes. Read this when a phase or gate fails and you need to choose between exit, retry, mark-and-continue, or warn-and-proceed.
4+
5+
| Situation | Action |
6+
| --- | --- |
7+
| Source URL missing or not a recognised PR/issue URL | Exit `8 BAD_INPUT` |
8+
| Steps section missing or empty (PR `### Tests` / issue `## Action Performed:`) | Exit `3 NO_FLOWS` |
9+
| Only out-of-scope platforms checked on issue (e.g. `MacOS: Chrome / Safari` only) | Exit `4 PLATFORM_UNSUPPORTED` |
10+
| mWeb / Desktop / Windows explicitly requested via `--platforms` | Exit `4 PLATFORM_UNSUPPORTED` |
11+
| Bring-up fails (HybridApp gate, missing dev build, Metro start, etc.) | Surface parent skill's error verbatim; exit `7 BRING_UP_FAILED` |
12+
| Phase 1 step uninterpretable by LLM | Mark flow `phase1_failed`, log the step that failed, continue to next flow |
13+
| Phase 1 a11y empty (0 nodes) on a screen | Use coordinate fallback; log `warnings: ["a11y_fallback:<screen>"]` |
14+
| Phase 1 `$TEST_FLOW.ad` empty after warm-up | Mark flow `phase1_failed`, continue |
15+
| Phase 2 `replay` fails on a step | Mark flow `phase2_failed`, continue. |
16+
| `record stop` produces 0-byte file | Retry Phase 2 once for that flow; if still empty, mark `phase2_failed` |
17+
| Android flow exceeds 3-min cap | Mark `phase2_failed`, continue (per-flow MP4s should rarely hit this; if they do, the Tests section is too coarse-grained) |
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Manifest schema
2+
3+
`manifest.json` is written to the run root after all platforms complete. Read this when populating or consuming the manifest.
4+
5+
```json
6+
{
7+
"source": {
8+
"kind": "pr",
9+
"number": 89475,
10+
"url": "https://github.com/Expensify/App/pull/89475",
11+
"title": "<source title>"
12+
},
13+
"platforms_requested": ["ios", "android"],
14+
"platforms_run": ["ios", "android"],
15+
"flows": {
16+
"ios": [
17+
{
18+
"id": 1,
19+
"title": "Test case 1: ...",
20+
"kind": "video",
21+
"path": "ios/flow-1.mp4",
22+
"stills": ["ios/flow-1-step-2-tap-signin.png"],
23+
"expected": "App will show error when creating new agent without name.",
24+
"status": "ok",
25+
"cached": true,
26+
"fingerprint": "a3f9b2c4...",
27+
"warnings": [],
28+
"params": {"email": "test+ci-89475-1@expensify.com"}
29+
}
30+
],
31+
"android": [...]
32+
}
33+
}
34+
```
35+
36+
## Field semantics
37+
38+
- `source.kind` - `"pr"` or `"issue"`.
39+
- `source.number` / `source.url` / `source.title` - identifying metadata, captured at fetch time.
40+
- `platforms_requested` - what the invocation asked for (via `--platforms` or inferred from the body).
41+
- `platforms_run` - subset of requested that actually executed (e.g. iOS-only when Android bring-up failed).
42+
- `flows.<platform>[]` - one entry per declared flow on that platform.
43+
44+
## Per-flow fields
45+
46+
- `id` - 1-indexed flow number, stable within a run.
47+
- `title` - flow label from the source body or LLM summary.
48+
- `kind` - `"video"` (Phase 2 records MP4) or `"still"` (single screenshot, verify-only).
49+
- `path` - artifact path relative to the run directory.
50+
- `stills` - candidate per-step PNGs captured as a Phase 1 side effect.
51+
- `expected` - free-form expected outcome from an issue's `## Expected Result:` block. Populated for issues, absent for PRs.
52+
- `status` - one of: `ok`, `phase1_failed`, `phase2_failed`, `skipped_after_failure`.
53+
- `cached` - `true` when the `.ad` came from the Phase 1 cache (Phase 1 skipped); `false` when Phase 1 ran fresh.
54+
- `fingerprint` - the `sha256(precondition + json(steps) + platform)` used as the cache key.
55+
- `warnings[]` - non-fatal annotations (e.g. `a11y_fallback:<screen>` when coordinate fallback was used).
56+
- `params` - any context-derived values the driver chose (e.g. test email), captured for reproducibility.

0 commit comments

Comments
 (0)