Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 721373139d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 20, | ||
| 5, | ||
| ); | ||
| const RUNNER_AUTH_TOKEN = process.env.AGENT_DEVICE_RUNNER_AUTH_TOKEN?.trim() || undefined; |
There was a problem hiding this comment.
Use a single env var for runner command auth token
waitForRunner only sends Authorization when AGENT_DEVICE_RUNNER_AUTH_TOKEN is set, but the runner enforces auth using AGENT_DEVICE_RUNNER_COMMAND_TOKEN (RunnerTests+Environment.swift). When only the command token is configured (the server-side knob), iOS /command requests are sent without a bearer token and are rejected with 401, which blocks all runner commands.
Useful? React with 👍 / 👎.
| 'footer visible text: Seasonal footer target', | ||
| ], | ||
| task: 'Assume Agent Device Tester is on the Catalog tab. Plan the commands to scroll into view the Seasonal footer target card using scrollintoview.', | ||
| outputs: [/scrollintoview/i, /catalog-footer/i], |
There was a problem hiding this comment.
Assert the real scroll command in smoke suite
This case requires scrollintoview, but the CLI command surface defines scroll (no scrollintoview command). As written, the benchmark rewards an invalid command and can fail agents that output the valid scroll ... plan, producing false negatives in the smoke suite.
Useful? React with 👍 / 👎.
| "check": "pnpm check:tooling && pnpm check:fallow && pnpm check:unit", | ||
| "prepack": "pnpm build:all", | ||
| "typecheck": "tsc -p tsconfig.json", | ||
| "test-app:install": "pnpm install --dir examples/test-app --ignore-workspace --lockfile=false", |
There was a problem hiding this comment.
Install fixture app with its lockfile enabled
The new test-app:install script passes --lockfile=false, which makes pnpm ignore the committed examples/test-app/pnpm-lock.yaml. That causes dependency drift across runs and can make SkillGym smoke results non-reproducible when upstream package versions change.
Useful? React with 👍 / 👎.
68fba33 to
0b6dd04
Compare
Summary
Add the Expo example app under
examples/test-appfor agent-device smoke flows.Add SkillGym config, docs, and a 48-case suite split into fixture smoke coverage and MECE skill-guidance regressions, including perf, React DevTools, gesture, settings, and trace planning coverage.
Update SkillGym to
0.5.0, include the Claude Haiku runner, and keep the fixture install lockfile-respecting.Ignore the nested fixture app in Fallow analysis and exclude generated SkillGym result artifacts from
pnpm formattraversal.Touched files: 33. Scope is limited to the example app plus SkillGym docs/test support and related tooling config.
Validation
pnpm formatpnpm typecheckpnpm check:toolingpnpm check:unitpnpm check:fallow --base origin/mainpnpm test-app:installpnpm test-app:typecheckgit diff --checkpnpm exec skillgym run ./test/skillgym/suites/agent-device-smoke-suite.ts --config ./test/skillgym/skillgym.config.ts --runner claude-haiku --case open-and-snapshotpnpm exec skillgym run ./test/skillgym/suites/agent-device-smoke-suite.ts --config ./test/skillgym/skillgym.config.ts --runner codex-main(benchmark run completed with 26 passed / 15 failed, surfacing intended command-planning regressions in Codex mini)pnpm exec skillgym run ./test/skillgym/suites/agent-device-smoke-suite.ts --config ./test/skillgym/skillgym.config.ts(full matrix completed: 51 passed / 45 failed runs; 20 passed / 28 failed cases)pnpm --dir examples/test-app exec expo start --ios --port 8082 --host lanplus simulator screenshots for light and dark UI review