Skip to content

feat: add skillgym tests and a test-app#453

Merged
thymikee merged 1 commit intomainfrom
feat/skillgym
Apr 27, 2026
Merged

feat: add skillgym tests and a test-app#453
thymikee merged 1 commit intomainfrom
feat/skillgym

Conversation

@thymikee
Copy link
Copy Markdown
Contributor

@thymikee thymikee commented Apr 26, 2026

Summary

Add the Expo example app under examples/test-app for agent-device smoke flows.
Add SkillGym config, docs, and a 48-case suite split into fixture smoke coverage and MECE skill-guidance regressions, including perf, React DevTools, gesture, settings, and trace planning coverage.
Update SkillGym to 0.5.0, include the Claude Haiku runner, and keep the fixture install lockfile-respecting.
Ignore the nested fixture app in Fallow analysis and exclude generated SkillGym result artifacts from pnpm format traversal.

Touched files: 33. Scope is limited to the example app plus SkillGym docs/test support and related tooling config.

Validation

  • pnpm format
  • pnpm typecheck
  • pnpm check:tooling
  • pnpm check:unit
  • pnpm check:fallow --base origin/main
  • pnpm test-app:install
  • pnpm test-app:typecheck
  • git diff --check
  • pnpm exec skillgym run ./test/skillgym/suites/agent-device-smoke-suite.ts --config ./test/skillgym/skillgym.config.ts --runner claude-haiku --case open-and-snapshot
  • pnpm exec skillgym run ./test/skillgym/suites/agent-device-smoke-suite.ts --config ./test/skillgym/skillgym.config.ts --runner codex-main (benchmark run completed with 26 passed / 15 failed, surfacing intended command-planning regressions in Codex mini)
  • pnpm exec skillgym run ./test/skillgym/suites/agent-device-smoke-suite.ts --config ./test/skillgym/skillgym.config.ts (full matrix completed: 51 passed / 45 failed runs; 20 passed / 28 failed cases)
  • pnpm --dir examples/test-app exec expo start --ios --port 8082 --host lan plus simulator screenshots for light and dark UI review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 26, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://callstackincubator.github.io/agent-device/pr-preview/pr-453/

Built to branch gh-pages at 2026-04-27 00:46 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 721373139d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/platforms/ios/runner-transport.ts Outdated
20,
5,
);
const RUNNER_AUTH_TOKEN = process.env.AGENT_DEVICE_RUNNER_AUTH_TOKEN?.trim() || undefined;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use a single env var for runner command auth token

waitForRunner only sends Authorization when AGENT_DEVICE_RUNNER_AUTH_TOKEN is set, but the runner enforces auth using AGENT_DEVICE_RUNNER_COMMAND_TOKEN (RunnerTests+Environment.swift). When only the command token is configured (the server-side knob), iOS /command requests are sent without a bearer token and are rejected with 401, which blocks all runner commands.

Useful? React with 👍 / 👎.

'footer visible text: Seasonal footer target',
],
task: 'Assume Agent Device Tester is on the Catalog tab. Plan the commands to scroll into view the Seasonal footer target card using scrollintoview.',
outputs: [/scrollintoview/i, /catalog-footer/i],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Assert the real scroll command in smoke suite

This case requires scrollintoview, but the CLI command surface defines scroll (no scrollintoview command). As written, the benchmark rewards an invalid command and can fail agents that output the valid scroll ... plan, producing false negatives in the smoke suite.

Useful? React with 👍 / 👎.

Comment thread package.json Outdated
"check": "pnpm check:tooling && pnpm check:fallow && pnpm check:unit",
"prepack": "pnpm build:all",
"typecheck": "tsc -p tsconfig.json",
"test-app:install": "pnpm install --dir examples/test-app --ignore-workspace --lockfile=false",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Install fixture app with its lockfile enabled

The new test-app:install script passes --lockfile=false, which makes pnpm ignore the committed examples/test-app/pnpm-lock.yaml. That causes dependency drift across runs and can make SkillGym smoke results non-reproducible when upstream package versions change.

Useful? React with 👍 / 👎.

@thymikee thymikee changed the title feat: add skillgym tests feat: add skillgym tests and a test-app Apr 26, 2026
@thymikee thymikee force-pushed the feat/skillgym branch 8 times, most recently from 68fba33 to 0b6dd04 Compare April 27, 2026 00:27
@thymikee thymikee merged commit 7c5b767 into main Apr 27, 2026
16 checks passed
@thymikee thymikee deleted the feat/skillgym branch April 27, 2026 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant