Skip to content

Productionize Agent Skills simulator journey and release gate #252

Description

@ucguy4u

Problem

Agent Skills are now implemented enough to build and merge, but the remaining production gap is validation quality: we need a repeatable simulator journey that proves the on-device LLM, model download/setup path, skill routing, connector execution, and user-facing output stay aligned with the intended UX before release.

Recent merged context:

Goal

Make Agent Skills production-ready by converting the current simulator journey into a reliable release gate with artifacts, clear pass/fail criteria, and documented remediation steps.

Scope

In scope

  • Run the Agent Skills simulator journey on Android and iOS from a clean developer machine or CI runner.
  • Cover the model setup path: launch app, select/download recommended on-device model where required, and reach a usable chat state.
  • Validate representative prompts:
    • "Check my schedule for today"
    • "Remind me tomorrow at 9 AM to submit the report"
    • User confirms: "yes, add it to calendar"
  • Verify visible action traces for get_current_date_time, read_calendar_events, schedule_notification, and create_calendar_event where applicable.
  • Persist journey artifacts: logs, screenshots/video if supported, summary JSON, selected model/runtime, device/platform, app build variant, and final assistant output.
  • Decide whether this journey should run on every PR, nightly, or only release branches, then wire it into GitHub Actions accordingly.
  • Document local and CI execution in the repo.

Out of scope

  • Building a full community skills marketplace.
  • Replacing the current Agent Skills UI.
  • Removing AGP 9 compatibility workarounds unless it is required to make this journey stable.
  • Adding new third-party connectors beyond Calendar/Notifications.

Acceptance criteria

  • make agent-skills-journey-android succeeds on a clean Android simulator and writes artifacts under artifacts/agent-skills-journey/....
  • make agent-skills-journey-ios succeeds on a clean iOS simulator or has a documented, tracked blocker if CI cannot support it yet.
  • The journey verifies model setup/download state before sending prompts.
  • The journey asserts the calendar-read prompt invokes the expected skill/tool path and produces a user-facing answer.
  • The journey asserts reminder scheduling creates an app notification and asks whether to add the event to Calendar.
  • The journey asserts positive confirmation executes create_calendar_event and displays success/failure clearly.
  • Artifacts include a machine-readable summary with status, duration, platform, device, prompt, selected model/runtime, and log path.
  • CI runs the journey in the agreed cadence and fails on regressions, or a documented manual QA runbook exists until CI support is stable.
  • Documentation explains prerequisites, commands, expected outputs, common failures, and how to update baselines.
  • All impacted checks are green before merge: PR Checks, Plugin Size Gate, CI analyze/lint/tests, and Android full/streaming/TV builds if touched.

Suggested implementation plan

  1. Audit the existing journey files and confirm what they currently assert versus only navigate.
  2. Add stable test IDs/semantics to the Agent Skills UI if the Patrol tests are relying on fragile text matching.
  3. Extend the Patrol journey to capture model/runtime state and screenshots at each major step.
  4. Add explicit assertions for action traces and final assistant messages.
  5. Add GitHub Actions wiring with a conservative cadence first, preferably manual dispatch or nightly if simulator runtime is too slow for every PR.
  6. Promote to PR/release gate once flakiness is under control.
  7. Update docs and PR template/testing notes so developers know when to run it locally.

Files to start from

  • scripts/run_agent_skills_journey.sh
  • Makefile targets around agent-skills-journey
  • app/patrol_test/agent_skills_journey_test.dart
  • app/integration_test/agent_skills_journey_test.dart
  • app/lib/features/agent_chat/domain/services/agent_skill_orchestrator.dart
  • app/lib/features/agent_chat/presentation/screens/chat_screen.dart
  • .github/workflows/ci.yml

Risks

  • Simulator tests may be slow or flaky if model download is network-dependent.
  • Calendar/notification permissions differ between Android and iOS simulators.
  • Current AGP 9 workaround patches pub-cache plugin Gradle files in CI; plugin upgrades may be needed before this is stable long term.
  • On-device model output can vary; assertions should focus on required actions and UX contract, not exact wording unless deterministic.

Owner / estimate

  • Main agent: agent/ai-llm
  • Supporting areas: QA, CI/CD, mobile UI
  • Estimate: 2-4 days for stable manual/nightly coverage; longer if PR-gating requires simulator infrastructure hardening.

Definition of done

A developer can pick this ticket, run one documented command, reproduce the Agent Skills user journey on simulator, inspect artifacts, and trust the result as a release-readiness signal for the on-device LLM skill workflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent/ai-llmAI/LLM Agent taskspriority/P1High - Important but not blockingsprint-1Sprint 1 - Foundation

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions