Automated evaluation suite for the Ignite UI for Angular agent skills. Uses the skill-eval framework to measure skill quality, detect regressions, and gate merges.
The suite tests three skills:
| Skill | Task ID | What it tests |
|---|---|---|
igniteui-angular-grids |
grid-basic-setup |
Flat grid with sorting and pagination on flat employee data |
igniteui-angular-components |
component-combo-reactive-form |
Multi-select combo bound to a reactive form control |
igniteui-angular-theming |
theming-palette-generation |
Custom branded palette with palette() and theme() |
Each task includes:
instruction.md— the prompt given to the agenttests/test.sh— deterministic grader (file checks, compilation, lint)prompts/quality.md— LLM rubric grader (intent routing, API usage)solution/solve.sh— reference solution for baseline validationenvironment/Dockerfile— isolated environment for agent executionskills/— symlinked or copied skill files under test
- Node.js 20+
- Docker (for isolated agent execution)
- An API key for the agent provider (Gemini or Anthropic)
cd evals
npm install# Gemini (default)
GEMINI_API_KEY=your-key npm run eval -- grid-basic-setup
# Claude
ANTHROPIC_API_KEY=your-key npm run eval -- grid-basic-setup --agent=claudeGEMINI_API_KEY=your-key npm run eval:all# Adjust trials (default: 5)
npm run eval -- grid-basic-setup --trials=5
# Run locally without Docker
npm run eval -- grid-basic-setup --provider=local
# Validate graders against the reference solution
npm run eval -- grid-basic-setup --validate --provider=local
# Run multiple trials in parallel
npm run eval -- grid-basic-setup --parallel=3# CLI report
npm run preview
# Web UI at http://localhost:3847
npm run preview:browser-
Create a directory under
evals/tasks/<task-id>/with the standard structure:tasks/<task-id>/ ├── task.toml # Config: graders, timeouts, resource limits ├── instruction.md # Agent prompt ├── environment/Dockerfile # Container setup ├── tests/test.sh # Deterministic grader ├── prompts/quality.md # LLM rubric grader ├── solution/solve.sh # Reference solution └── skills/ # Skill files under test └── <skill-name>/SKILL.md -
Write a clear, unambiguous
instruction.mdthat tells the agent exactly what to build. -
Write
tests/test.shto check outcomes (files exist, project compiles, correct selectors are present) rather than specific steps. -
Write
prompts/quality.mdwith rubric dimensions that sum to 1.0. -
Write
solution/solve.sh— a shell script that proves the task is solvable and validates that the graders work correctly. -
Validate graders before submitting:
npm run eval -- <task-id> --validate --provider=local
Following Anthropic's recommendations:
| Metric | Threshold | Effect |
|---|---|---|
pass@5 ≥ 80% |
Merge gate | At least 1 success in 5 trials required |
pass^5 ≥ 60% |
Tracked | Flags flaky skills for investigation |
pass@5 < 60% |
Blocks merge | On PRs touching the relevant skill |
The GitHub Actions workflow at .github/workflows/skill-eval.yml runs
automatically on PRs that modify skills/** or evals/**. It:
- Checks out the repo
- Installs eval dependencies
- Runs all tasks with 5 trials
- Uploads results as an artifact
- Posts a summary comment on the PR
Deterministic grader (60% weight) — checks:
- Project builds without errors
- Correct Ignite UI selector is present in the generated template
- Required imports exist
- No use of forbidden alternatives
LLM rubric grader (40% weight) — evaluates:
- Correct intent routing
- Idiomatic API usage
- Absence of hallucinated APIs
- Following the skill's guidance
Baseline results are stored in evals/results/baseline.json and used for
regression comparison on PRs. The CI workflow uploads per-run results as
GitHub Actions artifacts.