ci(eval): migrate from Waza to Vally eval framework#1912
ci(eval): migrate from Waza to Vally eval framework#1912
Conversation
Migrate all 30 Waza eval tasks across 4 suites to Vally eval.yaml format: - azure-hosted-copilot-sdk (6 stimuli) - azure-deploy (2 stimuli) - azure-enterprise-infra-planner (12 stimuli) - azure-prepare (10 stimuli) Add .vally.yaml project config with paths for skills and evals. Add evals/_base/common-graders.yaml as shared grader reference. Grader mappings: regex->output-matches, file->file-exists/file-matches, code->completed, behavior->constraints. Global graders duplicated per stimulus as workaround for evaluate#125. All prompts, regex patterns, and expected outputs preserved verbatim. Zero test case coverage loss. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace azd waza run with npx @microsoft/vally-cli eval. Add setup-node with GitHub Packages registry for @microsoft/vally-cli. Add packages:read permission for GitHub Packages auth. Preserve trigger paths, artifact upload, and retention settings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…executor Apply UX-designed tag taxonomy to all 30 stimuli: - eval-level tags: type + skill - stimulus-level tags: type, tier, cost, area - Fix cost values: low -> free (mock executor, no LLM cost) Add 5 named suites to .vally.yaml: smoke, pr, triggers, integration, full Switch executor from mock to copilot-sdk for real agent evaluation. Fix model names: claude-sonnet-4-20250514 -> claude-sonnet-4 (SDK rejects version-pinned model names) Live eval run results: 8/30 pass, 4 flaky, 17 fail, 1 timeout. Failures are grader calibration issues (brittle output-contains substrings, file-exists for files agent doesn't write to disk), not migration bugs. Grader tuning tracked as follow-up work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to repo root) Skill paths in environment.skills are resolved relative to the eval.yaml file location, not the repo root. Added ../../ prefix to climb from evals/<skill>/ to the repo root. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The eval workflow was invoking px @microsoft/vally-cli without any npm auth setup, so npm fell back to the public registry and the package (published to GitHub Packages) could not be resolved. - Add .npmrc mapping @microsoft scope to npm.pkg.github.com - Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied - Add an pm install --no-save step (with NODE_AUTH_TOKEN) so the @microsoft/vally-cli devDependency is resolved via authenticated fetch - Declare @microsoft/vally-cli in devDependencies (latest) so local dev and CI both resolve it through a single config path This mirrors the working setup in wbreza/skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The default GITHUB_TOKEN lacks read:packages access to the microsoft org's private @microsoft/vally-cli package, yielding 403s. Switch to a dedicated VALLY_NPM_TOKEN repo secret (PAT with read:packages, SSO-authorized for the microsoft org). Note: fork-originated pull_request runs do not receive secrets, so fork PRs will still fail auth until the package is made public or the trigger is reworked. Internal branches / workflow_dispatch / merges will resolve correctly once the secret is provisioned. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allow maintainers to manually invoke the eval workflow from the Actions UI. This is needed to bypass the fork-PR secrets restriction: pull_request workflows triggered from a fork cannot access repository secrets, so the @microsoft/vally-cli install fails. Manual workflow_dispatch runs execute in the base repo context where secrets are available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor in @microsoft/vally-cli reads GITHUB_TOKEN to create a Copilot session. The default Actions GITHUB_TOKEN doesn't have Copilot API scope, causing "Session was not created with authentication info or custom provider" at eval execution time. Reuse the existing repo secret COPILOT_CLI_TOKEN (a Copilot-enabled PAT) and expose it as GITHUB_TOKEN to the eval run step. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Migrates the repository’s skill evaluation setup from the Waza workflow to the Vally CLI framework, adding initial Vally eval specs and wiring CI to install/run @microsoft/vally-cli via GitHub Packages.
Changes:
- Updated CI eval workflow to use Node +
npx @microsoft/vally-cli evaland addedworkflow_dispatch. - Added Vally project config (
.vally.yaml) and new Vally eval specs underevals/(plus a shared graders reference file). - Added GitHub Packages npm registry configuration (
.npmrc) and a new dev dependency on@microsoft/vally-cli.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/eval.yml |
Switches CI eval execution from Waza/azd to Vally CLI with GitHub Packages auth. |
.npmrc |
Routes @microsoft scope installs through GitHub Packages. |
package.json |
Adds @microsoft/vally-cli to devDependencies (and adds yaml dependency). |
.vally.yaml |
Introduces Vally project paths and suite filters (smoke/pr/triggers/integration/full). |
evals/_base/common-graders.yaml |
Documents shared “global graders” patterns to copy into stimuli (evaluate#125 workaround). |
evals/azure-hosted-copilot-sdk/eval.yaml |
Converts the hosted-copilot-sdk eval suite from Waza to Vally stimuli/graders. |
evals/azure-prepare/eval.yaml |
Adds a migrated Vally eval suite for azure-prepare. |
evals/azure-enterprise-infra-planner/eval.yaml |
Adds a migrated Vally eval suite for azure-enterprise-infra-planner. |
evals/azure-deploy/eval.yaml |
Adds a migrated Vally eval suite for azure-deploy. |
jongio
left a comment
There was a problem hiding this comment.
Migration looks solid overall - good preservation of the original grader intent and thorough stimulus coverage.
Two things need attention before merge:
-
The COPILOT_CLI_TOKEN PAT is exposed as GITHUB_TOKEN to the eval agent process. Since eval prompts are modifiable via PR, this creates a credential exfiltration path. See inline comment for mitigation options.
-
The output-not-contains "error" / "failed" graders match normal agent prose and are likely why CI scores 76.8% instead of 80%+. The regex-based fatal error grader already covers real failures.
Also noting: the executor changed from mock (fast, deterministic, no LLM calls) to copilot-sdk (real LLM calls, ~7 min, non-deterministic, costs per run). The weighted Waza metrics (task_completion 0.4, trigger_accuracy 0.3 at 90%, behavior_quality 0.3) are now a flat 80% threshold. Both are reasonable migration trade-offs but worth documenting.
|
The test suite migration looks good. I left a few questions to better understand how vally works. Also, could you please do us a favor adding some documentation on how to run these test suites using vally locally? You can put the documentation in evals/readme.md or maybe just tests/readme.md. |
Workflow hardening: - Drop pull_request trigger (keep workflow_dispatch only) to eliminate token exfiltration vector from untrusted PR code - Add top-level permissions block (contents/packages: read) for defense-in-depth Package hygiene: - Remove @microsoft/vally-cli from devDependencies (CI installs it explicitly via GitHub Packages); lockfile regenerated in sync - Remove unused root yaml dependency Eval spec cleanup: - Remove 13 broad output-not-contains "error"/"failed" graders from azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex) - Add azure-prepare, azure-validate, azure-deploy to environment.skills - Remove cost:free tag from all LLM-backed stimuli across 4 eval files (reserved now for non-LLM static evals) - Align .vally.yaml suite descriptions with accurate tag semantics Cleanup: - Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/ - Add evals/README.md with local vally-cli run instructions - Gitignore local results/ output directory Follow-up issue #1920 tracks wiring CI to a curated medium suite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Re: docs — added �vals/README.md with GitHub Packages + .npmrc setup, local �ally-cli eval examples, and results file descriptions. Thanks for the suggestion. |
Details# 🔍 Token Analysis Report
fatal: path 'evals/README.md' exists on disk, but not in 'origin/main' 📊 Token Change ReportComparing Summary
Changed Files
📊 Token Limit Check ReportChecked: 562 files
|
| File | Tokens | Limit | Over By |
|---|---|---|---|
.github/skills/analyze-skill-issues/SKILL.md |
2109 | 500 | +1609 |
.github/skills/analyze-test-run/SKILL.md |
2471 | 500 | +1971 |
.github/skills/file-test-bug/SKILL.md |
628 | 500 | +128 |
.github/skills/sensei/README.md |
3531 | 2000 | +1531 |
.github/skills/sensei/SKILL.md |
3026 | 500 | +2526 |
.github/skills/sensei/references/EXAMPLES.md |
3701 | 2000 | +1701 |
.github/skills/sensei/references/LOOP.md |
4181 | 2000 | +2181 |
.github/skills/sensei/references/SCORING.md |
4299 | 2000 | +2299 |
.github/skills/skill-authoring/SKILL.md |
839 | 500 | +339 |
plugin/skills/appinsights-instrumentation/SKILL.md |
908 | 500 | +408 |
plugin/skills/azure-ai/SKILL.md |
817 | 500 | +317 |
plugin/skills/azure-aigateway/SKILL.md |
1258 | 500 | +758 |
plugin/skills/azure-aigateway/references/policies.md |
2342 | 2000 | +342 |
plugin/skills/azure-cloud-migrate/SKILL.md |
559 | 500 | +59 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md |
2029 | 2000 | +29 |
plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md |
2600 | 2000 | +600 |
plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md |
2181 | 2000 | +181 |
plugin/skills/azure-compliance/SKILL.md |
1185 | 500 | +685 |
plugin/skills/azure-compute/SKILL.md |
765 | 500 | +265 |
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md |
2631 | 2000 | +631 |
plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md |
2509 | 2000 | +509 |
plugin/skills/azure-cost/SKILL.md |
1977 | 500 | +1477 |
plugin/skills/azure-deploy/SKILL.md |
1643 | 500 | +1143 |
plugin/skills/azure-deploy/references/pre-deploy-checklist.md |
4074 | 2000 | +2074 |
plugin/skills/azure-deploy/references/recipes/azd/errors.md |
4001 | 2000 | +2001 |
plugin/skills/azure-deploy/references/troubleshooting.md |
2038 | 2000 | +38 |
plugin/skills/azure-diagnostics/SKILL.md |
1132 | 500 | +632 |
plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md |
2147 | 2000 | +147 |
plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md |
2003 | 2000 | +3 |
plugin/skills/azure-enterprise-infra-planner/SKILL.md |
999 | 500 | +499 |
plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md |
2022 | 2000 | +22 |
plugin/skills/azure-hosted-copilot-sdk/SKILL.md |
1260 | 500 | +760 |
plugin/skills/azure-kubernetes/SKILL.md |
2582 | 500 | +2082 |
plugin/skills/azure-kusto/SKILL.md |
2149 | 500 | +1649 |
plugin/skills/azure-messaging/SKILL.md |
967 | 500 | +467 |
plugin/skills/azure-prepare/SKILL.md |
3235 | 500 | +2735 |
plugin/skills/azure-prepare/references/aspire.md |
4617 | 2000 | +2617 |
plugin/skills/azure-prepare/references/plan-template.md |
2559 | 2000 | +559 |
plugin/skills/azure-prepare/references/recipes/azd/terraform.md |
3525 | 2000 | +1525 |
plugin/skills/azure-prepare/references/research.md |
2274 | 2000 | +274 |
plugin/skills/azure-prepare/references/resources-limits-quotas.md |
3322 | 2000 | +1322 |
plugin/skills/azure-prepare/references/security.md |
2147 | 2000 | +147 |
plugin/skills/azure-prepare/references/services/functions/bicep.md |
3065 | 2000 | +1065 |
plugin/skills/azure-prepare/references/services/functions/templates/SPEC-composable-templates.md |
6187 | 2000 | +4187 |
plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md |
4649 | 2000 | +2649 |
plugin/skills/azure-prepare/references/services/functions/terraform.md |
3358 | 2000 | +1358 |
plugin/skills/azure-quotas/SKILL.md |
2818 | 500 | +2318 |
plugin/skills/azure-quotas/references/commands.md |
2644 | 2000 | +644 |
plugin/skills/azure-resource-lookup/SKILL.md |
1288 | 500 | +788 |
plugin/skills/azure-resource-visualizer/SKILL.md |
2054 | 500 | +1554 |
plugin/skills/azure-storage/SKILL.md |
1180 | 500 | +680 |
plugin/skills/azure-upgrade/SKILL.md |
1001 | 500 | +501 |
plugin/skills/azure-upgrade/references/services/functions/automation.md |
3463 | 2000 | +1463 |
plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md |
2773 | 2000 | +773 |
plugin/skills/azure-validate/SKILL.md |
916 | 500 | +416 |
plugin/skills/entra-app-registration/SKILL.md |
2067 | 500 | +1567 |
plugin/skills/entra-app-registration/references/api-permissions.md |
2545 | 2000 | +545 |
plugin/skills/entra-app-registration/references/cli-commands.md |
2211 | 2000 | +211 |
plugin/skills/entra-app-registration/references/console-app-example.md |
2752 | 2000 | +752 |
plugin/skills/entra-app-registration/references/oauth-flows.md |
2375 | 2000 | +375 |
plugin/skills/microsoft-foundry/SKILL.md |
2870 | 500 | +2370 |
plugin/skills/microsoft-foundry/foundry-agent/create/create.md |
3016 | 2000 | +1016 |
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md |
5767 | 2000 | +3767 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md |
2342 | 2000 | +342 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md |
4268 | 2000 | +2268 |
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md |
2547 | 2000 | +547 |
plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md |
2701 | 2000 | +701 |
plugin/skills/microsoft-foundry/foundry-agent/troubleshoot/troubleshoot.md |
2164 | 2000 | +164 |
plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md |
1640 | 500 | +1140 |
plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md |
1739 | 500 | +1239 |
plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md |
2235 | 500 | +1735 |
plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md |
3335 | 2000 | +1335 |
plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md |
1226 | 500 | +726 |
plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md |
5534 | 2000 | +3534 |
plugin/skills/microsoft-foundry/quota/quota.md |
2288 | 2000 | +288 |
plugin/skills/microsoft-foundry/quota/references/capacity-planning.md |
2080 | 2000 | +80 |
plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md |
2162 | 2000 | +162 |
Consider moving content to
references/subdirectories.
Automated token analysis. See skill authoring guidelines for best practices.
- Update ai-bench references in evals/README.md to microsoft/evaluate (the actual upstream Vally repo name) - Add https://aka.ms/vally as the canonical docs link - Clarify that contributors don't need source-repo access to run evals locally — the @microsoft/vally-cli package from GitHub Packages is sufficient Addresses JasonYeMSFT's review question on evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # Task: expected.output_not_contains | ||
| - type: output-not-contains | ||
| config: | ||
| substring: "wireApi" | ||
| - type: output-not-contains | ||
| config: | ||
| substring: "bearerToken" | ||
| - type: output-not-contains | ||
| config: | ||
| substring: "DefaultAzureCredential" | ||
| # Global: no_fatal_errors | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|unhandled exception|stack trace" |
There was a problem hiding this comment.
In the Waza task source, this case also checked output_not_contains: ["error", "failed"]. The migrated Vally stimulus keeps the BYOM-negative assertions but drops the generic error/failure check, which changes the original test intent. Consider adding output-not-contains graders for "error"/"failed" (or expanding the no_fatal_errors regex) to keep parity with the previous task.
| # Task: expected.output_not_contains | ||
| - type: output-not-contains | ||
| config: | ||
| substring: "apiKey" | ||
| - type: output-not-contains | ||
| config: | ||
| substring: "AZURE_OPENAI_API_KEY" | ||
| - type: output-not-contains | ||
| config: | ||
| substring: "AZURE_OPENAI_KEY" | ||
| # Global: no_fatal_errors | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|unhandled exception|stack trace" |
There was a problem hiding this comment.
In the Waza task source, this case also checked output_not_contains: ["error", "failed"]. The migrated Vally stimulus preserves the API-key negative assertions but drops the generic error/failure check, which changes the original test intent. Consider adding output-not-contains graders for "error"/"failed" (or expanding the no_fatal_errors regex) to keep parity with the previous task.
| tags: | ||
| type: integration | ||
| tier: full | ||
| area: [output, files] |
There was a problem hiding this comment.
tags.area is a YAML list ([output, files]). If tags are expected to be scalar values, this can fail validation or prevent suite filters from matching. Prefer a scalar value or supported multi-tag pattern.
| area: [output, files] | |
| area: output-files |
| tags: | ||
| type: integration | ||
| tier: full | ||
| area: [output, files] |
There was a problem hiding this comment.
tags.area is set as a YAML list ([output, files]), which may not be supported for tag values and can break suite filtering. Prefer a scalar tag value or an alternative representation that Vally supports.
| area: [output, files] | |
| area: output-files |
| tags: | ||
| type: integration | ||
| tier: full | ||
| area: [output, files] |
There was a problem hiding this comment.
tags.area is set as a YAML list ([output, files]). If Vally expects scalar tag values, this can fail linting or make suite filters ineffective. Prefer a scalar value or a supported way to represent multiple areas.
| area: [output, files] | |
| area: output-files |
| tags: | ||
| type: integration | ||
| tier: full | ||
| area: [output, files] |
There was a problem hiding this comment.
tags.area is set as a YAML list ([output, files]). If Vally treats tags as scalar key/value pairs (as implied by .vally.yaml suite filters), list-valued tags may fail schema validation and/or not match suite filtering. Prefer a single scalar value (e.g., output or files) or encode multiple areas in a different tag scheme that Vally supports.
| area: [output, files] | |
| area: files |
| # Task: expected.output_not_contains | ||
| # Global: no_fatal_errors | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|unhandled exception|stack trace" |
There was a problem hiding this comment.
The original Waza task for this stimulus asserted output_not_contains: ["error", "failed"], but the migrated Vally stimulus doesn’t include equivalent negative checks (only a fatal/exception regex). This reduces coverage and may allow error responses to pass; consider adding output-not-contains graders for "error"/"failed" (or broadening the no-error regex) to preserve the original assertions.
| # Task: expected.output_not_contains | ||
| # Global: no_fatal_errors | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|unhandled exception|stack trace" |
There was a problem hiding this comment.
The original Waza task for this stimulus asserted output_not_contains: ["error", "failed"], but the migrated Vally stimulus doesn’t include equivalent negative checks (only a fatal/exception regex). Consider adding output-not-contains graders for "error"/"failed" (or broadening the no-error regex) to preserve the original assertions.
| # Task: expected.output_not_contains | ||
| # Global: no_fatal_errors | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|unhandled exception|stack trace" |
There was a problem hiding this comment.
The original Waza task for this stimulus asserted output_not_contains: ["error", "failed"], but the migrated Vally stimulus doesn’t include equivalent negative checks (only a fatal/exception regex). Consider adding output-not-contains graders for "error"/"failed" (or broadening the no-error regex) to preserve the original assertions.
| tags: | ||
| type: integration | ||
| tier: full | ||
| area: [output, files] |
There was a problem hiding this comment.
tags.area is a YAML list ([output, files]), which may not be supported for tag values and can break filtering/validation. Prefer a scalar value or supported multi-tag pattern.
| area: [output, files] | |
| area: output-files |
jongio
left a comment
There was a problem hiding this comment.
Earlier concerns are addressed - thanks. One small heads-up left as an inline.
| @@ -0,0 +1 @@ | |||
| @microsoft:registry=https://npm.pkg.github.com No newline at end of file | |||
There was a problem hiding this comment.
This routes the entire @microsoft scope to GitHub Packages, not just vally-cli. Nothing in the tree hits that scope today (scripts/, tests/, dashboard/ have no @microsoft/* deps), but if a sub-package later pulls a public @microsoft/* from npmjs, local npm install will 404 for contributors who don't have VALLY_NPM_TOKEN set. Worth a one-liner in evals/README.md so the next person isn't surprised.
There was a problem hiding this comment.
I find the current evals/Readme.md sufficient for explaining what is needed for a local developer to setup the NPM token.
All 4 concerns from this review were addressed in follow-up commits. Dismissing to unblock merge.
JasonYeMSFT
left a comment
There was a problem hiding this comment.
Please resolve the merge conflicts.
Summary
Replaces the existing Waza-based evaluation workflow with the new Vally
framework (@microsoft/vally-cli), and wires up CI to install it from
GitHub Packages.
This is the baseline migration; Batch 1/2/3 skill migrations (PRs #1866,
#1867, #1868) and the trigger-test migration layer on top of it.
Relates to #1818
Changes
added workflow_dispatch trigger so maintainers can manually run evals.
https://npm.pkg.github.com so npm can resolve the package.
(latest) so local contributors can run the CLI.
with �nvironment.skills and relative paths under plugin/skills/.
CI Authentication Notes
The Vally CLI package is hosted in GitHub Packages under the microsoft
org. The default GITHUB_TOKEN on a pull_request run cannot cross-org
read other orgs' package registries, and fork PRs cannot access any
repository secrets. Two things are needed:
ead:packages,
SSO-authorized for the microsoft org. This PR references that secret
in the workflow.
the eval workflow from the base repo context (where secrets are
available) on fork branches.
Once the secret is provisioned, the eval workflow will work for both
upstream branch PRs (auto-triggered) and fork PRs (via manual dispatch).
Testing
graders for the migrated skill
auth path once VALLY_NPM_TOKEN is configured