-
Notifications
You must be signed in to change notification settings - Fork 140
ci(eval): migrate from Waza to Vally eval framework #1912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
081555d
51d7324
6197bd8
39f8849
b2ce135
b3256cb
9bc0882
0385f74
3f4acc4
1f32453
6395f69
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,34 +1,36 @@ | ||||||
| name: Run Skill Evaluations | ||||||
|
|
||||||
| on: | ||||||
| pull_request: | ||||||
| branches: [main] | ||||||
| paths: | ||||||
| - 'evals/**' | ||||||
| - 'plugin/skills/**' | ||||||
| workflow_dispatch: | ||||||
|
|
||||||
| permissions: | ||||||
| contents: read | ||||||
| packages: read | ||||||
|
|
||||||
| jobs: | ||||||
| eval: | ||||||
| name: Run Evaluations | ||||||
| runs-on: ubuntu-latest | ||||||
| steps: | ||||||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6 | ||||||
| - name: Install Azure Developer CLI | ||||||
| uses: Azure/setup-azd@c495e71ba59e44bfaaac10a32c8ee90d191ca4a3 # v2 | ||||||
| - name: Install waza extension | ||||||
| run: | | ||||||
| azd config set alpha.extensions on | ||||||
| azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json | ||||||
| azd ext install microsoft.azd.waza | ||||||
| - name: Setup Node.js | ||||||
| uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0 | ||||||
| with: | ||||||
| node-version: '22' | ||||||
| registry-url: https://npm.pkg.github.com | ||||||
| scope: '@microsoft' | ||||||
| - name: Install vally-cli | ||||||
| run: npm install --no-save @microsoft/vally-cli | ||||||
|
||||||
| run: npm install --no-save @microsoft/vally-cli | |
| run: npm install --no-save --ignore-scripts @microsoft/vally-cli |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -347,3 +347,6 @@ x86/ | |
| dashboard/.azure/ | ||
| dashboard/dist/ | ||
| dashboard/**/dist/ | ||
|
|
||
| # Local vally eval outputs | ||
| results/ | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| @microsoft:registry=https://npm.pkg.github.com | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This routes the entire
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find the current evals/Readme.md sufficient for explaining what is needed for a local developer to setup the NPM token. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| paths: | ||
| skills: | ||
| - plugin/skills | ||
| evals: | ||
| - evals | ||
| results: results | ||
|
|
||
| suites: | ||
| smoke: | ||
| description: "Static non-LLM checks only (e.g., trigger-pattern tests). Currently empty — all evals use the copilot-sdk executor." | ||
| filter: | ||
| tier: smoke | ||
| cost: free | ||
|
|
||
| pr: | ||
| description: "Non-LLM PR gate evals (cost: free reserved for static checks). Currently empty — populate as static evals are added." | ||
| filter: | ||
| cost: free | ||
|
wbreza marked this conversation as resolved.
|
||
|
|
||
| triggers: | ||
| description: "All routing/trigger evals" | ||
| filter: | ||
| type: trigger | ||
|
|
||
| integration: | ||
| description: "All behavior/integration evals (LLM-backed)" | ||
| filter: | ||
| type: integration | ||
|
|
||
| full: | ||
| description: "All evals including LLM graders — nightly" | ||
| filter: {} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # Evals | ||
|
|
||
| Skill evaluation suites run by [Vally](https://github.com/microsoft/evaluate) (`@microsoft/vally-cli`). Each subdirectory corresponds to a skill and contains an `eval.yaml` defining stimuli, graders, and configuration. | ||
|
|
||
| Full docs: <https://aka.ms/vally> | ||
|
|
||
| > **You don't need access to the Vally source repo to run evals locally.** You only need the `@microsoft/vally-cli` package from GitHub Packages (see [Prerequisites](#prerequisites) below). If you need source access (e.g., to debug vally internals), reach out via <https://aka.ms/vally>. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| `@microsoft/vally-cli` is published to GitHub Packages. You need a GitHub **Personal Access Token** with the `read:packages` scope. | ||
|
|
||
| 1. Create a PAT: <https://github.com/settings/tokens> (classic) → enable `read:packages`. | ||
| 2. Configure npm to use GitHub Packages for the `@microsoft` scope. Create or update `~/.npmrc`: | ||
|
|
||
| ```ini | ||
| @microsoft:registry=https://npm.pkg.github.com | ||
| //npm.pkg.github.com/:_authToken=${GITHUB_PACKAGES_TOKEN} | ||
| ``` | ||
|
|
||
| 3. Export your token: | ||
|
|
||
| ```bash | ||
| export GITHUB_PACKAGES_TOKEN=ghp_xxxxxxxxxxxx | ||
| ``` | ||
|
|
||
| 4. Install the CLI (either globally, or invoke with `npx`): | ||
|
|
||
| ```bash | ||
| npm install -g @microsoft/vally-cli | ||
| # or, no install: use `npx @microsoft/vally-cli ...` below | ||
| ``` | ||
|
|
||
| You will also need a `GITHUB_TOKEN` (Copilot-enabled) in your environment for the `copilot-sdk` executor used by most evals. | ||
|
|
||
| ## Running a single eval spec | ||
|
|
||
| From the repo root: | ||
|
|
||
| ```bash | ||
| npx @microsoft/vally-cli eval \ | ||
| --eval-spec evals/azure-hosted-copilot-sdk/eval.yaml \ | ||
| --output-dir ./results \ | ||
| --output jsonl | ||
| ``` | ||
|
|
||
| ## Running a suite | ||
|
|
||
| Suites are defined in [`.vally.yaml`](../.vally.yaml) at the repo root and filter across all `evals/**/eval.yaml` files. | ||
|
|
||
| ```bash | ||
| npx @microsoft/vally-cli eval --suite pr | ||
| npx @microsoft/vally-cli eval --suite full | ||
| ``` | ||
|
|
||
| ## Viewing results | ||
|
|
||
| After a run, check the output directory (default `./results`): | ||
|
|
||
| - `results.jsonl` — one JSON record per stimulus/run with grader outcomes. | ||
| - `eval-results.md` — human-readable summary. | ||
|
|
||
| ## More info | ||
|
|
||
| - Vally docs: <https://aka.ms/vally> | ||
| - Vally source: <https://github.com/microsoft/evaluate> | ||
| - Suite definitions: [`.vally.yaml`](../.vally.yaml) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| # Common Vally graders reference — shared patterns across eval suites | ||
| # ───────────────────────────────────────────────────────────────────── | ||
| # This file is NOT consumed by Vally directly. It documents the global | ||
| # graders that were defined at the Waza eval level and must now be | ||
| # duplicated into EVERY stimulus block (evaluate#125 workaround). | ||
| # | ||
| # Copy the relevant graders into each stimulus in your eval.yaml files. | ||
| # Migration: Waza → Vally (issue #1817, Phase 1) | ||
|
|
||
| # ── has_output ────────────────────────────────────────────────────── | ||
| # Ensures the agent produced meaningful output (not empty/trivial). | ||
| # Waza: type: code, config: { assertions: ["len(output) > 10"] } | ||
| # Vally: type: completed | ||
| # | ||
| # Used in: azure-enterprise-infra-planner, azure-prepare, azure-deploy | ||
| has_output: | ||
| type: completed | ||
|
|
||
| # ── no_fatal_errors ───────────────────────────────────────────────── | ||
| # Catches fatal errors, unhandled exceptions, stack traces in output. | ||
| # Waza: type: regex, config: { must_not_match: [...] } | ||
| # Vally: type: output-not-matches | ||
| # | ||
| # Used in: azure-hosted-copilot-sdk | ||
| no_fatal_errors: | ||
| type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|unhandled exception|stack trace" | ||
|
|
||
| # ── no_runtime_failure ────────────────────────────────────────────── | ||
| # Variant of no_fatal_errors with additional crash/panic patterns. | ||
| # Waza: type: text, config: { regex_not_match: [...] } | ||
| # Vally: type: output-not-matches | ||
| # | ||
| # Used in: azure-enterprise-infra-planner, azure-deploy | ||
| no_runtime_failure: | ||
| type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|crashed|unhandled exception|panic:" | ||
|
|
||
| # ── security_posture ──────────────────────────────────────────────── | ||
| # Ensures agent never suggests connection strings or shared keys. | ||
| # Waza: type: regex/text, config: { must_not_match / regex_not_match: [...] } | ||
| # Vally: one output-not-matches grader per pattern | ||
| # | ||
| # Used in: azure-enterprise-infra-planner, azure-prepare | ||
| security_posture_connection_string: | ||
| type: output-not-matches | ||
| config: | ||
| pattern: "(?i)connection.?string.*=.*Account(Key|Name)=" | ||
|
|
||
| security_posture_shared_access_key: | ||
| type: output-not-matches | ||
| config: | ||
| pattern: "(?i)SharedAccessKey=" | ||
|
|
||
| # Additional pattern — only in azure-enterprise-infra-planner | ||
| security_posture_master_key: | ||
| type: output-not-matches | ||
| config: | ||
| pattern: "(?i)masterKey=" | ||
|
|
||
| # ── plan_first ────────────────────────────────────────────────────── | ||
| # Verifies plan-first workflow — output contains planning indicators. | ||
| # Waza: type: regex, config: { must_match: [...], must_not_match: [...] } | ||
| # Vally: split into positive (output-matches) and negative (output-not-matches) | ||
| # | ||
| # Used in: azure-prepare | ||
| plan_first_positive: | ||
| type: output-matches | ||
| config: | ||
| pattern: "(?i)plan|planning|structure|step|phase|first|will create|creating" | ||
|
|
||
| plan_first_negative: | ||
| type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|crashed|exception occurred" | ||
|
|
||
| # ── nodejs_entry_point (informational) ────────────────────────────── | ||
| # Checks that Node.js entry points are mentioned. | ||
| # Waza: type: regex, config: { must_match: [...], skip_if_no_match: true } | ||
| # Vally: No skip_if_no_match equivalent — include ONLY in Node.js/TS stimuli. | ||
| # | ||
| # Used in: azure-prepare (selectively — TS/Node tasks only) | ||
| nodejs_entry_point: | ||
| type: output-matches | ||
| config: | ||
| pattern: "(?i)index\\.(js|ts)|app\\.setup|entry.?point|node|typescript|javascript" | ||
|
|
||
| # ── efficiency (constraints, not a grader) ────────────────────────── | ||
| # Limits tool call count. Becomes a stimulus-level constraint. | ||
| # Waza: type: behavior, config: { max_tool_calls: N } | ||
| # Vally: constraints: { max_turns: N } | ||
| # | ||
| # Per-suite defaults: | ||
| # azure-enterprise-infra-planner: max_turns: 50 | ||
| # azure-prepare: max_turns: 40 | ||
| # azure-hosted-copilot-sdk: per-task (10 or 15) | ||
| # azure-deploy: (none — no global behavior constraint) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # Vally eval config — migrated from Waza | ||
| # Source: tests/azure-deploy/eval/eval.yaml + tasks/*.yaml | ||
| # Migration: Waza → Vally (issue #1817, Phase 1) | ||
| # | ||
| # Global graders (evaluate#125 workaround): has_output and no_runtime_failure | ||
| # are duplicated into every stimulus block below. | ||
|
|
||
| name: azure-deploy-eval | ||
| description: | | ||
| Evaluation suite for the azure-deploy skill. | ||
| Tests deployment guidance quality with emphasis on: | ||
| - AVM+AZD pattern-module preference | ||
| - AVM fallback when no pattern module exists | ||
| - deploy-only routing (not prepare/validate) | ||
|
|
||
| tags: | ||
| type: integration | ||
| skill: azure-deploy | ||
|
|
||
| environment: | ||
| skills: | ||
| - ../../plugin/skills/azure-deploy | ||
|
|
||
| config: | ||
| runs: 3 | ||
| timeout: 420 | ||
| executor: copilot-sdk | ||
| model: claude-sonnet-4 | ||
|
|
||
| scoring: | ||
| threshold: 0.8 | ||
|
|
||
| stimuli: | ||
| # ── avm-order-bicep-001 ── | ||
| # Waza source: tasks/avm-order-bicep.yaml | ||
| # Validates deploy guidance prefers AVM+AZD pattern modules first | ||
| - name: "AVM+AZD Priority - Bicep Deploy" | ||
| prompt: | | ||
| My app is already prepared and validated. | ||
| Give me deploy guidance and module preference order for Bicep. | ||
| Prefer AVM+AZD patterns where available, with fallback to AVM resource modules and AVM utility modules. | ||
| tags: | ||
| type: integration | ||
| tier: full | ||
| area: output | ||
| graders: | ||
| # Task: expected.output_contains | ||
| - type: output-contains | ||
| config: | ||
| substring: "AVM" | ||
| - type: output-contains | ||
| config: | ||
| substring: "deploy" | ||
| - type: output-contains | ||
| config: | ||
| substring: "pattern" | ||
| # Task grader: avm_pattern_first | ||
| - type: output-matches | ||
| config: | ||
| pattern: "(?i)AVM\\+AZD|AZD pattern|pattern modules" | ||
| # Task grader: includes_resource_and_utility_fallback | ||
| - type: output-matches | ||
| config: | ||
| pattern: "(?is)(AVM\\+AZD|AZD pattern|pattern modules).*resource modules.*utility modules" | ||
| # Global: has_output | ||
| - type: completed | ||
| # Global: no_runtime_failure | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|exception occurred|crashed" | ||
|
|
||
| # ── avm-fallback-no-pattern-001 ── | ||
| # Waza source: tasks/avm-fallback-no-pattern.yaml | ||
| # Validates fallback order when no AVM+AZD pattern module exists | ||
| - name: "AVM Fallback When No AZD Pattern" | ||
| prompt: | | ||
| I'm deploying with Bicep and there is no AVM+AZD pattern module for my scenario. | ||
| What module order should I follow if no pattern module exists and fallback must stay AVM resource modules then AVM utility modules? | ||
| tags: | ||
| type: integration | ||
| tier: full | ||
| area: output | ||
| graders: | ||
| # Task: expected.output_contains | ||
| - type: output-contains | ||
| config: | ||
| substring: "AVM" | ||
| - type: output-contains | ||
| config: | ||
| substring: "resource" | ||
| - type: output-contains | ||
| config: | ||
| substring: "utility" | ||
| # Task grader: explicit_no_pattern_fallback | ||
| - type: output-matches | ||
| config: | ||
| pattern: "(?is)(no .*pattern module|if no .*pattern).*AVM.*resource.*AVM.*utility" | ||
| # Task grader: avoids_non_avm_fallback | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fallback to non-AVM|use non-AVM modules" | ||
| # Global: has_output | ||
| - type: completed | ||
| # Global: no_runtime_failure | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|exception occurred|crashed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
workflow_dispatchalone doesn’t let maintainers run this workflow against fork PR code (dispatch runs on refs that exist in the base repo). If the goal is to support fork PR evals with secrets, consider adding dispatch inputs (e.g., PR number/ref) and explicitly checking outrefs/pull/<n>/head(or using another safe mechanism) so the workflow can run on the PR’s HEAD SHA.