Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 17 additions & 15 deletions .github/workflows/eval.yml
Original file line number Diff line number Diff line change
@@ -1,34 +1,36 @@
name: Run Skill Evaluations

on:
pull_request:
branches: [main]
paths:
- 'evals/**'
- 'plugin/skills/**'
workflow_dispatch:
Comment on lines 2 to +3
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workflow_dispatch alone doesn’t let maintainers run this workflow against fork PR code (dispatch runs on refs that exist in the base repo). If the goal is to support fork PR evals with secrets, consider adding dispatch inputs (e.g., PR number/ref) and explicitly checking out refs/pull/<n>/head (or using another safe mechanism) so the workflow can run on the PR’s HEAD SHA.

Copilot uses AI. Check for mistakes.

permissions:
contents: read
packages: read

jobs:
eval:
name: Run Evaluations
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- name: Install Azure Developer CLI
uses: Azure/setup-azd@c495e71ba59e44bfaaac10a32c8ee90d191ca4a3 # v2
- name: Install waza extension
run: |
azd config set alpha.extensions on
azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
azd ext install microsoft.azd.waza
- name: Setup Node.js
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
with:
node-version: '22'
registry-url: https://npm.pkg.github.com
scope: '@microsoft'
- name: Install vally-cli
run: npm install --no-save @microsoft/vally-cli
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most workflows in this repo install Node dependencies with npm ci --ignore-scripts (e.g., .github/workflows/pr.yml:26-33) to keep installs deterministic and avoid executing arbitrary package lifecycle scripts in CI. Here we’re doing npm install without --ignore-scripts; consider switching to an approach that avoids lifecycle scripts (if @microsoft/vally-cli supports it) or otherwise document why scripts must be enabled for this job.

Suggested change
run: npm install --no-save @microsoft/vally-cli
run: npm install --no-save --ignore-scripts @microsoft/vally-cli

Copilot uses AI. Check for mistakes.
env:
NODE_AUTH_TOKEN: ${{ secrets.VALLY_NPM_TOKEN }}
- name: Run evaluations
run: azd waza run evals/azure-hosted-copilot-sdk/eval.yaml --output-dir ./results
run: npx @microsoft/vally-cli eval --eval-spec evals/azure-hosted-copilot-sdk/eval.yaml --output-dir ./results --output jsonl
env:
GITHUB_TOKEN: ${{ secrets.COPILOT_CLI_TOKEN }}
Comment thread
wbreza marked this conversation as resolved.
Comment thread
wbreza marked this conversation as resolved.
- name: Upload results
if: always()
Comment thread
wbreza marked this conversation as resolved.
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: eval-results
path: ./results
retention-days: 30
retention-days: 30

3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -347,3 +347,6 @@ x86/
dashboard/.azure/
dashboard/dist/
dashboard/**/dist/

# Local vally eval outputs
results/
1 change: 1 addition & 0 deletions .npmrc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
@microsoft:registry=https://npm.pkg.github.com
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routes the entire @microsoft scope to GitHub Packages, not just vally-cli. Nothing in the tree hits that scope today (scripts/, tests/, dashboard/ have no @microsoft/* deps), but if a sub-package later pulls a public @microsoft/* from npmjs, local npm install will 404 for contributors who don't have VALLY_NPM_TOKEN set. Worth a one-liner in evals/README.md so the next person isn't surprised.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the current evals/Readme.md sufficient for explaining what is needed for a local developer to setup the NPM token.

32 changes: 32 additions & 0 deletions .vally.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
paths:
skills:
- plugin/skills
evals:
- evals
results: results

suites:
smoke:
description: "Static non-LLM checks only (e.g., trigger-pattern tests). Currently empty — all evals use the copilot-sdk executor."
filter:
tier: smoke
cost: free

pr:
description: "Non-LLM PR gate evals (cost: free reserved for static checks). Currently empty — populate as static evals are added."
filter:
cost: free
Comment thread
wbreza marked this conversation as resolved.

triggers:
description: "All routing/trigger evals"
filter:
type: trigger

integration:
description: "All behavior/integration evals (LLM-backed)"
filter:
type: integration

full:
description: "All evals including LLM graders — nightly"
filter: {}
67 changes: 67 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Evals

Skill evaluation suites run by [Vally](https://github.com/microsoft/evaluate) (`@microsoft/vally-cli`). Each subdirectory corresponds to a skill and contains an `eval.yaml` defining stimuli, graders, and configuration.

Full docs: <https://aka.ms/vally>

> **You don't need access to the Vally source repo to run evals locally.** You only need the `@microsoft/vally-cli` package from GitHub Packages (see [Prerequisites](#prerequisites) below). If you need source access (e.g., to debug vally internals), reach out via <https://aka.ms/vally>.

## Prerequisites

`@microsoft/vally-cli` is published to GitHub Packages. You need a GitHub **Personal Access Token** with the `read:packages` scope.

1. Create a PAT: <https://github.com/settings/tokens> (classic) → enable `read:packages`.
2. Configure npm to use GitHub Packages for the `@microsoft` scope. Create or update `~/.npmrc`:

```ini
@microsoft:registry=https://npm.pkg.github.com
//npm.pkg.github.com/:_authToken=${GITHUB_PACKAGES_TOKEN}
```

3. Export your token:

```bash
export GITHUB_PACKAGES_TOKEN=ghp_xxxxxxxxxxxx
```

4. Install the CLI (either globally, or invoke with `npx`):

```bash
npm install -g @microsoft/vally-cli
# or, no install: use `npx @microsoft/vally-cli ...` below
```

You will also need a `GITHUB_TOKEN` (Copilot-enabled) in your environment for the `copilot-sdk` executor used by most evals.

## Running a single eval spec

From the repo root:

```bash
npx @microsoft/vally-cli eval \
--eval-spec evals/azure-hosted-copilot-sdk/eval.yaml \
--output-dir ./results \
--output jsonl
```

## Running a suite

Suites are defined in [`.vally.yaml`](../.vally.yaml) at the repo root and filter across all `evals/**/eval.yaml` files.

```bash
npx @microsoft/vally-cli eval --suite pr
npx @microsoft/vally-cli eval --suite full
```

## Viewing results

After a run, check the output directory (default `./results`):

- `results.jsonl` — one JSON record per stimulus/run with grader outcomes.
- `eval-results.md` — human-readable summary.

## More info

- Vally docs: <https://aka.ms/vally>
- Vally source: <https://github.com/microsoft/evaluate>
- Suite definitions: [`.vally.yaml`](../.vally.yaml)
99 changes: 99 additions & 0 deletions evals/_base/common-graders.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Common Vally graders reference — shared patterns across eval suites
# ─────────────────────────────────────────────────────────────────────
# This file is NOT consumed by Vally directly. It documents the global
# graders that were defined at the Waza eval level and must now be
# duplicated into EVERY stimulus block (evaluate#125 workaround).
#
# Copy the relevant graders into each stimulus in your eval.yaml files.
# Migration: Waza → Vally (issue #1817, Phase 1)

# ── has_output ──────────────────────────────────────────────────────
# Ensures the agent produced meaningful output (not empty/trivial).
# Waza: type: code, config: { assertions: ["len(output) > 10"] }
# Vally: type: completed
#
# Used in: azure-enterprise-infra-planner, azure-prepare, azure-deploy
has_output:
type: completed

# ── no_fatal_errors ─────────────────────────────────────────────────
# Catches fatal errors, unhandled exceptions, stack traces in output.
# Waza: type: regex, config: { must_not_match: [...] }
# Vally: type: output-not-matches
#
# Used in: azure-hosted-copilot-sdk
no_fatal_errors:
type: output-not-matches
config:
pattern: "(?i)fatal error|unhandled exception|stack trace"

# ── no_runtime_failure ──────────────────────────────────────────────
# Variant of no_fatal_errors with additional crash/panic patterns.
# Waza: type: text, config: { regex_not_match: [...] }
# Vally: type: output-not-matches
#
# Used in: azure-enterprise-infra-planner, azure-deploy
no_runtime_failure:
type: output-not-matches
config:
pattern: "(?i)fatal error|crashed|unhandled exception|panic:"

# ── security_posture ────────────────────────────────────────────────
# Ensures agent never suggests connection strings or shared keys.
# Waza: type: regex/text, config: { must_not_match / regex_not_match: [...] }
# Vally: one output-not-matches grader per pattern
#
# Used in: azure-enterprise-infra-planner, azure-prepare
security_posture_connection_string:
type: output-not-matches
config:
pattern: "(?i)connection.?string.*=.*Account(Key|Name)="

security_posture_shared_access_key:
type: output-not-matches
config:
pattern: "(?i)SharedAccessKey="

# Additional pattern — only in azure-enterprise-infra-planner
security_posture_master_key:
type: output-not-matches
config:
pattern: "(?i)masterKey="

# ── plan_first ──────────────────────────────────────────────────────
# Verifies plan-first workflow — output contains planning indicators.
# Waza: type: regex, config: { must_match: [...], must_not_match: [...] }
# Vally: split into positive (output-matches) and negative (output-not-matches)
#
# Used in: azure-prepare
plan_first_positive:
type: output-matches
config:
pattern: "(?i)plan|planning|structure|step|phase|first|will create|creating"

plan_first_negative:
type: output-not-matches
config:
pattern: "(?i)fatal error|crashed|exception occurred"

# ── nodejs_entry_point (informational) ──────────────────────────────
# Checks that Node.js entry points are mentioned.
# Waza: type: regex, config: { must_match: [...], skip_if_no_match: true }
# Vally: No skip_if_no_match equivalent — include ONLY in Node.js/TS stimuli.
#
# Used in: azure-prepare (selectively — TS/Node tasks only)
nodejs_entry_point:
type: output-matches
config:
pattern: "(?i)index\\.(js|ts)|app\\.setup|entry.?point|node|typescript|javascript"

# ── efficiency (constraints, not a grader) ──────────────────────────
# Limits tool call count. Becomes a stimulus-level constraint.
# Waza: type: behavior, config: { max_tool_calls: N }
# Vally: constraints: { max_turns: N }
#
# Per-suite defaults:
# azure-enterprise-infra-planner: max_turns: 50
# azure-prepare: max_turns: 40
# azure-hosted-copilot-sdk: per-task (10 or 15)
# azure-deploy: (none — no global behavior constraint)
107 changes: 107 additions & 0 deletions evals/azure-deploy/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Vally eval config — migrated from Waza
# Source: tests/azure-deploy/eval/eval.yaml + tasks/*.yaml
# Migration: Waza → Vally (issue #1817, Phase 1)
#
# Global graders (evaluate#125 workaround): has_output and no_runtime_failure
# are duplicated into every stimulus block below.

name: azure-deploy-eval
description: |
Evaluation suite for the azure-deploy skill.
Tests deployment guidance quality with emphasis on:
- AVM+AZD pattern-module preference
- AVM fallback when no pattern module exists
- deploy-only routing (not prepare/validate)

tags:
type: integration
skill: azure-deploy

environment:
skills:
- ../../plugin/skills/azure-deploy

config:
runs: 3
timeout: 420
executor: copilot-sdk
model: claude-sonnet-4

scoring:
threshold: 0.8

stimuli:
# ── avm-order-bicep-001 ──
# Waza source: tasks/avm-order-bicep.yaml
# Validates deploy guidance prefers AVM+AZD pattern modules first
- name: "AVM+AZD Priority - Bicep Deploy"
prompt: |
My app is already prepared and validated.
Give me deploy guidance and module preference order for Bicep.
Prefer AVM+AZD patterns where available, with fallback to AVM resource modules and AVM utility modules.
tags:
type: integration
tier: full
area: output
graders:
# Task: expected.output_contains
- type: output-contains
config:
substring: "AVM"
- type: output-contains
config:
substring: "deploy"
- type: output-contains
config:
substring: "pattern"
# Task grader: avm_pattern_first
- type: output-matches
config:
pattern: "(?i)AVM\\+AZD|AZD pattern|pattern modules"
# Task grader: includes_resource_and_utility_fallback
- type: output-matches
config:
pattern: "(?is)(AVM\\+AZD|AZD pattern|pattern modules).*resource modules.*utility modules"
# Global: has_output
- type: completed
# Global: no_runtime_failure
- type: output-not-matches
config:
pattern: "(?i)fatal error|exception occurred|crashed"

# ── avm-fallback-no-pattern-001 ──
# Waza source: tasks/avm-fallback-no-pattern.yaml
# Validates fallback order when no AVM+AZD pattern module exists
- name: "AVM Fallback When No AZD Pattern"
prompt: |
I'm deploying with Bicep and there is no AVM+AZD pattern module for my scenario.
What module order should I follow if no pattern module exists and fallback must stay AVM resource modules then AVM utility modules?
tags:
type: integration
tier: full
area: output
graders:
# Task: expected.output_contains
- type: output-contains
config:
substring: "AVM"
- type: output-contains
config:
substring: "resource"
- type: output-contains
config:
substring: "utility"
# Task grader: explicit_no_pattern_fallback
- type: output-matches
config:
pattern: "(?is)(no .*pattern module|if no .*pattern).*AVM.*resource.*AVM.*utility"
# Task grader: avoids_non_avm_fallback
- type: output-not-matches
config:
pattern: "(?i)fallback to non-AVM|use non-AVM modules"
# Global: has_output
- type: completed
# Global: no_runtime_failure
- type: output-not-matches
config:
pattern: "(?i)fatal error|exception occurred|crashed"
Loading
Loading