microsoft · wbreza · Apr 11, 2026 · Apr 11, 2026 · Apr 13, 2026 · Apr 14, 2026
diff --git a/.github/workflows/eval.yml b/.github/workflows/eval.yml
@@ -1,34 +1,36 @@
 name: Run Skill Evaluations
-
 on:
-  pull_request:
-    branches: [main]
-    paths:
-      - 'evals/**'
-      - 'plugin/skills/**'
+  workflow_dispatch:
 
 permissions:
   contents: read
+  packages: read
 
 jobs:
   eval:
     name: Run Evaluations
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
-      - name: Install Azure Developer CLI
-        uses: Azure/setup-azd@c495e71ba59e44bfaaac10a32c8ee90d191ca4a3 # v2
-      - name: Install waza extension
-        run: |
-          azd config set alpha.extensions on
-          azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json
-          azd ext install microsoft.azd.waza
+      - name: Setup Node.js
+        uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
+        with:
+          node-version: '22'
+          registry-url: https://npm.pkg.github.com
+          scope: '@microsoft'
+      - name: Install vally-cli
+        run: npm install --no-save @microsoft/vally-cli
-        run: npm install --no-save @microsoft/vally-cli
+        run: npm install --no-save --ignore-scripts @microsoft/vally-cli
-        run: npm install --no-save @microsoft/vally-cli
+        run: npm install --no-save --ignore-scripts @microsoft/vally-cli
+        env:
+          NODE_AUTH_TOKEN: ${{ secrets.VALLY_NPM_TOKEN }}
       - name: Run evaluations
-        run: azd waza run evals/azure-hosted-copilot-sdk/eval.yaml --output-dir ./results
+        run: npx @microsoft/vally-cli eval --eval-spec evals/azure-hosted-copilot-sdk/eval.yaml --output-dir ./results --output jsonl
+        env:
+          GITHUB_TOKEN: ${{ secrets.COPILOT_CLI_TOKEN }}
       - name: Upload results
         if: always()
         uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
         with:
           name: eval-results
           path: ./results
-          retention-days: 30
+          retention-days: 30
+
diff --git a/.gitignore b/.gitignore
@@ -347,3 +347,6 @@ x86/
 dashboard/.azure/
 dashboard/dist/
 dashboard/**/dist/
+
+# Local vally eval outputs
+results/
diff --git a/.npmrc b/.npmrc
@@ -0,0 +1 @@
+@microsoft:registry=https://npm.pkg.github.com
diff --git a/.vally.yaml b/.vally.yaml
@@ -0,0 +1,32 @@
+paths:
+  skills:
+    - plugin/skills
+  evals:
+    - evals
+  results: results
+
+suites:
+  smoke:
+    description: "Static non-LLM checks only (e.g., trigger-pattern tests). Currently empty — all evals use the copilot-sdk executor."
+    filter:
+      tier: smoke
+      cost: free
+
+  pr:
+    description: "Non-LLM PR gate evals (cost: free reserved for static checks). Currently empty — populate as static evals are added."
+    filter:
+      cost: free
+
+  triggers:
+    description: "All routing/trigger evals"
+    filter:
+      type: trigger
+
+  integration:
+    description: "All behavior/integration evals (LLM-backed)"
+    filter:
+      type: integration
+
+  full:
+    description: "All evals including LLM graders — nightly"
+    filter: {}
diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,67 @@
+# Evals
+
+Skill evaluation suites run by [Vally](https://github.com/microsoft/evaluate) (`@microsoft/vally-cli`). Each subdirectory corresponds to a skill and contains an `eval.yaml` defining stimuli, graders, and configuration.
+
+Full docs: <https://aka.ms/vally>
+
+> **You don't need access to the Vally source repo to run evals locally.** You only need the `@microsoft/vally-cli` package from GitHub Packages (see [Prerequisites](#prerequisites) below). If you need source access (e.g., to debug vally internals), reach out via <https://aka.ms/vally>.
+
+## Prerequisites
+
+`@microsoft/vally-cli` is published to GitHub Packages. You need a GitHub **Personal Access Token** with the `read:packages` scope.
+
+1. Create a PAT: <https://github.com/settings/tokens> (classic) → enable `read:packages`.
+2. Configure npm to use GitHub Packages for the `@microsoft` scope. Create or update `~/.npmrc`:
+
+   ```ini
+   @microsoft:registry=https://npm.pkg.github.com
+   //npm.pkg.github.com/:_authToken=${GITHUB_PACKAGES_TOKEN}
+   ```
+
+3. Export your token:
+
+   ```bash
+   export GITHUB_PACKAGES_TOKEN=ghp_xxxxxxxxxxxx
+   ```
+
+4. Install the CLI (either globally, or invoke with `npx`):
+
+   ```bash
+   npm install -g @microsoft/vally-cli
+   # or, no install: use `npx @microsoft/vally-cli ...` below
+   ```
+
+You will also need a `GITHUB_TOKEN` (Copilot-enabled) in your environment for the `copilot-sdk` executor used by most evals.
+
+## Running a single eval spec
+
+From the repo root:
+
+```bash
+npx @microsoft/vally-cli eval \
+  --eval-spec evals/azure-hosted-copilot-sdk/eval.yaml \
+  --output-dir ./results \
+  --output jsonl
+```
+
+## Running a suite
+
+Suites are defined in [`.vally.yaml`](../.vally.yaml) at the repo root and filter across all `evals/**/eval.yaml` files.
+
+```bash
+npx @microsoft/vally-cli eval --suite pr
+npx @microsoft/vally-cli eval --suite full
+```
+
+## Viewing results
+
+After a run, check the output directory (default `./results`):
+
+- `results.jsonl` — one JSON record per stimulus/run with grader outcomes.
+- `eval-results.md` — human-readable summary.
+
+## More info
+
+- Vally docs: <https://aka.ms/vally>
+- Vally source: <https://github.com/microsoft/evaluate>
+- Suite definitions: [`.vally.yaml`](../.vally.yaml)
diff --git a/evals/_base/common-graders.yaml b/evals/_base/common-graders.yaml
@@ -0,0 +1,99 @@
+# Common Vally graders reference — shared patterns across eval suites
+# ─────────────────────────────────────────────────────────────────────
+# This file is NOT consumed by Vally directly.  It documents the global
+# graders that were defined at the Waza eval level and must now be
+# duplicated into EVERY stimulus block (evaluate#125 workaround).
+#
+# Copy the relevant graders into each stimulus in your eval.yaml files.
+# Migration: Waza → Vally (issue #1817, Phase 1)
+
+# ── has_output ──────────────────────────────────────────────────────
+# Ensures the agent produced meaningful output (not empty/trivial).
+# Waza:  type: code, config: { assertions: ["len(output) > 10"] }
+# Vally: type: completed
+#
+# Used in: azure-enterprise-infra-planner, azure-prepare, azure-deploy
+has_output:
+  type: completed
+
+# ── no_fatal_errors ─────────────────────────────────────────────────
+# Catches fatal errors, unhandled exceptions, stack traces in output.
+# Waza:  type: regex, config: { must_not_match: [...] }
+# Vally: type: output-not-matches
+#
+# Used in: azure-hosted-copilot-sdk
+no_fatal_errors:
+  type: output-not-matches
+  config:
+    pattern: "(?i)fatal error|unhandled exception|stack trace"
+
+# ── no_runtime_failure ──────────────────────────────────────────────
+# Variant of no_fatal_errors with additional crash/panic patterns.
+# Waza:  type: text, config: { regex_not_match: [...] }
+# Vally: type: output-not-matches
+#
+# Used in: azure-enterprise-infra-planner, azure-deploy
+no_runtime_failure:
+  type: output-not-matches
+  config:
+    pattern: "(?i)fatal error|crashed|unhandled exception|panic:"
+
+# ── security_posture ────────────────────────────────────────────────
+# Ensures agent never suggests connection strings or shared keys.
+# Waza:  type: regex/text, config: { must_not_match / regex_not_match: [...] }
+# Vally: one output-not-matches grader per pattern
+#
+# Used in: azure-enterprise-infra-planner, azure-prepare
+security_posture_connection_string:
+  type: output-not-matches
+  config:
+    pattern: "(?i)connection.?string.*=.*Account(Key|Name)="
+
+security_posture_shared_access_key:
+  type: output-not-matches
+  config:
+    pattern: "(?i)SharedAccessKey="
+
+# Additional pattern — only in azure-enterprise-infra-planner
+security_posture_master_key:
+  type: output-not-matches
+  config:
+    pattern: "(?i)masterKey="
+
+# ── plan_first ──────────────────────────────────────────────────────
+# Verifies plan-first workflow — output contains planning indicators.
+# Waza:  type: regex, config: { must_match: [...], must_not_match: [...] }
+# Vally: split into positive (output-matches) and negative (output-not-matches)
+#
+# Used in: azure-prepare
+plan_first_positive:
+  type: output-matches
+  config:
+    pattern: "(?i)plan|planning|structure|step|phase|first|will create|creating"
+
+plan_first_negative:
+  type: output-not-matches
+  config:
+    pattern: "(?i)fatal error|crashed|exception occurred"
+
+# ── nodejs_entry_point (informational) ──────────────────────────────
+# Checks that Node.js entry points are mentioned.
+# Waza:  type: regex, config: { must_match: [...], skip_if_no_match: true }
+# Vally: No skip_if_no_match equivalent — include ONLY in Node.js/TS stimuli.
+#
+# Used in: azure-prepare (selectively — TS/Node tasks only)
+nodejs_entry_point:
+  type: output-matches
+  config:
+    pattern: "(?i)index\\.(js|ts)|app\\.setup|entry.?point|node|typescript|javascript"
+
+# ── efficiency (constraints, not a grader) ──────────────────────────
+# Limits tool call count.  Becomes a stimulus-level constraint.
+# Waza:  type: behavior, config: { max_tool_calls: N }
+# Vally: constraints: { max_turns: N }
+#
+# Per-suite defaults:
+#   azure-enterprise-infra-planner: max_turns: 50
+#   azure-prepare:                  max_turns: 40
+#   azure-hosted-copilot-sdk:       per-task (10 or 15)
+#   azure-deploy:                   (none — no global behavior constraint)
diff --git a/evals/azure-deploy/eval.yaml b/evals/azure-deploy/eval.yaml
@@ -0,0 +1,107 @@
+# Vally eval config — migrated from Waza
+# Source: tests/azure-deploy/eval/eval.yaml + tasks/*.yaml
+# Migration: Waza → Vally (issue #1817, Phase 1)
+#
+# Global graders (evaluate#125 workaround): has_output and no_runtime_failure
+# are duplicated into every stimulus block below.
+
+name: azure-deploy-eval
+description: |
+  Evaluation suite for the azure-deploy skill.
+  Tests deployment guidance quality with emphasis on:
+  - AVM+AZD pattern-module preference
+  - AVM fallback when no pattern module exists
+  - deploy-only routing (not prepare/validate)
+
+tags:
+  type: integration
+  skill: azure-deploy
+
+environment:
+  skills:
+    - ../../plugin/skills/azure-deploy
+
+config:
+  runs: 3
+  timeout: 420
+  executor: copilot-sdk
+  model: claude-sonnet-4
+
+scoring:
+  threshold: 0.8
+
+stimuli:
+  # ── avm-order-bicep-001 ──
+  # Waza source: tasks/avm-order-bicep.yaml
+  # Validates deploy guidance prefers AVM+AZD pattern modules first
+  - name: "AVM+AZD Priority - Bicep Deploy"
+    prompt: |
+      My app is already prepared and validated.
+      Give me deploy guidance and module preference order for Bicep.
+      Prefer AVM+AZD patterns where available, with fallback to AVM resource modules and AVM utility modules.
+    tags:
+      type: integration
+      tier: full
+      area: output
+    graders:
+      # Task: expected.output_contains
+      - type: output-contains
+        config:
+          substring: "AVM"
+      - type: output-contains
+        config:
+          substring: "deploy"
+      - type: output-contains
+        config:
+          substring: "pattern"
+      # Task grader: avm_pattern_first
+      - type: output-matches
+        config:
+          pattern: "(?i)AVM\\+AZD|AZD pattern|pattern modules"
+      # Task grader: includes_resource_and_utility_fallback
+      - type: output-matches
+        config:
+          pattern: "(?is)(AVM\\+AZD|AZD pattern|pattern modules).*resource modules.*utility modules"
+      # Global: has_output
+      - type: completed
+      # Global: no_runtime_failure
+      - type: output-not-matches
+        config:
+          pattern: "(?i)fatal error|exception occurred|crashed"
+
+  # ── avm-fallback-no-pattern-001 ──
+  # Waza source: tasks/avm-fallback-no-pattern.yaml
+  # Validates fallback order when no AVM+AZD pattern module exists
+  - name: "AVM Fallback When No AZD Pattern"
+    prompt: |
+      I'm deploying with Bicep and there is no AVM+AZD pattern module for my scenario.
+      What module order should I follow if no pattern module exists and fallback must stay AVM resource modules then AVM utility modules?
+    tags:
+      type: integration
+      tier: full
+      area: output
+    graders:
+      # Task: expected.output_contains
+      - type: output-contains
+        config:
+          substring: "AVM"
+      - type: output-contains
+        config:
+          substring: "resource"
+      - type: output-contains
+        config:
+          substring: "utility"
+      # Task grader: explicit_no_pattern_fallback
+      - type: output-matches
+        config:
+          pattern: "(?is)(no .*pattern module|if no .*pattern).*AVM.*resource.*AVM.*utility"
+      # Task grader: avoids_non_avm_fallback
+      - type: output-not-matches
+        config:
+          pattern: "(?i)fallback to non-AVM|use non-AVM modules"
+      # Global: has_output
+      - type: completed
+      # Global: no_runtime_failure
+      - type: output-not-matches
+        config:
+          pattern: "(?i)fatal error|exception occurred|crashed"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		@microsoft:registry=https://npm.pkg.github.com
Copy link Copy Markdown Collaborator jongio Apr 17, 2026 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. This routes the entire `@microsoft` scope to GitHub Packages, not just `vally-cli`. Nothing in the tree hits that scope today (scripts/, tests/, dashboard/ have no `@microsoft/` deps), but if a sub-package later pulls a public `@microsoft/` from npmjs, local `npm install` will 404 for contributors who don't have `VALLY_NPM_TOKEN` set. Worth a one-liner in `evals/README.md` so the next person isn't surprised. Copy link Copy Markdown Member JasonYeMSFT Apr 20, 2026 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others. Learn more. I find the current evals/Readme.md sufficient for explaining what is needed for a local developer to setup the NPM token.