Skip to content

Commit 1f32453

Browse files
wbrezaCopilot
andcommitted
Address PR #1912 review feedback
Workflow hardening: - Drop pull_request trigger (keep workflow_dispatch only) to eliminate token exfiltration vector from untrusted PR code - Add top-level permissions block (contents/packages: read) for defense-in-depth Package hygiene: - Remove @microsoft/vally-cli from devDependencies (CI installs it explicitly via GitHub Packages); lockfile regenerated in sync - Remove unused root yaml dependency Eval spec cleanup: - Remove 13 broad output-not-contains "error"/"failed" graders from azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex) - Add azure-prepare, azure-validate, azure-deploy to environment.skills - Remove cost:free tag from all LLM-backed stimuli across 4 eval files (reserved now for non-LLM static evals) - Align .vally.yaml suite descriptions with accurate tag semantics Cleanup: - Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/ - Add evals/README.md with local vally-cli run instructions - Gitignore local results/ output directory Follow-up issue #1920 tracks wiring CI to a curated medium suite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 3f4acc4 commit 1f32453

16 files changed

Lines changed: 6577 additions & 6724 deletions

File tree

.github/workflows/eval.yml

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,5 @@
11
name: Run Skill Evaluations
22
on:
3-
pull_request:
4-
branches: [main]
5-
paths:
6-
- 'evals/**'
7-
- 'plugin/skills/**'
83
workflow_dispatch:
94

105
permissions:
@@ -23,8 +18,8 @@ jobs:
2318
node-version: '22'
2419
registry-url: https://npm.pkg.github.com
2520
scope: '@microsoft'
26-
- name: Install dependencies
27-
run: npm install --no-save
21+
- name: Install vally-cli
22+
run: npm install --no-save @microsoft/vally-cli
2823
env:
2924
NODE_AUTH_TOKEN: ${{ secrets.VALLY_NPM_TOKEN }}
3025
- name: Run evaluations

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -347,3 +347,6 @@ x86/
347347
dashboard/.azure/
348348
dashboard/dist/
349349
dashboard/**/dist/
350+
351+
# Local vally eval outputs
352+
results/

.vally.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ paths:
77

88
suites:
99
smoke:
10-
description: "Fast static checks — no LLM calls, <60s"
10+
description: "Static non-LLM checks only (e.g., trigger-pattern tests). Currently empty — all evals use the copilot-sdk executor."
1111
filter:
1212
tier: smoke
1313
cost: free
1414

1515
pr:
16-
description: "All free-tier evals for PR gate, <2min"
16+
description: "Non-LLM PR gate evals (cost: free reserved for static checks). Currently empty — populate as static evals are added."
1717
filter:
1818
cost: free
1919

@@ -23,7 +23,7 @@ suites:
2323
type: trigger
2424

2525
integration:
26-
description: "All behavior/integration evals"
26+
description: "All behavior/integration evals (LLM-backed)"
2727
filter:
2828
type: integration
2929

evals/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Evals
2+
3+
Skill evaluation suites run by [Vally](https://github.com/microsoft/ai-bench) (`@microsoft/vally-cli`). Each subdirectory corresponds to a skill and contains an `eval.yaml` defining stimuli, graders, and configuration.
4+
5+
## Prerequisites
6+
7+
`@microsoft/vally-cli` is published to GitHub Packages. You need a GitHub **Personal Access Token** with the `read:packages` scope.
8+
9+
1. Create a PAT: <https://github.com/settings/tokens> (classic) → enable `read:packages`.
10+
2. Configure npm to use GitHub Packages for the `@microsoft` scope. Create or update `~/.npmrc`:
11+
12+
```ini
13+
@microsoft:registry=https://npm.pkg.github.com
14+
//npm.pkg.github.com/:_authToken=${GITHUB_PACKAGES_TOKEN}
15+
```
16+
17+
3. Export your token:
18+
19+
```bash
20+
export GITHUB_PACKAGES_TOKEN=ghp_xxxxxxxxxxxx
21+
```
22+
23+
4. Install the CLI (either globally, or invoke with `npx`):
24+
25+
```bash
26+
npm install -g @microsoft/vally-cli
27+
# or, no install: use `npx @microsoft/vally-cli ...` below
28+
```
29+
30+
You will also need a `GITHUB_TOKEN` (Copilot-enabled) in your environment for the `copilot-sdk` executor used by most evals.
31+
32+
## Running a single eval spec
33+
34+
From the repo root:
35+
36+
```bash
37+
npx @microsoft/vally-cli eval \
38+
--eval-spec evals/azure-hosted-copilot-sdk/eval.yaml \
39+
--output-dir ./results \
40+
--output jsonl
41+
```
42+
43+
## Running a suite
44+
45+
Suites are defined in [`.vally.yaml`](../.vally.yaml) at the repo root and filter across all `evals/**/eval.yaml` files.
46+
47+
```bash
48+
npx @microsoft/vally-cli eval --suite pr
49+
npx @microsoft/vally-cli eval --suite full
50+
```
51+
52+
## Viewing results
53+
54+
After a run, check the output directory (default `./results`):
55+
56+
- `results.jsonl` — one JSON record per stimulus/run with grader outcomes.
57+
- `eval-results.md` — human-readable summary.
58+
59+
## More info
60+
61+
- Vally docs & source: <https://github.com/microsoft/ai-bench>
62+
- Suite definitions: [`.vally.yaml`](../.vally.yaml)

evals/azure-deploy/eval.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@ stimuli:
4242
tags:
4343
type: integration
4444
tier: full
45-
cost: free
4645
area: output
4746
graders:
4847
# Task: expected.output_contains
@@ -80,7 +79,6 @@ stimuli:
8079
tags:
8180
type: integration
8281
tier: full
83-
cost: free
8482
area: output
8583
graders:
8684
# Task: expected.output_contains

evals/azure-enterprise-infra-planner/eval.yaml

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@ stimuli:
4242
tags:
4343
type: integration
4444
tier: smoke
45-
cost: free
4645
area: files
4746
constraints:
4847
max_turns: 50
@@ -78,7 +77,6 @@ stimuli:
7877
tags:
7978
type: integration
8079
tier: full
81-
cost: free
8280
area: behavior
8381
constraints:
8482
max_turns: 50
@@ -110,7 +108,6 @@ stimuli:
110108
tags:
111109
type: integration
112110
tier: full
113-
cost: free
114111
area: [output, files]
115112
constraints:
116113
max_turns: 50
@@ -175,7 +172,6 @@ stimuli:
175172
tags:
176173
type: integration
177174
tier: full
178-
cost: free
179175
area: [output, files]
180176
constraints:
181177
max_turns: 50
@@ -239,7 +235,6 @@ stimuli:
239235
tags:
240236
type: integration
241237
tier: full
242-
cost: free
243238
area: behavior
244239
constraints:
245240
max_turns: 50
@@ -270,7 +265,6 @@ stimuli:
270265
tags:
271266
type: integration
272267
tier: full
273-
cost: free
274268
area: behavior
275269
constraints:
276270
max_turns: 50
@@ -302,7 +296,6 @@ stimuli:
302296
tags:
303297
type: integration
304298
tier: full
305-
cost: free
306299
area: [output, files]
307300
constraints:
308301
max_turns: 50
@@ -367,7 +360,6 @@ stimuli:
367360
tags:
368361
type: integration
369362
tier: full
370-
cost: free
371363
area: [output, files]
372364
constraints:
373365
max_turns: 50
@@ -432,7 +424,6 @@ stimuli:
432424
tags:
433425
type: integration
434426
tier: full
435-
cost: free
436427
area: [output, files]
437428
constraints:
438429
max_turns: 50
@@ -497,7 +488,6 @@ stimuli:
497488
tags:
498489
type: integration
499490
tier: full
500-
cost: free
501491
area: [output, files]
502492
constraints:
503493
max_turns: 50
@@ -561,7 +551,6 @@ stimuli:
561551
tags:
562552
type: integration
563553
tier: full
564-
cost: free
565554
area: behavior
566555
constraints:
567556
max_turns: 50
@@ -593,7 +582,6 @@ stimuli:
593582
tags:
594583
type: integration
595584
tier: full
596-
cost: free
597585
area: [output, files]
598586
constraints:
599587
max_turns: 50

evals/azure-hosted-copilot-sdk/eval.yaml

Lines changed: 3 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ tags:
1919
environment:
2020
skills:
2121
- ../../plugin/skills/azure-hosted-copilot-sdk
22+
- ../../plugin/skills/azure-prepare
23+
- ../../plugin/skills/azure-validate
24+
- ../../plugin/skills/azure-deploy
2225

2326
config:
2427
runs: 1
@@ -38,7 +41,6 @@ stimuli:
3841
tags:
3942
type: integration
4043
tier: full
41-
cost: free
4244
area: output
4345
constraints:
4446
max_turns: 10
@@ -48,12 +50,6 @@ stimuli:
4850
config:
4951
substring: "azd init --template azure-samples/copilot-sdk-service"
5052
# Task: expected.output_not_contains
51-
- type: output-not-contains
52-
config:
53-
substring: "error"
54-
- type: output-not-contains
55-
config:
56-
substring: "failed"
5753
# Global: no_fatal_errors
5854
- type: output-not-matches
5955
config:
@@ -67,7 +63,6 @@ stimuli:
6763
tags:
6864
type: integration
6965
tier: full
70-
cost: free
7166
area: output
7267
constraints:
7368
max_turns: 10
@@ -77,12 +72,6 @@ stimuli:
7772
config:
7873
substring: "azure.yaml"
7974
# Task: expected.output_not_contains
80-
- type: output-not-contains
81-
config:
82-
substring: "error"
83-
- type: output-not-contains
84-
config:
85-
substring: "failed"
8675
# Global: no_fatal_errors
8776
- type: output-not-matches
8877
config:
@@ -96,7 +85,6 @@ stimuli:
9685
tags:
9786
type: integration
9887
tier: full
99-
cost: free
10088
area: output
10189
constraints:
10290
max_turns: 15
@@ -112,12 +100,6 @@ stimuli:
112100
config:
113101
substring: "azure-deploy"
114102
# Task: expected.output_not_contains
115-
- type: output-not-contains
116-
config:
117-
substring: "error"
118-
- type: output-not-contains
119-
config:
120-
substring: "failed"
121103
# Global: no_fatal_errors
122104
- type: output-not-matches
123105
config:
@@ -131,7 +113,6 @@ stimuli:
131113
tags:
132114
type: integration
133115
tier: full
134-
cost: free
135116
area: output
136117
constraints:
137118
max_turns: 10
@@ -150,12 +131,6 @@ stimuli:
150131
- type: output-not-contains
151132
config:
152133
substring: "DefaultAzureCredential"
153-
- type: output-not-contains
154-
config:
155-
substring: "error"
156-
- type: output-not-contains
157-
config:
158-
substring: "failed"
159134
# Global: no_fatal_errors
160135
- type: output-not-matches
161136
config:
@@ -169,7 +144,6 @@ stimuli:
169144
tags:
170145
type: integration
171146
tier: full
172-
cost: free
173147
area: output
174148
constraints:
175149
max_turns: 10
@@ -185,12 +159,6 @@ stimuli:
185159
config:
186160
substring: "DefaultAzureCredential"
187161
# Task: expected.output_not_contains
188-
- type: output-not-contains
189-
config:
190-
substring: "error"
191-
- type: output-not-contains
192-
config:
193-
substring: "failed"
194162
- type: output-not-contains
195163
config:
196164
substring: "apiKey"
@@ -213,7 +181,6 @@ stimuli:
213181
tags:
214182
type: trigger
215183
tier: smoke
216-
cost: free
217184
area: routing
218185
constraints:
219186
max_turns: 10

evals/azure-hosted-copilot-sdk/tasks/byom-config.yaml

Lines changed: 0 additions & 29 deletions
This file was deleted.

0 commit comments

Comments
 (0)