You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/skills/vally-eval/SKILL.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
name: vally-eval
3
-
description: "Author, validate, and run Vally eval.yaml evaluation suites for agent skills. TRIGGERS: create eval, write eval, add eval, run eval, validate eval, vally eval, eval.yaml, add stimulus, map test to eval, migrate test to eval, eval graders, eval scoring."
3
+
description: "Author, validate, and run Vally eval.yaml evaluation suites for agent skills. TRIGGERS: create eval, write eval, add eval, run eval, validate eval, vally eval, eval.yaml, add stimulus, map test to eval, migrate test to eval, eval graders, eval scoring, add eval to CI."
4
4
license: MIT
5
5
metadata:
6
6
author: Microsoft
@@ -19,6 +19,8 @@ Refer to the official documentation on the schema of the spec and the schema of
19
19
20
20
Vally eval suites for azure-skills plugin have the following file layout. The shared eval spec is located at `<repo-root>/.vally.yaml`. The eval suites are categorized by skills. The eval suites for each skill are located at `<repo-root>/evals/<skill-name>/eval.yaml`, e.g. `<repo-root>/evals/azure-ai/eval.yaml`. If a skill needs fixture files for its eval suites, it should organize such fixture files in a `fixture` directory under its directory, e.g. `<repo-root>/evals/azure-ai/fixture/`.
21
21
22
+
The eval suites can be organized into separate files under a skill's eval directory. For example, you can have `<repo-root>/evals/<skill-name>/evalA.yaml` and `<repo-root>/evals/<skill-name>/evalB.yaml`. When running the vally suites using `npm run test:vally` command, the test script will run suites from all these .yaml files.
23
+
22
24
## Migrate integration tests
23
25
24
26
azure-skills plugin have implemented JavaScript integration test using Jest as the underlying test runner. All such integration tests are under `tests/**/integration.test.ts` files.
@@ -48,7 +50,7 @@ npm run vally validate-stimulus
48
50
49
51
Extended features such as early termination are implemented using tags and many of them use serialized JSON objects as input. This validation script also validates the values of these special tags.
50
52
51
-
## Run vally eval suites
53
+
## Run vally eval suites locally
52
54
53
55
Use vally-cli to run vally eval suites. In most cases, you would like to use a command like this.
54
56
@@ -59,6 +61,10 @@ npm run test:vally -- --skill $SKILL
59
61
60
62
`--eval-spec ../evals/<skill-name>/eval.yaml` tells vally which eval spec to run. The path is relative to the current working directory of the process running the command. `--output-dir ./results` tells vally to write its output to a `results/` directory relative to the current working directory of the process running the command. `--executor-plugin ../../tests/vally/vally-executor.ts` tells vally to load and execute the code in this module, which registers the custom executor used by azure-skill vally eval suites. Note that this path is relative to the parent directory of the eval spec to run. For example, if the eval spec to run is `<repo-root>/evals/azure-ai/eval.yaml`, resolving this relative path ends at `<repo-root>/tests/vally/vally-executor.ts`.
61
63
64
+
## Run vally eval suites in CI
65
+
66
+
Vally eval suites implemented in this repo can be added to the CI test workflow to be run nightly and publish results for reviewing. Refer to [ci-test](./references/ci-test.md) on how to add the Vally eval suites to the CI test workflow.
67
+
62
68
## Extend with custom grader
63
69
64
70
Custom graders can be added to grade trajectories in ways built-in graders don't support. To add a custom grader, follow the examples in the official vally documentation to create a `tests/vally/<custom-name>-grader.ts` module and register the new custom grader in `tests/vally/vally-graders.ts`. The `npm run test:vally` command internally loads all the custom graders when testing skills.
Follow these steps to add a skill's Vally suites to the CI test workflow so they run nightly and publish results. Because LLM behavior is statistical, accumulating test run results gives us better data to refine skills over time.
4
+
5
+
## Prerequisites
6
+
7
+
- The skill's Vally suites are implemented under `evals/<skill-name>/eval.yaml` (or split across multiple YAML files).
8
+
- The Vally suites use the `integration-test-agent-runner` custom executor.
9
+
- The test results can be made public.
10
+
11
+
## Required setup
12
+
13
+
The scheduled CI test workflow determines which skills to test by reading `tests/skills.json`. That file lists all skills and their test schedules. To include a new skill in scheduled runs, **add it** to the skill list and to one of the schedule slots. By convention, `microsoft-foundry` and `azure-deploy` run in their own slots, while new skills are added to another shared slot.
14
+
15
+
### Use shared job template
16
+
17
+
Most skills use a shared job template to run eval suites. This template is defined as the `test` job in `.github/workflows/test-all-integration.yml`.
18
+
19
+
If you use the shared job template, add the skill in the workflow’s `VALLY_SKILLS` list. Otherwise the job will run Jest-based integration tests instead of `npm run test:vally`. The CI workflow creates one job per skill from this template and runs all eval suites with `npm run test:vally`.
20
+
21
+
Reuse this template whenever possible. It provisions a test environment, installs common tools (for example, Azure CLI and Azure Developer CLI), connects to a test Azure subscription, and includes utility steps that collect and publish test results to a well-known storage location for downstream processing.
22
+
23
+
### Create dedicated workflow
24
+
25
+
In some cases, you may need a dedicated workflow for a skill. Common reasons include:
26
+
27
+
- The skill requires uncommon environment configuration, such as additional environment variables or secrets, a special Azure subscription, or installation of uncommon tools.
28
+
- The test suite is too large for a single job. GitHub Actions has a hard 6-hour runtime limit per job. For example, `azure-deploy` uses a dedicated workflow that splits tests across multiple jobs.
29
+
- Test results must be published to a custom destination for downstream processing and consumption. If your team owns a data pipeline, implement publishing steps in the dedicated workflow. You can still publish to the well-known location so our reporting tools continue to work.
30
+
31
+
If you create a dedicated workflow, update `test-all-integration.yml` so it triggers the dedicated workflow when that skill is included in the input. Then implement the dedicated workflow to run tests and collect results. Work with GitHub-Copilot-for-Azure repo contributors to configure any required environment variables or secrets.
0 commit comments