doc: document adding eval suites to CI (#2682)

JasonYeMSFT · web-flow · commit 15f397f04e71 · 2026-06-22T12:29:09.000-07:00
* doc: document adding eval suites to CI

* Resolve copilot comments
diff --git a/.github/skills/vally-eval/SKILL.md b/.github/skills/vally-eval/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: vally-eval
-description: "Author, validate, and run Vally eval.yaml evaluation suites for agent skills. TRIGGERS: create eval, write eval, add eval, run eval, validate eval, vally eval, eval.yaml, add stimulus, map test to eval, migrate test to eval, eval graders, eval scoring."
+description: "Author, validate, and run Vally eval.yaml evaluation suites for agent skills. TRIGGERS: create eval, write eval, add eval, run eval, validate eval, vally eval, eval.yaml, add stimulus, map test to eval, migrate test to eval, eval graders, eval scoring, add eval to CI."
 license: MIT
 metadata:
   author: Microsoft
@@ -19,6 +19,8 @@ Refer to the official documentation on the schema of the spec and the schema of
 
 Vally eval suites for azure-skills plugin have the following file layout. The shared eval spec is located at `<repo-root>/.vally.yaml`. The eval suites are categorized by skills. The eval suites for each skill are located at `<repo-root>/evals/<skill-name>/eval.yaml`, e.g. `<repo-root>/evals/azure-ai/eval.yaml`. If a skill needs fixture files for its eval suites, it should organize such fixture files in a `fixture` directory under its directory, e.g. `<repo-root>/evals/azure-ai/fixture/`.
 
+The eval suites can be organized into separate files under a skill's eval directory. For example, you can have `<repo-root>/evals/<skill-name>/evalA.yaml` and `<repo-root>/evals/<skill-name>/evalB.yaml`. When running the vally suites using `npm run test:vally` command, the test script will run suites from all these .yaml files.
+
 ## Migrate integration tests
 
 azure-skills plugin have implemented JavaScript integration test using Jest as the underlying test runner. All such integration tests are under `tests/**/integration.test.ts` files.
@@ -48,7 +50,7 @@ npm run vally validate-stimulus
 
 Extended features such as early termination are implemented using tags and many of them use serialized JSON objects as input. This validation script also validates the values of these special tags.
 
-## Run vally eval suites
+## Run vally eval suites locally
 
 Use vally-cli to run vally eval suites. In most cases, you would like to use a command like this.
 
@@ -59,6 +61,10 @@ npm run test:vally -- --skill $SKILL
 
 `--eval-spec ../evals/<skill-name>/eval.yaml` tells vally which eval spec to run. The path is relative to the current working directory of the process running the command. `--output-dir ./results` tells vally to write its output to a `results/` directory relative to the current working directory of the process running the command. `--executor-plugin ../../tests/vally/vally-executor.ts` tells vally to load and execute the code in this module, which registers the custom executor used by azure-skill vally eval suites. Note that this path is relative to the parent directory of the eval spec to run. For example, if the eval spec to run is `<repo-root>/evals/azure-ai/eval.yaml`, resolving this relative path ends at `<repo-root>/tests/vally/vally-executor.ts`.
 
+## Run vally eval suites in CI
+
+Vally eval suites implemented in this repo can be added to the CI test workflow to be run nightly and publish results for reviewing. Refer to [ci-test](./references/ci-test.md) on how to add the Vally eval suites to the CI test workflow.
+
 ## Extend with custom grader
 
 Custom graders can be added to grade trajectories in ways built-in graders don't support. To add a custom grader, follow the examples in the official vally documentation to create a `tests/vally/<custom-name>-grader.ts` module and register the new custom grader in `tests/vally/vally-graders.ts`. The `npm run test:vally` command internally loads all the custom graders when testing skills.
diff --git a/.github/skills/vally-eval/references/ci-test.md b/.github/skills/vally-eval/references/ci-test.md
@@ -0,0 +1,31 @@
+# Run tests in CI
+
+Follow these steps to add a skill's Vally suites to the CI test workflow so they run nightly and publish results. Because LLM behavior is statistical, accumulating test run results gives us better data to refine skills over time.
+
+## Prerequisites
+
+- The skill's Vally suites are implemented under `evals/<skill-name>/eval.yaml` (or split across multiple YAML files).
+- The Vally suites use the `integration-test-agent-runner` custom executor.
+- The test results can be made public.
+
+## Required setup
+
+The scheduled CI test workflow determines which skills to test by reading `tests/skills.json`. That file lists all skills and their test schedules. To include a new skill in scheduled runs, **add it** to the skill list and to one of the schedule slots. By convention, `microsoft-foundry` and `azure-deploy` run in their own slots, while new skills are added to another shared slot.
+
+### Use shared job template
+
+Most skills use a shared job template to run eval suites. This template is defined as the `test` job in `.github/workflows/test-all-integration.yml`.
+
+If you use the shared job template, add the skill in the workflow’s `VALLY_SKILLS` list. Otherwise the job will run Jest-based integration tests instead of `npm run test:vally`. The CI workflow creates one job per skill from this template and runs all eval suites with `npm run test:vally`.
+
+Reuse this template whenever possible. It provisions a test environment, installs common tools (for example, Azure CLI and Azure Developer CLI), connects to a test Azure subscription, and includes utility steps that collect and publish test results to a well-known storage location for downstream processing.
+
+### Create dedicated workflow
+
+In some cases, you may need a dedicated workflow for a skill. Common reasons include:
+
+- The skill requires uncommon environment configuration, such as additional environment variables or secrets, a special Azure subscription, or installation of uncommon tools.
+- The test suite is too large for a single job. GitHub Actions has a hard 6-hour runtime limit per job. For example, `azure-deploy` uses a dedicated workflow that splits tests across multiple jobs.
+- Test results must be published to a custom destination for downstream processing and consumption. If your team owns a data pipeline, implement publishing steps in the dedicated workflow. You can still publish to the well-known location so our reporting tools continue to work.
+
+If you create a dedicated workflow, update `test-all-integration.yml` so it triggers the dedicated workflow when that skill is included in the input. Then implement the dedicated workflow to run tests and collect results. Work with GitHub-Copilot-for-Azure repo contributors to configure any required environment variables or secrets.