feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#1868
feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#1868wbreza wants to merge 5 commits intomicrosoft:mainfrom
Conversation
…lls) Migrate 10 complex-complexity Jest integration test files to Vally eval configs: Batch 3A (7 smaller complex skills): - _template (4 stimuli, workspace setup) - azure-hosted-copilot-sdk (8 stimuli, security regression checks, 20min timeout) - azure-resource-visualizer (4 stimuli, file pattern assertions, 30min timeout) - azure-enterprise-infra-planner (8 stimuli, file validation) - appinsights-instrumentation (4 stimuli, resource copying) - azure-cloud-migrate (2 stimuli, cloneRepo, 45min timeout) - azure-resource-lookup (7 stimuli, keyword validation) Batch 3B (3 large complex skills): - azure-deploy (39 stimuli, custom deploy validators, 60min timeout) - azure-validate (11 stimuli, sparse checkout, 45min timeout) - azure-prepare (38 stimuli, multi-language scaffolding, 60min timeout) Total: 125 stimuli across 10 eval configs. Extended grader types: file-exists, file-matches, output-not-matches (security). Custom helpers mapped: hasDeployLinks, expectFiles, matchesCommand, countSecretsInCode, matchesFileEdit, doesWorkspaceFileIncludePattern. All prompts preserved verbatim from Jest test files. Part of microsoft#1818 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…re-validate evals Fix lint errors: prompt was nested under input.prompt instead of being a top-level stimulus field. Corrected in all 88 stimuli across 3 files. No prompt text changed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jongio
left a comment
There was a problem hiding this comment.
Big migration, and the Batch 3A files + _template + .vally.yaml look close to what #1912's Vally CLI will consume. Seven issues worth addressing before this comes out of draft, flagged inline. The most important ones:
-
azure-deploy,azure-prepare, andazure-validatestill use the old Waza schema (top-levelskill,version,metrics,graders;config.trials_per_task,timeout_seconds,parallel). Once #1912 switches CI to@microsoft/vally-cli, those three won't parse. Other seven files are on the new schema. -
Every stimulus is tagged
cost: free, but.vally.yamlreservescost: freefor static non-LLM checks (see #1912). Theprsuite filters oncost: free, so this pulls 125 LLM-backed stimuli (some with 60-min timeouts) into the PR gate. They should be taggedcost: llm(or whatever non-free tag.vally.yamlsettles on). -
A few stimuli will fail by design because workspace seed data got commented out during migration:
azure-cloud-migrate(needscloneRepoof AWS samples),azure-enterprise-infra-plannerresponse-quality stimuli (needed multi-turn followUp confirmations), the 3azure-hosted-copilot-sdkstimuli that neededsetupExpressApp/setupCopilotSdkApp, andappinsights-instrumentationPython/ASP.NET quality checks. -
Negative assertions from
expectFilesgot dropped inazure-deploy: the Bicep stimuli assertinfra/*.bicepexists but don't assertinfra/*.tfis absent (and vice-versa for the Terraform section).
Happy to help triage once you've picked an approach for the seed-data gap.
| timeout_seconds: 3600 # 60 min — matches deployTestTimeoutMs / brownfieldTestTimeoutMs | ||
| parallel: false | ||
| executor: copilot-sdk | ||
| model: claude-sonnet-4-20250514 |
There was a problem hiding this comment.
This file is on the old Waza schema: skill, version, config.trials_per_task, timeout_seconds, parallel, top-level metrics, top-level graders. The 7 Batch 3A files and _template use the new Vally schema (tags, scoring, config.runs, config.timeout).
Once #1912 lands and CI switches to @microsoft/vally-cli, this file, azure-prepare/eval.yaml, and azure-validate/eval.yaml won't parse under the Vally schema. They'll either get rejected as having unknown fields or fall through to defaults for runs/timeout (so the 60-min timeout claimed in the PR description silently won't apply).
Also line 1 still points at waza/main/schemas/eval.schema.json, which contradicts the stated migration target.
Convert the 3 Batch 3B files to match Batch 3A's shape:
- drop
skill,version,metrics, top-levelgraders - replace
config.trials_per_task->config.runs,timeout_seconds->timeout, dropparallel - add
tags:+scoring: { threshold: 0.8 } - remove the yaml-language-server directive or repoint it at the Vally schema once it's published
| tags: | ||
| type: integration | ||
| tier: full | ||
| cost: free |
There was a problem hiding this comment.
cost: free is repeated on every stimulus in this PR (125 total across all 10 files), but .vally.yaml reserves cost: free for static non-LLM checks:
pr:
description: "All free-tier evals for PR gate, <2min"
filter:
cost: freeAll stimuli here run executor: copilot-sdk (real LLM) with 10-60 minute timeouts. Running vally eval --suite pr would pull all 125 into the PR gate, which isn't the <2min contract .vally.yaml promises.
Use cost: llm (or whatever non-free tag the framework settles on) for copilot-sdk stimuli, and reserve cost: free for static trigger/routing checks that don't call an LLM.
| # Global: no_runtime_failure | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|unhandled exception|stack trace" |
There was a problem hiding this comment.
This test's whole point in Jest was to migrate a real AWS Lambda repo that cloneRepo() dropped into the workspace. With environment.commands commented out, the skill runs on an empty workspace and the stimulus asks it to "migrate this Lambda to Azure" when there's no Lambda to migrate.
file-exists: **/migration-status.md and file-exists: **/migration-assessment-report.md will almost certainly never hit: the skill's migration flow expects source files to analyze, so it either bails out early or produces generic advisory text without creating those deliverables.
Same issue for lambda-webapp-migration below.
Options:
- Gate these stimuli on an
environment.commands/environment.filesprimitive that Vally supports (or file an ask againstmicrosoft/evaluate). - Replace the clone with a fixture directory checked into the repo that the
copilot-sdkexecutor seeds from. - Mark both stimuli as skipped/pending until workspace setup lands, rather than leaving them as guaranteed failures that'll drag the skill's pass rate down to 0%.
| # Global: no_runtime_failure | ||
| - type: output-not-matches | ||
| config: | ||
| pattern: "(?i)fatal error|unhandled exception|stack trace" |
There was a problem hiding this comment.
This stimulus and bicep-file-generation below both expect files that the original Jest tests only produced after multi-turn confirmations:
plan-json-generation: Jest followUp was["The resource list looks good, proceed with generating the plan.", "Go with recommended options. Assume all defaults to make the plan."]. Without those confirmations, the planner stays in resource-gathering mode and never writes.azure/infrastructure-plan.json.bicep-file-generation: needed a third turn"Looks good! Let's make Bicep now."to produce Bicep files.
With followUps commented out, both file-exists graders will fail ~100% of the time because the skill hasn't been told to emit the artifacts.
If Vally supports multi-turn via messages: or followUp:, wire it in here. If it doesn't yet, these two response-quality stimuli aren't portable and should be marked pending rather than left as guaranteed failures.
Same shape appears in azure-hosted-copilot-sdk/eval.yaml (3 stimuli that relied on setupExpressApp / setupCopilotSdkApp) and appinsights-instrumentation/eval.yaml (Python/ASP.NET quality checks relied on fs.cpSync resource copying).
| # expectFiles: bicep present, no .tf | ||
| - type: file-exists | ||
| config: | ||
| path: "infra/*.bicep" |
There was a problem hiding this comment.
The comment on line 158 says "bicep present, no .tf" and the description on line 166 says "no Terraform", but the only assertion here is file-exists: infra/*.bicep. A skill that produces both infra/*.bicep AND infra/*.tf passes this grader, which loses the intent of the original Jest expectFiles check.
Same issue in reverse for the Terraform stimuli starting at line 520 (tf-swa-whiteboard, etc.): they assert infra/*.tf exists but don't assert infra/*.bicep is absent.
Add a file-not-exists (or equivalent) grader for the opposite IaC flavor in each stimulus. If Vally doesn't ship that grader today, file-exists with a negated path or output-not-matches on a file listing can stand in. Worth fixing across the whole section, not just the first case.
| graders: | ||
| - type: output-contains | ||
| config: | ||
| substring: "resource" |
There was a problem hiding this comment.
output-contains: "resource" will match almost any Azure-related response, even an empty "I can't help with that" that happens to mention resources. The Jest doesAssistantMessageIncludeKeyword had the same weakness, but since you're rewriting the assertion anyway, this is a chance to make it meaningful.
Same pattern on the next stimulus (output-contains: "location"). Consider a phrase that actually requires the skill to do the lookup, e.g. "across subscriptions" or a regex covering "resources by (type|location|subscription)". Otherwise the grader won't catch a regression where the skill silently stops doing cross-subscription queries.
| - type: file-matches | ||
| config: | ||
| path: "**/*.py" | ||
| pattern: "configure_azure_monitor" |
There was a problem hiding this comment.
file-matches on **/*.py looking for configure_azure_monitor was originally checking seed Python files copied via fs.cpSync("./appinsights-instrumentation/resources/python-app/", workspace). With environment.files commented out (line 141-144), there are no seed .py files in the workspace, so this grader now behaves one of two ways depending on Vally's semantics:
- If
file-matchesrequires at least one matching path and returns false when none exist: it fails whenever the skill doesn't generate a.pyfile (which it often won't for an instrumentation-guidance prompt). - If
file-matchespasses when no files match: it becomes a no-op that always passes.
Either way, the assertion no longer means "the seeded Python app has the instrumentation pattern added" (the original Jest intent). Same story for the ASP.NET Core stimulus above: "prompt is self-contained enough to trigger correct guidance" assumes the model doesn't regress, but removes the objective evidence the test was providing.
Skip these quality checks until the seed-data / environment.files story lands, or inline the resource content so the test is self-contained.
Summary
Migrate Batch 3 (complex) Jest integration tests to Vally eval configs — 10 skills, 125 stimuli. This is the final and largest batch, covering skills with custom utilities, workspace setup, security regression checks, and long-running deploy tests.
Part of #1818
Changes
Batch 3A — 7 Smaller Complex Skills (37 stimuli)
countSecretsInCode,countApiKeyInByomConfig)listFilesRecursive)cloneRepogit setupBatch 3B — 3 Large Complex Skills (88 stimuli)
hasDeployLinks,expectFiles,matchesCommand, Bicep/Terraform variants, brownfield dotnet/js/pythonmatchesCommand,matchesFileEdit, sparse checkoutexpectFiles, multi-language scaffolding (SWA/AppService/Functions/ACA × Bicep/Terraform)Extended Grader Mappings (new in Batch 3)
hasDeployLinksoutput-matcheswith Azure URL regexexpectFilesfile-existswith globcountSecretsInCodeoutput-not-matcheswith secret patternsmatchesCommandoutput-matcheswith regexmatchesFileEditfile-matcheswith path + patterndoesWorkspaceFileIncludePatternfile-matcheswith regexWhat's NOT in this PR
Validation
output-not-matches