Skip to content

feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#1868

Draft
wbreza wants to merge 5 commits intomicrosoft:mainfrom
wbreza:feature/jest-batch3-migration
Draft

feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#1868
wbreza wants to merge 5 commits intomicrosoft:mainfrom
wbreza:feature/jest-batch3-migration

Conversation

@wbreza
Copy link
Copy Markdown
Collaborator

@wbreza wbreza commented Apr 14, 2026

Summary

Migrate Batch 3 (complex) Jest integration tests to Vally eval configs — 10 skills, 125 stimuli. This is the final and largest batch, covering skills with custom utilities, workspace setup, security regression checks, and long-running deploy tests.

Part of #1818

Changes

Batch 3A — 7 Smaller Complex Skills (37 stimuli)

Skill Stimuli Runs Timeout Key Patterns
_template 4 5x 10min Workspace file setup
azure-hosted-copilot-sdk 8 3x 20min Security regression checks (countSecretsInCode, countApiKeyInByomConfig)
azure-resource-visualizer 4 5x 30min File pattern assertions, followUp prompts
azure-enterprise-infra-planner 8 1x 10min File validation (listFilesRecursive)
appinsights-instrumentation 4 5x 10min Resource file copying, regex file search
azure-cloud-migrate 2 1x 45min cloneRepo git setup
azure-resource-lookup 7 1x 10min Keyword validation

Batch 3B — 3 Large Complex Skills (88 stimuli)

Skill Stimuli Runs Timeout Key Patterns
azure-deploy 39 1x 60min hasDeployLinks, expectFiles, matchesCommand, Bicep/Terraform variants, brownfield dotnet/js/python
azure-validate 11 1x 45min matchesCommand, matchesFileEdit, sparse checkout
azure-prepare 38 1x 60min expectFiles, multi-language scaffolding (SWA/AppService/Functions/ACA × Bicep/Terraform)

Extended Grader Mappings (new in Batch 3)

Jest Helper Vally Grader Usage
hasDeployLinks output-matches with Azure URL regex 28 stimuli
expectFiles file-exists with glob 26 stimuli
countSecretsInCode output-not-matches with secret patterns 8 stimuli
matchesCommand output-matches with regex 2 stimuli
matchesFileEdit file-matches with path + pattern 1 stimulus
doesWorkspaceFileIncludePattern file-matches with regex 6 stimuli

What's NOT in this PR

Validation

  • ✅ All 10 eval configs valid YAML
  • ✅ 125/125 stimuli created (37 + 88)
  • ✅ All prompts preserved verbatim from Jest sources
  • ✅ Security regression patterns mapped to output-not-matches

…lls)

Migrate 10 complex-complexity Jest integration test files to Vally eval configs:

Batch 3A (7 smaller complex skills):
- _template (4 stimuli, workspace setup)
- azure-hosted-copilot-sdk (8 stimuli, security regression checks, 20min timeout)
- azure-resource-visualizer (4 stimuli, file pattern assertions, 30min timeout)
- azure-enterprise-infra-planner (8 stimuli, file validation)
- appinsights-instrumentation (4 stimuli, resource copying)
- azure-cloud-migrate (2 stimuli, cloneRepo, 45min timeout)
- azure-resource-lookup (7 stimuli, keyword validation)

Batch 3B (3 large complex skills):
- azure-deploy (39 stimuli, custom deploy validators, 60min timeout)
- azure-validate (11 stimuli, sparse checkout, 45min timeout)
- azure-prepare (38 stimuli, multi-language scaffolding, 60min timeout)

Total: 125 stimuli across 10 eval configs.
Extended grader types: file-exists, file-matches, output-not-matches (security).
Custom helpers mapped: hasDeployLinks, expectFiles, matchesCommand,
countSecretsInCode, matchesFileEdit, doesWorkspaceFileIncludePattern.
All prompts preserved verbatim from Jest test files.

Part of microsoft#1818

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza wbreza self-assigned this Apr 14, 2026
wbreza and others added 4 commits April 14, 2026 13:27
…re-validate evals

Fix lint errors: prompt was nested under input.prompt instead of
being a top-level stimulus field. Corrected in all 88 stimuli
across 3 files. No prompt text changed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill
definitions into the session. Without this, no skills are available
and skill-invocation graders always fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big migration, and the Batch 3A files + _template + .vally.yaml look close to what #1912's Vally CLI will consume. Seven issues worth addressing before this comes out of draft, flagged inline. The most important ones:

  1. azure-deploy, azure-prepare, and azure-validate still use the old Waza schema (top-level skill, version, metrics, graders; config.trials_per_task, timeout_seconds, parallel). Once #1912 switches CI to @microsoft/vally-cli, those three won't parse. Other seven files are on the new schema.

  2. Every stimulus is tagged cost: free, but .vally.yaml reserves cost: free for static non-LLM checks (see #1912). The pr suite filters on cost: free, so this pulls 125 LLM-backed stimuli (some with 60-min timeouts) into the PR gate. They should be tagged cost: llm (or whatever non-free tag .vally.yaml settles on).

  3. A few stimuli will fail by design because workspace seed data got commented out during migration: azure-cloud-migrate (needs cloneRepo of AWS samples), azure-enterprise-infra-planner response-quality stimuli (needed multi-turn followUp confirmations), the 3 azure-hosted-copilot-sdk stimuli that needed setupExpressApp / setupCopilotSdkApp, and appinsights-instrumentation Python/ASP.NET quality checks.

  4. Negative assertions from expectFiles got dropped in azure-deploy: the Bicep stimuli assert infra/*.bicep exists but don't assert infra/*.tf is absent (and vice-versa for the Terraform section).

Happy to help triage once you've picked an approach for the seed-data gap.

timeout_seconds: 3600 # 60 min — matches deployTestTimeoutMs / brownfieldTestTimeoutMs
parallel: false
executor: copilot-sdk
model: claude-sonnet-4-20250514
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is on the old Waza schema: skill, version, config.trials_per_task, timeout_seconds, parallel, top-level metrics, top-level graders. The 7 Batch 3A files and _template use the new Vally schema (tags, scoring, config.runs, config.timeout).

Once #1912 lands and CI switches to @microsoft/vally-cli, this file, azure-prepare/eval.yaml, and azure-validate/eval.yaml won't parse under the Vally schema. They'll either get rejected as having unknown fields or fall through to defaults for runs/timeout (so the 60-min timeout claimed in the PR description silently won't apply).

Also line 1 still points at waza/main/schemas/eval.schema.json, which contradicts the stated migration target.

Convert the 3 Batch 3B files to match Batch 3A's shape:

  • drop skill, version, metrics, top-level graders
  • replace config.trials_per_task -> config.runs, timeout_seconds -> timeout, drop parallel
  • add tags: + scoring: { threshold: 0.8 }
  • remove the yaml-language-server directive or repoint it at the Vally schema once it's published

tags:
type: integration
tier: full
cost: free
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cost: free is repeated on every stimulus in this PR (125 total across all 10 files), but .vally.yaml reserves cost: free for static non-LLM checks:

pr:
  description: "All free-tier evals for PR gate, <2min"
  filter:
    cost: free

All stimuli here run executor: copilot-sdk (real LLM) with 10-60 minute timeouts. Running vally eval --suite pr would pull all 125 into the PR gate, which isn't the <2min contract .vally.yaml promises.

Use cost: llm (or whatever non-free tag the framework settles on) for copilot-sdk stimuli, and reserve cost: free for static trigger/routing checks that don't call an LLM.

# Global: no_runtime_failure
- type: output-not-matches
config:
pattern: "(?i)fatal error|unhandled exception|stack trace"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test's whole point in Jest was to migrate a real AWS Lambda repo that cloneRepo() dropped into the workspace. With environment.commands commented out, the skill runs on an empty workspace and the stimulus asks it to "migrate this Lambda to Azure" when there's no Lambda to migrate.

file-exists: **/migration-status.md and file-exists: **/migration-assessment-report.md will almost certainly never hit: the skill's migration flow expects source files to analyze, so it either bails out early or produces generic advisory text without creating those deliverables.

Same issue for lambda-webapp-migration below.

Options:

  • Gate these stimuli on an environment.commands / environment.files primitive that Vally supports (or file an ask against microsoft/evaluate).
  • Replace the clone with a fixture directory checked into the repo that the copilot-sdk executor seeds from.
  • Mark both stimuli as skipped/pending until workspace setup lands, rather than leaving them as guaranteed failures that'll drag the skill's pass rate down to 0%.

# Global: no_runtime_failure
- type: output-not-matches
config:
pattern: "(?i)fatal error|unhandled exception|stack trace"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stimulus and bicep-file-generation below both expect files that the original Jest tests only produced after multi-turn confirmations:

  • plan-json-generation: Jest followUp was ["The resource list looks good, proceed with generating the plan.", "Go with recommended options. Assume all defaults to make the plan."]. Without those confirmations, the planner stays in resource-gathering mode and never writes .azure/infrastructure-plan.json.
  • bicep-file-generation: needed a third turn "Looks good! Let's make Bicep now." to produce Bicep files.

With followUps commented out, both file-exists graders will fail ~100% of the time because the skill hasn't been told to emit the artifacts.

If Vally supports multi-turn via messages: or followUp:, wire it in here. If it doesn't yet, these two response-quality stimuli aren't portable and should be marked pending rather than left as guaranteed failures.

Same shape appears in azure-hosted-copilot-sdk/eval.yaml (3 stimuli that relied on setupExpressApp / setupCopilotSdkApp) and appinsights-instrumentation/eval.yaml (Python/ASP.NET quality checks relied on fs.cpSync resource copying).

# expectFiles: bicep present, no .tf
- type: file-exists
config:
path: "infra/*.bicep"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 158 says "bicep present, no .tf" and the description on line 166 says "no Terraform", but the only assertion here is file-exists: infra/*.bicep. A skill that produces both infra/*.bicep AND infra/*.tf passes this grader, which loses the intent of the original Jest expectFiles check.

Same issue in reverse for the Terraform stimuli starting at line 520 (tf-swa-whiteboard, etc.): they assert infra/*.tf exists but don't assert infra/*.bicep is absent.

Add a file-not-exists (or equivalent) grader for the opposite IaC flavor in each stimulus. If Vally doesn't ship that grader today, file-exists with a negated path or output-not-matches on a file listing can stand in. Worth fixing across the whole section, not just the first case.

graders:
- type: output-contains
config:
substring: "resource"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output-contains: "resource" will match almost any Azure-related response, even an empty "I can't help with that" that happens to mention resources. The Jest doesAssistantMessageIncludeKeyword had the same weakness, but since you're rewriting the assertion anyway, this is a chance to make it meaningful.

Same pattern on the next stimulus (output-contains: "location"). Consider a phrase that actually requires the skill to do the lookup, e.g. "across subscriptions" or a regex covering "resources by (type|location|subscription)". Otherwise the grader won't catch a regression where the skill silently stops doing cross-subscription queries.

- type: file-matches
config:
path: "**/*.py"
pattern: "configure_azure_monitor"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file-matches on **/*.py looking for configure_azure_monitor was originally checking seed Python files copied via fs.cpSync("./appinsights-instrumentation/resources/python-app/", workspace). With environment.files commented out (line 141-144), there are no seed .py files in the workspace, so this grader now behaves one of two ways depending on Vally's semantics:

  • If file-matches requires at least one matching path and returns false when none exist: it fails whenever the skill doesn't generate a .py file (which it often won't for an instrumentation-guidance prompt).
  • If file-matches passes when no files match: it becomes a no-op that always passes.

Either way, the assertion no longer means "the seeded Python app has the instrumentation pattern added" (the original Jest intent). Same story for the ASP.NET Core stimulus above: "prompt is self-contained enough to trigger correct guidance" assumes the model doesn't regress, but removes the objective evidence the test was providing.

Skip these quality checks until the seed-data / environment.files story lands, or inline the resource content so the test is self-contained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants