Skip to content

ci(eval): migrate from Waza to Vally eval framework#1912

Open
wbreza wants to merge 11 commits intomainfrom
feature/waza-eval-migration
Open

ci(eval): migrate from Waza to Vally eval framework#1912
wbreza wants to merge 11 commits intomainfrom
feature/waza-eval-migration

Conversation

@wbreza
Copy link
Copy Markdown
Collaborator

@wbreza wbreza commented Apr 16, 2026

Summary

Replaces the existing Waza-based evaluation workflow with the new Vally
framework (@microsoft/vally-cli), and wires up CI to install it from
GitHub Packages.

This is the baseline migration; Batch 1/2/3 skill migrations (PRs #1866,
#1867, #1868) and the trigger-test migration layer on top of it.

Relates to #1818

Changes

  • .github/workflows/eval.yml — switched runner to @microsoft/vally-cli,
    added workflow_dispatch trigger so maintainers can manually run evals.
  • .npmrc — new file mapping the @microsoft scope to
    https://npm.pkg.github.com so npm can resolve the package.
  • package.json — added @microsoft/vally-cli to devDependencies
    (latest) so local contributors can run the CLI.
  • Eval specs (�vals//eval.yaml, .vally.yaml)** — Vally eval configs
    with �nvironment.skills and relative paths under plugin/skills/.

CI Authentication Notes

The Vally CLI package is hosted in GitHub Packages under the microsoft
org. The default GITHUB_TOKEN on a pull_request run cannot cross-org
read other orgs' package registries, and fork PRs cannot access any
repository secrets
. Two things are needed:

  1. Repo secret VALLY_NPM_TOKEN — a PAT with
    ead:packages,
    SSO-authorized for the microsoft org. This PR references that secret
    in the workflow.
  2. workflow_dispatch trigger — allows maintainers to manually run
    the eval workflow from the base repo context (where secrets are
    available) on fork branches.

Once the secret is provisioned, the eval workflow will work for both
upstream branch PRs (auto-triggered) and fork PRs (via manual dispatch).

Testing

wbreza and others added 8 commits April 10, 2026 17:19
Migrate all 30 Waza eval tasks across 4 suites to Vally eval.yaml format:
- azure-hosted-copilot-sdk (6 stimuli)
- azure-deploy (2 stimuli)
- azure-enterprise-infra-planner (12 stimuli)
- azure-prepare (10 stimuli)

Add .vally.yaml project config with paths for skills and evals.
Add evals/_base/common-graders.yaml as shared grader reference.

Grader mappings: regex->output-matches, file->file-exists/file-matches,
code->completed, behavior->constraints. Global graders duplicated per
stimulus as workaround for evaluate#125.

All prompts, regex patterns, and expected outputs preserved verbatim.
Zero test case coverage loss.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace azd waza run with npx @microsoft/vally-cli eval.
Add setup-node with GitHub Packages registry for @microsoft/vally-cli.
Add packages:read permission for GitHub Packages auth.
Preserve trigger paths, artifact upload, and retention settings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…executor

Apply UX-designed tag taxonomy to all 30 stimuli:
- eval-level tags: type + skill
- stimulus-level tags: type, tier, cost, area
- Fix cost values: low -> free (mock executor, no LLM cost)

Add 5 named suites to .vally.yaml: smoke, pr, triggers, integration, full

Switch executor from mock to copilot-sdk for real agent evaluation.
Fix model names: claude-sonnet-4-20250514 -> claude-sonnet-4
(SDK rejects version-pinned model names)

Live eval run results: 8/30 pass, 4 flaky, 17 fail, 1 timeout.
Failures are grader calibration issues (brittle output-contains
substrings, file-exists for files agent doesn't write to disk),
not migration bugs. Grader tuning tracked as follow-up work.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill
definitions into the session. Without this, no skills are available
and skill-invocation graders always fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to repo root)

Skill paths in environment.skills are resolved relative to the
eval.yaml file location, not the repo root. Added ../../ prefix
to climb from evals/<skill>/ to the repo root.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The eval workflow was invoking
px @microsoft/vally-cli without any
npm auth setup, so npm fell back to the public registry and the package
(published to GitHub Packages) could not be resolved.

- Add .npmrc mapping @microsoft scope to npm.pkg.github.com
- Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied
- Add an
pm install --no-save step (with NODE_AUTH_TOKEN) so the
  @microsoft/vally-cli devDependency is resolved via authenticated fetch
- Declare @microsoft/vally-cli in devDependencies (latest) so local dev
  and CI both resolve it through a single config path

This mirrors the working setup in wbreza/skills.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The default GITHUB_TOKEN lacks read:packages access to the microsoft
org's private @microsoft/vally-cli package, yielding 403s. Switch to
a dedicated VALLY_NPM_TOKEN repo secret (PAT with read:packages,
SSO-authorized for the microsoft org).

Note: fork-originated pull_request runs do not receive secrets, so
fork PRs will still fail auth until the package is made public or the
trigger is reworked. Internal branches / workflow_dispatch / merges
will resolve correctly once the secret is provisioned.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allow maintainers to manually invoke the eval workflow from the Actions
UI. This is needed to bypass the fork-PR secrets restriction: pull_request
workflows triggered from a fork cannot access repository secrets, so the
@microsoft/vally-cli install fails. Manual workflow_dispatch runs execute
in the base repo context where secrets are available.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 16, 2026 18:29
The copilot-sdk executor in @microsoft/vally-cli reads GITHUB_TOKEN
to create a Copilot session. The default Actions GITHUB_TOKEN doesn't
have Copilot API scope, causing "Session was not created with
authentication info or custom provider" at eval execution time.

Reuse the existing repo secret COPILOT_CLI_TOKEN (a Copilot-enabled
PAT) and expose it as GITHUB_TOKEN to the eval run step.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the repository’s skill evaluation setup from the Waza workflow to the Vally CLI framework, adding initial Vally eval specs and wiring CI to install/run @microsoft/vally-cli via GitHub Packages.

Changes:

  • Updated CI eval workflow to use Node + npx @microsoft/vally-cli eval and added workflow_dispatch.
  • Added Vally project config (.vally.yaml) and new Vally eval specs under evals/ (plus a shared graders reference file).
  • Added GitHub Packages npm registry configuration (.npmrc) and a new dev dependency on @microsoft/vally-cli.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
.github/workflows/eval.yml Switches CI eval execution from Waza/azd to Vally CLI with GitHub Packages auth.
.npmrc Routes @microsoft scope installs through GitHub Packages.
package.json Adds @microsoft/vally-cli to devDependencies (and adds yaml dependency).
.vally.yaml Introduces Vally project paths and suite filters (smoke/pr/triggers/integration/full).
evals/_base/common-graders.yaml Documents shared “global graders” patterns to copy into stimuli (evaluate#125 workaround).
evals/azure-hosted-copilot-sdk/eval.yaml Converts the hosted-copilot-sdk eval suite from Waza to Vally stimuli/graders.
evals/azure-prepare/eval.yaml Adds a migrated Vally eval suite for azure-prepare.
evals/azure-enterprise-infra-planner/eval.yaml Adds a migrated Vally eval suite for azure-enterprise-infra-planner.
evals/azure-deploy/eval.yaml Adds a migrated Vally eval suite for azure-deploy.

Comment thread package.json
Comment thread package.json
Comment thread package.json Outdated
Comment thread .github/workflows/eval.yml
Comment thread .github/workflows/eval.yml Outdated
Comment thread .vally.yaml Outdated
jongio
jongio previously requested changes Apr 16, 2026
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration looks solid overall - good preservation of the original grader intent and thorough stimulus coverage.

Two things need attention before merge:

  1. The COPILOT_CLI_TOKEN PAT is exposed as GITHUB_TOKEN to the eval agent process. Since eval prompts are modifiable via PR, this creates a credential exfiltration path. See inline comment for mitigation options.

  2. The output-not-contains "error" / "failed" graders match normal agent prose and are likely why CI scores 76.8% instead of 80%+. The regex-based fatal error grader already covers real failures.

Also noting: the executor changed from mock (fast, deterministic, no LLM calls) to copilot-sdk (real LLM calls, ~7 min, non-deterministic, costs per run). The weighted Waza metrics (task_completion 0.4, trigger_accuracy 0.3 at 90%, behavior_quality 0.3) are now a flat 80% threshold. Both are reasonable migration trade-offs but worth documenting.

Comment thread .github/workflows/eval.yml
Comment thread .github/workflows/eval.yml
Comment thread evals/azure-hosted-copilot-sdk/eval.yaml
Comment thread evals/azure-hosted-copilot-sdk/eval.yaml
Comment thread evals/azure-enterprise-infra-planner/eval.yaml
Comment thread evals/azure-deploy/eval.yaml Outdated
Comment thread evals/azure-enterprise-infra-planner/eval.yaml
Comment thread evals/azure-hosted-copilot-sdk/eval.yaml
Comment thread .vally.yaml
@JasonYeMSFT
Copy link
Copy Markdown
Member

The test suite migration looks good. I left a few questions to better understand how vally works. Also, could you please do us a favor adding some documentation on how to run these test suites using vally locally? You can put the documentation in evals/readme.md or maybe just tests/readme.md.

Comment thread .github/workflows/eval.yml Outdated
Workflow hardening:
- Drop pull_request trigger (keep workflow_dispatch only) to eliminate
  token exfiltration vector from untrusted PR code
- Add top-level permissions block (contents/packages: read) for
  defense-in-depth

Package hygiene:
- Remove @microsoft/vally-cli from devDependencies (CI installs it
  explicitly via GitHub Packages); lockfile regenerated in sync
- Remove unused root yaml dependency

Eval spec cleanup:
- Remove 13 broad output-not-contains "error"/"failed" graders from
  azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex)
- Add azure-prepare, azure-validate, azure-deploy to environment.skills
- Remove cost:free tag from all LLM-backed stimuli across 4 eval files
  (reserved now for non-LLM static evals)
- Align .vally.yaml suite descriptions with accurate tag semantics

Cleanup:
- Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/
- Add evals/README.md with local vally-cli run instructions
- Gitignore local results/ output directory

Follow-up issue #1920 tracks wiring CI to a curated medium suite.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza
Copy link
Copy Markdown
Collaborator Author

wbreza commented Apr 16, 2026

Re: docs — added �vals/README.md with GitHub Packages + .npmrc setup, local �ally-cli eval examples, and results file descriptions. Thanks for the suggestion.

@wbreza wbreza requested review from JasonYeMSFT and jongio April 16, 2026 22:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

Details# 🔍 Token Analysis Report

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts compare --base origin/main --head HEAD --markdown

fatal: path 'evals/README.md' exists on disk, but not in 'origin/main'

📊 Token Change Report

Comparing origin/mainHEAD

Summary

Metric Value
📈 Total Change +573 tokens (0%)
Before 0 tokens
After 573 tokens
Files Changed 1

Changed Files

File Before After Change
evals/README.md - 573 +573

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts check --markdown

📊 Token Limit Check Report

Checked: 562 files
Exceeded: 77 files

⚠️ Files Exceeding Token Limits

File Tokens Limit Over By
.github/skills/analyze-skill-issues/SKILL.md 2109 500 +1609
.github/skills/analyze-test-run/SKILL.md 2471 500 +1971
.github/skills/file-test-bug/SKILL.md 628 500 +128
.github/skills/sensei/README.md 3531 2000 +1531
.github/skills/sensei/SKILL.md 3026 500 +2526
.github/skills/sensei/references/EXAMPLES.md 3701 2000 +1701
.github/skills/sensei/references/LOOP.md 4181 2000 +2181
.github/skills/sensei/references/SCORING.md 4299 2000 +2299
.github/skills/skill-authoring/SKILL.md 839 500 +339
plugin/skills/appinsights-instrumentation/SKILL.md 908 500 +408
plugin/skills/azure-ai/SKILL.md 817 500 +317
plugin/skills/azure-aigateway/SKILL.md 1258 500 +758
plugin/skills/azure-aigateway/references/policies.md 2342 2000 +342
plugin/skills/azure-cloud-migrate/SKILL.md 559 500 +59
plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md 2029 2000 +29
plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md 2600 2000 +600
plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md 2181 2000 +181
plugin/skills/azure-compliance/SKILL.md 1185 500 +685
plugin/skills/azure-compute/SKILL.md 765 500 +265
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md 2631 2000 +631
plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md 2509 2000 +509
plugin/skills/azure-cost/SKILL.md 1977 500 +1477
plugin/skills/azure-deploy/SKILL.md 1643 500 +1143
plugin/skills/azure-deploy/references/pre-deploy-checklist.md 4074 2000 +2074
plugin/skills/azure-deploy/references/recipes/azd/errors.md 4001 2000 +2001
plugin/skills/azure-deploy/references/troubleshooting.md 2038 2000 +38
plugin/skills/azure-diagnostics/SKILL.md 1132 500 +632
plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md 2147 2000 +147
plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md 2003 2000 +3
plugin/skills/azure-enterprise-infra-planner/SKILL.md 999 500 +499
plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md 2022 2000 +22
plugin/skills/azure-hosted-copilot-sdk/SKILL.md 1260 500 +760
plugin/skills/azure-kubernetes/SKILL.md 2582 500 +2082
plugin/skills/azure-kusto/SKILL.md 2149 500 +1649
plugin/skills/azure-messaging/SKILL.md 967 500 +467
plugin/skills/azure-prepare/SKILL.md 3235 500 +2735
plugin/skills/azure-prepare/references/aspire.md 4617 2000 +2617
plugin/skills/azure-prepare/references/plan-template.md 2559 2000 +559
plugin/skills/azure-prepare/references/recipes/azd/terraform.md 3525 2000 +1525
plugin/skills/azure-prepare/references/research.md 2274 2000 +274
plugin/skills/azure-prepare/references/resources-limits-quotas.md 3322 2000 +1322
plugin/skills/azure-prepare/references/security.md 2147 2000 +147
plugin/skills/azure-prepare/references/services/functions/bicep.md 3065 2000 +1065
plugin/skills/azure-prepare/references/services/functions/templates/SPEC-composable-templates.md 6187 2000 +4187
plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md 4649 2000 +2649
plugin/skills/azure-prepare/references/services/functions/terraform.md 3358 2000 +1358
plugin/skills/azure-quotas/SKILL.md 2818 500 +2318
plugin/skills/azure-quotas/references/commands.md 2644 2000 +644
plugin/skills/azure-resource-lookup/SKILL.md 1288 500 +788
plugin/skills/azure-resource-visualizer/SKILL.md 2054 500 +1554
plugin/skills/azure-storage/SKILL.md 1180 500 +680
plugin/skills/azure-upgrade/SKILL.md 1001 500 +501
plugin/skills/azure-upgrade/references/services/functions/automation.md 3463 2000 +1463
plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md 2773 2000 +773
plugin/skills/azure-validate/SKILL.md 916 500 +416
plugin/skills/entra-app-registration/SKILL.md 2067 500 +1567
plugin/skills/entra-app-registration/references/api-permissions.md 2545 2000 +545
plugin/skills/entra-app-registration/references/cli-commands.md 2211 2000 +211
plugin/skills/entra-app-registration/references/console-app-example.md 2752 2000 +752
plugin/skills/entra-app-registration/references/oauth-flows.md 2375 2000 +375
plugin/skills/microsoft-foundry/SKILL.md 2870 500 +2370
plugin/skills/microsoft-foundry/foundry-agent/create/create.md 3016 2000 +1016
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md 5767 2000 +3767
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md 2342 2000 +342
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md 4268 2000 +2268
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md 2547 2000 +547
plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md 2701 2000 +701
plugin/skills/microsoft-foundry/foundry-agent/troubleshoot/troubleshoot.md 2164 2000 +164
plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md 1640 500 +1140
plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md 1739 500 +1239
plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md 2235 500 +1735
plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md 3335 2000 +1335
plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md 1226 500 +726
plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md 5534 2000 +3534
plugin/skills/microsoft-foundry/quota/quota.md 2288 2000 +288
plugin/skills/microsoft-foundry/quota/references/capacity-planning.md 2080 2000 +80
plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md 2162 2000 +162

Consider moving content to references/ subdirectories.


Automated token analysis. See skill authoring guidelines for best practices.

Comment thread evals/README.md Outdated
JasonYeMSFT
JasonYeMSFT previously approved these changes Apr 16, 2026
- Update ai-bench references in evals/README.md to microsoft/evaluate
  (the actual upstream Vally repo name)
- Add https://aka.ms/vally as the canonical docs link
- Clarify that contributors don't need source-repo access to run evals
  locally — the @microsoft/vally-cli package from GitHub Packages is
  sufficient

Addresses JasonYeMSFT's review question on evals/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 17, 2026 00:13
@wbreza wbreza requested a review from JasonYeMSFT April 17, 2026 00:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 18 changed files in this pull request and generated 14 comments.

Comment on lines +124 to +137
# Task: expected.output_not_contains
- type: output-not-contains
config:
substring: "wireApi"
- type: output-not-contains
config:
substring: "bearerToken"
- type: output-not-contains
config:
substring: "DefaultAzureCredential"
# Global: no_fatal_errors
- type: output-not-matches
config:
pattern: "(?i)fatal error|unhandled exception|stack trace"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Waza task source, this case also checked output_not_contains: ["error", "failed"]. The migrated Vally stimulus keeps the BYOM-negative assertions but drops the generic error/failure check, which changes the original test intent. Consider adding output-not-contains graders for "error"/"failed" (or expanding the no_fatal_errors regex) to keep parity with the previous task.

Copilot uses AI. Check for mistakes.
Comment on lines +161 to +174
# Task: expected.output_not_contains
- type: output-not-contains
config:
substring: "apiKey"
- type: output-not-contains
config:
substring: "AZURE_OPENAI_API_KEY"
- type: output-not-contains
config:
substring: "AZURE_OPENAI_KEY"
# Global: no_fatal_errors
- type: output-not-matches
config:
pattern: "(?i)fatal error|unhandled exception|stack trace"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Waza task source, this case also checked output_not_contains: ["error", "failed"]. The migrated Vally stimulus preserves the API-key negative assertions but drops the generic error/failure check, which changes the original test intent. Consider adding output-not-contains graders for "error"/"failed" (or expanding the no_fatal_errors regex) to keep parity with the previous task.

Copilot uses AI. Check for mistakes.
tags:
type: integration
tier: full
area: [output, files]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags.area is a YAML list ([output, files]). If tags are expected to be scalar values, this can fail validation or prevent suite filters from matching. Prefer a scalar value or supported multi-tag pattern.

Suggested change
area: [output, files]
area: output-files

Copilot uses AI. Check for mistakes.
tags:
type: integration
tier: full
area: [output, files]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags.area is set as a YAML list ([output, files]), which may not be supported for tag values and can break suite filtering. Prefer a scalar tag value or an alternative representation that Vally supports.

Suggested change
area: [output, files]
area: output-files

Copilot uses AI. Check for mistakes.
tags:
type: integration
tier: full
area: [output, files]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags.area is set as a YAML list ([output, files]). If Vally expects scalar tag values, this can fail linting or make suite filters ineffective. Prefer a scalar value or a supported way to represent multiple areas.

Suggested change
area: [output, files]
area: output-files

Copilot uses AI. Check for mistakes.
tags:
type: integration
tier: full
area: [output, files]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags.area is set as a YAML list ([output, files]). If Vally treats tags as scalar key/value pairs (as implied by .vally.yaml suite filters), list-valued tags may fail schema validation and/or not match suite filtering. Prefer a single scalar value (e.g., output or files) or encode multiple areas in a different tag scheme that Vally supports.

Suggested change
area: [output, files]
area: files

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +56
# Task: expected.output_not_contains
# Global: no_fatal_errors
- type: output-not-matches
config:
pattern: "(?i)fatal error|unhandled exception|stack trace"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original Waza task for this stimulus asserted output_not_contains: ["error", "failed"], but the migrated Vally stimulus doesn’t include equivalent negative checks (only a fatal/exception regex). This reduces coverage and may allow error responses to pass; consider adding output-not-contains graders for "error"/"failed" (or broadening the no-error regex) to preserve the original assertions.

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +78
# Task: expected.output_not_contains
# Global: no_fatal_errors
- type: output-not-matches
config:
pattern: "(?i)fatal error|unhandled exception|stack trace"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original Waza task for this stimulus asserted output_not_contains: ["error", "failed"], but the migrated Vally stimulus doesn’t include equivalent negative checks (only a fatal/exception regex). Consider adding output-not-contains graders for "error"/"failed" (or broadening the no-error regex) to preserve the original assertions.

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +106
# Task: expected.output_not_contains
# Global: no_fatal_errors
- type: output-not-matches
config:
pattern: "(?i)fatal error|unhandled exception|stack trace"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original Waza task for this stimulus asserted output_not_contains: ["error", "failed"], but the migrated Vally stimulus doesn’t include equivalent negative checks (only a fatal/exception regex). Consider adding output-not-contains graders for "error"/"failed" (or broadening the no-error regex) to preserve the original assertions.

Copilot uses AI. Check for mistakes.
tags:
type: integration
tier: full
area: [output, files]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags.area is a YAML list ([output, files]), which may not be supported for tag values and can break filtering/validation. Prefer a scalar value or supported multi-tag pattern.

Suggested change
area: [output, files]
area: output-files

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier concerns are addressed - thanks. One small heads-up left as an inline.

Comment thread .npmrc
@@ -0,0 +1 @@
@microsoft:registry=https://npm.pkg.github.com No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routes the entire @microsoft scope to GitHub Packages, not just vally-cli. Nothing in the tree hits that scope today (scripts/, tests/, dashboard/ have no @microsoft/* deps), but if a sub-package later pulls a public @microsoft/* from npmjs, local npm install will 404 for contributors who don't have VALLY_NPM_TOKEN set. Worth a one-liner in evals/README.md so the next person isn't surprised.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the current evals/Readme.md sufficient for explaining what is needed for a local developer to setup the NPM token.

Copy link
Copy Markdown
Member

@JasonYeMSFT JasonYeMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants