Skip to content

Eval suite renaming and multiple env & agent folder support.#2092

Open
XOEEst wants to merge 33 commits intomicrosoft:mainfrom
XOEEst:main
Open

Eval suite renaming and multiple env & agent folder support.#2092
XOEEst wants to merge 33 commits intomicrosoft:mainfrom
XOEEst:main

Conversation

@XOEEst
Copy link
Copy Markdown
Collaborator

@XOEEst XOEEst commented Apr 29, 2026

Description

This pull request updates the documentation for Microsoft Foundry agent workflows to support multiple environment-specific metadata files (such as agent-metadata.prod.yaml) in addition to the default agent-metadata.yaml. The changes clarify how to select, resolve, and persist agent metadata, enforce scoping to the selected agent root, and standardize the transition from legacy test case formats to the new evaluationSuites[] structure. The documentation now consistently refers to the "selected metadata file" and provides detailed rules for environment and metadata file selection, artifact persistence, and cache management.

Metadata file selection and environment resolution:

  • Added support for environment-specific metadata files (e.g., agent-metadata.prod.yaml) alongside agent-metadata.yaml, with clear rules for selecting which file to use based on workflow context and user input. [1] [2]
  • Updated all references and workflows to use the "selected metadata file" and clarified the order of precedence for environment resolution, including normalization of legacy testSuites[] and testCases[] to evaluationSuites[]. [1] [2]

Agent root and cache scoping:

  • Enforced strict scoping to the selected agent root for all cache, dataset, and evaluator operations; workflows no longer scan or merge sibling agent folders unless explicitly requested. [1] [2] [3]

Artifact and metadata persistence:

  • Standardized all artifact persistence (datasets, evaluators, results, and test suites) to operate within the selected agent root and metadata file, with rules for updating only the relevant environment block and for migrating legacy test case formats to evaluationSuites[]. [1] [2] [3]

Terminology and documentation consistency:

  • Updated terminology throughout to refer to "evaluation suites" (evaluationSuites[]) instead of "test cases" or "test suites," and ensured dataset naming conventions reference the selected metadata file and agent root. [1] [2] [3] [4]

User interaction and workflow prompts:

  • Clarified when to prompt the user for agent root, metadata file, and environment selection, and specified that workflow summaries should always display the current agent root, metadata file, and environment. [1] [2]

These changes ensure that all Foundry agent workflows are robust, environment-aware, and compatible with both legacy and new metadata formats, improving reliability and user experience.

Checklist

  • Tests pass locally (cd tests && npm test)
  • If modifying skill descriptions: verified routing correctness with integration tests (npm run test:skills:integration -- <skill>)
  • If modifying skill USE FOR / DO NOT USE FOR / PREFER OVER clauses: confirmed no routing regressions for competing skills
  • Version bumped in skill frontmatter (if skill files changed)

Related Issues

Copilot AI and others added 30 commits March 10, 2026 23:35
…OPILOT_GITHUB_TOKEN` (#6)

* Initial plan

* Fix issue triage workflow: add github.token fallback for COPILOT_GITHUB_TOKEN

The Issue Triage workflow was failing at the secret validation step because
COPILOT_GITHUB_TOKEN was not configured. This adds github.token as a fallback
in all 4 places where COPILOT_GITHUB_TOKEN is used for authentication:
- agent job: validate-secret step and Execute step
- detection job: validate-secret step and Execute step

This is consistent with the existing fallback patterns in the workflow
(e.g., secrets.GH_AW_GITHUB_MCP_SERVER_TOKEN || secrets.GH_AW_GITHUB_TOKEN || secrets.GITHUB_TOKEN)

Co-authored-by: XOEEst <18523445+XOEEst@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: XOEEst <18523445+XOEEst@users.noreply.github.com>
* Fix broken auto-create evaluators step in deploy/observe loop

The 'Auto-create evaluators & evaluation dataset' step was being skipped
when the monolithic agent-observability-loop skill was split into separate
deploy and observe skills. Neither skill owned the auto-create step,
causing post-deploy users to jump directly to evaluation.

Changes:
- deploy.md: Replace generic 'set up evaluation?' prompt with automatic
  6-step evaluator & dataset creation matching the reference behavior
- observe.md: Add Loop Overview, fix entry points to route post-deploy
  users through auto-setup, add evaluator existence check
- deploy-and-setup.md: Make auto-create primary content, demote deploy
  section to prerequisites

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add content tests for observe/deploy loop logic

Tests verify:
- observe.md has Loop Overview, post-deploy entry points, evaluator
  existence checks, behavioral rules, and all reference files
- deploy.md has auto-create evaluators section that is automatic (not
  optional), includes evaluator categories, LLM-judge, artifact
  persistence, and routes to observe skill Step 2
- deploy-and-setup.md has auto-create as primary content with proper
  evaluator selection, dataset generation, and user prompt

49 tests total (29 observe + 20 deploy), all passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: trigger CI checks

* Fix

* add local dataset gen enforcement

* Merge

* feat: prefer monitor_resource_log_query and local datasets

- Replace azure-kusto delegation with monitor_resource_log_query for
  App Insights KQL queries in trace.md and troubleshoot.md
- Mark evaluation_dataset_create as not available (MCP upload not ready)
- Replace server-side dataset sections with local JSONL workflow
- Update mcp-gap-analysis.md to reflect practical tool availability

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: make dataset upload restriction more agent-proof

- Add Do NOT section at top of trace-to-dataset.md (before Overview)
- Add behavioral rule #7 to eval-datasets.md: never upload to cloud
- Remove Option A/B structure; Step 4 is now local JSONL only
- Eliminates subtle strikethrough formatting that agents miss

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix link

* fix: make auto-create evaluators an explicit numbered step

- Hosted workflow: add Step 10 after Step 9 with DO NOT stop gate
- Prompt workflow: add Step 5 after Step 4 with DO NOT stop gate
- Both link to existing After Deployment section as implementation
- Prevents agents from treating evaluator setup as optional appendix

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add dataset update loop with optimization guardrails

- Add Dataset Update Loop (eval→compare→analyze→optimize→re-eval)
  to dataset-versioning.md after Creating a New Version
- Add guardrails: never remove dataset rows or weaken evaluators
  to recover scores after dataset expansion
- Add same guardrail to observe optimize-deploy.md Step 6
- Add behavioral rule #8 to eval-datasets.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: add subscription parameter warning to trace-related skills

Always pass subscription explicitly to Azure MCP tools like
monitor_resource_log_query — they don't extract it from resource IDs.
Added to trace.md, troubleshoot.md, and trace-to-dataset.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: make customEvents-to-traces eval correlation more obvious

- Add Key Concept section to trace-to-dataset.md explaining that
  eval results live in customEvents (not dependencies) and the
  join key is gen_ai.response.id
- Add table showing dependencies vs customEvents join pattern
- Cross-reference trace skill's eval-correlation.md from both
  trace-to-dataset.md and eval-datasets.md Related Skills

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: improve cross-references and add KQL parse_json warning

1. Add parse_json(customDimensions) warning to Do NOT section
2. Add Related References section with skill-root paths
3. Add skill-root path hints to all cross-skill links
4. Add observe + trace to SKILL.md sub-skill routing table

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: improve hosted agent KQL patterns and content extraction

- Add Hosted Agent Harvest template (requests→dependencies join)
- Fix Hosted Agent Attributes: appear on both requests and traces
- Add gen_ai.agent.name duality callout (Foundry name vs class name)
- Remove incorrect azure.ai.agentserver.agent_name fallback from dependencies queries
- Document gen_ai.input.messages/gen_ai.output.messages as content source
- Add operation_ParentId join example to Span Correlation section
- Update search-traces.md hosted agent query to use requests entry point

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: improve trace sub-skills for hosted agent KQL patterns

- search-traces: fix hosted agent query to group by operation_ParentId
- conversation-detail: add content extraction from invoke_agent spans
  (gen_ai.input.messages / gen_ai.output.messages)
- analyze-failures: add hosted agent gen_ai.agent.name duality warning
  and hosted agent variant query using requests→dependencies join
- analyze-latency: same hosted agent warning and variant query
- kql-templates: expand requests table description as preferred entry
  point; add gen_ai.input/output.messages to attributes table
- trace.md: reword rule 6 to clarify hosted vs prompt agent filtering

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: restore routing keywords and update trigger snapshots

- Add back critical routing keywords to SKILL.md description (578→779 chars):
  role assignment, permissions, capacity, region, deployment failure,
  AI Services, Cognitive Services, provision, knowledge index,
  monitoring, customize, onboard, availability
- Update trigger test snapshots for new keyword set (24 snapshots)
- Fix deploy trigger test: Docker IS our capability (remove false negative)
- Fix customize-deployment tests: ensure prompts have ≥2 keyword matches
- Fix deploy-model-optimal-region tests: use longer prompts for HA/PTU

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: add 'create AI Services' to description for resource/create test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: bump microsoft-foundry version to 1.0.2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(eval-datasets): enable Foundry dataset sync via MCP tools

- Add Step 5 (Sync to Foundry) to trace-to-dataset pipeline using
  evaluation_dataset_create with connectionName and project_connection tools
- Add server-side version discovery via evaluation_dataset_versions_get
- Add dual experiment types to dataset-comparison (agent vs dataset comparison)
- Update mcp-gap-analysis: mark resolved tools, update workarounds
- Add AzureBlob to project connections reference
- Bump microsoft-foundry version to 1.0.3
- Fix upstream section heading changes in unit tests
- Update trigger snapshots for upstream keyword changes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(dataset-comparison): focus on dataset-version comparison only

Remove agent comparison experiment type from dataset-comparison flow.
Agent comparison belongs in the observe/eval loop, not the dataset skill.
Update all examples to use dataset versions as baseline/treatment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove Playwright MCP server until skills require it (microsoft#1200)

* Collapse token analysis comment (microsoft#1147)

* update region-availability in prepare/validate/deploy skill (microsoft#1083)

* update region-availability in prepare/deploy skill

* update

* update

* fix

* update date

* Update plugin/skills/azure-deploy/references/region-availability.md

* fix ci failure

* bump version

* build(deps): bump @github/copilot and @github/copilot-sdk in /tests (microsoft#1201)

Bumps [@github/copilot](https://github.com/github/copilot-cli) to 1.0.2 and updates ancestor dependency [@github/copilot-sdk](https://github.com/github/copilot-sdk). These dependencies need to be updated together.

Updates `@github/copilot` from 0.0.414 to 1.0.2
- [Release notes](https://github.com/github/copilot-cli/releases)
- [Changelog](https://github.com/github/copilot-cli/blob/main/changelog.md)
- [Commits](github/copilot-cli@v0.0.414...v1.0.2)

Updates `@github/copilot-sdk` from 0.1.26 to 0.1.32
- [Release notes](https://github.com/github/copilot-sdk/releases)
- [Changelog](https://github.com/github/copilot-sdk/blob/main/CHANGELOG.md)
- [Commits](https://github.com/github/copilot-sdk/commits/v0.1.32)

---
updated-dependencies:
- dependency-name: "@github/copilot"
  dependency-version: 1.0.2
  dependency-type: indirect
- dependency-name: "@github/copilot-sdk"
  dependency-version: 0.1.32
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* specify application path in prompt (microsoft#1204)

* Add AVM (Azure Verified Modules) integration tests (microsoft#1171)

* Add AVM (Azure Verified Modules) integration tests

Add 3 integration tests validating the AVM module selection hierarchy
for Bicep infrastructure generation:

- avm-module-priority: Verifies AVM modules prioritized over non-AVM
- avm-fallback-behavior: Verifies fallback stays within AVM ecosystem
- avm-azd-pattern-preference: Verifies AZD pattern modules preferred

Tests validate that the azure-deploy skill enforces the mandatory
AVM selection order: Pattern modules > Resource modules > Utility
modules, and never falls back to non-AVM alternatives.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add output assertions to AVM integration tests

Address Copilot review feedback: add keyword-based output
assertions using getAllAssistantMessages/getAllToolText to
verify agent responses contain AVM hierarchy terms, not just
skill invocation. Includes non-AVM fallback negative check.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Strengthen AVM test output assertions per Copilot review

- Split keyword checks into critical-term + context assertions
- Add resource-before-utility ordering assertion for fallback test
- Expand non-AVM negative check to use regex patterns
- Require core keywords (avm+pattern, azd+pattern) explicitly

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address Copilot round 3 — ordering assertions and context-aware non-AVM check

- Add hierarchy ordering assertion to test 1 (pattern before resource/utility)
- Make non-AVM detection context-aware: skip matches preceded by negation words
  (e.g., 'never fall back to non-AVM' is correct behavior, not a false positive)
- Add pattern-before-resource ordering assertion to test 3 (AZD pattern preference)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor: move AVM integration tests to avm/ subdirectory

Move tests/azure-deploy/avm-integration.test.ts to
tests/azure-deploy/avm/integration.test.ts so the file matches the
**/integration.test.ts glob used by the custom ESLint rule
(integration-test-name) and follows the subdirectory convention
established by tests/microsoft-foundry/ (e.g. foundry-agent/).

Import paths updated from ../utils/ to ../../utils/ to reflect the
new depth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address round 4 Copilot review feedback

- Add 'fall back'/'fall-back' keyword variants for resilience
- Extend non-AVM negation check to also scan following context
- Use regex for AZD ordering assertion to match plural/prefixed variants

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update github workflows to use best practices (microsoft#1149)

* Early terminate azure-deploy, azure-validate tests (microsoft#1205)

Add comment for early termination to help AI grader

* Replace script inline parameters with env var (microsoft#1209)

* Early terminate azure-deploy tests on deploy link (microsoft#1208)

* Early terminate azure-deploy tests on deploy link

* Fix lint issue

* Reduce char count of existing skills (microsoft#1210)

* Reduce char count of existing skills

* Update ci tests and snapshots

* Enhance benchmark ci run script (microsoft#1176)

* Add msbench_benchmarks repo clone to get model definition

* Remove unused vars

* Use mcp-pr repo before MI has access to msbench-benchmarks repo

* Address copilot feedback

* Change back to msbench-benchmarks repo

* Get ADO token for repo clone

* Fix line continuation character

* Add run for all interested models

* Extract run IDs

* Fix yaml format issue

* Schedule it to run nightly

* Address copilot feedbacks

* formalize .foundry and multi-environment support

* fix

* Feature/azure quotas (microsoft#1137)

* update for using azure-quotas in skill

* test update

* unit test update

* path update

* add skill in skills.json

* skills.json update

* reduce the text

* version update

* skill version

* skill description update

* reduce text size

* 1.0.4 for next prepare version

* upload snap shot

* update version

* test update

---------

Co-authored-by: Yinghui Dong <yinghuidong@microsoft.com>

* build(deps-dev): bump simple-git from 3.30.0 to 3.32.3 in /tests (microsoft#1213)

Bumps [simple-git](https://github.com/steveukx/git-js/tree/HEAD/simple-git) from 3.30.0 to 3.32.3.
- [Release notes](https://github.com/steveukx/git-js/releases)
- [Changelog](https://github.com/steveukx/git-js/blob/main/simple-git/CHANGELOG.md)
- [Commits](https://github.com/steveukx/git-js/commits/simple-git@3.32.3/simple-git)

---
updated-dependencies:
- dependency-name: simple-git
  dependency-version: 3.32.3
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Improve azure-compliance invocation rate (microsoft#1214)

* Improve azure-compliance invocation rate

* Race condition free report writing

* Fix debug logging for report location

* Bump skill version

* Fix suffix base value

* fix

* llm judge model and eval group improvement

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Chris Harris <charris@microsoft.com>
Co-authored-by: JasonYeMSFT <chuye@microsoft.com>
Co-authored-by: xfz11 <81600993+xfz11@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Juan Ospina <70209456+jeo02@users.noreply.github.com>
Co-authored-by: Jon Gallant <2163001+jongio@users.noreply.github.com>
Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com>
Co-authored-by: Fan Yang <52458914+fanyang-mono@users.noreply.github.com>
Co-authored-by: rakal-dyh <33503911+rakal-dyh@users.noreply.github.com>
Co-authored-by: Yinghui Dong <yinghuidong@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

# Conflicts:
#	tests/microsoft-foundry/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/foundry-agent/create/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/foundry-agent/deploy/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/foundry-agent/invoke/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/foundry-agent/observe/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/foundry-agent/trace/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/foundry-agent/troubleshoot/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/models/deploy/capacity/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/models/deploy/customize-deployment/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/models/deploy/deploy-model-optimal-region/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/models/deploy/deploy-model/__snapshots__/triggers.test.ts.snap
#	tests/microsoft-foundry/resource/create/__snapshots__/triggers.test.ts.snap
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* dataset fix

* fix: restore foundry dataset guidance

Restore the explicit seed dataset registration guidance in the deploy skill and align dataset docs with the current evaluation_dataset_create MCP surface.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 29, 2026 07:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the microsoft-foundry skill documentation and unit tests to support (1) selecting environment-specific metadata sidecars (e.g., agent-metadata.prod.yaml) and (2) standardizing legacy testSuites[]/testCases[] into the canonical evaluationSuites[] schema, with stronger scoping to a single selected agent root.

Changes:

  • Documented metadata file sidecars (agent-metadata.<env>.yaml), selection precedence, and “selected metadata file” wording across workflows.
  • Standardized documentation around evaluationSuites[] + tags (including migration rules from legacy fields on write).
  • Updated unit tests to assert the new terminology, scoping rules, and schema references.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/microsoft-foundry/unit.test.ts Adds assertions for sidecar metadata + legacy-to-evaluationSuites[] migration wording.
tests/microsoft-foundry/foundry-agent/observe/unit.test.ts Updates observe docs expectations for selected metadata file, root scoping, and tier=smoke terminology.
tests/microsoft-foundry/foundry-agent/eval-datasets/unit.test.ts Updates eval-datasets docs expectations for selected metadata file/root scoping and evaluationSuites[] schema.
tests/microsoft-foundry/foundry-agent/deploy/unit.test.ts Updates deploy docs expectations for selected metadata file and evaluationSuites[] schema.
scripts/package-lock.json Lockfile updates (adds peer: true flags for some packages).
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md Adds sidecar file model + selection rules + evaluationSuites[] schema and migration guidance.
plugin/skills/microsoft-foundry/foundry-agent/trace/trace.md Updates trace workflow to reference selected metadata file (agent-metadata*.yaml).
plugin/skills/microsoft-foundry/foundry-agent/trace/references/search-traces.md Updates prerequisites and persistence reminder to refer to selected metadata file.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md Shifts “test case” language to “test suite” + tier=smoke guidance.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md Updates auto-setup instructions for selected root/file scoping and migration-on-write guidance.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/compare-iterate.md Updates wording to “test suite” and root-scoped dataset location.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/cicd-monitoring.md Documents CI/CD workflows selecting a metadata file (e.g., via FOUNDRY_METADATA_FILE).
plugin/skills/microsoft-foundry/foundry-agent/observe/references/analyze-results.md Updates failure prioritization terminology to suite tags (but has an inconsistency noted in comments).
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md Enforces selected agent root/file scoping and standardizes on evaluationSuites[] + legacy migration.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md Updates prerequisites + explicitly scopes updates to the selected agent root only.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md Updates metadata references to selected file and evaluationSuites[], including migration-on-write.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/eval-trending.md Updates prerequisites to selected metadata file language.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-versioning.md Updates agentName source to selected metadata file.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-organization.md Replaces priority with flexible tags in dataset row examples and filtering logic.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md Updates overall eval-datasets guidance to selected root/file scoping and evaluationSuites[].
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md Updates deploy workflow for selected root scoping and selected metadata file + evaluationSuites[].
plugin/skills/microsoft-foundry/SKILL.md Documents sidecar metadata support, selection precedence, scoping rules, and legacy migration behavior.
Files not reviewed (1)
  • scripts/package-lock.json: Language not supported

4. Keep the selected agent root and environment visible in every deploy, eval, dataset, and trace summary.
5. Treat `datasets/` and `evaluators/` as cache folders. Reuse local files when present, but offer refresh when the user asks or when remote state is newer.
6. Never overwrite cache files or metadata silently.
3. Inside the selected agent root, select the metadata file in this order: explicit file/path from the user or workflow, then `.foundry/agent-metadata.<env>.yaml` when an explicit environment is already known and that file exists, then `.foundry/agent-metadata.yaml`.
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata-file selection rules don’t describe what to do when .foundry/agent-metadata.yaml is absent and multiple sidecar files exist (e.g., agent-metadata.prod.yaml, agent-metadata.staging.yaml) but no explicit file/path was provided. Adding an explicit “prompt the user to choose the metadata file” fallback here would make this contract consistent with the skill’s selection rules and avoid ambiguous behavior.

Suggested change
3. Inside the selected agent root, select the metadata file in this order: explicit file/path from the user or workflow, then `.foundry/agent-metadata.<env>.yaml` when an explicit environment is already known and that file exists, then `.foundry/agent-metadata.yaml`.
3. Inside the selected agent root, select the metadata file in this order: explicit file/path from the user or workflow, then `.foundry/agent-metadata.<env>.yaml` when an explicit environment is already known and that file exists, then `.foundry/agent-metadata.yaml`. If `.foundry/agent-metadata.yaml` is absent, use the only matching sidecar file when exactly one `.foundry/agent-metadata.<env>.yaml` file exists; if multiple sidecar files exist and no explicit file/path was provided, require the user to choose the metadata file.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +127
| Focus | Cluster | Suggested Action |
|-------|---------|------------------|
| Runtime blockers | Runtime errors or failing suites tagged `tier=smoke` | Check container logs or fix blockers first |
| Key regressions | Incorrect answers on suites tagged `purpose=regression` or `tier=smoke` | Optimize prompt or tool instructions |
| Broader quality gaps | Incomplete answers or coverage-oriented suites | Optimize prompt or expand context |
| Tooling issues | Tool call failures | Fix tool definitions or instructions |
| Safety issues | Safety violations | Add guardrails to instructions |

**Rule:** Prioritize runtime errors first, then sort by test-case priority (`P0` before `P1` before `P2`) and count × severity.
**Rule:** Prioritize runtime errors first, then suites tagged `tier=smoke`, then suites tagged `purpose=regression`, then broader coverage suites by count × severity.
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section prioritizes suites tagged purpose=regression, but elsewhere in the docs (and in the tag guidance) regressions are represented as tier=regression. To keep the schema and filtering terminology consistent, consider changing these references from purpose=regression to tier=regression (or, if purpose=regression is intended, update the tag guidance/examples to match).

Copilot uses AI. Check for mistakes.
| Tag Key | Example Values | Typical Use |
|---------|----------------|-------------|
| `tier` | `smoke`, `regression`, `coverage` | Suggested run order / breadth |
| `purpose` | `baseline`, `safety`, `tools`, `quality` | Why the suite exists |
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tagging guidance, the example suite uses tags.purpose: regression (see the trace-regression-suite example), but the suggested purpose values table does not include regression. This is inconsistent and can confuse users about which key/value to use for regression-focused suites. Consider either (a) adding regression to the suggested purpose values, or (b) updating the example to keep regressions under tier: regression and use a purpose value from the table (e.g., quality).

Suggested change
| `purpose` | `baseline`, `safety`, `tools`, `quality` | Why the suite exists |
| `purpose` | `baseline`, `safety`, `tools`, `quality`, `regression` | Why the suite exists |

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two themes to address across this PR:

  1. Several places renamed "test case" to "test suite" instead of "evaluation suite", leaving a third inconsistent term alongside the legacy names and the new schema field. deploy.md section 6 heading, observe.md workflow steps 4 and 7, analyze-results.md step 5, and agent-metadata-contract.md's guidance paragraph all say "test suite" where "evaluation suite" would match the evaluationSuites[] field name. See inline comments for the most visible instances.

  2. The purpose values table in agent-metadata-contract.md lists baseline, safety, tools, quality but the trace-regression-suite example uses purpose: regression, and analyze-results.md filters on purpose=regression. The table should include regression as a valid purpose value so the docs are self-consistent.

Use [Generate Seed Evaluation Dataset](../eval-datasets/references/generate-seed-dataset.md) as the single source of truth for seed dataset registration. It covers `project_connection_list` with `AzureStorageAccount`, key-based versus AAD upload, `evaluation_dataset_create` with `connectionName`, and saving the returned `datasetUri`.

### 6. Persist Artifacts and Test Cases
### 6. Persist Artifacts and Test Suites
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This heading says "Test Suites" but the schema field is evaluationSuites[]. For consistency with the rest of the rename, this should be "Evaluation Suites". Same applies to the prompt text further down ("test-suite metadata" at line 312 should be "evaluation-suite metadata").

2. Evaluate (batch eval run)
3. Download and cluster failures
4. Pick a category or test case to optimize
4. Pick a category or test suite to optimize
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"test suite" here and in step 7 (line 52) should be "evaluation suite" to match the evaluationSuites[] schema name. The behavioral rule at line 63 ("Run test suites tagged...") has the same inconsistency.

| Tag Key | Example Values | Typical Use |
|---------|----------------|-------------|
| `tier` | `smoke`, `regression`, `coverage` | Suggested run order / breadth |
| `purpose` | `baseline`, `safety`, `tools`, `quality` | Why the suite exists |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose example values don't include regression, but the trace-regression-suite example above (line 75) uses purpose: regression and analyze-results.md prioritizes suites on purpose=regression. Add regression to this table so the suggested values match actual usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants