Eval suite renaming and multiple env & agent folder support.#2092
Eval suite renaming and multiple env & agent folder support.#2092XOEEst wants to merge 33 commits intomicrosoft:mainfrom
Conversation
…OPILOT_GITHUB_TOKEN` (#6) * Initial plan * Fix issue triage workflow: add github.token fallback for COPILOT_GITHUB_TOKEN The Issue Triage workflow was failing at the secret validation step because COPILOT_GITHUB_TOKEN was not configured. This adds github.token as a fallback in all 4 places where COPILOT_GITHUB_TOKEN is used for authentication: - agent job: validate-secret step and Execute step - detection job: validate-secret step and Execute step This is consistent with the existing fallback patterns in the workflow (e.g., secrets.GH_AW_GITHUB_MCP_SERVER_TOKEN || secrets.GH_AW_GITHUB_TOKEN || secrets.GITHUB_TOKEN) Co-authored-by: XOEEst <18523445+XOEEst@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: XOEEst <18523445+XOEEst@users.noreply.github.com>
* Fix broken auto-create evaluators step in deploy/observe loop The 'Auto-create evaluators & evaluation dataset' step was being skipped when the monolithic agent-observability-loop skill was split into separate deploy and observe skills. Neither skill owned the auto-create step, causing post-deploy users to jump directly to evaluation. Changes: - deploy.md: Replace generic 'set up evaluation?' prompt with automatic 6-step evaluator & dataset creation matching the reference behavior - observe.md: Add Loop Overview, fix entry points to route post-deploy users through auto-setup, add evaluator existence check - deploy-and-setup.md: Make auto-create primary content, demote deploy section to prerequisites Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add content tests for observe/deploy loop logic Tests verify: - observe.md has Loop Overview, post-deploy entry points, evaluator existence checks, behavioral rules, and all reference files - deploy.md has auto-create evaluators section that is automatic (not optional), includes evaluator categories, LLM-judge, artifact persistence, and routes to observe skill Step 2 - deploy-and-setup.md has auto-create as primary content with proper evaluator selection, dataset generation, and user prompt 49 tests total (29 observe + 20 deploy), all passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: trigger CI checks * Fix * add local dataset gen enforcement * Merge * feat: prefer monitor_resource_log_query and local datasets - Replace azure-kusto delegation with monitor_resource_log_query for App Insights KQL queries in trace.md and troubleshoot.md - Mark evaluation_dataset_create as not available (MCP upload not ready) - Replace server-side dataset sections with local JSONL workflow - Update mcp-gap-analysis.md to reflect practical tool availability Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: make dataset upload restriction more agent-proof - Add Do NOT section at top of trace-to-dataset.md (before Overview) - Add behavioral rule #7 to eval-datasets.md: never upload to cloud - Remove Option A/B structure; Step 4 is now local JSONL only - Eliminates subtle strikethrough formatting that agents miss Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix link * fix: make auto-create evaluators an explicit numbered step - Hosted workflow: add Step 10 after Step 9 with DO NOT stop gate - Prompt workflow: add Step 5 after Step 4 with DO NOT stop gate - Both link to existing After Deployment section as implementation - Prevents agents from treating evaluator setup as optional appendix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add dataset update loop with optimization guardrails - Add Dataset Update Loop (eval→compare→analyze→optimize→re-eval) to dataset-versioning.md after Creating a New Version - Add guardrails: never remove dataset rows or weaken evaluators to recover scores after dataset expansion - Add same guardrail to observe optimize-deploy.md Step 6 - Add behavioral rule #8 to eval-datasets.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: add subscription parameter warning to trace-related skills Always pass subscription explicitly to Azure MCP tools like monitor_resource_log_query — they don't extract it from resource IDs. Added to trace.md, troubleshoot.md, and trace-to-dataset.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: make customEvents-to-traces eval correlation more obvious - Add Key Concept section to trace-to-dataset.md explaining that eval results live in customEvents (not dependencies) and the join key is gen_ai.response.id - Add table showing dependencies vs customEvents join pattern - Cross-reference trace skill's eval-correlation.md from both trace-to-dataset.md and eval-datasets.md Related Skills Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: improve cross-references and add KQL parse_json warning 1. Add parse_json(customDimensions) warning to Do NOT section 2. Add Related References section with skill-root paths 3. Add skill-root path hints to all cross-skill links 4. Add observe + trace to SKILL.md sub-skill routing table Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: improve hosted agent KQL patterns and content extraction - Add Hosted Agent Harvest template (requests→dependencies join) - Fix Hosted Agent Attributes: appear on both requests and traces - Add gen_ai.agent.name duality callout (Foundry name vs class name) - Remove incorrect azure.ai.agentserver.agent_name fallback from dependencies queries - Document gen_ai.input.messages/gen_ai.output.messages as content source - Add operation_ParentId join example to Span Correlation section - Update search-traces.md hosted agent query to use requests entry point Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: improve trace sub-skills for hosted agent KQL patterns - search-traces: fix hosted agent query to group by operation_ParentId - conversation-detail: add content extraction from invoke_agent spans (gen_ai.input.messages / gen_ai.output.messages) - analyze-failures: add hosted agent gen_ai.agent.name duality warning and hosted agent variant query using requests→dependencies join - analyze-latency: same hosted agent warning and variant query - kql-templates: expand requests table description as preferred entry point; add gen_ai.input/output.messages to attributes table - trace.md: reword rule 6 to clarify hosted vs prompt agent filtering Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: restore routing keywords and update trigger snapshots - Add back critical routing keywords to SKILL.md description (578→779 chars): role assignment, permissions, capacity, region, deployment failure, AI Services, Cognitive Services, provision, knowledge index, monitoring, customize, onboard, availability - Update trigger test snapshots for new keyword set (24 snapshots) - Fix deploy trigger test: Docker IS our capability (remove false negative) - Fix customize-deployment tests: ensure prompts have ≥2 keyword matches - Fix deploy-model-optimal-region tests: use longer prompts for HA/PTU Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: add 'create AI Services' to description for resource/create test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: bump microsoft-foundry version to 1.0.2 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(eval-datasets): enable Foundry dataset sync via MCP tools - Add Step 5 (Sync to Foundry) to trace-to-dataset pipeline using evaluation_dataset_create with connectionName and project_connection tools - Add server-side version discovery via evaluation_dataset_versions_get - Add dual experiment types to dataset-comparison (agent vs dataset comparison) - Update mcp-gap-analysis: mark resolved tools, update workarounds - Add AzureBlob to project connections reference - Bump microsoft-foundry version to 1.0.3 - Fix upstream section heading changes in unit tests - Update trigger snapshots for upstream keyword changes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor(dataset-comparison): focus on dataset-version comparison only Remove agent comparison experiment type from dataset-comparison flow. Agent comparison belongs in the observe/eval loop, not the dataset skill. Update all examples to use dataset versions as baseline/treatment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove Playwright MCP server until skills require it (microsoft#1200) * Collapse token analysis comment (microsoft#1147) * update region-availability in prepare/validate/deploy skill (microsoft#1083) * update region-availability in prepare/deploy skill * update * update * fix * update date * Update plugin/skills/azure-deploy/references/region-availability.md * fix ci failure * bump version * build(deps): bump @github/copilot and @github/copilot-sdk in /tests (microsoft#1201) Bumps [@github/copilot](https://github.com/github/copilot-cli) to 1.0.2 and updates ancestor dependency [@github/copilot-sdk](https://github.com/github/copilot-sdk). These dependencies need to be updated together. Updates `@github/copilot` from 0.0.414 to 1.0.2 - [Release notes](https://github.com/github/copilot-cli/releases) - [Changelog](https://github.com/github/copilot-cli/blob/main/changelog.md) - [Commits](github/copilot-cli@v0.0.414...v1.0.2) Updates `@github/copilot-sdk` from 0.1.26 to 0.1.32 - [Release notes](https://github.com/github/copilot-sdk/releases) - [Changelog](https://github.com/github/copilot-sdk/blob/main/CHANGELOG.md) - [Commits](https://github.com/github/copilot-sdk/commits/v0.1.32) --- updated-dependencies: - dependency-name: "@github/copilot" dependency-version: 1.0.2 dependency-type: indirect - dependency-name: "@github/copilot-sdk" dependency-version: 0.1.32 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * specify application path in prompt (microsoft#1204) * Add AVM (Azure Verified Modules) integration tests (microsoft#1171) * Add AVM (Azure Verified Modules) integration tests Add 3 integration tests validating the AVM module selection hierarchy for Bicep infrastructure generation: - avm-module-priority: Verifies AVM modules prioritized over non-AVM - avm-fallback-behavior: Verifies fallback stays within AVM ecosystem - avm-azd-pattern-preference: Verifies AZD pattern modules preferred Tests validate that the azure-deploy skill enforces the mandatory AVM selection order: Pattern modules > Resource modules > Utility modules, and never falls back to non-AVM alternatives. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add output assertions to AVM integration tests Address Copilot review feedback: add keyword-based output assertions using getAllAssistantMessages/getAllToolText to verify agent responses contain AVM hierarchy terms, not just skill invocation. Includes non-AVM fallback negative check. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Strengthen AVM test output assertions per Copilot review - Split keyword checks into critical-term + context assertions - Add resource-before-utility ordering assertion for fallback test - Expand non-AVM negative check to use regex patterns - Require core keywords (avm+pattern, azd+pattern) explicitly Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address Copilot round 3 — ordering assertions and context-aware non-AVM check - Add hierarchy ordering assertion to test 1 (pattern before resource/utility) - Make non-AVM detection context-aware: skip matches preceded by negation words (e.g., 'never fall back to non-AVM' is correct behavior, not a false positive) - Add pattern-before-resource ordering assertion to test 3 (AZD pattern preference) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor: move AVM integration tests to avm/ subdirectory Move tests/azure-deploy/avm-integration.test.ts to tests/azure-deploy/avm/integration.test.ts so the file matches the **/integration.test.ts glob used by the custom ESLint rule (integration-test-name) and follows the subdirectory convention established by tests/microsoft-foundry/ (e.g. foundry-agent/). Import paths updated from ../utils/ to ../../utils/ to reflect the new depth. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address round 4 Copilot review feedback - Add 'fall back'/'fall-back' keyword variants for resilience - Extend non-AVM negation check to also scan following context - Use regex for AZD ordering assertion to match plural/prefixed variants Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update github workflows to use best practices (microsoft#1149) * Early terminate azure-deploy, azure-validate tests (microsoft#1205) Add comment for early termination to help AI grader * Replace script inline parameters with env var (microsoft#1209) * Early terminate azure-deploy tests on deploy link (microsoft#1208) * Early terminate azure-deploy tests on deploy link * Fix lint issue * Reduce char count of existing skills (microsoft#1210) * Reduce char count of existing skills * Update ci tests and snapshots * Enhance benchmark ci run script (microsoft#1176) * Add msbench_benchmarks repo clone to get model definition * Remove unused vars * Use mcp-pr repo before MI has access to msbench-benchmarks repo * Address copilot feedback * Change back to msbench-benchmarks repo * Get ADO token for repo clone * Fix line continuation character * Add run for all interested models * Extract run IDs * Fix yaml format issue * Schedule it to run nightly * Address copilot feedbacks * formalize .foundry and multi-environment support * fix * Feature/azure quotas (microsoft#1137) * update for using azure-quotas in skill * test update * unit test update * path update * add skill in skills.json * skills.json update * reduce the text * version update * skill version * skill description update * reduce text size * 1.0.4 for next prepare version * upload snap shot * update version * test update --------- Co-authored-by: Yinghui Dong <yinghuidong@microsoft.com> * build(deps-dev): bump simple-git from 3.30.0 to 3.32.3 in /tests (microsoft#1213) Bumps [simple-git](https://github.com/steveukx/git-js/tree/HEAD/simple-git) from 3.30.0 to 3.32.3. - [Release notes](https://github.com/steveukx/git-js/releases) - [Changelog](https://github.com/steveukx/git-js/blob/main/simple-git/CHANGELOG.md) - [Commits](https://github.com/steveukx/git-js/commits/simple-git@3.32.3/simple-git) --- updated-dependencies: - dependency-name: simple-git dependency-version: 3.32.3 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Improve azure-compliance invocation rate (microsoft#1214) * Improve azure-compliance invocation rate * Race condition free report writing * Fix debug logging for report location * Bump skill version * Fix suffix base value * fix * llm judge model and eval group improvement --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chris Harris <charris@microsoft.com> Co-authored-by: JasonYeMSFT <chuye@microsoft.com> Co-authored-by: xfz11 <81600993+xfz11@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Juan Ospina <70209456+jeo02@users.noreply.github.com> Co-authored-by: Jon Gallant <2163001+jongio@users.noreply.github.com> Co-authored-by: Wes Haggard <weshaggard@users.noreply.github.com> Co-authored-by: Fan Yang <52458914+fanyang-mono@users.noreply.github.com> Co-authored-by: rakal-dyh <33503911+rakal-dyh@users.noreply.github.com> Co-authored-by: Yinghui Dong <yinghuidong@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> # Conflicts: # tests/microsoft-foundry/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/foundry-agent/create/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/foundry-agent/deploy/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/foundry-agent/invoke/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/foundry-agent/observe/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/foundry-agent/trace/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/foundry-agent/troubleshoot/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/models/deploy/capacity/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/models/deploy/customize-deployment/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/models/deploy/deploy-model-optimal-region/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/models/deploy/deploy-model/__snapshots__/triggers.test.ts.snap # tests/microsoft-foundry/resource/create/__snapshots__/triggers.test.ts.snap
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts: # plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* dataset fix * fix: restore foundry dataset guidance Restore the explicit seed dataset registration guidance in the deploy skill and align dataset docs with the current evaluation_dataset_create MCP surface. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates the microsoft-foundry skill documentation and unit tests to support (1) selecting environment-specific metadata sidecars (e.g., agent-metadata.prod.yaml) and (2) standardizing legacy testSuites[]/testCases[] into the canonical evaluationSuites[] schema, with stronger scoping to a single selected agent root.
Changes:
- Documented metadata file sidecars (
agent-metadata.<env>.yaml), selection precedence, and “selected metadata file” wording across workflows. - Standardized documentation around
evaluationSuites[]+tags(including migration rules from legacy fields on write). - Updated unit tests to assert the new terminology, scoping rules, and schema references.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/microsoft-foundry/unit.test.ts | Adds assertions for sidecar metadata + legacy-to-evaluationSuites[] migration wording. |
| tests/microsoft-foundry/foundry-agent/observe/unit.test.ts | Updates observe docs expectations for selected metadata file, root scoping, and tier=smoke terminology. |
| tests/microsoft-foundry/foundry-agent/eval-datasets/unit.test.ts | Updates eval-datasets docs expectations for selected metadata file/root scoping and evaluationSuites[] schema. |
| tests/microsoft-foundry/foundry-agent/deploy/unit.test.ts | Updates deploy docs expectations for selected metadata file and evaluationSuites[] schema. |
| scripts/package-lock.json | Lockfile updates (adds peer: true flags for some packages). |
| plugin/skills/microsoft-foundry/references/agent-metadata-contract.md | Adds sidecar file model + selection rules + evaluationSuites[] schema and migration guidance. |
| plugin/skills/microsoft-foundry/foundry-agent/trace/trace.md | Updates trace workflow to reference selected metadata file (agent-metadata*.yaml). |
| plugin/skills/microsoft-foundry/foundry-agent/trace/references/search-traces.md | Updates prerequisites and persistence reminder to refer to selected metadata file. |
| plugin/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md | Shifts “test case” language to “test suite” + tier=smoke guidance. |
| plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md | Updates auto-setup instructions for selected root/file scoping and migration-on-write guidance. |
| plugin/skills/microsoft-foundry/foundry-agent/observe/references/compare-iterate.md | Updates wording to “test suite” and root-scoped dataset location. |
| plugin/skills/microsoft-foundry/foundry-agent/observe/references/cicd-monitoring.md | Documents CI/CD workflows selecting a metadata file (e.g., via FOUNDRY_METADATA_FILE). |
| plugin/skills/microsoft-foundry/foundry-agent/observe/references/analyze-results.md | Updates failure prioritization terminology to suite tags (but has an inconsistency noted in comments). |
| plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md | Enforces selected agent root/file scoping and standardizes on evaluationSuites[] + legacy migration. |
| plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md | Updates prerequisites + explicitly scopes updates to the selected agent root only. |
| plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md | Updates metadata references to selected file and evaluationSuites[], including migration-on-write. |
| plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/eval-trending.md | Updates prerequisites to selected metadata file language. |
| plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-versioning.md | Updates agentName source to selected metadata file. |
| plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-organization.md | Replaces priority with flexible tags in dataset row examples and filtering logic. |
| plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md | Updates overall eval-datasets guidance to selected root/file scoping and evaluationSuites[]. |
| plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md | Updates deploy workflow for selected root scoping and selected metadata file + evaluationSuites[]. |
| plugin/skills/microsoft-foundry/SKILL.md | Documents sidecar metadata support, selection precedence, scoping rules, and legacy migration behavior. |
Files not reviewed (1)
- scripts/package-lock.json: Language not supported
| 4. Keep the selected agent root and environment visible in every deploy, eval, dataset, and trace summary. | ||
| 5. Treat `datasets/` and `evaluators/` as cache folders. Reuse local files when present, but offer refresh when the user asks or when remote state is newer. | ||
| 6. Never overwrite cache files or metadata silently. | ||
| 3. Inside the selected agent root, select the metadata file in this order: explicit file/path from the user or workflow, then `.foundry/agent-metadata.<env>.yaml` when an explicit environment is already known and that file exists, then `.foundry/agent-metadata.yaml`. |
There was a problem hiding this comment.
The metadata-file selection rules don’t describe what to do when .foundry/agent-metadata.yaml is absent and multiple sidecar files exist (e.g., agent-metadata.prod.yaml, agent-metadata.staging.yaml) but no explicit file/path was provided. Adding an explicit “prompt the user to choose the metadata file” fallback here would make this contract consistent with the skill’s selection rules and avoid ambiguous behavior.
| 3. Inside the selected agent root, select the metadata file in this order: explicit file/path from the user or workflow, then `.foundry/agent-metadata.<env>.yaml` when an explicit environment is already known and that file exists, then `.foundry/agent-metadata.yaml`. | |
| 3. Inside the selected agent root, select the metadata file in this order: explicit file/path from the user or workflow, then `.foundry/agent-metadata.<env>.yaml` when an explicit environment is already known and that file exists, then `.foundry/agent-metadata.yaml`. If `.foundry/agent-metadata.yaml` is absent, use the only matching sidecar file when exactly one `.foundry/agent-metadata.<env>.yaml` file exists; if multiple sidecar files exist and no explicit file/path was provided, require the user to choose the metadata file. |
| | Focus | Cluster | Suggested Action | | ||
| |-------|---------|------------------| | ||
| | Runtime blockers | Runtime errors or failing suites tagged `tier=smoke` | Check container logs or fix blockers first | | ||
| | Key regressions | Incorrect answers on suites tagged `purpose=regression` or `tier=smoke` | Optimize prompt or tool instructions | | ||
| | Broader quality gaps | Incomplete answers or coverage-oriented suites | Optimize prompt or expand context | | ||
| | Tooling issues | Tool call failures | Fix tool definitions or instructions | | ||
| | Safety issues | Safety violations | Add guardrails to instructions | | ||
|
|
||
| **Rule:** Prioritize runtime errors first, then sort by test-case priority (`P0` before `P1` before `P2`) and count × severity. | ||
| **Rule:** Prioritize runtime errors first, then suites tagged `tier=smoke`, then suites tagged `purpose=regression`, then broader coverage suites by count × severity. |
There was a problem hiding this comment.
This section prioritizes suites tagged purpose=regression, but elsewhere in the docs (and in the tag guidance) regressions are represented as tier=regression. To keep the schema and filtering terminology consistent, consider changing these references from purpose=regression to tier=regression (or, if purpose=regression is intended, update the tag guidance/examples to match).
| | Tag Key | Example Values | Typical Use | | ||
| |---------|----------------|-------------| | ||
| | `tier` | `smoke`, `regression`, `coverage` | Suggested run order / breadth | | ||
| | `purpose` | `baseline`, `safety`, `tools`, `quality` | Why the suite exists | |
There was a problem hiding this comment.
In the tagging guidance, the example suite uses tags.purpose: regression (see the trace-regression-suite example), but the suggested purpose values table does not include regression. This is inconsistent and can confuse users about which key/value to use for regression-focused suites. Consider either (a) adding regression to the suggested purpose values, or (b) updating the example to keep regressions under tier: regression and use a purpose value from the table (e.g., quality).
| | `purpose` | `baseline`, `safety`, `tools`, `quality` | Why the suite exists | | |
| | `purpose` | `baseline`, `safety`, `tools`, `quality`, `regression` | Why the suite exists | |
jongio
left a comment
There was a problem hiding this comment.
Two themes to address across this PR:
-
Several places renamed "test case" to "test suite" instead of "evaluation suite", leaving a third inconsistent term alongside the legacy names and the new schema field. deploy.md section 6 heading, observe.md workflow steps 4 and 7, analyze-results.md step 5, and agent-metadata-contract.md's guidance paragraph all say "test suite" where "evaluation suite" would match the
evaluationSuites[]field name. See inline comments for the most visible instances. -
The
purposevalues table in agent-metadata-contract.md listsbaseline, safety, tools, qualitybut thetrace-regression-suiteexample usespurpose: regression, and analyze-results.md filters onpurpose=regression. The table should includeregressionas a valid purpose value so the docs are self-consistent.
| Use [Generate Seed Evaluation Dataset](../eval-datasets/references/generate-seed-dataset.md) as the single source of truth for seed dataset registration. It covers `project_connection_list` with `AzureStorageAccount`, key-based versus AAD upload, `evaluation_dataset_create` with `connectionName`, and saving the returned `datasetUri`. | ||
|
|
||
| ### 6. Persist Artifacts and Test Cases | ||
| ### 6. Persist Artifacts and Test Suites |
There was a problem hiding this comment.
This heading says "Test Suites" but the schema field is evaluationSuites[]. For consistency with the rest of the rename, this should be "Evaluation Suites". Same applies to the prompt text further down ("test-suite metadata" at line 312 should be "evaluation-suite metadata").
| 2. Evaluate (batch eval run) | ||
| 3. Download and cluster failures | ||
| 4. Pick a category or test case to optimize | ||
| 4. Pick a category or test suite to optimize |
There was a problem hiding this comment.
"test suite" here and in step 7 (line 52) should be "evaluation suite" to match the evaluationSuites[] schema name. The behavioral rule at line 63 ("Run test suites tagged...") has the same inconsistency.
| | Tag Key | Example Values | Typical Use | | ||
| |---------|----------------|-------------| | ||
| | `tier` | `smoke`, `regression`, `coverage` | Suggested run order / breadth | | ||
| | `purpose` | `baseline`, `safety`, `tools`, `quality` | Why the suite exists | |
There was a problem hiding this comment.
The purpose example values don't include regression, but the trace-regression-suite example above (line 75) uses purpose: regression and analyze-results.md prioritizes suites on purpose=regression. Add regression to this table so the suggested values match actual usage.
Description
This pull request updates the documentation for Microsoft Foundry agent workflows to support multiple environment-specific metadata files (such as
agent-metadata.prod.yaml) in addition to the defaultagent-metadata.yaml. The changes clarify how to select, resolve, and persist agent metadata, enforce scoping to the selected agent root, and standardize the transition from legacy test case formats to the newevaluationSuites[]structure. The documentation now consistently refers to the "selected metadata file" and provides detailed rules for environment and metadata file selection, artifact persistence, and cache management.Metadata file selection and environment resolution:
agent-metadata.prod.yaml) alongsideagent-metadata.yaml, with clear rules for selecting which file to use based on workflow context and user input. [1] [2]testSuites[]andtestCases[]toevaluationSuites[]. [1] [2]Agent root and cache scoping:
Artifact and metadata persistence:
evaluationSuites[]. [1] [2] [3]Terminology and documentation consistency:
evaluationSuites[]) instead of "test cases" or "test suites," and ensured dataset naming conventions reference the selected metadata file and agent root. [1] [2] [3] [4]User interaction and workflow prompts:
These changes ensure that all Foundry agent workflows are robust, environment-aware, and compatible with both legacy and new metadata formats, improving reliability and user experience.
Checklist
cd tests && npm test)npm run test:skills:integration -- <skill>)USE FOR/DO NOT USE FOR/PREFER OVERclauses: confirmed no routing regressions for competing skillsRelated Issues