Skip to content

[Evaluation] Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conversations + ToolCallSuccess [STATUS] pass-through#47369

Open
mmkawale wants to merge 7 commits into
Azure:mainfrom
mmkawale:mk/enable-tool-evals-1
Open

[Evaluation] Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conversations + ToolCallSuccess [STATUS] pass-through#47369
mmkawale wants to merge 7 commits into
Azure:mainfrom
mmkawale:mk/enable-tool-evals-1

Conversation

@mmkawale

@mmkawale mmkawale commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 1 of the restricted-tool evaluation enablement, SDK side. Mirrors the asset-side changes in Azure/azureml-assets#5126.

ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator are unblocked on conversations that include any of the five built-in restricted tools (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Both evaluators grade the agent's tool selection and arguments and never read the tool output body, so the previous unconditional rejection is safe to lift.

_ToolCallSuccessEvaluator receives an LLM-rubric-level [STATUS] pass-through improvement that benefits customer-function conversations only. The TCS validator flip for restricted tools is deferred to Phase 2, because TCS grades the tool result payload and the SDK converter does not yet emit a tool_result body for the Bing-family tools — flipping the validator now would change customer-visible behavior on mixed-tool conversations without giving the evaluator anything new to grade. For the three non-Bing restricted tools (azure_ai_search, azure_fabric, sharepoint_grounding) the converter already emits a real body but RAI sign-off on cloud-eval exposure is still pending; that gate lifts in Phase 2.

_ToolInputAccuracyEvaluator is also exported from the top-level azure.ai.evaluation namespace so consumers no longer need to reach into the private _evaluators._tool_input_accuracy submodule. The other three tool evaluators were already exposed there; this brings the four siblings in line.

Companion PRs:

  • Azure/azureml-assets#5126 — asset-side validator flips + TCS [STATUS] rubric (matching prompty + helpers).
  • Azure/azure-sdk-for-python#47396 — SDK converter branches for bing_custom_search + sharepoint_grounding, query/input fallback on AI Search / SharePoint / Fabric.
  • Vienna #2139056 — ACA-side status preservation + evaluator_classes map registration + pin bump to azure-ai-evaluation 1.17.1.

Design spec: areas/evaluations/restricted-tool-evals-enablement.md in Observability-Specs.
Follow-up redesign doc: areas/evaluations/tool-call-success-v3-redesign.md (covers recovery-blindness and a v3 split-into-two-evaluators direction).

Evaluator × tool support after this PR

Tool ToolCallAccuracy ToolInputAccuracy ToolCallSuccess Groundedness ToolOutputUtilization
bing_grounding enabled enabled blocked (Phase 3) blocked blocked
bing_custom_search enabled enabled blocked (Phase 3) blocked blocked
azure_ai_search enabled enabled blocked (Phase 2) blocked blocked
azure_fabric (fabric_dataagent) enabled enabled blocked (Phase 2) blocked blocked
sharepoint_grounding enabled enabled blocked (Phase 2) blocked blocked
Customer function calls unchanged unchanged unchanged + [STATUS] pass-through unchanged unchanged

blocked = the input validator rejects the conversation with UnsupportedTool before grading. Phase 2 lifts the gate for the three non-Bing restricted tools (the SDK converter already emits a synthesized tool_result body for them; awaiting RAI sign-off). Phase 3 settles the Bing-family tools.

Changes

Validator flips (check_for_unsupported_tools True → False):

  • sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
  • sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py

Top-level export:

  • _ToolInputAccuracyEvaluator added to azure.ai.evaluation.__all__ so users can from azure.ai.evaluation import _ToolInputAccuracyEvaluator.

ToolCallSuccess (validator unchanged at check_for_unsupported_tools=True, prompty + helper updates only):

  • _get_tool_calls_results emits [STATUS] <value> inline on each formatted [TOOL_CALL] / [TOOL_RESULT] line when the source content block carries a status field, e.g. [TOOL_CALL] send_email(to="x@example.com") [STATUS] failed.
  • _format_status_suffix helper builds the annotation (returns "" when status is absent / empty / non-string, preserving the prior wire format byte-for-byte).
  • tool_call_success.prompty adds a [STATUS] bullet to the ERROR-CASES list and three illustrative examples: (a) failed status with bland payload, (b) completed status with error payload, (c) parallel calls with one failed.
  • Treats [STATUS] failed | incomplete as an authoritative failure signal that overrides a contradictory-looking payload; treats [STATUS] completed as success but still applies payload-based rules (a tool can return an error inside a success envelope); falls back to today's payload-only judgment when the annotation is absent.

Tests:

  • tests/unittests/test_unsupported_tools_validation.py: asserts TCA and TIA accept all five restricted tools; asserts TCS still rejects them (its rubric depends on the tool output body); asserts the underlying ToolCallsValidator / ToolDefinitionsValidator classes still reject when check_for_unsupported_tools=True (behavior change is per-evaluator wiring only).
  • tests/unittests/test_tool_call_success_evaluator.py: 12 tests covering _format_status_suffix edge cases (None, empty, non-string, arbitrary string) and _get_tool_calls_results across present / absent / completed / mixed / parallel-tool-calls-in-one-assistant-message topologies.

Changelog + version:

  • CHANGELOG.md 1.17.1 entry: TCA + TIA restricted-tool enablement, _ToolInputAccuracyEvaluator top-level export, TCS [STATUS] pass-through (with explicit note that the TCS validator flip is deferred).
  • _version.py: 1.17.0 → 1.17.1.

Out of scope (deferred)

  • Phase 2: SDK converter changes for any Bing-family tool_result synthesis; matching asset-side and SDK-side validator flips for _ToolCallSuccessEvaluator on the three non-Bing restricted tools; validator flips + per-tool allowlists for GroundednessEvaluator and _ToolOutputUtilizationEvaluator on those three tools. Requires RAI sign-off on synthesized body content.
  • Phase 3: Body-consuming evaluator support for Bing-family tools (likely status-only rubric mode + [NO_RESULT_AVAILABLE] marker — see redesign doc §1.2).
  • in_progress status from Responses API: The grounding-call API enum includes in_progress alongside completed | incomplete | failed. The current rubric treats [STATUS] in_progress as "fall back to payload rules", which is fine for completed eval rows but worth revisiting if mid-flight calls start landing in the eval input.

Verification

  • All 38 impacted unit tests pass locally (pytest tests/unittests/test_tool_call_success_evaluator.py tests/unittests/test_unsupported_tools_validation.py).
  • flake8 --max-line-length=119 clean on changed files (3 pre-existing warnings in _tool_call_success.py were not introduced by this PR — verified with git stash comparison).
  • Existing TCS regression tests under tests/unittests/test_tool_call_success_evaluator.py for the deterministic Python short-circuit were removed; that earlier design (which lived briefly on this branch) was superseded by the LLM-rubric [STATUS] pass-through approach to match the asset side. The short-circuit had two structural issues that motivated the pivot: (a) it gave bypassed grades no rubric explanation, and (b) it forced the SDK and asset to maintain duplicate Python failure logic.

These three evaluators grade the agent's tool selection, input arguments,
and call status -- none consume the (redacted) tool output body -- so the
previous unconditional rejection of conversations containing built-in
restricted tools (bing_grounding, bing_custom_search, azure_ai_search,
azure_fabric, sharepoint_grounding) is now lifted.

Implementation:
- Set check_for_unsupported_tools=False on each evaluator's input validator
  in _tool_call_accuracy.py, _tool_input_accuracy.py, _tool_call_success.py.
- The underlying ToolDefinitionsValidator / ToolCallsValidator classes are
  unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still
  reject restricted tools because they require the tool output body.

Tests:
- New test_unsupported_tools_validation.py (26 tests) covers:
  * 15 parametrized cases: each of the 3 evaluators x 5 restricted tools,
    asserting validate_eval_input returns True for response= payloads.
  * 1 mixed-tools case.
  * 10 regression cases asserting the underlying validators still reject
    restricted tools when check_for_unsupported_tools=True.

Versioning:
- Bumped _version.py 1.17.0 -> 1.17.1.
- Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added.
@mmkawale mmkawale requested a review from a team as a code owner June 5, 2026 17:08
Copilot AI review requested due to automatic review settings June 5, 2026 17:08
@github-actions github-actions Bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation labels Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Thank you for your contribution @mmkawale! We will review the pull request and get back to you soon.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates tool-related evaluators to allow conversations that include restricted built-in tools by disabling unsupported-tool checks in their input validators, and adds regression tests to ensure the relaxed behavior is limited to those evaluators.

Changes:

  • Set check_for_unsupported_tools=False for ToolCallAccuracyEvaluator, _ToolInputAccuracyEvaluator, and _ToolCallSuccessEvaluator validators.
  • Added unit tests covering acceptance of restricted tools for those evaluators and continued rejection when validator flags are enabled.
  • Bumped package version and documented the behavior change in the changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py Disables unsupported-tool checking in ToolCallsValidator wiring.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py Disables unsupported-tool checking in ToolDefinitionsValidator wiring.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/_tool_call_success.py Disables unsupported-tool checking in ToolDefinitionsValidator wiring.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_unsupported_tools_validation.py Adds regression tests ensuring restricted tools are accepted only where intended.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py Bumps version to 1.17.1.
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Documents the new behavior under 1.17.1 (Unreleased).

Comment on lines +83 to +84
# Should not raise EvaluationException; flag flip made this path legal.
assert evaluator._validator.validate_eval_input(eval_input) is True
Comment on lines +70 to +73
@pytest.mark.usefixtures("mock_model_config")
@pytest.mark.unittest
class TestRestrictedToolValidationLifted:
"""Validator should no longer reject restricted tools for these three evaluators."""
Comment on lines +59 to +67
def _restricted_tool_definition(tool_name: str):
return {
"name": tool_name,
"description": f"Built-in {tool_name} tool.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
},
}
Comment on lines +34 to +40
RESTRICTED_TOOL_NAMES = [
"bing_grounding",
"bing_custom_search",
"azure_ai_search",
"azure_fabric",
"sharepoint_grounding",
]
When any tool_call or tool_result in the response carries a known-failure status (failed, error, incomplete, cancelled/canceled), short-circuit _do_eval to return a deterministic fail result (score=0, _passed=False, _result='fail') without invoking the LLM. The evaluator's scoring contract is explicitly binary -- 'FALSE: at least one tool call failed' -- and the prompty rubric does not consider the status field, so it would otherwise grade only the (typically empty) result body and frequently mis-score failed conversations as passes.

Reuses the existing pre-flow short-circuit pattern (_is_intermediate_response / _return_not_applicable_result) for consistency. Status is only populated by upstream converters that preserve it; absent status, behavior is unchanged. Bumps to 1.17.1, adds CHANGELOG entry, and adds 19 focused unit tests.
… namespace

Brings _ToolInputAccuracyEvaluator in line with its three sibling tool evaluators (ToolCallAccuracyEvaluator, _ToolCallSuccessEvaluator, _ToolOutputUtilizationEvaluator) which are already exposed on the top-level package. Consumers (notably the Foundry evaluations service catalog) can now import it from azure.ai.evaluation directly instead of reaching into the private _evaluators._tool_input_accuracy submodule.
mmkawale pushed a commit to mmkawale/azure-sdk-for-python that referenced this pull request Jun 8, 2026
…ery/input fallback for AIS, SP, Fabric

break_tool_call_into_messages previously had no elif branch for bing_custom_search or sharepoint_grounding, so calls touching either tool were silently dropped before any evaluator could see them. The three status-only tool evaluators (ToolCallAccuracyEvaluator, _ToolInputAccuracyEvaluator, _ToolCallSuccessEvaluator) therefore returned NOT_APPLICABLE on those conversations even after the validator was loosened in PR Azure#47369.

Changes:

- bing_custom_search: arguments-only branch mirroring bing_grounding (emits a tool_call with the requesturl; no tool_result, since Bing-family results are redacted upstream for compliance).

- sharepoint_grounding: arguments + dumped output, mirroring azure_ai_search. Phase 2 will extend the Groundedness extractor to walk the documents structure already present on the tool_result.

- azure_ai_search, sharepoint_grounding, fabric_dataagent input branches: switched from direct details[<tool>][input] dereference to .get(input) or .get(query) or empty-string fallback. Live agent traces emit the search term under 'query' for all three, which made the existing AIS and Fabric branches surface empty arguments to evaluators (a live bug, not just a Phase 1 prerequisite).

- Refreshed the stale March-2025 top-of-function comment to reflect the current set of supported built-ins.

Tests:

Added 5 new tests in tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py covering bing_custom_search, sharepoint_grounding (input key and output dump), and the query-key fallback for AIS, SP, and Fabric. The new tests construct ToolCall via a small _HybridDict helper instead of going through ToolDecoder, so they do not depend on the agents SDK RunStep* models that have moved between azure.ai.projects.models and azure.ai.agents.models packages.
…ough

Mirrors azureml-assets PR Azure#5126 design pivot.

Source (_tool_call_success.py):

- Reverted check_for_unsupported_tools True->False flip; TCS again rejects restricted-tool conversations (its rubric depends on the tool output body).

- Removed _FAILED_TOOL_STATUSES + _collect_failed_tool_statuses helper and the pre-LLM deterministic-fail short-circuit. Status interpretation is now an LLM-only concern.

- Added _format_status_suffix helper and wired it into _get_tool_calls_results so every [TOOL_CALL] / [TOOL_RESULT] line carries a [STATUS] <value> suffix when the source content block has a status field. Back-compat preserved: empty/None/non-string status emits ''; output is byte-identical to the prior format when status is absent.

Prompty (tool_call_success.prompty):

- Added a [STATUS] failed|error|incomplete|cancelled|canceled bullet to ERROR-CASES marking it an authoritative failure signal that overrides bland payload appearance.

- Added an explicit clause that [STATUS] is optional and that [STATUS] completed does not by itself imply success (payload rules still apply).

- Added 3 illustrative examples: bland-payload+failed-status, completed-status+error-payload, and a parallel-call topology with one failed.

Tests:

- Replaced test_tool_call_success_evaluator.py with status-passthrough coverage (12 tests on _format_status_suffix + _get_tool_calls_results topologies).

- Flipped test_tool_call_success_accepts_restricted_tool to test_tool_call_success_still_rejects_restricted_tool in test_unsupported_tools_validation.py and updated module docstring scope to TCA/TIA only.

Changelog: rewrote 1.17.1 entry to reflect TCA/TIA enablement + TCS [STATUS] pass-through (validator flip deferred to a later phase).

All 38 impacted unit tests pass.
@mmkawale mmkawale changed the title [Evaluation] Enable ToolCallAccuracy/Input/Success on restricted-tool conversations [Evaluation] Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conversations + ToolCallSuccess [STATUS] pass-through Jun 9, 2026
The previous wording listed all five failure values (failed, error, incomplete, cancelled, canceled) as if any runtime emitted them, and claimed the annotation is case-insensitive. Per the Responses-API tool-call status enum (in_progress | completed | incomplete | failed), only 'failed' and 'incomplete' are ever emitted by the platform; the other three are reserved for non-Responses-API runtimes. Case-insensitivity was never enforced by _format_status_suffix (status is forwarded verbatim) and the API contract is lowercase regardless.

New wording: foregrounds 'failed' and 'incomplete' as the primary values, parenthesizes the other three as non-Responses-API future-proofing, separates the two failure causes (runtime caught a technical failure vs. call interrupted before completion -> incomplete), and drops the case-insensitivity claim. No behavior change in the helper; rubric language only.
…mplete only

Walked the runtime surface area: Responses API enum is in_progress | completed | incomplete | failed; Threads/v1 Agents API has 'cancelled' on runs but no SDK converter lifts run-status onto individual tool_call blocks; ACA trace converter maps OTel status_code to the Responses-API vocabulary (Ok -> completed, Error -> failed) rather than preserving 'cancelled'/'error' verbatim; tool-server gRPC StatusCodes are server-side only and never reach the eval row. No emitter today produces error | cancelled | canceled on a tool_call block, so listing them as recognized [STATUS] values overstates the spec and adds rubric noise for vocabulary the LLM will never see.

The _format_status_suffix helper stays permissive (still accepts any non-empty string for forward-compat); only the rubric wording is narrowed.

Keeps 'incomplete' as authoritative failure: it explicitly means the tool call did not produce a usable result (host timeout, parent-response cancellation, max_tokens cut-off mid-call), which matches the binary 'did the tool call succeed' contract. 'in_progress' is intentionally not addressed: it shouldn't appear in a completed eval row, and if it does the typically-empty payload will get judged correctly by the existing rules -- documented as a follow-up spec question.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants