Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver… by mmkawale · Pull Request #47462 · Azure/azure-sdk-for-python

mmkawale · 2026-06-11T17:43:10Z

…sations and add [STATUS] pass-through for ToolCallSuccess

Three evaluators in azure-ai-evaluation previously rejected any conversation containing a built-in restricted tool (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Two of those evaluators -- ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator -- only judge the agent's tool selection and input arguments and do not need the (redacted) tool output body, so the rejection was overly conservative. This change enables both on restricted-tool conversations. _ToolCallSuccessEvaluator continues to reject them because its rubric inspects the tool output body, but it gains a new mechanism -- [STATUS] pass-through -- so the LLM judge can correctly recognize runtime-reported failures on conversations that do reach it.

Changes

ToolCallAccuracy / ToolInputAccuracy:

Set check_for_unsupported_tools=False on the input validator in _tool_call_accuracy.py and _tool_input_accuracy.py. The underlying ToolDefinitionsValidator / ToolCallsValidator classes are unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still reject restricted tools because they require the tool output body.
Export _ToolInputAccuracyEvaluator from the azure.ai.evaluation top-level namespace, matching its three sibling tool evaluators (ToolCallAccuracyEvaluator, _ToolCallSuccessEvaluator, _ToolOutputUtilizationEvaluator). Consumers (notably the Foundry evaluations service catalog) can now import it directly instead of reaching into the private _evaluators._tool_input_accuracy submodule.

ToolCallSuccess -- [STATUS] pass-through:

Added _format_status_suffix helper and wired it into _get_tool_calls_results so every [TOOL_CALL] / [TOOL_RESULT] line carries a [STATUS] suffix when the source content block has a status field. Back-compat preserved: empty/None/non-string status emits the empty string, so output is byte-identical to the prior format when status is absent.
Prompty: added an ERROR-CASES bullet that names [STATUS] failed and [STATUS] incomplete as authoritative failure signals that override bland payload appearance, with two illustrative examples (bland-payload+failed-status and completed-status+error-payload). The bullet matches the Responses-API tool-call status enum (in_progress | completed | incomplete | failed) -- only 'failed' and 'incomplete' are listed as primary values because no current emitter (Responses API, Threads/v1 Agents, ACA trace converter, tool-server gRPC) produces error/cancelled/canceled on a tool_call block. The _format_status_suffix helper remains permissive (any non-empty string) for forward-compat; only the rubric wording is narrowed.
Prompty: added an explicit clause that [STATUS] is optional and that [STATUS] completed does not by itself imply success -- payload-based rules still apply.
Prompty: fixed invalid trailing commas in every few-shot EXAMPLE OUTPUT. Each example had a trailing comma after the only failed_tools field of properties, producing invalid JSON. Under gpt-4o + response_format=json_object this caused the model to disambiguate the trailing comma by nesting score/status inside properties (a syntactically-valid alternative), which broke the SDK's top-level score extractor and silently flipped passing evaluations to fail. Validated end-to-end on a SharePoint-grounded transcript: with the commas stripped, gpt-4o reliably emits the canonical shape with score/status as siblings of properties, and pass/fail rows are classified correctly.

Tests:

New test_unsupported_tools_validation.py (26 tests): 15 parametrized cases (3 evaluators x 5 restricted tools) asserting validate_eval_input returns True for response= payloads, 1 mixed-tools case, 10 regression cases asserting the underlying validators still reject restricted tools when check_for_unsupported_tools=True.
Replaced test_tool_call_success_evaluator.py with status-passthrough coverage (12 tests on _format_status_suffix and _get_tool_calls_results topologies).
One test was flipped from test_tool_call_success_accepts_restricted_tool to test_tool_call_success_still_rejects_restricted_tool in test_unsupported_tools_validation.py, with the module docstring scope narrowed to TCA/TIA only.

Versioning:

Bumped _version.py 1.17.0 -> 1.17.1.
Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added covering TCA/TIA enablement on restricted-tool conversations and TCS [STATUS] pass-through.

All 38 impacted unit tests pass.

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

…sations and add [STATUS] pass-through for ToolCallSuccess Three evaluators in azure-ai-evaluation previously rejected any conversation containing a built-in restricted tool (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Two of those evaluators -- ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator -- only judge the agent's tool selection and input arguments and do not need the (redacted) tool output body, so the rejection was overly conservative. This change enables both on restricted-tool conversations. _ToolCallSuccessEvaluator continues to reject them because its rubric inspects the tool output body, but it gains a new mechanism -- [STATUS] pass-through -- so the LLM judge can correctly recognize runtime-reported failures on conversations that *do* reach it. Changes ------- ToolCallAccuracy / ToolInputAccuracy: - Set check_for_unsupported_tools=False on the input validator in _tool_call_accuracy.py and _tool_input_accuracy.py. The underlying ToolDefinitionsValidator / ToolCallsValidator classes are unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still reject restricted tools because they require the tool output body. - Export _ToolInputAccuracyEvaluator from the azure.ai.evaluation top-level namespace, matching its three sibling tool evaluators (ToolCallAccuracyEvaluator, _ToolCallSuccessEvaluator, _ToolOutputUtilizationEvaluator). Consumers (notably the Foundry evaluations service catalog) can now import it directly instead of reaching into the private _evaluators._tool_input_accuracy submodule. ToolCallSuccess -- [STATUS] pass-through: - Added _format_status_suffix helper and wired it into _get_tool_calls_results so every [TOOL_CALL] / [TOOL_RESULT] line carries a [STATUS] <value> suffix when the source content block has a status field. Back-compat preserved: empty/None/non-string status emits the empty string, so output is byte-identical to the prior format when status is absent. - Prompty: added an ERROR-CASES bullet that names [STATUS] failed and [STATUS] incomplete as authoritative failure signals that override bland payload appearance, with two illustrative examples (bland-payload+failed-status and completed-status+error-payload). The bullet matches the Responses-API tool-call status enum (in_progress | completed | incomplete | failed) -- only 'failed' and 'incomplete' are listed as primary values because no current emitter (Responses API, Threads/v1 Agents, ACA trace converter, tool-server gRPC) produces error/cancelled/canceled on a tool_call block. The _format_status_suffix helper remains permissive (any non-empty string) for forward-compat; only the rubric wording is narrowed. - Prompty: added an explicit clause that [STATUS] is optional and that [STATUS] completed does not by itself imply success -- payload-based rules still apply. - Prompty: fixed invalid trailing commas in every few-shot EXAMPLE OUTPUT. Each example had a trailing comma after the only failed_tools field of properties, producing invalid JSON. Under gpt-4o + response_format=json_object this caused the model to disambiguate the trailing comma by nesting score/status inside properties (a syntactically-valid alternative), which broke the SDK's top-level score extractor and silently flipped passing evaluations to fail. Validated end-to-end on a SharePoint-grounded transcript: with the commas stripped, gpt-4o reliably emits the canonical shape with score/status as siblings of properties, and pass/fail rows are classified correctly. Tests: - New test_unsupported_tools_validation.py (26 tests): 15 parametrized cases (3 evaluators x 5 restricted tools) asserting validate_eval_input returns True for response= payloads, 1 mixed-tools case, 10 regression cases asserting the underlying validators still reject restricted tools when check_for_unsupported_tools=True. - Replaced test_tool_call_success_evaluator.py with status-passthrough coverage (12 tests on _format_status_suffix and _get_tool_calls_results topologies). - One test was flipped from test_tool_call_success_accepts_restricted_tool to test_tool_call_success_still_rejects_restricted_tool in test_unsupported_tools_validation.py, with the module docstring scope narrowed to TCA/TIA only. Versioning: - Bumped _version.py 1.17.0 -> 1.17.1. - Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added covering TCA/TIA enablement on restricted-tool conversations and TCS [STATUS] pass-through. All 38 impacted unit tests pass.

Copilot

Pull request overview

This PR updates azure-ai-evaluation tool evaluators to (1) allow ToolCallAccuracy and ToolInputAccuracy to run on conversations that include restricted built-in tools (since they don’t require tool output bodies), and (2) improve ToolCallSuccess grading by passing runtime tool-call status through into the rubric via [STATUS] ... annotations. It also exposes _ToolInputAccuracyEvaluator from the top-level package namespace, adds/updates unit tests, and bumps the package version.

Changes:

Lifted restricted-tool validation for ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator by disabling unsupported-tool checks in their validators.
Added [STATUS] <value> suffix pass-through for ToolCallSuccess’s formatted [TOOL_CALL] / [TOOL_RESULT] lines and updated the prompty rubric/examples accordingly.
Exported _ToolInputAccuracyEvaluator from azure.ai.evaluation, added targeted unit tests, and bumped version/changelog.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_unsupported_tools_validation.py	New regression tests covering restricted-tool acceptance for TCA/TIA and continued rejection for TCS, plus validator-level regression.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_success_evaluator.py	New unit tests covering `_format_status_suffix` and `[STATUS]` emission topology in `_get_tool_calls_results`.
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md	Adds 1.17.1 (Unreleased) entry documenting restricted-tool enablement, status pass-through, and export change.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py	Version bump to 1.17.1.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py	Disables unsupported-tool validation for ToolInputAccuracy evaluator inputs.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/tool_call_success.prompty	Updates rubric to account for `[STATUS]` and fixes JSON example formatting (trailing commas).
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/_tool_call_success.py	Implements `_format_status_suffix` and appends status suffix to formatted tool call/result lines.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py	Disables unsupported-tool validation for ToolCallAccuracy evaluator inputs.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/init.py	Exports `_ToolInputAccuracyEvaluator` and adds it to `__all__`.

+@pytest.mark.usefixtures("mock_model_config")
+@pytest.mark.unittest
+class TestRestrictedToolValidationLifted:
+    """Validator should no longer reject restricted tools for these three evaluators."""
+


+from azure.ai.evaluation import ToolCallAccuracyEvaluator
+from azure.ai.evaluation._evaluators._tool_call_success import _ToolCallSuccessEvaluator
+from azure.ai.evaluation._evaluators._tool_input_accuracy import _ToolInputAccuracyEvaluator


+content block carries a ``status`` field. The prompty rubric is taught to treat
+these annotations as a strong (authoritative) failure signal when the status is
+in {failed, error, incomplete, cancelled, canceled}, and to fall back to
+payload-only judgment when ``status`` is absent.


…ython Failed/incomplete tool_call or tool_result blocks now return a deterministic fail result without invoking the LLM judge; the prompty rubric is consulted only on the success path. Drops [STATUS] suffix from the formatted LLM input (back-compat with pre-pass-through wire format). Adds _collect_failed_tool_calls helper and _return_short_circuit_failure_result method; removes _format_status_suffix; rewrites tests.

The api-md-consistency CI check walks the package tree and tries to import every subfolder. The orphan autogen/ folder (which contains autogen/raiclient/) was missing its __init__.py, causing ModuleNotFoundError: No module named 'azure.ai.evaluation.autogen'. Adding an empty __init__.py makes the folder a real package without changing any behavior.

aprilk-ms · 2026-06-17T16:32:34Z

        """
        return super().__call__(*args, **kwargs)

+    def _return_short_circuit_failure_result(self, failed_tools: List[str]) -> Dict[str, Union[str, float]]:


Is this the only 2 evaluators that will need this logic? Possibility of refactoring this to a common util in SDK?

Agreed on the pattern. We already keep evaluator-specific input/output translation logic in the evaluator module itself, so I followed the same approach for runtime status handling here. I kept it local for now because the status semantics are specific to ToolCallSuccess. If we see the same status behavior needed across multiple evaluators, I can extract it into a shared utility in a follow-up.

Copilot AI review requested due to automatic review settings June 11, 2026 17:43

mmkawale requested a review from a team as a code owner June 11, 2026 17:43

Copilot started reviewing on behalf of mmkawale June 11, 2026 17:43 View session

github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Jun 11, 2026

Copilot AI reviewed Jun 11, 2026

View reviewed changes

posaninagendra approved these changes Jun 11, 2026

View reviewed changes

mmkawale mentioned this pull request Jun 15, 2026

ToolCallSuccess: move runtime-status short-circuit from prompt into Python Azure/azureml-assets#5145

Draft

manaskawale added 4 commits June 15, 2026 13:24

ToolCallSuccess: rename tcid -> call_id to satisfy cspell

bc783a6

ToolCallSuccess: apply black formatting

470db6b

Add api.md and api.metadata.yml for api-md-consistency check

52de8eb

aprilk-ms reviewed Jun 17, 2026

View reviewed changes

Sync ToolCallSuccess runtime status handling with asset changes

35aee24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver…#47462

Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver…#47462
mmkawale wants to merge 7 commits into
mainfrom
mk/enable-tool-evals

mmkawale commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

aprilk-ms Jun 17, 2026

Uh oh!

mmkawale Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mmkawale commented Jun 11, 2026

Changes

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

aprilk-ms Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mmkawale Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants