Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver…#47462
Enable ToolCallAccuracy / ToolInputAccuracy on restricted-tool conver…#47462mmkawale wants to merge 7 commits into
Conversation
…sations and add [STATUS] pass-through for ToolCallSuccess Three evaluators in azure-ai-evaluation previously rejected any conversation containing a built-in restricted tool (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Two of those evaluators -- ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator -- only judge the agent's tool selection and input arguments and do not need the (redacted) tool output body, so the rejection was overly conservative. This change enables both on restricted-tool conversations. _ToolCallSuccessEvaluator continues to reject them because its rubric inspects the tool output body, but it gains a new mechanism -- [STATUS] pass-through -- so the LLM judge can correctly recognize runtime-reported failures on conversations that *do* reach it. Changes ------- ToolCallAccuracy / ToolInputAccuracy: - Set check_for_unsupported_tools=False on the input validator in _tool_call_accuracy.py and _tool_input_accuracy.py. The underlying ToolDefinitionsValidator / ToolCallsValidator classes are unchanged; GroundednessEvaluator and ToolOutputUtilizationEvaluator still reject restricted tools because they require the tool output body. - Export _ToolInputAccuracyEvaluator from the azure.ai.evaluation top-level namespace, matching its three sibling tool evaluators (ToolCallAccuracyEvaluator, _ToolCallSuccessEvaluator, _ToolOutputUtilizationEvaluator). Consumers (notably the Foundry evaluations service catalog) can now import it directly instead of reaching into the private _evaluators._tool_input_accuracy submodule. ToolCallSuccess -- [STATUS] pass-through: - Added _format_status_suffix helper and wired it into _get_tool_calls_results so every [TOOL_CALL] / [TOOL_RESULT] line carries a [STATUS] <value> suffix when the source content block has a status field. Back-compat preserved: empty/None/non-string status emits the empty string, so output is byte-identical to the prior format when status is absent. - Prompty: added an ERROR-CASES bullet that names [STATUS] failed and [STATUS] incomplete as authoritative failure signals that override bland payload appearance, with two illustrative examples (bland-payload+failed-status and completed-status+error-payload). The bullet matches the Responses-API tool-call status enum (in_progress | completed | incomplete | failed) -- only 'failed' and 'incomplete' are listed as primary values because no current emitter (Responses API, Threads/v1 Agents, ACA trace converter, tool-server gRPC) produces error/cancelled/canceled on a tool_call block. The _format_status_suffix helper remains permissive (any non-empty string) for forward-compat; only the rubric wording is narrowed. - Prompty: added an explicit clause that [STATUS] is optional and that [STATUS] completed does not by itself imply success -- payload-based rules still apply. - Prompty: fixed invalid trailing commas in every few-shot EXAMPLE OUTPUT. Each example had a trailing comma after the only failed_tools field of properties, producing invalid JSON. Under gpt-4o + response_format=json_object this caused the model to disambiguate the trailing comma by nesting score/status inside properties (a syntactically-valid alternative), which broke the SDK's top-level score extractor and silently flipped passing evaluations to fail. Validated end-to-end on a SharePoint-grounded transcript: with the commas stripped, gpt-4o reliably emits the canonical shape with score/status as siblings of properties, and pass/fail rows are classified correctly. Tests: - New test_unsupported_tools_validation.py (26 tests): 15 parametrized cases (3 evaluators x 5 restricted tools) asserting validate_eval_input returns True for response= payloads, 1 mixed-tools case, 10 regression cases asserting the underlying validators still reject restricted tools when check_for_unsupported_tools=True. - Replaced test_tool_call_success_evaluator.py with status-passthrough coverage (12 tests on _format_status_suffix and _get_tool_calls_results topologies). - One test was flipped from test_tool_call_success_accepts_restricted_tool to test_tool_call_success_still_rejects_restricted_tool in test_unsupported_tools_validation.py, with the module docstring scope narrowed to TCA/TIA only. Versioning: - Bumped _version.py 1.17.0 -> 1.17.1. - Added 1.17.1 (Unreleased) section to CHANGELOG.md under Features Added covering TCA/TIA enablement on restricted-tool conversations and TCS [STATUS] pass-through. All 38 impacted unit tests pass.
There was a problem hiding this comment.
Pull request overview
This PR updates azure-ai-evaluation tool evaluators to (1) allow ToolCallAccuracy and ToolInputAccuracy to run on conversations that include restricted built-in tools (since they don’t require tool output bodies), and (2) improve ToolCallSuccess grading by passing runtime tool-call status through into the rubric via [STATUS] ... annotations. It also exposes _ToolInputAccuracyEvaluator from the top-level package namespace, adds/updates unit tests, and bumps the package version.
Changes:
- Lifted restricted-tool validation for
ToolCallAccuracyEvaluatorand_ToolInputAccuracyEvaluatorby disabling unsupported-tool checks in their validators. - Added
[STATUS] <value>suffix pass-through for ToolCallSuccess’s formatted[TOOL_CALL]/[TOOL_RESULT]lines and updated the prompty rubric/examples accordingly. - Exported
_ToolInputAccuracyEvaluatorfromazure.ai.evaluation, added targeted unit tests, and bumped version/changelog.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_unsupported_tools_validation.py | New regression tests covering restricted-tool acceptance for TCA/TIA and continued rejection for TCS, plus validator-level regression. |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_success_evaluator.py | New unit tests covering _format_status_suffix and [STATUS] emission topology in _get_tool_calls_results. |
| sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | Adds 1.17.1 (Unreleased) entry documenting restricted-tool enablement, status pass-through, and export change. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py | Version bump to 1.17.1. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py | Disables unsupported-tool validation for ToolInputAccuracy evaluator inputs. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/tool_call_success.prompty | Updates rubric to account for [STATUS] and fixes JSON example formatting (trailing commas). |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_success/_tool_call_success.py | Implements _format_status_suffix and appends status suffix to formatted tool call/result lines. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py | Disables unsupported-tool validation for ToolCallAccuracy evaluator inputs. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/init.py | Exports _ToolInputAccuracyEvaluator and adds it to __all__. |
| @pytest.mark.usefixtures("mock_model_config") | ||
| @pytest.mark.unittest | ||
| class TestRestrictedToolValidationLifted: | ||
| """Validator should no longer reject restricted tools for these three evaluators.""" | ||
|
|
| from azure.ai.evaluation import ToolCallAccuracyEvaluator | ||
| from azure.ai.evaluation._evaluators._tool_call_success import _ToolCallSuccessEvaluator | ||
| from azure.ai.evaluation._evaluators._tool_input_accuracy import _ToolInputAccuracyEvaluator |
| content block carries a ``status`` field. The prompty rubric is taught to treat | ||
| these annotations as a strong (authoritative) failure signal when the status is | ||
| in {failed, error, incomplete, cancelled, canceled}, and to fall back to | ||
| payload-only judgment when ``status`` is absent. |
…ython Failed/incomplete tool_call or tool_result blocks now return a deterministic fail result without invoking the LLM judge; the prompty rubric is consulted only on the success path. Drops [STATUS] suffix from the formatted LLM input (back-compat with pre-pass-through wire format). Adds _collect_failed_tool_calls helper and _return_short_circuit_failure_result method; removes _format_status_suffix; rewrites tests.
The api-md-consistency CI check walks the package tree and tries to import every subfolder. The orphan autogen/ folder (which contains autogen/raiclient/) was missing its __init__.py, causing ModuleNotFoundError: No module named 'azure.ai.evaluation.autogen'. Adding an empty __init__.py makes the folder a real package without changing any behavior.
| """ | ||
| return super().__call__(*args, **kwargs) | ||
|
|
||
| def _return_short_circuit_failure_result(self, failed_tools: List[str]) -> Dict[str, Union[str, float]]: |
There was a problem hiding this comment.
Is this the only 2 evaluators that will need this logic? Possibility of refactoring this to a common util in SDK?
There was a problem hiding this comment.
Agreed on the pattern. We already keep evaluator-specific input/output translation logic in the evaluator module itself, so I followed the same approach for runtime status handling here. I kept it local for now because the status semantics are specific to ToolCallSuccess. If we see the same status behavior needed across multiple evaluators, I can extract it into a shared utility in a follow-up.
…sations and add [STATUS] pass-through for ToolCallSuccess
Three evaluators in azure-ai-evaluation previously rejected any conversation containing a built-in restricted tool (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding). Two of those evaluators -- ToolCallAccuracyEvaluator and _ToolInputAccuracyEvaluator -- only judge the agent's tool selection and input arguments and do not need the (redacted) tool output body, so the rejection was overly conservative. This change enables both on restricted-tool conversations. _ToolCallSuccessEvaluator continues to reject them because its rubric inspects the tool output body, but it gains a new mechanism -- [STATUS] pass-through -- so the LLM judge can correctly recognize runtime-reported failures on conversations that do reach it.
Changes
ToolCallAccuracy / ToolInputAccuracy:
ToolCallSuccess -- [STATUS] pass-through:
Tests:
Versioning:
All 38 impacted unit tests pass.
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines