Enable ToolCallAccuracy/InputAccuracy/CallSuccess on restricted-tool conversations by mmkawale · Pull Request #5117 · Azure/azureml-assets

mmkawale · 2026-06-05T19:00:28Z

Summary

Lifts the unconditional rejection of conversations containing built-in restricted tools (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding, browser_automation, code_interpreter_call, computer_call, openapi_call, web_search) on the three tool-call evaluators (tool_call_accuracy, tool_input_accuracy, tool_call_success).

These three evaluators grade the agent's tool selection, input arguments, and call status respectively. None of them consume the (redacted) tool output body, so the previous blanket rejection was over-broad. tool_output_utilization and groundedness continue to reject restricted-tool conversations because they do read the tool output body.

This is the registry-asset side of the change. The matching SDK change ships in Azure/azure-sdk-for-python#47369 (azure-ai-evaluation 1.17.1). Because the registry hosts forked copies of the evaluator source under assets/evaluators/builtin/<name>/evaluator/, the same one-line flip has to be duplicated here per-evaluator.

Changes

Source (`check_for_unsupported_tools=True` → `False`)

assets/evaluators/builtin/tool_call_accuracy/evaluator/_tool_call_accuracy.py — ToolCallsValidator
assets/evaluators/builtin/tool_input_accuracy/evaluator/_tool_input_accuracy.py — ToolDefinitionsValidator
assets/evaluators/builtin/tool_call_success/evaluator/_tool_call_success.py — ToolDefinitionsValidator

Spec version bumps

Asset	Before	After
`tool_call_accuracy/spec.yaml`	11	12
`tool_input_accuracy/spec.yaml`	12	13
`tool_call_success/spec.yaml`	7	8

Tests

Flipped check_for_unsupported_tools True → False in the three behavior test suites so assertions match the new behavior (validator accepts → flow runs).
Relaxed base_tool_evaluation_test._run_tool_type_test: when a tool's expected_flow_inputs is not yet populated (empty dict), assert only that the flow was invoked once instead of exact-argument matching. The full per-tool expected-flow constants for the newly-enabled tool types will land in a follow-up PR via a mechanical generation pass over the existing _QUERY / _RESPONSE / _TOOL_DEFINITIONS fixtures.

Test results

228 / 228 behavior tests pass across the three suites:

test_tool_call_accuracy_evaluator_behavior
test_tool_call_success_evaluator_behavior
test_tool_input_accuracy_evaluator_behavior

Follow-ups (not in this PR)

Port the _ToolCallSuccess status short-circuit (the other half of SDK #47369) into assets/evaluators/builtin/tool_call_success/evaluator/_tool_call_success.py, with spec.yaml v8 → v9 and a mirrored short-circuit test suite. Without that follow-up, the cloud-managed tool_call_success evaluator will accept restricted-tool conversations (after this PR) but will still send failed runs through the LLM rather than deterministically returning fail.
Backfill per-tool expected_flow_inputs fixtures for the newly-enabled tool types so base_tool_evaluation_test resumes exact-argument matching.

These three evaluators grade the agent's tool selection, input arguments, and call status -- none consume the (redacted) tool output body -- so the previous unconditional rejection of conversations containing built-in restricted tools (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding, plus browser_automation, code_interpreter_call, computer_call, openapi_call, web_search) is now lifted. tool_output_utilization and groundedness still reject restricted tools because they consume the tool output body. Source: - _tool_call_accuracy.py: ToolCallsValidator check_for_unsupported_tools True -> False - _tool_input_accuracy.py: ToolDefinitionsValidator check_for_unsupported_tools True -> False - _tool_call_success.py: ToolDefinitionsValidator check_for_unsupported_tools True -> False Registry: - tool_call_accuracy/spec.yaml: version 11 -> 12 - tool_call_success/spec.yaml: version 7 -> 8 - tool_input_accuracy/spec.yaml: version 12 -> 13 Tests: - Flip test class check_for_unsupported_tools True -> False on the three suites so assertions match the new behavior (validator accepts -> flow runs). - Relax base_tool_evaluation_test._run_tool_type_test: when a tool's expected_flow_inputs is not yet populated (empty dict), assert only that the flow was invoked once instead of exact-argument matching. The full per-tool expected-flow constants for the newly-enabled tool types will land in a follow-up PR via a mechanical generation script over the existing _QUERY/_RESPONSE/_TOOL_DEFINITIONS fixtures. Verified: 228 of 228 behavior tests pass across the three suites (test_tool_call_accuracy_evaluator_behavior, test_tool_call_success_*, test_tool_input_accuracy_*).

github-actions · 2026-06-05T19:02:17Z

Test Results for assets-test

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit aef7bdd.

♻️ This comment has been updated with latest results.

Mirrors the SDK-side short-circuit landed in azure-sdk-for-python#47369 into the registry's forked evaluator source. When any tool_call or tool_result content block carries a known-failure status (failed/error/incomplete/cancelled/canceled), _do_eval returns a deterministic fail without calling the LLM. Absent status, behavior is unchanged. Source: _tool_call_success.py adds _FAILED_TOOL_STATUSES + _collect_failed_tool_statuses helper and an inline short-circuit block in _do_eval, placed after the intermediate-response check and None/empty validation, before the list-response preprocessing. Registry: tool_call_success/spec.yaml version 8 -> 9. Tests: new test_tool_call_success_short_circuit.py with 14 helper tests (parametrized over each failure status, case-insensitivity, malformed input tolerance) and 4 integration tests (short-circuit hits, dedupe of statuses in properties, no short-circuit when all completed, no short-circuit when status absent). 18/18 new tests pass; existing 69 behavior tests in test_tool_call_success_evaluator_behavior.py still pass.

mmkawale requested review from a team as code owners June 5, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ToolCallAccuracy/InputAccuracy/CallSuccess on restricted-tool conversations#5117

Enable ToolCallAccuracy/InputAccuracy/CallSuccess on restricted-tool conversations#5117
mmkawale wants to merge 2 commits into
Azure:mainfrom
mmkawale:mk/enable-tool-evals-1

mmkawale commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mmkawale commented Jun 5, 2026

Summary

Changes

Source (check_for_unsupported_tools=True → False)

Spec version bumps

Tests

Test results

Follow-ups (not in this PR)

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results for assets-test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Source (`check_for_unsupported_tools=True` → `False`)

github-actions Bot commented Jun 5, 2026 •

edited

Loading