Skip to content

Enable ToolCallAccuracy/InputAccuracy/CallSuccess on restricted-tool conversations#5117

Open
mmkawale wants to merge 2 commits into
Azure:mainfrom
mmkawale:mk/enable-tool-evals-1
Open

Enable ToolCallAccuracy/InputAccuracy/CallSuccess on restricted-tool conversations#5117
mmkawale wants to merge 2 commits into
Azure:mainfrom
mmkawale:mk/enable-tool-evals-1

Conversation

@mmkawale
Copy link
Copy Markdown

@mmkawale mmkawale commented Jun 5, 2026

Summary

Lifts the unconditional rejection of conversations containing built-in restricted tools (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding, browser_automation, code_interpreter_call, computer_call, openapi_call, web_search) on the three tool-call evaluators (tool_call_accuracy, tool_input_accuracy, tool_call_success).

These three evaluators grade the agent's tool selection, input arguments, and call status respectively. None of them consume the (redacted) tool output body, so the previous blanket rejection was over-broad. tool_output_utilization and groundedness continue to reject restricted-tool conversations because they do read the tool output body.

This is the registry-asset side of the change. The matching SDK change ships in Azure/azure-sdk-for-python#47369 (azure-ai-evaluation 1.17.1). Because the registry hosts forked copies of the evaluator source under assets/evaluators/builtin/<name>/evaluator/, the same one-line flip has to be duplicated here per-evaluator.

Changes

Source (check_for_unsupported_tools=TrueFalse)

  • assets/evaluators/builtin/tool_call_accuracy/evaluator/_tool_call_accuracy.pyToolCallsValidator
  • assets/evaluators/builtin/tool_input_accuracy/evaluator/_tool_input_accuracy.pyToolDefinitionsValidator
  • assets/evaluators/builtin/tool_call_success/evaluator/_tool_call_success.pyToolDefinitionsValidator

Spec version bumps

Asset Before After
tool_call_accuracy/spec.yaml 11 12
tool_input_accuracy/spec.yaml 12 13
tool_call_success/spec.yaml 7 8

Tests

  • Flipped check_for_unsupported_tools True → False in the three behavior test suites so assertions match the new behavior (validator accepts → flow runs).
  • Relaxed base_tool_evaluation_test._run_tool_type_test: when a tool's expected_flow_inputs is not yet populated (empty dict), assert only that the flow was invoked once instead of exact-argument matching. The full per-tool expected-flow constants for the newly-enabled tool types will land in a follow-up PR via a mechanical generation pass over the existing _QUERY / _RESPONSE / _TOOL_DEFINITIONS fixtures.

Test results

228 / 228 behavior tests pass across the three suites:

  • test_tool_call_accuracy_evaluator_behavior
  • test_tool_call_success_evaluator_behavior
  • test_tool_input_accuracy_evaluator_behavior

Follow-ups (not in this PR)

  1. Port the _ToolCallSuccess status short-circuit (the other half of SDK #47369) into assets/evaluators/builtin/tool_call_success/evaluator/_tool_call_success.py, with spec.yaml v8 → v9 and a mirrored short-circuit test suite. Without that follow-up, the cloud-managed tool_call_success evaluator will accept restricted-tool conversations (after this PR) but will still send failed runs through the LLM rather than deterministically returning fail.
  2. Backfill per-tool expected_flow_inputs fixtures for the newly-enabled tool types so base_tool_evaluation_test resumes exact-argument matching.

These three evaluators grade the agent's tool selection, input arguments,
and call status -- none consume the (redacted) tool output body -- so the
previous unconditional rejection of conversations containing built-in
restricted tools (bing_grounding, bing_custom_search, azure_ai_search,
azure_fabric, sharepoint_grounding, plus browser_automation,
code_interpreter_call, computer_call, openapi_call, web_search) is now
lifted. tool_output_utilization and groundedness still reject restricted
tools because they consume the tool output body.

Source:
- _tool_call_accuracy.py: ToolCallsValidator check_for_unsupported_tools True -> False
- _tool_input_accuracy.py: ToolDefinitionsValidator check_for_unsupported_tools True -> False
- _tool_call_success.py:  ToolDefinitionsValidator check_for_unsupported_tools True -> False

Registry:
- tool_call_accuracy/spec.yaml: version 11 -> 12
- tool_call_success/spec.yaml:  version 7  -> 8
- tool_input_accuracy/spec.yaml: version 12 -> 13

Tests:
- Flip test class check_for_unsupported_tools True -> False on the three suites
  so assertions match the new behavior (validator accepts -> flow runs).
- Relax base_tool_evaluation_test._run_tool_type_test: when a tool's
  expected_flow_inputs is not yet populated (empty dict), assert only that
  the flow was invoked once instead of exact-argument matching. The full
  per-tool expected-flow constants for the newly-enabled tool types will
  land in a follow-up PR via a mechanical generation script over the
  existing _QUERY/_RESPONSE/_TOOL_DEFINITIONS fixtures.

Verified: 228 of 228 behavior tests pass across the three suites
(test_tool_call_accuracy_evaluator_behavior, test_tool_call_success_*,
test_tool_input_accuracy_*).
@mmkawale mmkawale requested review from a team as code owners June 5, 2026 19:00
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Test Results for assets-test

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit aef7bdd.

♻️ This comment has been updated with latest results.

Mirrors the SDK-side short-circuit landed in azure-sdk-for-python#47369 into the registry's forked evaluator source. When any tool_call or tool_result content block carries a known-failure status (failed/error/incomplete/cancelled/canceled), _do_eval returns a deterministic fail without calling the LLM. Absent status, behavior is unchanged.

Source: _tool_call_success.py adds _FAILED_TOOL_STATUSES + _collect_failed_tool_statuses helper and an inline short-circuit block in _do_eval, placed after the intermediate-response check and None/empty validation, before the list-response preprocessing.

Registry: tool_call_success/spec.yaml version 8 -> 9.

Tests: new test_tool_call_success_short_circuit.py with 14 helper tests (parametrized over each failure status, case-insensitivity, malformed input tolerance) and 4 integration tests (short-circuit hits, dedupe of statuses in properties, no short-circuit when all completed, no short-circuit when status absent). 18/18 new tests pass; existing 69 behavior tests in test_tool_call_success_evaluator_behavior.py still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants