Enable ToolCallAccuracy/InputAccuracy/CallSuccess on restricted-tool conversations#5117
Open
mmkawale wants to merge 2 commits into
Open
Enable ToolCallAccuracy/InputAccuracy/CallSuccess on restricted-tool conversations#5117mmkawale wants to merge 2 commits into
mmkawale wants to merge 2 commits into
Conversation
These three evaluators grade the agent's tool selection, input arguments, and call status -- none consume the (redacted) tool output body -- so the previous unconditional rejection of conversations containing built-in restricted tools (bing_grounding, bing_custom_search, azure_ai_search, azure_fabric, sharepoint_grounding, plus browser_automation, code_interpreter_call, computer_call, openapi_call, web_search) is now lifted. tool_output_utilization and groundedness still reject restricted tools because they consume the tool output body. Source: - _tool_call_accuracy.py: ToolCallsValidator check_for_unsupported_tools True -> False - _tool_input_accuracy.py: ToolDefinitionsValidator check_for_unsupported_tools True -> False - _tool_call_success.py: ToolDefinitionsValidator check_for_unsupported_tools True -> False Registry: - tool_call_accuracy/spec.yaml: version 11 -> 12 - tool_call_success/spec.yaml: version 7 -> 8 - tool_input_accuracy/spec.yaml: version 12 -> 13 Tests: - Flip test class check_for_unsupported_tools True -> False on the three suites so assertions match the new behavior (validator accepts -> flow runs). - Relax base_tool_evaluation_test._run_tool_type_test: when a tool's expected_flow_inputs is not yet populated (empty dict), assert only that the flow was invoked once instead of exact-argument matching. The full per-tool expected-flow constants for the newly-enabled tool types will land in a follow-up PR via a mechanical generation script over the existing _QUERY/_RESPONSE/_TOOL_DEFINITIONS fixtures. Verified: 228 of 228 behavior tests pass across the three suites (test_tool_call_accuracy_evaluator_behavior, test_tool_call_success_*, test_tool_input_accuracy_*).
Test Results for assets-test0 tests 0 ✅ 0s ⏱️ Results for commit aef7bdd. ♻️ This comment has been updated with latest results. |
Mirrors the SDK-side short-circuit landed in azure-sdk-for-python#47369 into the registry's forked evaluator source. When any tool_call or tool_result content block carries a known-failure status (failed/error/incomplete/cancelled/canceled), _do_eval returns a deterministic fail without calling the LLM. Absent status, behavior is unchanged. Source: _tool_call_success.py adds _FAILED_TOOL_STATUSES + _collect_failed_tool_statuses helper and an inline short-circuit block in _do_eval, placed after the intermediate-response check and None/empty validation, before the list-response preprocessing. Registry: tool_call_success/spec.yaml version 8 -> 9. Tests: new test_tool_call_success_short_circuit.py with 14 helper tests (parametrized over each failure status, case-insensitivity, malformed input tolerance) and 4 integration tests (short-circuit hits, dedupe of statuses in properties, no short-circuit when all completed, no short-circuit when status absent). 18/18 new tests pass; existing 69 behavior tests in test_tool_call_success_evaluator_behavior.py still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lifts the unconditional rejection of conversations containing built-in restricted tools (
bing_grounding,bing_custom_search,azure_ai_search,azure_fabric,sharepoint_grounding,browser_automation,code_interpreter_call,computer_call,openapi_call,web_search) on the three tool-call evaluators (tool_call_accuracy,tool_input_accuracy,tool_call_success).These three evaluators grade the agent's tool selection, input arguments, and call status respectively. None of them consume the (redacted) tool output body, so the previous blanket rejection was over-broad.
tool_output_utilizationandgroundednesscontinue to reject restricted-tool conversations because they do read the tool output body.This is the registry-asset side of the change. The matching SDK change ships in Azure/azure-sdk-for-python#47369 (
azure-ai-evaluation 1.17.1). Because the registry hosts forked copies of the evaluator source underassets/evaluators/builtin/<name>/evaluator/, the same one-line flip has to be duplicated here per-evaluator.Changes
Source (
check_for_unsupported_tools=True→False)assets/evaluators/builtin/tool_call_accuracy/evaluator/_tool_call_accuracy.py—ToolCallsValidatorassets/evaluators/builtin/tool_input_accuracy/evaluator/_tool_input_accuracy.py—ToolDefinitionsValidatorassets/evaluators/builtin/tool_call_success/evaluator/_tool_call_success.py—ToolDefinitionsValidatorSpec version bumps
tool_call_accuracy/spec.yamltool_input_accuracy/spec.yamltool_call_success/spec.yamlTests
check_for_unsupported_toolsTrue → Falsein the three behavior test suites so assertions match the new behavior (validator accepts → flow runs).base_tool_evaluation_test._run_tool_type_test: when a tool'sexpected_flow_inputsis not yet populated (empty dict), assert only that the flow was invoked once instead of exact-argument matching. The full per-tool expected-flow constants for the newly-enabled tool types will land in a follow-up PR via a mechanical generation pass over the existing_QUERY/_RESPONSE/_TOOL_DEFINITIONSfixtures.Test results
228 / 228 behavior tests pass across the three suites:
test_tool_call_accuracy_evaluator_behaviortest_tool_call_success_evaluator_behaviortest_tool_input_accuracy_evaluator_behaviorFollow-ups (not in this PR)
_ToolCallSuccessstatus short-circuit (the other half of SDK #47369) intoassets/evaluators/builtin/tool_call_success/evaluator/_tool_call_success.py, withspec.yamlv8 → v9 and a mirrored short-circuit test suite. Without that follow-up, the cloud-managedtool_call_successevaluator will accept restricted-tool conversations (after this PR) but will still send failed runs through the LLM rather than deterministically returningfail.expected_flow_inputsfixtures for the newly-enabled tool types sobase_tool_evaluation_testresumes exact-argument matching.