Skip to content

Commit 5445c22

Browse files
committed
Converter: add bing_custom_search + sharepoint_grounding branches; query/input fallback for AIS, SP, Fabric
break_tool_call_into_messages previously had no elif branch for bing_custom_search or sharepoint_grounding, so calls touching either tool were silently dropped before any evaluator could see them. The three status-only tool evaluators (ToolCallAccuracyEvaluator, _ToolInputAccuracyEvaluator, _ToolCallSuccessEvaluator) therefore returned NOT_APPLICABLE on those conversations even after the validator was loosened in PR #47369. Changes: - bing_custom_search: arguments-only branch mirroring bing_grounding (emits a tool_call with the requesturl; no tool_result, since Bing-family results are redacted upstream for compliance). - sharepoint_grounding: arguments + dumped output, mirroring azure_ai_search. Phase 2 will extend the Groundedness extractor to walk the documents structure already present on the tool_result. - azure_ai_search, sharepoint_grounding, fabric_dataagent input branches: switched from direct details[<tool>][input] dereference to .get(input) or .get(query) or empty-string fallback. Live agent traces emit the search term under 'query' for all three, which made the existing AIS and Fabric branches surface empty arguments to evaluators (a live bug, not just a Phase 1 prerequisite). - Refreshed the stale March-2025 top-of-function comment to reflect the current set of supported built-ins. Tests: Added 5 new tests in tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py covering bing_custom_search, sharepoint_grounding (input key and output dump), and the query-key fallback for AIS, SP, and Fabric. The new tests construct ToolCall via a small _HybridDict helper instead of going through ToolDecoder, so they do not depend on the agents SDK RunStep* models that have moved between azure.ai.projects.models and azure.ai.agents.models packages.
1 parent 24198a3 commit 5445c22

3 files changed

Lines changed: 162 additions & 8 deletions

File tree

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
- Enabled `ToolCallAccuracyEvaluator`, `_ToolInputAccuracyEvaluator`, and `_ToolCallSuccessEvaluator` to run on conversations that include built-in restricted tools (`bing_grounding`, `bing_custom_search`, `azure_ai_search`, `azure_fabric`, `sharepoint_grounding`). These three evaluators grade the agent's tool selection, input arguments, and call status — none of which require the (redacted) tool output body — so the previous unconditional rejection of conversations containing restricted tools is now lifted. Achieved by setting `check_for_unsupported_tools=False` on each evaluator's input validator. `GroundednessEvaluator` and `ToolOutputUtilizationEvaluator` continue to reject restricted tools because they consume the tool output body.
88
- Exported `_ToolInputAccuracyEvaluator` from the top-level `azure.ai.evaluation` namespace so consumers no longer need to reach into the private `_evaluators._tool_input_accuracy` submodule. The other tool evaluators were already exposed there; this brings the four siblings in line.
99
- `_ToolCallSuccessEvaluator` now deterministically returns `fail` (score `0`, `_passed=False`) without invoking the LLM when any `tool_call` or `tool_result` in the response carries a known-failure `status` (`failed`, `error`, `incomplete`, `cancelled`/`canceled`). This matches the evaluator's binary contract ("FALSE: at least one tool call failed") and prevents the prompty rubric -- which doesn't see the `status` field -- from mis-grading conversations whose only failure signal is the runtime-reported execution status. Behavior is unchanged for responses where no `status` is populated.
10+
- Extended `break_tool_call_into_messages` in `_converters/_models.py` with explicit branches for `bing_custom_search` (arguments-only, mirroring `bing_grounding` — Bing-family results stay redacted upstream) and `sharepoint_grounding` (arguments + dumped output, mirroring `azure_ai_search`). Both were silently dropped before because the converter had no `elif` branch for them, which meant the three status-only tool evaluators returned `NOT_APPLICABLE` on conversations that touched either tool. The `bing_grounding` and `bing_custom_search` request-side payloads continue to emit only the `requesturl`; the `sharepoint_grounding` result is dumped onto the `tool_result` so a future Groundedness / Tool Output Utilization extractor can read it.
11+
- Made the per-tool argument extraction in `break_tool_call_into_messages` resilient to the `query` vs `input` runtime drift observed on `azure_ai_search`, `sharepoint_grounding`, and `fabric_dataagent`. Each branch now reads `details["<tool>"].get("input") or details["<tool>"].get("query") or ""` instead of dereferencing `["input"]` directly, so live agent traces (which emit the search term under `query`) no longer surface as empty `arguments` to the evaluators. Behavior is unchanged when the runtime emits `input`.
1012

1113
## 1.17.0 (2026-06-03)
1214

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -327,11 +327,12 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
327327
# We will use this as our accumulator.
328328
messages: List[Message] = []
329329

330-
# As of March 17th, 2025, we only support custom functions due to built-in code interpreters and bing grounding
331-
# tooling not reporting their function calls in the same way. Code interpreters don't include the tool call at
332-
# all in most of the cases, and bing would only show the API URL, without arguments or results.
333-
# Bing grounding would have "bing_grounding" in details with "requesturl" that will just be the API path with query.
334-
# TODO: Work with AI Services to add converter support for BingGrounding and CodeInterpreter.
330+
# In addition to custom functions, we support a handful of built-in tools whose runtime payload
331+
# we have explicit branches for below (code_interpreter, file_search, bing_grounding,
332+
# bing_custom_search, azure_ai_search, sharepoint_grounding, fabric_dataagent). Bing variants
333+
# only carry the `requesturl` request side (results are redacted upstream for compliance), so
334+
# they emit just the tool_call message; the others emit both call and result.
335+
# Unknown built-in types are silently skipped by the trailing `return messages`.
335336
if hasattr(tool_call.details, _FUNCTION) or tool_call.details.get("function"):
336337
# This is the internals of the content object that will be included with the tool call.
337338
tool_call_id = tool_call.details.id
@@ -351,15 +352,22 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
351352
arguments = {"input": tool_call.details.code_interpreter.input}
352353
elif tool_call.details["type"] == "bing_grounding":
353354
arguments = {"requesturl": tool_call.details["bing_grounding"]["requesturl"]}
355+
elif tool_call.details["type"] == "bing_custom_search":
356+
arguments = {"requesturl": tool_call.details["bing_custom_search"]["requesturl"]}
354357
elif tool_call.details["type"] == "file_search":
355358
options = tool_call.details["file_search"]["ranking_options"]
356359
arguments = {
357360
"ranking_options": {"ranker": options["ranker"], "score_threshold": options["score_threshold"]}
358361
}
359362
elif tool_call.details["type"] == "azure_ai_search":
360-
arguments = {"input": tool_call.details["azure_ai_search"]["input"]}
363+
ais = tool_call.details["azure_ai_search"]
364+
arguments = {"input": ais.get("input") or ais.get("query") or ""}
365+
elif tool_call.details["type"] == "sharepoint_grounding":
366+
sp = tool_call.details["sharepoint_grounding"]
367+
arguments = {"input": sp.get("input") or sp.get("query") or ""}
361368
elif tool_call.details["type"] == "fabric_dataagent":
362-
arguments = {"input": tool_call.details["fabric_dataagent"]["input"]}
369+
fab = tool_call.details["fabric_dataagent"]
370+
arguments = {"input": fab.get("input") or fab.get("query") or ""}
363371
else:
364372
# unsupported tool type, skip
365373
return messages
@@ -389,11 +397,15 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
389397
if tool_call.details.type == _CODE_INTERPRETER:
390398
output = [result.as_dict() for result in tool_call.details.code_interpreter.outputs]
391399
elif tool_call.details.type == _BING_GROUNDING:
392-
return messages # not supported yet from bing grounding tool
400+
return messages # results are redacted upstream for Bing; no tool_result to emit
401+
elif tool_call.details.type == _BING_CUSTOM_SEARCH:
402+
return messages # results are redacted upstream for Bing; no tool_result to emit
393403
elif tool_call.details.type == _FILE_SEARCH:
394404
output = [result.as_dict() for result in tool_call.details.file_search.results]
395405
elif tool_call.details.type == _AZURE_AI_SEARCH:
396406
output = tool_call.details.azure_ai_search["output"]
407+
elif tool_call.details.type == _SHAREPOINT_GROUNDING:
408+
output = tool_call.details.sharepoint_grounding["output"]
397409
elif tool_call.details.type == _FABRIC_DATAAGENT:
398410
output = tool_call.details.fabric_dataagent["output"]
399411
except:

sdk/evaluation/azure-ai-evaluation/tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,42 @@
3737
from serialization_helper import ToolDecoder, ThreadRunDecoder
3838

3939

40+
class _HybridDict(dict):
41+
"""Dict subclass that also exposes its keys as attributes.
42+
43+
The converter (`break_tool_call_into_messages`) mixes subscript access on the request side
44+
(`tool_call.details["type"]`, `tool_call.details["bing_grounding"]["requesturl"]`) with attribute
45+
access on the result side (`tool_call.details.type`, `tool_call.details.azure_ai_search["output"]`).
46+
The production code path uses typed runtime models (`RunStep*ToolCall`) that satisfy both shapes;
47+
`_HybridDict` mimics that surface in unit tests without depending on the agents SDK models, which
48+
have moved between packages and are not guaranteed to be importable in every test environment.
49+
"""
50+
51+
def __getattr__(self, name):
52+
try:
53+
return self[name]
54+
except KeyError as e:
55+
raise AttributeError(name) from e
56+
57+
58+
def _build_builtin_tool_call(call_id: str, tool_type: str, payload: dict) -> ToolCall:
59+
"""Construct a `ToolCall` for a built-in tool without going through `ToolDecoder`.
60+
61+
`payload` is the per-tool sub-object (e.g. `{"requesturl": "..."}` for Bing or
62+
`{"input": "...", "output": {...}}` for SharePoint). The returned `ToolCall.details` is a
63+
nested `_HybridDict` so both subscript and attribute access work.
64+
"""
65+
details = _HybridDict(
66+
{
67+
"id": call_id,
68+
"type": tool_type,
69+
tool_type: _HybridDict(payload),
70+
}
71+
)
72+
now = datetime.now()
73+
return ToolCall(created=now, completed=now, details=details)
74+
75+
4076
class TestAIAgentConverter(unittest.TestCase):
4177
def test_is_agent_tool_call(self):
4278
# Test case where message is an agent tool call
@@ -200,6 +236,110 @@ def test_bing_grounding_tool_calls(self):
200236
tool_call_content["arguments"] == {"requesturl": "https://api.bing.microsoft.com/v7.0/search?q="}
201237
)
202238

239+
def test_bing_custom_search_tool_calls(self):
240+
# bing_custom_search mirrors bing_grounding: arguments-only tool_call, no tool_result
241+
# (results are redacted upstream for Bing-family tools).
242+
# Built directly rather than via ToolDecoder so the test does not depend on the
243+
# RunStepBingCustomSearchToolCall model being present in the installed agents SDK.
244+
tool_call = _build_builtin_tool_call(
245+
call_id="call_BCS123",
246+
tool_type="bing_custom_search",
247+
payload={"requesturl": "https://api.bing.microsoft.com/v7.0/custom/search?customconfig=abc&q=foo"},
248+
)
249+
messages = break_tool_call_into_messages(tool_call, "abc123")
250+
self.assertTrue(len(messages) == 1) # Bing variants emit no tool_result
251+
self.assertTrue(isinstance(messages[0], AssistantMessage))
252+
tool_call_content = messages[0].content[0]
253+
self.assertTrue(tool_call_content["type"] == "tool_call")
254+
self.assertTrue(tool_call_content["tool_call_id"] == "call_BCS123")
255+
self.assertTrue(tool_call_content["name"] == "bing_custom_search")
256+
self.assertTrue(
257+
tool_call_content["arguments"]
258+
== {"requesturl": "https://api.bing.microsoft.com/v7.0/custom/search?customconfig=abc&q=foo"}
259+
)
260+
261+
def test_sharepoint_grounding_tool_calls(self):
262+
# sharepoint_grounding mirrors azure_ai_search: arguments + dumped output.
263+
# Exercises the `input` argument key on the request side.
264+
tool_call = _build_builtin_tool_call(
265+
call_id="call_SP123",
266+
tool_type="sharepoint_grounding",
267+
payload={
268+
"input": "quarterly sales report",
269+
"output": {
270+
"documents": [
271+
{
272+
"title": "Q3 Sales",
273+
"url": "https://contoso.sharepoint.com/Q3.docx",
274+
"content": "Q3 was up 12%",
275+
}
276+
]
277+
},
278+
},
279+
)
280+
messages = break_tool_call_into_messages(tool_call, "abc123")
281+
self.assertTrue(len(messages) == 2)
282+
self.assertTrue(isinstance(messages[0], AssistantMessage))
283+
tool_call_content = messages[0].content[0]
284+
self.assertTrue(tool_call_content["type"] == "tool_call")
285+
self.assertTrue(tool_call_content["tool_call_id"] == "call_SP123")
286+
self.assertTrue(tool_call_content["name"] == "sharepoint_grounding")
287+
self.assertTrue(tool_call_content["arguments"] == {"input": "quarterly sales report"})
288+
self.assertTrue(isinstance(messages[1], ToolMessage))
289+
self.assertTrue(messages[1].content[0]["type"] == "tool_result")
290+
self.assertTrue(
291+
messages[1].content[0]["tool_result"]
292+
== {
293+
"documents": [
294+
{
295+
"title": "Q3 Sales",
296+
"url": "https://contoso.sharepoint.com/Q3.docx",
297+
"content": "Q3 was up 12%",
298+
}
299+
]
300+
}
301+
)
302+
303+
def test_sharepoint_grounding_tool_calls_query_key_fallback(self):
304+
# Live agent traces emit the search term under `query` instead of `input` for SharePoint.
305+
# The converter must fall back to `query` so downstream evaluators see a non-empty argument.
306+
tool_call = _build_builtin_tool_call(
307+
call_id="call_SP456",
308+
tool_type="sharepoint_grounding",
309+
payload={"query": "vacation policy", "output": {"documents": []}},
310+
)
311+
messages = break_tool_call_into_messages(tool_call, "abc123")
312+
self.assertTrue(len(messages) == 2)
313+
tool_call_content = messages[0].content[0]
314+
self.assertTrue(tool_call_content["arguments"] == {"input": "vacation policy"})
315+
316+
def test_azure_ai_search_tool_calls_query_key_fallback(self):
317+
# Live agent traces emit the search term under `query` instead of `input` for Azure AI Search.
318+
# The converter must fall back to `query` so downstream evaluators see a non-empty argument.
319+
tool_call = _build_builtin_tool_call(
320+
call_id="call_AIS789",
321+
tool_type="azure_ai_search",
322+
payload={"query": "refund policy", "output": []},
323+
)
324+
messages = break_tool_call_into_messages(tool_call, "abc123")
325+
self.assertTrue(len(messages) == 2)
326+
tool_call_content = messages[0].content[0]
327+
self.assertTrue(tool_call_content["name"] == "azure_ai_search")
328+
self.assertTrue(tool_call_content["arguments"] == {"input": "refund policy"})
329+
330+
def test_fabric_dataagent_tool_calls_query_key_fallback(self):
331+
# Same `query` vs `input` drift for fabric_dataagent.
332+
tool_call = _build_builtin_tool_call(
333+
call_id="call_FAB012",
334+
tool_type="fabric_dataagent",
335+
payload={"query": "top customers by revenue", "output": {}},
336+
)
337+
messages = break_tool_call_into_messages(tool_call, "abc123")
338+
self.assertTrue(len(messages) == 2)
339+
tool_call_content = messages[0].content[0]
340+
self.assertTrue(tool_call_content["name"] == "fabric_dataagent")
341+
self.assertTrue(tool_call_content["arguments"] == {"input": "top customers by revenue"})
342+
203343
def test_extract_tool_definitions(self):
204344
thread_run_data = """{
205345
"id": "run_zs3USbTw61ZpRk8bwBPP8Ue7",

0 commit comments

Comments
 (0)