Converter: add bing_custom_search + sharepoint_grounding branches; query/input fallback for AIS, SP, Fabric

manaskawale · manaskawale · commit 5445c226de33 · 2026-06-08T12:02:31.000-07:00
break_tool_call_into_messages previously had no elif branch for bing_custom_search or sharepoint_grounding, so calls touching either tool were silently dropped before any evaluator could see them. The three status-only tool evaluators (ToolCallAccuracyEvaluator, _ToolInputAccuracyEvaluator, _ToolCallSuccessEvaluator) therefore returned NOT_APPLICABLE on those conversations even after the validator was loosened in PR #47369. Changes: - bing_custom_search: arguments-only branch mirroring bing_grounding (emits a tool_call with the requesturl; no tool_result, since Bing-family results are redacted upstream for compliance). - sharepoint_grounding: arguments + dumped output, mirroring azure_ai_search. Phase 2 will extend the Groundedness extractor to walk the documents structure already present on the tool_result. - azure_ai_search, sharepoint_grounding, fabric_dataagent input branches: switched from direct details[<tool>][input] dereference to .get(input) or .get(query) or empty-string fallback. Live agent traces emit the search term under 'query' for all three, which made the existing AIS and Fabric branches surface empty arguments to evaluators (a live bug, not just a Phase 1 prerequisite). - Refreshed the stale March-2025 top-of-function comment to reflect the current set of supported built-ins. Tests: Added 5 new tests in tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py covering bing_custom_search, sharepoint_grounding (input key and output dump), and the query-key fallback for AIS, SP, and Fabric. The new tests construct ToolCall via a small _HybridDict helper instead of going through ToolDecoder, so they do not depend on the agents SDK RunStep* models that have moved between azure.ai.projects.models and azure.ai.agents.models packages.
diff --git a/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md b/sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
@@ -7,6 +7,8 @@
 - Enabled `ToolCallAccuracyEvaluator`, `_ToolInputAccuracyEvaluator`, and `_ToolCallSuccessEvaluator` to run on conversations that include built-in restricted tools (`bing_grounding`, `bing_custom_search`, `azure_ai_search`, `azure_fabric`, `sharepoint_grounding`). These three evaluators grade the agent's tool selection, input arguments, and call status — none of which require the (redacted) tool output body — so the previous unconditional rejection of conversations containing restricted tools is now lifted. Achieved by setting `check_for_unsupported_tools=False` on each evaluator's input validator. `GroundednessEvaluator` and `ToolOutputUtilizationEvaluator` continue to reject restricted tools because they consume the tool output body.
 - Exported `_ToolInputAccuracyEvaluator` from the top-level `azure.ai.evaluation` namespace so consumers no longer need to reach into the private `_evaluators._tool_input_accuracy` submodule. The other tool evaluators were already exposed there; this brings the four siblings in line.
 - `_ToolCallSuccessEvaluator` now deterministically returns `fail` (score `0`, `_passed=False`) without invoking the LLM when any `tool_call` or `tool_result` in the response carries a known-failure `status` (`failed`, `error`, `incomplete`, `cancelled`/`canceled`). This matches the evaluator's binary contract ("FALSE: at least one tool call failed") and prevents the prompty rubric -- which doesn't see the `status` field -- from mis-grading conversations whose only failure signal is the runtime-reported execution status. Behavior is unchanged for responses where no `status` is populated.
+- Extended `break_tool_call_into_messages` in `_converters/_models.py` with explicit branches for `bing_custom_search` (arguments-only, mirroring `bing_grounding` — Bing-family results stay redacted upstream) and `sharepoint_grounding` (arguments + dumped output, mirroring `azure_ai_search`). Both were silently dropped before because the converter had no `elif` branch for them, which meant the three status-only tool evaluators returned `NOT_APPLICABLE` on conversations that touched either tool. The `bing_grounding` and `bing_custom_search` request-side payloads continue to emit only the `requesturl`; the `sharepoint_grounding` result is dumped onto the `tool_result` so a future Groundedness / Tool Output Utilization extractor can read it.
+- Made the per-tool argument extraction in `break_tool_call_into_messages` resilient to the `query` vs `input` runtime drift observed on `azure_ai_search`, `sharepoint_grounding`, and `fabric_dataagent`. Each branch now reads `details["<tool>"].get("input") or details["<tool>"].get("query") or ""` instead of dereferencing `["input"]` directly, so live agent traces (which emit the search term under `query`) no longer surface as empty `arguments` to the evaluators. Behavior is unchanged when the runtime emits `input`.
 
 ## 1.17.0 (2026-06-03)
 
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py
@@ -327,11 +327,12 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
     # We will use this as our accumulator.
     messages: List[Message] = []
 
-    # As of March 17th, 2025, we only support custom functions due to built-in code interpreters and bing grounding
-    # tooling not reporting their function calls in the same way. Code interpreters don't include the tool call at
-    # all in most of the cases, and bing would only show the API URL, without arguments or results.
-    # Bing grounding would have "bing_grounding" in details with "requesturl" that will just be the API path with query.
-    # TODO: Work with AI Services to add converter support for BingGrounding and CodeInterpreter.
+    # In addition to custom functions, we support a handful of built-in tools whose runtime payload
+    # we have explicit branches for below (code_interpreter, file_search, bing_grounding,
+    # bing_custom_search, azure_ai_search, sharepoint_grounding, fabric_dataagent). Bing variants
+    # only carry the `requesturl` request side (results are redacted upstream for compliance), so
+    # they emit just the tool_call message; the others emit both call and result.
+    # Unknown built-in types are silently skipped by the trailing `return messages`.
     if hasattr(tool_call.details, _FUNCTION) or tool_call.details.get("function"):
         # This is the internals of the content object that will be included with the tool call.
         tool_call_id = tool_call.details.id
@@ -351,15 +352,22 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
             arguments = {"input": tool_call.details.code_interpreter.input}
         elif tool_call.details["type"] == "bing_grounding":
             arguments = {"requesturl": tool_call.details["bing_grounding"]["requesturl"]}
+        elif tool_call.details["type"] == "bing_custom_search":
+            arguments = {"requesturl": tool_call.details["bing_custom_search"]["requesturl"]}
         elif tool_call.details["type"] == "file_search":
             options = tool_call.details["file_search"]["ranking_options"]
             arguments = {
                 "ranking_options": {"ranker": options["ranker"], "score_threshold": options["score_threshold"]}
             }
         elif tool_call.details["type"] == "azure_ai_search":
-            arguments = {"input": tool_call.details["azure_ai_search"]["input"]}
+            ais = tool_call.details["azure_ai_search"]
+            arguments = {"input": ais.get("input") or ais.get("query") or ""}
+        elif tool_call.details["type"] == "sharepoint_grounding":
+            sp = tool_call.details["sharepoint_grounding"]
+            arguments = {"input": sp.get("input") or sp.get("query") or ""}
         elif tool_call.details["type"] == "fabric_dataagent":
-            arguments = {"input": tool_call.details["fabric_dataagent"]["input"]}
+            fab = tool_call.details["fabric_dataagent"]
+            arguments = {"input": fab.get("input") or fab.get("query") or ""}
         else:
             # unsupported tool type, skip
             return messages
@@ -389,11 +397,15 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
             if tool_call.details.type == _CODE_INTERPRETER:
                 output = [result.as_dict() for result in tool_call.details.code_interpreter.outputs]
             elif tool_call.details.type == _BING_GROUNDING:
-                return messages  # not supported yet from bing grounding tool
+                return messages  # results are redacted upstream for Bing; no tool_result to emit
+            elif tool_call.details.type == _BING_CUSTOM_SEARCH:
+                return messages  # results are redacted upstream for Bing; no tool_result to emit
             elif tool_call.details.type == _FILE_SEARCH:
                 output = [result.as_dict() for result in tool_call.details.file_search.results]
             elif tool_call.details.type == _AZURE_AI_SEARCH:
                 output = tool_call.details.azure_ai_search["output"]
+            elif tool_call.details.type == _SHAREPOINT_GROUNDING:
+                output = tool_call.details.sharepoint_grounding["output"]
             elif tool_call.details.type == _FABRIC_DATAAGENT:
                 output = tool_call.details.fabric_dataagent["output"]
         except:
diff --git a/sdk/evaluation/azure-ai-evaluation/tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py b/sdk/evaluation/azure-ai-evaluation/tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py
@@ -37,6 +37,42 @@
 from serialization_helper import ToolDecoder, ThreadRunDecoder
 
 
+class _HybridDict(dict):
+    """Dict subclass that also exposes its keys as attributes.
+
+    The converter (`break_tool_call_into_messages`) mixes subscript access on the request side
+    (`tool_call.details["type"]`, `tool_call.details["bing_grounding"]["requesturl"]`) with attribute
+    access on the result side (`tool_call.details.type`, `tool_call.details.azure_ai_search["output"]`).
+    The production code path uses typed runtime models (`RunStep*ToolCall`) that satisfy both shapes;
+    `_HybridDict` mimics that surface in unit tests without depending on the agents SDK models, which
+    have moved between packages and are not guaranteed to be importable in every test environment.
+    """
+
+    def __getattr__(self, name):
+        try:
+            return self[name]
+        except KeyError as e:
+            raise AttributeError(name) from e
+
+
+def _build_builtin_tool_call(call_id: str, tool_type: str, payload: dict) -> ToolCall:
+    """Construct a `ToolCall` for a built-in tool without going through `ToolDecoder`.
+
+    `payload` is the per-tool sub-object (e.g. `{"requesturl": "..."}` for Bing or
+    `{"input": "...", "output": {...}}` for SharePoint). The returned `ToolCall.details` is a
+    nested `_HybridDict` so both subscript and attribute access work.
+    """
+    details = _HybridDict(
+        {
+            "id": call_id,
+            "type": tool_type,
+            tool_type: _HybridDict(payload),
+        }
+    )
+    now = datetime.now()
+    return ToolCall(created=now, completed=now, details=details)
+
+
 class TestAIAgentConverter(unittest.TestCase):
     def test_is_agent_tool_call(self):
         # Test case where message is an agent tool call
@@ -200,6 +236,110 @@ def test_bing_grounding_tool_calls(self):
             tool_call_content["arguments"] == {"requesturl": "https://api.bing.microsoft.com/v7.0/search?q="}
         )
 
+    def test_bing_custom_search_tool_calls(self):
+        # bing_custom_search mirrors bing_grounding: arguments-only tool_call, no tool_result
+        # (results are redacted upstream for Bing-family tools).
+        # Built directly rather than via ToolDecoder so the test does not depend on the
+        # RunStepBingCustomSearchToolCall model being present in the installed agents SDK.
+        tool_call = _build_builtin_tool_call(
+            call_id="call_BCS123",
+            tool_type="bing_custom_search",
+            payload={"requesturl": "https://api.bing.microsoft.com/v7.0/custom/search?customconfig=abc&q=foo"},
+        )
+        messages = break_tool_call_into_messages(tool_call, "abc123")
+        self.assertTrue(len(messages) == 1)  # Bing variants emit no tool_result
+        self.assertTrue(isinstance(messages[0], AssistantMessage))
+        tool_call_content = messages[0].content[0]
+        self.assertTrue(tool_call_content["type"] == "tool_call")
+        self.assertTrue(tool_call_content["tool_call_id"] == "call_BCS123")
+        self.assertTrue(tool_call_content["name"] == "bing_custom_search")
+        self.assertTrue(
+            tool_call_content["arguments"]
+            == {"requesturl": "https://api.bing.microsoft.com/v7.0/custom/search?customconfig=abc&q=foo"}
+        )
+
+    def test_sharepoint_grounding_tool_calls(self):
+        # sharepoint_grounding mirrors azure_ai_search: arguments + dumped output.
+        # Exercises the `input` argument key on the request side.
+        tool_call = _build_builtin_tool_call(
+            call_id="call_SP123",
+            tool_type="sharepoint_grounding",
+            payload={
+                "input": "quarterly sales report",
+                "output": {
+                    "documents": [
+                        {
+                            "title": "Q3 Sales",
+                            "url": "https://contoso.sharepoint.com/Q3.docx",
+                            "content": "Q3 was up 12%",
+                        }
+                    ]
+                },
+            },
+        )
+        messages = break_tool_call_into_messages(tool_call, "abc123")
+        self.assertTrue(len(messages) == 2)
+        self.assertTrue(isinstance(messages[0], AssistantMessage))
+        tool_call_content = messages[0].content[0]
+        self.assertTrue(tool_call_content["type"] == "tool_call")
+        self.assertTrue(tool_call_content["tool_call_id"] == "call_SP123")
+        self.assertTrue(tool_call_content["name"] == "sharepoint_grounding")
+        self.assertTrue(tool_call_content["arguments"] == {"input": "quarterly sales report"})
+        self.assertTrue(isinstance(messages[1], ToolMessage))
+        self.assertTrue(messages[1].content[0]["type"] == "tool_result")
+        self.assertTrue(
+            messages[1].content[0]["tool_result"]
+            == {
+                "documents": [
+                    {
+                        "title": "Q3 Sales",
+                        "url": "https://contoso.sharepoint.com/Q3.docx",
+                        "content": "Q3 was up 12%",
+                    }
+                ]
+            }
+        )
+
+    def test_sharepoint_grounding_tool_calls_query_key_fallback(self):
+        # Live agent traces emit the search term under `query` instead of `input` for SharePoint.
+        # The converter must fall back to `query` so downstream evaluators see a non-empty argument.
+        tool_call = _build_builtin_tool_call(
+            call_id="call_SP456",
+            tool_type="sharepoint_grounding",
+            payload={"query": "vacation policy", "output": {"documents": []}},
+        )
+        messages = break_tool_call_into_messages(tool_call, "abc123")
+        self.assertTrue(len(messages) == 2)
+        tool_call_content = messages[0].content[0]
+        self.assertTrue(tool_call_content["arguments"] == {"input": "vacation policy"})
+
+    def test_azure_ai_search_tool_calls_query_key_fallback(self):
+        # Live agent traces emit the search term under `query` instead of `input` for Azure AI Search.
+        # The converter must fall back to `query` so downstream evaluators see a non-empty argument.
+        tool_call = _build_builtin_tool_call(
+            call_id="call_AIS789",
+            tool_type="azure_ai_search",
+            payload={"query": "refund policy", "output": []},
+        )
+        messages = break_tool_call_into_messages(tool_call, "abc123")
+        self.assertTrue(len(messages) == 2)
+        tool_call_content = messages[0].content[0]
+        self.assertTrue(tool_call_content["name"] == "azure_ai_search")
+        self.assertTrue(tool_call_content["arguments"] == {"input": "refund policy"})
+
+    def test_fabric_dataagent_tool_calls_query_key_fallback(self):
+        # Same `query` vs `input` drift for fabric_dataagent.
+        tool_call = _build_builtin_tool_call(
+            call_id="call_FAB012",
+            tool_type="fabric_dataagent",
+            payload={"query": "top customers by revenue", "output": {}},
+        )
+        messages = break_tool_call_into_messages(tool_call, "abc123")
+        self.assertTrue(len(messages) == 2)
+        tool_call_content = messages[0].content[0]
+        self.assertTrue(tool_call_content["name"] == "fabric_dataagent")
+        self.assertTrue(tool_call_content["arguments"] == {"input": "top customers by revenue"})
+
     def test_extract_tool_definitions(self):
         thread_run_data = """{
   "id": "run_zs3USbTw61ZpRk8bwBPP8Ue7",