Skip to content

Commit 6039d5c

Browse files
m7md7sienCopilotCopilot
authored
Update Tool Call Accuracy to output unified format (#46319)
* Update Tool Call Accuracy to output unified format * Update tests * reformatting * Refactor not applicable result method calls * Fix test assertions for new unified output format and apply black formatting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Fix tool call accuracy test for skipped output schema (#46356) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Add back backward-compatible base result keys for tool call accuracy outputs (#46449) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Update documentation to state deprecate 'gpt_' prefix Update documentation to state deprecate 'gpt_' prefix * Rename `_result` value from `not_applicable` to `pass` in `_return_not_applicable_result` (#46500) * rename not_applicable to pass in _return_not_applicable_result and update tests Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e94d600e-75a6-4b62-92cf-420fb1597e29 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * restore TODO comment above _return_not_applicable_result Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1ac22d46-abad-4a51-9269-cc884c11835d Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Add TODO for pass in _return_not_applicable_result * Add back gpt_ key for backward compatibility. Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>
1 parent 874c95b commit 6039d5c

5 files changed

Lines changed: 108 additions & 53 deletions

File tree

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -438,3 +438,29 @@ def _not_applicable_result(
438438
result[f"{self._result_key}_details"] = {}
439439

440440
return result
441+
442+
# TODO: After all evaluators output are updated, we can remove the _not_applicable_result method and replace calls to it with _return_not_applicable_result, which returns a "skipped" status instead of "pass" to avoid confusion.
443+
def _return_not_applicable_result(
444+
self, error_message: str, threshold: Union[int, float]
445+
) -> Dict[str, Union[str, float, Dict, None]]:
446+
"""Return a result indicating that the tool call is not applicable for evaluation.
447+
448+
:param error_message: The error message indicating why the evaluation is not applicable.
449+
:type error_message: str
450+
:param threshold: The threshold value for the evaluation.
451+
:type threshold: Union[int, float]
452+
:return: A dictionary containing the result of the evaluation.
453+
:rtype: Dict[str, Union[str, float, None]]
454+
"""
455+
return {
456+
f"{self._result_key}": None,
457+
f"{self._result_key}_score": None,
458+
# TODO: Return "not_applicable" instead of "pass" once the
459+
# evaluation service accepts it as a valid result value.
460+
f"{self._result_key}_result": "pass",
461+
f"{self._result_key}_passed": None,
462+
f"{self._result_key}_reason": f"Not applicable: {error_message}",
463+
f"{self._result_key}_status": "skipped",
464+
f"{self._result_key}_threshold": threshold,
465+
f"{self._result_key}_properties": None,
466+
}

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py

Lines changed: 32 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -66,10 +66,11 @@ class ToolCallAccuracyEvaluator(PromptyEvaluatorBase[Union[str, float]]):
6666
6767
.. note::
6868
69-
The output field "details" has been renamed to "tool_call_accuracy_details" for clarity.
69+
The output field "details" has been renamed to "tool_call_accuracy_properties" for clarity.
7070
71-
To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.
72-
To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;
71+
To align with our support of a diverse set of models,
72+
an output key with "_score" suffix instead of the `gpt_` prefix has been added.
73+
To maintain backwards compatibility, the old key with the `gpt_` prefix is still present in the output;
7374
however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
7475
7576
"""
@@ -86,7 +87,7 @@ class ToolCallAccuracyEvaluator(PromptyEvaluatorBase[Union[str, float]]):
8687
_TOOL_DEFINITIONS_MISSING_MESSAGE = "Tool definitions for all tool calls must be provided."
8788
_INVALID_SCORE_MESSAGE = "Tool call accuracy score must be between 1 and 5."
8889

89-
_LLM_SCORE_KEY = "tool_calls_success_level"
90+
_LLM_SCORE_KEY = "score"
9091

9192
_validator: ValidatorInterface
9293

@@ -230,10 +231,9 @@ async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]: # t
230231

231232
# Check for intermediate response
232233
if _is_intermediate_response(eval_input.get("response")):
233-
return self._not_applicable_result(
234+
return self._return_not_applicable_result(
234235
"Intermediate response. Please provide the agent's final response for evaluation.",
235236
self.threshold,
236-
has_details=True,
237237
)
238238

239239
# Preprocess messages if they are lists
@@ -256,6 +256,12 @@ async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]: # t
256256
prompty_output_dict = await self._flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)
257257
llm_output = prompty_output_dict.get("llm_output", prompty_output_dict)
258258
if isinstance(llm_output, dict):
259+
# Handle skipped status from LLM
260+
llm_status = llm_output.get("status", "completed")
261+
if llm_status == "skipped":
262+
reason = llm_output.get("reason", "")
263+
return self._return_not_applicable_result(reason, self.threshold)
264+
259265
score = llm_output.get(self._LLM_SCORE_KEY, None)
260266
if not score or not check_score_is_valid(
261267
score,
@@ -271,23 +277,32 @@ async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]: # t
271277
)
272278

273279
# Format the output
274-
reason = llm_output.get("chain_of_thought", "")
280+
reason = llm_output.get("reason", "")
275281
score = float(score)
276282
score_result = "pass" if score >= self.threshold else "fail"
283+
llm_properties = llm_output.get("properties", {}) or {}
284+
llm_properties.update(
285+
{
286+
"prompt_tokens": prompty_output_dict.get("input_token_count", 0),
287+
"completion_tokens": prompty_output_dict.get("output_token_count", 0),
288+
"total_tokens": prompty_output_dict.get("total_token_count", 0),
289+
"finish_reason": prompty_output_dict.get("finish_reason", ""),
290+
"model": prompty_output_dict.get("model_id", ""),
291+
"sample_input": prompty_output_dict.get("sample_input", ""),
292+
"sample_output": prompty_output_dict.get("sample_output", ""),
293+
}
294+
)
277295
response_dict = {
278296
self._result_key: score,
297+
# The "gpt_" prefixed key is maintained for backwards compatibility but is deprecated.
279298
f"gpt_{self._result_key}": score,
299+
f"{self._result_key}_score": score,
280300
f"{self._result_key}_result": score_result,
281-
f"{self._result_key}_threshold": self._threshold,
301+
f"{self._result_key}_passed": score_result == "pass",
282302
f"{self._result_key}_reason": reason,
283-
f"{self._result_key}_details": llm_output.get("details", {}),
284-
f"{self._result_key}_prompt_tokens": prompty_output_dict.get("input_token_count", 0),
285-
f"{self._result_key}_completion_tokens": prompty_output_dict.get("output_token_count", 0),
286-
f"{self._result_key}_total_tokens": prompty_output_dict.get("total_token_count", 0),
287-
f"{self._result_key}_finish_reason": prompty_output_dict.get("finish_reason", ""),
288-
f"{self._result_key}_model": prompty_output_dict.get("model_id", ""),
289-
f"{self._result_key}_sample_input": prompty_output_dict.get("sample_input", ""),
290-
f"{self._result_key}_sample_output": prompty_output_dict.get("sample_output", ""),
303+
f"{self._result_key}_status": "completed",
304+
f"{self._result_key}_threshold": self._threshold,
305+
f"{self._result_key}_properties": llm_properties,
291306
}
292307
return response_dict
293308

@@ -314,7 +329,7 @@ async def _real_call(self, **kwargs):
314329
eval_input = self._convert_kwargs_to_eval_input(**kwargs)
315330
if isinstance(eval_input, dict) and eval_input.get("error_message"):
316331
# If there is an error message, return not applicable result
317-
return self._not_applicable_result(eval_input.get("error_message"), self.threshold, has_details=True)
332+
return self._return_not_applicable_result(eval_input.get("error_message"), self.threshold)
318333
# Do the evaluation
319334
result = await self._do_eval(eval_input)
320335
# Return the result

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,16 @@ Evaluate based on these factors:
5454

5555
**Tool Assessment**: Focus solely on appropriate use of available tools, not on capabilities beyond what tools can provide.
5656

57+
## Status: Skipped
58+
Before performing any evaluation, check for the following conditions. If ANY are true, return `status: "skipped"` immediately without scoring:
59+
1. **No tool calls to evaluate**: The TOOL CALLS TO BE EVALUATED section is empty (tool calls appearing only in the CONVERSATION section do not count).
60+
2. **Missing tool definitions**: Any tool call in TOOL CALLS TO BE EVALUATED references a tool that is not present in the TOOL DEFINITIONS.
61+
62+
When skipped, return:
63+
```json
64+
{"reason": "<explain why evaluation was skipped>", "score": null, "status": "skipped", "properties": null}
65+
```
66+
5767

5868
# Ratings
5969
## [Tool Call Accuracy: 1] (Irrelevant)
@@ -139,10 +149,13 @@ TOOL DEFINITIONS: {{tool_definitions}}
139149

140150
# Tasks
141151
## Please provide your evaluation for the assistant RESPONSE in relation to the user QUERY and tool definitions based on the Definitions and examples above.
142-
Your output should consist only of a JSON object, as provided in the examples, that has the following keys:
143-
- chain_of_thought: a string that explains your thought process to decide on the tool call accuracy level, based on the Chain of Thought structure. Start this string with 'Let's think step by step:'.
144-
- tool_calls_success_level: a integer value between 1 and 5 that represents the level of tool call success, based on the level definitions mentioned before. You need to be very precise when deciding on this level. Ensure you are correctly following the rating system based on the description of each level.
145-
- details: a dictionary that contains the following keys:
152+
Your output should consist only of a JSON object that has the following keys:
153+
- reason: a string that explains your thought process to decide on the tool call accuracy level, based on the Chain of Thought structure. Start this string with 'Let's think step by step:'. When status is "skipped", explain why the evaluation was skipped.
154+
- score: an integer value between 1 and 5 that represents the level of tool call success, based on the level definitions mentioned before. You need to be very precise when deciding on this level. Ensure you are correctly following the rating system based on the description of each level. Set to null when status is "skipped".
155+
- status: a string indicating the evaluation status. Must be one of:
156+
- "completed": tool calls were present, tool definitions were available, and evaluation was performed.
157+
- "skipped": evaluation was not performed because there were no tool calls to evaluate, or tool definitions were missing for the tool calls. When skipped, set score to null and properties to null.
158+
- properties: a dictionary that contains the following keys:
146159
- tool_calls_made_by_agent: total number of tool calls made by the agent
147160
- correct_tool_calls_made_by_agent: total number of correct tool calls made by the agent
148161
- per_tool_call_details: a list of dictionaries, each containing:
@@ -163,4 +176,4 @@ Your output should consist only of a JSON object, as provided in the examples, t
163176
- tool_name: name of the tool
164177
- missing_count: number of missing calls for this query
165178

166-
# Output
179+
# Output

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_agent_evaluators.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,9 @@ def test_tool_call_accuracy_evaluator_missing_inputs(self, mock_model_config):
6666
}
6767
],
6868
)
69-
assert (
70-
result[ToolCallAccuracyEvaluator._RESULT_KEY] == ToolCallAccuracyEvaluator._DEFAULT_TOOL_CALL_ACCURACY_SCORE
71-
)
69+
assert result[f"{ToolCallAccuracyEvaluator._RESULT_KEY}_score"] is None
70+
assert result[f"{ToolCallAccuracyEvaluator._RESULT_KEY}_result"] == "pass"
71+
assert result[f"{ToolCallAccuracyEvaluator._RESULT_KEY}_status"] == "skipped"
7272
assert (
7373
"not applicable" in result[f"{ToolCallAccuracyEvaluator._RESULT_KEY}_reason"].lower()
7474
and ToolCallAccuracyEvaluator._TOOL_DEFINITIONS_MISSING_MESSAGE

0 commit comments

Comments
 (0)