Skip to content

Commit b06269f

Browse files
m7md7sienCopilot
andauthored
Standardize Output Schema for Evalautors (#46436)
* Update Tool Call Accuracy to output unified format * Update tests * reformatting * Refactor not applicable result method calls * Fix test assertions for new unified output format and apply black formatting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Fix tool call accuracy test for skipped output schema (#46356) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Standradize Output Scheme * Add explicit _KEY_PREFIX/_RESULT_KEY * add missing evaluators to init * Align evaluator unit tests with new unified output schema * Update recordings tag to solve e2e tests * Run formatting * Align evaluator unit tests with unified output schema and refresh recordings * Restore legacy `_result` and bare evaluator-name keys for backward compat * resolve conflict * Refresh azure-ai-evaluation test recordings for standardized evaluator output schema * Update multimodal test assertion for new schema and refresh recordings tag * Remove unused label assignment in navigation efficiency Remove assignment of match_result to additional_properties_metrics['label'] * update _return_not_applicable_result * Return "not_applicable" instead of "pass" * update evaluators * Fix error * Add results back * undo unrelated change * undo key_prefix change * Revert `_evaluate.py` changes from #46436 on `mohessie/standardize_output_schema` (#46835) * Initial plan * Revert _evaluate.py changes from PR 46436 by restoring file from main Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8462065c-c6cf-473a-9421-84eaf0a44b5b Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * update tool_selection prompty * Fix evaluation unit tests: replace `_KEY_PREFIX` with `_RESULT_KEY` across 7 test files (#46852) * Initial plan * Fix evaluation unit test failures: replace _KEY_PREFIX with _RESULT_KEY and align test expectations Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/b75cef24-3217-4d44-a0ad-51d690e90035 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * reformatting * Fix rouge KeyError and inject _passed key in base evaluator Two fixes for failing e2e tests on standardize_output_schema PR: 1. _rouge.py: '*_result' keys were used to index binary_results dict, but _get_binary_result() returns '*_passed' keys. Fixes 6 test_math_evaluator_rouge_score tests that failed with KeyError. 2. _base_eval.py: _real_call post-processing now auto-injects '*_passed' boolean keys (alongside '*_result' and '*_threshold') when only '*_score' is present. Fixes 6 multimodal content-safety tests expecting 103 output columns including new '_passed' fields. * Fix key errors * update test records * Update recordings * Fix result key assignment in base prompt evaluation * Change 'reasoning' to 'reason' in evaluation prompt * Update _document_retrieval.py * Update task instruction from 'reasoning' to 'reason' * update records * Add ndcg_score to document retrieval results * Align evaluator metric mapping for standardized single-metric outputs (#46900) * Initial plan * Align evaluator metric mappings with single-metric output schema Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/d4c1d0ce-2425-4757-a52b-9fcad1734be7 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
1 parent d2f0225 commit b06269f

53 files changed

Lines changed: 1180 additions & 930 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212

1313
### Breaking Changes
1414

15+
- Updated `EVALUATOR_NAME_METRICS_MAPPINGS` so `document_retrieval` and `rouge_score` report single primary metrics (`document_retrieval`, `rouge`), with previous sub-metrics now represented in each evaluator's `*_properties` payload.
16+
1517
### Bugs Fixed
1618

1719
- `_TaskNavigationEfficiencyEvaluator` now accepts JSON-stringified `response` and `ground_truth` inputs (e.g., from data pipelines that serialize list/tuple inputs to strings). String inputs are parsed as JSON; on parse failure the original value is preserved so downstream validation surfaces the error as before.

sdk/evaluation/azure-ai-evaluation/assets.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22
"AssetsRepo": "Azure/azure-sdk-assets",
33
"AssetsRepoPrefixPath": "python",
44
"TagPrefix": "python/evaluation/azure-ai-evaluation",
5-
"Tag": "python/evaluation/azure-ai-evaluation_0748353c8d"
6-
}
5+
"Tag": "python/evaluation/azure-ai-evaluation_f30e4bdde3"
6+
}

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_constants.py

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -103,24 +103,14 @@ class _EvaluatorMetricMapping:
103103
EVALUATOR_NAME_METRICS_MAPPINGS = {
104104
"bleu_score": ["bleu"],
105105
"coherence": ["coherence"],
106-
"document_retrieval": [
107-
"xdcg@3",
108-
"ndcg@3",
109-
"fidelity",
110-
"top1_relevance",
111-
"top3_max_relevance",
112-
"holes",
113-
"holes_ratio",
114-
"total_retrieved_documents",
115-
"total_ground_truth_documents",
116-
],
106+
"document_retrieval": ["document_retrieval"],
117107
"f1_score": ["f1_score"],
118108
"fluency": ["fluency"],
119109
"gleu_score": ["gleu"],
120110
"meteor_score": ["meteor"],
121111
"relevance": ["relevance"],
122112
"response_completeness": ["response_completeness"],
123-
"rouge_score": ["rouge_f1_score", "rouge_precision", "rouge_recall"],
113+
"rouge_score": ["rouge"],
124114
"groundedness_pro": ["groundedness_pro"],
125115
"similarity": ["similarity"],
126116
"intent_resolution": ["intent_resolution"],

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_bleu/_bleu.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
from typing_extensions import overload, override
77

88
from azure.ai.evaluation._common.utils import nltk_tokenize
9+
from azure.ai.evaluation._constants import EVALUATION_PASS_FAIL_MAPPING
910

1011
from azure.ai.evaluation._evaluators._common import EvaluatorBase
11-
from azure.ai.evaluation._constants import EVALUATION_PASS_FAIL_MAPPING
1212

1313

1414
class BleuScoreEvaluator(EvaluatorBase):
@@ -87,9 +87,14 @@ async def _do_eval(self, eval_input: Dict) -> Dict[str, float]:
8787
binary_result = score <= self._threshold
8888

8989
return {
90+
"bleu": score,
9091
"bleu_score": score,
92+
"bleu_passed": binary_result,
9193
"bleu_result": EVALUATION_PASS_FAIL_MAPPING[binary_result],
94+
"bleu_reason": None,
95+
"bleu_status": "completed",
9296
"bleu_threshold": self._threshold,
97+
"bleu_properties": None,
9398
}
9499

95100
@overload # type: ignore

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_coherence/coherence.prompty

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ model:
1010
presence_penalty: 0
1111
frequency_penalty: 0
1212
response_format:
13-
type: text
13+
type: json_object
1414

1515
inputs:
1616
query:
@@ -89,11 +89,12 @@ RESPONSE: {{response}}
8989

9090

9191
# Tasks
92-
## Please provide your assessment Score for the previous RESPONSE in relation to the QUERY based on the Definitions above. Your output should include the following information:
93-
- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
94-
- **Explanation**: a very short explanation of why you think the input Data should get that Score.
95-
- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "1", "2"...) based on the levels of the definitions.
92+
## Please provide your assessment for the previous RESPONSE in relation to the QUERY based on the Definitions above.
93+
Your output must be a valid JSON object with exactly these keys:
94+
- reason: a string explaining your thought process and assessment. Start with "Let's think step by step:". When status is "skipped", explain why evaluation was skipped.
95+
- score: an integer value between 1 and 5 based on the level definitions above. The score you give MUST be an integer score (i.e., 1, 2...) based on the levels of the definitions. Set to null when status is "skipped".
96+
- status: a string indicating the evaluation status. Must be one of:
97+
- "completed": evaluation was performed normally.
98+
- "skipped": evaluation was not performed because the QUERY or RESPONSE is empty or not provided. When skipped, set score to null.
9699

97-
98-
## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.
99100
# Output

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_eval.py

Lines changed: 35 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -619,35 +619,43 @@ async def _real_call(self, **kwargs) -> Union[DoEvalResult[T_EvalValue], Aggrega
619619
for eval_input in eval_input_list:
620620
result = await self._do_eval(eval_input)
621621
# logic to determine threshold pass/fail
622+
# if it wasn't computed in _do_eval
622623
try:
623-
for key in list(result.keys()):
624-
if key.endswith("_score") and "rouge" not in key:
625-
score_value = result[key]
626-
base_key = key[:-6] # Remove "_score" suffix
627-
result_key = f"{base_key}_result"
628-
threshold_key = f"{base_key}_threshold"
629-
threshold_value = (
630-
self._threshold.get(base_key) if isinstance(self._threshold, dict) else self._threshold
631-
)
632-
if not isinstance(threshold_value, (int, float)):
633-
raise EvaluationException(
634-
"Threshold value must be a number.",
635-
internal_message=str(threshold_value),
636-
target=ErrorTarget.EVALUATE,
637-
category=ErrorCategory.INVALID_VALUE,
624+
keys = list(result.keys())
625+
contains_result_key = any(key.endswith("_result") for key in keys)
626+
contains_threshold_key = any(key.endswith("_threshold") for key in keys)
627+
if not contains_result_key or not contains_threshold_key:
628+
for key in keys:
629+
if key.endswith("_score"):
630+
score_value = result[key]
631+
base_key = key[:-6] # Remove "_score" suffix
632+
result_key = f"{base_key}_result"
633+
threshold_key = f"{base_key}_threshold"
634+
threshold_value = (
635+
self._threshold.get(base_key) if isinstance(self._threshold, dict) else self._threshold
638636
)
639-
640-
result[threshold_key] = threshold_value
641-
if self._higher_is_better:
642-
if float(score_value) >= threshold_value:
643-
result[result_key] = EVALUATION_PASS_FAIL_MAPPING[True]
644-
else:
645-
result[result_key] = EVALUATION_PASS_FAIL_MAPPING[False]
646-
else:
647-
if float(score_value) <= threshold_value:
648-
result[result_key] = EVALUATION_PASS_FAIL_MAPPING[True]
649-
else:
650-
result[result_key] = EVALUATION_PASS_FAIL_MAPPING[False]
637+
if not isinstance(threshold_value, (int, float)):
638+
raise EvaluationException(
639+
"Threshold value must be a number.",
640+
internal_message=str(threshold_value),
641+
target=ErrorTarget.EVALUATE,
642+
category=ErrorCategory.INVALID_VALUE,
643+
)
644+
645+
if not contains_threshold_key:
646+
result[threshold_key] = threshold_value
647+
648+
if not contains_result_key:
649+
if self._higher_is_better:
650+
if float(score_value) >= threshold_value:
651+
result[result_key] = EVALUATION_PASS_FAIL_MAPPING[True]
652+
else:
653+
result[result_key] = EVALUATION_PASS_FAIL_MAPPING[False]
654+
else:
655+
if float(score_value) <= threshold_value:
656+
result[result_key] = EVALUATION_PASS_FAIL_MAPPING[True]
657+
else:
658+
result[result_key] = EVALUATION_PASS_FAIL_MAPPING[False]
651659
except Exception as e:
652660
logger.warning(f"Error calculating binary result: {e}")
653661
per_turn_results.append(result)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py

Lines changed: 67 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
# Copyright (c) Microsoft Corporation. All rights reserved.
33
# ---------------------------------------------------------
44

5+
import json
56
import math
67
import re
78
import os
@@ -201,7 +202,7 @@ async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]: # t
201202

202203
# Check for intermediate response
203204
if _is_intermediate_response(eval_input.get("response")):
204-
return self._not_applicable_result(
205+
return self._return_not_applicable_result(
205206
"Intermediate response. Please provide the agent's final response for evaluation.",
206207
self._threshold,
207208
)
@@ -216,59 +217,83 @@ async def _do_eval(self, eval_input: Dict) -> Dict[str, Union[float, str]]: # t
216217
prompty_output_dict = await self._flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)
217218

218219
score = math.nan
220+
reason = ""
221+
llm_properties = {}
222+
219223
if prompty_output_dict:
220224
llm_output = prompty_output_dict.get("llm_output", "")
221-
input_token_count = prompty_output_dict.get("input_token_count", 0)
222-
output_token_count = prompty_output_dict.get("output_token_count", 0)
223-
total_token_count = prompty_output_dict.get("total_token_count", 0)
224-
finish_reason = prompty_output_dict.get("finish_reason", "")
225-
model_id = prompty_output_dict.get("model_id", "")
226-
sample_input = prompty_output_dict.get("sample_input", "")
227-
sample_output = prompty_output_dict.get("sample_output", "")
228-
# Parse out score and reason from evaluators known to possess them.
229-
if self._result_key in PROMPT_BASED_REASON_EVALUATORS:
230-
score, reason = parse_quality_evaluator_reason_score(llm_output)
231-
binary_result = self._get_binary_result(score)
232-
return {
233-
self._result_key: float(score),
234-
f"gpt_{self._result_key}": float(score),
235-
f"{self._result_key}_reason": reason,
236-
f"{self._result_key}_result": binary_result,
237-
f"{self._result_key}_threshold": self._threshold,
238-
f"{self._result_key}_prompt_tokens": input_token_count,
239-
f"{self._result_key}_completion_tokens": output_token_count,
240-
f"{self._result_key}_total_tokens": total_token_count,
241-
f"{self._result_key}_finish_reason": finish_reason,
242-
f"{self._result_key}_model": model_id,
243-
f"{self._result_key}_sample_input": sample_input,
244-
f"{self._result_key}_sample_output": sample_output,
245-
}
246-
match = re.search(r"\d", llm_output)
247-
if match:
248-
score = float(match.group())
249-
binary_result = self._get_binary_result(score)
225+
226+
# Parse JSON output from LLM
227+
parsed_output = None
228+
if isinstance(llm_output, dict):
229+
parsed_output = llm_output
230+
elif isinstance(llm_output, str):
231+
try:
232+
parsed_output = json.loads(llm_output)
233+
except (json.JSONDecodeError, TypeError):
234+
parsed_output = None
235+
236+
if parsed_output and isinstance(parsed_output, dict):
237+
# Handle skipped status from LLM
238+
llm_status = parsed_output.get("status", "completed")
239+
if llm_status == "skipped":
240+
skip_reason = parsed_output.get("reason", "")
241+
return self._return_not_applicable_result(skip_reason, self._threshold)
242+
243+
score = parsed_output.get("score", math.nan)
244+
reason = parsed_output.get("reason", "")
245+
llm_properties = parsed_output.get("properties", {}) or {}
246+
else:
247+
# Fallback: try to parse legacy XML format or extract digit
248+
if isinstance(llm_output, str) and self._result_key in PROMPT_BASED_REASON_EVALUATORS:
249+
score, reason = parse_quality_evaluator_reason_score(llm_output)
250+
elif isinstance(llm_output, str):
251+
match = re.search(r"\d", llm_output)
252+
if match:
253+
score = float(match.group())
254+
255+
score = float(score) if score is not None else math.nan
256+
score_result = self._get_binary_result(score)
257+
258+
llm_properties.update(self._get_token_metadata(prompty_output_dict))
259+
250260
return {
251-
self._result_key: float(score),
252-
f"gpt_{self._result_key}": float(score),
253-
f"{self._result_key}_result": binary_result,
261+
self._result_key: score,
262+
f"{self._result_key}_score": score,
263+
f"{self._result_key}_passed": score_result == "pass",
264+
f"{self._result_key}_result": score_result,
265+
f"{self._result_key}_reason": reason,
266+
f"{self._result_key}_status": "completed",
254267
f"{self._result_key}_threshold": self._threshold,
255-
f"{self._result_key}_prompt_tokens": input_token_count,
256-
f"{self._result_key}_completion_tokens": output_token_count,
257-
f"{self._result_key}_total_tokens": total_token_count,
258-
f"{self._result_key}_finish_reason": finish_reason,
259-
f"{self._result_key}_model": model_id,
260-
f"{self._result_key}_sample_input": sample_input,
261-
f"{self._result_key}_sample_output": sample_output,
268+
f"{self._result_key}_properties": llm_properties,
262269
}
263270

264-
binary_result = self._get_binary_result(score)
265271
raise EvaluationException(
266272
message="Evaluator returned invalid output.",
267273
blame=ErrorBlame.SYSTEM_ERROR,
268274
category=ErrorCategory.FAILED_EXECUTION,
269275
target=ErrorTarget.EVALUATE,
270276
)
271277

278+
@staticmethod
279+
def _get_token_metadata(prompty_output: Dict) -> Dict:
280+
"""Extract token usage and model metadata from the prompty output dict.
281+
282+
:param prompty_output: The raw output dictionary from the prompty flow.
283+
:type prompty_output: Dict
284+
:return: A dictionary with token counts, finish reason, model, and sample I/O.
285+
:rtype: Dict
286+
"""
287+
return {
288+
"prompt_tokens": prompty_output.get("input_token_count", 0),
289+
"completion_tokens": prompty_output.get("output_token_count", 0),
290+
"total_tokens": prompty_output.get("total_token_count", 0),
291+
"finish_reason": prompty_output.get("finish_reason", ""),
292+
"model": prompty_output.get("model_id", ""),
293+
"sample_input": prompty_output.get("sample_input", ""),
294+
"sample_output": prompty_output.get("sample_output", ""),
295+
}
296+
272297
@staticmethod
273298
def _get_built_in_tool_definition(tool_name: str):
274299
"""Get the definition for the built-in tool."""
@@ -401,45 +426,6 @@ def _extract_needed_tool_definitions(
401426

402427
return needed_tool_definitions
403428

404-
def _not_applicable_result(
405-
self, error_message: str, threshold: Union[int, float], has_details: bool = False
406-
) -> Dict[str, Union[str, int, float, Dict]]:
407-
"""Return a result indicating that the evaluation is not applicable.
408-
409-
When evaluation cannot be performed (e.g., no tool calls, missing definitions),
410-
this returns the threshold value as the score with a "pass" result.
411-
412-
:param error_message: The error message explaining why evaluation is not applicable.
413-
:type error_message: str
414-
:param threshold: The threshold value for the evaluator, used as the score.
415-
:type threshold: Union[int, float]
416-
:param has_details: Whether to include an empty details field in the result.
417-
:type has_details: bool
418-
:return: A dictionary containing the result of the evaluation.
419-
:rtype: Dict[str, Union[str, float, Dict]]
420-
"""
421-
# If no tool calls were made or tool call type is not supported, return threshold as score with pass result
422-
result = {
423-
self._result_key: threshold,
424-
f"{self._result_key}_result": "pass",
425-
f"{self._result_key}_threshold": threshold,
426-
f"{self._result_key}_reason": f"Not applicable: {error_message}",
427-
f"{self._result_key}_prompt_tokens": 0,
428-
f"{self._result_key}_completion_tokens": 0,
429-
f"{self._result_key}_total_tokens": 0,
430-
f"{self._result_key}_finish_reason": "",
431-
f"{self._result_key}_model": "",
432-
f"{self._result_key}_sample_input": "",
433-
f"{self._result_key}_sample_output": "",
434-
}
435-
436-
# Add empty details field if requested
437-
if has_details:
438-
result[f"{self._result_key}_details"] = {}
439-
440-
return result
441-
442-
# TODO: After all evaluators output are updated, we can remove the _not_applicable_result method and replace calls to it with _return_not_applicable_result, which returns a "skipped" status instead of "pass" to avoid confusion.
443429
def _return_not_applicable_result(
444430
self, error_message: str, threshold: Union[int, float]
445431
) -> Dict[str, Union[str, float, Dict, None]]:
@@ -455,10 +441,8 @@ def _return_not_applicable_result(
455441
return {
456442
f"{self._result_key}": None,
457443
f"{self._result_key}_score": None,
458-
# TODO: Return "not_applicable" instead of "pass" once the
459-
# evaluation service accepts it as a valid result value.
460-
f"{self._result_key}_result": "pass",
461444
f"{self._result_key}_passed": None,
445+
f"{self._result_key}_result": "not_applicable",
462446
f"{self._result_key}_reason": f"Not applicable: {error_message}",
463447
f"{self._result_key}_status": "skipped",
464448
f"{self._result_key}_threshold": threshold,

0 commit comments

Comments
 (0)