feat(evaluation): unify validators with azureml-assets#47526
Conversation
- add DEVELOPER role, EvaluationLevel, MessagesOrQueryResponseInputValidator + level utils - support actions/expected_actions aliases in TaskNavigationEfficiencyValidator - align check_for_unsupported_tools flags in tool_call/input/output evaluators
There was a problem hiding this comment.
Pull request overview
This PR updates azure-ai-evaluation’s internal evaluator input validation layer to better align with azureml-assets naming and behavior, while expanding supported conversation roles and adding utilities for evaluation-level handling.
Changes:
- Added
DEVELOPERmessage role support and introducedEvaluationLevelplus evaluation-level utility helpers. - Added
MessagesOrQueryResponseInputValidatorto support both multi-turn (messages) and single-turn (query/response) input shapes. - Added
actions/expected_actionsaliases for task navigation efficiency inputs, and alignedcheck_for_unsupported_toolsbehavior across tool-related evaluators/validators.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_output_utilization/_tool_output_utilization.py | Enables unsupported-tool checking for tool output utilization inputs. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_input_accuracy/_tool_input_accuracy.py | Adjusts unsupported-tool checking behavior for tool input accuracy validation. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py | Adjusts unsupported-tool checking behavior for tool call accuracy validation. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_validation_constants.py | Adds DEVELOPER role and introduces the EvaluationLevel enum. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.py | Adds normalization to accept actions/expected_actions aliases. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_messages_or_query_response_validator.py | New validator supporting either messages or query/response input formats. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_evaluation_level_utils.py | New helper utilities for resolving evaluation levels and reshaping message inputs. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_conversation_validator.py | Adds developer-role validation handling and minor error-message cleanup. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/init.py | Exposes new enums/validators/utilities from the validators package. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
…luated Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
| target=self.error_target, | ||
| ) | ||
| # The final assistant message must contain text | ||
| last_content = messages[-1].get("content", "") |
There was a problem hiding this comment.
Here we assume that the last message will have a role as assistant, but that may not be the case. Can we explicitly check that the last message's role is assistant before moving on to content check?
| self._validator = ToolCallsValidator( | ||
| error_target=ErrorTarget.TOOL_CALL_ACCURACY_EVALUATOR, | ||
| check_for_unsupported_tools=True, | ||
| check_for_unsupported_tools=False, |
There was a problem hiding this comment.
This is the same change I am making in my sdk pr: https://github.com/Azure/azure-sdk-for-python/pull/47462/changes#diff-f0cd98f94f077616907714246b399d03dcc97bde3cde5dbe0ff1dac8c5253869
| error_target=ErrorTarget.TOOL_OUTPUT_UTILIZATION_EVALUATOR, optional_tool_definitions=False | ||
| error_target=ErrorTarget.TOOL_OUTPUT_UTILIZATION_EVALUATOR, | ||
| optional_tool_definitions=False, | ||
| check_for_unsupported_tools=True, |
There was a problem hiding this comment.
In assets we pass this flag check_for_unsupported_tools correctly. It would be great to create a matrix for all the built in evals with the expected inputs and outputs along with the values for these flags.
| from ._tool_definitions_validator import ToolDefinitionsValidator | ||
|
|
||
|
|
||
| class MessagesOrQueryResponseInputValidator(ToolDefinitionsValidator): |
There was a problem hiding this comment.
Let's add unit tests for these new validators.
Description
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines