Route documented skip cases to status=skipped for 6 evaluators (relevance, task_adherence, tool_selection, response_completeness, intent_resolution, task_completion)#5110
Draft
imatiach-msft wants to merge 1 commit into
Conversation
Six evaluators (relevance, task_adherence, tool_selection, response_completeness, intent_resolution, task_completion) already define the _return_not_applicable_result helper added by PR Azure#5042 and have wired both the Python-side intermediate-response skip path and the LLM-prompt-driven 'Status: Skipped' path. However, their upstream validators raise EvaluationException(USER_ERROR) for exactly the inputs the prompts say should produce status='skipped' (empty/None query, response, conversation.messages, messages, tool_definitions, tool_calls). The row dies with status='error' before either skip path can run. This change adds a small per-evaluator helper _documented_skip_reason() that returns a human-readable reason string when the input matches a documented skip case. It is called: - at the top of each validate_eval_input override to suppress the USER_ERROR raise for these specific cases, and - at the top of each _do_eval to short-circuit to _return_not_applicable_result with status='skipped'. Non-documented validation failures (wrong type, malformed message dict, missing required fields) continue to raise as before. Spec.yaml version bumped by 1 for each affected evaluator. Reproduction (from EvalAcaLogs): EvalAcaLogs | where timestamp > ago(15d) | where Message has 'must be real number, not NoneType' or Message has 'math.isnan' or Message has 'TypeError' and Message has 'NoneType' | summarize Count=count() by bin(timestamp, 1d) Related: PR Azure#5042 (skipped status + _return_not_applicable_result helper), PR Azure#5107 (math.isnan(None) guard for 5 other evaluators). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Test Results for assets-test0 tests 0 ✅ 0s ⏱️ Results for commit 072b020. |
imatiach-msft
commented
Jun 4, 2026
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _is_empty_input_value(value: Any) -> bool: |
Contributor
Author
There was a problem hiding this comment.
I see a lot of repetition in these utils across evaluators. I wonder if we should just move these to the azure-ai-evaluation sdk first before making this fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Route documented skip cases (empty/None query, response, conversation.messages, messages, tool_definitions, tool_calls) to
status="skipped"for six evaluators I own:relevancetask_adherencetool_selectionresponse_completenessintent_resolutiontask_completionBackground
PR #5042 introduced
status="skipped"/result="not_applicable"and the_return_not_applicable_result()helper. Each of these six evaluators already:_return_not_applicable_result()locally._is_intermediate_response()→_return_not_applicable_result()).llm_status == "skipped"→_return_not_applicable_result()).PR #5107 added a
math.isnan(None)guard for 5 other evaluators (groundedness/coherence/fluency/retrieval/similarity). My 6 don't have that exact pattern, so #5107 itself didn't help them.The bug
For my 6 evaluators, the
validate_eval_inputmethod (called by the basePromptyEvaluatorBase._real_callbefore_do_eval) raisesEvaluationException(USER_ERROR)for exactly the inputs the prompts say should returnstatus="skipped". The row dies withstatus="error"before either skip path can fire.Concrete raise sites:
_validate_input_messages_list"{input_name} string cannot be empty","{input_name} list cannot be empty"_validate_tool_definitions"Tool definitions input is required but not provided"'NoneType' object is not iterable(separate iteration bug — see note below)validate_eval_input(ToolCallsValidator)"Tool definitions input is required but not provided","Query is a required input"_do_evaldirectly"Both ground_truth and response must be provided..."This is the exact signal Kavitha's KQL on
EvalAcaLogsis finding ("must be real number, not NoneType","math.isnan","TypeError" + "NoneType").The fix
For each evaluator, add a small module-level helper:
It returns a human-readable reason if
eval_inputmatches a documented skip case (empty/None query, response, conversation.messages, messages, tool_definitions, tool_calls — varying per evaluator).It is called:
validate_eval_inputoverride — suppresses the USER_ERROR raise for these specific cases (returnsTrueimmediately so_do_evalcan run)._do_eval— short-circuits toself._return_not_applicable_result(reason, self._threshold).Non-documented validation failures (wrong type, malformed message dict, missing required fields) continue to raise as before.
Reproduction
In project
ilmat-0277, with literal-string"empty query"/"empty response"inputs, all 6 evaluators correctly returnstatus="skipped"today — proving the LLM-skip pipeline end-to-end works once the validators are bypassed. With actual""empty strings or missing fields, the row errors instead.KQL signal:
Version bumps
Each modified evaluator's
spec.yamlversion bumped by 1:Notes / follow-ups (NOT in this PR)
'NoneType' object is not iterable— whenmessages[i].contentisNone, downstream iteration in_do_eval_conversation_levelcrashes. The validator currently rejects this with USER_ERROR, but if it slips through (e.g., conversation-level path or service-level shape), it raises a rawTypeError. Should be a separate PR with explicit None-handling.tool_call_accuracyinconsistency — outside the scope of my 6, but worth flagging: it currently returnslabel="pass"whilestatus="skipped".tool_selectioncorrectly returnslabel="not_applicable". Different owner should fix.content: nullandcontent: ""inconversation_messagesdataKind before the evaluator code even runs) are independent of this PR.Related
status="skipped"+_return_not_applicable_resulthelpermath.isnan(None)guard for 5 other evaluators