Route documented skip cases to status=skipped for 6 evaluators (relevance, task_adherence, tool_selection, response_completeness, intent_resolution, task_completion) by imatiach-msft · Pull Request #5110 · Azure/azureml-assets

imatiach-msft · 2026-06-04T19:51:28Z

Summary

Route documented skip cases (empty/None query, response, conversation.messages, messages, tool_definitions, tool_calls) to status="skipped" for six evaluators I own:

relevance
task_adherence
tool_selection
response_completeness
intent_resolution
task_completion

Background

PR #5042 introduced status="skipped" / result="not_applicable" and the _return_not_applicable_result() helper. Each of these six evaluators already:

Defines _return_not_applicable_result() locally.
Has the Python-side intermediate-response skip path wired (_is_intermediate_response() → _return_not_applicable_result()).
Has the LLM-prompt-driven skip path wired (prompty says "return status: skipped on empty inputs"; Python wrapper catches llm_status == "skipped" → _return_not_applicable_result()).

PR #5107 added a math.isnan(None) guard for 5 other evaluators (groundedness/coherence/fluency/retrieval/similarity). My 6 don't have that exact pattern, so #5107 itself didn't help them.

The bug

For my 6 evaluators, the validate_eval_input method (called by the base PromptyEvaluatorBase._real_call before _do_eval) raises EvaluationException(USER_ERROR) for exactly the inputs the prompts say should return status="skipped". The row dies with status="error" before either skip path can fire.

Concrete raise sites:

Evaluator	Raise site	Message
relevance	`_validate_input_messages_list`	`"{input_name} string cannot be empty"`, `"{input_name} list cannot be empty"`
intent_resolution	same + `_validate_tool_definitions`	`"Tool definitions input is required but not provided"`
task_adherence	same	(same family)
task_completion	same	also `'NoneType' object is not iterable` (separate iteration bug — see note below)
tool_selection	`validate_eval_input` (ToolCallsValidator)	`"Tool definitions input is required but not provided"`, `"Query is a required input"`
response_completeness	`_do_eval` directly	`"Both ground_truth and response must be provided..."`

This is the exact signal Kavitha's KQL on EvalAcaLogs is finding ("must be real number, not NoneType", "math.isnan", "TypeError" + "NoneType").

The fix

For each evaluator, add a small module-level helper:

def _documented_skip_reason(eval_input: Dict[str, Any]) -> Optional[str]:
    """Return a reason string when eval_input matches a documented skip case."""
    ...

It returns a human-readable reason if eval_input matches a documented skip case (empty/None query, response, conversation.messages, messages, tool_definitions, tool_calls — varying per evaluator).

It is called:

At the top of each validate_eval_input override — suppresses the USER_ERROR raise for these specific cases (returns True immediately so _do_eval can run).
At the top of each _do_eval — short-circuits to self._return_not_applicable_result(reason, self._threshold).

Non-documented validation failures (wrong type, malformed message dict, missing required fields) continue to raise as before.

Reproduction

In project ilmat-0277, with literal-string "empty query"/"empty response" inputs, all 6 evaluators correctly return status="skipped" today — proving the LLM-skip pipeline end-to-end works once the validators are bypassed. With actual "" empty strings or missing fields, the row errors instead.

KQL signal:

EvalAcaLogs
| where timestamp > ago(15d)
| where Message has "must be real number, not NoneType"
   or Message has "math.isnan"
   or Message has "TypeError" and Message has "NoneType"
| summarize Count=count() by bin(timestamp, 1d)
| order by timestamp desc

Version bumps

Each modified evaluator's spec.yaml version bumped by 1:

relevance: 10 → 11
intent_resolution: 7 → 8
task_adherence: 13 → 14
task_completion: 16 → 17
tool_selection: 10 → 11
response_completeness: 8 → 9

Notes / follow-ups (NOT in this PR)

task_completion 'NoneType' object is not iterable — when messages[i].content is None, downstream iteration in _do_eval_conversation_level crashes. The validator currently rejects this with USER_ERROR, but if it slips through (e.g., conversation-level path or service-level shape), it raises a raw TypeError. Should be a separate PR with explicit None-handling.
tool_call_accuracy inconsistency — outside the scope of my 6, but worth flagging: it currently returns label="pass" while status="skipped". tool_selection correctly returns label="not_applicable". Different owner should fix.
Service-level validators (which reject content: null and content: "" in conversation_messages dataKind before the evaluator code even runs) are independent of this PR.

Six evaluators (relevance, task_adherence, tool_selection, response_completeness, intent_resolution, task_completion) already define the _return_not_applicable_result helper added by PR Azure#5042 and have wired both the Python-side intermediate-response skip path and the LLM-prompt-driven 'Status: Skipped' path. However, their upstream validators raise EvaluationException(USER_ERROR) for exactly the inputs the prompts say should produce status='skipped' (empty/None query, response, conversation.messages, messages, tool_definitions, tool_calls). The row dies with status='error' before either skip path can run. This change adds a small per-evaluator helper _documented_skip_reason() that returns a human-readable reason string when the input matches a documented skip case. It is called: - at the top of each validate_eval_input override to suppress the USER_ERROR raise for these specific cases, and - at the top of each _do_eval to short-circuit to _return_not_applicable_result with status='skipped'. Non-documented validation failures (wrong type, malformed message dict, missing required fields) continue to raise as before. Spec.yaml version bumped by 1 for each affected evaluator. Reproduction (from EvalAcaLogs): EvalAcaLogs | where timestamp > ago(15d) | where Message has 'must be real number, not NoneType' or Message has 'math.isnan' or Message has 'TypeError' and Message has 'NoneType' | summarize Count=count() by bin(timestamp, 1d) Related: PR Azure#5042 (skipped status + _return_not_applicable_result helper), PR Azure#5107 (math.isnan(None) guard for 5 other evaluators). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-04T19:53:17Z

Test Results for assets-test

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit 072b020.

imatiach-msft · 2026-06-04T20:08:24Z

 logger = logging.getLogger(__name__)


+def _is_empty_input_value(value: Any) -> bool:


I see a lot of repetition in these utils across evaluators. I wonder if we should just move these to the azure-ai-evaluation sdk first before making this fix.

imatiach-msft temporarily deployed to Testing June 4, 2026 19:53 — with GitHub Actions Inactive

imatiach-msft commented Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Route documented skip cases to status=skipped for 6 evaluators (relevance, task_adherence, tool_selection, response_completeness, intent_resolution, task_completion)#5110

Route documented skip cases to status=skipped for 6 evaluators (relevance, task_adherence, tool_selection, response_completeness, intent_resolution, task_completion)#5110
imatiach-msft wants to merge 1 commit into
Azure:mainfrom
imatiach-msft:fix-skipped-validators-six-evaluators

imatiach-msft commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

imatiach-msft Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		logger = logging.getLogger(__name__)


		def _is_empty_input_value(value: Any) -> bool:

Conversation

imatiach-msft commented Jun 4, 2026

Summary

Background

The bug

The fix

Reproduction

Version bumps

Notes / follow-ups (NOT in this PR)

Related

Uh oh!

github-actions Bot commented Jun 4, 2026

Test Results for assets-test

Uh oh!

imatiach-msft Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant