Skip to content

Route documented skip cases to status=skipped for 6 evaluators (relevance, task_adherence, tool_selection, response_completeness, intent_resolution, task_completion)#5110

Draft
imatiach-msft wants to merge 1 commit into
Azure:mainfrom
imatiach-msft:fix-skipped-validators-six-evaluators
Draft

Route documented skip cases to status=skipped for 6 evaluators (relevance, task_adherence, tool_selection, response_completeness, intent_resolution, task_completion)#5110
imatiach-msft wants to merge 1 commit into
Azure:mainfrom
imatiach-msft:fix-skipped-validators-six-evaluators

Conversation

@imatiach-msft
Copy link
Copy Markdown
Contributor

Summary

Route documented skip cases (empty/None query, response, conversation.messages, messages, tool_definitions, tool_calls) to status="skipped" for six evaluators I own:

  • relevance
  • task_adherence
  • tool_selection
  • response_completeness
  • intent_resolution
  • task_completion

Background

PR #5042 introduced status="skipped" / result="not_applicable" and the _return_not_applicable_result() helper. Each of these six evaluators already:

  • Defines _return_not_applicable_result() locally.
  • Has the Python-side intermediate-response skip path wired (_is_intermediate_response()_return_not_applicable_result()).
  • Has the LLM-prompt-driven skip path wired (prompty says "return status: skipped on empty inputs"; Python wrapper catches llm_status == "skipped"_return_not_applicable_result()).

PR #5107 added a math.isnan(None) guard for 5 other evaluators (groundedness/coherence/fluency/retrieval/similarity). My 6 don't have that exact pattern, so #5107 itself didn't help them.

The bug

For my 6 evaluators, the validate_eval_input method (called by the base PromptyEvaluatorBase._real_call before _do_eval) raises EvaluationException(USER_ERROR) for exactly the inputs the prompts say should return status="skipped". The row dies with status="error" before either skip path can fire.

Concrete raise sites:

Evaluator Raise site Message
relevance _validate_input_messages_list "{input_name} string cannot be empty", "{input_name} list cannot be empty"
intent_resolution same + _validate_tool_definitions "Tool definitions input is required but not provided"
task_adherence same (same family)
task_completion same also 'NoneType' object is not iterable (separate iteration bug — see note below)
tool_selection validate_eval_input (ToolCallsValidator) "Tool definitions input is required but not provided", "Query is a required input"
response_completeness _do_eval directly "Both ground_truth and response must be provided..."

This is the exact signal Kavitha's KQL on EvalAcaLogs is finding ("must be real number, not NoneType", "math.isnan", "TypeError" + "NoneType").

The fix

For each evaluator, add a small module-level helper:

def _documented_skip_reason(eval_input: Dict[str, Any]) -> Optional[str]:
    """Return a reason string when eval_input matches a documented skip case."""
    ...

It returns a human-readable reason if eval_input matches a documented skip case (empty/None query, response, conversation.messages, messages, tool_definitions, tool_calls — varying per evaluator).

It is called:

  1. At the top of each validate_eval_input override — suppresses the USER_ERROR raise for these specific cases (returns True immediately so _do_eval can run).
  2. At the top of each _do_eval — short-circuits to self._return_not_applicable_result(reason, self._threshold).

Non-documented validation failures (wrong type, malformed message dict, missing required fields) continue to raise as before.

Reproduction

In project ilmat-0277, with literal-string "empty query"/"empty response" inputs, all 6 evaluators correctly return status="skipped" today — proving the LLM-skip pipeline end-to-end works once the validators are bypassed. With actual "" empty strings or missing fields, the row errors instead.

KQL signal:

EvalAcaLogs
| where timestamp > ago(15d)
| where Message has "must be real number, not NoneType"
   or Message has "math.isnan"
   or Message has "TypeError" and Message has "NoneType"
| summarize Count=count() by bin(timestamp, 1d)
| order by timestamp desc

Version bumps

Each modified evaluator's spec.yaml version bumped by 1:

  • relevance: 10 → 11
  • intent_resolution: 7 → 8
  • task_adherence: 13 → 14
  • task_completion: 16 → 17
  • tool_selection: 10 → 11
  • response_completeness: 8 → 9

Notes / follow-ups (NOT in this PR)

  1. task_completion 'NoneType' object is not iterable — when messages[i].content is None, downstream iteration in _do_eval_conversation_level crashes. The validator currently rejects this with USER_ERROR, but if it slips through (e.g., conversation-level path or service-level shape), it raises a raw TypeError. Should be a separate PR with explicit None-handling.
  2. tool_call_accuracy inconsistency — outside the scope of my 6, but worth flagging: it currently returns label="pass" while status="skipped". tool_selection correctly returns label="not_applicable". Different owner should fix.
  3. Service-level validators (which reject content: null and content: "" in conversation_messages dataKind before the evaluator code even runs) are independent of this PR.

Related

Six evaluators (relevance, task_adherence, tool_selection, response_completeness,
intent_resolution, task_completion) already define the _return_not_applicable_result
helper added by PR Azure#5042 and have wired both the Python-side intermediate-response
skip path and the LLM-prompt-driven 'Status: Skipped' path. However, their
upstream validators raise EvaluationException(USER_ERROR) for exactly the inputs
the prompts say should produce status='skipped' (empty/None query, response,
conversation.messages, messages, tool_definitions, tool_calls). The row dies
with status='error' before either skip path can run.

This change adds a small per-evaluator helper _documented_skip_reason() that
returns a human-readable reason string when the input matches a documented
skip case. It is called:

  - at the top of each validate_eval_input override to suppress the USER_ERROR
    raise for these specific cases, and
  - at the top of each _do_eval to short-circuit to _return_not_applicable_result
    with status='skipped'.

Non-documented validation failures (wrong type, malformed message dict, missing
required fields) continue to raise as before. Spec.yaml version bumped by 1 for
each affected evaluator.

Reproduction (from EvalAcaLogs):
  EvalAcaLogs
  | where timestamp > ago(15d)
  | where Message has 'must be real number, not NoneType'
     or Message has 'math.isnan'
     or Message has 'TypeError' and Message has 'NoneType'
  | summarize Count=count() by bin(timestamp, 1d)

Related: PR Azure#5042 (skipped status + _return_not_applicable_result helper),
         PR Azure#5107 (math.isnan(None) guard for 5 other evaluators).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Test Results for assets-test

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 072b020.

logger = logging.getLogger(__name__)


def _is_empty_input_value(value: Any) -> bool:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of repetition in these utils across evaluators. I wonder if we should just move these to the azure-ai-evaluation sdk first before making this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant