Fix LLM evaluator skip-handling: status-first guard + pre-validation by AliMahmoudzadeh · Pull Request #5116 · Azure/azureml-assets

AliMahmoudzadeh · 2026-06-05T16:24:29Z

Summary

Two related defects across the LLM-based evaluators were identified through end-to-end testing against the np-canary Foundry project (using minimal_repro_dataset.jsonl and minimal_repro_dataset_empty.jsonl):

Skip-handling fragility — coherence / fluency / retrieval post-process the dict returned by _the_super_do_eval and then run math.isnan(result.get(self._result_key, 0)). PR Fix TypeError when evaluator returns None score for skipped/not-applicable results #5107 added the _score is not None guard that prevents the production TypeError: must be real number, not NoneType, but the structure is fragile: any future override that forgets the guard re-introduces the bug. task_completion (and relevance) avoid this entire class of defect by checking status == "skipped" before any numeric coercion runs.
No input pre-validation on retrieval — empty/missing query/context is silently sent to the LLM, which returns status: skipped, demoting the row to not_applicable instead of raising a clean USER_ERROR. (groundedness/coherence/fluency/relevance already have this via self._validator.)

Changes (3 files, +43 lines)

File	Change
`coherence/evaluator/_coherence.py`	Add status-first early-return in `_do_eval` before `math.isnan`
`fluency/evaluator/_fluency.py`	Same
`retrieval/evaluator/_retrieval.py`	Same + new `_validate_required_inputs` invoked at top of `_real_call`

The status-first pattern (the canonical example lives in task_completion._parse_prompty_output):

`python
result = await self._the_super_do_eval(eval_input)

Honor explicit skip status before any numeric validation. Structurally safer

than relying solely on the None-guard below.

if result.get(f"{self._result_key}_status") == "skipped":
return result
_score = result.get(self._result_key, 0)
if _score is not None and math.isnan(_score):
raise EvaluationException(...)
`

The None-guard from #5107 is retained as a backstop. The skipped path can no longer reach math.isnan at all.

Out of scope

relevance — already has both patterns inline. No change required.
groundedness — has the same outer post-check shape as coherence/fluency/retrieval and would benefit from the same W4 refactor; left out per agreed scope (this PR is the four highest-risk evaluators only).
The remaining ~14 LLM evaluators in the catalog need pre-validation as well; tracked separately.
Republishing fixed evaluators to the azureml registry (operational, not a code change).
Math-based evaluators (bleu/gleu/meteor/rouge/f1/task_navigation_efficiency) need typed input validation; tracked separately.

Validation

python scripts/validation/code_health.py -i assets/evaluators/builtin — clean.
pytest on test_coherence_evaluator_behavior.py + test_fluency_evaluator_behavior.py + test_relevance_evaluator_behavior.py — 199 passed.

Three LLM-based evaluators (coherence, fluency, retrieval) post-process the dict returned by `_the_super_do_eval` and then run `math.isnan` on the score. PR #5107 added a `_score is not None` guard to prevent `TypeError: must be real number, not NoneType` on rows where the base helper produced a skipped/not-applicable result. The None-guard works, but it is fragile: any future override that forgets the `is not None` check re-introduces the bug. `task_completion` avoids this entire failure class by checking `status == "skipped"` *before* any numeric coercion runs (gold-standard pattern). This change adopts that ordering for coherence, fluency, and retrieval: result = await self._the_super_do_eval(eval_input) if result.get(f"{self._result_key}_status") == "skipped": return result _score = result.get(self._result_key, 0) if _score is not None and math.isnan(_score): raise EvaluationException(...) The None-guard is retained as a backstop. Now `math.isnan` is structurally unreachable on the skipped path. Additionally, `retrieval` previously had no input pre-validation: empty or missing `query`/`context` would be silently sent to the LLM, which would return `status: skipped`, demoting the row to `not_applicable` with no proactive USER_ERROR signal. This change adds a `_validate_required_inputs` helper invoked at the top of `_real_call` that raises `EvaluationException(USER_ERROR, MISSING_FIELD)` when `conversation` is absent and either `query` or `context` is empty, matching the pattern already used by `groundedness`, `coherence`/`fluency`/`relevance` (via `self._validator`). No changes needed for `relevance`: its `_do_eval` already implements the status-first pattern inline, and `_real_call` already invokes `self._validator.validate_eval_input`. Behavior tests (199) for coherence, fluency, relevance all pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-05T16:26:56Z

Test Results for assets-test

139 tests 139 ✅ 4s ⏱️
2 suites 0 💤
2 files 0 ❌

Results for commit 68947a7.

♻️ This comment has been updated with latest results.

AliMahmoudzadeh requested review from a team as code owners June 5, 2026 16:24

AliMahmoudzadeh temporarily deployed to Testing June 5, 2026 16:25 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing June 5, 2026 16:26 — with GitHub Actions Inactive

AliMahmoudzadeh temporarily deployed to Testing June 5, 2026 16:29 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LLM evaluator skip-handling: status-first guard + pre-validation#5116

Fix LLM evaluator skip-handling: status-first guard + pre-validation#5116
AliMahmoudzadeh wants to merge 1 commit into
mainfrom
fix/llm-evaluator-skip-handling-status-first

AliMahmoudzadeh commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AliMahmoudzadeh commented Jun 5, 2026

Summary

Changes (3 files, +43 lines)

Honor explicit skip status before any numeric validation. Structurally safer

than relying solely on the None-guard below.

Out of scope

Validation

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results for assets-test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 5, 2026 •

edited

Loading