Fix LLM evaluator skip-handling: status-first guard + pre-validation#5116
Open
AliMahmoudzadeh wants to merge 1 commit into
Open
Fix LLM evaluator skip-handling: status-first guard + pre-validation#5116AliMahmoudzadeh wants to merge 1 commit into
AliMahmoudzadeh wants to merge 1 commit into
Conversation
Three LLM-based evaluators (coherence, fluency, retrieval) post-process the dict returned by `_the_super_do_eval` and then run `math.isnan` on the score. PR #5107 added a `_score is not None` guard to prevent `TypeError: must be real number, not NoneType` on rows where the base helper produced a skipped/not-applicable result. The None-guard works, but it is fragile: any future override that forgets the `is not None` check re-introduces the bug. `task_completion` avoids this entire failure class by checking `status == "skipped"` *before* any numeric coercion runs (gold-standard pattern). This change adopts that ordering for coherence, fluency, and retrieval: result = await self._the_super_do_eval(eval_input) if result.get(f"{self._result_key}_status") == "skipped": return result _score = result.get(self._result_key, 0) if _score is not None and math.isnan(_score): raise EvaluationException(...) The None-guard is retained as a backstop. Now `math.isnan` is structurally unreachable on the skipped path. Additionally, `retrieval` previously had no input pre-validation: empty or missing `query`/`context` would be silently sent to the LLM, which would return `status: skipped`, demoting the row to `not_applicable` with no proactive USER_ERROR signal. This change adds a `_validate_required_inputs` helper invoked at the top of `_real_call` that raises `EvaluationException(USER_ERROR, MISSING_FIELD)` when `conversation` is absent and either `query` or `context` is empty, matching the pattern already used by `groundedness`, `coherence`/`fluency`/`relevance` (via `self._validator`). No changes needed for `relevance`: its `_do_eval` already implements the status-first pattern inline, and `_real_call` already invokes `self._validator.validate_eval_input`. Behavior tests (199) for coherence, fluency, relevance all pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Test Results for assets-test139 tests 139 ✅ 4s ⏱️ Results for commit 68947a7. ♻️ This comment has been updated with latest results. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related defects across the LLM-based evaluators were identified through end-to-end testing against the
np-canaryFoundry project (using minimal_repro_dataset.jsonl and minimal_repro_dataset_empty.jsonl):Skip-handling fragility —
coherence/fluency/retrievalpost-process the dict returned by_the_super_do_evaland then runmath.isnan(result.get(self._result_key, 0)). PR Fix TypeError when evaluator returns None score for skipped/not-applicable results #5107 added the_score is not Noneguard that prevents the productionTypeError: must be real number, not NoneType, but the structure is fragile: any future override that forgets the guard re-introduces the bug.task_completion(andrelevance) avoid this entire class of defect by checkingstatus == "skipped"before any numeric coercion runs.No input pre-validation on
retrieval— empty/missingquery/contextis silently sent to the LLM, which returnsstatus: skipped, demoting the row tonot_applicableinstead of raising a cleanUSER_ERROR. (groundedness/coherence/fluency/relevancealready have this viaself._validator.)Changes (3 files, +43 lines)
coherence/evaluator/_coherence.py_do_evalbeforemath.isnanfluency/evaluator/_fluency.pyretrieval/evaluator/_retrieval.py_validate_required_inputsinvoked at top of_real_callThe status-first pattern (the canonical example lives in
task_completion._parse_prompty_output):`python
result = await self._the_super_do_eval(eval_input)
Honor explicit skip status before any numeric validation. Structurally safer
than relying solely on the None-guard below.
if result.get(f"{self._result_key}_status") == "skipped":
return result
_score = result.get(self._result_key, 0)
if _score is not None and math.isnan(_score):
raise EvaluationException(...)
`
The None-guard from #5107 is retained as a backstop. The skipped path can no longer reach
math.isnanat all.Out of scope
relevance— already has both patterns inline. No change required.groundedness— has the same outer post-check shape ascoherence/fluency/retrievaland would benefit from the same W4 refactor; left out per agreed scope (this PR is the four highest-risk evaluators only).azuremlregistry (operational, not a code change).bleu/gleu/meteor/rouge/f1/task_navigation_efficiency) need typed input validation; tracked separately.Validation
python scripts/validation/code_health.py -i assets/evaluators/builtin— clean.pytestontest_coherence_evaluator_behavior.py+test_fluency_evaluator_behavior.py+test_relevance_evaluator_behavior.py— 199 passed.