Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ dev = [
"langchain-openai>=0.0.5,<0.4",
"langchain>=1,<2",
"langgraph>=1,<2",
"autoevals>=0.0.130,<0.1",
"autoevals>=0.0.130,<0.3",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing: create_evaluator_from_autoevals() in experiment.py:1046 passes evaluation.score directly to Evaluation(value=...) without a None guard; autoevals 0.2.0 formally declares Score.score: float | None = None (PR #48), making this path more likely to trigger. When score is None, it propagates silently through the unenforced type annotation, then is dropped from averages by the isinstance(evaluation.value, (int, float)) check at experiment.py:562-565, resulting in silent data loss.

Extended reasoning...

What the bug is and how it manifests

In langfuse/experiment.py:1046, create_evaluator_from_autoevals() wraps an autoevals evaluator and constructs a Langfuse Evaluation object. It does so with:

return Evaluation(
    name=evaluation.name,
    value=evaluation.score,   # <-- no None check
    comment=...,
    metadata=...,
)

In autoevals 0.2.0, the Score class declares score: float | None = None with the docstring: "If the score is None, the evaluation is considered to be skipped." (introduced in autoevals PR #48 — "Updates to track the fact that Scores can be null".) When an LLM-based scorer fails to parse a response or explicitly skips evaluation, it returns score=None.

The specific code path

  1. autoevals_evaluator() returns a Score with .score = None.
  2. Evaluation(value=None) is constructed — Python does not enforce type annotations at runtime, so this succeeds silently (see experiment.py:185: value: Union[int, float, str, bool] with no validation, just self.value = value at line 205).
  3. The Evaluation object flows into ExperimentResult.format() at lines 562–565:
    if evaluation.name == eval_name and isinstance(evaluation.value, (int, float)):
        scores.append(evaluation.value)
    isinstance(None, (int, float)) is False, so the score is silently dropped from averages.
  4. Additionally, if create_score(value=None) is called via _create_score_for_scope, ScoreBody (which uses CreateScoreValue = Union[float, str]) raises a Pydantic ValidationError — but this is caught and only logged in client.py's except block, further hiding the failure from the user.

Why existing code does not prevent it

Evaluation.__init__ has no runtime validation. The isinstance check in format() was designed to skip string/bool values, not to handle None — there is no warning or logging when a None score is silently excluded.

What the impact would be

Users employing LLM-based autoevals scorers (e.g., Factuality, ClosedQA, etc.) may experience silent omission of scores for items where the LLM evaluation call fails. Average scores reported in ExperimentResult will be computed over fewer items than expected, skewing results upward without any indication that some items were excluded.

How to fix it

Add a None guard in create_evaluator_from_autoevals():

if evaluation.score is None:
    return None  # or raise, or return a special sentinel
return Evaluation(
    name=evaluation.name,
    value=evaluation.score,
    ...
)

Alternatively, log a warning and skip score creation explicitly so users are aware when evaluations are skipped.

Step-by-step proof

  1. User calls create_evaluator_from_autoevals(Factuality()) to create a Langfuse evaluator.
  2. During an experiment run, the OpenAI call inside Factuality.eval_async() fails or returns unparseable output.
  3. autoevals 0.2.0 returns Score(name="Factuality", score=None, metadata=...) instead of raising.
  4. langfuse_evaluator constructs Evaluation(name="Factuality", value=None) — no exception.
  5. ExperimentResult.format() iterates evaluations, hits isinstance(None, (int, float)) == False, silently skips the item.
  6. The printed average score for "Factuality" is computed over N-k items where k items silently failed, with no warning to the user.

Pre-existing status

The verifier refutation notes that the phrase "track the fact that Scores can be null" in PR #48 implies null scores may have been possible even in 0.0.130, and the langfuse wrapper was never updated to handle them. This is a valid point — the bug is pre-existing in the wrapper code. This PR does not modify experiment.py. However, autoevals 0.2.0 formally types and documents the null-score path, making it more likely to occur in practice, making this a reasonable time to address it.

"opentelemetry-instrumentation-threading>=0.59b0,<1",
"tenacity>=9.1.4",
]
Expand Down
10 changes: 5 additions & 5 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading