Skip to content

chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0#1621

Open
dependabot[bot] wants to merge 1 commit intomainfrom
dependabot/uv/autoevals-0.2.0
Open

chore(deps-dev): bump autoevals from 0.0.130 to 0.2.0#1621
dependabot[bot] wants to merge 1 commit intomainfrom
dependabot/uv/autoevals-0.2.0

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot bot commented on behalf of github Apr 10, 2026

Bumps autoevals from 0.0.130 to 0.2.0.

Release notes

Sourced from autoevals's releases.

autoevals Python v0.2.0

What's Changed

... (truncated)

Commits
  • a5854ee chore: Publish python via trusted publishing and unify release process (#183)
  • 398ded6 Add pnpm enforcement and config (#182)
  • 443f631 Update pnpm version and use frozen lockfile (#181)
  • 110e252 chore: Publish JS package via gha trusted publishing (#180)
  • 5b4b90c chore: Pin github actions to commit (#179)
  • c52da64 Bump to gpt5 models (#169)
  • 71e61dd Filter system messages (#177)
  • 0d428fb Trace injection in python to mirror the JS implementation (#175)
  • d99a37c Add models configuration object to init() (#164)
  • d78f4ab Fix MDX parsing by escaping curly braces in JSDoc comment (#174)
  • Additional commits viewable in compare view

Dependabot compatibility score

You can trigger a rebase of this PR by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Disclaimer: Experimental PR review

Greptile Summary

This PR bumps the dev-only autoevals dependency from 0.0.130 to 0.2.0 via uv.lock. The version is already within the >=0.0.130,<0.3 range declared in pyproject.toml, so no manifest change is needed. Since autoevals is a [dependency-groups] dev dependency, production builds are unaffected.

Confidence Score: 5/5

Safe to merge — dev-only dependency bump with no production impact.

autoevals is a dev-only dependency; production builds and the published package are unaffected. The version 0.2.0 is within the already-declared constraint. The only nuance is that null scores (newly possible in 0.2.0) are not guarded in create_evaluator_from_autoevals, but that is a pre-existing style gap and not introduced by this PR.

No files require special attention.

Important Files Changed

Filename Overview
pyproject.toml No change to pyproject.toml; the existing constraint >=0.0.130,<0.3 already accommodates 0.2.0.
uv.lock Lock file updated to pin autoevals to 0.2.0 with correct hashes; dependency set now lists chevron, jsonschema, polyleven, pyyaml (openai removed as a direct autoevals dependency).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[autoevals 0.2.0\ndev dependency] -->|create_evaluator_from_autoevals| B[langfuse/experiment.py]
    B --> C[autoevals_evaluator called\nwith input/output/expected]
    C --> D{evaluation.score}
    D -->|numeric / string / bool| E[Evaluation\nname=evaluation.name\nvalue=evaluation.score]
    D -->|None\nnew in 0.2.0| F[Evaluation value=None\ntype mismatch — not enforced at runtime]
    E --> G[Returned to caller]
    F --> G
Loading

Reviews (1): Last reviewed commit: "chore(deps-dev): bump autoevals from 0.0..." | Re-trigger Greptile

Bumps [autoevals](https://github.com/braintrustdata/autoevals) from 0.0.130 to 0.2.0.
- [Release notes](https://github.com/braintrustdata/autoevals/releases)
- [Changelog](https://github.com/braintrustdata/autoevals/blob/main/CHANGELOG.md)
- [Commits](braintrustdata/autoevals@py-0.0.130...py-0.2.0)

---
updated-dependencies:
- dependency-name: autoevals
  dependency-version: 0.2.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot added dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels Apr 10, 2026
"langchain>=1,<2",
"langgraph>=1,<2",
"autoevals>=0.0.130,<0.1",
"autoevals>=0.0.130,<0.3",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing: create_evaluator_from_autoevals() in experiment.py:1046 passes evaluation.score directly to Evaluation(value=...) without a None guard; autoevals 0.2.0 formally declares Score.score: float | None = None (PR #48), making this path more likely to trigger. When score is None, it propagates silently through the unenforced type annotation, then is dropped from averages by the isinstance(evaluation.value, (int, float)) check at experiment.py:562-565, resulting in silent data loss.

Extended reasoning...

What the bug is and how it manifests

In langfuse/experiment.py:1046, create_evaluator_from_autoevals() wraps an autoevals evaluator and constructs a Langfuse Evaluation object. It does so with:

return Evaluation(
    name=evaluation.name,
    value=evaluation.score,   # <-- no None check
    comment=...,
    metadata=...,
)

In autoevals 0.2.0, the Score class declares score: float | None = None with the docstring: "If the score is None, the evaluation is considered to be skipped." (introduced in autoevals PR #48 — "Updates to track the fact that Scores can be null".) When an LLM-based scorer fails to parse a response or explicitly skips evaluation, it returns score=None.

The specific code path

  1. autoevals_evaluator() returns a Score with .score = None.
  2. Evaluation(value=None) is constructed — Python does not enforce type annotations at runtime, so this succeeds silently (see experiment.py:185: value: Union[int, float, str, bool] with no validation, just self.value = value at line 205).
  3. The Evaluation object flows into ExperimentResult.format() at lines 562–565:
    if evaluation.name == eval_name and isinstance(evaluation.value, (int, float)):
        scores.append(evaluation.value)
    isinstance(None, (int, float)) is False, so the score is silently dropped from averages.
  4. Additionally, if create_score(value=None) is called via _create_score_for_scope, ScoreBody (which uses CreateScoreValue = Union[float, str]) raises a Pydantic ValidationError — but this is caught and only logged in client.py's except block, further hiding the failure from the user.

Why existing code does not prevent it

Evaluation.__init__ has no runtime validation. The isinstance check in format() was designed to skip string/bool values, not to handle None — there is no warning or logging when a None score is silently excluded.

What the impact would be

Users employing LLM-based autoevals scorers (e.g., Factuality, ClosedQA, etc.) may experience silent omission of scores for items where the LLM evaluation call fails. Average scores reported in ExperimentResult will be computed over fewer items than expected, skewing results upward without any indication that some items were excluded.

How to fix it

Add a None guard in create_evaluator_from_autoevals():

if evaluation.score is None:
    return None  # or raise, or return a special sentinel
return Evaluation(
    name=evaluation.name,
    value=evaluation.score,
    ...
)

Alternatively, log a warning and skip score creation explicitly so users are aware when evaluations are skipped.

Step-by-step proof

  1. User calls create_evaluator_from_autoevals(Factuality()) to create a Langfuse evaluator.
  2. During an experiment run, the OpenAI call inside Factuality.eval_async() fails or returns unparseable output.
  3. autoevals 0.2.0 returns Score(name="Factuality", score=None, metadata=...) instead of raising.
  4. langfuse_evaluator constructs Evaluation(name="Factuality", value=None) — no exception.
  5. ExperimentResult.format() iterates evaluations, hits isinstance(None, (int, float)) == False, silently skips the item.
  6. The printed average score for "Factuality" is computed over N-k items where k items silently failed, with no warning to the user.

Pre-existing status

The verifier refutation notes that the phrase "track the fact that Scores can be null" in PR #48 implies null scores may have been possible even in 0.0.130, and the langfuse wrapper was never updated to handle them. This is a valid point — the bug is pre-existing in the wrapper code. This PR does not modify experiment.py. However, autoevals 0.2.0 formally types and documents the null-score path, making it more likely to occur in practice, making this a reasonable time to address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants