feat(ci): add RunnerContext and RegressionError for experiment GH action (#1635)

wochinge · claude · web-flow · commit 3166cb8bd666 · 2026-05-04T09:32:10.000+02:00
* feat(ci): add RunnerContext and RegressionError for experiment GH action

Adds the SDK-side primitives consumed by the upcoming
`langfuse/experiment-action` GitHub Action (LFE-9241):

- `RunnerContext` wraps `Langfuse.run_experiment` with action-injected
  defaults (data, dataset_version, name, run_name, metadata). Users can
  override any default on the call site; metadata is merged with
  user-supplied keys winning on collision.
- `RegressionError` lets users signal a CI gate failure and optionally
  pass structured `metric`/`value`/`threshold` fields so the action can
  render a callout in the PR comment.

Both live in a dedicated `langfuse/ci.py` module so the CI surface stays
isolated from the general experiment API.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* refactor(experiment): move RunnerContext and RegressionError into experiment module

Relocates the CI-action primitives from the standalone `langfuse/ci.py`
module into `langfuse/experiment.py` alongside the other experiment
types. Deletes `langfuse/ci.py` and renames the tests accordingly.

The public import paths (`from langfuse import RunnerContext,
RegressionError`) are unchanged.

`CompositeEvaluatorFunction` is imported under `TYPE_CHECKING` to avoid
a circular import with `langfuse.batch_evaluation`. The
signature-drift guard now resolves the forward reference via
`typing.get_type_hints(..., localns=...)`.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* test: rename test_runner_context.py to test_experiment.py

Mirrors the module name now that RunnerContext and RegressionError
live in `langfuse.experiment`.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* feat(experiment): tighten RunnerContext + RegressionError public surface

- RunnerContext no longer carries `name` or `run_name` as context-level
  defaults. `name` is now required on every `run_experiment` call
  (supports the action's directory-of-experiments mode where each
  script must name itself). `run_name` passes straight through to
  `Langfuse.run_experiment`.
- RegressionError gains three typed `@overload` signatures (minimal,
  free-form message, structured metric/value/threshold) so type
  checkers enforce that `metric` and `value` are supplied together.
  At runtime, partial structured input falls back to the default
  message instead of rendering misleading `None` placeholders in PR
  comments.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/langfuse/__init__.py b/langfuse/__init__.py
@@ -8,7 +8,7 @@
     EvaluatorStats,
     MapperFunction,
 )
-from langfuse.experiment import Evaluation
+from langfuse.experiment import Evaluation, RegressionError, RunnerContext
 
 from ._client import client as _client_module
 from ._client.attributes import LangfuseOtelSpanAttributes
@@ -63,6 +63,8 @@
     "EvaluatorStats",
     "BatchEvaluationResumeToken",
     "BatchEvaluationResult",
+    "RunnerContext",
+    "RegressionError",
     "__version__",
     "is_default_export_span",
     "is_langfuse_span",
diff --git a/langfuse/experiment.py b/langfuse/experiment.py
@@ -6,7 +6,9 @@
 """
 
 import asyncio
+from datetime import datetime
 from typing import (
+    TYPE_CHECKING,
     Any,
     Awaitable,
     Dict,
@@ -15,12 +17,17 @@
     Protocol,
     TypedDict,
     Union,
+    overload,
 )
 
 from langfuse.api import DatasetItem
 from langfuse.logger import langfuse_logger as logger
 from langfuse.types import ExperimentScoreType
 
+if TYPE_CHECKING:
+    from langfuse._client.client import Langfuse
+    from langfuse.batch_evaluation import CompositeEvaluatorFunction
+
 
 class LocalExperimentItem(TypedDict, total=False):
     """Structure for local experiment data items (not from Langfuse datasets).
@@ -1049,3 +1056,152 @@ def langfuse_evaluator(
         )
 
     return langfuse_evaluator
+
+
+class RunnerContext:
+    """Wraps :meth:`Langfuse.run_experiment` with CI-injected defaults.
+
+    Intended for use with the ``langfuse/experiment-action`` GitHub Action
+    (https://github.com/langfuse/experiment-action). The action builds a
+    ``RunnerContext`` before invoking the user's ``experiment(context)``
+    function. Defaults set here (dataset, metadata tags) are applied when
+    the user omits them on the :meth:`run_experiment` call; users can
+    override any default by passing the corresponding argument explicitly.
+    """
+
+    def __init__(
+        self,
+        *,
+        client: "Langfuse",
+        data: Optional[ExperimentData] = None,
+        dataset_version: Optional[datetime] = None,
+        metadata: Optional[Dict[str, str]] = None,
+    ):
+        """Build a ``RunnerContext`` populated with defaults for ``run_experiment``.
+
+        Typically called by the ``langfuse/experiment-action`` GitHub Action,
+        not by end users directly. Every field except ``client`` is optional:
+        fields left as ``None`` simply mean the corresponding argument must be
+        supplied on the :meth:`run_experiment` call.
+
+        Args:
+            client: Initialized Langfuse SDK client used to execute the
+                experiment. The action creates this from the
+                ``langfuse_public_key`` / ``langfuse_secret_key`` /
+                ``langfuse_base_url`` inputs.
+            data: Default dataset items to run the experiment on. Accepts
+                either ``List[LocalExperimentItem]`` or ``List[DatasetItem]``.
+                Injected by the action when ``dataset_name`` is configured.
+                If ``None``, the user must pass ``data=`` to
+                :meth:`run_experiment`.
+            dataset_version: Optional pinned dataset version. Injected by the
+                action when ``dataset_version`` is configured.
+            metadata: Default metadata attached to every experiment trace and
+                the dataset run. The action injects GitHub-sourced tags (SHA,
+                PR link, workflow run link, branch, GH user, etc.). Merged
+                with any ``metadata`` passed to :meth:`run_experiment`, with
+                user-supplied keys winning on collision.
+        """
+        self.client = client
+        self.data = data
+        self.dataset_version = dataset_version
+        self.metadata = metadata
+
+    def run_experiment(
+        self,
+        *,
+        name: str,
+        run_name: Optional[str] = None,
+        description: Optional[str] = None,
+        data: Optional[ExperimentData] = None,
+        task: TaskFunction,
+        evaluators: List[EvaluatorFunction] = [],
+        composite_evaluator: Optional["CompositeEvaluatorFunction"] = None,
+        run_evaluators: List[RunEvaluatorFunction] = [],
+        max_concurrency: int = 50,
+        metadata: Optional[Dict[str, str]] = None,
+        _dataset_version: Optional[datetime] = None,
+    ) -> ExperimentResult:
+        resolved_data = data if data is not None else self.data
+        if resolved_data is None:
+            raise ValueError(
+                "`data` must be provided either on the RunnerContext or the run_experiment call"
+            )
+
+        resolved_dataset_version = (
+            _dataset_version if _dataset_version is not None else self.dataset_version
+        )
+
+        merged_metadata: Optional[Dict[str, str]]
+        if self.metadata is None and metadata is None:
+            merged_metadata = None
+        else:
+            merged_metadata = {**(self.metadata or {}), **(metadata or {})}
+
+        return self.client.run_experiment(
+            name=name,
+            run_name=run_name,
+            description=description,
+            data=resolved_data,
+            task=task,
+            evaluators=evaluators,
+            composite_evaluator=composite_evaluator,
+            run_evaluators=run_evaluators,
+            max_concurrency=max_concurrency,
+            metadata=merged_metadata,
+            _dataset_version=resolved_dataset_version,
+        )
+
+
+class RegressionError(Exception):
+    """Raised by a user's ``experiment`` function to signal a CI gate failure.
+
+    Intended for use with the ``langfuse/experiment-action`` GitHub Action
+    (https://github.com/langfuse/experiment-action). The action catches this
+    exception and, when ``should_fail_on_error`` is enabled, fails the
+    workflow run and renders a callout in the PR comment using
+    ``metric``/``value``/``threshold`` if supplied, otherwise ``str(exc)``.
+
+    Callers choose one of three forms:
+
+    - ``RegressionError(result=r)`` — minimal, generic message.
+    - ``RegressionError(result=r, message="...")`` — free-form message.
+    - ``RegressionError(result=r, metric="acc", value=0.7, threshold=0.9)`` —
+      structured; ``metric`` and ``value`` must be provided together so the
+      action can render a targeted callout without ``None`` placeholders.
+    """
+
+    @overload
+    def __init__(self, *, result: ExperimentResult) -> None: ...
+    @overload
+    def __init__(self, *, result: ExperimentResult, message: str) -> None: ...
+    @overload
+    def __init__(
+        self,
+        *,
+        result: ExperimentResult,
+        metric: str,
+        value: float,
+        threshold: Optional[float] = None,
+        message: Optional[str] = None,
+    ) -> None: ...
+    def __init__(
+        self,
+        *,
+        result: ExperimentResult,
+        metric: Optional[str] = None,
+        value: Optional[float] = None,
+        threshold: Optional[float] = None,
+        message: Optional[str] = None,
+    ):
+        self.result = result
+        self.metric = metric
+        self.value = value
+        self.threshold = threshold
+        if message is not None:
+            formatted = message
+        elif metric is not None and value is not None:
+            formatted = f"Regression on `{metric}`: {value} (threshold {threshold})"
+        else:
+            formatted = "Experiment regression detected"
+        super().__init__(formatted)
diff --git a/tests/unit/test_experiment.py b/tests/unit/test_experiment.py