Skip to content

Commit 3166cb8

Browse files
wochingeclaude
andauthored
feat(ci): add RunnerContext and RegressionError for experiment GH action (#1635)
* feat(ci): add RunnerContext and RegressionError for experiment GH action Adds the SDK-side primitives consumed by the upcoming `langfuse/experiment-action` GitHub Action (LFE-9241): - `RunnerContext` wraps `Langfuse.run_experiment` with action-injected defaults (data, dataset_version, name, run_name, metadata). Users can override any default on the call site; metadata is merged with user-supplied keys winning on collision. - `RegressionError` lets users signal a CI gate failure and optionally pass structured `metric`/`value`/`threshold` fields so the action can render a callout in the PR comment. Both live in a dedicated `langfuse/ci.py` module so the CI surface stays isolated from the general experiment API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(experiment): move RunnerContext and RegressionError into experiment module Relocates the CI-action primitives from the standalone `langfuse/ci.py` module into `langfuse/experiment.py` alongside the other experiment types. Deletes `langfuse/ci.py` and renames the tests accordingly. The public import paths (`from langfuse import RunnerContext, RegressionError`) are unchanged. `CompositeEvaluatorFunction` is imported under `TYPE_CHECKING` to avoid a circular import with `langfuse.batch_evaluation`. The signature-drift guard now resolves the forward reference via `typing.get_type_hints(..., localns=...)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: rename test_runner_context.py to test_experiment.py Mirrors the module name now that RunnerContext and RegressionError live in `langfuse.experiment`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(experiment): tighten RunnerContext + RegressionError public surface - RunnerContext no longer carries `name` or `run_name` as context-level defaults. `name` is now required on every `run_experiment` call (supports the action's directory-of-experiments mode where each script must name itself). `run_name` passes straight through to `Langfuse.run_experiment`. - RegressionError gains three typed `@overload` signatures (minimal, free-form message, structured metric/value/threshold) so type checkers enforce that `metric` and `value` are supplied together. At runtime, partial structured input falls back to the default message instead of rendering misleading `None` placeholders in PR comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5ef17a0 commit 3166cb8

3 files changed

Lines changed: 407 additions & 1 deletion

File tree

langfuse/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
EvaluatorStats,
99
MapperFunction,
1010
)
11-
from langfuse.experiment import Evaluation
11+
from langfuse.experiment import Evaluation, RegressionError, RunnerContext
1212

1313
from ._client import client as _client_module
1414
from ._client.attributes import LangfuseOtelSpanAttributes
@@ -63,6 +63,8 @@
6363
"EvaluatorStats",
6464
"BatchEvaluationResumeToken",
6565
"BatchEvaluationResult",
66+
"RunnerContext",
67+
"RegressionError",
6668
"__version__",
6769
"is_default_export_span",
6870
"is_langfuse_span",

langfuse/experiment.py

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@
66
"""
77

88
import asyncio
9+
from datetime import datetime
910
from typing import (
11+
TYPE_CHECKING,
1012
Any,
1113
Awaitable,
1214
Dict,
@@ -15,12 +17,17 @@
1517
Protocol,
1618
TypedDict,
1719
Union,
20+
overload,
1821
)
1922

2023
from langfuse.api import DatasetItem
2124
from langfuse.logger import langfuse_logger as logger
2225
from langfuse.types import ExperimentScoreType
2326

27+
if TYPE_CHECKING:
28+
from langfuse._client.client import Langfuse
29+
from langfuse.batch_evaluation import CompositeEvaluatorFunction
30+
2431

2532
class LocalExperimentItem(TypedDict, total=False):
2633
"""Structure for local experiment data items (not from Langfuse datasets).
@@ -1049,3 +1056,152 @@ def langfuse_evaluator(
10491056
)
10501057

10511058
return langfuse_evaluator
1059+
1060+
1061+
class RunnerContext:
1062+
"""Wraps :meth:`Langfuse.run_experiment` with CI-injected defaults.
1063+
1064+
Intended for use with the ``langfuse/experiment-action`` GitHub Action
1065+
(https://github.com/langfuse/experiment-action). The action builds a
1066+
``RunnerContext`` before invoking the user's ``experiment(context)``
1067+
function. Defaults set here (dataset, metadata tags) are applied when
1068+
the user omits them on the :meth:`run_experiment` call; users can
1069+
override any default by passing the corresponding argument explicitly.
1070+
"""
1071+
1072+
def __init__(
1073+
self,
1074+
*,
1075+
client: "Langfuse",
1076+
data: Optional[ExperimentData] = None,
1077+
dataset_version: Optional[datetime] = None,
1078+
metadata: Optional[Dict[str, str]] = None,
1079+
):
1080+
"""Build a ``RunnerContext`` populated with defaults for ``run_experiment``.
1081+
1082+
Typically called by the ``langfuse/experiment-action`` GitHub Action,
1083+
not by end users directly. Every field except ``client`` is optional:
1084+
fields left as ``None`` simply mean the corresponding argument must be
1085+
supplied on the :meth:`run_experiment` call.
1086+
1087+
Args:
1088+
client: Initialized Langfuse SDK client used to execute the
1089+
experiment. The action creates this from the
1090+
``langfuse_public_key`` / ``langfuse_secret_key`` /
1091+
``langfuse_base_url`` inputs.
1092+
data: Default dataset items to run the experiment on. Accepts
1093+
either ``List[LocalExperimentItem]`` or ``List[DatasetItem]``.
1094+
Injected by the action when ``dataset_name`` is configured.
1095+
If ``None``, the user must pass ``data=`` to
1096+
:meth:`run_experiment`.
1097+
dataset_version: Optional pinned dataset version. Injected by the
1098+
action when ``dataset_version`` is configured.
1099+
metadata: Default metadata attached to every experiment trace and
1100+
the dataset run. The action injects GitHub-sourced tags (SHA,
1101+
PR link, workflow run link, branch, GH user, etc.). Merged
1102+
with any ``metadata`` passed to :meth:`run_experiment`, with
1103+
user-supplied keys winning on collision.
1104+
"""
1105+
self.client = client
1106+
self.data = data
1107+
self.dataset_version = dataset_version
1108+
self.metadata = metadata
1109+
1110+
def run_experiment(
1111+
self,
1112+
*,
1113+
name: str,
1114+
run_name: Optional[str] = None,
1115+
description: Optional[str] = None,
1116+
data: Optional[ExperimentData] = None,
1117+
task: TaskFunction,
1118+
evaluators: List[EvaluatorFunction] = [],
1119+
composite_evaluator: Optional["CompositeEvaluatorFunction"] = None,
1120+
run_evaluators: List[RunEvaluatorFunction] = [],
1121+
max_concurrency: int = 50,
1122+
metadata: Optional[Dict[str, str]] = None,
1123+
_dataset_version: Optional[datetime] = None,
1124+
) -> ExperimentResult:
1125+
resolved_data = data if data is not None else self.data
1126+
if resolved_data is None:
1127+
raise ValueError(
1128+
"`data` must be provided either on the RunnerContext or the run_experiment call"
1129+
)
1130+
1131+
resolved_dataset_version = (
1132+
_dataset_version if _dataset_version is not None else self.dataset_version
1133+
)
1134+
1135+
merged_metadata: Optional[Dict[str, str]]
1136+
if self.metadata is None and metadata is None:
1137+
merged_metadata = None
1138+
else:
1139+
merged_metadata = {**(self.metadata or {}), **(metadata or {})}
1140+
1141+
return self.client.run_experiment(
1142+
name=name,
1143+
run_name=run_name,
1144+
description=description,
1145+
data=resolved_data,
1146+
task=task,
1147+
evaluators=evaluators,
1148+
composite_evaluator=composite_evaluator,
1149+
run_evaluators=run_evaluators,
1150+
max_concurrency=max_concurrency,
1151+
metadata=merged_metadata,
1152+
_dataset_version=resolved_dataset_version,
1153+
)
1154+
1155+
1156+
class RegressionError(Exception):
1157+
"""Raised by a user's ``experiment`` function to signal a CI gate failure.
1158+
1159+
Intended for use with the ``langfuse/experiment-action`` GitHub Action
1160+
(https://github.com/langfuse/experiment-action). The action catches this
1161+
exception and, when ``should_fail_on_error`` is enabled, fails the
1162+
workflow run and renders a callout in the PR comment using
1163+
``metric``/``value``/``threshold`` if supplied, otherwise ``str(exc)``.
1164+
1165+
Callers choose one of three forms:
1166+
1167+
- ``RegressionError(result=r)`` — minimal, generic message.
1168+
- ``RegressionError(result=r, message="...")`` — free-form message.
1169+
- ``RegressionError(result=r, metric="acc", value=0.7, threshold=0.9)`` —
1170+
structured; ``metric`` and ``value`` must be provided together so the
1171+
action can render a targeted callout without ``None`` placeholders.
1172+
"""
1173+
1174+
@overload
1175+
def __init__(self, *, result: ExperimentResult) -> None: ...
1176+
@overload
1177+
def __init__(self, *, result: ExperimentResult, message: str) -> None: ...
1178+
@overload
1179+
def __init__(
1180+
self,
1181+
*,
1182+
result: ExperimentResult,
1183+
metric: str,
1184+
value: float,
1185+
threshold: Optional[float] = None,
1186+
message: Optional[str] = None,
1187+
) -> None: ...
1188+
def __init__(
1189+
self,
1190+
*,
1191+
result: ExperimentResult,
1192+
metric: Optional[str] = None,
1193+
value: Optional[float] = None,
1194+
threshold: Optional[float] = None,
1195+
message: Optional[str] = None,
1196+
):
1197+
self.result = result
1198+
self.metric = metric
1199+
self.value = value
1200+
self.threshold = threshold
1201+
if message is not None:
1202+
formatted = message
1203+
elif metric is not None and value is not None:
1204+
formatted = f"Regression on `{metric}`: {value} (threshold {threshold})"
1205+
else:
1206+
formatted = "Experiment regression detected"
1207+
super().__init__(formatted)

0 commit comments

Comments
 (0)