feat: add shared metric contract for scorer functions#950
feat: add shared metric contract for scorer functions#950SandyChapman wants to merge 1 commit intodev/0.3.0from
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| to_metric = getattr(self._defn.scorer_fn, "to_metric", None) | ||
| if callable(to_metric): | ||
| metric = cast(ScorerFunctionMetric[ScorerConfig], to_metric()).bind_raw_config( | ||
| config=self._defn.extra, | ||
| sandbox=sandbox, | ||
| target=expected, | ||
| ) | ||
| metric_input = _metric_input_from_verify( | ||
| response=response, | ||
| metadata=meta, | ||
| ) | ||
| result = await metric.compute_scores(metric_input) | ||
| return _metric_result_to_verify_result( | ||
| metric=metric, | ||
| result=result, | ||
| benchmark_name=self._defn.name, | ||
| response=response, | ||
| ) |
There was a problem hiding this comment.
This is just illustrative of how the verify func could use the Metric version of the scorer.
| class BenchmarkDefinition: | ||
| name: str | ||
| dataset: str | Callable[[], list[dict]] | ||
| dataset: str | Callable[..., list[dict[str, Any]]] |
There was a problem hiding this comment.
There's a bit of diff noise in the PR as I address typechecking errors reported by ty and pyright.
| ) -> Callable[[ScorerCallable[ConfigT]], ScorerCallable[ConfigT]]: ... | ||
|
|
||
|
|
||
| def scorer( |
There was a problem hiding this comment.
The adjustments to the scorer decorator provide a couple additional features:
- the ability to specify a schema (via a Pydantic BaseModel) that allows validation of inputted config objects.
- static type safety of the Config type as well (such that the
to_metricfunc will return aScorerFunctionMetricwith the generic parameter matching the passed config. - structured definition of outputs is required for supporting the
to_metricfunction call asscore_namesis a required part of theMetricprotocol. metric_typeis an optional label for the metric which is also needed to refer to it (for instance) via an API call or in the DB. By default it's generated if not specified, but we provide the ability to manually specify it in case the code gets moved or a module renamed (as the generated name uses the package name).
There was a problem hiding this comment.
metric_type to become required since it can introduce bugs
|
|
||
| @dataclass | ||
| class ScorerInput: | ||
| class ScorerInput(Generic[ConfigT]): |
There was a problem hiding this comment.
Genericize ScoreInput to allow specifying a strongly typed config object. dict is still accepted.
4b8f615 to
dd8aeb8
Compare
| model_config = ConfigDict(extra="forbid") | ||
|
|
||
| scores: list[ScoreOutputSpec] = Field(min_length=1) | ||
| annotations: list[AnnotationOutputSpec] = Field(default_factory=list) |
There was a problem hiding this comment.
todo: check if we can have one type
Expose MetricInput -> MetricResult types and adapt decorated scorers via to_metric() so Evaluator OSS scorers can share a runtime contract with platform integrations while preserving BYOB scorer compatibility.
dd8aeb8 to
62efcfa
Compare
Expose MetricInput -> MetricResult types and adapt decorated scorers via to_metric() so Evaluator OSS scorers can share a runtime contract with platform integrations while preserving BYOB scorer compatibility.