Skip to content

feat: add shared metric contract for scorer functions#950

Draft
SandyChapman wants to merge 1 commit intodev/0.3.0from
schapman/feat/shared-metric-contract
Draft

feat: add shared metric contract for scorer functions#950
SandyChapman wants to merge 1 commit intodev/0.3.0from
schapman/feat/shared-metric-contract

Conversation

@SandyChapman
Copy link
Copy Markdown

Expose MetricInput -> MetricResult types and adapt decorated scorers via to_metric() so Evaluator OSS scorers can share a runtime contract with platform integrations while preserving BYOB scorer compatibility.

@SandyChapman SandyChapman requested a review from wprazuch April 29, 2026 16:12
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 90a186e5-8aeb-4ff2-ab6f-3920a098e79a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch schapman/feat/shared-metric-contract

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the tests label Apr 29, 2026
Comment on lines +262 to +279
to_metric = getattr(self._defn.scorer_fn, "to_metric", None)
if callable(to_metric):
metric = cast(ScorerFunctionMetric[ScorerConfig], to_metric()).bind_raw_config(
config=self._defn.extra,
sandbox=sandbox,
target=expected,
)
metric_input = _metric_input_from_verify(
response=response,
metadata=meta,
)
result = await metric.compute_scores(metric_input)
return _metric_result_to_verify_result(
metric=metric,
result=result,
benchmark_name=self._defn.name,
response=response,
)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just illustrative of how the verify func could use the Metric version of the scorer.

class BenchmarkDefinition:
name: str
dataset: str | Callable[[], list[dict]]
dataset: str | Callable[..., list[dict[str, Any]]]
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of diff noise in the PR as I address typechecking errors reported by ty and pyright.

) -> Callable[[ScorerCallable[ConfigT]], ScorerCallable[ConfigT]]: ...


def scorer(
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adjustments to the scorer decorator provide a couple additional features:

  1. the ability to specify a schema (via a Pydantic BaseModel) that allows validation of inputted config objects.
  2. static type safety of the Config type as well (such that the to_metric func will return a ScorerFunctionMetric with the generic parameter matching the passed config.
  3. structured definition of outputs is required for supporting the to_metric function call as score_names is a required part of the Metric protocol.
  4. metric_type is an optional label for the metric which is also needed to refer to it (for instance) via an API call or in the DB. By default it's generated if not specified, but we provide the ability to manually specify it in case the code gets moved or a module renamed (as the generated name uses the package name).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric_type to become required since it can introduce bugs


@dataclass
class ScorerInput:
class ScorerInput(Generic[ConfigT]):
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genericize ScoreInput to allow specifying a strongly typed config object. dict is still accepted.

@SandyChapman SandyChapman force-pushed the schapman/feat/shared-metric-contract branch from 4b8f615 to dd8aeb8 Compare April 29, 2026 18:02
Comment thread src/nemo_evaluator/scoring/metric.py Outdated
model_config = ConfigDict(extra="forbid")

scores: list[ScoreOutputSpec] = Field(min_length=1)
annotations: list[AnnotationOutputSpec] = Field(default_factory=list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: check if we can have one type

Expose MetricInput -> MetricResult types and adapt decorated scorers via to_metric() so Evaluator OSS scorers can share a runtime contract with platform integrations while preserving BYOB scorer compatibility.
@SandyChapman SandyChapman force-pushed the schapman/feat/shared-metric-contract branch from dd8aeb8 to 62efcfa Compare April 30, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants