feat(scoring): add metric abstractions for NEL/NMP interop by wprazuch · Pull Request #930 · NVIDIA-NeMo/Evaluator

wprazuch · 2026-04-23T09:49:06Z

Introduces src/nemo_evaluator/scoring/contracts.py (~460 LOC) — the shared contract layer between NEL and downstream metric providers (NMP's nemo_evaluator_sdk, third-party plugins).

This implements the reshape discussed in steps.md:

DROPPED from Metric Protocol: metric(item, sample, trace) -> float | bool Per Sandy Chapman + Voytek Prazuch — redundant with compute_scores. Concrete classes that keep it as a private helper still satisfy the Protocol structurally; consumers must not rely on it.
CHANGED signature: compute_scores now takes a single MetricInput (aliased to ScorerInput — NEL's native BYOB input dataclass) instead of a pair of item/sample dicts. Unifies function-style and object- style runtime inputs.
ADDED TemplateMetric base class: subclasses declare a Pydantic config and implement _score(MetricInput) -> float. Default compute_scores wraps _score in a single-score MetricResult; default score_names returns [self.type]. Reduces per-metric boilerplate to ~20-30 LOC.
ADDED @register_metric class decorator + get_metric / list_metrics lookup helpers. Registers classes by their 'type' identifier (read from Pydantic field default or plain ClassVar attribute).
ADDED metric_as_scorer(metric) bridge: adapts an object-style Metric to NEL's function-style Scorer callable, so object-style metrics can register in NEL's _SCORER_REGISTRY without glue code. Uses a thread with a fresh event loop when called inside an existing loop (notebook-safe).

ERD (narrative):

MetricInput (= ScorerInput)
|-- consumed by --> Scorer: Callable[[MetricInput], dict]
|-- consumed by --> Metric: Protocol(type, compute_scores, score_names)
MetricResult (= MetricOutput)
|<-- returned by --- Metric.compute_scores
TemplateMetric (Pydantic BaseModel, implements Metric)
|-- concrete base; users subclass for ~20-30 LOC metrics

Tests: 24 new tests in tests/test_scoring/test_contracts.py covering input/output aliases, Pydantic result types with NaN serialization, all four Protocols (Metric, CorpusMetric, MetricWithSecrets, MetricWithPreflight), TemplateMetric default + override, register_metric / get_metric / list_metrics, and the metric_as_scorer bridge for both single-score and multi-score metrics. All pass.

Design rationale + migration plan for NMP Platform is in .claude/tasks/nel_nmp_integration/approach3_design.md.

No breaking changes in NEL — this is a new module next to existing scoring/ utilities. The breaking change lands on NMP SDK (a follow-up PR on NVIDIA-NeMo/Platform will import these contracts and adapt SDK's concrete metrics to the new compute_scores signature).

copy-pr-bot · 2026-04-23T09:49:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-23T09:49:12Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0e14ee39-4ae3-410b-8a37-79aca4d575f8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch wprazuch/metric-abstractions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Introduces src/nemo_evaluator/scoring/contracts.py (~460 LOC) — the shared contract layer between NEL and downstream metric providers (NMP's nemo_evaluator_sdk, third-party plugins). This implements the reshape discussed in steps.md: * DROPPED from Metric Protocol: metric(item, sample, trace) -> float | bool Per Sandy Chapman + Voytek Prazuch — redundant with compute_scores. Concrete classes that keep it as a private helper still satisfy the Protocol structurally; consumers must not rely on it. * CHANGED signature: compute_scores now takes a single MetricInput (aliased to ScorerInput — NEL's native BYOB input dataclass) instead of a pair of item/sample dicts. Unifies function-style and object- style runtime inputs. * ADDED TemplateMetric base class: subclasses declare a Pydantic config and implement _score(MetricInput) -> float. Default compute_scores wraps _score in a single-score MetricResult; default score_names returns [self.type]. Reduces per-metric boilerplate to ~20-30 LOC. * ADDED @register_metric class decorator + get_metric / list_metrics lookup helpers. Registers classes by their 'type' identifier (read from Pydantic field default or plain ClassVar attribute). * ADDED metric_as_scorer(metric) bridge: adapts an object-style Metric to NEL's function-style Scorer callable, so object-style metrics can register in NEL's _SCORER_REGISTRY without glue code. Uses a thread with a fresh event loop when called inside an existing loop (notebook-safe). ERD (narrative): MetricInput (= ScorerInput) |-- consumed by --> Scorer: Callable[[MetricInput], dict] |-- consumed by --> Metric: Protocol(type, compute_scores, score_names) MetricResult (= MetricOutput) |<-- returned by --- Metric.compute_scores TemplateMetric (Pydantic BaseModel, implements Metric) |-- concrete base; users subclass for ~20-30 LOC metrics Tests: 24 new tests in tests/test_scoring/test_contracts.py covering input/output aliases, Pydantic result types with NaN serialization, all four Protocols (Metric, CorpusMetric, MetricWithSecrets, MetricWithPreflight), TemplateMetric default + override, register_metric / get_metric / list_metrics, and the metric_as_scorer bridge for both single-score and multi-score metrics. All pass. Design rationale + migration plan for NMP Platform is in .claude/tasks/nel_nmp_integration/approach3_design.md. No breaking changes in NEL — this is a new module next to existing scoring/ utilities. The breaking change lands on NMP SDK (a follow-up PR on NVIDIA-NeMo/Platform will import these contracts and adapt SDK's concrete metrics to the new compute_scores signature). Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

github-actions Bot added the tests label Apr 23, 2026

wprazuch force-pushed the wprazuch/metric-abstractions branch from a394a42 to c89d9b9 Compare April 23, 2026 11:46

wprazuch closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scoring): add metric abstractions for NEL/NMP interop#930

feat(scoring): add metric abstractions for NEL/NMP interop#930
wprazuch wants to merge 1 commit intodev/0.3.0from
wprazuch/metric-abstractions

wprazuch commented Apr 23, 2026

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wprazuch commented Apr 23, 2026

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading