Skip to content

Commit a394a42

Browse files
committed
feat(scoring): add metric abstractions for NEL/NMP interop
Introduces src/nemo_evaluator/scoring/contracts.py (~460 LOC) — the shared contract layer between NEL and downstream metric providers (NMP's nemo_evaluator_sdk, third-party plugins). This implements the reshape discussed in steps.md: * DROPPED from Metric Protocol: metric(item, sample, trace) -> float | bool Per Sandy Chapman + Voytek Prazuch — redundant with compute_scores. Concrete classes that keep it as a private helper still satisfy the Protocol structurally; consumers must not rely on it. * CHANGED signature: compute_scores now takes a single MetricInput (aliased to ScorerInput — NEL's native BYOB input dataclass) instead of a pair of item/sample dicts. Unifies function-style and object- style runtime inputs. * ADDED TemplateMetric base class: subclasses declare a Pydantic config and implement _score(MetricInput) -> float. Default compute_scores wraps _score in a single-score MetricResult; default score_names returns [self.type]. Reduces per-metric boilerplate to ~20-30 LOC. * ADDED @register_metric class decorator + get_metric / list_metrics lookup helpers. Registers classes by their 'type' identifier (read from Pydantic field default or plain ClassVar attribute). * ADDED metric_as_scorer(metric) bridge: adapts an object-style Metric to NEL's function-style Scorer callable, so object-style metrics can register in NEL's _SCORER_REGISTRY without glue code. Uses a thread with a fresh event loop when called inside an existing loop (notebook-safe). ERD (narrative): MetricInput (= ScorerInput) |-- consumed by --> Scorer: Callable[[MetricInput], dict] |-- consumed by --> Metric: Protocol(type, compute_scores, score_names) MetricResult (= MetricOutput) |<-- returned by --- Metric.compute_scores TemplateMetric (Pydantic BaseModel, implements Metric) |-- concrete base; users subclass for ~20-30 LOC metrics Tests: 24 new tests in tests/test_scoring/test_contracts.py covering input/output aliases, Pydantic result types with NaN serialization, all four Protocols (Metric, CorpusMetric, MetricWithSecrets, MetricWithPreflight), TemplateMetric default + override, register_metric / get_metric / list_metrics, and the metric_as_scorer bridge for both single-score and multi-score metrics. All pass. Design rationale + migration plan for NMP Platform is in .claude/tasks/nel_nmp_integration/approach3_design.md. No breaking changes in NEL — this is a new module next to existing scoring/ utilities. The breaking change lands on NMP SDK (a follow-up PR on NVIDIA-NeMo/Platform will import these contracts and adapt SDK's concrete metrics to the new compute_scores signature). Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
1 parent d9337e6 commit a394a42

3 files changed

Lines changed: 831 additions & 0 deletions

File tree

src/nemo_evaluator/scoring/__init__.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,27 @@
2222

2323
from typing import Callable
2424

25+
from nemo_evaluator.scoring.contracts import (
26+
CorpusMetric,
27+
Metric,
28+
MetricInput,
29+
MetricOutput,
30+
MetricResult,
31+
MetricScore,
32+
MetricWithPreflight,
33+
MetricWithSecrets,
34+
RubricScoreStat,
35+
RubricScoreValue,
36+
Scorer,
37+
ScoreStats,
38+
SecretRefLike,
39+
SecretResolver,
40+
TemplateMetric,
41+
get_metric,
42+
list_metrics,
43+
metric_as_scorer,
44+
register_metric,
45+
)
2546
from nemo_evaluator.scoring.judge import (
2647
JudgeScoringConfig,
2748
build_judge_prompt,

0 commit comments

Comments
 (0)