feat(scoring): add metric abstractions for NEL/NMP interop#930
Closed
feat(scoring): add metric abstractions for NEL/NMP interop#930
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Introduces src/nemo_evaluator/scoring/contracts.py (~460 LOC) — the shared
contract layer between NEL and downstream metric providers (NMP's
nemo_evaluator_sdk, third-party plugins).
This implements the reshape discussed in steps.md:
* DROPPED from Metric Protocol: metric(item, sample, trace) -> float | bool
Per Sandy Chapman + Voytek Prazuch — redundant with compute_scores.
Concrete classes that keep it as a private helper still satisfy the
Protocol structurally; consumers must not rely on it.
* CHANGED signature: compute_scores now takes a single MetricInput
(aliased to ScorerInput — NEL's native BYOB input dataclass) instead
of a pair of item/sample dicts. Unifies function-style and object-
style runtime inputs.
* ADDED TemplateMetric base class: subclasses declare a Pydantic config
and implement _score(MetricInput) -> float. Default compute_scores
wraps _score in a single-score MetricResult; default score_names
returns [self.type]. Reduces per-metric boilerplate to ~20-30 LOC.
* ADDED @register_metric class decorator + get_metric / list_metrics
lookup helpers. Registers classes by their 'type' identifier (read
from Pydantic field default or plain ClassVar attribute).
* ADDED metric_as_scorer(metric) bridge: adapts an object-style Metric
to NEL's function-style Scorer callable, so object-style metrics can
register in NEL's _SCORER_REGISTRY without glue code. Uses a thread
with a fresh event loop when called inside an existing loop
(notebook-safe).
ERD (narrative):
MetricInput (= ScorerInput)
|-- consumed by --> Scorer: Callable[[MetricInput], dict]
|-- consumed by --> Metric: Protocol(type, compute_scores, score_names)
MetricResult (= MetricOutput)
|<-- returned by --- Metric.compute_scores
TemplateMetric (Pydantic BaseModel, implements Metric)
|-- concrete base; users subclass for ~20-30 LOC metrics
Tests: 24 new tests in tests/test_scoring/test_contracts.py covering
input/output aliases, Pydantic result types with NaN serialization, all
four Protocols (Metric, CorpusMetric, MetricWithSecrets,
MetricWithPreflight), TemplateMetric default + override, register_metric
/ get_metric / list_metrics, and the metric_as_scorer bridge for both
single-score and multi-score metrics. All pass.
Design rationale + migration plan for NMP Platform is in
.claude/tasks/nel_nmp_integration/approach3_design.md.
No breaking changes in NEL — this is a new module next to existing
scoring/ utilities. The breaking change lands on NMP SDK (a follow-up PR
on NVIDIA-NeMo/Platform will import these contracts and adapt SDK's
concrete metrics to the new compute_scores signature).
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
a394a42 to
c89d9b9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces src/nemo_evaluator/scoring/contracts.py (~460 LOC) — the shared contract layer between NEL and downstream metric providers (NMP's nemo_evaluator_sdk, third-party plugins).
This implements the reshape discussed in steps.md:
DROPPED from Metric Protocol: metric(item, sample, trace) -> float | bool Per Sandy Chapman + Voytek Prazuch — redundant with compute_scores. Concrete classes that keep it as a private helper still satisfy the Protocol structurally; consumers must not rely on it.
CHANGED signature: compute_scores now takes a single MetricInput (aliased to ScorerInput — NEL's native BYOB input dataclass) instead of a pair of item/sample dicts. Unifies function-style and object- style runtime inputs.
ADDED TemplateMetric base class: subclasses declare a Pydantic config and implement _score(MetricInput) -> float. Default compute_scores wraps _score in a single-score MetricResult; default score_names returns [self.type]. Reduces per-metric boilerplate to ~20-30 LOC.
ADDED @register_metric class decorator + get_metric / list_metrics lookup helpers. Registers classes by their 'type' identifier (read from Pydantic field default or plain ClassVar attribute).
ADDED metric_as_scorer(metric) bridge: adapts an object-style Metric to NEL's function-style Scorer callable, so object-style metrics can register in NEL's _SCORER_REGISTRY without glue code. Uses a thread with a fresh event loop when called inside an existing loop (notebook-safe).
ERD (narrative):
MetricInput (= ScorerInput)
|-- consumed by --> Scorer: Callable[[MetricInput], dict]
|-- consumed by --> Metric: Protocol(type, compute_scores, score_names)
MetricResult (= MetricOutput)
|<-- returned by --- Metric.compute_scores
TemplateMetric (Pydantic BaseModel, implements Metric)
|-- concrete base; users subclass for ~20-30 LOC metrics
Tests: 24 new tests in tests/test_scoring/test_contracts.py covering input/output aliases, Pydantic result types with NaN serialization, all four Protocols (Metric, CorpusMetric, MetricWithSecrets, MetricWithPreflight), TemplateMetric default + override, register_metric / get_metric / list_metrics, and the metric_as_scorer bridge for both single-score and multi-score metrics. All pass.
Design rationale + migration plan for NMP Platform is in .claude/tasks/nel_nmp_integration/approach3_design.md.
No breaking changes in NEL — this is a new module next to existing scoring/ utilities. The breaking change lands on NMP SDK (a follow-up PR on NVIDIA-NeMo/Platform will import these contracts and adapt SDK's concrete metrics to the new compute_scores signature).