Skip to content

feat(scoring): add metric abstractions for NEL/NMP interop#930

Closed
wprazuch wants to merge 1 commit intodev/0.3.0from
wprazuch/metric-abstractions
Closed

feat(scoring): add metric abstractions for NEL/NMP interop#930
wprazuch wants to merge 1 commit intodev/0.3.0from
wprazuch/metric-abstractions

Conversation

@wprazuch
Copy link
Copy Markdown
Contributor

Introduces src/nemo_evaluator/scoring/contracts.py (~460 LOC) — the shared contract layer between NEL and downstream metric providers (NMP's nemo_evaluator_sdk, third-party plugins).

This implements the reshape discussed in steps.md:

  • DROPPED from Metric Protocol: metric(item, sample, trace) -> float | bool Per Sandy Chapman + Voytek Prazuch — redundant with compute_scores. Concrete classes that keep it as a private helper still satisfy the Protocol structurally; consumers must not rely on it.

  • CHANGED signature: compute_scores now takes a single MetricInput (aliased to ScorerInput — NEL's native BYOB input dataclass) instead of a pair of item/sample dicts. Unifies function-style and object- style runtime inputs.

  • ADDED TemplateMetric base class: subclasses declare a Pydantic config and implement _score(MetricInput) -> float. Default compute_scores wraps _score in a single-score MetricResult; default score_names returns [self.type]. Reduces per-metric boilerplate to ~20-30 LOC.

  • ADDED @register_metric class decorator + get_metric / list_metrics lookup helpers. Registers classes by their 'type' identifier (read from Pydantic field default or plain ClassVar attribute).

  • ADDED metric_as_scorer(metric) bridge: adapts an object-style Metric to NEL's function-style Scorer callable, so object-style metrics can register in NEL's _SCORER_REGISTRY without glue code. Uses a thread with a fresh event loop when called inside an existing loop (notebook-safe).

ERD (narrative):

MetricInput (= ScorerInput)
|-- consumed by --> Scorer: Callable[[MetricInput], dict]
|-- consumed by --> Metric: Protocol(type, compute_scores, score_names)
MetricResult (= MetricOutput)
|<-- returned by --- Metric.compute_scores
TemplateMetric (Pydantic BaseModel, implements Metric)
|-- concrete base; users subclass for ~20-30 LOC metrics

Tests: 24 new tests in tests/test_scoring/test_contracts.py covering input/output aliases, Pydantic result types with NaN serialization, all four Protocols (Metric, CorpusMetric, MetricWithSecrets, MetricWithPreflight), TemplateMetric default + override, register_metric / get_metric / list_metrics, and the metric_as_scorer bridge for both single-score and multi-score metrics. All pass.

Design rationale + migration plan for NMP Platform is in .claude/tasks/nel_nmp_integration/approach3_design.md.

No breaking changes in NEL — this is a new module next to existing scoring/ utilities. The breaking change lands on NMP SDK (a follow-up PR on NVIDIA-NeMo/Platform will import these contracts and adapt SDK's concrete metrics to the new compute_scores signature).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0e14ee39-4ae3-410b-8a37-79aca4d575f8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wprazuch/metric-abstractions

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the tests label Apr 23, 2026
Introduces src/nemo_evaluator/scoring/contracts.py (~460 LOC) — the shared
contract layer between NEL and downstream metric providers (NMP's
nemo_evaluator_sdk, third-party plugins).

This implements the reshape discussed in steps.md:

  * DROPPED from Metric Protocol: metric(item, sample, trace) -> float | bool
    Per Sandy Chapman + Voytek Prazuch — redundant with compute_scores.
    Concrete classes that keep it as a private helper still satisfy the
    Protocol structurally; consumers must not rely on it.

  * CHANGED signature: compute_scores now takes a single MetricInput
    (aliased to ScorerInput — NEL's native BYOB input dataclass) instead
    of a pair of item/sample dicts. Unifies function-style and object-
    style runtime inputs.

  * ADDED TemplateMetric base class: subclasses declare a Pydantic config
    and implement _score(MetricInput) -> float. Default compute_scores
    wraps _score in a single-score MetricResult; default score_names
    returns [self.type]. Reduces per-metric boilerplate to ~20-30 LOC.

  * ADDED @register_metric class decorator + get_metric / list_metrics
    lookup helpers. Registers classes by their 'type' identifier (read
    from Pydantic field default or plain ClassVar attribute).

  * ADDED metric_as_scorer(metric) bridge: adapts an object-style Metric
    to NEL's function-style Scorer callable, so object-style metrics can
    register in NEL's _SCORER_REGISTRY without glue code. Uses a thread
    with a fresh event loop when called inside an existing loop
    (notebook-safe).

ERD (narrative):

  MetricInput (= ScorerInput)
      |-- consumed by --> Scorer: Callable[[MetricInput], dict]
      |-- consumed by --> Metric: Protocol(type, compute_scores, score_names)
  MetricResult (= MetricOutput)
      |<-- returned by --- Metric.compute_scores
  TemplateMetric  (Pydantic BaseModel, implements Metric)
      |-- concrete base; users subclass for ~20-30 LOC metrics

Tests: 24 new tests in tests/test_scoring/test_contracts.py covering
input/output aliases, Pydantic result types with NaN serialization, all
four Protocols (Metric, CorpusMetric, MetricWithSecrets,
MetricWithPreflight), TemplateMetric default + override, register_metric
/ get_metric / list_metrics, and the metric_as_scorer bridge for both
single-score and multi-score metrics. All pass.

Design rationale + migration plan for NMP Platform is in
.claude/tasks/nel_nmp_integration/approach3_design.md.

No breaking changes in NEL — this is a new module next to existing
scoring/ utilities. The breaking change lands on NMP SDK (a follow-up PR
on NVIDIA-NeMo/Platform will import these contracts and adapt SDK's
concrete metrics to the new compute_scores signature).

Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@wprazuch wprazuch force-pushed the wprazuch/metric-abstractions branch from a394a42 to c89d9b9 Compare April 23, 2026 11:46
@wprazuch wprazuch closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant