feat(evaluator): inline + hybrid metric bundlers; built-ins skip cloudpickle#438
feat(evaluator): inline + hybrid metric bundlers; built-ins skip cloudpickle#438SandyChapman wants to merge 2 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (31)
💤 Files with no reviewable changes (9)
✅ Files skipped from review due to trivial changes (5)
🚧 Files skipped from review as they are similar to previous changes (15)
📝 WalkthroughWalkthroughAdds inline metric bundling and default packager resolution, with SDK, schema, test, and docs updates. ChangesInline Metric Bundling and Default Packager Resolution
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py (1)
14-14: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winRemove postponed-annotation import to match project typing rules.
Line 14 enables string-based annotations via postponed evaluation.
Suggested change
-from __future__ import annotationsAs per coding guidelines,
**/*.py: Always prefer concrete type hints over string-based ones in Python code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py` at line 14, Remove the postponed-annotation import from hybrid.py so the module follows the project’s typing rule of using concrete annotations instead of string-based ones. Update any type hints in the affected module, especially around the HybridMetricBundle definitions and related symbols, so they remain valid without future-annotations behavior and do not rely on deferred evaluation.Source: Coding guidelines
plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.py (1)
15-15: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winRemove postponed-annotation import to match project typing rules.
Line 15 enables string-based annotations via postponed evaluation.
Suggested change
-from __future__ import annotationsAs per coding guidelines,
**/*.py: Always prefer concrete type hints over string-based ones in Python code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.py` at line 15, Remove the postponed-annotation import from inline.py so the module follows the project typing rule of using concrete type hints instead of string-based annotations. Update any type annotations in the nearby module-level declarations, functions, or classes referenced by inline.py so they remain valid without future-annotations, and keep the existing symbols in this file unchanged apart from typing syntax.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/evaluator/metrics/llm-as-a-judge.mdx`:
- Line 648: The stop sequence in the LLM-as-a-judge example is using Fern
template syntax instead of a literal token, so update the quoted token in the
example to a fully inlined string. Locate the `"stop"` entry in the documented
code snippet and replace the `{{...}}`-style placeholder with the actual literal
end-of-text token text, keeping the rest of the example unchanged.
In `@docs/evaluator/test_doc_examples.py`:
- Around line 117-124: The built-in submit test is too permissive because it
swallows any exception from evaluator.submit, which can hide unrelated
regressions. Update the test around evaluator.submit to patch
evaluator._executor.submit (or the HTTP client) so it raises a sentinel after
packager resolution, then assert that specific sentinel is raised instead of
using a broad except Exception: pass. Keep the existing
MetricBundlePackagerPolicyError check, and use the evaluator.submit and
evaluator._executor.submit symbols to locate the flow.
In `@plugins/nemo-evaluator/openapi/openapi.yaml`:
- Around line 1309-1316: The `payload` schema in the OpenAPI contract was
widened by adding `InlineMetricPayload` to the `oneOf`, which changes the stable
response shape for existing evaluate job endpoints. Update the `openapi.yaml`
definition around `payload` and its discriminator mapping to keep the v2
contract backward-compatible, either by removing the new union member from the
current schema or by moving this change into a versioned API contract/migration
path before releasing it.
In `@plugins/nemo-evaluator/src/nemo_evaluator/api/schemas.py`:
- Around line 55-57: InlineMetricPayload.metric is too permissive because
dict[str, Any] allows malformed inline metric payloads to pass validation.
Update the InlineMetricPayload schema in the nemo_evaluator/api/schemas.py
models to use the built-in metric discriminated union or a DTO that requires the
metric type discriminator, matching the MetricsUnion validation approach used
for reconstruction. Keep the field aligned with the metric schema types so
invalid durable jobs are rejected during submission instead of later in runtime.
In `@plugins/nemo-evaluator/tests/test_inline_bundle_execution.py`:
- Line 14: Remove the postponed annotations import from
test_inline_bundle_execution.py so the file uses concrete type hints instead of
string-based ones. Update the module header in the test file and keep the
existing direct type imports used by the test code; no other behavioral changes
are needed.
---
Nitpick comments:
In `@plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py`:
- Line 14: Remove the postponed-annotation import from hybrid.py so the module
follows the project’s typing rule of using concrete annotations instead of
string-based ones. Update any type hints in the affected module, especially
around the HybridMetricBundle definitions and related symbols, so they remain
valid without future-annotations behavior and do not rely on deferred
evaluation.
In `@plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.py`:
- Line 15: Remove the postponed-annotation import from inline.py so the module
follows the project typing rule of using concrete type hints instead of
string-based annotations. Update any type annotations in the nearby module-level
declarations, functions, or classes referenced by inline.py so they remain valid
without future-annotations, and keep the existing symbols in this file unchanged
apart from typing syntax.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a2280022-917c-4f1d-97dc-bcd41b9e42b9
📒 Files selected for processing (31)
docs/evaluator/index.mdxdocs/evaluator/metrics/agent-configuration.mdxdocs/evaluator/metrics/agentic.mdxdocs/evaluator/metrics/job-management.mdxdocs/evaluator/metrics/llm-as-a-judge.mdxdocs/evaluator/metrics/manage-metrics.mdxdocs/evaluator/metrics/model-configuration.mdxdocs/evaluator/metrics/rag.mdxdocs/evaluator/metrics/remote.mdxdocs/evaluator/metrics/results.mdxdocs/evaluator/metrics/similarity.mdxdocs/evaluator/sdk-resources.mdxdocs/evaluator/test_doc_examples.pydocs/evaluator/tutorials/run-llm-judge-evaluation.mdxplugins/nemo-evaluator/openapi/openapi.yamlplugins/nemo-evaluator/src/nemo_evaluator/api/schemas.pyplugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.pyplugins/nemo-evaluator/src/nemo_evaluator/metric_storage.pyplugins/nemo-evaluator/src/nemo_evaluator/sdk/_executor.pyplugins/nemo-evaluator/src/nemo_evaluator/sdk/metric_resources.pyplugins/nemo-evaluator/src/nemo_evaluator/sdk/resources.pyplugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/bundles.pyplugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/defaults.pyplugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.pyplugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.pyplugins/nemo-evaluator/tests/sdk/test_metric_sdk_resources.pyplugins/nemo-evaluator/tests/shared/metric_bundles/test_defaults.pyplugins/nemo-evaluator/tests/shared/metric_bundles/test_hybrid.pyplugins/nemo-evaluator/tests/shared/metric_bundles/test_inline.pyplugins/nemo-evaluator/tests/test_inline_bundle_execution.pyplugins/nemo-evaluator/tests/test_sdk.py
💤 Files with no reviewable changes (9)
- docs/evaluator/index.mdx
- docs/evaluator/metrics/model-configuration.mdx
- docs/evaluator/metrics/remote.mdx
- docs/evaluator/metrics/agent-configuration.mdx
- docs/evaluator/metrics/results.mdx
- docs/evaluator/tutorials/run-llm-judge-evaluation.mdx
- docs/evaluator/metrics/manage-metrics.mdx
- docs/evaluator/metrics/job-management.mdx
- docs/evaluator/metrics/agentic.mdx
|
- api/schemas.py: require a non-empty `type` on the inline wire payload so
malformed inline metrics are rejected at the API boundary instead of failing
at execution (mirrors the runtime InlineMetricPayload validator). The metric
body stays an open object; concrete shape is still validated on hydration.
- docs/evaluator/test_doc_examples.py: stub the executor with a sentinel in the
built-in-submit test instead of `except Exception: pass`, so the test proves
packaging resolved (no packager required) without swallowing unrelated errors.
- docs/evaluator/metrics/llm-as-a-judge.mdx: replace the `{{ end_of_text }}`
Fern template token with a literal `<end_of_text>` per docs guidelines.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Sandy Chapman <schapman@nvidia.com>
…dpickle Built-in evaluator metrics can now bundle as their own JSON configuration instead of a cloudpickle blob, so durable jobs no longer require an explicit metric_bundle_packager and avoid the Python-version coupling of cloudpickle payloads. Reconstruction is pure data validation via the MetricsUnion discriminated union — no arbitrary code runs on load. New bundlers (shared/metric_bundles/): - inline.py: InlineMetricPayload + InlineMetricBundlePackager (kind="inline"). - hybrid.py: HybridMetricBundlePackager — inline per metric, cloudpickle only for metrics that cannot be reconstructed from config. - defaults.py: resolve_default_metric_bundle_packager — selection policy. Default behavior: - Built-in metrics bundle inline everywhere (run/submit/create); no packager needed. - Local run() of a custom metric falls back to cloudpickle automatically (in-process; opt out via the executor's allow_cloudpickle_fallback=False). - Durable submit()/metric create() of a custom metric require an explicit packager — shipping arbitrary code to the shared service stays opt-in. Wire contract: api/schemas.py adds InlineMetricPayload to the MetricPayload discriminated union; openapi spec regenerated (make refresh-openapi). The hand-written evaluator SDK is updated accordingly; no Stainless regen needed. Docs: drop the now-unnecessary CloudpickleMetricBundlePackager from built-in submit() examples; keep it only in the custom-Python-metrics tutorial (points to Hybrid as the recommended packager). ModelRef example keeps config=RunConfigOnlineModel(). Fix two illustrative fragment blocks so all 261 doc code blocks compile, and update the offline contract test to the new behavior. Validated: 305 evaluator unit tests; offline contract test (8); deterministic submit/run examples and the LLM-judge tutorial run end-to-end against a live platform (packager-free durable submit completed against a real judge model). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Sandy Chapman <schapman@nvidia.com>
- api/schemas.py: require a non-empty `type` on the inline wire payload so
malformed inline metrics are rejected at the API boundary instead of failing
at execution (mirrors the runtime InlineMetricPayload validator). The metric
body stays an open object; concrete shape is still validated on hydration.
- docs/evaluator/test_doc_examples.py: stub the executor with a sentinel in the
built-in-submit test instead of `except Exception: pass`, so the test proves
packaging resolved (no packager required) without swallowing unrelated errors.
- docs/evaluator/metrics/llm-as-a-judge.mdx: replace the `{{ end_of_text }}`
Fern template token with a literal `<end_of_text>` per docs guidelines.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Sandy Chapman <schapman@nvidia.com>
1a72e1a to
4c7c598
Compare
Summary
Built-in evaluator metrics can now bundle as their own JSON configuration instead of a cloudpickle blob. Durable jobs no longer require an explicit
metric_bundle_packagerfor built-in metrics, and those metrics avoid the Python-version coupling that cloudpickle payloads impose on a remote runtime. Reconstruction is pure data validation via theMetricsUniondiscriminated union — no arbitrary code runs on load.This fills back the "submit a built-in metric by config" capability from the old v2 service.
What's new
shared/metric_bundles/inline.py—InlineMetricPayload+InlineMetricBundlePackager(kind="inline"); reconstructs the concrete metric from itstypeviaMetricsUnion.shared/metric_bundles/hybrid.py—HybridMetricBundlePackager: inline per metric, cloudpickle only for metrics that can't be reconstructed from config (so a mixed set keeps built-ins inline).shared/metric_bundles/defaults.py—resolve_default_metric_bundle_packager, the selection policy.Behavior / contract
run()/evaluate_benchmark()allow_cloudpickle_fallback=False)submit()/ metriccreate()CloudpickleMetricBundlePackager()requiredShipping arbitrary code to the shared service stays an explicit opt-in; local execution (your own process) doesn't.
Wire / SDK
api/schemas.pyaddsInlineMetricPayloadto theMetricPayloaddiscriminated union;openapi.yamlregenerated viamake refresh-openapi. The evaluator plugin's SDK is hand-written and updated here; the vendored Stainless SDK doesn't reference these schemas, so no SDK regen is needed.Docs
Removes the now-unnecessary
CloudpickleMetricBundlePackager()from built-insubmit()examples across the evaluator docs (these were added in #406 as a stopgap for the pre-inline behavior; this change makes them redundant). The packager is kept only in the custom-Python-metrics tutorial. TheModelRefexample keepsconfig=RunConfigOnlineModel(). Two illustrative fragment blocks were made valid Python so all 261 doc code blocks compile, and the offline contract test (docs/evaluator/test_doc_examples.py) is updated to the new behavior.Validation
ruff+tyclean;make docs-checkclean.run()/packager-freesubmit()examples completed with correct scores, and the LLM-judge tutorial ran end-to-end —FilesetRefdataset registered, local judge run, and the packager-free durablesubmit()completed againstmeta/llama-3.1-70b-instruct(exit_code 0).Notes
evaluator/benchmarks/*docs remain gated (gated-nav.yml) — they reference a benchmark API surface not yet ported into the plugin. Out of scope here; un-hide when that lands.🤖 Generated with Claude Code
Summary by CodeRabbit