Skip to content

feat(evaluator): inline + hybrid metric bundlers; built-ins skip cloudpickle#438

Open
SandyChapman wants to merge 2 commits into
mainfrom
feat-inline-metric-bundler/schapman
Open

feat(evaluator): inline + hybrid metric bundlers; built-ins skip cloudpickle#438
SandyChapman wants to merge 2 commits into
mainfrom
feat-inline-metric-bundler/schapman

Conversation

@SandyChapman

@SandyChapman SandyChapman commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Built-in evaluator metrics can now bundle as their own JSON configuration instead of a cloudpickle blob. Durable jobs no longer require an explicit metric_bundle_packager for built-in metrics, and those metrics avoid the Python-version coupling that cloudpickle payloads impose on a remote runtime. Reconstruction is pure data validation via the MetricsUnion discriminated union — no arbitrary code runs on load.

This fills back the "submit a built-in metric by config" capability from the old v2 service.

What's new

  • shared/metric_bundles/inline.pyInlineMetricPayload + InlineMetricBundlePackager (kind="inline"); reconstructs the concrete metric from its type via MetricsUnion.
  • shared/metric_bundles/hybrid.pyHybridMetricBundlePackager: inline per metric, cloudpickle only for metrics that can't be reconstructed from config (so a mixed set keeps built-ins inline).
  • shared/metric_bundles/defaults.pyresolve_default_metric_bundle_packager, the selection policy.

Behavior / contract

built-in metric custom metric
local run() / evaluate_benchmark() inline cloudpickle (automatic, in-process; opt out via executor allow_cloudpickle_fallback=False)
durable submit() / metric create() inline explicit CloudpickleMetricBundlePackager() required

Shipping arbitrary code to the shared service stays an explicit opt-in; local execution (your own process) doesn't.

Wire / SDK

api/schemas.py adds InlineMetricPayload to the MetricPayload discriminated union; openapi.yaml regenerated via make refresh-openapi. The evaluator plugin's SDK is hand-written and updated here; the vendored Stainless SDK doesn't reference these schemas, so no SDK regen is needed.

Docs

Removes the now-unnecessary CloudpickleMetricBundlePackager() from built-in submit() examples across the evaluator docs (these were added in #406 as a stopgap for the pre-inline behavior; this change makes them redundant). The packager is kept only in the custom-Python-metrics tutorial. The ModelRef example keeps config=RunConfigOnlineModel(). Two illustrative fragment blocks were made valid Python so all 261 doc code blocks compile, and the offline contract test (docs/evaluator/test_doc_examples.py) is updated to the new behavior.

Validation

  • 305 evaluator unit tests; offline doc contract test (8).
  • ruff + ty clean; make docs-check clean.
  • Ran the actual doc snippets against a live local platform: deterministic run()/packager-free submit() examples completed with correct scores, and the LLM-judge tutorial ran end-to-endFilesetRef dataset registered, local judge run, and the packager-free durable submit() completed against meta/llama-3.1-70b-instruct (exit_code 0).

Notes

  • evaluator/benchmarks/* docs remain gated (gated-nav.yml) — they reference a benchmark API surface not yet ported into the plugin. Out of scope here; un-hide when that lands.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added inline metric bundling with automatic default selection for built-in metrics, plus hybrid handling for mixed jobs.
    • Added an inline metric payload format to the metrics API (inline DTO + digest).
  • Bug Fixes
    • Updated submission and local execution so built-in metrics don’t require an explicit metric packager, while custom metrics follow stricter packager policy.
  • Documentation
    • Simplified many evaluator/metric examples and tutorials by removing manual cloudpickle packager configuration; improved LLM-as-a-Judge snippet parsing/inference examples.
  • Tests
    • Added and updated unit and end-to-end coverage for inline/hybrid/default packager resolution and execution.

@SandyChapman SandyChapman requested review from a team as code owners June 24, 2026 16:23
@github-actions github-actions Bot added the feat label Jun 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d04dff9f-59f1-4522-a06c-e3c43be875fb

📥 Commits

Reviewing files that changed from the base of the PR and between 1a72e1a and 4c7c598.

📒 Files selected for processing (31)
  • docs/evaluator/index.mdx
  • docs/evaluator/metrics/agent-configuration.mdx
  • docs/evaluator/metrics/agentic.mdx
  • docs/evaluator/metrics/job-management.mdx
  • docs/evaluator/metrics/llm-as-a-judge.mdx
  • docs/evaluator/metrics/manage-metrics.mdx
  • docs/evaluator/metrics/model-configuration.mdx
  • docs/evaluator/metrics/rag.mdx
  • docs/evaluator/metrics/remote.mdx
  • docs/evaluator/metrics/results.mdx
  • docs/evaluator/metrics/similarity.mdx
  • docs/evaluator/sdk-resources.mdx
  • docs/evaluator/test_doc_examples.py
  • docs/evaluator/tutorials/run-llm-judge-evaluation.mdx
  • plugins/nemo-evaluator/openapi/openapi.yaml
  • plugins/nemo-evaluator/src/nemo_evaluator/api/schemas.py
  • plugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.py
  • plugins/nemo-evaluator/src/nemo_evaluator/metric_storage.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/_executor.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/metric_resources.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/resources.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/bundles.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/defaults.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.py
  • plugins/nemo-evaluator/tests/sdk/test_metric_sdk_resources.py
  • plugins/nemo-evaluator/tests/shared/metric_bundles/test_defaults.py
  • plugins/nemo-evaluator/tests/shared/metric_bundles/test_hybrid.py
  • plugins/nemo-evaluator/tests/shared/metric_bundles/test_inline.py
  • plugins/nemo-evaluator/tests/test_inline_bundle_execution.py
  • plugins/nemo-evaluator/tests/test_sdk.py
💤 Files with no reviewable changes (9)
  • docs/evaluator/metrics/job-management.mdx
  • docs/evaluator/metrics/manage-metrics.mdx
  • docs/evaluator/metrics/model-configuration.mdx
  • docs/evaluator/metrics/results.mdx
  • docs/evaluator/index.mdx
  • docs/evaluator/metrics/remote.mdx
  • docs/evaluator/tutorials/run-llm-judge-evaluation.mdx
  • docs/evaluator/metrics/agent-configuration.mdx
  • docs/evaluator/metrics/agentic.mdx
✅ Files skipped from review due to trivial changes (5)
  • plugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.py
  • plugins/nemo-evaluator/src/nemo_evaluator/metric_storage.py
  • docs/evaluator/metrics/llm-as-a-judge.mdx
  • docs/evaluator/metrics/similarity.mdx
  • docs/evaluator/sdk-resources.mdx
🚧 Files skipped from review as they are similar to previous changes (15)
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/defaults.py
  • plugins/nemo-evaluator/tests/shared/metric_bundles/test_hybrid.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py
  • plugins/nemo-evaluator/openapi/openapi.yaml
  • plugins/nemo-evaluator/tests/test_inline_bundle_execution.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/bundles.py
  • plugins/nemo-evaluator/src/nemo_evaluator/api/schemas.py
  • docs/evaluator/test_doc_examples.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/metric_resources.py
  • plugins/nemo-evaluator/tests/shared/metric_bundles/test_defaults.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/resources.py
  • plugins/nemo-evaluator/tests/sdk/test_metric_sdk_resources.py
  • plugins/nemo-evaluator/tests/test_sdk.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/_executor.py
  • docs/evaluator/metrics/rag.mdx

📝 Walkthrough

Walkthrough

Adds inline metric bundling and default packager resolution, with SDK, schema, test, and docs updates.

Changes

Inline Metric Bundling and Default Packager Resolution

Layer / File(s) Summary
Inline and hybrid packagers
plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/*
Adds inline payload handling, a hybrid packager, a packager policy error, and default packager selection.
Schema and registration wiring
plugins/nemo-evaluator/src/nemo_evaluator/api/schemas.py, plugins/nemo-evaluator/openapi/openapi.yaml, plugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.py, plugins/nemo-evaluator/src/nemo_evaluator/metric_storage.py
Extends the metric payload schema with inline support and imports bundle modules for registration side effects.
SDK submit, storage, and local run resolution
plugins/nemo-evaluator/src/nemo_evaluator/sdk/resources.py, plugins/nemo-evaluator/src/nemo_evaluator/sdk/metric_resources.py, plugins/nemo-evaluator/src/nemo_evaluator/sdk/_executor.py
Routes submit, storage, and local execution paths through default packager resolution.
Packager and SDK tests
plugins/nemo-evaluator/tests/shared/metric_bundles/*, plugins/nemo-evaluator/tests/test_sdk.py, plugins/nemo-evaluator/tests/sdk/test_metric_sdk_resources.py, plugins/nemo-evaluator/tests/test_inline_bundle_execution.py, docs/evaluator/test_doc_examples.py
Adds coverage for inline payloads, hybrid routing, default selection, policy errors, and local execution.
Documentation examples
docs/evaluator/*.mdx
Removes explicit cloudpickle packager usage from evaluator docs and updates parser and inference examples in the LLM judge guide.

Possibly related PRs

Suggested reviewers

  • ngoncharenko
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title matches the main change: adding inline and hybrid metric bundlers so built-in metrics avoid cloudpickle.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat-inline-metric-bundler/schapman

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py (1)

14-14: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Remove postponed-annotation import to match project typing rules.

Line 14 enables string-based annotations via postponed evaluation.

Suggested change
-from __future__ import annotations

As per coding guidelines, **/*.py: Always prefer concrete type hints over string-based ones in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py` at
line 14, Remove the postponed-annotation import from hybrid.py so the module
follows the project’s typing rule of using concrete annotations instead of
string-based ones. Update any type hints in the affected module, especially
around the HybridMetricBundle definitions and related symbols, so they remain
valid without future-annotations behavior and do not rely on deferred
evaluation.

Source: Coding guidelines

plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.py (1)

15-15: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Remove postponed-annotation import to match project typing rules.

Line 15 enables string-based annotations via postponed evaluation.

Suggested change
-from __future__ import annotations

As per coding guidelines, **/*.py: Always prefer concrete type hints over string-based ones in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.py` at
line 15, Remove the postponed-annotation import from inline.py so the module
follows the project typing rule of using concrete type hints instead of
string-based annotations. Update any type annotations in the nearby module-level
declarations, functions, or classes referenced by inline.py so they remain valid
without future-annotations, and keep the existing symbols in this file unchanged
apart from typing syntax.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/evaluator/metrics/llm-as-a-judge.mdx`:
- Line 648: The stop sequence in the LLM-as-a-judge example is using Fern
template syntax instead of a literal token, so update the quoted token in the
example to a fully inlined string. Locate the `"stop"` entry in the documented
code snippet and replace the `{{...}}`-style placeholder with the actual literal
end-of-text token text, keeping the rest of the example unchanged.

In `@docs/evaluator/test_doc_examples.py`:
- Around line 117-124: The built-in submit test is too permissive because it
swallows any exception from evaluator.submit, which can hide unrelated
regressions. Update the test around evaluator.submit to patch
evaluator._executor.submit (or the HTTP client) so it raises a sentinel after
packager resolution, then assert that specific sentinel is raised instead of
using a broad except Exception: pass. Keep the existing
MetricBundlePackagerPolicyError check, and use the evaluator.submit and
evaluator._executor.submit symbols to locate the flow.

In `@plugins/nemo-evaluator/openapi/openapi.yaml`:
- Around line 1309-1316: The `payload` schema in the OpenAPI contract was
widened by adding `InlineMetricPayload` to the `oneOf`, which changes the stable
response shape for existing evaluate job endpoints. Update the `openapi.yaml`
definition around `payload` and its discriminator mapping to keep the v2
contract backward-compatible, either by removing the new union member from the
current schema or by moving this change into a versioned API contract/migration
path before releasing it.

In `@plugins/nemo-evaluator/src/nemo_evaluator/api/schemas.py`:
- Around line 55-57: InlineMetricPayload.metric is too permissive because
dict[str, Any] allows malformed inline metric payloads to pass validation.
Update the InlineMetricPayload schema in the nemo_evaluator/api/schemas.py
models to use the built-in metric discriminated union or a DTO that requires the
metric type discriminator, matching the MetricsUnion validation approach used
for reconstruction. Keep the field aligned with the metric schema types so
invalid durable jobs are rejected during submission instead of later in runtime.

In `@plugins/nemo-evaluator/tests/test_inline_bundle_execution.py`:
- Line 14: Remove the postponed annotations import from
test_inline_bundle_execution.py so the file uses concrete type hints instead of
string-based ones. Update the module header in the test file and keep the
existing direct type imports used by the test code; no other behavioral changes
are needed.

---

Nitpick comments:
In `@plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py`:
- Line 14: Remove the postponed-annotation import from hybrid.py so the module
follows the project’s typing rule of using concrete annotations instead of
string-based ones. Update any type hints in the affected module, especially
around the HybridMetricBundle definitions and related symbols, so they remain
valid without future-annotations behavior and do not rely on deferred
evaluation.

In `@plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.py`:
- Line 15: Remove the postponed-annotation import from inline.py so the module
follows the project typing rule of using concrete type hints instead of
string-based annotations. Update any type annotations in the nearby module-level
declarations, functions, or classes referenced by inline.py so they remain valid
without future-annotations, and keep the existing symbols in this file unchanged
apart from typing syntax.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a2280022-917c-4f1d-97dc-bcd41b9e42b9

📥 Commits

Reviewing files that changed from the base of the PR and between 4a3a16e and 4bc8623.

📒 Files selected for processing (31)
  • docs/evaluator/index.mdx
  • docs/evaluator/metrics/agent-configuration.mdx
  • docs/evaluator/metrics/agentic.mdx
  • docs/evaluator/metrics/job-management.mdx
  • docs/evaluator/metrics/llm-as-a-judge.mdx
  • docs/evaluator/metrics/manage-metrics.mdx
  • docs/evaluator/metrics/model-configuration.mdx
  • docs/evaluator/metrics/rag.mdx
  • docs/evaluator/metrics/remote.mdx
  • docs/evaluator/metrics/results.mdx
  • docs/evaluator/metrics/similarity.mdx
  • docs/evaluator/sdk-resources.mdx
  • docs/evaluator/test_doc_examples.py
  • docs/evaluator/tutorials/run-llm-judge-evaluation.mdx
  • plugins/nemo-evaluator/openapi/openapi.yaml
  • plugins/nemo-evaluator/src/nemo_evaluator/api/schemas.py
  • plugins/nemo-evaluator/src/nemo_evaluator/jobs/evaluate.py
  • plugins/nemo-evaluator/src/nemo_evaluator/metric_storage.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/_executor.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/metric_resources.py
  • plugins/nemo-evaluator/src/nemo_evaluator/sdk/resources.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/bundles.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/defaults.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/hybrid.py
  • plugins/nemo-evaluator/src/nemo_evaluator/shared/metric_bundles/inline.py
  • plugins/nemo-evaluator/tests/sdk/test_metric_sdk_resources.py
  • plugins/nemo-evaluator/tests/shared/metric_bundles/test_defaults.py
  • plugins/nemo-evaluator/tests/shared/metric_bundles/test_hybrid.py
  • plugins/nemo-evaluator/tests/shared/metric_bundles/test_inline.py
  • plugins/nemo-evaluator/tests/test_inline_bundle_execution.py
  • plugins/nemo-evaluator/tests/test_sdk.py
💤 Files with no reviewable changes (9)
  • docs/evaluator/index.mdx
  • docs/evaluator/metrics/model-configuration.mdx
  • docs/evaluator/metrics/remote.mdx
  • docs/evaluator/metrics/agent-configuration.mdx
  • docs/evaluator/metrics/results.mdx
  • docs/evaluator/tutorials/run-llm-judge-evaluation.mdx
  • docs/evaluator/metrics/manage-metrics.mdx
  • docs/evaluator/metrics/job-management.mdx
  • docs/evaluator/metrics/agentic.mdx

Comment thread docs/evaluator/metrics/llm-as-a-judge.mdx Outdated
Comment thread docs/evaluator/test_doc_examples.py Outdated
Comment thread plugins/nemo-evaluator/openapi/openapi.yaml
Comment thread plugins/nemo-evaluator/src/nemo_evaluator/api/schemas.py
Comment thread plugins/nemo-evaluator/tests/test_inline_bundle_execution.py
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 20908/27474 76.1% 61.2%
Integration Tests 12108/26243 46.1% 19.5%

SandyChapman added a commit that referenced this pull request Jun 24, 2026
- api/schemas.py: require a non-empty `type` on the inline wire payload so
  malformed inline metrics are rejected at the API boundary instead of failing
  at execution (mirrors the runtime InlineMetricPayload validator). The metric
  body stays an open object; concrete shape is still validated on hydration.
- docs/evaluator/test_doc_examples.py: stub the executor with a sentinel in the
  built-in-submit test instead of `except Exception: pass`, so the test proves
  packaging resolved (no packager required) without swallowing unrelated errors.
- docs/evaluator/metrics/llm-as-a-judge.mdx: replace the `{{ end_of_text }}`
  Fern template token with a literal `<end_of_text>` per docs guidelines.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Sandy Chapman <schapman@nvidia.com>
@crookedstorm crookedstorm requested a review from a team as a code owner June 24, 2026 18:07
SandyChapman and others added 2 commits June 24, 2026 15:22
…dpickle

Built-in evaluator metrics can now bundle as their own JSON configuration
instead of a cloudpickle blob, so durable jobs no longer require an explicit
metric_bundle_packager and avoid the Python-version coupling of cloudpickle
payloads. Reconstruction is pure data validation via the MetricsUnion
discriminated union — no arbitrary code runs on load.

New bundlers (shared/metric_bundles/):
- inline.py: InlineMetricPayload + InlineMetricBundlePackager (kind="inline").
- hybrid.py: HybridMetricBundlePackager — inline per metric, cloudpickle only
  for metrics that cannot be reconstructed from config.
- defaults.py: resolve_default_metric_bundle_packager — selection policy.

Default behavior:
- Built-in metrics bundle inline everywhere (run/submit/create); no packager
  needed.
- Local run() of a custom metric falls back to cloudpickle automatically
  (in-process; opt out via the executor's allow_cloudpickle_fallback=False).
- Durable submit()/metric create() of a custom metric require an explicit
  packager — shipping arbitrary code to the shared service stays opt-in.

Wire contract: api/schemas.py adds InlineMetricPayload to the MetricPayload
discriminated union; openapi spec regenerated (make refresh-openapi). The
hand-written evaluator SDK is updated accordingly; no Stainless regen needed.

Docs: drop the now-unnecessary CloudpickleMetricBundlePackager from built-in
submit() examples; keep it only in the custom-Python-metrics tutorial (points
to Hybrid as the recommended packager). ModelRef example keeps
config=RunConfigOnlineModel(). Fix two illustrative fragment blocks so all 261
doc code blocks compile, and update the offline contract test to the new
behavior.

Validated: 305 evaluator unit tests; offline contract test (8); deterministic
submit/run examples and the LLM-judge tutorial run end-to-end against a live
platform (packager-free durable submit completed against a real judge model).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Sandy Chapman <schapman@nvidia.com>
- api/schemas.py: require a non-empty `type` on the inline wire payload so
  malformed inline metrics are rejected at the API boundary instead of failing
  at execution (mirrors the runtime InlineMetricPayload validator). The metric
  body stays an open object; concrete shape is still validated on hydration.
- docs/evaluator/test_doc_examples.py: stub the executor with a sentinel in the
  built-in-submit test instead of `except Exception: pass`, so the test proves
  packaging resolved (no packager required) without swallowing unrelated errors.
- docs/evaluator/metrics/llm-as-a-judge.mdx: replace the `{{ end_of_text }}`
  Fern template token with a literal `<end_of_text>` per docs guidelines.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Sandy Chapman <schapman@nvidia.com>
@SandyChapman SandyChapman force-pushed the feat-inline-metric-bundler/schapman branch from 1a72e1a to 4c7c598 Compare June 24, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants