Skip to content

Fix MLflow auto-export in evaluation skill template#1586

Merged
cjluo-nv merged 2 commits into
mainfrom
skill/mlflow-autoexport-fix
Jun 1, 2026
Merged

Fix MLflow auto-export in evaluation skill template#1586
cjluo-nv merged 2 commits into
mainfrom
skill/mlflow-autoexport-fix

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Jun 1, 2026

What does this PR do?

Type of change: documentation

Fixes two issues in the evaluation skill that caused completed eval runs to silently not upload to MLflow (found while running a real GPQA eval — the run finished successfully but never appeared in MLflow):

  1. Missing auto-export trigger. NEL only auto-exports when execution.auto_export.destinations is set; the export.mlflow block alone just configures the exporter. The skill's shortcut said to "copy the export.mlflow block" without calling out the trigger, so a config could end up with the block but no upload.

  2. Interpolated export values break at submit time. With auto-export enabled, NEL resolves the export.mlflow block at submit time in a scope that does not include deployment / evaluation, so ${deployment.served_model_name} / ${evaluation.*} cross-references fail hard with Interpolation key '...' not found. (example_eval.yaml shipped this interpolated pattern with auto-export, so it would hit the error.)

Changes:

  • example_eval.yaml: mark the auto_export.destinations trigger as REQUIRED; switch the export.mlflow block to literal values (CHANGEME-served-model-name placeholders) with a comment explaining why cross-refs break. ${oc.env:USER} (env interpolation) is retained since it resolves fine.
  • SKILL.md: rewrite the MLflow shortcut step to require both the trigger and literal export values, and warn against ${deployment.*} / ${evaluation.*} cross-references.

Usage

N/A — documentation/skill-template only.

Testing

pre-commit run passes on both files (markdownlint + YAML format). Verified the literal-export approach uploads correctly by exporting a real run to MLflow.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (documentation)
  • Did you update Changelog?: N/A (skill docs)
  • Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

Separate from #1583 (parallelism reference); both touch the evaluation skill but are independent.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Clarified MLflow auto-export: auto-export must be explicitly enabled for uploads and the MLflow export block is ignored otherwise.
    • Example config now uses literal experiment name, description, and tags (no cross-scope interpolation) and shows concrete sampling values.
    • Warned to keep sampling params consistent between tags and runtime params and to set the tracking URI.

The export.mlflow block alone does not upload runs: NEL only auto-exports
when execution.auto_export.destinations is set. And with auto-export on, NEL
resolves the export block at submit time in a scope without deployment /
evaluation, so ${deployment.*} / ${evaluation.*} cross-references fail with
"Interpolation key '...' not found".

- example_eval.yaml: mark the auto_export.destinations trigger REQUIRED;
  switch the export.mlflow block to literal values (CHANGEME placeholders)
  with a note on why interpolation cross-refs break at submit time.
- SKILL.md: rewrite the MLflow shortcut step to require both the trigger and
  literal export values, and warn against deployment/evaluation cross-refs
  (${oc.env:USER} is fine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 98f135e0-2479-4d0c-8622-8d9e4e606b13

📥 Commits

Reviewing files that changed from the base of the PR and between e3d5530 and 5c2e28b.

📒 Files selected for processing (2)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
✅ Files skipped from review due to trivial changes (1)
  • .claude/skills/evaluation/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml

📝 Walkthrough

Walkthrough

Updated documentation and example YAML to clarify MLflow auto-export configuration. Documentation specifies the two required config pieces (auto_export trigger and export.mlflow block), prohibits cross-reference interpolation in export metadata, and reminds users to set tracking_uri. Example YAML now uses literal placeholder and parameter values instead of interpolating from deployment and evaluation config.

Changes

MLflow Auto-Export Documentation and Configuration

Layer / File(s) Summary
Documentation of auto-export defaults and requirements
.claude/skills/evaluation/SKILL.md
Step 1 instructions updated to explicitly document that MLflow auto-export is enabled by default, requires execution.auto_export.destinations: [mlflow] trigger and export.mlflow block with literal metadata values, forbids ${deployment.*}/${evaluation.*} interpolation, and reminds users to set tracking_uri in Step 4.
Example auto-export trigger configuration
.claude/skills/evaluation/recipes/examples/example_eval.yaml
auto_export block adjusted with comments clarifying it is the required trigger and explaining that export.mlflow is ignored unless auto-export is enabled.
Example MLflow export metadata with literal values
.claude/skills/evaluation/recipes/examples/example_eval.yaml
export.mlflow block re-templated to replace ${deployment.served_model_name} and ${evaluation.nemo_evaluator_config.config.params.*} interpolations with literal CHANGEME-served-model-name placeholder and literal sampling parameter values ('1.0', '0.95', '65536') in experiment_name, description, and tags fields.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing MLflow auto-export configuration in the evaluation skill template by addressing missing auto-export triggers and interpolation failures.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR modifies only documentation (SKILL.md) and configuration (example_eval.yaml) files; no Python code changes present, so security anti-pattern check does not apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch skill/mlflow-autoexport-fix

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.claude/skills/evaluation/recipes/examples/example_eval.yaml (1)

106-106: ⚡ Quick win

Consider adding validation for CHANGEME placeholder values.

The CHANGEME-served-model-name placeholder is clear and prominent, but if users forget to replace it, MLflow will accept the run with incorrect metadata. While the documentation instructs users to substitute actual values, runtime validation could prevent this mistake.

Optional: Consider whether NEL or a pre-submit hook could validate that CHANGEME placeholders have been replaced, or add a more prominent warning in the YAML comments.

Also applies to: 114-114

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/recipes/examples/example_eval.yaml at line 106,
The experiment_name field currently uses the literal placeholder
"CHANGEME-served-model-name" which may be left unchanged; add runtime validation
to reject or warn when experiment_name contains the substring "CHANGEME" (or
exactly matches that pattern) before creating the MLflow run. Update the code
path that reads/uses experiment_name (where experiment_name is referenced) to
detect this placeholder and either throw an error or log a fatal message
instructing the user to replace it (and do the same check for the other
occurrence noted at the same file); optionally add a pre-submit hook or CI lint
rule to scan the YAML for "CHANGEME" placeholders.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.claude/skills/evaluation/recipes/examples/example_eval.yaml:
- Line 106: The experiment_name field currently uses the literal placeholder
"CHANGEME-served-model-name" which may be left unchanged; add runtime validation
to reject or warn when experiment_name contains the substring "CHANGEME" (or
exactly matches that pattern) before creating the MLflow run. Update the code
path that reads/uses experiment_name (where experiment_name is referenced) to
detect this placeholder and either throw an error or log a fatal message
instructing the user to replace it (and do the same check for the other
occurrence noted at the same file); optionally add a pre-submit hook or CI lint
rule to scan the YAML for "CHANGEME" placeholders.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1da619c2-ad96-43fa-811c-a1ee882a7ffa

📥 Commits

Reviewing files that changed from the base of the PR and between b3d83ba and e3d5530.

📒 Files selected for processing (2)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml

The export.mlflow temperature/top_p/max_new_tokens literals duplicate the
top-level evaluation params and can drift. They are the only queryable record
of sampling in MLflow (NEL does not log them as run params), so they are worth
keeping — but since they cannot be interpolated under auto-export, the comment
in example_eval.yaml and the SKILL.md guidance now require updating them in the
same edit as the params, else MLflow misreports the run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.64%. Comparing base (f99d83e) to head (5c2e28b).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1586      +/-   ##
==========================================
- Coverage   76.67%   76.64%   -0.04%     
==========================================
  Files         478      478              
  Lines       52393    52988     +595     
==========================================
+ Hits        40174    40614     +440     
- Misses      12219    12374     +155     
Flag Coverage Δ
unit 53.61% <ø> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Docs-only fix to the evaluation skill template. The two changes are consistent and well-explained: (1) auto_export.destinations is annotated as the required trigger, and (2) the export.mlflow block switches from ${deployment.*} / ${evaluation.*} cross-refs (which fail at NEL submit time, where that scope isn't available) to CHANGEME-… literals plus ${oc.env:USER} (env-var interpolation, which still resolves). SKILL.md Step 2 shortcut is updated to match, including the drift-warning to keep the literal temperature / top_p / max_new_tokens tags in sync with the top-level params. Author reports verifying upload with a real run. No licensing, code, or architecture impact.

@cjluo-nv cjluo-nv merged commit 2b7668d into main Jun 1, 2026
33 checks passed
@cjluo-nv cjluo-nv deleted the skill/mlflow-autoexport-fix branch June 1, 2026 20:11
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-01 20:11 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants