Fix MLflow auto-export in evaluation skill template#1586
Conversation
The export.mlflow block alone does not upload runs: NEL only auto-exports
when execution.auto_export.destinations is set. And with auto-export on, NEL
resolves the export block at submit time in a scope without deployment /
evaluation, so ${deployment.*} / ${evaluation.*} cross-references fail with
"Interpolation key '...' not found".
- example_eval.yaml: mark the auto_export.destinations trigger REQUIRED;
switch the export.mlflow block to literal values (CHANGEME placeholders)
with a note on why interpolation cross-refs break at submit time.
- SKILL.md: rewrite the MLflow shortcut step to require both the trigger and
literal export values, and warn against deployment/evaluation cross-refs
(${oc.env:USER} is fine).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughUpdated documentation and example YAML to clarify MLflow auto-export configuration. Documentation specifies the two required config pieces (auto_export trigger and export.mlflow block), prohibits cross-reference interpolation in export metadata, and reminds users to set tracking_uri. Example YAML now uses literal placeholder and parameter values instead of interpolating from deployment and evaluation config. ChangesMLflow Auto-Export Documentation and Configuration
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.claude/skills/evaluation/recipes/examples/example_eval.yaml (1)
106-106: ⚡ Quick winConsider adding validation for CHANGEME placeholder values.
The
CHANGEME-served-model-nameplaceholder is clear and prominent, but if users forget to replace it, MLflow will accept the run with incorrect metadata. While the documentation instructs users to substitute actual values, runtime validation could prevent this mistake.Optional: Consider whether NEL or a pre-submit hook could validate that CHANGEME placeholders have been replaced, or add a more prominent warning in the YAML comments.
Also applies to: 114-114
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/recipes/examples/example_eval.yaml at line 106, The experiment_name field currently uses the literal placeholder "CHANGEME-served-model-name" which may be left unchanged; add runtime validation to reject or warn when experiment_name contains the substring "CHANGEME" (or exactly matches that pattern) before creating the MLflow run. Update the code path that reads/uses experiment_name (where experiment_name is referenced) to detect this placeholder and either throw an error or log a fatal message instructing the user to replace it (and do the same check for the other occurrence noted at the same file); optionally add a pre-submit hook or CI lint rule to scan the YAML for "CHANGEME" placeholders.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In @.claude/skills/evaluation/recipes/examples/example_eval.yaml:
- Line 106: The experiment_name field currently uses the literal placeholder
"CHANGEME-served-model-name" which may be left unchanged; add runtime validation
to reject or warn when experiment_name contains the substring "CHANGEME" (or
exactly matches that pattern) before creating the MLflow run. Update the code
path that reads/uses experiment_name (where experiment_name is referenced) to
detect this placeholder and either throw an error or log a fatal message
instructing the user to replace it (and do the same check for the other
occurrence noted at the same file); optionally add a pre-submit hook or CI lint
rule to scan the YAML for "CHANGEME" placeholders.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1da619c2-ad96-43fa-811c-a1ee882a7ffa
📒 Files selected for processing (2)
.claude/skills/evaluation/SKILL.md.claude/skills/evaluation/recipes/examples/example_eval.yaml
The export.mlflow temperature/top_p/max_new_tokens literals duplicate the top-level evaluation params and can drift. They are the only queryable record of sampling in MLflow (NEL does not log them as run params), so they are worth keeping — but since they cannot be interpolated under auto-export, the comment in example_eval.yaml and the SKILL.md guidance now require updating them in the same edit as the params, else MLflow misreports the run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1586 +/- ##
==========================================
- Coverage 76.67% 76.64% -0.04%
==========================================
Files 478 478
Lines 52393 52988 +595
==========================================
+ Hits 40174 40614 +440
- Misses 12219 12374 +155
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Docs-only fix to the evaluation skill template. The two changes are consistent and well-explained: (1) auto_export.destinations is annotated as the required trigger, and (2) the export.mlflow block switches from ${deployment.*} / ${evaluation.*} cross-refs (which fail at NEL submit time, where that scope isn't available) to CHANGEME-… literals plus ${oc.env:USER} (env-var interpolation, which still resolves). SKILL.md Step 2 shortcut is updated to match, including the drift-warning to keep the literal temperature / top_p / max_new_tokens tags in sync with the top-level params. Author reports verifying upload with a real run. No licensing, code, or architecture impact.
|
What does this PR do?
Type of change: documentation
Fixes two issues in the evaluation skill that caused completed eval runs to silently not upload to MLflow (found while running a real GPQA eval — the run finished successfully but never appeared in MLflow):
Missing auto-export trigger. NEL only auto-exports when
execution.auto_export.destinationsis set; theexport.mlflowblock alone just configures the exporter. The skill's shortcut said to "copy theexport.mlflowblock" without calling out the trigger, so a config could end up with the block but no upload.Interpolated export values break at submit time. With auto-export enabled, NEL resolves the
export.mlflowblock at submit time in a scope that does not includedeployment/evaluation, so${deployment.served_model_name}/${evaluation.*}cross-references fail hard withInterpolation key '...' not found. (example_eval.yamlshipped this interpolated pattern with auto-export, so it would hit the error.)Changes:
example_eval.yaml: mark theauto_export.destinationstrigger as REQUIRED; switch theexport.mlflowblock to literal values (CHANGEME-served-model-nameplaceholders) with a comment explaining why cross-refs break.${oc.env:USER}(env interpolation) is retained since it resolves fine.SKILL.md: rewrite the MLflow shortcut step to require both the trigger and literal export values, and warn against${deployment.*}/${evaluation.*}cross-references.Usage
N/A — documentation/skill-template only.
Testing
pre-commit runpasses on both files (markdownlint + YAML format). Verified the literal-export approach uploads correctly by exporting a real run to MLflow.Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
Separate from #1583 (parallelism reference); both touch the evaluation skill but are independent.
🤖 Generated with Claude Code
Summary by CodeRabbit