You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor evaluation skill: vLLM cross-check, MLflow defaults, walltime cap (#1561)
### What does this PR do?
Type of change: documentation
Refactor of the `evaluation` skill (`.claude/skills/evaluation/`) with
several substantive rule additions plus a significant compression pass.
**Skill rules added / tightened:**
- **Cross-check `recipes.vllm.ai` + HF model card before composing the
vLLM command.** Both sources matter; conflicts get surfaced to the user
instead of silently picked. WebFetch caveat triage (3 cases) for
JS-rendered variant tabs (this caveat bit the agent multiple times
during testing; the triage rules name the failure modes).
- **Single `deployment.command:` field** replaces separate
`tensor_parallel_size` / `data_parallel_size` / `extra_args` YAML
fields. NEL mounts the model at `/checkpoint`; Hydra interpolates
`${deployment.port}`.
- **vLLM defaults always included unless a recipe contradicts them**
(silence ≠ contradiction):
- `--max-num-batched-tokens 8192`
- `--enable-chunked-prefill`
- `--enable-expert-parallel` (MoE-only, detected via active-param suffix
or `num_experts`-like config field)
- `--max-num-seqs N` where `N = ceil(max_parallelism /
data_parallel_size)`, computed after Step 4 fills in `parallelism`.
- **Six-field evaluation params template:** `parallelism`,
`request_timeout`, `max_retries`, `max_new_tokens`, `temperature`,
`top_p`. No `top_k` / `presence_penalty` / `repetition_penalty` /
`min_p` at top level (task harnesses have their own defaults that
conflict). No per-task `max_new_tokens` overrides — one ceiling
everywhere.
- **`max_new_tokens` mandatory model-card lookup:** highest
card-recommended value wins; "card not yet checked + use generic
default" is explicitly forbidden (this was a real bug the user caught).
- **AA Index v2 (`recipes/tasks/aa/`)** is the default benchmark set for
quantized-checkpoint validation. "AA" / "Artificial Analysis" triggers
AA-only mode (no MMLU-Pro / AIME / LiveCodeBench unless asked).
- **MLflow auto-export on by default in shortcut path**, with
Hydra-interpolated `experiment_name` (`${USER}/${served_model_name}`),
`description` (embeds T / top_p / max_new_tokens), and `tags`
(string-coerced via single quotes for MLflow's tag type requirement).
Only `tracking_uri` needs user input.
- **Walltime capped at 4h** in generated configs (longer walltimes lower
scheduler priority → longer queue). Skill suggests three alternatives
for over-4h runs.
- **Example template** updated to the new conventions; `--max-model-len`
fallback bumped 32K → 131K to cover AA-LCR.
- **`tau2_bench_telecom` recipe:** `parallelism` left as `???` with
rate-limit guidance (canary 32–128, cap 512).
- **`model-card-research.md`:** output-length extraction is now a
mandatory, top-level checklist item with a cross-reference to the
SKILL.md rule.
- **`env.example`:** added `JUDGE_API_KEY` entry for AIME.
**Compression pass:** SKILL.md compressed ~670 → ~330 lines while
preserving every rule. Long prose collapsed into tables and bullets;
duplicated workflow checklist at the bottom removed.
### Usage
The skill is invoked when users ask to evaluate a model. The shortcut
path now produces a config like:
```yaml
deployment:
command: >-
vllm serve /checkpoint
--host 0.0.0.0
--port ${deployment.port}
--tensor-parallel-size 8
...
evaluation:
nemo_evaluator_config:
config:
params:
parallelism: ??? # Required — ask user
request_timeout: 3600
max_retries: 10
max_new_tokens: 81920 # from model card (highest)
temperature: 1.0
top_p: 0.95
export:
mlflow:
tracking_uri: ???
experiment_name: ${oc.env:USER}/${deployment.served_model_name}
description: '...'
tags:
framework: vllm
...
```
### Testing
Manually exercised the skill end-to-end on four models during refactor
(Qwen3.5-122B-A10B-FP8, Qwen3.6-35B-A3B-FP8, Kimi-K2.6, GLM-5.1-NVFP4)
to validate that the cross-check rules surface conflicts, the MLflow
defaults populate correctly, the AA suite excludes the right tasks, and
the walltime cap holds. Test configs are not included in this PR
(session artifacts at repo root, gitignored locally).
### Before your PR is "*Ready for review*"
- Is this change backward compatible?: ✅ — the skill is
editorial/operational; no runtime API changes. Existing configs continue
to work.
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A — skill documentation;
manually exercised on four models.
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A — skill documentation only.
- Did you get Claude approval on this PR?: ❌ — happy to run `/claude
review` if maintainers want it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
- Rewrote the evaluation workflow with a stricter end-to-end checklist,
workspace-reuse guidance, AA-only shortcut, mandatory validation steps,
iterative finalization loop, and a hard 04:00:00 walltime cap
- Enforced vLLM as a single deployment command and stricter
generation/config constraints, including exact top-level params and
model-card–derived max_new_tokens
- Updated env var guidance (JUDGE_API_KEY / INFERENCE_API_KEY), MLflow
export metadata, and registry/auth preflight flow
- Added several AA-task recipes, added new benchmark recipes, and
removed or consolidated legacy task docs
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1561?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
0 commit comments