Skip to content

Commit ed0a4b1

Browse files
cjluo-nvclaude
andauthored
Refactor evaluation skill: vLLM cross-check, MLflow defaults, walltime cap (#1561)
### What does this PR do? Type of change: documentation Refactor of the `evaluation` skill (`.claude/skills/evaluation/`) with several substantive rule additions plus a significant compression pass. **Skill rules added / tightened:** - **Cross-check `recipes.vllm.ai` + HF model card before composing the vLLM command.** Both sources matter; conflicts get surfaced to the user instead of silently picked. WebFetch caveat triage (3 cases) for JS-rendered variant tabs (this caveat bit the agent multiple times during testing; the triage rules name the failure modes). - **Single `deployment.command:` field** replaces separate `tensor_parallel_size` / `data_parallel_size` / `extra_args` YAML fields. NEL mounts the model at `/checkpoint`; Hydra interpolates `${deployment.port}`. - **vLLM defaults always included unless a recipe contradicts them** (silence ≠ contradiction): - `--max-num-batched-tokens 8192` - `--enable-chunked-prefill` - `--enable-expert-parallel` (MoE-only, detected via active-param suffix or `num_experts`-like config field) - `--max-num-seqs N` where `N = ceil(max_parallelism / data_parallel_size)`, computed after Step 4 fills in `parallelism`. - **Six-field evaluation params template:** `parallelism`, `request_timeout`, `max_retries`, `max_new_tokens`, `temperature`, `top_p`. No `top_k` / `presence_penalty` / `repetition_penalty` / `min_p` at top level (task harnesses have their own defaults that conflict). No per-task `max_new_tokens` overrides — one ceiling everywhere. - **`max_new_tokens` mandatory model-card lookup:** highest card-recommended value wins; "card not yet checked + use generic default" is explicitly forbidden (this was a real bug the user caught). - **AA Index v2 (`recipes/tasks/aa/`)** is the default benchmark set for quantized-checkpoint validation. "AA" / "Artificial Analysis" triggers AA-only mode (no MMLU-Pro / AIME / LiveCodeBench unless asked). - **MLflow auto-export on by default in shortcut path**, with Hydra-interpolated `experiment_name` (`${USER}/${served_model_name}`), `description` (embeds T / top_p / max_new_tokens), and `tags` (string-coerced via single quotes for MLflow's tag type requirement). Only `tracking_uri` needs user input. - **Walltime capped at 4h** in generated configs (longer walltimes lower scheduler priority → longer queue). Skill suggests three alternatives for over-4h runs. - **Example template** updated to the new conventions; `--max-model-len` fallback bumped 32K → 131K to cover AA-LCR. - **`tau2_bench_telecom` recipe:** `parallelism` left as `???` with rate-limit guidance (canary 32–128, cap 512). - **`model-card-research.md`:** output-length extraction is now a mandatory, top-level checklist item with a cross-reference to the SKILL.md rule. - **`env.example`:** added `JUDGE_API_KEY` entry for AIME. **Compression pass:** SKILL.md compressed ~670 → ~330 lines while preserving every rule. Long prose collapsed into tables and bullets; duplicated workflow checklist at the bottom removed. ### Usage The skill is invoked when users ask to evaluate a model. The shortcut path now produces a config like: ```yaml deployment: command: >- vllm serve /checkpoint --host 0.0.0.0 --port ${deployment.port} --tensor-parallel-size 8 ... evaluation: nemo_evaluator_config: config: params: parallelism: ??? # Required — ask user request_timeout: 3600 max_retries: 10 max_new_tokens: 81920 # from model card (highest) temperature: 1.0 top_p: 0.95 export: mlflow: tracking_uri: ??? experiment_name: ${oc.env:USER}/${deployment.served_model_name} description: '...' tags: framework: vllm ... ``` ### Testing Manually exercised the skill end-to-end on four models during refactor (Qwen3.5-122B-A10B-FP8, Qwen3.6-35B-A3B-FP8, Kimi-K2.6, GLM-5.1-NVFP4) to validate that the cross-check rules surface conflicts, the MLflow defaults populate correctly, the AA suite excludes the right tasks, and the walltime cap holds. Test configs are not included in this PR (session artifacts at repo root, gitignored locally). ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ — the skill is editorial/operational; no runtime API changes. Existing configs continue to work. - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A — skill documentation; manually exercised on four models. - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A — skill documentation only. - Did you get Claude approval on this PR?: ❌ — happy to run `/claude review` if maintainers want it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** - Rewrote the evaluation workflow with a stricter end-to-end checklist, workspace-reuse guidance, AA-only shortcut, mandatory validation steps, iterative finalization loop, and a hard 04:00:00 walltime cap - Enforced vLLM as a single deployment command and stricter generation/config constraints, including exact top-level params and model-card–derived max_new_tokens - Updated env var guidance (JUDGE_API_KEY / INFERENCE_API_KEY), MLflow export metadata, and registry/auth preflight flow - Added several AA-task recipes, added new benchmark recipes, and removed or consolidated legacy task docs <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1561?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent f99d83e commit ed0a4b1

23 files changed

Lines changed: 624 additions & 781 deletions

.claude/skills/evaluation/SKILL.md

Lines changed: 190 additions & 239 deletions
Large diffs are not rendered by default.

.claude/skills/evaluation/recipes/env.example

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,19 @@ NEMO_EVALUATOR_TRUST_PRE_CMD=1
1818

1919
# --- Optional: task-specific keys ---
2020

21-
# HLE, AA-LCR, and other judge-backed tasks
21+
# Judge / inference endpoints — two separate env vars by harness:
22+
#
23+
# JUDGE_API_KEY — used by simple-evals harness tasks (e.g. AIME 2025).
24+
# Typically the API key from build.nvidia.com.
25+
# INFERENCE_API_KEY — used by nemo-skills and tau2-bench harnesses for
26+
# judge / user-simulator endpoints (HLE, AA-LCR,
27+
# Tau2-Bench Telecom, etc.).
28+
#
29+
# The two keys can point to the same provider/credential — they're separate
30+
# env vars only because different eval harnesses look up different names.
31+
# Set both if you run tasks from both harness families.
2232
# JUDGE_API_KEY=
23-
24-
# tau2_bench_telecom user simulator endpoint
25-
# USER_API_KEY=
33+
# INFERENCE_API_KEY=
2634

2735
# terminal-bench-hard (AWS sandbox)
2836
# AWS_ACCESS_KEY_ID=
Lines changed: 57 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
1-
# Example: Quantization Validation Suite
1+
# Example: AA-aligned eval template (single-task starting point)
22
#
3-
# A balanced set of benchmarks for validating quantized model quality.
4-
# Copy this file and customize for your needs.
5-
# Task references in recipes/tasks/ define benchmark requirements and YAML
6-
# fragments — the agent composes them into a runnable config like this one.
3+
# Minimal config to validate a quantized model end-to-end against one AA-style
4+
# benchmark. The agent extends this template by copying additional task
5+
# fragments from recipes/tasks/aa/*.md into the `evaluation.tasks` list — task
6+
# references define benchmark requirements and YAML fragments, this file
7+
# defines the deployment + operational conventions.
78
#
8-
# Includes:
9-
# - MMLU-Pro (knowledge, completions)
10-
# - GPQA Diamond (reasoning, chat, 32 repeats)
11-
# - LiveCodeBench v6 (code, chat, 3 repeats)
12-
# - IFBench (instruction following, chat, 8 repeats)
9+
# Includes (default): gpqa_diamond_aa_v3 (simple-evals harness, n_samples=16).
10+
# Add more AA tasks per recipes/tasks/aa/ and the AA suite rule in SKILL.md.
1311
#
1412
# Usage:
1513
# nel run --config recipes/examples/example_eval.yaml \
@@ -22,10 +20,15 @@
2220
# For quantized checkpoints, do not add a vLLM quantization flag by default.
2321
# Recent vLLM reads ModelOpt quantization metadata from the checkpoint. Only add
2422
# an explicit flag if the model card, vLLM version, or dry-run error requires it.
25-
# -o 'deployment.extra_args=--max-model-len 32768 --trust-remote-code'
23+
#
24+
# Deployment uses a single `command:` field instead of separate
25+
# `tensor_parallel_size` / `data_parallel_size` / `extra_args` fields — the full
26+
# `vllm serve` invocation lives in the command string. NEL mounts the resolved
27+
# model (from checkpoint_path or hf_model_handle) at /checkpoint inside the
28+
# container, and Hydra interpolates ${deployment.port} at run time.
2629
#
2730
# Run a single task:
28-
# nel run --config ... -t ns_gpqa
31+
# nel run --config ... -t gpqa_diamond_aa_v3
2932
#
3033
# Canary (2 samples): use this before a full run to validate logs and tune
3134
# parallelism.
@@ -42,84 +45,68 @@ execution:
4245
walltime: "04:00:00"
4346
mounts:
4447
mount_home: false
48+
auto_export:
49+
destinations:
50+
- mlflow
4551
deployment:
4652
env_vars:
4753
HF_TOKEN: host:HF_TOKEN
4854
checkpoint_path: ???
4955
hf_model_handle:
5056
served_model_name: ???
51-
tensor_parallel_size: 1
52-
data_parallel_size: 1
53-
# For models with custom code, add: --trust-remote-code
54-
extra_args: --max-model-len 32768
57+
image: vllm/vllm-openai:v0.19.1
58+
# For MoE models, add `--enable-expert-parallel` to the command.
59+
# For models with custom code, add `--trust-remote-code` to the command.
60+
# After filling in evaluation `parallelism` values (top-level + per-task),
61+
# append `--max-num-seqs N` to the command where
62+
# N = ceil(max_parallelism / data_parallel_size).
63+
command: >-
64+
vllm serve /checkpoint
65+
--host 0.0.0.0
66+
--port ${deployment.port}
67+
--tensor-parallel-size 1
68+
--data-parallel-size 1
69+
--max-model-len 131072
70+
--max-num-batched-tokens 8192
71+
--enable-chunked-prefill
5572
evaluation:
5673
env_vars:
5774
HF_TOKEN: host:HF_TOKEN
5875
nemo_evaluator_config:
5976
config:
6077
params:
78+
parallelism: ??? # Number of concurrent requests per each benchmark
6179
request_timeout: 3600
6280
max_retries: 10
63-
parallelism: 16
81+
max_new_tokens: 65536 # 64K for reasoning models; use 16384 (16K) for non-reasoning; prefer model card value
82+
temperature: 1.0 # from model card (reasoning mode); adjust per card
83+
top_p: 0.95 # from model card (reasoning mode); adjust per card
6484
target:
6585
api_endpoint:
6686
api_key_name: DUMMY_API_KEY
6787
tasks:
68-
# Knowledge (chat endpoint, short)
69-
- name: nemo_skills.ns_mmlu_pro
70-
nemo_evaluator_config:
71-
config:
72-
params:
73-
extra:
74-
num_repeats: 1
75-
args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null
76-
target:
77-
api_endpoint:
78-
adapter_config:
79-
params_to_remove:
80-
- max_new_tokens
81-
- max_completion_tokens
82-
83-
# Reasoning (chat endpoint, 32 repeats, short)
84-
- name: ns_gpqa
88+
# Reasoning (chat endpoint, 8 repeats, short)
89+
- name: gpqa_diamond_aa_v3
90+
container: nvcr.io/nvidia/eval-factory/simple-evals:26.03
8591
nemo_evaluator_config:
8692
config:
8793
params:
8894
extra:
89-
args: ++prompt_config=eval/aai/mcq-4choices
90-
num_repeats: 32
91-
target:
92-
api_endpoint:
93-
adapter_config:
94-
params_to_remove:
95-
- max_new_tokens
96-
- max_completion_tokens
95+
n_samples: 16
9796

98-
# Code (chat endpoint, 3 repeats, medium)
99-
- name: ns_livecodebench
100-
nemo_evaluator_config:
101-
config:
102-
params:
103-
extra:
104-
dataset_split: test_v6_2408_2505
105-
num_repeats: 3
106-
target:
107-
api_endpoint:
108-
adapter_config:
109-
params_to_remove:
110-
- max_new_tokens
111-
- max_completion_tokens
112-
113-
# Instruction following (chat endpoint, 8 repeats, super short)
114-
- name: ns_ifbench
115-
nemo_evaluator_config:
116-
config:
117-
params:
118-
extra:
119-
num_repeats: 8
120-
target:
121-
api_endpoint:
122-
adapter_config:
123-
params_to_remove:
124-
- max_new_tokens
125-
- max_completion_tokens
97+
export:
98+
mlflow:
99+
tracking_uri: ???
100+
experiment_name: ${oc.env:USER}/${deployment.served_model_name}
101+
description: '${oc.env:USER}/${deployment.served_model_name} | T=${evaluation.nemo_evaluator_config.config.params.temperature}, top_p=${evaluation.nemo_evaluator_config.config.params.top_p},
102+
max_new_tokens=${evaluation.nemo_evaluator_config.config.params.max_new_tokens}'
103+
log_logs: true
104+
log_artifacts: true
105+
only_required: false
106+
skip_existing: false
107+
tags:
108+
framework: vllm
109+
model: ${deployment.served_model_name}
110+
temperature: '${evaluation.nemo_evaluator_config.config.params.temperature}'
111+
top_p: '${evaluation.nemo_evaluator_config.config.params.top_p}'
112+
max_new_tokens: '${evaluation.nemo_evaluator_config.config.params.max_new_tokens}'
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# GPQA Diamond
2+
3+
## Task Details
4+
5+
- Reference: <https://docs.nvidia.com/nemo/evaluator/latest/evaluation/benchmarks/catalog/all/harnesses/simple_evals.html#simple-evals-gpqa-diamond-aa-v3>
6+
7+
## Params
8+
9+
## YAML Fragment
10+
11+
Use this inside the top-level `evaluation.tasks` list:
12+
13+
```yaml
14+
- name: gpqa_diamond_aa_v3
15+
container: nvcr.io/nvidia/eval-factory/simple-evals:26.03
16+
nemo_evaluator_config:
17+
config:
18+
params:
19+
extra:
20+
n_samples: 16
21+
```
22+
23+
## Score Extraction from mlflow
24+
25+
Result (0-100): `gpqa_diamond_score_micro_avg_of_N`
26+
27+
N is the repeat count. If the repeat count is unknown, use the highest available `avg_of_N`.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# HLE
2+
3+
## Task Details
4+
5+
- Reference: <https://docs.nvidia.com/nemo/evaluator/latest/evaluation/benchmarks/catalog/all/harnesses/nemo_skills.html#nemo-skills-ns-hle-aa>
6+
7+
## Params
8+
9+
This is the text-only HLE task with params aligned to Artificial Analysis Index
10+
v2. HLE is judge-scored and requires judge credentials.
11+
12+
## YAML Fragment
13+
14+
Use this inside the top-level `evaluation.tasks` list:
15+
16+
```yaml
17+
- name: ns_hle_aa
18+
container: nvcr.io/nvidia/eval-factory/nemo-skills:26.03
19+
env_vars:
20+
INFERENCE_API_KEY: host:INFERENCE_API_KEY
21+
nemo_evaluator_config:
22+
config:
23+
params:
24+
extra:
25+
judge:
26+
model_id: <hle_aa_judge_model_id>
27+
url: <openai_compatible_judge_chat_completions_url>
28+
api_key: INFERENCE_API_KEY
29+
```
30+
31+
## Score Extraction from mlflow
32+
33+
Result (0-100): `hle_pass_at_1_judge_correct`
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# IFBench
2+
3+
## Task Details
4+
5+
- Reference: <https://docs.nvidia.com/nemo/evaluator/latest/evaluation/benchmarks/catalog/all/harnesses/nemo_skills.html#nemo-skills-ns-ifbench>
6+
7+
## Params
8+
9+
## YAML Fragment
10+
11+
Use this inside the top-level `evaluation.tasks` list:
12+
13+
```yaml
14+
- name: ns_ifbench
15+
container: nvcr.io/nvidia/eval-factory/nemo-skills:26.03
16+
nemo_evaluator_config:
17+
config:
18+
params:
19+
extra:
20+
num_repeats: 5
21+
```
22+
23+
## Score Extraction from mlflow
24+
25+
Result (0-100): `ifbench_pass_at_1_avg-of-N_prompt_loose_accuracy`
26+
27+
N is the repeat count. If the repeat count is unknown, use the highest available `avg-of-N`.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# LCR
2+
3+
## Task Details
4+
5+
- Reference: <https://docs.nvidia.com/nemo/evaluator/latest/evaluation/benchmarks/catalog/all/harnesses/nemo_skills.html#nemo-skills-ns-aa-lcr>
6+
7+
## Params
8+
9+
Recommended judge: use Qwen3 235B as an OpenAI-compatible equality-checker
10+
judge, and keep the same judge across comparable runs.
11+
12+
AA-LCR needs long context: plan for roughly 120K input tokens plus 16K
13+
generation tokens. Set deployment `--max-model-len` to at least `131072`, and
14+
use a larger value when the model supports it.
15+
16+
## YAML Fragment
17+
18+
LCR has a deployment-side requirement (`--max-model-len 131072`) and a task
19+
block. Per SKILL.md Step 3, the deployment flag must live inside
20+
`deployment.command:` — not in the deprecated `extra_args` field.
21+
22+
**Deployment requirement:** ensure the `vllm serve ...` invocation in
23+
`deployment.command` includes `--max-model-len 131072` (or higher).
24+
25+
```yaml
26+
- name: ns_aa_lcr
27+
container: nvcr.io/nvidia/eval-factory/nemo-skills:26.03
28+
env_vars:
29+
INFERENCE_API_KEY: host:INFERENCE_API_KEY
30+
nemo_evaluator_config:
31+
target:
32+
api_endpoint:
33+
adapter_config:
34+
use_request_logging: false
35+
use_response_logging: false
36+
config:
37+
params:
38+
extra:
39+
num_repeats: 16
40+
judge:
41+
model_id: <qwen3_235b_judge_model_id>
42+
url: <openai_compatible_judge_chat_completions_url>
43+
api_key: INFERENCE_API_KEY
44+
```
45+
46+
## Score Extraction from mlflow
47+
48+
Result (0-100): `aalcr_pass_at_1_avg-of-N_judge_correct`
49+
50+
N is the repeat count. If the repeat count is unknown, use the highest available `avg-of-N`.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# MMMU-Pro
2+
3+
## Task Details
4+
5+
- Reference: <https://docs.nvidia.com/nemo/evaluator/latest/evaluation/benchmarks/catalog/all/harnesses/nemo_skills.html>
6+
7+
## Params
8+
9+
MMMU-Pro is a multimodal task. Use a multimodal-capable endpoint.
10+
11+
## YAML Fragment
12+
13+
Use this inside the top-level `evaluation.tasks` list:
14+
15+
```yaml
16+
- name: ns_mmmu_pro
17+
container: nvcr.io/nvidia/eval-factory/nemo-skills:26.03
18+
nemo_evaluator_config:
19+
config:
20+
params:
21+
extra:
22+
num_repeats: 1
23+
```
24+
25+
## Score Extraction from mlflow
26+
27+
Result (0-100): `mmmu-pro_pass_at_1_symbolic_correct`

0 commit comments

Comments
 (0)