Skip to content

Commit b8c45e8

Browse files
committed
Address review comments
Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent 99dac7d commit b8c45e8

5 files changed

Lines changed: 101 additions & 63 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: evaluation
3-
description: Evaluate accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Use when user says "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel", or needs to measure how quantization affects model quality. Handles model deployment, config generation, and evaluation execution.
3+
description: Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq) or deploying/serving models (use deployment).
44
license: Apache-2.0
55
# Based on nel-assistant skill from NeMo Evaluator Launcher (commit f1fa073)
66
# https://github.com/NVIDIA-NeMo/Evaluator/tree/f1fa073/packages/nemo-evaluator-launcher/.claude/skills/nel-assistant
@@ -21,7 +21,7 @@ If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md
2121
```text
2222
Config Generation Progress:
2323
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
24-
- [ ] Step 1: Check if nel is installed
24+
- [ ] Step 1: Check if nel is installed and if user has existing config
2525
- [ ] Step 2: Build the base config file
2626
- [ ] Step 3: Configure model path and parameters
2727
- [ ] Step 4: Fill in remaining missing values
@@ -31,11 +31,11 @@ Config Generation Progress:
3131
- [ ] Step 8: Run the evaluation
3232
```
3333

34-
**Step 1: Check if nel is installed**
34+
**Step 1: Check prerequisites**
3535

36-
Test that `nel` is installed with `nel --version`.
36+
Test that `nel` is installed with `nel --version`. If not, instruct the user to `pip install nemo-evaluator-launcher`.
3737

38-
If not, instruct the user to `pip install nemo-evaluator-launcher`.
38+
If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running.
3939

4040
**Step 2: Build the base config file**
4141

@@ -76,6 +76,8 @@ Prompt the user with "I'll ask you 5 questions to build the base config we'll ad
7676

7777
DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
7878

79+
> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
80+
7981
When you have all the answers, run the script to build the base config:
8082

8183
```bash
@@ -118,14 +120,7 @@ If no `hf_quant_config.json`, also check `config.json` for a `quantization_confi
118120

119121
**Quantization-aware benchmark defaults:**
120122

121-
When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
122-
123-
- **Always include**: MMLU (general knowledge — typically shows measurable accuracy loss from quantization)
124-
- **Recommended**: GSM8K (math reasoning — sensitive to precision loss), ARC-Challenge (reasoning)
125-
- **Good to add**: HumanEval (code generation — catches subtle degradation), Winogrande (commonsense)
126-
- **Less useful for quant comparison**: IFEval (instruction following — typically less affected, but worth including for aggressive quantization like FP4)
127-
128-
Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
123+
When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.
129124

130125
Read `references/model-card-research.md` for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
131126

@@ -191,7 +186,7 @@ Print the following commands to the user. Propose to execute them in order to co
191186
**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.
192187

193188
```bash
194-
# If using pre_cmd or post_cmd:
189+
# If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):
195190
export NEMO_EVALUATOR_TRUST_PRE_CMD=1
196191
197192
# If using nemo_skills.* tasks with self-deployment:
@@ -299,7 +294,7 @@ Now, copy this checklist and track your progress:
299294
```text
300295
Config Generation Progress:
301296
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
302-
- [ ] Step 1: Check if nel is installed
297+
- [ ] Step 1: Check if nel is installed and if user has existing config
303298
- [ ] Step 2: Build the base config file
304299
- [ ] Step 3: Configure model path and parameters
305300
- [ ] Step 4: Fill in remaining missing values
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Quantization-Aware Benchmark Recommendations
2+
3+
When evaluating a quantized checkpoint, prioritize benchmarks that are sensitive to precision loss.
4+
5+
## Sensitivity ranking
6+
7+
| Priority | Benchmarks | Why |
8+
|----------|-----------|-----|
9+
| **Always include** | MMLU | General knowledge — typically shows measurable accuracy loss from quantization |
10+
| **Recommended** | GSM8K, ARC-Challenge | Math reasoning and general reasoning — sensitive to precision loss |
11+
| **Good to add** | HumanEval, Winogrande | Code generation and commonsense — catches subtle degradation |
12+
| **Less useful for quant comparison** | IFEval | Instruction following — typically less affected, but worth including for aggressive quantization like FP4 |
13+
14+
## Recommended sets by use case
15+
16+
| Use case | Benchmarks |
17+
|----------|-----------|
18+
| Quick sanity check | MMLU |
19+
| Standard quant validation | MMLU, GSM8K, ARC-Challenge |
20+
| Thorough evaluation | MMLU, GSM8K, ARC-Challenge, HumanEval, Winogrande |
21+
| Code-focused model | HumanEval, MBPP, MMLU |
22+
| Reasoning model | GSM8K, MATH-500, GPQA, MMLU |
23+
24+
## How to use
25+
26+
Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
[
2+
{
3+
"name": "nemotron3-nano-bf16-reasoning",
4+
"skills": ["evaluation"],
5+
"query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA",
6+
"files": [],
7+
"expected_behavior": [
8+
"Verifies nel is installed by running 'nel --version'",
9+
"Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks) before generating the config",
10+
"Runs 'nel skills build-config' with correct flags matching user answers: --execution slurm --deployment vllm --model-type reasoning --benchmarks standard code math_reasoning --export mlflow",
11+
"Searches the web for the model card on HuggingFace and extracts model-specific settings",
12+
"Sets correct HF handle: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
13+
"Sets reasoning sampling params from model card: temperature=1.0, top_p=1.0",
14+
"Configures reasoning toggle via params_to_add with chat_template_kwargs.enable_thinking (not via system prompt)",
15+
"Disables reasoning for IFEval task using enable_thinking: false with use_system_prompt: false",
16+
"Adds deployment.pre_cmd using curl (not wget) to download nano_v3_reasoning_parser.py from HuggingFace",
17+
"Adds vLLM extra_args including --trust-remote-code, --reasoning-parser-plugin, --reasoning-parser nano_v3, --max-num-seqs 8",
18+
"Pins vLLM image to v0.12.0 or later as required by model card",
19+
"Adds target.api_endpoint.api_key_name: DUMMY_API_KEY for nemo_skills tasks with self-deployment",
20+
"Fills in all ??? placeholders after asking the user for SLURM hostname, account, output_dir, MLflow tracking_uri, and experiment_name",
21+
"Applies user-requested SLURM customizations: partition batch_short, walltime 00:20:00, MLflow tag scenario: demo",
22+
"Presents task list and waits for user confirmation before proceeding",
23+
"Configures request and response logging interceptors under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config using correct field names (max_logged_requests/max_logged_responses, not max_saved_*)",
24+
"Handles dry-run failure for missing HF_TOKEN_FOR_GPQA_DIAMOND by offering to fix the config",
25+
"Successfully submits test run with limit_samples=10 after dry-run passes",
26+
"Provides monitoring commands (nel status, nel info --logs) and inspects server logs via SSH when asked"
27+
]
28+
},
29+
{
30+
"name": "quantized-checkpoint-local-vllm",
31+
"skills": ["evaluation"],
32+
"query": "evaluate my FP8 quantized Llama checkpoint at ./llama-3.1-8b-fp8 on MMLU and GSM8K",
33+
"files": [],
34+
"expected_behavior": [
35+
"Verifies nel is installed by running nel --version",
36+
"Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks)",
37+
"Runs nel skills build-config with correct flags matching user answers",
38+
"Sets deployment.checkpoint_path to ./llama-3.1-8b-fp8 and deployment.hf_model_handle to null",
39+
"Auto-detects quantization format by reading ./llama-3.1-8b-fp8/hf_quant_config.json",
40+
"Finds quant_algo=FP8 and adds --quantization modelopt to deployment.extra_args",
41+
"Recommends accuracy-sensitive benchmarks from references/quantization-benchmarks.md",
42+
"Searches web for Llama-3.1-8B model card and extracts sampling params, context length, TP settings",
43+
"Fills in remaining missing values by asking user",
44+
"Runs dry-run, then test with limit_samples=10, then full evaluation",
45+
"Reports accuracy results per benchmark"
46+
]
47+
},
48+
{
49+
"name": "slurm-quantized-model",
50+
"skills": ["evaluation"],
51+
"query": "Evaluate my quantized Llama-3.1-8B-FP8 checkpoint on mmlu and gsm8k on the SLURM cluster",
52+
"files": [],
53+
"expected_behavior": [
54+
"Verifies nel is installed by running nel --version",
55+
"Asks 5 base config questions with execution=slurm pre-selected based on user request",
56+
"Runs nel skills build-config with --execution slurm --deployment vllm --benchmarks standard",
57+
"Detects FP8 quantization from hf_quant_config.json and sets deployment.extra_args with --quantization modelopt",
58+
"Reads references/quantization-benchmarks.md and recommends accuracy-sensitive benchmarks",
59+
"Uses WebSearch to research model card for sampling params and context length",
60+
"Fills in SLURM-specific values: hostname, account, partition from user input",
61+
"Runs dry-run validation before full evaluation",
62+
"Provides SSH-based log monitoring commands for SLURM execution"
63+
]
64+
}
65+
]

.claude/skills/evaluation/tests/nemotron3-nano-bf16-reasoning.json

Lines changed: 0 additions & 26 deletions
This file was deleted.

.claude/skills/evaluation/tests/quantized-checkpoint-local-vllm.json

Lines changed: 0 additions & 22 deletions
This file was deleted.

0 commit comments

Comments
 (0)