Skip to content

Commit bf4941b

Browse files
committed
Add Agent Evaluation skill for accuracy benchmarking
Add a Claude Code skill for evaluating LLM accuracy using NeMo Evaluator Launcher (NEL). Based on the upstream nel-assistant skill (commit f1fa073) with ModelOpt-specific additions: - Auto-detect ModelOpt quantization format from hf_quant_config.json and set the correct vLLM/SGLang --quantization flag - Quantization-aware benchmark defaults (recommend MMLU, GSM8K, ARC-Challenge for quantized models) - Workspace management for multi-user environments (Step 0) - Disable MD036/MD029 markdownlint rules for upstream NEL formatting The skill guides users through NEL config generation, model card research, and evaluation execution (local and SLURM). Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent 24ceba6 commit bf4941b

3 files changed

Lines changed: 414 additions & 0 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 386 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,386 @@
1+
---
2+
name: evaluation
3+
description: Evaluate accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Use when user says "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel", or needs to measure how quantization affects model quality. Handles model deployment, config generation, and evaluation execution.
4+
license: Apache-2.0
5+
# Based on nel-assistant skill from NeMo Evaluator Launcher (commit f1fa073)
6+
# https://github.com/NVIDIA-NeMo/Evaluator/tree/f1fa073/packages/nemo-evaluator-launcher/.claude/skills/nel-assistant
7+
# Modifications: renamed to evaluation, added workspace management (Step 0),
8+
# auto-detect ModelOpt quantization format, quantization-aware benchmark defaults.
9+
---
10+
11+
## NeMo Evaluator Launcher Assistant
12+
13+
You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.
14+
15+
### Workspace (multi-user / Slack bot)
16+
17+
If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
18+
19+
### Workflow
20+
21+
```text
22+
Config Generation Progress:
23+
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
24+
- [ ] Step 1: Check if nel is installed
25+
- [ ] Step 2: Build the base config file
26+
- [ ] Step 3: Configure model path and parameters
27+
- [ ] Step 4: Fill in remaining missing values
28+
- [ ] Step 5: Confirm tasks (iterative)
29+
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
30+
- [ ] Step 7: Advanced - Interceptors
31+
- [ ] Step 8: Run the evaluation
32+
```
33+
34+
**Step 1: Check if nel is installed**
35+
36+
Test that `nel` is installed with `nel --version`.
37+
38+
If not, instruct the user to `pip install nemo-evaluator-launcher`.
39+
40+
**Step 2: Build the base config file**
41+
42+
Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
43+
44+
1. Execution:
45+
46+
- Local
47+
- SLURM
48+
49+
2. Deployment:
50+
51+
- None (External)
52+
- vLLM
53+
- SGLang
54+
- NIM
55+
- TRT-LLM
56+
57+
3. Auto-export:
58+
59+
- None (auto-export disabled)
60+
- MLflow
61+
- wandb
62+
63+
4. Model type
64+
65+
- Base
66+
- Chat
67+
- Reasoning
68+
69+
5. Benchmarks:
70+
Allow for multiple choices in this question.
71+
1. Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
72+
2. Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
73+
3. Math & Reasoning (like AIME, GPQA, MATH-500, ...)
74+
4. Safety & Security (like Garak and Safety Harness)
75+
5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)
76+
77+
DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
78+
79+
When you have all the answers, run the script to build the base config:
80+
81+
```bash
82+
nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat|reasoning> --benchmarks <standard|code|math_reasoning|safety|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]
83+
```
84+
85+
Where `--output` depends on what the user provides:
86+
87+
- Omit: Uses current directory with auto-generated filename
88+
- Directory: Writes to that directory with auto-generated filename
89+
- File path (*.yaml): Writes to that specific file
90+
91+
It never overwrites existing files.
92+
93+
**Step 3: Configure model path and parameters**
94+
95+
Ask for model path. Determine type:
96+
97+
- Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null`
98+
- HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`
99+
100+
**Auto-detect ModelOpt quantization format** (checkpoint paths only):
101+
102+
Check for `hf_quant_config.json` in the checkpoint directory:
103+
104+
```bash
105+
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
106+
```
107+
108+
If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`:
109+
110+
| `quant_algo` | Flag to add |
111+
|-------------|-------------|
112+
| `FP8` | `--quantization modelopt` |
113+
| `W4A8_AWQ` | `--quantization modelopt` |
114+
| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
115+
116+
If no `hf_quant_config.json`, the checkpoint is unquantized — no flag needed.
117+
118+
**Quantization-aware benchmark defaults:**
119+
120+
When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
121+
122+
- **Always include**: MMLU (general knowledge, most affected by quantization)
123+
- **Recommended**: GSM8K (math reasoning — sensitive to precision loss), ARC-Challenge (reasoning)
124+
- **Good to add**: HumanEval (code generation — catches subtle degradation), Winogrande (commonsense)
125+
- **Less useful for quant comparison**: IFEval (instruction following — rarely affected by quantization)
126+
127+
Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
128+
129+
Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:
130+
131+
- Sampling params (`temperature`, `top_p`)
132+
- Context length (`deployment.extra_args: "--max-model-len <value>"`)
133+
- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
134+
- Reasoning config (if applicable):
135+
- reasoning on/off: use either:
136+
- `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
137+
- `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
138+
- reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
139+
- higher `max_new_tokens`
140+
- etc.
141+
- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
142+
- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models)
143+
- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
144+
- Example: `deployment.image: nvcr.io/nvidia/vllm:26.01-py3`
145+
- AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
146+
- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
147+
- If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command
148+
- Use `curl` instead of `wget` as it's more widely available in Docker containers
149+
- Example: `pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py`
150+
- When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems)
151+
- Example: `pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation`
152+
- Any other model-specific requirements
153+
154+
Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)!
155+
156+
Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.
157+
158+
**Step 4: Fill in remaining missing values**
159+
160+
- Find all remaining `???` missing values in the config.
161+
- Ask the user only for values that couldn't be auto-discovered from the model card (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI). Don't propose any defaults here. Let the user give you the values in plain text.
162+
- Ask the user if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled).
163+
164+
**Step 5: Confirm tasks (iterative)**
165+
166+
Show tasks in the current config. Loop until the user confirms the task list is final:
167+
168+
1. Tell the user: "Run `nel ls tasks` to see all available tasks".
169+
2. Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides.
170+
To add per-task `nemo_evaluator_config` as specified by the user, e.g.:
171+
172+
```yaml
173+
tasks:
174+
- name: <task>
175+
nemo_evaluator_config:
176+
config:
177+
params:
178+
temperature: <value>
179+
max_new_tokens: <value>
180+
...
181+
```
182+
183+
3. Apply changes.
184+
4. Show updated list and ask: "Is the task list final, or do you want to make more changes?"
185+
186+
**Known Issues**
187+
188+
- NeMo-Skills workaround (self-deployment only): If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
189+
190+
```yaml
191+
target:
192+
api_endpoint:
193+
api_key_name: DUMMY_API_KEY
194+
```
195+
196+
For the None (External) deployment the `api_key_name` should be already defined. The `DUMMY_API_KEY` export is handled in Step 8.
197+
198+
**Step 6: Advanced - Multi-node**
199+
200+
There are two multi-node patterns. Ask the user which applies:
201+
202+
**Pattern A: Multi-instance (independent instances with HAProxy)**
203+
204+
Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
205+
206+
```yaml
207+
execution:
208+
num_nodes: 4 # Total nodes
209+
num_instances: 4 # 4 independent instances → HAProxy auto-enabled
210+
```
211+
212+
**Pattern B: Multi-node single instance (Ray TP/PP across nodes)**
213+
214+
When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config:
215+
216+
```yaml
217+
defaults:
218+
- deployment: vllm_ray # Built-in Ray cluster setup (replaces manual pre_cmd)
219+
220+
execution:
221+
num_nodes: 2 # Single instance spanning 2 nodes
222+
223+
deployment:
224+
tensor_parallel_size: 8
225+
pipeline_parallel_size: 2
226+
```
227+
228+
**Pattern A+B combined: Multi-instance with multi-node instances**
229+
230+
For very large models needing both cross-node parallelism AND multiple instances:
231+
232+
```yaml
233+
defaults:
234+
- deployment: vllm_ray
235+
236+
execution:
237+
num_nodes: 4 # Total nodes
238+
num_instances: 2 # 2 instances of 2 nodes each → HAProxy auto-enabled
239+
240+
deployment:
241+
tensor_parallel_size: 8
242+
pipeline_parallel_size: 2
243+
```
244+
245+
**Common Confusions**
246+
247+
- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
248+
- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
249+
- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
250+
- `num_nodes` must be divisible by `num_instances`.
251+
252+
**Step 7: Advanced - Interceptors**
253+
254+
- Tell the user they should see: <https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/interceptors/index.html> .
255+
- DON'T provide any general information about what interceptors typically do in API frameworks without reading the docs. If the user asks about interceptors, only then read the webpage to provide precise information.
256+
- If the user asks you to configure some interceptor, then read the webpage of this interceptor and configure it according to the `--overrides` syntax but put the values in the YAML config under `evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config` (NOT under `target.api_endpoint.adapter_config`) instead of using CLI overrides.
257+
By defining `interceptors` list you'd override the full chain of interceptors which can have unintended consequences like disabling default interceptors. That's why use the fields specified in the `CLI Configuration` section after the `--overrides` keyword to configure interceptors in the YAML config.
258+
259+
**Documentation Errata**
260+
261+
- The docs may show incorrect parameter names for logging. Use `max_logged_requests` and `max_logged_responses` (NOT `max_saved_*` or `max_*`).
262+
263+
**Step 8: Run the evaluation**
264+
265+
Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.
266+
267+
**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.
268+
269+
```bash
270+
# If using pre_cmd or post_cmd:
271+
export NEMO_EVALUATOR_TRUST_PRE_CMD=1
272+
273+
# If using nemo_skills.* tasks with self-deployment:
274+
export DUMMY_API_KEY=dummy
275+
```
276+
277+
1. **Dry-run** (validates config without running):
278+
279+
```bash
280+
nel run --config <config_path> --dry-run
281+
```
282+
283+
2. **Test with limited samples** (quick validation run):
284+
285+
```bash
286+
nel run --config <config_path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
287+
```
288+
289+
3. **Re-run a single task** (useful for debugging or re-testing after config changes):
290+
291+
```bash
292+
nel run --config <config_path> -t <task_name>
293+
```
294+
295+
Combine with `-o` for limited samples: `nel run --config <config_path> -t <task_name> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10`
296+
297+
4. **Full evaluation** (production run):
298+
299+
```bash
300+
nel run --config <config_path>
301+
```
302+
303+
After the dry-run, check the output from `nel` for any problems with the config. If there are no problems, propose to first execute the test run with limited samples and then execute the full evaluation. If there are problems, resolve them before executing the full evaluation.
304+
305+
**Monitoring Progress**
306+
307+
After job submission, you can monitor progress using:
308+
309+
1. **Check job status:**
310+
311+
```bash
312+
nel status <invocation_id>
313+
nel info <invocation_id>
314+
```
315+
316+
2. **Stream logs** (Local execution only):
317+
318+
```bash
319+
nel logs <invocation_id>
320+
```
321+
322+
Note: `nel logs` is not supported for SLURM execution.
323+
324+
3. **Inspect logs via SSH** (SLURM workaround):
325+
326+
When `nel logs` is unavailable (SLURM), use SSH to inspect logs directly:
327+
328+
First, get log locations:
329+
330+
```bash
331+
nel info <invocation_id> --logs
332+
```
333+
334+
Then, use SSH to view logs:
335+
336+
**Check server deployment logs:**
337+
338+
```bash
339+
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/server-<slurm_job_id>-*.log"
340+
```
341+
342+
Shows vLLM server startup, model loading, and deployment errors (e.g., missing wget/curl).
343+
344+
**Check evaluation client logs:**
345+
346+
```bash
347+
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/client-<slurm_job_id>.log"
348+
```
349+
350+
Shows evaluation progress, task execution, and results.
351+
352+
**Check SLURM scheduler logs:**
353+
354+
```bash
355+
ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/slurm-<slurm_job_id>.log"
356+
```
357+
358+
Shows job scheduling, health checks, and overall execution flow.
359+
360+
**Search for errors:**
361+
362+
```bash
363+
ssh <username>@<hostname> "grep -i 'error\|warning\|failed' <log path from `nel info <invocation_id> --logs`>/*.log"
364+
```
365+
366+
---
367+
368+
Direct users with issues to:
369+
370+
- **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>
371+
- **GitHub Discussions:** <https://github.com/NVIDIA-NeMo/Evaluator/discussions>
372+
373+
Now, copy this checklist and track your progress:
374+
375+
```text
376+
Config Generation Progress:
377+
- [ ] Step 0: Check workspace (if multi-user)
378+
- [ ] Step 1: Check if nel is installed
379+
- [ ] Step 2: Build the base config file
380+
- [ ] Step 3: Configure model path and parameters
381+
- [ ] Step 4: Fill in remaining missing values
382+
- [ ] Step 5: Confirm tasks (iterative)
383+
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
384+
- [ ] Step 7: Advanced - Interceptors
385+
- [ ] Step 8: Run the evaluation
386+
```

0 commit comments

Comments
 (0)