Skip to content

Commit a5eb3a6

Browse files
committed
Refactor and add modelopt path
Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent bf4941b commit a5eb3a6

15 files changed

Lines changed: 281 additions & 84 deletions

.claude/skills/evaluation/SKILL.md

Lines changed: 7 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -112,48 +112,22 @@ If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantiz
112112
| `FP8` | `--quantization modelopt` |
113113
| `W4A8_AWQ` | `--quantization modelopt` |
114114
| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
115+
| Other values | Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure |
115116

116-
If no `hf_quant_config.json`, the checkpoint is unquantized — no flag needed.
117+
If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.
117118

118119
**Quantization-aware benchmark defaults:**
119120

120121
When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
121122

122-
- **Always include**: MMLU (general knowledge, most affected by quantization)
123+
- **Always include**: MMLU (general knowledge — typically shows measurable accuracy loss from quantization)
123124
- **Recommended**: GSM8K (math reasoning — sensitive to precision loss), ARC-Challenge (reasoning)
124125
- **Good to add**: HumanEval (code generation — catches subtle degradation), Winogrande (commonsense)
125-
- **Less useful for quant comparison**: IFEval (instruction following — rarely affected by quantization)
126+
- **Less useful for quant comparison**: IFEval (instruction following — typically less affected, but worth including for aggressive quantization like FP4)
126127

127128
Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
128129

129-
Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:
130-
131-
- Sampling params (`temperature`, `top_p`)
132-
- Context length (`deployment.extra_args: "--max-model-len <value>"`)
133-
- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
134-
- Reasoning config (if applicable):
135-
- reasoning on/off: use either:
136-
- `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
137-
- `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
138-
- reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
139-
- higher `max_new_tokens`
140-
- etc.
141-
- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
142-
- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models)
143-
- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
144-
- Example: `deployment.image: nvcr.io/nvidia/vllm:26.01-py3`
145-
- AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
146-
- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
147-
- If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command
148-
- Use `curl` instead of `wget` as it's more widely available in Docker containers
149-
- Example: `pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py`
150-
- When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems)
151-
- Example: `pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation`
152-
- Any other model-specific requirements
153-
154-
Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)!
155-
156-
Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.
130+
Read `references/model-card-research.md` for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
157131

158132
**Step 4: Fill in remaining missing values**
159133

@@ -197,57 +171,7 @@ Show tasks in the current config. Loop until the user confirms the task list is
197171

198172
**Step 6: Advanced - Multi-node**
199173

200-
There are two multi-node patterns. Ask the user which applies:
201-
202-
**Pattern A: Multi-instance (independent instances with HAProxy)**
203-
204-
Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
205-
206-
```yaml
207-
execution:
208-
num_nodes: 4 # Total nodes
209-
num_instances: 4 # 4 independent instances → HAProxy auto-enabled
210-
```
211-
212-
**Pattern B: Multi-node single instance (Ray TP/PP across nodes)**
213-
214-
When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config:
215-
216-
```yaml
217-
defaults:
218-
- deployment: vllm_ray # Built-in Ray cluster setup (replaces manual pre_cmd)
219-
220-
execution:
221-
num_nodes: 2 # Single instance spanning 2 nodes
222-
223-
deployment:
224-
tensor_parallel_size: 8
225-
pipeline_parallel_size: 2
226-
```
227-
228-
**Pattern A+B combined: Multi-instance with multi-node instances**
229-
230-
For very large models needing both cross-node parallelism AND multiple instances:
231-
232-
```yaml
233-
defaults:
234-
- deployment: vllm_ray
235-
236-
execution:
237-
num_nodes: 4 # Total nodes
238-
num_instances: 2 # 2 instances of 2 nodes each → HAProxy auto-enabled
239-
240-
deployment:
241-
tensor_parallel_size: 8
242-
pipeline_parallel_size: 2
243-
```
244-
245-
**Common Confusions**
246-
247-
- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
248-
- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
249-
- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
250-
- `num_nodes` must be divisible by `num_instances`.
174+
If the user needs multi-node evaluation (model >120B, or more throughput), read `references/multi-node.md` for the configuration patterns (HAProxy multi-instance, Ray TP/PP, or combined).
251175

252176
**Step 7: Advanced - Interceptors**
253177

@@ -374,7 +298,7 @@ Now, copy this checklist and track your progress:
374298

375299
```text
376300
Config Generation Progress:
377-
- [ ] Step 0: Check workspace (if multi-user)
301+
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
378302
- [ ] Step 1: Check if nel is installed
379303
- [ ] Step 2: Build the base config file
380304
- [ ] Step 3: Configure model path and parameters
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"skills": ["evaluation"],
3+
"query": "evaluate Qwen/Qwen3-0.6B on standard benchmarks locally",
4+
"files": [],
5+
"expected_behavior": [
6+
"Verifies nel is installed",
7+
"Asks 5 base config questions — user selects: execution=local, deployment=vllm, export=none, model_type=base, benchmarks=standard",
8+
"Runs nel skills build-config --execution local --deployment vllm --model_type base --benchmarks standard",
9+
"Sets deployment.hf_model_handle to Qwen/Qwen3-0.6B and deployment.checkpoint_path to null",
10+
"No hf_quant_config.json since this is an HF hub model — no quantization flag needed",
11+
"Searches web for Qwen3-0.6B model card to extract deployment settings",
12+
"For local execution: no SLURM-specific config needed",
13+
"Fills remaining ??? values",
14+
"Shows task list for confirmation",
15+
"Runs dry-run, then test with limit_samples=10, then full evaluation",
16+
"Reports results"
17+
]
18+
}
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"skills": ["evaluation"],
3+
"query": "evaluate my model that's already running at http://myserver:8000",
4+
"files": [],
5+
"expected_behavior": [
6+
"Verifies nel is installed",
7+
"Asks 5 base config questions — user selects deployment=None (External) since model is already deployed",
8+
"Runs nel skills build-config with --deployment none",
9+
"Configures target.api_endpoint with the user's existing server URL",
10+
"Does NOT start a deployment — uses the external endpoint directly",
11+
"api_key_name should already be defined for external deployment",
12+
"Asks user for model type (base/chat/reasoning) and benchmark selection",
13+
"Fills remaining config values",
14+
"Runs dry-run, test, then full evaluation against the external endpoint",
15+
"Reports results"
16+
]
17+
}
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"skills": ["evaluation"],
3+
"query": "evaluate my model and configure request/response logging interceptors",
4+
"files": [],
5+
"expected_behavior": [
6+
"Follows standard evaluation workflow through Step 6",
7+
"In Step 7 (Interceptors): tells user to see the interceptors documentation URL",
8+
"Does NOT provide general information about interceptors without reading the docs",
9+
"If user asks to configure logging interceptor: reads the interceptor webpage",
10+
"Configures interceptor under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config (NOT under target.api_endpoint.adapter_config)",
11+
"Uses field names from CLI Configuration section after --overrides keyword",
12+
"Does NOT define interceptors list directly (would override full chain with unintended consequences)",
13+
"Uses correct field names: max_logged_requests and max_logged_responses (NOT max_saved_* or max_*)",
14+
"Proceeds with evaluation after interceptor configuration"
15+
]
16+
}
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"skills": ["evaluation"],
3+
"query": "evaluate a 405B parameter model, I have 4 nodes available",
4+
"files": [],
5+
"expected_behavior": [
6+
"Verifies nel is installed",
7+
"Asks 5 base config questions",
8+
"In Step 6 (Advanced - Multi-node): asks which pattern applies",
9+
"For a single 405B model: recommends Pattern B (multi-node single instance with Ray TP/PP)",
10+
"Uses vllm_ray deployment config: defaults: [deployment: vllm_ray]",
11+
"Sets execution.num_nodes: 2 or 4 depending on GPU memory",
12+
"Configures deployment.tensor_parallel_size and pipeline_parallel_size",
13+
"Explains: num_instances controls independent instances (with HAProxy), while this is single-instance across nodes",
14+
"If user wants throughput AND cross-node: explains Pattern A+B combined",
15+
"Notes: num_nodes must be divisible by num_instances",
16+
"Proceeds with standard evaluation flow after multi-node config"
17+
]
18+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"skills": ["evaluation"],
3+
"query": "evaluate my quantized model",
4+
"files": [],
5+
"expected_behavior": [
6+
"Checks if nel is installed by running 'nel --version'",
7+
"nel command not found or errors",
8+
"Instructs user to install: pip install nemo-evaluator-launcher",
9+
"Does NOT attempt to proceed without nel installed",
10+
"After user installs, re-checks nel --version and proceeds with workflow"
11+
]
12+
}

.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"skills": ["nel-assistant"],
2+
"skills": ["evaluation"],
33
"query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA",
44
"files": [],
55
"expected_behavior": [
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"skills": ["evaluation"],
3+
"query": "evaluate accuracy of my NVFP4 quantized model at ./llama-nvfp4",
4+
"files": [],
5+
"expected_behavior": [
6+
"Verifies nel is installed",
7+
"Asks 5 base config questions",
8+
"Sets deployment.checkpoint_path to ./llama-nvfp4",
9+
"Auto-detects quantization by reading ./llama-nvfp4/hf_quant_config.json",
10+
"Finds quant_algo contains 'FP4' or 'NVFP4' and adds --quantization modelopt_fp4 to deployment.extra_args",
11+
"Does NOT use --quantization modelopt (that's for FP8 only)",
12+
"Recommends quantization-sensitive benchmarks: MMLU, GSM8K, ARC-Challenge",
13+
"Mentions that NVFP4 inference requires Blackwell GPUs",
14+
"Proceeds with standard evaluation flow"
15+
]
16+
}
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"skills": ["evaluation"],
3+
"query": "evaluate my FP8 quantized Llama checkpoint at ./llama-3.1-8b-fp8 on MMLU and GSM8K",
4+
"files": [],
5+
"expected_behavior": [
6+
"Verifies nel is installed by running 'nel --version'",
7+
"Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks)",
8+
"Runs 'nel skills build-config' with correct flags matching user answers",
9+
"Sets deployment.checkpoint_path to ./llama-3.1-8b-fp8 and deployment.hf_model_handle to null",
10+
"Auto-detects quantization format by reading ./llama-3.1-8b-fp8/hf_quant_config.json",
11+
"Finds quant_algo=FP8 and adds --quantization modelopt to deployment.extra_args",
12+
"Recommends accuracy-sensitive benchmarks: MMLU (always), GSM8K (math reasoning), ARC-Challenge",
13+
"Searches web for Llama-3.1-8B model card and extracts sampling params, context length, TP settings",
14+
"Asks user for GPU count to set tensor_parallel_size",
15+
"Fills in remaining ??? values by asking user",
16+
"Shows task list and confirms with user",
17+
"Runs dry-run first: nel run --config <config> --dry-run",
18+
"Then test run with limit_samples=10",
19+
"Then full evaluation",
20+
"Reports accuracy results per benchmark"
21+
]
22+
}
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"skills": ["evaluation"],
3+
"query": "evaluate QwQ-32B reasoning model with math benchmarks on SLURM using SGLang",
4+
"files": [],
5+
"expected_behavior": [
6+
"Verifies nel is installed",
7+
"Asks 5 base config questions — user selects: execution=slurm, deployment=sglang, model_type=reasoning, benchmarks=math_reasoning",
8+
"Runs nel skills build-config with --execution slurm --deployment sglang --model_type reasoning --benchmarks math_reasoning",
9+
"Searches web for QwQ-32B model card",
10+
"Configures reasoning toggle: either via adapter_config.custom_system_prompt (/think, /no_think) or via adapter_config.params_to_add with chat_template_kwargs.enable_thinking",
11+
"Sets higher max_new_tokens for reasoning (thinking tokens can be long)",
12+
"Asks user about reasoning effort/budget if configurable",
13+
"Configures SGLang-specific deployment settings",
14+
"Asks user for SLURM hostname, account, partition, walltime",
15+
"For nemo_skills.* tasks with self-deployment: adds target.api_endpoint.api_key_name: DUMMY_API_KEY",
16+
"Disables reasoning for tasks where it's not needed (e.g., IFEval) using task-level overrides",
17+
"Shows task list for confirmation",
18+
"Exports DUMMY_API_KEY=dummy before running",
19+
"Runs dry-run, test, then full evaluation"
20+
]
21+
}

0 commit comments

Comments
 (0)