You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Other values | Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure |
115
116
116
-
If no `hf_quant_config.json`, the checkpoint is unquantized — no flag needed.
117
+
If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.
117
118
118
119
**Quantization-aware benchmark defaults:**
119
120
120
121
When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
121
122
122
-
-**Always include**: MMLU (general knowledge, most affected by quantization)
123
+
-**Always include**: MMLU (general knowledge — typically shows measurable accuracy loss from quantization)
-**Less useful for quant comparison**: IFEval (instruction following — rarely affected by quantization)
126
+
-**Less useful for quant comparison**: IFEval (instruction following — typically less affected, but worth including for aggressive quantization like FP4)
126
127
127
128
Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
128
129
129
-
Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:
- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
134
-
- Reasoning config (if applicable):
135
-
- reasoning on/off: use either:
136
-
-`adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
137
-
-`adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
138
-
- reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
139
-
- higher `max_new_tokens`
140
-
- etc.
141
-
- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
142
-
- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models)
143
-
- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
- AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
146
-
- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
147
-
- If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command
148
-
- Use `curl` instead of `wget` as it's more widely available in Docker containers
- When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems)
Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)!
155
-
156
-
Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.
130
+
Read `references/model-card-research.md` for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
157
131
158
132
**Step 4: Fill in remaining missing values**
159
133
@@ -197,57 +171,7 @@ Show tasks in the current config. Loop until the user confirms the task list is
197
171
198
172
**Step 6: Advanced - Multi-node**
199
173
200
-
There are two multi-node patterns. Ask the user which applies:
201
-
202
-
**Pattern A: Multi-instance (independent instances with HAProxy)**
203
-
204
-
Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
**Pattern A+B combined: Multi-instance with multi-node instances**
229
-
230
-
For very large models needing both cross-node parallelism AND multiple instances:
231
-
232
-
```yaml
233
-
defaults:
234
-
- deployment: vllm_ray
235
-
236
-
execution:
237
-
num_nodes: 4 # Total nodes
238
-
num_instances: 2 # 2 instances of 2 nodes each → HAProxy auto-enabled
239
-
240
-
deployment:
241
-
tensor_parallel_size: 8
242
-
pipeline_parallel_size: 2
243
-
```
244
-
245
-
**Common Confusions**
246
-
247
-
- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
248
-
- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
249
-
- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
250
-
- `num_nodes`must be divisible by `num_instances`.
174
+
If the user needs multi-node evaluation (model >120B, or more throughput), read `references/multi-node.md` for the configuration patterns (HAProxy multi-instance, Ray TP/PP, or combined).
251
175
252
176
**Step 7: Advanced - Interceptors**
253
177
@@ -374,7 +298,7 @@ Now, copy this checklist and track your progress:
"query": "evaluate my model and configure request/response logging interceptors",
4
+
"files": [],
5
+
"expected_behavior": [
6
+
"Follows standard evaluation workflow through Step 6",
7
+
"In Step 7 (Interceptors): tells user to see the interceptors documentation URL",
8
+
"Does NOT provide general information about interceptors without reading the docs",
9
+
"If user asks to configure logging interceptor: reads the interceptor webpage",
10
+
"Configures interceptor under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config (NOT under target.api_endpoint.adapter_config)",
11
+
"Uses field names from CLI Configuration section after --overrides keyword",
12
+
"Does NOT define interceptors list directly (would override full chain with unintended consequences)",
13
+
"Uses correct field names: max_logged_requests and max_logged_responses (NOT max_saved_* or max_*)",
14
+
"Proceeds with evaluation after interceptor configuration"
"query": "evaluate QwQ-32B reasoning model with math benchmarks on SLURM using SGLang",
4
+
"files": [],
5
+
"expected_behavior": [
6
+
"Verifies nel is installed",
7
+
"Asks 5 base config questions — user selects: execution=slurm, deployment=sglang, model_type=reasoning, benchmarks=math_reasoning",
8
+
"Runs nel skills build-config with --execution slurm --deployment sglang --model_type reasoning --benchmarks math_reasoning",
9
+
"Searches web for QwQ-32B model card",
10
+
"Configures reasoning toggle: either via adapter_config.custom_system_prompt (/think, /no_think) or via adapter_config.params_to_add with chat_template_kwargs.enable_thinking",
11
+
"Sets higher max_new_tokens for reasoning (thinking tokens can be long)",
12
+
"Asks user about reasoning effort/budget if configurable",
13
+
"Configures SGLang-specific deployment settings",
14
+
"Asks user for SLURM hostname, account, partition, walltime",
15
+
"For nemo_skills.* tasks with self-deployment: adds target.api_endpoint.api_key_name: DUMMY_API_KEY",
16
+
"Disables reasoning for tasks where it's not needed (e.g., IFEval) using task-level overrides",
0 commit comments