Refactor and add modelopt path

kaix-nv · kaix-nv · commit a5eb3a6c34da · 2026-03-29T11:00:03.000-07:00
Signed-off-by: Kai Xu &lt;kaix@nvidia.com&gt;
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -112,48 +112,22 @@ If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantiz
 | `FP8` | `--quantization modelopt` |
 | `W4A8_AWQ` | `--quantization modelopt` |
 | `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
+| Other values | Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure |
 
-If no `hf_quant_config.json`, the checkpoint is unquantized — no flag needed.
+If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.
 
 **Quantization-aware benchmark defaults:**
 
 When a quantized checkpoint is detected, recommend benchmarks sensitive to quantization accuracy loss:
 
-- **Always include**: MMLU (general knowledge, most affected by quantization)
+- **Always include**: MMLU (general knowledge — typically shows measurable accuracy loss from quantization)
 - **Recommended**: GSM8K (math reasoning — sensitive to precision loss), ARC-Challenge (reasoning)
 - **Good to add**: HumanEval (code generation — catches subtle degradation), Winogrande (commonsense)
-- **Less useful for quant comparison**: IFEval (instruction following — rarely affected by quantization)
+- **Less useful for quant comparison**: IFEval (instruction following — typically less affected, but worth including for aggressive quantization like FP4)
 
 Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed.
 
-Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:
-
-- Sampling params (`temperature`, `top_p`)
-- Context length (`deployment.extra_args: "--max-model-len <value>"`)
-- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
-- Reasoning config (if applicable):
-  - reasoning on/off: use either:
-    - `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
-    - `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
-  - reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
-  - higher `max_new_tokens`
-  - etc.
-- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
-- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models)
-- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
-  - Example: `deployment.image: nvcr.io/nvidia/vllm:26.01-py3`
-  - AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
-- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
-  - If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command
-  - Use `curl` instead of `wget` as it's more widely available in Docker containers
-  - Example: `pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py`
-  - When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems)
-  - Example: `pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation`
-- Any other model-specific requirements
-
-Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)!
-
-Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.
+Read `references/model-card-research.md` for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
 
 **Step 4: Fill in remaining missing values**
 
@@ -197,57 +171,7 @@ Show tasks in the current config. Loop until the user confirms the task list is
 
 **Step 6: Advanced - Multi-node**
 
-There are two multi-node patterns. Ask the user which applies:
-
-**Pattern A: Multi-instance (independent instances with HAProxy)**
-
-Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
-
-```yaml
-execution:
-    num_nodes: 4       # Total nodes
-    num_instances: 4   # 4 independent instances → HAProxy auto-enabled
-```
-
-**Pattern B: Multi-node single instance (Ray TP/PP across nodes)**
-
-When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config:
-
-```yaml
-defaults:
-  - deployment: vllm_ray   # Built-in Ray cluster setup (replaces manual pre_cmd)
-
-execution:
-    num_nodes: 2           # Single instance spanning 2 nodes
-
-deployment:
-    tensor_parallel_size: 8
-    pipeline_parallel_size: 2
-```
-
-**Pattern A+B combined: Multi-instance with multi-node instances**
-
-For very large models needing both cross-node parallelism AND multiple instances:
-
-```yaml
-defaults:
-  - deployment: vllm_ray
-
-execution:
-    num_nodes: 4       # Total nodes
-    num_instances: 2   # 2 instances of 2 nodes each → HAProxy auto-enabled
-
-deployment:
-    tensor_parallel_size: 8
-    pipeline_parallel_size: 2
-```
-
-**Common Confusions**
-
-- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
-- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
-- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
-- `num_nodes` must be divisible by `num_instances`.
+If the user needs multi-node evaluation (model >120B, or more throughput), read `references/multi-node.md` for the configuration patterns (HAProxy multi-instance, Ray TP/PP, or combined).
 
 **Step 7: Advanced - Interceptors**
 
@@ -374,7 +298,7 @@ Now, copy this checklist and track your progress:
 
 ```text
 Config Generation Progress:
-- [ ] Step 0: Check workspace (if multi-user)
+- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
 - [ ] Step 1: Check if nel is installed
 - [ ] Step 2: Build the base config file
 - [ ] Step 3: Configure model path and parameters
diff --git a/.claude/skills/evaluation/evals/base-model-local-execution.json b/.claude/skills/evaluation/evals/base-model-local-execution.json
@@ -0,0 +1,18 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate Qwen/Qwen3-0.6B on standard benchmarks locally",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects: execution=local, deployment=vllm, export=none, model_type=base, benchmarks=standard",
+    "Runs nel skills build-config --execution local --deployment vllm --model_type base --benchmarks standard",
+    "Sets deployment.hf_model_handle to Qwen/Qwen3-0.6B and deployment.checkpoint_path to null",
+    "No hf_quant_config.json since this is an HF hub model — no quantization flag needed",
+    "Searches web for Qwen3-0.6B model card to extract deployment settings",
+    "For local execution: no SLURM-specific config needed",
+    "Fills remaining ??? values",
+    "Shows task list for confirmation",
+    "Runs dry-run, then test with limit_samples=10, then full evaluation",
+    "Reports results"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/external-deployment-eval.json b/.claude/skills/evaluation/evals/external-deployment-eval.json
@@ -0,0 +1,17 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my model that's already running at http://myserver:8000",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects deployment=None (External) since model is already deployed",
+    "Runs nel skills build-config with --deployment none",
+    "Configures target.api_endpoint with the user's existing server URL",
+    "Does NOT start a deployment — uses the external endpoint directly",
+    "api_key_name should already be defined for external deployment",
+    "Asks user for model type (base/chat/reasoning) and benchmark selection",
+    "Fills remaining config values",
+    "Runs dry-run, test, then full evaluation against the external endpoint",
+    "Reports results"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/interceptor-configuration.json b/.claude/skills/evaluation/evals/interceptor-configuration.json
@@ -0,0 +1,16 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my model and configure request/response logging interceptors",
+  "files": [],
+  "expected_behavior": [
+    "Follows standard evaluation workflow through Step 6",
+    "In Step 7 (Interceptors): tells user to see the interceptors documentation URL",
+    "Does NOT provide general information about interceptors without reading the docs",
+    "If user asks to configure logging interceptor: reads the interceptor webpage",
+    "Configures interceptor under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config (NOT under target.api_endpoint.adapter_config)",
+    "Uses field names from CLI Configuration section after --overrides keyword",
+    "Does NOT define interceptors list directly (would override full chain with unintended consequences)",
+    "Uses correct field names: max_logged_requests and max_logged_responses (NOT max_saved_* or max_*)",
+    "Proceeds with evaluation after interceptor configuration"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/multi-node-evaluation.json b/.claude/skills/evaluation/evals/multi-node-evaluation.json
@@ -0,0 +1,18 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate a 405B parameter model, I have 4 nodes available",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions",
+    "In Step 6 (Advanced - Multi-node): asks which pattern applies",
+    "For a single 405B model: recommends Pattern B (multi-node single instance with Ray TP/PP)",
+    "Uses vllm_ray deployment config: defaults: [deployment: vllm_ray]",
+    "Sets execution.num_nodes: 2 or 4 depending on GPU memory",
+    "Configures deployment.tensor_parallel_size and pipeline_parallel_size",
+    "Explains: num_instances controls independent instances (with HAProxy), while this is single-instance across nodes",
+    "If user wants throughput AND cross-node: explains Pattern A+B combined",
+    "Notes: num_nodes must be divisible by num_instances",
+    "Proceeds with standard evaluation flow after multi-node config"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/nel-not-installed.json b/.claude/skills/evaluation/evals/nel-not-installed.json
@@ -0,0 +1,12 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my quantized model",
+  "files": [],
+  "expected_behavior": [
+    "Checks if nel is installed by running 'nel --version'",
+    "nel command not found or errors",
+    "Instructs user to install: pip install nemo-evaluator-launcher",
+    "Does NOT attempt to proceed without nel installed",
+    "After user installs, re-checks nel --version and proceeds with workflow"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json b/.claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
@@ -1,5 +1,5 @@
 {
-  "skills": ["nel-assistant"],
+  "skills": ["evaluation"],
   "query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA",
   "files": [],
   "expected_behavior": [
diff --git a/.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json b/.claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
@@ -0,0 +1,16 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate accuracy of my NVFP4 quantized model at ./llama-nvfp4",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions",
+    "Sets deployment.checkpoint_path to ./llama-nvfp4",
+    "Auto-detects quantization by reading ./llama-nvfp4/hf_quant_config.json",
+    "Finds quant_algo contains 'FP4' or 'NVFP4' and adds --quantization modelopt_fp4 to deployment.extra_args",
+    "Does NOT use --quantization modelopt (that's for FP8 only)",
+    "Recommends quantization-sensitive benchmarks: MMLU, GSM8K, ARC-Challenge",
+    "Mentions that NVFP4 inference requires Blackwell GPUs",
+    "Proceeds with standard evaluation flow"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json b/.claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
@@ -0,0 +1,22 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my FP8 quantized Llama checkpoint at ./llama-3.1-8b-fp8 on MMLU and GSM8K",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed by running 'nel --version'",
+    "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks)",
+    "Runs 'nel skills build-config' with correct flags matching user answers",
+    "Sets deployment.checkpoint_path to ./llama-3.1-8b-fp8 and deployment.hf_model_handle to null",
+    "Auto-detects quantization format by reading ./llama-3.1-8b-fp8/hf_quant_config.json",
+    "Finds quant_algo=FP8 and adds --quantization modelopt to deployment.extra_args",
+    "Recommends accuracy-sensitive benchmarks: MMLU (always), GSM8K (math reasoning), ARC-Challenge",
+    "Searches web for Llama-3.1-8B model card and extracts sampling params, context length, TP settings",
+    "Asks user for GPU count to set tensor_parallel_size",
+    "Fills in remaining ??? values by asking user",
+    "Shows task list and confirms with user",
+    "Runs dry-run first: nel run --config <config> --dry-run",
+    "Then test run with limit_samples=10",
+    "Then full evaluation",
+    "Reports accuracy results per benchmark"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/reasoning-model-sglang.json b/.claude/skills/evaluation/evals/reasoning-model-sglang.json
@@ -0,0 +1,21 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate QwQ-32B reasoning model with math benchmarks on SLURM using SGLang",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects: execution=slurm, deployment=sglang, model_type=reasoning, benchmarks=math_reasoning",
+    "Runs nel skills build-config with --execution slurm --deployment sglang --model_type reasoning --benchmarks math_reasoning",
+    "Searches web for QwQ-32B model card",
+    "Configures reasoning toggle: either via adapter_config.custom_system_prompt (/think, /no_think) or via adapter_config.params_to_add with chat_template_kwargs.enable_thinking",
+    "Sets higher max_new_tokens for reasoning (thinking tokens can be long)",
+    "Asks user about reasoning effort/budget if configurable",
+    "Configures SGLang-specific deployment settings",
+    "Asks user for SLURM hostname, account, partition, walltime",
+    "For nemo_skills.* tasks with self-deployment: adds target.api_endpoint.api_key_name: DUMMY_API_KEY",
+    "Disables reasoning for tasks where it's not needed (e.g., IFEval) using task-level overrides",
+    "Shows task list for confirmation",
+    "Exports DUMMY_API_KEY=dummy before running",
+    "Runs dry-run, test, then full evaluation"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json b/.claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
@@ -0,0 +1,17 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my chat model on safety and multilingual benchmarks",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects: model_type=chat, benchmarks=safety multilingual",
+    "Runs nel skills build-config with --model_type chat --benchmarks safety multilingual",
+    "Safety benchmarks include: Garak and Safety Harness",
+    "Multilingual benchmarks include: MMATH, Global MMLU, MMLU-Prox",
+    "Searches web for model card to extract chat-specific settings (system prompt, sampling params)",
+    "Configures chat template and system prompt if needed",
+    "Shows task list including both safety and multilingual tasks",
+    "Allows user to add/remove tasks in Step 5 confirmation loop",
+    "Proceeds with evaluation"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json b/.claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
@@ -0,0 +1,18 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate my model on code benchmarks and export results to Weights & Biases",
+  "files": [],
+  "expected_behavior": [
+    "Verifies nel is installed",
+    "Asks 5 base config questions — user selects: export=wandb, benchmarks=code",
+    "Runs nel skills build-config with --export wandb --benchmarks code",
+    "Code benchmarks include: HumanEval, MBPP, LiveCodeBench",
+    "Asks user for wandb tracking URI and project name",
+    "Fills in wandb-specific config values",
+    "Asks if user wants to add wandb tags",
+    "Shows task list for confirmation",
+    "Runs dry-run to validate config including wandb connection",
+    "Proceeds with test and full evaluation",
+    "Results are exported to wandb automatically"
+  ]
+}
diff --git a/.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json b/.claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
@@ -0,0 +1,15 @@
+{
+  "skills": ["evaluation"],
+  "query": "evaluate the model I just quantized",
+  "files": [],
+  "expected_behavior": [
+    "Checks if MODELOPT_WORKSPACE_ROOT is set",
+    "If set: reads skills/common/workspace-management.md",
+    "Lists existing workspaces and finds the one from prior PTQ step",
+    "Reuses the workspace to access the quantized checkpoint",
+    "Auto-detects quantization format from hf_quant_config.json in the checkpoint",
+    "Sets correct deployment.extra_args based on detected format (--quantization modelopt or modelopt_fp4)",
+    "Recommends quantization-sensitive benchmarks since this is a quantized model",
+    "Proceeds with standard evaluation workflow"
+  ]
+}
diff --git a/.claude/skills/evaluation/references/model-card-research.md b/.claude/skills/evaluation/references/model-card-research.md
@@ -0,0 +1,30 @@
+# Model Card Research
+
+Use WebSearch to find the model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:
+
+- Sampling params (`temperature`, `top_p`)
+- Context length (`deployment.extra_args: "--max-model-len <value>"`)
+- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
+- Reasoning config (if applicable):
+  - reasoning on/off: use either:
+    - `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
+    - `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
+  - reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
+  - higher `max_new_tokens`
+  - etc.
+- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
+- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models)
+- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
+  - Example: `deployment.image: nvcr.io/nvidia/vllm:26.01-py3`
+  - AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
+- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
+  - If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command
+  - Use `curl` instead of `wget` as it's more widely available in Docker containers
+  - Example: `pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py`
+  - When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems)
+  - Example: `pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation`
+- Any other model-specific requirements
+
+Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)!
+
+Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.
diff --git a/.claude/skills/evaluation/references/multi-node.md b/.claude/skills/evaluation/references/multi-node.md

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,5 @@`
`1`	`1`	`{`
`2`		`- "skills": ["nel-assistant"],`
	`2`	`+ "skills": ["evaluation"],`
`3`	`3`	`"query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA",`
`4`	`4`	`"files": [],`
`5`	`5`	`"expected_behavior": [`