From f8096a67714b29238c81e3c593ded67f325ec03f Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Mon, 1 Jun 2026 16:52:37 -0700 Subject: [PATCH 1/7] [skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals Concrete fixes from a Qwen3.5-9B NVFP4 PTQ -> deploy -> AA-eval run on B300/GB300 where each issue caused a real failure: - example_eval.yaml: add --served-model-name ${deployment.served_model_name}; without it vLLM registers the model as /checkpoint and every eval 404s. - evaluation SKILL: nemo-skills (ns_*) self-deployment needs DUMMY_API_KEY in evaluation.env_vars (a shell export does NOT reach the SLURM container); document the required host:/lit:/runtime: env-var value prefixes; note that execution.gres must match the node GPU count (else sbatch 'Requested node configuration is not available'). - deployment + evaluation SKILL: NVFP4 on Blackwell (sm_100/sm_103) requires vllm/vllm-openai:cu130-nightly; v0.19.1 and any cu129 build lack sm_103 FP4 kernels (engine init dies 'no kernel image'). Plus --mm-encoder-attn-backend TRITON_ATTN for multimodal on sm_103, and the raw-markdown recipes.vllm.ai fallback for hardware variants. - ptq launcher-guide: match gpus_per_node to node/QOS; EXTRA_PIP_DEPS must avoid shell metacharacters (use == pins, not >=/<). Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Zhiyu Cheng --- .claude/skills/deployment/SKILL.md | 12 ++++++++ .claude/skills/evaluation/SKILL.md | 28 ++++++++++++++----- .../recipes/examples/example_eval.yaml | 20 +++++++++++++ .../skills/ptq/references/launcher-guide.md | 12 ++++++++ 4 files changed, 65 insertions(+), 7 deletions(-) diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md index f14cc0b9822..c542f462ca2 100644 --- a/.claude/skills/deployment/SKILL.md +++ b/.claude/skills/deployment/SKILL.md @@ -125,6 +125,18 @@ python -m vllm.entrypoints.openai.api_server \ For NVFP4 checkpoints, use `--quantization modelopt_fp4`. +> **NVFP4 on Blackwell needs the CUDA-13 vLLM build.** On B200/B300/GB200/GB300 +> (compute capability sm_100/sm_103), use `vllm/vllm-openai:cu130-nightly-` +> (`-x86_64`, or `-aarch64` on Grace). The common `v0.19.1` / any `cu129` +> (CUDA 12.9) build has **no sm_103 FP4 kernels** — vLLM loads the checkpoint +> then dies at engine init with `CUDA error: no kernel image is available for +> execution on the device` (affects the `flashinfer` and `cutlass` NVFP4 +> backends; `marlin` separately fails on non-64-divisible layer dims). Verify the +> image via `recipes.vllm.ai//?hardware=b300` (JS-rendered — fetch the +> raw markdown at `github.com/vllm-project/recipes/blob/main//.md`). +> For multimodal models on sm_103, also pass `--mm-encoder-attn-backend +> TRITON_ATTN` (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x"). + #### SGLang ```bash diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 4da78164a7f..74bc47d5436 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -136,12 +136,14 @@ deployment: <... rest of cross-checked flags ...> ``` -Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value. +Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required** — without it vLLM registers the model under the path `/checkpoint`, and every eval request 404s with "The model `` does not exist."); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value. For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there. **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping. +> **NVFP4 on Blackwell needs the CUDA-13 build.** Serving an NVFP4 checkpoint on Blackwell (B200/B300/GB200/GB300, compute capability sm_100/sm_103) requires `vllm/vllm-openai:cu130-nightly-` (`-x86_64`, or `-aarch64` on Grace). The pinned `v0.19.1` and **all** `cu129` (CUDA 12.9) builds lack sm_103 FP4 kernels — the server loads the checkpoint then dies at engine init with `CUDA error: no kernel image is available for execution on the device` (true for the `flashinfer` *and* `cutlass` NVFP4 backends; `marlin` separately fails on non-64-divisible layer dims). This is the vLLM-recipe-recommended Blackwell image — confirm via `recipes.vllm.ai//?hardware=b300` (and since that page is JS-rendered, fetch the raw markdown at `github.com/vllm-project/recipes/blob/main//.md`). For **multimodal** models on sm_103, also add `--mm-encoder-attn-backend TRITON_ATTN` — the default CuTe ViT flash-attn kernel asserts "Only SM 10.x and 11.x are supported" on sm_103. + #### vLLM-backend defaults — always include unless the recipe *contradicts* Silence is not contradiction. Drop/override only when the recipe sets a different value for the same setting (e.g. recipe pins `--max-num-batched-tokens 16384` → use 16384). @@ -196,6 +198,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text. - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown. - Ask about other defaults they may want to change (partition, walltime, MLflow tags). +- **`execution.gres`** — NEL defaults to `gpu:8`. Set it to the cluster's per-node GPU count (and what the QOS permits), and match `--data-parallel-size`/`--tensor-parallel-size` to it. A mismatch makes `sbatch` reject the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → set `gres: gpu:4`). Confirm the node GPU count with `sinfo -o '%P %G'` on the target cluster. **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference. @@ -229,15 +232,21 @@ Implications for the agent: **Tasks that call an external judge / user-simulator / scoring endpoint.** Treat this as a general pattern, not a fixed list — HLE, AA-LCR, and Tau2 need one today, but other benchmarks may too (check each task's recipe). Their `model_id` / `url` are **config, not secrets**: substitute the **literal** values the user keeps in `.env` (keys per the task's recipe + `recipes/env.example`) into the task's `` placeholders. Do **not** emit `${oc.env:...}` for these (it silently fails unless the var was exported with `set -a`). Only `api_key` stays an env-var *name* (e.g. `INFERENCE_API_KEY`), exported and read by the harness. -**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level: +**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks (`ns_*`) with self-deployment (vLLM/SGLang/NIM), you need **both** of these: ```yaml -target: - api_endpoint: - api_key_name: DUMMY_API_KEY +evaluation: + env_vars: + DUMMY_API_KEY: lit:dummy # MUST be set here — see below + nemo_evaluator_config: + target: + api_endpoint: + api_key_name: DUMMY_API_KEY ``` -External-deployment configs already define `api_key_name`. Export of `DUMMY_API_KEY` is handled in Step 8. +`api_key_name` only names the env var; the nemo-skills client **hard-fails if that var has no value inside the eval container** (`ValueError: api_key_env_var=DUMMY_API_KEY but the value is not set`). On SLURM, a shell `export DUMMY_API_KEY=dummy` (Step 8) does **NOT** propagate into the container — NEL only injects vars declared in `env_vars`. So declare `DUMMY_API_KEY: lit:dummy` under `evaluation.env_vars` (note the `lit:` prefix — see below). The shell export only helps for local/Docker runs. External-deployment configs already define `api_key_name`. + +**NEL env-var value prefixes (required):** every value in an `env_vars` map needs an explicit prefix — `host:VAR` (read from the submitting shell's env at submit time), `lit:value` (literal string), or `runtime:VAR` (read in the job at run time). A bare value (e.g. `DUMMY_API_KEY: dummy`) hard-errors: *"Env var value '…' must have an explicit prefix."* Use `lit:` for constants like `DUMMY_API_KEY` and `VLLM_*` backend selectors, `host:` for secrets like `HF_TOKEN` / `INFERENCE_API_KEY`. --- @@ -258,10 +267,13 @@ Default images: | Framework | Image | Registry | | --- | --- | --- | | vLLM | `vllm/vllm-openai:v0.19.1` (bump per recipe; never `:latest`) | DockerHub | +| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:cu130-nightly-x86_64` (or `-aarch64`) | DockerHub | | SGLang | `lmsysorg/sglang:latest` | DockerHub | | TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC | | Eval tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC | +> NVFP4 checkpoints on Blackwell (sm_100/sm_103) need the `cu130-nightly` image — cu129/v0.19.1 lack sm_103 FP4 kernels (see the "NVFP4 on Blackwell" note in Step 3). + Public images → submit without preflight. Private/restricted → check credentials: ```bash @@ -284,8 +296,10 @@ set -a && source .env && set +a # If pre_cmd/post_cmd in config (review pre_cmd first — runs arbitrary commands): export NEMO_EVALUATOR_TRUST_PRE_CMD=1 -# If nemo_skills.* + self-deployment: +# If nemo_skills.* + self-deployment, for LOCAL/Docker runs only: export DUMMY_API_KEY=dummy +# On SLURM this shell export does NOT reach the container — instead declare +# `DUMMY_API_KEY: lit:dummy` under evaluation.env_vars (see Step 5). ``` **Step 8.1 — Dry-run** (config validation): diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml index c3a48d8c58b..0066e3afcaf 100644 --- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml +++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml @@ -43,6 +43,9 @@ execution: account: ??? output_dir: ??? walltime: "04:00:00" + # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what + # the QOS allows) or sbatch fails "Requested node configuration is not available". + # gres: gpu:4 # e.g. GB300 nodes (4 GPUs); also drop --data-parallel-size to match mounts: mount_home: false auto_export: # REQUIRED trigger for auto-export. Without this, the @@ -55,6 +58,16 @@ deployment: hf_model_handle: served_model_name: ??? image: vllm/vllm-openai:v0.19.1 + # NVFP4 on Blackwell (B200/B300/GB200/GB300, sm_100/sm_103): the v0.19.1 and + # any cu129 builds have NO sm_103 FP4 kernels (deploy dies with CUDA + # "no kernel image is available"). Use the CUDA-13 nightly instead: + # image: vllm/vllm-openai:cu130-nightly-x86_64 # or -aarch64 on Grace + # (vLLM-recipe-recommended Blackwell image; see recipes.vllm.ai ?hardware=b300). + # For multimodal models on sm_103 also add `--mm-encoder-attn-backend TRITON_ATTN`. + # + # `--served-model-name ${deployment.served_model_name}` (in command below) is + # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval + # request 404s ("The model `` does not exist."). # For MoE models, add `--enable-expert-parallel` to the command. # For models with custom code, add `--trust-remote-code` to the command. # After filling in evaluation `parallelism` values (top-level + per-task), @@ -62,6 +75,7 @@ deployment: # N = ceil(max_parallelism / data_parallel_size). command: >- vllm serve /checkpoint + --served-model-name ${deployment.served_model_name} --host 0.0.0.0 --port ${deployment.port} --tensor-parallel-size 1 @@ -72,6 +86,12 @@ deployment: evaluation: env_vars: HF_TOKEN: host:HF_TOKEN + # nemo-skills tasks (ns_*) hard-require the served-endpoint api key env var + # (api_key_name below) to be set WITH A VALUE inside the eval container. + # A shell `export DUMMY_API_KEY=dummy` does NOT reach the SLURM container — + # it must be declared here. Omit it and ns_* tasks die with + # "api_key_env_var=DUMMY_API_KEY but the value is not set". + DUMMY_API_KEY: lit:dummy nemo_evaluator_config: config: params: diff --git a/.claude/skills/ptq/references/launcher-guide.md b/.claude/skills/ptq/references/launcher-guide.md index 542c4ade5b1..9fff22b12a4 100644 --- a/.claude/skills/ptq/references/launcher-guide.md +++ b/.claude/skills/ptq/references/launcher-guide.md @@ -31,6 +31,18 @@ pipeline: gpus_per_node: ``` +> **Match `gpus_per_node` to the cluster's node GPU count / QOS minimum.** If it +> is below what the QOS requires (many clusters mandate a full node), `sbatch` +> rejects the job with `QOSMinGRES` or `Requested node configuration is not +> available`. e.g. GB300 nodes have 4 GPUs and require the full node → set +> `gpus_per_node: 4`; B300/B200 nodes have 8. Check with `sinfo -o '%P %G'`. + +> **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an +> unquoted `export` in the generated sbatch script, so a value like +> `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and +> silently dropped — the deps never install. Use exact pins instead, e.g. +> `EXTRA_PIP_DEPS: "transformers==5.5.0"`. + Extra `hf_ptq.py` flags can be passed via `args`: ```yaml From 0d7595da2a69097da93d6bbf66b3bb0fb6cee483 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Mon, 1 Jun 2026 18:08:59 -0700 Subject: [PATCH 2/7] [skills] evaluation: prefer checkpoint_path over hf_model_handle hf_model_handle is not reliably mounted at /checkpoint in current NEL: with only hf_model_handle set, `vllm serve /checkpoint` makes vLLM treat the literal '/checkpoint' as an HF repo id and the deploy dies with `HFValidationError: Repo id must use alphanumeric chars ... : '/checkpoint'`. Document preferring checkpoint_path (download the HF model to the cluster via snapshot_download first) in the evaluation SKILL Step 3 and example_eval.yaml. Hit while running a BF16 baseline (Qwen/Qwen3.5-9B) for an NVFP4 comparison. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/SKILL.md | 2 ++ .../evaluation/recipes/examples/example_eval.yaml | 13 ++++++++++--- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 74bc47d5436..962bf9a6292 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -77,6 +77,8 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...> **Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`. +> **Prefer `checkpoint_path` on SLURM — `hf_model_handle` is not reliably mounted at `/checkpoint` in current NEL.** With only `hf_model_handle` set, the `vllm serve /checkpoint` command finds nothing mounted there and vLLM treats the literal string `/checkpoint` as an HF repo id, so the deploy dies with `HFValidationError: Repo id must use alphanumeric chars … : '/checkpoint'`. To evaluate an un-staged HF model (e.g. a BF16 baseline for a quant comparison), first download it onto the cluster — `python -c "from huggingface_hub import snapshot_download; snapshot_download('/', local_dir='')"` — then set `checkpoint_path: ` (this is the path the NVFP4/quantized run already uses). + **Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`): - **vLLM:** no `--quantization` flag by default — vLLM auto-detects from `quantization_config` / `hf_quant_config.json`. Add only when the card, vLLM version, or dry-run error requires it. diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml index 0066e3afcaf..e068ca01f9b 100644 --- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml +++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml @@ -23,9 +23,16 @@ # # Deployment uses a single `command:` field instead of separate # `tensor_parallel_size` / `data_parallel_size` / `extra_args` fields — the full -# `vllm serve` invocation lives in the command string. NEL mounts the resolved -# model (from checkpoint_path or hf_model_handle) at /checkpoint inside the -# container, and Hydra interpolates ${deployment.port} at run time. +# `vllm serve` invocation lives in the command string. NEL mounts a +# `checkpoint_path` at /checkpoint inside the container, and Hydra interpolates +# ${deployment.port} at run time. +# +# PREFER `checkpoint_path` (a path already on the cluster) over `hf_model_handle`. +# `hf_model_handle` is NOT reliably mounted at /checkpoint in current NEL — the +# `vllm serve /checkpoint` command then makes vLLM treat "/checkpoint" as a HF +# repo id and the deploy dies with `HFValidationError: Repo id must use ...`. +# To eval an un-staged HF model, first download it to the cluster (e.g. +# `huggingface_hub.snapshot_download`) and point `checkpoint_path` at it. # # Run a single task: # nel run --config ... -t gpqa_diamond_aa_v3 From 2fa127ef573665000b4b31ea5c4a55a94f23f7cc Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Tue, 2 Jun 2026 11:31:12 -0700 Subject: [PATCH 3/7] [skills] address PR #1595 review feedback - Blackwell NVFP4 image: recommend pinned vllm/vllm-openai:v0.19.1-cu130 (matches the default image), bump to cu130-nightly- only if it lacks the model arch; note Qwen3.5-9B's qwen3_5 was verified on the nightly (v0.19.2rc1.dev134), v0.19.1-cu130 untested for it (cjluo-nv). - De-duplicate the Blackwell note: trim the evaluation SKILL Step 3 copy and point to example_eval.yaml for the full version (chadvoegele, cjluo-nv). - gres comment: 'match --tensor-parallel-size/--data-parallel-size to it' and refer to references/parallelism.md for DP/TP sizing (kaix-nv, chadvoegele). - Clarify TRITON_ATTN is a ViT sm_103 flash-attn workaround that may be unneeded on newer builds (cjluo-nv). - launcher-guide: note EXTRA_PIP_DEPS transformers version is model-specific, not the repo's library pin (Copilot). Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/SKILL.md | 4 ++-- .../recipes/examples/example_eval.yaml | 16 +++++++++++----- .claude/skills/ptq/references/launcher-guide.md | 6 ++++-- 3 files changed, 17 insertions(+), 9 deletions(-) diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 962bf9a6292..3a47b115ebf 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -144,7 +144,7 @@ For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipelin **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping. -> **NVFP4 on Blackwell needs the CUDA-13 build.** Serving an NVFP4 checkpoint on Blackwell (B200/B300/GB200/GB300, compute capability sm_100/sm_103) requires `vllm/vllm-openai:cu130-nightly-` (`-x86_64`, or `-aarch64` on Grace). The pinned `v0.19.1` and **all** `cu129` (CUDA 12.9) builds lack sm_103 FP4 kernels — the server loads the checkpoint then dies at engine init with `CUDA error: no kernel image is available for execution on the device` (true for the `flashinfer` *and* `cutlass` NVFP4 backends; `marlin` separately fails on non-64-divisible layer dims). This is the vLLM-recipe-recommended Blackwell image — confirm via `recipes.vllm.ai//?hardware=b300` (and since that page is JS-rendered, fetch the raw markdown at `github.com/vllm-project/recipes/blob/main//.md`). For **multimodal** models on sm_103, also add `--mm-encoder-attn-backend TRITON_ATTN` — the default CuTe ViT flash-attn kernel asserts "Only SM 10.x and 11.x are supported" on sm_103. +> **NVFP4 on Blackwell needs a CUDA-13 vLLM build.** On B200/B300/GB200/GB300 (sm_100/sm_103) the pinned `v0.19.1` and all `cu129` builds lack sm_103 FP4 kernels — engine init dies with `CUDA error: no kernel image is available for execution on the device`. Use `vllm/vllm-openai:v0.19.1-cu130` (pinned, matches the default image), and bump to `cu130-nightly-` only if it lacks the model's arch (Qwen3.5-9B's `qwen3_5` needed the nightly). Multimodal on sm_103 also needs `--mm-encoder-attn-backend TRITON_ATTN` (ViT flash-attn workaround). Full note + the `recipes.vllm.ai ?hardware=b300` lookup are in `recipes/examples/example_eval.yaml`. #### vLLM-backend defaults — always include unless the recipe *contradicts* @@ -269,7 +269,7 @@ Default images: | Framework | Image | Registry | | --- | --- | --- | | vLLM | `vllm/vllm-openai:v0.19.1` (bump per recipe; never `:latest`) | DockerHub | -| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:cu130-nightly-x86_64` (or `-aarch64`) | DockerHub | +| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:v0.19.1-cu130` (bump to `cu130-nightly-` for new archs) | DockerHub | | SGLang | `lmsysorg/sglang:latest` | DockerHub | | TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC | | Eval tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC | diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml index e068ca01f9b..6035e04f765 100644 --- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml +++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml @@ -52,7 +52,7 @@ execution: walltime: "04:00:00" # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what # the QOS allows) or sbatch fails "Requested node configuration is not available". - # gres: gpu:4 # e.g. GB300 nodes (4 GPUs); also drop --data-parallel-size to match + # gres: gpu:4 # e.g. GB300 (4 GPUs); match --tensor-parallel-size/--data-parallel-size to it (see references/parallelism.md) mounts: mount_home: false auto_export: # REQUIRED trigger for auto-export. Without this, the @@ -67,10 +67,16 @@ deployment: image: vllm/vllm-openai:v0.19.1 # NVFP4 on Blackwell (B200/B300/GB200/GB300, sm_100/sm_103): the v0.19.1 and # any cu129 builds have NO sm_103 FP4 kernels (deploy dies with CUDA - # "no kernel image is available"). Use the CUDA-13 nightly instead: - # image: vllm/vllm-openai:cu130-nightly-x86_64 # or -aarch64 on Grace - # (vLLM-recipe-recommended Blackwell image; see recipes.vllm.ai ?hardware=b300). - # For multimodal models on sm_103 also add `--mm-encoder-attn-backend TRITON_ATTN`. + # "no kernel image is available"). Use a CUDA-13 build — prefer the pinned + # release matching this image, bump to nightly only if it lacks the arch: + # image: vllm/vllm-openai:v0.19.1-cu130 # reproducible; verify arch support + # image: vllm/vllm-openai:cu130-nightly-x86_64 # newest (-aarch64 on Grace) + # (Qwen3.5-9B's qwen3_5 arch needed the nightly = vLLM 0.19.2rc1.dev134; the + # pinned v0.19.1-cu130 is untested for it.) Confirm via recipes.vllm.ai + # ?hardware=b300 (JS-rendered; fetch raw markdown at + # github.com/vllm-project/recipes/blob/main//.md). + # Multimodal on sm_103: the ViT CuTe flash-attn asserts "Only SM 10.x/11.x"; + # workaround `--mm-encoder-attn-backend TRITON_ATTN` (may be unneeded on newer builds). # # `--served-model-name ${deployment.served_model_name}` (in command below) is # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval diff --git a/.claude/skills/ptq/references/launcher-guide.md b/.claude/skills/ptq/references/launcher-guide.md index 9fff22b12a4..1b5bb263818 100644 --- a/.claude/skills/ptq/references/launcher-guide.md +++ b/.claude/skills/ptq/references/launcher-guide.md @@ -40,8 +40,10 @@ pipeline: > **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an > unquoted `export` in the generated sbatch script, so a value like > `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and -> silently dropped — the deps never install. Use exact pins instead, e.g. -> `EXTRA_PIP_DEPS: "transformers==5.5.0"`. +> silently dropped — the deps never install. Use exact `==` pins (no `>`/`<`). +> The right version is **model-specific** — a brand-new architecture may need a +> newer transformers than the repo's library pin (e.g. Qwen3.5's `qwen3_5` needs +> `EXTRA_PIP_DEPS: "transformers==5.5.0"`); pick what the target model requires. Extra `hf_ptq.py` flags can be passed via `args`: From beaad18745b8376a2f6c84f599c385abb645dec3 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Tue, 2 Jun 2026 23:00:05 -0700 Subject: [PATCH 4/7] [skills] address PR #1595 second-pass review (cjluo-nv) - Blackwell image: simplify to 'B300/GB300 -> append -cu130 to the (multi-arch) image tag' (e.g. v0.19.1-cu130); keep a one-line nightly fallback for archs a pinned release predates (qwen3_5). Applied in eval + deployment skills. - gres: defer to NEL's internal/slurm/ execution configs (PR #1599) when present (they pre-fill gres/hostname/partition); keep the manual fallback. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Zhiyu Cheng --- .claude/skills/deployment/SKILL.md | 19 ++++++------- .claude/skills/evaluation/SKILL.md | 4 +-- .../recipes/examples/example_eval.yaml | 27 +++++++++---------- 3 files changed, 24 insertions(+), 26 deletions(-) diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md index c542f462ca2..099ac7a6a41 100644 --- a/.claude/skills/deployment/SKILL.md +++ b/.claude/skills/deployment/SKILL.md @@ -125,17 +125,18 @@ python -m vllm.entrypoints.openai.api_server \ For NVFP4 checkpoints, use `--quantization modelopt_fp4`. -> **NVFP4 on Blackwell needs the CUDA-13 vLLM build.** On B200/B300/GB200/GB300 -> (compute capability sm_100/sm_103), use `vllm/vllm-openai:cu130-nightly-` -> (`-x86_64`, or `-aarch64` on Grace). The common `v0.19.1` / any `cu129` -> (CUDA 12.9) build has **no sm_103 FP4 kernels** — vLLM loads the checkpoint +> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** +> (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The +> default cu12 build has **no sm_103 FP4 kernel**, so vLLM loads the checkpoint > then dies at engine init with `CUDA error: no kernel image is available for > execution on the device` (affects the `flashinfer` and `cutlass` NVFP4 -> backends; `marlin` separately fails on non-64-divisible layer dims). Verify the -> image via `recipes.vllm.ai//?hardware=b300` (JS-rendered — fetch the -> raw markdown at `github.com/vllm-project/recipes/blob/main//.md`). -> For multimodal models on sm_103, also pass `--mm-encoder-attn-backend -> TRITON_ATTN` (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x"). +> backends; `marlin` separately fails on non-64-divisible layer dims). If a +> pinned release predates the model's arch, use `cu130-nightly-` instead +> (Qwen3.5-9B's `qwen3_5` needed it). Cross-check via +> `recipes.vllm.ai//?hardware=b300` (JS-rendered — fetch the raw +> markdown at `github.com/vllm-project/recipes/blob/main//.md`). For +> multimodal models on sm_103, also pass `--mm-encoder-attn-backend TRITON_ATTN` +> (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x"). #### SGLang diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 3a47b115ebf..b4132acf40e 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -144,7 +144,7 @@ For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipelin **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping. -> **NVFP4 on Blackwell needs a CUDA-13 vLLM build.** On B200/B300/GB200/GB300 (sm_100/sm_103) the pinned `v0.19.1` and all `cu129` builds lack sm_103 FP4 kernels — engine init dies with `CUDA error: no kernel image is available for execution on the device`. Use `vllm/vllm-openai:v0.19.1-cu130` (pinned, matches the default image), and bump to `cu130-nightly-` only if it lacks the model's arch (Qwen3.5-9B's `qwen3_5` needed the nightly). Multimodal on sm_103 also needs `--mm-encoder-attn-backend TRITON_ATTN` (ViT flash-attn workaround). Full note + the `recipes.vllm.ai ?hardware=b300` lookup are in `recipes/examples/example_eval.yaml`. +> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The default cu12 build has no sm_103 FP4 kernel, so engine init dies with `CUDA error: no kernel image is available`. If a pinned release predates the model's arch, use `cu130-nightly-` (Qwen3.5-9B's `qwen3_5` needed it, vLLM 0.19.2rc1.dev134). Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`. Full note in `recipes/examples/example_eval.yaml`. #### vLLM-backend defaults — always include unless the recipe *contradicts* @@ -200,7 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text. - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown. - Ask about other defaults they may want to change (partition, walltime, MLflow tags). -- **`execution.gres`** — NEL defaults to `gpu:8`. Set it to the cluster's per-node GPU count (and what the QOS permits), and match `--data-parallel-size`/`--tensor-parallel-size` to it. A mismatch makes `sbatch` reject the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → set `gres: gpu:4`). Confirm the node GPU count with `sinfo -o '%P %G'` on the target cluster. +- **`execution.gres`** — if your NEL install ships an `internal/slurm/` execution config, prefer it (it pre-fills `gres`/hostname/partition/node-exclusivity). Otherwise NEL defaults to `gpu:8`; set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`), or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → `gres: gpu:4`; check with `sinfo -o '%P %G'`). **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference. diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml index 6035e04f765..185e80397c0 100644 --- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml +++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml @@ -50,9 +50,10 @@ execution: account: ??? output_dir: ??? walltime: "04:00:00" - # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what - # the QOS allows) or sbatch fails "Requested node configuration is not available". - # gres: gpu:4 # e.g. GB300 (4 GPUs); match --tensor-parallel-size/--data-parallel-size to it (see references/parallelism.md) + # gres defaults to gpu:8. If your NEL install ships an internal/slurm/ + # execution config, prefer it (it pre-fills gres/hostname/partition). Otherwise + # set gres to the node's GPU count or sbatch fails "Requested node configuration + # is not available"; match --tensor/--data-parallel-size to it (references/parallelism.md). mounts: mount_home: false auto_export: # REQUIRED trigger for auto-export. Without this, the @@ -65,18 +66,14 @@ deployment: hf_model_handle: served_model_name: ??? image: vllm/vllm-openai:v0.19.1 - # NVFP4 on Blackwell (B200/B300/GB200/GB300, sm_100/sm_103): the v0.19.1 and - # any cu129 builds have NO sm_103 FP4 kernels (deploy dies with CUDA - # "no kernel image is available"). Use a CUDA-13 build — prefer the pinned - # release matching this image, bump to nightly only if it lacks the arch: - # image: vllm/vllm-openai:v0.19.1-cu130 # reproducible; verify arch support - # image: vllm/vllm-openai:cu130-nightly-x86_64 # newest (-aarch64 on Grace) - # (Qwen3.5-9B's qwen3_5 arch needed the nightly = vLLM 0.19.2rc1.dev134; the - # pinned v0.19.1-cu130 is untested for it.) Confirm via recipes.vllm.ai - # ?hardware=b300 (JS-rendered; fetch raw markdown at - # github.com/vllm-project/recipes/blob/main//.md). - # Multimodal on sm_103: the ViT CuTe flash-attn asserts "Only SM 10.x/11.x"; - # workaround `--mm-encoder-attn-backend TRITON_ATTN` (may be unneeded on newer builds). + # NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the (multi-arch) + # image tag — the default cu12 build has no sm_103 FP4 kernel, so the deploy + # dies at engine init with CUDA "no kernel image is available". e.g.: + # image: vllm/vllm-openai:v0.19.1-cu130 + # If a pinned release predates your model's arch, use the nightly instead + # (Qwen3.5-9B's qwen3_5 needed vllm/vllm-openai:cu130-nightly-x86_64, 0.19.2rc1.dev134). + # Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN` + # (ViT flash-attn workaround; drop if the encoder loads without it). # # `--served-model-name ${deployment.served_model_name}` (in command below) is # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval From 9ccf032d1e02436768a373fa005da4aa15153c2a Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Tue, 2 Jun 2026 23:09:16 -0700 Subject: [PATCH 5/7] [skills] gres: defer to #1599's internal/slurm/, keep one-line fallback Reduce the gres guidance to a single fallback note (the slurm/default case), deferring the predefined per-cluster config path to PR #1599 (which pre-fills gres/hostname/partition). Fills the gap #1599's fallback branch leaves (it does not mention gres for the no-internal-package case). Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/SKILL.md | 2 +- .../skills/evaluation/recipes/examples/example_eval.yaml | 7 +++---- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index b4132acf40e..5124c347204 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -200,7 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text. - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown. - Ask about other defaults they may want to change (partition, walltime, MLflow tags). -- **`execution.gres`** — if your NEL install ships an `internal/slurm/` execution config, prefer it (it pre-fills `gres`/hostname/partition/node-exclusivity). Otherwise NEL defaults to `gpu:8`; set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`), or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → `gres: gpu:4`; check with `sinfo -o '%P %G'`). +- **`execution.gres`** — auto-set if you used a predefined `internal/slurm/` config (above). On the `slurm/default` fallback it's `gpu:8`, so set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`) or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. 4-GPU GB300 → `gres: gpu:4`; check with `sinfo -o '%P %G'`). **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference. diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml index 185e80397c0..6ef6a2e8ca0 100644 --- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml +++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml @@ -50,10 +50,9 @@ execution: account: ??? output_dir: ??? walltime: "04:00:00" - # gres defaults to gpu:8. If your NEL install ships an internal/slurm/ - # execution config, prefer it (it pre-fills gres/hostname/partition). Otherwise - # set gres to the node's GPU count or sbatch fails "Requested node configuration - # is not available"; match --tensor/--data-parallel-size to it (references/parallelism.md). + # gres: a predefined internal/slurm/ config (see SKILL Step 4) sets this. + # On the slurm/default fallback it's gpu:8 — set to the node's GPU count or sbatch + # fails "Requested node configuration is not available" (4-GPU GB300 -> gpu:4). mounts: mount_home: false auto_export: # REQUIRED trigger for auto-export. Without this, the From 96408be1019a9a381cf6a539f0cb2e059c3fedb4 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Tue, 2 Jun 2026 23:16:05 -0700 Subject: [PATCH 6/7] [skills] de-dup served-model-name rationale (single-source to example) Per @chadvoegele: keep the 'why' (vLLM serves as /checkpoint -> 404) only in example_eval.yaml; the SKILL Step-3 Conventions line now just names the required flag and points to the example. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/SKILL.md | 2 +- .claude/skills/evaluation/recipes/examples/example_eval.yaml | 5 ++--- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 5124c347204..2c80930c6b4 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -138,7 +138,7 @@ deployment: <... rest of cross-checked flags ...> ``` -Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required** — without it vLLM registers the model under the path `/checkpoint`, and every eval request 404s with "The model `` does not exist."); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value. +Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required**; see `example_eval.yaml` for why); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value. For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there. diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml index 6ef6a2e8ca0..e188d37137a 100644 --- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml +++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml @@ -74,9 +74,8 @@ deployment: # Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN` # (ViT flash-attn workaround; drop if the encoder loads without it). # - # `--served-model-name ${deployment.served_model_name}` (in command below) is - # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval - # request 404s ("The model `` does not exist."). + # `--served-model-name` (in command below) is REQUIRED — else vLLM serves the + # model as `/checkpoint` and eval requests 404 ("model does not exist"). # For MoE models, add `--enable-expert-parallel` to the command. # For models with custom code, add `--trust-remote-code` to the command. # After filling in evaluation `parallelism` values (top-level + per-task), From 81fed4276fe04c7c5cdb552f521481a5411d0909 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Tue, 2 Jun 2026 23:55:22 -0700 Subject: [PATCH 7/7] [skills] de-dup hf_model_handle rationale (single-source to example) Keep the full 'why' (hf_model_handle not mounted -> HFValidationError, stage via snapshot_download) only in example_eval.yaml; reduce the SKILL Step-3 note to a concise pointer. Same single-sourcing pattern as served-model-name. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 2c80930c6b4..f668ef223d5 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -77,7 +77,7 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...> **Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`. -> **Prefer `checkpoint_path` on SLURM — `hf_model_handle` is not reliably mounted at `/checkpoint` in current NEL.** With only `hf_model_handle` set, the `vllm serve /checkpoint` command finds nothing mounted there and vLLM treats the literal string `/checkpoint` as an HF repo id, so the deploy dies with `HFValidationError: Repo id must use alphanumeric chars … : '/checkpoint'`. To evaluate an un-staged HF model (e.g. a BF16 baseline for a quant comparison), first download it onto the cluster — `python -c "from huggingface_hub import snapshot_download; snapshot_download('/', local_dir='')"` — then set `checkpoint_path: ` (this is the path the NVFP4/quantized run already uses). +> **Prefer `checkpoint_path` over `hf_model_handle` on SLURM** — `hf_model_handle` isn't reliably mounted at `/checkpoint`, so the deploy dies with `HFValidationError`. To eval an un-staged HF model, stage it first (`huggingface_hub.snapshot_download`) and point `checkpoint_path` at it. See `example_eval.yaml` for why. **Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`):