From f8096a67714b29238c81e3c593ded67f325ec03f Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Mon, 1 Jun 2026 16:52:37 -0700
Subject: [PATCH 1/7] [skills] Fix eval/deploy defaults for NVFP4-on-Blackwell
 + nemo-skills evals

Concrete fixes from a Qwen3.5-9B NVFP4 PTQ -> deploy -> AA-eval run on
B300/GB300 where each issue caused a real failure:

- example_eval.yaml: add --served-model-name ${deployment.served_model_name};
  without it vLLM registers the model as /checkpoint and every eval 404s.
- evaluation SKILL: nemo-skills (ns_*) self-deployment needs DUMMY_API_KEY in
  evaluation.env_vars (a shell export does NOT reach the SLURM container);
  document the required host:/lit:/runtime: env-var value prefixes; note that
  execution.gres must match the node GPU count (else sbatch 'Requested node
  configuration is not available').
- deployment + evaluation SKILL: NVFP4 on Blackwell (sm_100/sm_103) requires
  vllm/vllm-openai:cu130-nightly; v0.19.1 and any cu129 build lack sm_103 FP4
  kernels (engine init dies 'no kernel image'). Plus --mm-encoder-attn-backend
  TRITON_ATTN for multimodal on sm_103, and the raw-markdown recipes.vllm.ai
  fallback for hardware variants.
- ptq launcher-guide: match gpus_per_node to node/QOS; EXTRA_PIP_DEPS must
  avoid shell metacharacters (use == pins, not >=/<).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/deployment/SKILL.md            | 12 ++++++++
 .claude/skills/evaluation/SKILL.md            | 28 ++++++++++++++-----
 .../recipes/examples/example_eval.yaml        | 20 +++++++++++++
 .../skills/ptq/references/launcher-guide.md   | 12 ++++++++
 4 files changed, 65 insertions(+), 7 deletions(-)
diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md
index f14cc0b9822..c542f462ca2 100644
--- a/.claude/skills/deployment/SKILL.md
+++ b/.claude/skills/deployment/SKILL.md
@@ -125,6 +125,18 @@ python -m vllm.entrypoints.openai.api_server \
 
 For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
 
+> **NVFP4 on Blackwell needs the CUDA-13 vLLM build.** On B200/B300/GB200/GB300
+> (compute capability sm_100/sm_103), use `vllm/vllm-openai:cu130-nightly-<arch>`
+> (`-x86_64`, or `-aarch64` on Grace). The common `v0.19.1` / any `cu129`
+> (CUDA 12.9) build has **no sm_103 FP4 kernels** — vLLM loads the checkpoint
+> then dies at engine init with `CUDA error: no kernel image is available for
+> execution on the device` (affects the `flashinfer` and `cutlass` NVFP4
+> backends; `marlin` separately fails on non-64-divisible layer dims). Verify the
+> image via `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the
+> raw markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`).
+> For multimodal models on sm_103, also pass `--mm-encoder-attn-backend
+> TRITON_ATTN` (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").
+
 #### SGLang
 
 ```bash
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index 4da78164a7f..74bc47d5436 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -136,12 +136,14 @@ deployment:
     <... rest of cross-checked flags ...>
 ```
 
-Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
+Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required** — without it vLLM registers the model under the path `/checkpoint`, and every eval request 404s with "The model `<served_model_name>` does not exist."); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
 
 For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.
 
 **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
 
+> **NVFP4 on Blackwell needs the CUDA-13 build.** Serving an NVFP4 checkpoint on Blackwell (B200/B300/GB200/GB300, compute capability sm_100/sm_103) requires `vllm/vllm-openai:cu130-nightly-<arch>` (`-x86_64`, or `-aarch64` on Grace). The pinned `v0.19.1` and **all** `cu129` (CUDA 12.9) builds lack sm_103 FP4 kernels — the server loads the checkpoint then dies at engine init with `CUDA error: no kernel image is available for execution on the device` (true for the `flashinfer` *and* `cutlass` NVFP4 backends; `marlin` separately fails on non-64-divisible layer dims). This is the vLLM-recipe-recommended Blackwell image — confirm via `recipes.vllm.ai/<org>/<model>?hardware=b300` (and since that page is JS-rendered, fetch the raw markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`). For **multimodal** models on sm_103, also add `--mm-encoder-attn-backend TRITON_ATTN` — the default CuTe ViT flash-attn kernel asserts "Only SM 10.x and 11.x are supported" on sm_103.
+
 #### vLLM-backend defaults — always include unless the recipe *contradicts*
 
 Silence is not contradiction. Drop/override only when the recipe sets a different value for the same setting (e.g. recipe pins `--max-num-batched-tokens 16384` → use 16384).
@@ -196,6 +198,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
 - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
 - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
 - Ask about other defaults they may want to change (partition, walltime, MLflow tags).
+- **`execution.gres`** — NEL defaults to `gpu:8`. Set it to the cluster's per-node GPU count (and what the QOS permits), and match `--data-parallel-size`/`--tensor-parallel-size` to it. A mismatch makes `sbatch` reject the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → set `gres: gpu:4`). Confirm the node GPU count with `sinfo -o '%P %G'` on the target cluster.
 
 **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
 
@@ -229,15 +232,21 @@ Implications for the agent:
 
 **Tasks that call an external judge / user-simulator / scoring endpoint.** Treat this as a general pattern, not a fixed list — HLE, AA-LCR, and Tau2 need one today, but other benchmarks may too (check each task's recipe). Their `model_id` / `url` are **config, not secrets**: substitute the **literal** values the user keeps in `.env` (keys per the task's recipe + `recipes/env.example`) into the task's `<VAR>` placeholders. Do **not** emit `${oc.env:...}` for these (it silently fails unless the var was exported with `set -a`). Only `api_key` stays an env-var *name* (e.g. `INFERENCE_API_KEY`), exported and read by the harness.
 
-**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
+**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks (`ns_*`) with self-deployment (vLLM/SGLang/NIM), you need **both** of these:
 
 ```yaml
-target:
-  api_endpoint:
-    api_key_name: DUMMY_API_KEY
+evaluation:
+  env_vars:
+    DUMMY_API_KEY: lit:dummy   # MUST be set here — see below
+  nemo_evaluator_config:
+    target:
+      api_endpoint:
+        api_key_name: DUMMY_API_KEY
 ```
 
-External-deployment configs already define `api_key_name`. Export of `DUMMY_API_KEY` is handled in Step 8.
+`api_key_name` only names the env var; the nemo-skills client **hard-fails if that var has no value inside the eval container** (`ValueError: api_key_env_var=DUMMY_API_KEY but the value is not set`). On SLURM, a shell `export DUMMY_API_KEY=dummy` (Step 8) does **NOT** propagate into the container — NEL only injects vars declared in `env_vars`. So declare `DUMMY_API_KEY: lit:dummy` under `evaluation.env_vars` (note the `lit:` prefix — see below). The shell export only helps for local/Docker runs. External-deployment configs already define `api_key_name`.
+
+**NEL env-var value prefixes (required):** every value in an `env_vars` map needs an explicit prefix — `host:VAR` (read from the submitting shell's env at submit time), `lit:value` (literal string), or `runtime:VAR` (read in the job at run time). A bare value (e.g. `DUMMY_API_KEY: dummy`) hard-errors: *"Env var value '…' must have an explicit prefix."* Use `lit:` for constants like `DUMMY_API_KEY` and `VLLM_*` backend selectors, `host:` for secrets like `HF_TOKEN` / `INFERENCE_API_KEY`.
 
 ---
 
@@ -258,10 +267,13 @@ Default images:
 | Framework | Image | Registry |
 | --- | --- | --- |
 | vLLM | `vllm/vllm-openai:v0.19.1` (bump per recipe; never `:latest`) | DockerHub |
+| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:cu130-nightly-x86_64` (or `-aarch64`) | DockerHub |
 | SGLang | `lmsysorg/sglang:latest` | DockerHub |
 | TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC |
 | Eval tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC |
 
+> NVFP4 checkpoints on Blackwell (sm_100/sm_103) need the `cu130-nightly` image — cu129/v0.19.1 lack sm_103 FP4 kernels (see the "NVFP4 on Blackwell" note in Step 3).
+
 Public images → submit without preflight. Private/restricted → check credentials:
 
 ```bash
@@ -284,8 +296,10 @@ set -a && source .env && set +a
 
 # If pre_cmd/post_cmd in config (review pre_cmd first — runs arbitrary commands):
 export NEMO_EVALUATOR_TRUST_PRE_CMD=1
-# If nemo_skills.* + self-deployment:
+# If nemo_skills.* + self-deployment, for LOCAL/Docker runs only:
 export DUMMY_API_KEY=dummy
+# On SLURM this shell export does NOT reach the container — instead declare
+# `DUMMY_API_KEY: lit:dummy` under evaluation.env_vars (see Step 5).
 ```
 
 **Step 8.1 — Dry-run** (config validation):
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
index c3a48d8c58b..0066e3afcaf 100644
--- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml
+++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -43,6 +43,9 @@ execution:
   account: ???
   output_dir: ???
   walltime: "04:00:00"
+  # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what
+  # the QOS allows) or sbatch fails "Requested node configuration is not available".
+  # gres: gpu:4   # e.g. GB300 nodes (4 GPUs); also drop --data-parallel-size to match
   mounts:
     mount_home: false
   auto_export:          # REQUIRED trigger for auto-export. Without this, the
@@ -55,6 +58,16 @@ deployment:
   hf_model_handle:
   served_model_name: ???
   image: vllm/vllm-openai:v0.19.1
+  # NVFP4 on Blackwell (B200/B300/GB200/GB300, sm_100/sm_103): the v0.19.1 and
+  # any cu129 builds have NO sm_103 FP4 kernels (deploy dies with CUDA
+  # "no kernel image is available"). Use the CUDA-13 nightly instead:
+  #   image: vllm/vllm-openai:cu130-nightly-x86_64    # or -aarch64 on Grace
+  # (vLLM-recipe-recommended Blackwell image; see recipes.vllm.ai ?hardware=b300).
+  # For multimodal models on sm_103 also add `--mm-encoder-attn-backend TRITON_ATTN`.
+  #
+  # `--served-model-name ${deployment.served_model_name}` (in command below) is
+  # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval
+  # request 404s ("The model `<served_model_name>` does not exist.").
   # For MoE models, add `--enable-expert-parallel` to the command.
   # For models with custom code, add `--trust-remote-code` to the command.
   # After filling in evaluation `parallelism` values (top-level + per-task),
@@ -62,6 +75,7 @@ deployment:
   # N = ceil(max_parallelism / data_parallel_size).
   command: >-
     vllm serve /checkpoint
+    --served-model-name ${deployment.served_model_name}
     --host 0.0.0.0
     --port ${deployment.port}
     --tensor-parallel-size 1
@@ -72,6 +86,12 @@ deployment:
 evaluation:
   env_vars:
     HF_TOKEN: host:HF_TOKEN
+    # nemo-skills tasks (ns_*) hard-require the served-endpoint api key env var
+    # (api_key_name below) to be set WITH A VALUE inside the eval container.
+    # A shell `export DUMMY_API_KEY=dummy` does NOT reach the SLURM container —
+    # it must be declared here. Omit it and ns_* tasks die with
+    # "api_key_env_var=DUMMY_API_KEY but the value is not set".
+    DUMMY_API_KEY: lit:dummy
   nemo_evaluator_config:
     config:
       params:
diff --git a/.claude/skills/ptq/references/launcher-guide.md b/.claude/skills/ptq/references/launcher-guide.md
index 542c4ade5b1..9fff22b12a4 100644
--- a/.claude/skills/ptq/references/launcher-guide.md
+++ b/.claude/skills/ptq/references/launcher-guide.md
@@ -31,6 +31,18 @@ pipeline:
       gpus_per_node: <num_gpus>
 ```
 
+> **Match `gpus_per_node` to the cluster's node GPU count / QOS minimum.** If it
+> is below what the QOS requires (many clusters mandate a full node), `sbatch`
+> rejects the job with `QOSMinGRES` or `Requested node configuration is not
+> available`. e.g. GB300 nodes have 4 GPUs and require the full node → set
+> `gpus_per_node: 4`; B300/B200 nodes have 8. Check with `sinfo -o '%P %G'`.
+
+> **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an
+> unquoted `export` in the generated sbatch script, so a value like
+> `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and
+> silently dropped — the deps never install. Use exact pins instead, e.g.
+> `EXTRA_PIP_DEPS: "transformers==5.5.0"`.
+
 Extra `hf_ptq.py` flags can be passed via `args`:
 
 ```yaml

From 0d7595da2a69097da93d6bbf66b3bb0fb6cee483 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Mon, 1 Jun 2026 18:08:59 -0700
Subject: [PATCH 2/7] [skills] evaluation: prefer checkpoint_path over
 hf_model_handle

hf_model_handle is not reliably mounted at /checkpoint in current NEL: with
only hf_model_handle set, `vllm serve /checkpoint` makes vLLM treat the
literal '/checkpoint' as an HF repo id and the deploy dies with
`HFValidationError: Repo id must use alphanumeric chars ... : '/checkpoint'`.
Document preferring checkpoint_path (download the HF model to the cluster via
snapshot_download first) in the evaluation SKILL Step 3 and example_eval.yaml.

Hit while running a BF16 baseline (Qwen/Qwen3.5-9B) for an NVFP4 comparison.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md                  |  2 ++
 .../evaluation/recipes/examples/example_eval.yaml   | 13 ++++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index 74bc47d5436..962bf9a6292 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -77,6 +77,8 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...>
 
 **Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`.
 
+> **Prefer `checkpoint_path` on SLURM — `hf_model_handle` is not reliably mounted at `/checkpoint` in current NEL.** With only `hf_model_handle` set, the `vllm serve /checkpoint` command finds nothing mounted there and vLLM treats the literal string `/checkpoint` as an HF repo id, so the deploy dies with `HFValidationError: Repo id must use alphanumeric chars … : '/checkpoint'`. To evaluate an un-staged HF model (e.g. a BF16 baseline for a quant comparison), first download it onto the cluster — `python -c "from huggingface_hub import snapshot_download; snapshot_download('<org>/<model>', local_dir='<path>')"` — then set `checkpoint_path: <path>` (this is the path the NVFP4/quantized run already uses).
+
 **Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`):
 
 - **vLLM:** no `--quantization` flag by default — vLLM auto-detects from `quantization_config` / `hf_quant_config.json`. Add only when the card, vLLM version, or dry-run error requires it.
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
index 0066e3afcaf..e068ca01f9b 100644
--- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml
+++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -23,9 +23,16 @@
 #
 # Deployment uses a single `command:` field instead of separate
 # `tensor_parallel_size` / `data_parallel_size` / `extra_args` fields — the full
-# `vllm serve` invocation lives in the command string. NEL mounts the resolved
-# model (from checkpoint_path or hf_model_handle) at /checkpoint inside the
-# container, and Hydra interpolates ${deployment.port} at run time.
+# `vllm serve` invocation lives in the command string. NEL mounts a
+# `checkpoint_path` at /checkpoint inside the container, and Hydra interpolates
+# ${deployment.port} at run time.
+#
+# PREFER `checkpoint_path` (a path already on the cluster) over `hf_model_handle`.
+# `hf_model_handle` is NOT reliably mounted at /checkpoint in current NEL — the
+# `vllm serve /checkpoint` command then makes vLLM treat "/checkpoint" as a HF
+# repo id and the deploy dies with `HFValidationError: Repo id must use ...`.
+# To eval an un-staged HF model, first download it to the cluster (e.g.
+# `huggingface_hub.snapshot_download`) and point `checkpoint_path` at it.
 #
 # Run a single task:
 #   nel run --config ... -t gpqa_diamond_aa_v3

From 2fa127ef573665000b4b31ea5c4a55a94f23f7cc Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Tue, 2 Jun 2026 11:31:12 -0700
Subject: [PATCH 3/7] [skills] address PR #1595 review feedback

- Blackwell NVFP4 image: recommend pinned vllm/vllm-openai:v0.19.1-cu130
  (matches the default image), bump to cu130-nightly-<arch> only if it lacks
  the model arch; note Qwen3.5-9B's qwen3_5 was verified on the nightly
  (v0.19.2rc1.dev134), v0.19.1-cu130 untested for it (cjluo-nv).
- De-duplicate the Blackwell note: trim the evaluation SKILL Step 3 copy and
  point to example_eval.yaml for the full version (chadvoegele, cjluo-nv).
- gres comment: 'match --tensor-parallel-size/--data-parallel-size to it' and
  refer to references/parallelism.md for DP/TP sizing (kaix-nv, chadvoegele).
- Clarify TRITON_ATTN is a ViT sm_103 flash-attn workaround that may be
  unneeded on newer builds (cjluo-nv).
- launcher-guide: note EXTRA_PIP_DEPS transformers version is model-specific,
  not the repo's library pin (Copilot).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md               |  4 ++--
 .../recipes/examples/example_eval.yaml           | 16 +++++++++++-----
 .claude/skills/ptq/references/launcher-guide.md  |  6 ++++--
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index 962bf9a6292..3a47b115ebf 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -144,7 +144,7 @@ For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipelin
 
 **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
 
-> **NVFP4 on Blackwell needs the CUDA-13 build.** Serving an NVFP4 checkpoint on Blackwell (B200/B300/GB200/GB300, compute capability sm_100/sm_103) requires `vllm/vllm-openai:cu130-nightly-<arch>` (`-x86_64`, or `-aarch64` on Grace). The pinned `v0.19.1` and **all** `cu129` (CUDA 12.9) builds lack sm_103 FP4 kernels — the server loads the checkpoint then dies at engine init with `CUDA error: no kernel image is available for execution on the device` (true for the `flashinfer` *and* `cutlass` NVFP4 backends; `marlin` separately fails on non-64-divisible layer dims). This is the vLLM-recipe-recommended Blackwell image — confirm via `recipes.vllm.ai/<org>/<model>?hardware=b300` (and since that page is JS-rendered, fetch the raw markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`). For **multimodal** models on sm_103, also add `--mm-encoder-attn-backend TRITON_ATTN` — the default CuTe ViT flash-attn kernel asserts "Only SM 10.x and 11.x are supported" on sm_103.
+> **NVFP4 on Blackwell needs a CUDA-13 vLLM build.** On B200/B300/GB200/GB300 (sm_100/sm_103) the pinned `v0.19.1` and all `cu129` builds lack sm_103 FP4 kernels — engine init dies with `CUDA error: no kernel image is available for execution on the device`. Use `vllm/vllm-openai:v0.19.1-cu130` (pinned, matches the default image), and bump to `cu130-nightly-<arch>` only if it lacks the model's arch (Qwen3.5-9B's `qwen3_5` needed the nightly). Multimodal on sm_103 also needs `--mm-encoder-attn-backend TRITON_ATTN` (ViT flash-attn workaround). Full note + the `recipes.vllm.ai ?hardware=b300` lookup are in `recipes/examples/example_eval.yaml`.
 
 #### vLLM-backend defaults — always include unless the recipe *contradicts*
 
@@ -269,7 +269,7 @@ Default images:
 | Framework | Image | Registry |
 | --- | --- | --- |
 | vLLM | `vllm/vllm-openai:v0.19.1` (bump per recipe; never `:latest`) | DockerHub |
-| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:cu130-nightly-x86_64` (or `-aarch64`) | DockerHub |
+| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:v0.19.1-cu130` (bump to `cu130-nightly-<arch>` for new archs) | DockerHub |
 | SGLang | `lmsysorg/sglang:latest` | DockerHub |
 | TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC |
 | Eval tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC |
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
index e068ca01f9b..6035e04f765 100644
--- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml
+++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -52,7 +52,7 @@ execution:
   walltime: "04:00:00"
   # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what
   # the QOS allows) or sbatch fails "Requested node configuration is not available".
-  # gres: gpu:4   # e.g. GB300 nodes (4 GPUs); also drop --data-parallel-size to match
+  # gres: gpu:4   # e.g. GB300 (4 GPUs); match --tensor-parallel-size/--data-parallel-size to it (see references/parallelism.md)
   mounts:
     mount_home: false
   auto_export:          # REQUIRED trigger for auto-export. Without this, the
@@ -67,10 +67,16 @@ deployment:
   image: vllm/vllm-openai:v0.19.1
   # NVFP4 on Blackwell (B200/B300/GB200/GB300, sm_100/sm_103): the v0.19.1 and
   # any cu129 builds have NO sm_103 FP4 kernels (deploy dies with CUDA
-  # "no kernel image is available"). Use the CUDA-13 nightly instead:
-  #   image: vllm/vllm-openai:cu130-nightly-x86_64    # or -aarch64 on Grace
-  # (vLLM-recipe-recommended Blackwell image; see recipes.vllm.ai ?hardware=b300).
-  # For multimodal models on sm_103 also add `--mm-encoder-attn-backend TRITON_ATTN`.
+  # "no kernel image is available"). Use a CUDA-13 build — prefer the pinned
+  # release matching this image, bump to nightly only if it lacks the arch:
+  #   image: vllm/vllm-openai:v0.19.1-cu130          # reproducible; verify arch support
+  #   image: vllm/vllm-openai:cu130-nightly-x86_64   # newest (-aarch64 on Grace)
+  # (Qwen3.5-9B's qwen3_5 arch needed the nightly = vLLM 0.19.2rc1.dev134; the
+  # pinned v0.19.1-cu130 is untested for it.) Confirm via recipes.vllm.ai
+  # ?hardware=b300 (JS-rendered; fetch raw markdown at
+  # github.com/vllm-project/recipes/blob/main/<org>/<model>.md).
+  # Multimodal on sm_103: the ViT CuTe flash-attn asserts "Only SM 10.x/11.x";
+  # workaround `--mm-encoder-attn-backend TRITON_ATTN` (may be unneeded on newer builds).
   #
   # `--served-model-name ${deployment.served_model_name}` (in command below) is
   # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval
diff --git a/.claude/skills/ptq/references/launcher-guide.md b/.claude/skills/ptq/references/launcher-guide.md
index 9fff22b12a4..1b5bb263818 100644
--- a/.claude/skills/ptq/references/launcher-guide.md
+++ b/.claude/skills/ptq/references/launcher-guide.md
@@ -40,8 +40,10 @@ pipeline:
 > **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an
 > unquoted `export` in the generated sbatch script, so a value like
 > `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and
-> silently dropped — the deps never install. Use exact pins instead, e.g.
-> `EXTRA_PIP_DEPS: "transformers==5.5.0"`.
+> silently dropped — the deps never install. Use exact `==` pins (no `>`/`<`).
+> The right version is **model-specific** — a brand-new architecture may need a
+> newer transformers than the repo's library pin (e.g. Qwen3.5's `qwen3_5` needs
+> `EXTRA_PIP_DEPS: "transformers==5.5.0"`); pick what the target model requires.
 
 Extra `hf_ptq.py` flags can be passed via `args`:
 

From beaad18745b8376a2f6c84f599c385abb645dec3 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Tue, 2 Jun 2026 23:00:05 -0700
Subject: [PATCH 4/7] [skills] address PR #1595 second-pass review (cjluo-nv)

- Blackwell image: simplify to 'B300/GB300 -> append -cu130 to the (multi-arch)
  image tag' (e.g. v0.19.1-cu130); keep a one-line nightly fallback for archs a
  pinned release predates (qwen3_5). Applied in eval + deployment skills.
- gres: defer to NEL's internal/slurm/<cluster> execution configs (PR #1599)
  when present (they pre-fill gres/hostname/partition); keep the manual fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/deployment/SKILL.md            | 19 ++++++-------
 .claude/skills/evaluation/SKILL.md            |  4 +--
 .../recipes/examples/example_eval.yaml        | 27 +++++++++----------
 3 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md
index c542f462ca2..099ac7a6a41 100644
--- a/.claude/skills/deployment/SKILL.md
+++ b/.claude/skills/deployment/SKILL.md
@@ -125,17 +125,18 @@ python -m vllm.entrypoints.openai.api_server \
 
 For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
 
-> **NVFP4 on Blackwell needs the CUDA-13 vLLM build.** On B200/B300/GB200/GB300
-> (compute capability sm_100/sm_103), use `vllm/vllm-openai:cu130-nightly-<arch>`
-> (`-x86_64`, or `-aarch64` on Grace). The common `v0.19.1` / any `cu129`
-> (CUDA 12.9) build has **no sm_103 FP4 kernels** — vLLM loads the checkpoint
+> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag**
+> (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The
+> default cu12 build has **no sm_103 FP4 kernel**, so vLLM loads the checkpoint
 > then dies at engine init with `CUDA error: no kernel image is available for
 > execution on the device` (affects the `flashinfer` and `cutlass` NVFP4
-> backends; `marlin` separately fails on non-64-divisible layer dims). Verify the
-> image via `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the
-> raw markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`).
-> For multimodal models on sm_103, also pass `--mm-encoder-attn-backend
-> TRITON_ATTN` (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").
+> backends; `marlin` separately fails on non-64-divisible layer dims). If a
+> pinned release predates the model's arch, use `cu130-nightly-<arch>` instead
+> (Qwen3.5-9B's `qwen3_5` needed it). Cross-check via
+> `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the raw
+> markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`). For
+> multimodal models on sm_103, also pass `--mm-encoder-attn-backend TRITON_ATTN`
+> (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").
 
 #### SGLang
 
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index 3a47b115ebf..b4132acf40e 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -144,7 +144,7 @@ For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipelin
 
 **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
 
-> **NVFP4 on Blackwell needs a CUDA-13 vLLM build.** On B200/B300/GB200/GB300 (sm_100/sm_103) the pinned `v0.19.1` and all `cu129` builds lack sm_103 FP4 kernels — engine init dies with `CUDA error: no kernel image is available for execution on the device`. Use `vllm/vllm-openai:v0.19.1-cu130` (pinned, matches the default image), and bump to `cu130-nightly-<arch>` only if it lacks the model's arch (Qwen3.5-9B's `qwen3_5` needed the nightly). Multimodal on sm_103 also needs `--mm-encoder-attn-backend TRITON_ATTN` (ViT flash-attn workaround). Full note + the `recipes.vllm.ai ?hardware=b300` lookup are in `recipes/examples/example_eval.yaml`.
+> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The default cu12 build has no sm_103 FP4 kernel, so engine init dies with `CUDA error: no kernel image is available`. If a pinned release predates the model's arch, use `cu130-nightly-<arch>` (Qwen3.5-9B's `qwen3_5` needed it, vLLM 0.19.2rc1.dev134). Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`. Full note in `recipes/examples/example_eval.yaml`.
 
 #### vLLM-backend defaults — always include unless the recipe *contradicts*
 
@@ -200,7 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
 - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
 - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
 - Ask about other defaults they may want to change (partition, walltime, MLflow tags).
-- **`execution.gres`** — NEL defaults to `gpu:8`. Set it to the cluster's per-node GPU count (and what the QOS permits), and match `--data-parallel-size`/`--tensor-parallel-size` to it. A mismatch makes `sbatch` reject the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → set `gres: gpu:4`). Confirm the node GPU count with `sinfo -o '%P %G'` on the target cluster.
+- **`execution.gres`** — if your NEL install ships an `internal/slurm/<cluster>` execution config, prefer it (it pre-fills `gres`/hostname/partition/node-exclusivity). Otherwise NEL defaults to `gpu:8`; set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`), or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → `gres: gpu:4`; check with `sinfo -o '%P %G'`).
 
 **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
 
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
index 6035e04f765..185e80397c0 100644
--- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml
+++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -50,9 +50,10 @@ execution:
   account: ???
   output_dir: ???
   walltime: "04:00:00"
-  # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what
-  # the QOS allows) or sbatch fails "Requested node configuration is not available".
-  # gres: gpu:4   # e.g. GB300 (4 GPUs); match --tensor-parallel-size/--data-parallel-size to it (see references/parallelism.md)
+  # gres defaults to gpu:8. If your NEL install ships an internal/slurm/<cluster>
+  # execution config, prefer it (it pre-fills gres/hostname/partition). Otherwise
+  # set gres to the node's GPU count or sbatch fails "Requested node configuration
+  # is not available"; match --tensor/--data-parallel-size to it (references/parallelism.md).
   mounts:
     mount_home: false
   auto_export:          # REQUIRED trigger for auto-export. Without this, the
@@ -65,18 +66,14 @@ deployment:
   hf_model_handle:
   served_model_name: ???
   image: vllm/vllm-openai:v0.19.1
-  # NVFP4 on Blackwell (B200/B300/GB200/GB300, sm_100/sm_103): the v0.19.1 and
-  # any cu129 builds have NO sm_103 FP4 kernels (deploy dies with CUDA
-  # "no kernel image is available"). Use a CUDA-13 build — prefer the pinned
-  # release matching this image, bump to nightly only if it lacks the arch:
-  #   image: vllm/vllm-openai:v0.19.1-cu130          # reproducible; verify arch support
-  #   image: vllm/vllm-openai:cu130-nightly-x86_64   # newest (-aarch64 on Grace)
-  # (Qwen3.5-9B's qwen3_5 arch needed the nightly = vLLM 0.19.2rc1.dev134; the
-  # pinned v0.19.1-cu130 is untested for it.) Confirm via recipes.vllm.ai
-  # ?hardware=b300 (JS-rendered; fetch raw markdown at
-  # github.com/vllm-project/recipes/blob/main/<org>/<model>.md).
-  # Multimodal on sm_103: the ViT CuTe flash-attn asserts "Only SM 10.x/11.x";
-  # workaround `--mm-encoder-attn-backend TRITON_ATTN` (may be unneeded on newer builds).
+  # NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the (multi-arch)
+  # image tag — the default cu12 build has no sm_103 FP4 kernel, so the deploy
+  # dies at engine init with CUDA "no kernel image is available". e.g.:
+  #   image: vllm/vllm-openai:v0.19.1-cu130
+  # If a pinned release predates your model's arch, use the nightly instead
+  # (Qwen3.5-9B's qwen3_5 needed vllm/vllm-openai:cu130-nightly-x86_64, 0.19.2rc1.dev134).
+  # Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`
+  # (ViT flash-attn workaround; drop if the encoder loads without it).
   #
   # `--served-model-name ${deployment.served_model_name}` (in command below) is
   # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval

From 9ccf032d1e02436768a373fa005da4aa15153c2a Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Tue, 2 Jun 2026 23:09:16 -0700
Subject: [PATCH 5/7] [skills] gres: defer to #1599's internal/slurm/<cluster>,
 keep one-line fallback

Reduce the gres guidance to a single fallback note (the slurm/default case),
deferring the predefined per-cluster config path to PR #1599 (which pre-fills
gres/hostname/partition). Fills the gap #1599's fallback branch leaves (it does
not mention gres for the no-internal-package case).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md                         | 2 +-
 .../skills/evaluation/recipes/examples/example_eval.yaml   | 7 +++----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index b4132acf40e..5124c347204 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -200,7 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
 - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
 - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
 - Ask about other defaults they may want to change (partition, walltime, MLflow tags).
-- **`execution.gres`** — if your NEL install ships an `internal/slurm/<cluster>` execution config, prefer it (it pre-fills `gres`/hostname/partition/node-exclusivity). Otherwise NEL defaults to `gpu:8`; set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`), or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. `gpu:8` on 4-GPU GB300 nodes → `gres: gpu:4`; check with `sinfo -o '%P %G'`).
+- **`execution.gres`** — auto-set if you used a predefined `internal/slurm/<cluster>` config (above). On the `slurm/default` fallback it's `gpu:8`, so set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`) or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. 4-GPU GB300 → `gres: gpu:4`; check with `sinfo -o '%P %G'`).
 
 **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
 
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
index 185e80397c0..6ef6a2e8ca0 100644
--- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml
+++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -50,10 +50,9 @@ execution:
   account: ???
   output_dir: ???
   walltime: "04:00:00"
-  # gres defaults to gpu:8. If your NEL install ships an internal/slurm/<cluster>
-  # execution config, prefer it (it pre-fills gres/hostname/partition). Otherwise
-  # set gres to the node's GPU count or sbatch fails "Requested node configuration
-  # is not available"; match --tensor/--data-parallel-size to it (references/parallelism.md).
+  # gres: a predefined internal/slurm/<cluster> config (see SKILL Step 4) sets this.
+  # On the slurm/default fallback it's gpu:8 — set to the node's GPU count or sbatch
+  # fails "Requested node configuration is not available" (4-GPU GB300 -> gpu:4).
   mounts:
     mount_home: false
   auto_export:          # REQUIRED trigger for auto-export. Without this, the

From 96408be1019a9a381cf6a539f0cb2e059c3fedb4 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Tue, 2 Jun 2026 23:16:05 -0700
Subject: [PATCH 6/7] [skills] de-dup served-model-name rationale
 (single-source to example)

Per @chadvoegele: keep the 'why' (vLLM serves as /checkpoint -> 404) only in
example_eval.yaml; the SKILL Step-3 Conventions line now just names the required
flag and points to the example.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md                           | 2 +-
 .claude/skills/evaluation/recipes/examples/example_eval.yaml | 5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index 5124c347204..2c80930c6b4 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -138,7 +138,7 @@ deployment:
     <... rest of cross-checked flags ...>
 ```
 
-Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required** — without it vLLM registers the model under the path `/checkpoint`, and every eval request 404s with "The model `<served_model_name>` does not exist."); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
+Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required**; see `example_eval.yaml` for why); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
 
 For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.
 
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
index 6ef6a2e8ca0..e188d37137a 100644
--- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml
+++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -74,9 +74,8 @@ deployment:
   # Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`
   # (ViT flash-attn workaround; drop if the encoder loads without it).
   #
-  # `--served-model-name ${deployment.served_model_name}` (in command below) is
-  # REQUIRED: without it vLLM registers the model as `/checkpoint` and every eval
-  # request 404s ("The model `<served_model_name>` does not exist.").
+  # `--served-model-name` (in command below) is REQUIRED — else vLLM serves the
+  # model as `/checkpoint` and eval requests 404 ("model does not exist").
   # For MoE models, add `--enable-expert-parallel` to the command.
   # For models with custom code, add `--trust-remote-code` to the command.
   # After filling in evaluation `parallelism` values (top-level + per-task),

From 81fed4276fe04c7c5cdb552f521481a5411d0909 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Tue, 2 Jun 2026 23:55:22 -0700
Subject: [PATCH 7/7] [skills] de-dup hf_model_handle rationale (single-source
 to example)

Keep the full 'why' (hf_model_handle not mounted -> HFValidationError, stage via
snapshot_download) only in example_eval.yaml; reduce the SKILL Step-3 note to a
concise pointer. Same single-sourcing pattern as served-model-name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index 2c80930c6b4..f668ef223d5 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -77,7 +77,7 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...>
 
 **Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`.
 
-> **Prefer `checkpoint_path` on SLURM — `hf_model_handle` is not reliably mounted at `/checkpoint` in current NEL.** With only `hf_model_handle` set, the `vllm serve /checkpoint` command finds nothing mounted there and vLLM treats the literal string `/checkpoint` as an HF repo id, so the deploy dies with `HFValidationError: Repo id must use alphanumeric chars … : '/checkpoint'`. To evaluate an un-staged HF model (e.g. a BF16 baseline for a quant comparison), first download it onto the cluster — `python -c "from huggingface_hub import snapshot_download; snapshot_download('<org>/<model>', local_dir='<path>')"` — then set `checkpoint_path: <path>` (this is the path the NVFP4/quantized run already uses).
+> **Prefer `checkpoint_path` over `hf_model_handle` on SLURM** — `hf_model_handle` isn't reliably mounted at `/checkpoint`, so the deploy dies with `HFValidationError`. To eval an un-staged HF model, stage it first (`huggingface_hub.snapshot_download`) and point `checkpoint_path` at it. See `example_eval.yaml` for why.
 
 **Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`):