NVIDIA · Edwardf0t1 · Jun 3, 2026 · Jun 1, 2026 · Jun 2, 2026 · Jun 2, 2026
@@ -125,6 +125,19 @@ python -m vllm.entrypoints.openai.api_server \
 
 For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
 
+> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag**
+> (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The
+> default cu12 build has **no sm_103 FP4 kernel**, so vLLM loads the checkpoint
+> then dies at engine init with `CUDA error: no kernel image is available for
+> execution on the device` (affects the `flashinfer` and `cutlass` NVFP4
+> backends; `marlin` separately fails on non-64-divisible layer dims). If a
+> pinned release predates the model's arch, use `cu130-nightly-<arch>` instead
+> (Qwen3.5-9B's `qwen3_5` needed it). Cross-check via
+> `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the raw
+> markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`). For
+> multimodal models on sm_103, also pass `--mm-encoder-attn-backend TRITON_ATTN`
+> (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").
+
 #### SGLang
 
 ```bash

@@ -77,6 +77,8 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...>
 
 **Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`.
 
+> **Prefer `checkpoint_path` over `hf_model_handle` on SLURM** — `hf_model_handle` isn't reliably mounted at `/checkpoint`, so the deploy dies with `HFValidationError`. To eval an un-staged HF model, stage it first (`huggingface_hub.snapshot_download`) and point `checkpoint_path` at it. See `example_eval.yaml` for why.
+
 **Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`):
 
 - **vLLM:** no `--quantization` flag by default — vLLM auto-detects from `quantization_config` / `hf_quant_config.json`. Add only when the card, vLLM version, or dry-run error requires it.
@@ -136,12 +138,14 @@ deployment:
     <... rest of cross-checked flags ...>
 ```
 
-Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
+Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required**; see `example_eval.yaml` for why); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
 
 For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.
 
 **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
 
+> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The default cu12 build has no sm_103 FP4 kernel, so engine init dies with `CUDA error: no kernel image is available`. If a pinned release predates the model's arch, use `cu130-nightly-<arch>` (Qwen3.5-9B's `qwen3_5` needed it, vLLM 0.19.2rc1.dev134). Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`. Full note in `recipes/examples/example_eval.yaml`.
+
 #### vLLM-backend defaults — always include unless the recipe *contradicts*
 
 Silence is not contradiction. Drop/override only when the recipe sets a different value for the same setting (e.g. recipe pins `--max-num-batched-tokens 16384` → use 16384).
@@ -196,6 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
 - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
 - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
 - Ask about other defaults they may want to change (partition, walltime, MLflow tags).
+- **`execution.gres`** — auto-set if you used a predefined `internal/slurm/<cluster>` config (above). On the `slurm/default` fallback it's `gpu:8`, so set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`) or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. 4-GPU GB300 → `gres: gpu:4`; check with `sinfo -o '%P %G'`).
 
 **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
 
@@ -229,15 +234,21 @@ Implications for the agent:
 
 **Tasks that call an external judge / user-simulator / scoring endpoint.** Treat this as a general pattern, not a fixed list — HLE, AA-LCR, and Tau2 need one today, but other benchmarks may too (check each task's recipe). Their `model_id` / `url` are **config, not secrets**: substitute the **literal** values the user keeps in `.env` (keys per the task's recipe + `recipes/env.example`) into the task's `<VAR>` placeholders. Do **not** emit `${oc.env:...}` for these (it silently fails unless the var was exported with `set -a`). Only `api_key` stays an env-var *name* (e.g. `INFERENCE_API_KEY`), exported and read by the harness.
 
-**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
+**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks (`ns_*`) with self-deployment (vLLM/SGLang/NIM), you need **both** of these:
 
 ```yaml
-target:
-  api_endpoint:
-    api_key_name: DUMMY_API_KEY
+evaluation:
+  env_vars:
+    DUMMY_API_KEY: lit:dummy   # MUST be set here — see below
+  nemo_evaluator_config:
+    target:
+      api_endpoint:
+        api_key_name: DUMMY_API_KEY
 ```
 
-External-deployment configs already define `api_key_name`. Export of `DUMMY_API_KEY` is handled in Step 8.
+`api_key_name` only names the env var; the nemo-skills client **hard-fails if that var has no value inside the eval container** (`ValueError: api_key_env_var=DUMMY_API_KEY but the value is not set`). On SLURM, a shell `export DUMMY_API_KEY=dummy` (Step 8) does **NOT** propagate into the container — NEL only injects vars declared in `env_vars`. So declare `DUMMY_API_KEY: lit:dummy` under `evaluation.env_vars` (note the `lit:` prefix — see below). The shell export only helps for local/Docker runs. External-deployment configs already define `api_key_name`.
+
+**NEL env-var value prefixes (required):** every value in an `env_vars` map needs an explicit prefix — `host:VAR` (read from the submitting shell's env at submit time), `lit:value` (literal string), or `runtime:VAR` (read in the job at run time). A bare value (e.g. `DUMMY_API_KEY: dummy`) hard-errors: *"Env var value '…' must have an explicit prefix."* Use `lit:` for constants like `DUMMY_API_KEY` and `VLLM_*` backend selectors, `host:` for secrets like `HF_TOKEN` / `INFERENCE_API_KEY`.
 
 ---
 
@@ -258,10 +269,13 @@ Default images:
 | Framework | Image | Registry |
 | --- | --- | --- |
 | vLLM | `vllm/vllm-openai:v0.19.1` (bump per recipe; never `:latest`) | DockerHub |
+| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:v0.19.1-cu130` (bump to `cu130-nightly-<arch>` for new archs) | DockerHub |
 | SGLang | `lmsysorg/sglang:latest` | DockerHub |
 | TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC |
 | Eval tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC |
 
+> NVFP4 checkpoints on Blackwell (sm_100/sm_103) need the `cu130-nightly` image — cu129/v0.19.1 lack sm_103 FP4 kernels (see the "NVFP4 on Blackwell" note in Step 3).
+
 Public images → submit without preflight. Private/restricted → check credentials:
 
 ```bash
@@ -284,8 +298,10 @@ set -a && source .env && set +a
 
 # If pre_cmd/post_cmd in config (review pre_cmd first — runs arbitrary commands):
 export NEMO_EVALUATOR_TRUST_PRE_CMD=1
-# If nemo_skills.* + self-deployment:
+# If nemo_skills.* + self-deployment, for LOCAL/Docker runs only:
 export DUMMY_API_KEY=dummy
+# On SLURM this shell export does NOT reach the container — instead declare
+# `DUMMY_API_KEY: lit:dummy` under evaluation.env_vars (see Step 5).
 ```
 
 **Step 8.1 — Dry-run** (config validation):

@@ -23,9 +23,16 @@
 #
 # Deployment uses a single `command:` field instead of separate
 # `tensor_parallel_size` / `data_parallel_size` / `extra_args` fields — the full
-# `vllm serve` invocation lives in the command string. NEL mounts the resolved
-# model (from checkpoint_path or hf_model_handle) at /checkpoint inside the
-# container, and Hydra interpolates ${deployment.port} at run time.
+# `vllm serve` invocation lives in the command string. NEL mounts a
+# `checkpoint_path` at /checkpoint inside the container, and Hydra interpolates
+# ${deployment.port} at run time.
+#
+# PREFER `checkpoint_path` (a path already on the cluster) over `hf_model_handle`.
+# `hf_model_handle` is NOT reliably mounted at /checkpoint in current NEL — the
+# `vllm serve /checkpoint` command then makes vLLM treat "/checkpoint" as a HF
+# repo id and the deploy dies with `HFValidationError: Repo id must use ...`.
+# To eval an un-staged HF model, first download it to the cluster (e.g.
+# `huggingface_hub.snapshot_download`) and point `checkpoint_path` at it.
 #
 # Run a single task:
 #   nel run --config ... -t gpqa_diamond_aa_v3
@@ -43,6 +50,9 @@ execution:
   account: ???
   output_dir: ???
   walltime: "04:00:00"
+  # gres: a predefined internal/slurm/<cluster> config (see SKILL Step 4) sets this.
+  # On the slurm/default fallback it's gpu:8 — set to the node's GPU count or sbatch
+  # fails "Requested node configuration is not available" (4-GPU GB300 -> gpu:4).
   mounts:
     mount_home: false
   auto_export:          # REQUIRED trigger for auto-export. Without this, the
@@ -55,13 +65,25 @@ deployment:
   hf_model_handle:
   served_model_name: ???
   image: vllm/vllm-openai:v0.19.1
+  # NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the (multi-arch)
+  # image tag — the default cu12 build has no sm_103 FP4 kernel, so the deploy
+  # dies at engine init with CUDA "no kernel image is available". e.g.:
+  #   image: vllm/vllm-openai:v0.19.1-cu130
+  # If a pinned release predates your model's arch, use the nightly instead
+  # (Qwen3.5-9B's qwen3_5 needed vllm/vllm-openai:cu130-nightly-x86_64, 0.19.2rc1.dev134).
+  # Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`
+  # (ViT flash-attn workaround; drop if the encoder loads without it).
+  #
+  # `--served-model-name` (in command below) is REQUIRED — else vLLM serves the
+  # model as `/checkpoint` and eval requests 404 ("model does not exist").
   # For MoE models, add `--enable-expert-parallel` to the command.
   # For models with custom code, add `--trust-remote-code` to the command.
   # After filling in evaluation `parallelism` values (top-level + per-task),
   # append `--max-num-seqs N` to the command where
   # N = ceil(max_parallelism / data_parallel_size).
   command: >-
     vllm serve /checkpoint
+    --served-model-name ${deployment.served_model_name}
     --host 0.0.0.0
     --port ${deployment.port}
     --tensor-parallel-size 1
@@ -72,6 +94,12 @@ deployment:
 evaluation:
   env_vars:
     HF_TOKEN: host:HF_TOKEN
+    # nemo-skills tasks (ns_*) hard-require the served-endpoint api key env var
+    # (api_key_name below) to be set WITH A VALUE inside the eval container.
+    # A shell `export DUMMY_API_KEY=dummy` does NOT reach the SLURM container —
+    # it must be declared here. Omit it and ns_* tasks die with
+    # "api_key_env_var=DUMMY_API_KEY but the value is not set".
+    DUMMY_API_KEY: lit:dummy
   nemo_evaluator_config:
     config:
       params:

@@ -31,6 +31,20 @@ pipeline:
       gpus_per_node: <num_gpus>
 ```
 
+> **Match `gpus_per_node` to the cluster's node GPU count / QOS minimum.** If it
+> is below what the QOS requires (many clusters mandate a full node), `sbatch`
+> rejects the job with `QOSMinGRES` or `Requested node configuration is not
+> available`. e.g. GB300 nodes have 4 GPUs and require the full node → set
+> `gpus_per_node: 4`; B300/B200 nodes have 8. Check with `sinfo -o '%P %G'`.
+
+> **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an
+> unquoted `export` in the generated sbatch script, so a value like
+> `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and
+> silently dropped — the deps never install. Use exact `==` pins (no `>`/`<`).
+> The right version is **model-specific** — a brand-new architecture may need a
+> newer transformers than the repo's library pin (e.g. Qwen3.5's `qwen3_5` needs
+> `EXTRA_PIP_DEPS: "transformers==5.5.0"`); pick what the target model requires.
+
 Extra `hf_ptq.py` flags can be passed via `args`:
 
 ```yaml