diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md index f14cc0b9822..099ac7a6a41 100644 --- a/.claude/skills/deployment/SKILL.md +++ b/.claude/skills/deployment/SKILL.md @@ -125,6 +125,19 @@ python -m vllm.entrypoints.openai.api_server \ For NVFP4 checkpoints, use `--quantization modelopt_fp4`. +> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** +> (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The +> default cu12 build has **no sm_103 FP4 kernel**, so vLLM loads the checkpoint +> then dies at engine init with `CUDA error: no kernel image is available for +> execution on the device` (affects the `flashinfer` and `cutlass` NVFP4 +> backends; `marlin` separately fails on non-64-divisible layer dims). If a +> pinned release predates the model's arch, use `cu130-nightly-` instead +> (Qwen3.5-9B's `qwen3_5` needed it). Cross-check via +> `recipes.vllm.ai//?hardware=b300` (JS-rendered — fetch the raw +> markdown at `github.com/vllm-project/recipes/blob/main//.md`). For +> multimodal models on sm_103, also pass `--mm-encoder-attn-backend TRITON_ATTN` +> (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x"). + #### SGLang ```bash diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 4da78164a7f..f668ef223d5 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -77,6 +77,8 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...> **Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`. +> **Prefer `checkpoint_path` over `hf_model_handle` on SLURM** — `hf_model_handle` isn't reliably mounted at `/checkpoint`, so the deploy dies with `HFValidationError`. To eval an un-staged HF model, stage it first (`huggingface_hub.snapshot_download`) and point `checkpoint_path` at it. See `example_eval.yaml` for why. + **Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`): - **vLLM:** no `--quantization` flag by default — vLLM auto-detects from `quantization_config` / `hf_quant_config.json`. Add only when the card, vLLM version, or dry-run error requires it. @@ -136,12 +138,14 @@ deployment: <... rest of cross-checked flags ...> ``` -Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value. +Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required**; see `example_eval.yaml` for why); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value. For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there. **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping. +> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The default cu12 build has no sm_103 FP4 kernel, so engine init dies with `CUDA error: no kernel image is available`. If a pinned release predates the model's arch, use `cu130-nightly-` (Qwen3.5-9B's `qwen3_5` needed it, vLLM 0.19.2rc1.dev134). Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`. Full note in `recipes/examples/example_eval.yaml`. + #### vLLM-backend defaults — always include unless the recipe *contradicts* Silence is not contradiction. Drop/override only when the recipe sets a different value for the same setting (e.g. recipe pins `--max-num-batched-tokens 16384` → use 16384). @@ -196,6 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text. - **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown. - Ask about other defaults they may want to change (partition, walltime, MLflow tags). +- **`execution.gres`** — auto-set if you used a predefined `internal/slurm/` config (above). On the `slurm/default` fallback it's `gpu:8`, so set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`) or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. 4-GPU GB300 → `gres: gpu:4`; check with `sinfo -o '%P %G'`). **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference. @@ -229,15 +234,21 @@ Implications for the agent: **Tasks that call an external judge / user-simulator / scoring endpoint.** Treat this as a general pattern, not a fixed list — HLE, AA-LCR, and Tau2 need one today, but other benchmarks may too (check each task's recipe). Their `model_id` / `url` are **config, not secrets**: substitute the **literal** values the user keeps in `.env` (keys per the task's recipe + `recipes/env.example`) into the task's `` placeholders. Do **not** emit `${oc.env:...}` for these (it silently fails unless the var was exported with `set -a`). Only `api_key` stays an env-var *name* (e.g. `INFERENCE_API_KEY`), exported and read by the harness. -**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level: +**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks (`ns_*`) with self-deployment (vLLM/SGLang/NIM), you need **both** of these: ```yaml -target: - api_endpoint: - api_key_name: DUMMY_API_KEY +evaluation: + env_vars: + DUMMY_API_KEY: lit:dummy # MUST be set here — see below + nemo_evaluator_config: + target: + api_endpoint: + api_key_name: DUMMY_API_KEY ``` -External-deployment configs already define `api_key_name`. Export of `DUMMY_API_KEY` is handled in Step 8. +`api_key_name` only names the env var; the nemo-skills client **hard-fails if that var has no value inside the eval container** (`ValueError: api_key_env_var=DUMMY_API_KEY but the value is not set`). On SLURM, a shell `export DUMMY_API_KEY=dummy` (Step 8) does **NOT** propagate into the container — NEL only injects vars declared in `env_vars`. So declare `DUMMY_API_KEY: lit:dummy` under `evaluation.env_vars` (note the `lit:` prefix — see below). The shell export only helps for local/Docker runs. External-deployment configs already define `api_key_name`. + +**NEL env-var value prefixes (required):** every value in an `env_vars` map needs an explicit prefix — `host:VAR` (read from the submitting shell's env at submit time), `lit:value` (literal string), or `runtime:VAR` (read in the job at run time). A bare value (e.g. `DUMMY_API_KEY: dummy`) hard-errors: *"Env var value '…' must have an explicit prefix."* Use `lit:` for constants like `DUMMY_API_KEY` and `VLLM_*` backend selectors, `host:` for secrets like `HF_TOKEN` / `INFERENCE_API_KEY`. --- @@ -258,10 +269,13 @@ Default images: | Framework | Image | Registry | | --- | --- | --- | | vLLM | `vllm/vllm-openai:v0.19.1` (bump per recipe; never `:latest`) | DockerHub | +| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:v0.19.1-cu130` (bump to `cu130-nightly-` for new archs) | DockerHub | | SGLang | `lmsysorg/sglang:latest` | DockerHub | | TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC | | Eval tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC | +> NVFP4 checkpoints on Blackwell (sm_100/sm_103) need the `cu130-nightly` image — cu129/v0.19.1 lack sm_103 FP4 kernels (see the "NVFP4 on Blackwell" note in Step 3). + Public images → submit without preflight. Private/restricted → check credentials: ```bash @@ -284,8 +298,10 @@ set -a && source .env && set +a # If pre_cmd/post_cmd in config (review pre_cmd first — runs arbitrary commands): export NEMO_EVALUATOR_TRUST_PRE_CMD=1 -# If nemo_skills.* + self-deployment: +# If nemo_skills.* + self-deployment, for LOCAL/Docker runs only: export DUMMY_API_KEY=dummy +# On SLURM this shell export does NOT reach the container — instead declare +# `DUMMY_API_KEY: lit:dummy` under evaluation.env_vars (see Step 5). ``` **Step 8.1 — Dry-run** (config validation): diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml index c3a48d8c58b..e188d37137a 100644 --- a/.claude/skills/evaluation/recipes/examples/example_eval.yaml +++ b/.claude/skills/evaluation/recipes/examples/example_eval.yaml @@ -23,9 +23,16 @@ # # Deployment uses a single `command:` field instead of separate # `tensor_parallel_size` / `data_parallel_size` / `extra_args` fields — the full -# `vllm serve` invocation lives in the command string. NEL mounts the resolved -# model (from checkpoint_path or hf_model_handle) at /checkpoint inside the -# container, and Hydra interpolates ${deployment.port} at run time. +# `vllm serve` invocation lives in the command string. NEL mounts a +# `checkpoint_path` at /checkpoint inside the container, and Hydra interpolates +# ${deployment.port} at run time. +# +# PREFER `checkpoint_path` (a path already on the cluster) over `hf_model_handle`. +# `hf_model_handle` is NOT reliably mounted at /checkpoint in current NEL — the +# `vllm serve /checkpoint` command then makes vLLM treat "/checkpoint" as a HF +# repo id and the deploy dies with `HFValidationError: Repo id must use ...`. +# To eval an un-staged HF model, first download it to the cluster (e.g. +# `huggingface_hub.snapshot_download`) and point `checkpoint_path` at it. # # Run a single task: # nel run --config ... -t gpqa_diamond_aa_v3 @@ -43,6 +50,9 @@ execution: account: ??? output_dir: ??? walltime: "04:00:00" + # gres: a predefined internal/slurm/ config (see SKILL Step 4) sets this. + # On the slurm/default fallback it's gpu:8 — set to the node's GPU count or sbatch + # fails "Requested node configuration is not available" (4-GPU GB300 -> gpu:4). mounts: mount_home: false auto_export: # REQUIRED trigger for auto-export. Without this, the @@ -55,6 +65,17 @@ deployment: hf_model_handle: served_model_name: ??? image: vllm/vllm-openai:v0.19.1 + # NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the (multi-arch) + # image tag — the default cu12 build has no sm_103 FP4 kernel, so the deploy + # dies at engine init with CUDA "no kernel image is available". e.g.: + # image: vllm/vllm-openai:v0.19.1-cu130 + # If a pinned release predates your model's arch, use the nightly instead + # (Qwen3.5-9B's qwen3_5 needed vllm/vllm-openai:cu130-nightly-x86_64, 0.19.2rc1.dev134). + # Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN` + # (ViT flash-attn workaround; drop if the encoder loads without it). + # + # `--served-model-name` (in command below) is REQUIRED — else vLLM serves the + # model as `/checkpoint` and eval requests 404 ("model does not exist"). # For MoE models, add `--enable-expert-parallel` to the command. # For models with custom code, add `--trust-remote-code` to the command. # After filling in evaluation `parallelism` values (top-level + per-task), @@ -62,6 +83,7 @@ deployment: # N = ceil(max_parallelism / data_parallel_size). command: >- vllm serve /checkpoint + --served-model-name ${deployment.served_model_name} --host 0.0.0.0 --port ${deployment.port} --tensor-parallel-size 1 @@ -72,6 +94,12 @@ deployment: evaluation: env_vars: HF_TOKEN: host:HF_TOKEN + # nemo-skills tasks (ns_*) hard-require the served-endpoint api key env var + # (api_key_name below) to be set WITH A VALUE inside the eval container. + # A shell `export DUMMY_API_KEY=dummy` does NOT reach the SLURM container — + # it must be declared here. Omit it and ns_* tasks die with + # "api_key_env_var=DUMMY_API_KEY but the value is not set". + DUMMY_API_KEY: lit:dummy nemo_evaluator_config: config: params: diff --git a/.claude/skills/ptq/references/launcher-guide.md b/.claude/skills/ptq/references/launcher-guide.md index 542c4ade5b1..1b5bb263818 100644 --- a/.claude/skills/ptq/references/launcher-guide.md +++ b/.claude/skills/ptq/references/launcher-guide.md @@ -31,6 +31,20 @@ pipeline: gpus_per_node: ``` +> **Match `gpus_per_node` to the cluster's node GPU count / QOS minimum.** If it +> is below what the QOS requires (many clusters mandate a full node), `sbatch` +> rejects the job with `QOSMinGRES` or `Requested node configuration is not +> available`. e.g. GB300 nodes have 4 GPUs and require the full node → set +> `gpus_per_node: 4`; B300/B200 nodes have 8. Check with `sinfo -o '%P %G'`. + +> **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an +> unquoted `export` in the generated sbatch script, so a value like +> `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and +> silently dropped — the deps never install. Use exact `==` pins (no `>`/`<`). +> The right version is **model-specific** — a brand-new architecture may need a +> newer transformers than the repo's library pin (e.g. Qwen3.5's `qwen3_5` needs +> `EXTRA_PIP_DEPS: "transformers==5.5.0"`); pick what the target model requires. + Extra `hf_ptq.py` flags can be passed via `args`: ```yaml