Skip to content
13 changes: 13 additions & 0 deletions .claude/skills/deployment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,19 @@ python -m vllm.entrypoints.openai.api_server \

For NVFP4 checkpoints, use `--quantization modelopt_fp4`.

> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag**
> (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The
> default cu12 build has **no sm_103 FP4 kernel**, so vLLM loads the checkpoint
> then dies at engine init with `CUDA error: no kernel image is available for

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CUDA error: no kernel image is available for execution on the device, description and fix seem like a good candidate for the debugging playbook (/ shared memory) that I attempted in 1493. Perhaps we can restart the conversation on how to best develop shared memory for the agent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 — this is a good playbook candidate. I didn't fold it into shared memory here to keep the PR scoped, but happy to restart the #1493 discussion and migrate the "no kernel image → cu130" finding into the debugging playbook.

> execution on the device` (affects the `flashinfer` and `cutlass` NVFP4
> backends; `marlin` separately fails on non-64-divisible layer dims). If a
> pinned release predates the model's arch, use `cu130-nightly-<arch>` instead
> (Qwen3.5-9B's `qwen3_5` needed it). Cross-check via
> `recipes.vllm.ai/<org>/<model>?hardware=b300` (JS-rendered — fetch the raw
> markdown at `github.com/vllm-project/recipes/blob/main/<org>/<model>.md`). For
> multimodal models on sm_103, also pass `--mm-encoder-attn-backend TRITON_ATTN`
> (the default CuTe ViT flash-attn asserts "Only SM 10.x and 11.x").

#### SGLang

```bash
Expand Down
30 changes: 23 additions & 7 deletions .claude/skills/evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...>

**Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`.

> **Prefer `checkpoint_path` over `hf_model_handle` on SLURM** — `hf_model_handle` isn't reliably mounted at `/checkpoint`, so the deploy dies with `HFValidationError`. To eval an un-staged HF model, stage it first (`huggingface_hub.snapshot_download`) and point `checkpoint_path` at it. See `example_eval.yaml` for why.

**Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`):

- **vLLM:** no `--quantization` flag by default — vLLM auto-detects from `quantization_config` / `hf_quant_config.json`. Add only when the card, vLLM version, or dry-run error requires it.
Expand Down Expand Up @@ -136,12 +138,14 @@ deployment:
<... rest of cross-checked flags ...>
```

Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--served-model-name ${deployment.served_model_name}` (**required**; see `example_eval.yaml` for why); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.

For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.

**Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.

> **NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the image tag** (e.g. `vllm/vllm-openai:v0.19.1-cu130` — release tags are multi-arch). The default cu12 build has no sm_103 FP4 kernel, so engine init dies with `CUDA error: no kernel image is available`. If a pinned release predates the model's arch, use `cu130-nightly-<arch>` (Qwen3.5-9B's `qwen3_5` needed it, vLLM 0.19.2rc1.dev134). Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`. Full note in `recipes/examples/example_eval.yaml`.

#### vLLM-backend defaults — always include unless the recipe *contradicts*

Silence is not contradiction. Drop/override only when the recipe sets a different value for the same setting (e.g. recipe pins `--max-num-batched-tokens 16384` → use 16384).
Expand Down Expand Up @@ -196,6 +200,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
- Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
- **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
- Ask about other defaults they may want to change (partition, walltime, MLflow tags).
- **`execution.gres`** — auto-set if you used a predefined `internal/slurm/<cluster>` config (above). On the `slurm/default` fallback it's `gpu:8`, so set it to the node's GPU count (and match `--data-parallel-size`/`--tensor-parallel-size`) or `sbatch` rejects the job with *"Requested node configuration is not available"* (e.g. 4-GPU GB300 → `gres: gpu:4`; check with `sinfo -o '%P %G'`).

**Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.

Expand Down Expand Up @@ -229,15 +234,21 @@ Implications for the agent:

**Tasks that call an external judge / user-simulator / scoring endpoint.** Treat this as a general pattern, not a fixed list — HLE, AA-LCR, and Tau2 need one today, but other benchmarks may too (check each task's recipe). Their `model_id` / `url` are **config, not secrets**: substitute the **literal** values the user keeps in `.env` (keys per the task's recipe + `recipes/env.example`) into the task's `<VAR>` placeholders. Do **not** emit `${oc.env:...}` for these (it silently fails unless the var was exported with `set -a`). Only `api_key` stays an env-var *name* (e.g. `INFERENCE_API_KEY`), exported and read by the harness.

**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
**Known issue — nemo-skills self-deployment:** If using `nemo_skills.*` tasks (`ns_*`) with self-deployment (vLLM/SGLang/NIM), you need **both** of these:

```yaml
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
evaluation:
env_vars:
DUMMY_API_KEY: lit:dummy # MUST be set here — see below
nemo_evaluator_config:
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
```

External-deployment configs already define `api_key_name`. Export of `DUMMY_API_KEY` is handled in Step 8.
`api_key_name` only names the env var; the nemo-skills client **hard-fails if that var has no value inside the eval container** (`ValueError: api_key_env_var=DUMMY_API_KEY but the value is not set`). On SLURM, a shell `export DUMMY_API_KEY=dummy` (Step 8) does **NOT** propagate into the container — NEL only injects vars declared in `env_vars`. So declare `DUMMY_API_KEY: lit:dummy` under `evaluation.env_vars` (note the `lit:` prefix — see below). The shell export only helps for local/Docker runs. External-deployment configs already define `api_key_name`.

**NEL env-var value prefixes (required):** every value in an `env_vars` map needs an explicit prefix — `host:VAR` (read from the submitting shell's env at submit time), `lit:value` (literal string), or `runtime:VAR` (read in the job at run time). A bare value (e.g. `DUMMY_API_KEY: dummy`) hard-errors: *"Env var value '…' must have an explicit prefix."* Use `lit:` for constants like `DUMMY_API_KEY` and `VLLM_*` backend selectors, `host:` for secrets like `HF_TOKEN` / `INFERENCE_API_KEY`.

---

Expand All @@ -258,10 +269,13 @@ Default images:
| Framework | Image | Registry |
| --- | --- | --- |
| vLLM | `vllm/vllm-openai:v0.19.1` (bump per recipe; never `:latest`) | DockerHub |
| vLLM (NVFP4 on Blackwell) | `vllm/vllm-openai:v0.19.1-cu130` (bump to `cu130-nightly-<arch>` for new archs) | DockerHub |
| SGLang | `lmsysorg/sglang:latest` | DockerHub |
| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC |
| Eval tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC |

> NVFP4 checkpoints on Blackwell (sm_100/sm_103) need the `cu130-nightly` image — cu129/v0.19.1 lack sm_103 FP4 kernels (see the "NVFP4 on Blackwell" note in Step 3).

Public images → submit without preflight. Private/restricted → check credentials:

```bash
Expand All @@ -284,8 +298,10 @@ set -a && source .env && set +a

# If pre_cmd/post_cmd in config (review pre_cmd first — runs arbitrary commands):
export NEMO_EVALUATOR_TRUST_PRE_CMD=1
# If nemo_skills.* + self-deployment:
# If nemo_skills.* + self-deployment, for LOCAL/Docker runs only:
export DUMMY_API_KEY=dummy
# On SLURM this shell export does NOT reach the container — instead declare
# `DUMMY_API_KEY: lit:dummy` under evaluation.env_vars (see Step 5).
```

**Step 8.1 — Dry-run** (config validation):
Expand Down
34 changes: 31 additions & 3 deletions .claude/skills/evaluation/recipes/examples/example_eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,16 @@
#
# Deployment uses a single `command:` field instead of separate
# `tensor_parallel_size` / `data_parallel_size` / `extra_args` fields — the full
# `vllm serve` invocation lives in the command string. NEL mounts the resolved
# model (from checkpoint_path or hf_model_handle) at /checkpoint inside the
# container, and Hydra interpolates ${deployment.port} at run time.
# `vllm serve` invocation lives in the command string. NEL mounts a
# `checkpoint_path` at /checkpoint inside the container, and Hydra interpolates
# ${deployment.port} at run time.
#
# PREFER `checkpoint_path` (a path already on the cluster) over `hf_model_handle`.
# `hf_model_handle` is NOT reliably mounted at /checkpoint in current NEL — the
# `vllm serve /checkpoint` command then makes vLLM treat "/checkpoint" as a HF
# repo id and the deploy dies with `HFValidationError: Repo id must use ...`.
# To eval an un-staged HF model, first download it to the cluster (e.g.
# `huggingface_hub.snapshot_download`) and point `checkpoint_path` at it.
#
# Run a single task:
# nel run --config ... -t gpqa_diamond_aa_v3
Expand All @@ -43,6 +50,9 @@ execution:
account: ???
output_dir: ???
walltime: "04:00:00"
# gres: a predefined internal/slurm/<cluster> config (see SKILL Step 4) sets this.
# On the slurm/default fallback it's gpu:8 — set to the node's GPU count or sbatch
# fails "Requested node configuration is not available" (4-GPU GB300 -> gpu:4).
mounts:
mount_home: false
auto_export: # REQUIRED trigger for auto-export. Without this, the
Expand All @@ -55,13 +65,25 @@ deployment:
hf_model_handle:
served_model_name: ???
image: vllm/vllm-openai:v0.19.1
# NVFP4 on Blackwell B300/GB300 (sm_103): append `-cu130` to the (multi-arch)
# image tag — the default cu12 build has no sm_103 FP4 kernel, so the deploy
# dies at engine init with CUDA "no kernel image is available". e.g.:
# image: vllm/vllm-openai:v0.19.1-cu130
# If a pinned release predates your model's arch, use the nightly instead
# (Qwen3.5-9B's qwen3_5 needed vllm/vllm-openai:cu130-nightly-x86_64, 0.19.2rc1.dev134).
# Multimodal on sm_103 may also need `--mm-encoder-attn-backend TRITON_ATTN`
# (ViT flash-attn workaround; drop if the encoder loads without it).
#
# `--served-model-name` (in command below) is REQUIRED — else vLLM serves the
# model as `/checkpoint` and eval requests 404 ("model does not exist").
# For MoE models, add `--enable-expert-parallel` to the command.
# For models with custom code, add `--trust-remote-code` to the command.
# After filling in evaluation `parallelism` values (top-level + per-task),
# append `--max-num-seqs N` to the command where
# N = ceil(max_parallelism / data_parallel_size).
command: >-
vllm serve /checkpoint
--served-model-name ${deployment.served_model_name}
--host 0.0.0.0
--port ${deployment.port}
--tensor-parallel-size 1
Expand All @@ -72,6 +94,12 @@ deployment:
evaluation:
env_vars:
HF_TOKEN: host:HF_TOKEN
# nemo-skills tasks (ns_*) hard-require the served-endpoint api key env var
# (api_key_name below) to be set WITH A VALUE inside the eval container.
# A shell `export DUMMY_API_KEY=dummy` does NOT reach the SLURM container —
# it must be declared here. Omit it and ns_* tasks die with
# "api_key_env_var=DUMMY_API_KEY but the value is not set".
DUMMY_API_KEY: lit:dummy
nemo_evaluator_config:
config:
params:
Expand Down
14 changes: 14 additions & 0 deletions .claude/skills/ptq/references/launcher-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,20 @@ pipeline:
gpus_per_node: <num_gpus>
```

> **Match `gpus_per_node` to the cluster's node GPU count / QOS minimum.** If it
> is below what the QOS requires (many clusters mandate a full node), `sbatch`
> rejects the job with `QOSMinGRES` or `Requested node configuration is not
> available`. e.g. GB300 nodes have 4 GPUs and require the full node → set
> `gpus_per_node: 4`; B300/B200 nodes have 8. Check with `sinfo -o '%P %G'`.

> **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an
> unquoted `export` in the generated sbatch script, so a value like
> `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and
> silently dropped — the deps never install. Use exact `==` pins (no `>`/`<`).
> The right version is **model-specific** — a brand-new architecture may need a
> newer transformers than the repo's library pin (e.g. Qwen3.5's `qwen3_5` needs
> `EXTRA_PIP_DEPS: "transformers==5.5.0"`); pick what the target model requires.

Extra `hf_ptq.py` flags can be passed via `args`:

```yaml
Expand Down
Loading