[skills] evaluation: prefer checkpoint_path over hf_model_handle

Edwardf0t1 · claude · Edwardf0t1 · commit c0a32a7a0e1f · 2026-06-01T18:08:59.000-07:00
hf_model_handle is not reliably mounted at /checkpoint in current NEL: with
only hf_model_handle set, `vllm serve /checkpoint` makes vLLM treat the
literal '/checkpoint' as an HF repo id and the deploy dies with
`HFValidationError: Repo id must use alphanumeric chars ... : '/checkpoint'`.
Document preferring checkpoint_path (download the HF model to the cluster via
snapshot_download first) in the evaluation SKILL Step 3 and example_eval.yaml.

Hit while running a BF16 baseline (Qwen/Qwen3.5-9B) for an NVFP4 comparison.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Zhiyu Cheng &lt;zhiyuc@nvidia.com&gt;
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -77,6 +77,8 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...>
 
 **Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`.
 
+> **Prefer `checkpoint_path` on SLURM — `hf_model_handle` is not reliably mounted at `/checkpoint` in current NEL.** With only `hf_model_handle` set, the `vllm serve /checkpoint` command finds nothing mounted there and vLLM treats the literal string `/checkpoint` as an HF repo id, so the deploy dies with `HFValidationError: Repo id must use alphanumeric chars … : '/checkpoint'`. To evaluate an un-staged HF model (e.g. a BF16 baseline for a quant comparison), first download it onto the cluster — `python -c "from huggingface_hub import snapshot_download; snapshot_download('<org>/<model>', local_dir='<path>')"` — then set `checkpoint_path: <path>` (this is the path the NVFP4/quantized run already uses).
+
 **Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`):
 
 - **vLLM:** no `--quantization` flag by default — vLLM auto-detects from `quantization_config` / `hf_quant_config.json`. Add only when the card, vLLM version, or dry-run error requires it.
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -23,9 +23,16 @@
 #
 # Deployment uses a single `command:` field instead of separate
 # `tensor_parallel_size` / `data_parallel_size` / `extra_args` fields — the full
-# `vllm serve` invocation lives in the command string. NEL mounts the resolved
-# model (from checkpoint_path or hf_model_handle) at /checkpoint inside the
-# container, and Hydra interpolates ${deployment.port} at run time.
+# `vllm serve` invocation lives in the command string. NEL mounts a
+# `checkpoint_path` at /checkpoint inside the container, and Hydra interpolates
+# ${deployment.port} at run time.
+#
+# PREFER `checkpoint_path` (a path already on the cluster) over `hf_model_handle`.
+# `hf_model_handle` is NOT reliably mounted at /checkpoint in current NEL — the
+# `vllm serve /checkpoint` command then makes vLLM treat "/checkpoint" as a HF
+# repo id and the deploy dies with `HFValidationError: Repo id must use ...`.
+# To eval an un-staged HF model, first download it to the cluster (e.g.
+# `huggingface_hub.snapshot_download`) and point `checkpoint_path` at it.
 #
 # Run a single task:
 #   nel run --config ... -t gpqa_diamond_aa_v3