Skip to content

Commit c0a32a7

Browse files
Edwardf0t1claude
andcommitted
[skills] evaluation: prefer checkpoint_path over hf_model_handle
hf_model_handle is not reliably mounted at /checkpoint in current NEL: with only hf_model_handle set, `vllm serve /checkpoint` makes vLLM treat the literal '/checkpoint' as an HF repo id and the deploy dies with `HFValidationError: Repo id must use alphanumeric chars ... : '/checkpoint'`. Document preferring checkpoint_path (download the HF model to the cluster via snapshot_download first) in the evaluation SKILL Step 3 and example_eval.yaml. Hit while running a BF16 baseline (Qwen/Qwen3.5-9B) for an NVFP4 comparison. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent f11770d commit c0a32a7

2 files changed

Lines changed: 12 additions & 3 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@ nel skills build-config --execution <...> --deployment <...> --model_type <...>
7777

7878
**Model path.** Checkpoint path (`/`, `./`, `../`, `~`, or exists on disk) → set `deployment.checkpoint_path`, leave `hf_model_handle: null`. Else HF handle (one `/`, not on disk) → set `deployment.hf_model_handle`, leave `checkpoint_path: null`.
7979

80+
> **Prefer `checkpoint_path` on SLURM — `hf_model_handle` is not reliably mounted at `/checkpoint` in current NEL.** With only `hf_model_handle` set, the `vllm serve /checkpoint` command finds nothing mounted there and vLLM treats the literal string `/checkpoint` as an HF repo id, so the deploy dies with `HFValidationError: Repo id must use alphanumeric chars … : '/checkpoint'`. To evaluate an un-staged HF model (e.g. a BF16 baseline for a quant comparison), first download it onto the cluster — `python -c "from huggingface_hub import snapshot_download; snapshot_download('<org>/<model>', local_dir='<path>')"` — then set `checkpoint_path: <path>` (this is the path the NVFP4/quantized run already uses).
81+
8082
**Auto-detect ModelOpt quantization** (checkpoint paths). Check `config.json` for `quantization_config` (or legacy `hf_quant_config.json`):
8183

8284
- **vLLM:** no `--quantization` flag by default — vLLM auto-detects from `quantization_config` / `hf_quant_config.json`. Add only when the card, vLLM version, or dry-run error requires it.

.claude/skills/evaluation/recipes/examples/example_eval.yaml

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,16 @@
2323
#
2424
# Deployment uses a single `command:` field instead of separate
2525
# `tensor_parallel_size` / `data_parallel_size` / `extra_args` fields — the full
26-
# `vllm serve` invocation lives in the command string. NEL mounts the resolved
27-
# model (from checkpoint_path or hf_model_handle) at /checkpoint inside the
28-
# container, and Hydra interpolates ${deployment.port} at run time.
26+
# `vllm serve` invocation lives in the command string. NEL mounts a
27+
# `checkpoint_path` at /checkpoint inside the container, and Hydra interpolates
28+
# ${deployment.port} at run time.
29+
#
30+
# PREFER `checkpoint_path` (a path already on the cluster) over `hf_model_handle`.
31+
# `hf_model_handle` is NOT reliably mounted at /checkpoint in current NEL — the
32+
# `vllm serve /checkpoint` command then makes vLLM treat "/checkpoint" as a HF
33+
# repo id and the deploy dies with `HFValidationError: Repo id must use ...`.
34+
# To eval an un-staged HF model, first download it to the cluster (e.g.
35+
# `huggingface_hub.snapshot_download`) and point `checkpoint_path` at it.
2936
#
3037
# Run a single task:
3138
# nel run --config ... -t gpqa_diamond_aa_v3

0 commit comments

Comments
 (0)