[skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals#1595
[skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals#1595Edwardf0t1 wants to merge 2 commits into
Conversation
…s evals
Concrete fixes from a Qwen3.5-9B NVFP4 PTQ -> deploy -> AA-eval run on
B300/GB300 where each issue caused a real failure:
- example_eval.yaml: add --served-model-name ${deployment.served_model_name};
without it vLLM registers the model as /checkpoint and every eval 404s.
- evaluation SKILL: nemo-skills (ns_*) self-deployment needs DUMMY_API_KEY in
evaluation.env_vars (a shell export does NOT reach the SLURM container);
document the required host:/lit:/runtime: env-var value prefixes; note that
execution.gres must match the node GPU count (else sbatch 'Requested node
configuration is not available').
- deployment + evaluation SKILL: NVFP4 on Blackwell (sm_100/sm_103) requires
vllm/vllm-openai:cu130-nightly; v0.19.1 and any cu129 build lack sm_103 FP4
kernels (engine init dies 'no kernel image'). Plus --mm-encoder-attn-backend
TRITON_ATTN for multimodal on sm_103, and the raw-markdown recipes.vllm.ai
fallback for hardware variants.
- ptq launcher-guide: match gpus_per_node to node/QOS; EXTRA_PIP_DEPS must
avoid shell metacharacters (use == pins, not >=/<).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds documentation guidance across deployment, evaluation, and PTQ skills to help users properly configure Blackwell (sm_103) GPU deployments on SLURM clusters. Updates include cu130-nightly vLLM build requirements, resource allocation settings, environment variable setup patterns, and safe dependency injection practices with concrete example YAML configurations. ChangesBlackwell Deployment and SLURM Configuration Guidance
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR updates the agent-skill documentation and example configs under .claude/skills/** to prevent known failure modes when running NEL-based deployments/evals (notably NVFP4-on-Blackwell and nemo-skills self-deployment).
Changes:
- Update the evaluation skill and
example_eval.yamlto require--served-model-nameand document SLURM/container env-var injection requirements (including nemo-skills’DUMMY_API_KEY). - Document Blackwell NVFP4 deployment requirements (CUDA-13
cu130-nightlyimages) in both deployment and evaluation skills, plus a multimodal-specific vLLM flag. - Add PTQ launcher-guide notes about SLURM GPU resource matching and
EXTRA_PIP_DEPSquoting/metacharacter pitfalls.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
.claude/skills/ptq/references/launcher-guide.md |
Adds guidance on gpus_per_node and EXTRA_PIP_DEPS pitfalls for launcher-generated SLURM scripts. |
.claude/skills/evaluation/SKILL.md |
Clarifies required vLLM serving conventions, Blackwell NVFP4 image requirements, and NEL env var rules. |
.claude/skills/evaluation/recipes/examples/example_eval.yaml |
Updates the example eval template with --served-model-name and required env vars for nemo-skills tasks. |
.claude/skills/deployment/SKILL.md |
Documents Blackwell NVFP4 requirements for vLLM image selection and related flags. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| > **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an | ||
| > unquoted `export` in the generated sbatch script, so a value like | ||
| > `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and | ||
| > silently dropped — the deps never install. Use exact pins instead, e.g. | ||
| > `EXTRA_PIP_DEPS: "transformers==5.5.0"`. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1595 +/- ##
===========================================
+ Coverage 55.66% 73.23% +17.57%
===========================================
Files 478 479 +1
Lines 52367 52435 +68
===========================================
+ Hits 29148 38400 +9252
+ Misses 23219 14035 -9184
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
hf_model_handle is not reliably mounted at /checkpoint in current NEL: with only hf_model_handle set, `vllm serve /checkpoint` makes vLLM treat the literal '/checkpoint' as an HF repo id and the deploy dies with `HFValidationError: Repo id must use alphanumeric chars ... : '/checkpoint'`. Document preferring checkpoint_path (download the HF model to the cluster via snapshot_download first) in the evaluation SKILL Step 3 and example_eval.yaml. Hit while running a BF16 baseline (Qwen/Qwen3.5-9B) for an NVFP4 comparison. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
| walltime: "04:00:00" | ||
| # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what | ||
| # the QOS allows) or sbatch fails "Requested node configuration is not available". | ||
| # gres: gpu:4 # e.g. GB300 nodes (4 GPUs); also drop --data-parallel-size to match |
There was a problem hiding this comment.
replaced also drop --data-parallel-size to match with match --data-parallel-size/--tensor-parallel-size to it?
|
Do we want to add a warning msg for Blackwell-only context? Native FP4 GEMM kernels exist only on sm_100/sm_103. On Hopper/Ada/Ampere there's no native FP4 path and NVFP4 falls back to the marlin weight-dequant backend (slower). |
What does this PR do?
Type of change: documentation (agent skills under
.claude/skills/**)Fixes five agent-skill defaults/docs that each caused a real failure during a
Qwen3.5-9B NVFP4 PTQ → deploy → Artificial-Analysis eval run on B300/GB300:
example_eval.yamlmissing--served-model-name— vLLM registered themodel under the path
/checkpoint, so every eval request 404'd(
The model '<served_model_name>' does not exist). Added--served-model-name ${deployment.served_model_name}to the template commandand documented it as a required convention in the evaluation SKILL.
DUMMY_API_KEY—ns_*tasks hard-fail(
api_key_env_var=DUMMY_API_KEY but the value is not set) unless the api-keyenv var is set inside the container; a shell
exportdoes not reach theSLURM container. Added
DUMMY_API_KEY: lit:dummyto the example'sevaluation.env_varsand corrected the Step 5 known-issue + Step 8 note.host:/lit:/runtime:prefixes (a bare value hard-errors:"Env var value '…' must have an explicit prefix").
execution.gres/gpus_per_node— must match the cluster node GPU countand QOS, or
sbatchrejects the job (Requested node configuration is not availableon the eval side,QOSMinGRESon the PTQ side). Documented in theevaluation SKILL (Step 4) and the PTQ launcher guide.
cu130-nightly—v0.19.1and anycu129(CUDA 12.9) vLLM build lack sm_103 FP4 kernels; the server loads the checkpoint
then dies at engine init with
CUDA error: no kernel image is available(affects
flashinferandcutlassNVFP4 backends;marlinseparately failson non-64-divisible layer dims). Documented in the deployment + evaluation
SKILLs, including
--mm-encoder-attn-backend TRITON_ATTNfor multimodal modelson sm_103 and the raw-markdown
recipes.vllm.aifallback for hardware variants.Closely-related bonus (same launcher-guide file):
EXTRA_PIP_DEPSmust avoidshell
>/<metacharacters (use==pins) — they are mangled in the launcher'sunquoted sbatch
exportand silently dropped.Usage
N/A — agent-skill docs and a config template only; no library/API/code change.
Testing
example_eval.yamlvalidated withyaml.safe_load; confirmed the commandincludes
--served-model-nameandevaluation.env_varsincludesDUMMY_API_KEY: lit:dummy.applied, a Qwen3.5-9B NVFP4 checkpoint deploys on B300 (cu130-nightly) and the
AA suite runs via NEL (GPQA scored; SciCode/IFBench/MMMU-Pro/HLE running).
pre-commitis not installed in this environment).
Before your PR is "Ready for review"
CONTRIBUTING.md: N/A/claude review)Additional Information
Skills touched:
evaluation(SKILL.md,recipes/examples/example_eval.yaml),deployment(SKILL.md),ptq(references/launcher-guide.md).🤖 Generated with Claude Code
Summary by CodeRabbit