[skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals by Edwardf0t1 · Pull Request #1595 · NVIDIA/Model-Optimizer

Edwardf0t1 · 2026-06-01T23:53:26Z

What does this PR do?

Type of change: documentation (agent skills under .claude/skills/**)

Fixes five agent-skill defaults/docs that each caused a real failure during a
Qwen3.5-9B NVFP4 PTQ → deploy → Artificial-Analysis eval run on B300/GB300:

example_eval.yaml missing --served-model-name — vLLM registered the
model under the path /checkpoint, so every eval request 404'd
(The model '<served_model_name>' does not exist). Added
--served-model-name ${deployment.served_model_name} to the template command
and documented it as a required convention in the evaluation SKILL.
nemo-skills DUMMY_API_KEY — ns_* tasks hard-fail
(api_key_env_var=DUMMY_API_KEY but the value is not set) unless the api-key
env var is set inside the container; a shell export does not reach the
SLURM container. Added DUMMY_API_KEY: lit:dummy to the example's
evaluation.env_vars and corrected the Step 5 known-issue + Step 8 note.
NEL env-var value prefixes — documented the required
host: / lit: / runtime: prefixes (a bare value hard-errors:
"Env var value '…' must have an explicit prefix").
execution.gres / gpus_per_node — must match the cluster node GPU count
and QOS, or sbatch rejects the job (Requested node configuration is not available on the eval side, QOSMinGRES on the PTQ side). Documented in the
evaluation SKILL (Step 4) and the PTQ launcher guide.
NVFP4 on Blackwell needs cu130-nightly — v0.19.1 and any cu129
(CUDA 12.9) vLLM build lack sm_103 FP4 kernels; the server loads the checkpoint
then dies at engine init with CUDA error: no kernel image is available
(affects flashinfer and cutlass NVFP4 backends; marlin separately fails
on non-64-divisible layer dims). Documented in the deployment + evaluation
SKILLs, including --mm-encoder-attn-backend TRITON_ATTN for multimodal models
on sm_103 and the raw-markdown recipes.vllm.ai fallback for hardware variants.

Closely-related bonus (same launcher-guide file): EXTRA_PIP_DEPS must avoid
shell >/< metacharacters (use == pins) — they are mangled in the launcher's
unquoted sbatch export and silently dropped.

Usage

N/A — agent-skill docs and a config template only; no library/API/code change.

Testing

example_eval.yaml validated with yaml.safe_load; confirmed the command
includes --served-model-name and evaluation.env_vars includes
DUMMY_API_KEY: lit:dummy.
Each fix was empirically validated end-to-end in this session: with all five
applied, a Qwen3.5-9B NVFP4 checkpoint deploys on B300 (cu130-nightly) and the
AA suite runs via NEL (GPQA scored; SciCode/IFBench/MMMU-Pro/HLE running).
Trailing-whitespace / EOF-newline checked on all changed files (pre-commit
is not installed in this environment).

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ (docs/templates only)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A (documentation)
Did you update Changelog?: N/A (agent-skill docs, not a library feature/API/critical-bug change)
Did you get Claude approval on this PR?: ❌ (will run /claude review)

Additional Information

Skills touched: evaluation (SKILL.md, recipes/examples/example_eval.yaml),
deployment (SKILL.md), ptq (references/launcher-guide.md).

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Clarified NVFP4-on-Blackwell guidance: required CUDA-13 vLLM images and multimodal flag for sm_103 FP4 support.
- Added verification steps for the correct container image and explicit vLLM serve options to ensure endpoints register.
- Improved SLURM guidance: prefer checkpoint_path, set per-node GPU counts (gres) to match topology, and align parallel sizes.
- Documented env var handling for evaluations (lit-style dummy API key) and warned about injected pip-deps quoting risks.

…s evals Concrete fixes from a Qwen3.5-9B NVFP4 PTQ -> deploy -> AA-eval run on B300/GB300 where each issue caused a real failure: - example_eval.yaml: add --served-model-name ${deployment.served_model_name}; without it vLLM registers the model as /checkpoint and every eval 404s. - evaluation SKILL: nemo-skills (ns_*) self-deployment needs DUMMY_API_KEY in evaluation.env_vars (a shell export does NOT reach the SLURM container); document the required host:/lit:/runtime: env-var value prefixes; note that execution.gres must match the node GPU count (else sbatch 'Requested node configuration is not available'). - deployment + evaluation SKILL: NVFP4 on Blackwell (sm_100/sm_103) requires vllm/vllm-openai:cu130-nightly; v0.19.1 and any cu129 build lack sm_103 FP4 kernels (engine init dies 'no kernel image'). Plus --mm-encoder-attn-backend TRITON_ATTN for multimodal on sm_103, and the raw-markdown recipes.vllm.ai fallback for hardware variants. - ptq launcher-guide: match gpus_per_node to node/QOS; EXTRA_PIP_DEPS must avoid shell metacharacters (use == pins, not >=/<). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

coderabbitai · 2026-06-01T23:53:36Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fa8274eb-1800-438d-8f3f-4340d6435c5c

📥 Commits

Reviewing files that changed from the base of the PR and between f11770d and c0a32a7.

📒 Files selected for processing (2)

.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/recipes/examples/example_eval.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

.claude/skills/evaluation/recipes/examples/example_eval.yaml

📝 Walkthrough

Walkthrough

This PR adds documentation guidance across deployment, evaluation, and PTQ skills to help users properly configure Blackwell (sm_103) GPU deployments on SLURM clusters. Updates include cu130-nightly vLLM build requirements, resource allocation settings, environment variable setup patterns, and safe dependency injection practices with concrete example YAML configurations.

Changes

Blackwell Deployment and SLURM Configuration Guidance

Layer / File(s)	Summary
Blackwell/NVFP4 vLLM build requirements `.claude/skills/deployment/SKILL.md`, `.claude/skills/evaluation/SKILL.md`, `.claude/skills/evaluation/recipes/examples/example_eval.yaml`	Documents requirement for CUDA-13 vLLM builds (`cu130-nightly`) on Blackwell sm_103 due to FP4 kernel availability, replacing older CUDA-12.9 (`cu129`) builds. Includes multimodal encoder attention backend flag (`--mm-encoder-attn-backend TRITON_ATTN`) guidance and image verification methods across general and step-specific deployment documentation.
SLURM resource allocation and environment variable patterns `.claude/skills/evaluation/SKILL.md`, `.claude/skills/evaluation/recipes/examples/example_eval.yaml`, `.claude/skills/ptq/references/launcher-guide.md`	Establishes guidance for `execution.gres` and `gpus_per_node` matching cluster per-node GPU counts to prevent SLURM allocation failures, prefers `deployment.checkpoint_path` over `hf_model_handle`, documents required environment variable prefix formats (`lit:`, `host:`, `runtime:`) for `DUMMY_API_KEY` inside evaluation containers, and warns that `EXTRA_PIP_DEPS` is injected into an unquoted `export` in generated sbatch scripts (avoid shell metacharacters).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1583: Related documentation changes touching evaluation guidance for vLLM parallelism/TP-DP sizing and concurrency knobs.

Suggested reviewers

kaix-nv
chadvoegele
meenchen

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: fixing evaluation and deployment defaults for NVFP4 on Blackwell and nemo-skills evaluations. It matches the primary objectives of the PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR contains only documentation (.md) and YAML configuration files; no Python code changes reviewed. Security check not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch zhiyuc/skill-fixes-blackwell-nvfp4-eval

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR updates the agent-skill documentation and example configs under .claude/skills/** to prevent known failure modes when running NEL-based deployments/evals (notably NVFP4-on-Blackwell and nemo-skills self-deployment).

Changes:

Update the evaluation skill and example_eval.yaml to require --served-model-name and document SLURM/container env-var injection requirements (including nemo-skills’ DUMMY_API_KEY).
Document Blackwell NVFP4 deployment requirements (CUDA-13 cu130-nightly images) in both deployment and evaluation skills, plus a multimodal-specific vLLM flag.
Add PTQ launcher-guide notes about SLURM GPU resource matching and EXTRA_PIP_DEPS quoting/metacharacter pitfalls.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`.claude/skills/ptq/references/launcher-guide.md`	Adds guidance on `gpus_per_node` and `EXTRA_PIP_DEPS` pitfalls for launcher-generated SLURM scripts.
`.claude/skills/evaluation/SKILL.md`	Clarifies required vLLM serving conventions, Blackwell NVFP4 image requirements, and NEL env var rules.
`.claude/skills/evaluation/recipes/examples/example_eval.yaml`	Updates the example eval template with `--served-model-name` and required env vars for nemo-skills tasks.
`.claude/skills/deployment/SKILL.md`	Documents Blackwell NVFP4 requirements for vLLM image selection and related flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+> **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an
+> unquoted `export` in the generated sbatch script, so a value like
+> `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and
+> silently dropped — the deps never install. Use exact pins instead, e.g.
+> `EXTRA_PIP_DEPS: "transformers==5.5.0"`.


codecov · 2026-06-02T00:06:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.23%. Comparing base (905259f) to head (c0a32a7).

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1595       +/-   ##
===========================================
+ Coverage   55.66%   73.23%   +17.57%     
===========================================
  Files         478      479        +1     
  Lines       52367    52435       +68     
===========================================
+ Hits        29148    38400     +9252     
+ Misses      23219    14035     -9184

Flag	Coverage Δ
unit	`53.61% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hf_model_handle is not reliably mounted at /checkpoint in current NEL: with only hf_model_handle set, `vllm serve /checkpoint` makes vLLM treat the literal '/checkpoint' as an HF repo id and the deploy dies with `HFValidationError: Repo id must use alphanumeric chars ... : '/checkpoint'`. Document preferring checkpoint_path (download the HF model to the cluster via snapshot_download first) in the evaluation SKILL Step 3 and example_eval.yaml. Hit while running a BF16 baseline (Qwen/Qwen3.5-9B) for an NVFP4 comparison. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

kaix-nv · 2026-06-02T01:10:43Z

  walltime: "04:00:00"
+  # gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what
+  # the QOS allows) or sbatch fails "Requested node configuration is not available".
+  # gres: gpu:4   # e.g. GB300 nodes (4 GPUs); also drop --data-parallel-size to match


replaced also drop --data-parallel-size to match with match --data-parallel-size/--tensor-parallel-size to it?

kaix-nv · 2026-06-02T01:15:45Z

Do we want to add a warning msg for Blackwell-only context? Native FP4 GEMM kernels exist only on sm_100/sm_103. On Hopper/Ada/Ampere there's no native FP4 path and NVFP4 falls back to the marlin weight-dequant backend (slower).

coderabbitai Bot approved these changes Jun 1, 2026

View reviewed changes

Edwardf0t1 requested review from chadvoegele, Copilot and kaix-nv June 1, 2026 23:57

Copilot started reviewing on behalf of Edwardf0t1 June 1, 2026 23:58 View session

Edwardf0t1 requested review from mxinO and shengliangxu June 1, 2026 23:58

Copilot AI reviewed Jun 1, 2026

View reviewed changes

kaix-nv reviewed Jun 2, 2026

View reviewed changes

kaix-nv approved these changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals#1595

[skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals#1595
Edwardf0t1 wants to merge 2 commits into
mainfrom
zhiyuc/skill-fixes-blackwell-nvfp4-eval

Edwardf0t1 commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

kaix-nv Jun 2, 2026

Uh oh!

kaix-nv commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Edwardf0t1 commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kaix-nv Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

kaix-nv commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Edwardf0t1 commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading