Skip to content

[skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals#1595

Open
Edwardf0t1 wants to merge 2 commits into
mainfrom
zhiyuc/skill-fixes-blackwell-nvfp4-eval
Open

[skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals#1595
Edwardf0t1 wants to merge 2 commits into
mainfrom
zhiyuc/skill-fixes-blackwell-nvfp4-eval

Conversation

@Edwardf0t1
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 commented Jun 1, 2026

What does this PR do?

Type of change: documentation (agent skills under .claude/skills/**)

Fixes five agent-skill defaults/docs that each caused a real failure during a
Qwen3.5-9B NVFP4 PTQ → deploy → Artificial-Analysis eval run on B300/GB300:

  1. example_eval.yaml missing --served-model-name — vLLM registered the
    model under the path /checkpoint, so every eval request 404'd
    (The model '<served_model_name>' does not exist). Added
    --served-model-name ${deployment.served_model_name} to the template command
    and documented it as a required convention in the evaluation SKILL.
  2. nemo-skills DUMMY_API_KEYns_* tasks hard-fail
    (api_key_env_var=DUMMY_API_KEY but the value is not set) unless the api-key
    env var is set inside the container; a shell export does not reach the
    SLURM container. Added DUMMY_API_KEY: lit:dummy to the example's
    evaluation.env_vars and corrected the Step 5 known-issue + Step 8 note.
  3. NEL env-var value prefixes — documented the required
    host: / lit: / runtime: prefixes (a bare value hard-errors:
    "Env var value '…' must have an explicit prefix").
  4. execution.gres / gpus_per_node — must match the cluster node GPU count
    and QOS, or sbatch rejects the job (Requested node configuration is not available on the eval side, QOSMinGRES on the PTQ side). Documented in the
    evaluation SKILL (Step 4) and the PTQ launcher guide.
  5. NVFP4 on Blackwell needs cu130-nightlyv0.19.1 and any cu129
    (CUDA 12.9) vLLM build lack sm_103 FP4 kernels; the server loads the checkpoint
    then dies at engine init with CUDA error: no kernel image is available
    (affects flashinfer and cutlass NVFP4 backends; marlin separately fails
    on non-64-divisible layer dims). Documented in the deployment + evaluation
    SKILLs, including --mm-encoder-attn-backend TRITON_ATTN for multimodal models
    on sm_103 and the raw-markdown recipes.vllm.ai fallback for hardware variants.

Closely-related bonus (same launcher-guide file): EXTRA_PIP_DEPS must avoid
shell >/< metacharacters (use == pins) — they are mangled in the launcher's
unquoted sbatch export and silently dropped.

Usage

N/A — agent-skill docs and a config template only; no library/API/code change.

Testing

  • example_eval.yaml validated with yaml.safe_load; confirmed the command
    includes --served-model-name and evaluation.env_vars includes
    DUMMY_API_KEY: lit:dummy.
  • Each fix was empirically validated end-to-end in this session: with all five
    applied, a Qwen3.5-9B NVFP4 checkpoint deploys on B300 (cu130-nightly) and the
    AA suite runs via NEL (GPQA scored; SciCode/IFBench/MMMU-Pro/HLE running).
  • Trailing-whitespace / EOF-newline checked on all changed files (pre-commit
    is not installed in this environment).

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (docs/templates only)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (documentation)
  • Did you update Changelog?: N/A (agent-skill docs, not a library feature/API/critical-bug change)
  • Did you get Claude approval on this PR?: ❌ (will run /claude review)

Additional Information

Skills touched: evaluation (SKILL.md, recipes/examples/example_eval.yaml),
deployment (SKILL.md), ptq (references/launcher-guide.md).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Clarified NVFP4-on-Blackwell guidance: required CUDA-13 vLLM images and multimodal flag for sm_103 FP4 support.
    • Added verification steps for the correct container image and explicit vLLM serve options to ensure endpoints register.
    • Improved SLURM guidance: prefer checkpoint_path, set per-node GPU counts (gres) to match topology, and align parallel sizes.
    • Documented env var handling for evaluations (lit-style dummy API key) and warned about injected pip-deps quoting risks.

…s evals

Concrete fixes from a Qwen3.5-9B NVFP4 PTQ -> deploy -> AA-eval run on
B300/GB300 where each issue caused a real failure:

- example_eval.yaml: add --served-model-name ${deployment.served_model_name};
  without it vLLM registers the model as /checkpoint and every eval 404s.
- evaluation SKILL: nemo-skills (ns_*) self-deployment needs DUMMY_API_KEY in
  evaluation.env_vars (a shell export does NOT reach the SLURM container);
  document the required host:/lit:/runtime: env-var value prefixes; note that
  execution.gres must match the node GPU count (else sbatch 'Requested node
  configuration is not available').
- deployment + evaluation SKILL: NVFP4 on Blackwell (sm_100/sm_103) requires
  vllm/vllm-openai:cu130-nightly; v0.19.1 and any cu129 build lack sm_103 FP4
  kernels (engine init dies 'no kernel image'). Plus --mm-encoder-attn-backend
  TRITON_ATTN for multimodal on sm_103, and the raw-markdown recipes.vllm.ai
  fallback for hardware variants.
- ptq launcher-guide: match gpus_per_node to node/QOS; EXTRA_PIP_DEPS must
  avoid shell metacharacters (use == pins, not >=/<).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fa8274eb-1800-438d-8f3f-4340d6435c5c

📥 Commits

Reviewing files that changed from the base of the PR and between f11770d and c0a32a7.

📒 Files selected for processing (2)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .claude/skills/evaluation/recipes/examples/example_eval.yaml

📝 Walkthrough

Walkthrough

This PR adds documentation guidance across deployment, evaluation, and PTQ skills to help users properly configure Blackwell (sm_103) GPU deployments on SLURM clusters. Updates include cu130-nightly vLLM build requirements, resource allocation settings, environment variable setup patterns, and safe dependency injection practices with concrete example YAML configurations.

Changes

Blackwell Deployment and SLURM Configuration Guidance

Layer / File(s) Summary
Blackwell/NVFP4 vLLM build requirements
.claude/skills/deployment/SKILL.md, .claude/skills/evaluation/SKILL.md, .claude/skills/evaluation/recipes/examples/example_eval.yaml
Documents requirement for CUDA-13 vLLM builds (cu130-nightly) on Blackwell sm_103 due to FP4 kernel availability, replacing older CUDA-12.9 (cu129) builds. Includes multimodal encoder attention backend flag (--mm-encoder-attn-backend TRITON_ATTN) guidance and image verification methods across general and step-specific deployment documentation.
SLURM resource allocation and environment variable patterns
.claude/skills/evaluation/SKILL.md, .claude/skills/evaluation/recipes/examples/example_eval.yaml, .claude/skills/ptq/references/launcher-guide.md
Establishes guidance for execution.gres and gpus_per_node matching cluster per-node GPU counts to prevent SLURM allocation failures, prefers deployment.checkpoint_path over hf_model_handle, documents required environment variable prefix formats (lit:, host:, runtime:) for DUMMY_API_KEY inside evaluation containers, and warns that EXTRA_PIP_DEPS is injected into an unquoted export in generated sbatch scripts (avoid shell metacharacters).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1583: Related documentation changes touching evaluation guidance for vLLM parallelism/TP-DP sizing and concurrency knobs.

Suggested reviewers

  • kaix-nv
  • chadvoegele
  • meenchen
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main changes: fixing evaluation and deployment defaults for NVFP4 on Blackwell and nemo-skills evaluations. It matches the primary objectives of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR contains only documentation (.md) and YAML configuration files; no Python code changes reviewed. Security check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch zhiyuc/skill-fixes-blackwell-nvfp4-eval

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the agent-skill documentation and example configs under .claude/skills/** to prevent known failure modes when running NEL-based deployments/evals (notably NVFP4-on-Blackwell and nemo-skills self-deployment).

Changes:

  • Update the evaluation skill and example_eval.yaml to require --served-model-name and document SLURM/container env-var injection requirements (including nemo-skills’ DUMMY_API_KEY).
  • Document Blackwell NVFP4 deployment requirements (CUDA-13 cu130-nightly images) in both deployment and evaluation skills, plus a multimodal-specific vLLM flag.
  • Add PTQ launcher-guide notes about SLURM GPU resource matching and EXTRA_PIP_DEPS quoting/metacharacter pitfalls.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
.claude/skills/ptq/references/launcher-guide.md Adds guidance on gpus_per_node and EXTRA_PIP_DEPS pitfalls for launcher-generated SLURM scripts.
.claude/skills/evaluation/SKILL.md Clarifies required vLLM serving conventions, Blackwell NVFP4 image requirements, and NEL env var rules.
.claude/skills/evaluation/recipes/examples/example_eval.yaml Updates the example eval template with --served-model-name and required env vars for nemo-skills tasks.
.claude/skills/deployment/SKILL.md Documents Blackwell NVFP4 requirements for vLLM image selection and related flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +40 to +44
> **`EXTRA_PIP_DEPS` must avoid shell metacharacters.** It is written into an
> unquoted `export` in the generated sbatch script, so a value like
> `transformers>=4.57,<4.58` is mangled by shell redirection (`>`/`<`) and
> silently dropped — the deps never install. Use exact pins instead, e.g.
> `EXTRA_PIP_DEPS: "transformers==5.5.0"`.
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.23%. Comparing base (905259f) to head (c0a32a7).

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1595       +/-   ##
===========================================
+ Coverage   55.66%   73.23%   +17.57%     
===========================================
  Files         478      479        +1     
  Lines       52367    52435       +68     
===========================================
+ Hits        29148    38400     +9252     
+ Misses      23219    14035     -9184     
Flag Coverage Δ
unit 53.61% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hf_model_handle is not reliably mounted at /checkpoint in current NEL: with
only hf_model_handle set, `vllm serve /checkpoint` makes vLLM treat the
literal '/checkpoint' as an HF repo id and the deploy dies with
`HFValidationError: Repo id must use alphanumeric chars ... : '/checkpoint'`.
Document preferring checkpoint_path (download the HF model to the cluster via
snapshot_download first) in the evaluation SKILL Step 3 and example_eval.yaml.

Hit while running a BF16 baseline (Qwen/Qwen3.5-9B) for an NVFP4 comparison.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
walltime: "04:00:00"
# gres defaults to gpu:8. Set it to the cluster's per-node GPU count (and what
# the QOS allows) or sbatch fails "Requested node configuration is not available".
# gres: gpu:4 # e.g. GB300 nodes (4 GPUs); also drop --data-parallel-size to match
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced also drop --data-parallel-size to match with match --data-parallel-size/--tensor-parallel-size to it?

@kaix-nv
Copy link
Copy Markdown
Contributor

kaix-nv commented Jun 2, 2026

Do we want to add a warning msg for Blackwell-only context? Native FP4 GEMM kernels exist only on sm_100/sm_103. On Hopper/Ada/Ampere there's no native FP4 path and NVFP4 falls back to the marlin weight-dequant backend (slower).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants