Skip to content

[fix] Fix NVFP4 AWQ GQA prequant fusion#1520

Open
ShawRong wants to merge 1 commit into
NVIDIA:mainfrom
ShawRong:codex/awq-gqa-prequant-fusion
Open

[fix] Fix NVFP4 AWQ GQA prequant fusion#1520
ShawRong wants to merge 1 commit into
NVIDIA:mainfrom
ShawRong:codex/awq-gqa-prequant-fusion

Conversation

@ShawRong
Copy link
Copy Markdown

@ShawRong ShawRong commented May 19, 2026

Summary

  • Enable the existing grouped-head pre-quant-scale fusion path for NVFP4 AWQ export.
  • Add a unit regression test to ensure requantize_resmooth_fused_llm_layers() calls fuse_prequant_to_linear(..., fuse_grouped_heads=True) for nvfp4_awq checkpoints.

Motivation

fuse_prequant_to_linear() already supports GQA/MQA grouped-head fusion, but the unified HF export path called it without enabling that mode. For GQA/MQA models, this can leave o_proj.pre_quant_scale unfused because hidden-size pre-quant scales do not match KV projection output channels.

Native vLLM real-quant ModelOpt NVFP4 loading does not consume pre_quant_scale, so AWQ exports should fold these scales when possible.

Validation

  • uv run pytest tests/unit/torch/export/test_unified_export_hf.py -q

Summary by CodeRabbit

  • New Features

    • NVFP4-AWQ model export now enables grouped-head prequant fusion for improved quantization handling.
  • Tests

    • Added test validation for NVFP4-AWQ export behavior with grouped-head prequant fusion.

Review Change Stack

Signed-off-by: ShawRong <shawnrong1213@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

This PR enables grouped-head pre-quant scale fusion for NVFP4 AWQ export by updating a function call to include fuse_grouped_heads=True and adds a unit test to validate the behavior.

Changes

NVFP4 Grouped-Head Prequant Fusion

Layer / File(s) Summary
Grouped-head prequant fusion implementation and validation
modelopt/torch/export/unified_export_hf.py, tests/unit/torch/export/test_unified_export_hf.py
The NVFP4-AWQ path now calls fuse_prequant_to_linear with fuse_grouped_heads=True. A new unit test validates this behavior by monkeypatching the export helpers, invoking requantize_resmooth_fused_llm_layers, and asserting the fusion helper is called with the expected parameter.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • cjluo-nv
  • meenchen
  • ChenhanYu
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security issues detected. Verified no torch.load(weights_only=False), numpy.load(allow_pickle=True), hardcoded trust_remote_code, eval/exec, nosec comments, or non-permissive dependencies.
Title check ✅ Passed The title '[fix] Fix NVFP4 AWQ GQA prequant fusion' clearly and specifically describes the main change: enabling grouped-head pre-quant-scale fusion for NVFP4 AWQ export, which directly matches the PR objectives and code modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@ShawRong ShawRong marked this pull request as ready for review May 19, 2026 13:28
@ShawRong ShawRong requested a review from a team as a code owner May 19, 2026 13:28
@ShawRong ShawRong requested a review from cjluo-nv May 19, 2026 13:28
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Small, focused change that flips on the existing fuse_grouped_heads path of fuse_prequant_to_linear for nvfp4_awq exports, plus a mock-based regression test. Code change itself is one line and looks correct: the grouped-head branch in fuse_prequant_to_linear already implements the GQA averaging math, and only nvfp4_awq is affected (the int4_awq/w4a8_awq paths don't go through this branch).

Reasons to nudge for a human look rather than approve:

  1. Behavior change for all NVFP4 AWQ consumers, not just vLLM. The PR description justifies folding pre_quant_scale into o_proj weights on the grounds that "Native vLLM real-quant ModelOpt NVFP4 loading does not consume pre_quant_scale." But requantize_resmooth_fused_llm_layers is the unified HF export path used by other backends (TRT-LLM, etc.) too. For GQA/MQA models, the new path replaces a per-channel o_proj.pre_quant_scale with the group-averaged scale folded into v_proj, which is a lossy approximation. Worth confirming a human is comfortable with that trade-off for non-vLLM consumers, or that those consumers handle the scaled weights correctly.

  2. Test is mock-only. test_nvfp4_awq_export_enables_grouped_head_prequant_fusion monkeypatches fuse_prequant_to_linear, is_moe, collect_shared_input_modules, and _fuse_shared_input_modules, so it just asserts the kwarg is forwarded. It doesn't exercise the actual GQA averaging math in fuse_prequant_to_linear or verify end-to-end output equivalence on a small GQA toy model. A real (even tiny) GQA module test would catch regressions in the fusion math, not just the call site.

  3. Minor: previous PRs in this area (e.g. PR #1382 fused-MoE fixes) flagged similar "silent change to fusion behavior" concerns; this is on the same surface.

License header on the new test file matches LICENSE_HEADER (2026) — no licensing concern.

@cjluo-nv cjluo-nv requested a review from meenchen May 19, 2026 20:45
@ShawRong ShawRong changed the title [codex] Fix NVFP4 AWQ GQA prequant fusion [fix] Fix NVFP4 AWQ GQA prequant fusion May 19, 2026
# Fuse pre_quant_scale to the linear weights if possible
if quantization_format is not None and "nvfp4_awq" in quantization_format.lower():
fuse_prequant_to_linear(model)
fuse_prequant_to_linear(model, fuse_grouped_heads=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShawRong can you make this configurable instead of hardcoding? I would like to keep the original behavior because fuse_grouped_heads=True will impact accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants