[fix] Fix NVFP4 AWQ GQA prequant fusion by ShawRong · Pull Request #1520 · NVIDIA/Model-Optimizer

ShawRong · 2026-05-19T13:27:57Z

Summary

Enable the existing grouped-head pre-quant-scale fusion path for NVFP4 AWQ export.
Add a unit regression test to ensure requantize_resmooth_fused_llm_layers() calls fuse_prequant_to_linear(..., fuse_grouped_heads=True) for nvfp4_awq checkpoints.

Motivation

fuse_prequant_to_linear() already supports GQA/MQA grouped-head fusion, but the unified HF export path called it without enabling that mode. For GQA/MQA models, this can leave o_proj.pre_quant_scale unfused because hidden-size pre-quant scales do not match KV projection output channels.

Native vLLM real-quant ModelOpt NVFP4 loading does not consume pre_quant_scale, so AWQ exports should fold these scales when possible.

Validation

uv run pytest tests/unit/torch/export/test_unified_export_hf.py -q

Summary by CodeRabbit

New Features
- NVFP4-AWQ model export now enables grouped-head prequant fusion for improved quantization handling.
Tests
- Added test validation for NVFP4-AWQ export behavior with grouped-head prequant fusion.

Signed-off-by: ShawRong <shawnrong1213@gmail.com>

copy-pr-bot · 2026-05-19T13:28:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-19T13:28:06Z

📝 Walkthrough

Walkthrough

This PR enables grouped-head pre-quant scale fusion for NVFP4 AWQ export by updating a function call to include fuse_grouped_heads=True and adds a unit test to validate the behavior.

Changes

NVFP4 Grouped-Head Prequant Fusion

Layer / File(s)	Summary
Grouped-head prequant fusion implementation and validation `modelopt/torch/export/unified_export_hf.py`, `tests/unit/torch/export/test_unified_export_hf.py`	The NVFP4-AWQ path now calls `fuse_prequant_to_linear` with `fuse_grouped_heads=True`. A new unit test validates this behavior by monkeypatching the export helpers, invoking `requantize_resmooth_fused_llm_layers`, and asserting the fusion helper is called with the expected parameter.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

cjluo-nv
meenchen
ChenhanYu

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security issues detected. Verified no torch.load(weights_only=False), numpy.load(allow_pickle=True), hardcoded trust_remote_code, eval/exec, nosec comments, or non-permissive dependencies.
Title check	✅ Passed	The title '[fix] Fix NVFP4 AWQ GQA prequant fusion' clearly and specifically describes the main change: enabling grouped-head pre-quant-scale fusion for NVFP4 AWQ export, which directly matches the PR objectives and code modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cjluo-nv

Bot review — DM the bot to share feedback.

Small, focused change that flips on the existing fuse_grouped_heads path of fuse_prequant_to_linear for nvfp4_awq exports, plus a mock-based regression test. Code change itself is one line and looks correct: the grouped-head branch in fuse_prequant_to_linear already implements the GQA averaging math, and only nvfp4_awq is affected (the int4_awq/w4a8_awq paths don't go through this branch).

Reasons to nudge for a human look rather than approve:

Behavior change for all NVFP4 AWQ consumers, not just vLLM. The PR description justifies folding pre_quant_scale into o_proj weights on the grounds that "Native vLLM real-quant ModelOpt NVFP4 loading does not consume pre_quant_scale." But requantize_resmooth_fused_llm_layers is the unified HF export path used by other backends (TRT-LLM, etc.) too. For GQA/MQA models, the new path replaces a per-channel o_proj.pre_quant_scale with the group-averaged scale folded into v_proj, which is a lossy approximation. Worth confirming a human is comfortable with that trade-off for non-vLLM consumers, or that those consumers handle the scaled weights correctly.
Test is mock-only. test_nvfp4_awq_export_enables_grouped_head_prequant_fusion monkeypatches fuse_prequant_to_linear, is_moe, collect_shared_input_modules, and _fuse_shared_input_modules, so it just asserts the kwarg is forwarded. It doesn't exercise the actual GQA averaging math in fuse_prequant_to_linear or verify end-to-end output equivalence on a small GQA toy model. A real (even tiny) GQA module test would catch regressions in the fusion math, not just the call site.
Minor: previous PRs in this area (e.g. PR #1382 fused-MoE fixes) flagged similar "silent change to fusion behavior" concerns; this is on the same surface.

License header on the new test file matches LICENSE_HEADER (2026) — no licensing concern.

meenchen · 2026-05-20T00:23:44Z

    # Fuse pre_quant_scale to the linear weights if possible
    if quantization_format is not None and "nvfp4_awq" in quantization_format.lower():
-        fuse_prequant_to_linear(model)
+        fuse_prequant_to_linear(model, fuse_grouped_heads=True)


@ShawRong can you make this configurable instead of hardcoding? I would like to keep the original behavior because fuse_grouped_heads=True will impact accuracy.

Fix NVFP4 AWQ GQA prequant fusion

874d593

Signed-off-by: ShawRong <shawnrong1213@gmail.com>

ShawRong marked this pull request as ready for review May 19, 2026 13:28

ShawRong requested a review from a team as a code owner May 19, 2026 13:28

ShawRong requested a review from cjluo-nv May 19, 2026 13:28

cjluo-nv reviewed May 19, 2026

View reviewed changes

cjluo-nv requested a review from meenchen May 19, 2026 20:45

ShawRong changed the title ~~[codex] Fix NVFP4 AWQ GQA prequant fusion~~ [fix] Fix NVFP4 AWQ GQA prequant fusion May 19, 2026

meenchen reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Fix NVFP4 AWQ GQA prequant fusion#1520

[fix] Fix NVFP4 AWQ GQA prequant fusion#1520
ShawRong wants to merge 1 commit into
NVIDIA:mainfrom
ShawRong:codex/awq-gqa-prequant-fusion

ShawRong commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

cjluo-nv left a comment

Uh oh!

meenchen May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ShawRong commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Validation

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ShawRong commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading