[fix] Fix NVFP4 AWQ GQA prequant fusion#1520
Conversation
Signed-off-by: ShawRong <shawnrong1213@gmail.com>
📝 WalkthroughWalkthroughThis PR enables grouped-head pre-quant scale fusion for NVFP4 AWQ export by updating a function call to include ChangesNVFP4 Grouped-Head Prequant Fusion
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Small, focused change that flips on the existing fuse_grouped_heads path of fuse_prequant_to_linear for nvfp4_awq exports, plus a mock-based regression test. Code change itself is one line and looks correct: the grouped-head branch in fuse_prequant_to_linear already implements the GQA averaging math, and only nvfp4_awq is affected (the int4_awq/w4a8_awq paths don't go through this branch).
Reasons to nudge for a human look rather than approve:
-
Behavior change for all NVFP4 AWQ consumers, not just vLLM. The PR description justifies folding
pre_quant_scaleintoo_projweights on the grounds that "Native vLLM real-quant ModelOpt NVFP4 loading does not consumepre_quant_scale." Butrequantize_resmooth_fused_llm_layersis the unified HF export path used by other backends (TRT-LLM, etc.) too. For GQA/MQA models, the new path replaces a per-channelo_proj.pre_quant_scalewith the group-averaged scale folded intov_proj, which is a lossy approximation. Worth confirming a human is comfortable with that trade-off for non-vLLM consumers, or that those consumers handle the scaled weights correctly. -
Test is mock-only.
test_nvfp4_awq_export_enables_grouped_head_prequant_fusionmonkeypatchesfuse_prequant_to_linear,is_moe,collect_shared_input_modules, and_fuse_shared_input_modules, so it just asserts the kwarg is forwarded. It doesn't exercise the actual GQA averaging math infuse_prequant_to_linearor verify end-to-end output equivalence on a small GQA toy model. A real (even tiny) GQA module test would catch regressions in the fusion math, not just the call site. -
Minor: previous PRs in this area (e.g. PR #1382 fused-MoE fixes) flagged similar "silent change to fusion behavior" concerns; this is on the same surface.
License header on the new test file matches LICENSE_HEADER (2026) — no licensing concern.
| # Fuse pre_quant_scale to the linear weights if possible | ||
| if quantization_format is not None and "nvfp4_awq" in quantization_format.lower(): | ||
| fuse_prequant_to_linear(model) | ||
| fuse_prequant_to_linear(model, fuse_grouped_heads=True) |
There was a problem hiding this comment.
@ShawRong can you make this configurable instead of hardcoding? I would like to keep the original behavior because fuse_grouped_heads=True will impact accuracy.
Summary
requantize_resmooth_fused_llm_layers()callsfuse_prequant_to_linear(..., fuse_grouped_heads=True)fornvfp4_awqcheckpoints.Motivation
fuse_prequant_to_linear()already supports GQA/MQA grouped-head fusion, but the unified HF export path called it without enabling that mode. For GQA/MQA models, this can leaveo_proj.pre_quant_scaleunfused because hidden-size pre-quant scales do not match KV projection output channels.Native vLLM real-quant ModelOpt NVFP4 loading does not consume
pre_quant_scale, so AWQ exports should fold these scales when possible.Validation
uv run pytest tests/unit/torch/export/test_unified_export_hf.py -qSummary by CodeRabbit
New Features
Tests