Skip to content

Commit 910dc49

Browse files
authored
Add Qwen3.6 W4A16 PTQ recipe (#1503)
### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> - Did you get Claude approval on this PR?: ✅ / ❌ / N/A <!--- Run `/claude review`. NVIDIA org members can self-trigger for complex changes; orthogonal to CodeRabbit. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a W4A16 PTQ recipe for Qwen3.5/Qwen3.6 with mixed-precision rules (NVFP4 for MLP projections, FP8 for attention layers and KV cache). * **Updates** * PTQ workflow now respects supplied recipes and applies quantization rules to the full model (recipe-driven targeting enabled). <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1503?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
1 parent e4dc020 commit 910dc49

2 files changed

Lines changed: 133 additions & 6 deletions

File tree

examples/llm_ptq/hf_ptq.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -542,12 +542,17 @@ def load_model(args: argparse.Namespace):
542542
]
543543

544544
# We only quantize the language model for VLMs other than the type supported above.
545-
extracted_lm, extracted_model_type = extract_and_prepare_language_model_from_vl(
546-
full_model
547-
)
548-
if extracted_lm is not None:
549-
language_model = extracted_lm
550-
model_type = extracted_model_type
545+
# Recipe mode is the exception: in Qwen3.5/3.6-MoE VLMs, lm_head sits
546+
# on the outer CausalLM, not the inner language backbone. A recipe that targets
547+
# lm_head must therefore quantize against the full model and explicitly keep visual
548+
# and MTP siblings disabled.
549+
if args.recipe is None:
550+
extracted_lm, extracted_model_type = extract_and_prepare_language_model_from_vl(
551+
full_model
552+
)
553+
if extracted_lm is not None:
554+
language_model = extracted_lm
555+
model_type = extracted_model_type
551556

552557
tokenizer = get_tokenizer(args.pyt_ckpt_path, trust_remote_code=args.trust_remote_code)
553558

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
metadata:
5+
recipe_type: ptq
6+
description: >
7+
W4A16 PTQ recipe for Qwen3.5/Qwen3.6 models: W4A16 NVFP4 for dense MLP,
8+
routed MoE, shared-expert MLP projections, and lm_head; FP8 for
9+
self-attention and the large linear-attention projections; FP8 KV cache
10+
with constant amax (fp8_cast behavior).
11+
12+
quantize:
13+
algorithm:
14+
method: max
15+
layerwise: false
16+
17+
quant_cfg:
18+
- quantizer_name: '*'
19+
enable: false
20+
21+
# W4A16 NVFP4 MLP projection targets. Matching the gate/up/down projection
22+
# names covers dense MLPs, shared experts, and fused MoE expert quantizers
23+
# such as gate_up_proj_weight_quantizers.N/down_proj_weight_quantizers.N.
24+
- quantizer_name: '*mlp*gate_proj*weight_quantizer*'
25+
enable: true
26+
cfg: &nvfp4_cfg
27+
block_sizes:
28+
-1: 16
29+
type: dynamic
30+
scale_bits: e4m3
31+
num_bits: e2m1
32+
- quantizer_name: '*mlp*up_proj*weight_quantizer*'
33+
enable: true
34+
cfg: *nvfp4_cfg
35+
- quantizer_name: '*mlp*down_proj*weight_quantizer*'
36+
enable: true
37+
cfg: *nvfp4_cfg
38+
- quantizer_name: '*lm_head*weight_quantizer'
39+
enable: true
40+
cfg: *nvfp4_cfg
41+
42+
# FP8 self-attention projections.
43+
- quantizer_name: '*self_attn*weight_quantizer'
44+
enable: true
45+
cfg: &fp8_cfg
46+
num_bits: e4m3
47+
axis:
48+
- quantizer_name: '*self_attn*input_quantizer'
49+
enable: true
50+
cfg: *fp8_cfg
51+
52+
# FP8 large linear-attention projections. Keep in_proj_a, in_proj_b, and
53+
# conv1d disabled to match the reference checkpoint policy.
54+
- quantizer_name: '*linear_attn.in_proj_qkv*weight_quantizer'
55+
enable: true
56+
cfg: *fp8_cfg
57+
- quantizer_name: '*linear_attn.in_proj_qkv*input_quantizer'
58+
enable: true
59+
cfg: *fp8_cfg
60+
- quantizer_name: '*linear_attn.in_proj_z*weight_quantizer'
61+
enable: true
62+
cfg: *fp8_cfg
63+
- quantizer_name: '*linear_attn.in_proj_z*input_quantizer'
64+
enable: true
65+
cfg: *fp8_cfg
66+
- quantizer_name: '*linear_attn.out_proj*weight_quantizer'
67+
enable: true
68+
cfg: *fp8_cfg
69+
- quantizer_name: '*linear_attn.out_proj*input_quantizer'
70+
enable: true
71+
cfg: *fp8_cfg
72+
73+
# FP8 KV cache with constant amax. This matches fp8_cast behavior and
74+
# avoids exporting per-layer KV scale tensors.
75+
- quantizer_name: '*[kv]_bmm_quantizer'
76+
enable: true
77+
cfg:
78+
num_bits: e4m3
79+
axis:
80+
use_constant_amax: true
81+
82+
# Explicitly keep non-reference targets unquantized.
83+
- quantizer_name: '*linear_attn.conv1d*'
84+
enable: false
85+
- quantizer_name: '*linear_attn.in_proj_a*'
86+
enable: false
87+
- quantizer_name: '*linear_attn.in_proj_b*'
88+
enable: false
89+
- quantizer_name: '*mlp.gate.*'
90+
enable: false
91+
- quantizer_name: '*mlp.shared_expert_gate.*'
92+
enable: false
93+
- quantizer_name: '*router*'
94+
enable: false
95+
- quantizer_name: '*block_sparse_moe.gate*'
96+
enable: false
97+
- quantizer_name: '*mixer.conv1d*'
98+
enable: false
99+
- quantizer_name: '*output_layer*'
100+
enable: false
101+
- quantizer_name: '*proj_out.*'
102+
enable: false
103+
- quantizer_name: 'output.*'
104+
enable: false
105+
- quantizer_name: '*visual*'
106+
enable: false
107+
- quantizer_name: '*vision_tower*'
108+
enable: false
109+
- quantizer_name: '*mtp*'
110+
enable: false
111+
- parent_class: 'nn.BatchNorm1d'
112+
quantizer_name: '*'
113+
enable: false
114+
- parent_class: 'nn.BatchNorm2d'
115+
quantizer_name: '*'
116+
enable: false
117+
- parent_class: 'nn.BatchNorm3d'
118+
quantizer_name: '*'
119+
enable: false
120+
- parent_class: 'nn.LeakyReLU'
121+
quantizer_name: '*'
122+
enable: false

0 commit comments

Comments
 (0)