Skip to content

Commit 7e8d221

Browse files
committed
Move Qwen3.5/Qwen3.6 W4A16 recipe to huggingface/<model_type>/ptq/ convention
Relocate the Qwen3.5/3.6 W4A16 recipe from modelopt_recipes/models/ to the per-model_type layout under modelopt_recipes/huggingface/qwen3_5/ptq/ and modelopt_recipes/huggingface/qwen3_5_moe/ptq/, matching the convention used by the other model-specific recipes. The two HuggingFace model_types share the same hybrid linear-attention + softmax-attention architecture (transformers qwen3_5 and qwen3_5_moe), so the recipe's quant_cfg list applies identically to both. Extract that list into a single QuantizerCfgListConfig snippet (nvfp4_mlp-fp8_attn-kv_fp8_cast .quant_cfg.yaml) and have both per-model_type recipe wrappers $import it, so there is one source of truth. Inline the shared base_disable_all and default_disabled_quantizers units via $import, replace the inline NVFP4/FP8 cfg literals with the existing configs/numerics/{nvfp4,fp8} snippets, and replace the explicit FP8 KV constant-amax block with the existing configs/ptq/units/kv_fp8_cast unit. Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
1 parent fe7a3b8 commit 7e8d221

4 files changed

Lines changed: 170 additions & 122 deletions

File tree

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# Shared `quant_cfg` snippet for the Qwen3.5 family's
17+
# `nvfp4_mlp-fp8_attn-kv_fp8_cast` recipe. Imported by both
18+
# `huggingface/qwen3_5/ptq/nvfp4_mlp-fp8_attn-kv_fp8_cast.yaml` (dense
19+
# `qwen3_5`) and `huggingface/qwen3_5_moe/ptq/nvfp4_mlp-fp8_attn-kv_fp8_cast.yaml`
20+
# (MoE `qwen3_5_moe`); the two families share the hybrid linear-attention +
21+
# softmax-attention architecture, so the wildcard rules apply identically.
22+
# MoE-only patterns inside `default_disabled_quantizers`
23+
# (`*block_sparse_moe.gate*`, `*mlp.shared_expert_gate.*`, `*router*`) are
24+
# no-ops on dense.
25+
26+
# modelopt-schema: modelopt.torch.quantization.config.QuantizerCfgListConfig
27+
imports:
28+
base_disable_all: configs/ptq/units/base_disable_all
29+
default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
30+
fp8: configs/numerics/fp8
31+
kv_fp8_cast: configs/ptq/units/kv_fp8_cast
32+
nvfp4: configs/numerics/nvfp4
33+
---
34+
- $import: base_disable_all
35+
36+
# W4A16 NVFP4 on MLP projection targets. The gate/up/down projection patterns
37+
# cover dense MLPs, shared experts, and fused MoE expert quantizers
38+
# (e.g. gate_up_proj_weight_quantizers.N).
39+
- quantizer_name: '*mlp*gate_proj*weight_quantizer*'
40+
cfg: {$import: nvfp4}
41+
- quantizer_name: '*mlp*up_proj*weight_quantizer*'
42+
cfg: {$import: nvfp4}
43+
- quantizer_name: '*mlp*down_proj*weight_quantizer*'
44+
cfg: {$import: nvfp4}
45+
46+
# FP8 self-attention projections.
47+
- quantizer_name: '*self_attn*weight_quantizer'
48+
cfg: {$import: fp8}
49+
- quantizer_name: '*self_attn*input_quantizer'
50+
cfg: {$import: fp8}
51+
52+
# FP8 large linear-attention projections. in_proj_a / in_proj_b / conv1d
53+
# remain disabled to match the reference checkpoint policy.
54+
- quantizer_name: '*linear_attn.in_proj_qkv*weight_quantizer'
55+
cfg: {$import: fp8}
56+
- quantizer_name: '*linear_attn.in_proj_qkv*input_quantizer'
57+
cfg: {$import: fp8}
58+
- quantizer_name: '*linear_attn.in_proj_z*weight_quantizer'
59+
cfg: {$import: fp8}
60+
- quantizer_name: '*linear_attn.in_proj_z*input_quantizer'
61+
cfg: {$import: fp8}
62+
- quantizer_name: '*linear_attn.out_proj*weight_quantizer'
63+
cfg: {$import: fp8}
64+
- quantizer_name: '*linear_attn.out_proj*input_quantizer'
65+
cfg: {$import: fp8}
66+
67+
# FP8 KV cache with constant amax.
68+
- $import: kv_fp8_cast
69+
70+
# Standard exclusions (BatchNorm, LeakyReLU, gates, routers, conv1d, output
71+
# heads, etc.). Includes `*lm_head*` disable, which is re-enabled below.
72+
- $import: default_disabled_quantizers
73+
74+
# Qwen-specific exclusions: linear-attention sub-modules that are not in the
75+
# reference recipe, and any visual / MTP siblings on multimodal releases.
76+
- quantizer_name: '*linear_attn.in_proj_a*'
77+
enable: false
78+
- quantizer_name: '*linear_attn.in_proj_b*'
79+
enable: false
80+
- quantizer_name: '*visual*'
81+
enable: false
82+
- quantizer_name: '*vision_tower*'
83+
enable: false
84+
- quantizer_name: '*mtp*'
85+
enable: false
86+
87+
# Re-enable NVFP4 on lm_head weights. Must come after
88+
# default_disabled_quantizers, which disables `*lm_head*`.
89+
- quantizer_name: '*lm_head*weight_quantizer'
90+
cfg: {$import: nvfp4}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# NVFP4 MLP / FP8 attention / FP8 KV-cast PTQ recipe for HuggingFace `qwen3_5`
17+
# (dense) models. Covers Qwen3.5 and Qwen3.6 dense releases, which share the
18+
# `qwen3_5` model_type and hybrid linear-attention + softmax-attention
19+
# architecture. Shares its `quant_cfg` with the MoE counterpart at
20+
# `huggingface/qwen3_5_moe/ptq/nvfp4_mlp-fp8_attn-kv_fp8_cast.yaml`; the
21+
# snippet lives under
22+
# `huggingface/qwen3_5/ptq/nvfp4_mlp-fp8_attn-kv_fp8_cast.quant_cfg.yaml`.
23+
24+
imports:
25+
quant_cfg: huggingface/qwen3_5/ptq/nvfp4_mlp-fp8_attn-kv_fp8_cast.quant_cfg
26+
27+
metadata:
28+
recipe_type: ptq
29+
description: >-
30+
NVFP4 MLP / FP8 attention / FP8 KV-cast PTQ recipe for HuggingFace
31+
`qwen3_5` (dense) models: NVFP4 for MLP projection weights and lm_head;
32+
FP8 for self-attention and the large linear-attention projections; FP8 KV
33+
cache with constant amax.
34+
quantize:
35+
algorithm:
36+
method: max
37+
layerwise: false
38+
quant_cfg:
39+
- $import: quant_cfg
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# NVFP4 MLP / FP8 attention / FP8 KV-cast PTQ recipe for HuggingFace
17+
# `qwen3_5_moe` models. Covers Qwen3.5-MoE and Qwen3.6-MoE releases, which
18+
# share the `qwen3_5_moe` model_type and hybrid linear-attention +
19+
# softmax-attention MoE architecture. Shares its `quant_cfg` with the dense
20+
# counterpart at
21+
# `huggingface/qwen3_5/ptq/nvfp4_mlp-fp8_attn-kv_fp8_cast.yaml`; the snippet
22+
# lives under
23+
# `huggingface/qwen3_5/ptq/nvfp4_mlp-fp8_attn-kv_fp8_cast.quant_cfg.yaml`.
24+
25+
imports:
26+
quant_cfg: huggingface/qwen3_5/ptq/nvfp4_mlp-fp8_attn-kv_fp8_cast.quant_cfg
27+
28+
metadata:
29+
recipe_type: ptq
30+
description: >-
31+
NVFP4 MLP / FP8 attention / FP8 KV-cast PTQ recipe for HuggingFace
32+
`qwen3_5_moe` models (Qwen3.5-MoE and Qwen3.6-MoE releases): NVFP4 for MoE
33+
/ shared-expert MLP projection weights and lm_head; FP8 for self-attention
34+
and the large linear-attention projections; FP8 KV cache with constant
35+
amax.
36+
quantize:
37+
algorithm:
38+
method: max
39+
layerwise: false
40+
quant_cfg:
41+
- $import: quant_cfg

modelopt_recipes/models/Qwen3.5-Qwen3.6/w4a16.yaml

Lines changed: 0 additions & 122 deletions
This file was deleted.

0 commit comments

Comments
 (0)