Skip to content

Commit a415667

Browse files
authored
Enable Qwen3.5-MoE PTQ (#897)
## What does this PR do? **Type of change:** New model support **Overview:** Add ModelOpt PTQ support for https://huggingface.co/Qwen/Qwen3.5-397B-A17B ## Usage <!-- You can potentially add a usage example below. --> ```python python3 hf_ptq.py --pyt_ckpt_path /home/omniml_data_3/models/Qwen3.5-397B-A17B --qformat nvfp4_mlp_only --export_path /home/omniml_data_3/zhiyuc/checkpoints/Qwen3.5-397B-A17B-NVFP4 --trust_remote_code ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Not yet <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Qwen3.5 Mixture-of-Experts model support in quantization workflows. * **Bug Fixes** * Enhanced error diagnostics during model export with detailed module information. * Improved dataset tokenizer processing with proper truncation and length handling. * Fixed model export stability issue related to framework integration. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent a6cbcba commit a415667

File tree

7 files changed

+204
-30
lines changed

7 files changed

+204
-30
lines changed

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ NVIDIA Model Optimizer Changelog (Linux)
1616
- Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
1717
- Add support for rotating the input before quantization for RHT.
1818
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
19+
- Enable PTQ workflow for Qwen3.5 MoE models.
1920

2021
0.42 (2026-02-xx)
2122
^^^^^^^^^^^^^^^^^

examples/llm_ptq/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
106106
| Llama-Nemotron Ultra ||||||
107107
| Gemma 3 | ✅<sup>2</sup> | - || - | - |
108108
| QWen 2, 2.5 <sup>4</sup> ||||||
109-
| QWen3 MOE, Next <sup>6</sup> || - | - | - ||
109+
| QWen3, 3.5 MOE, Next <sup>6</sup> || - | - | - ||
110110
| QwQ || - | - | - ||
111111
| DeepSeek V3, R1, V3.1, V3.2<sup>7</sup> | - | - | - | - ||
112112
| GLM-4.7<sup>8</sup> || - | - | - ||
@@ -402,6 +402,7 @@ print(llm_fp8.generate(["What's the age of the earth? "]))
402402
| QWen3 | FP4 ||| - |
403403
| QWen3 MoE | FP8 ||||
404404
| QWen3 MoE | FP4 || - | - |
405+
| QWen3.5 MoE | FP4 | - | - ||
405406
| QWen2.5 | FP8 ||||
406407
| QWen2.5 | FP4 ||| - |
407408
| QwQ-32B | FP8 ||||

examples/llm_ptq/hf_ptq.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -650,16 +650,19 @@ def export_quantized(
650650
extra_state_dict=mtp_state_dict,
651651
)
652652

653-
# Copy custom model files (Python files and JSON configs) if trust_remote_code is used
654-
copy_custom_model_files(args.pyt_ckpt_path, export_path, args.trust_remote_code)
655-
656653
# Restore default padding and export the tokenizer as well.
657654
if tokenizer is not None:
658655
tokenizer.padding_side = default_padding_side
659656
if default_pad_token is not None:
660657
tokenizer.pad_token = default_pad_token
661658
tokenizer.save_pretrained(export_path)
662659

660+
# Copy custom model files (Python files and JSON configs) if trust_remote_code is used.
661+
# This must run AFTER tokenizer.save_pretrained() so original tokenizer files
662+
# from the source checkpoint take precedence over regenerated ones (which may
663+
# differ in format due to newer transformers versions).
664+
copy_custom_model_files(args.pyt_ckpt_path, export_path, args.trust_remote_code)
665+
663666
end_time = time.time()
664667
print(
665668
f"Quantized model exported to: {export_path}. Total time used {end_time - start_time}s"

modelopt/torch/export/layer_utils.py

Lines changed: 12 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -327,20 +327,12 @@ def is_mlp(module: nn.Module) -> bool:
327327

328328
def is_moe(module: nn.Module) -> bool:
329329
"""Returns whether the module is an MOE layer."""
330-
return any(
331-
key in type(module).__name__.lower()
332-
for key in [
333-
"MixtralSparseMoeBlock".lower(),
334-
"ArcticMoE".lower(),
335-
"DbrxFFN".lower(),
336-
"MoELayer".lower(),
337-
"PhimoeSparseMoeBlock".lower(),
338-
"DeepseekMoE".lower(),
339-
"Qwen2MoeSparseMoeBlock".lower(),
340-
"Qwen3MoeSparseMoeBlock".lower(),
341-
"Qwen3NextSparseMoeBlock".lower(),
342-
]
343-
)
330+
name = type(module).__name__.lower()
331+
# Auto-detect common MoE patterns
332+
if name.endswith("sparsemoeblock") or "moelayer" in name:
333+
return True
334+
# Explicit matches for non-standard naming
335+
return any(key in name for key in ["arcticmoe", "deepseekmoe", "dbrxffn"])
344336

345337

346338
def is_quantlinear(module: nn.Module) -> bool:
@@ -1006,6 +998,7 @@ def module_match_name_list(module, name_list):
1006998
"Qwen2MoeSparseMoeBlock",
1007999
"Qwen3MoeSparseMoeBlock",
10081000
"Qwen3NextSparseMoeBlock",
1001+
"Qwen3_5MoeSparseMoeBlock",
10091002
"DeepseekMoE",
10101003
],
10111004
):
@@ -1141,7 +1134,10 @@ def set_expert_quantizer_amax(
11411134
# Apply target amax to quantizers that need it
11421135
for module, attr_name, quantizer in all_quantizers:
11431136
# Check if quantizer needs amax (use property for consistency)
1144-
needs_amax = getattr(quantizer, "amax", None) is None
1137+
# Also treat zero amax as needing recalibration — a zero amax is never valid
1138+
# and indicates the quantizer wasn't activated during calibration
1139+
amax = getattr(quantizer, "amax", None)
1140+
needs_amax = amax is None or (isinstance(amax, torch.Tensor) and torch.all(amax == 0))
11451141

11461142
# Skip dynamic quantizers for input quantizers
11471143
if "input_quantizer" in attr_name and getattr(quantizer, "_dynamic", False):
@@ -1747,7 +1743,7 @@ def _split_fused_qkv_weight_and_scaling(
17471743

17481744
qkv_in = weight.shape[-1] if weight_dim > 1 else 1
17491745

1750-
num_kv_heads = num_kv_heads if num_kv_heads else num_heads
1746+
num_kv_heads = num_kv_heads or num_heads
17511747
assert num_heads % num_kv_heads == 0, (
17521748
f"num_heads({num_heads}) must be divisible by num_kv_heads({num_kv_heads}))."
17531749
)

modelopt/torch/export/unified_export_hf.py

Lines changed: 66 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -589,7 +589,7 @@ def _process_quantized_modules(
589589
"""
590590
fsdp_module_to_reshard = None
591591

592-
for _, sub_module in model.named_modules():
592+
for name, sub_module in model.named_modules():
593593
# Optimization to perform resharding only once per decoder layer to avoid extra communication overhead
594594
if isinstance(sub_module, FSDPModule):
595595
# Every time we encounter a new FSDPModule, the previous decoder layer is fully processed.
@@ -610,8 +610,13 @@ def _process_quantized_modules(
610610
sub_module.unpack_weight()
611611
if get_quantization_format(sub_module) != QUANTIZATION_NONE:
612612
if is_quantlinear(sub_module):
613-
with fsdp2_aware_weight_update(model, sub_module, reshard=False):
614-
_export_quantized_weight(sub_module, dtype)
613+
try:
614+
with fsdp2_aware_weight_update(model, sub_module, reshard=False):
615+
_export_quantized_weight(sub_module, dtype)
616+
except AssertionError as e:
617+
raise AssertionError(
618+
f"Failed to export module '{name}' (type={type(sub_module).__name__}): {e}"
619+
) from e
615620
elif (
616621
"Llama4TextExperts" in type(sub_module).__name__
617622
or "GptOssExperts" in type(sub_module).__name__
@@ -988,6 +993,50 @@ def _export_diffusers_checkpoint(
988993
print(f"Export complete. Saved to: {export_dir}")
989994

990995

996+
# TODO: Remove this workaround once HuggingFace fixes revert_weight_conversion to handle
997+
# scalar (0-d) tensors. The bug is in transformers' Chunk.convert() which calls
998+
# tensor.size(self.dim) on quantization scale buffers that are 0-d scalars, causing
999+
# IndexError. Confirmed still present in transformers 5.2.0.
1000+
# See: transformers/core_model_loading.py, Chunk.convert()
1001+
def _revert_weight_conversion_noop(model: Any, state_dict: dict) -> dict:
1002+
"""No-op replacement for transformers' revert_weight_conversion."""
1003+
return state_dict
1004+
1005+
1006+
def _try_patch_module(mod_path: str) -> tuple[Any, Any] | None:
1007+
"""Try to patch revert_weight_conversion in a single module."""
1008+
import importlib
1009+
1010+
try:
1011+
mod = importlib.import_module(mod_path)
1012+
if hasattr(mod, "revert_weight_conversion"):
1013+
original = getattr(mod, "revert_weight_conversion")
1014+
setattr(mod, "revert_weight_conversion", _revert_weight_conversion_noop)
1015+
return (mod, original)
1016+
except (ImportError, AttributeError):
1017+
pass
1018+
return None
1019+
1020+
1021+
def _patch_revert_weight_conversion() -> list[tuple[Any, Any]]:
1022+
"""Patch revert_weight_conversion in transformers to avoid IndexError on scalar tensors."""
1023+
patches: list[tuple[Any, Any]] = []
1024+
for mod_path in [
1025+
"transformers.core_model_loading",
1026+
"transformers.modeling_utils",
1027+
]:
1028+
result = _try_patch_module(mod_path)
1029+
if result is not None:
1030+
patches.append(result)
1031+
return patches
1032+
1033+
1034+
def _unpatch_revert_weight_conversion(patches: list[tuple[Any, Any]]) -> None:
1035+
"""Restore the original revert_weight_conversion functions."""
1036+
for mod, original in patches:
1037+
mod.revert_weight_conversion = original
1038+
1039+
9911040
def export_hf_checkpoint(
9921041
model: Any,
9931042
dtype: torch.dtype | None = None,
@@ -1047,11 +1096,20 @@ def export_hf_checkpoint(
10471096
model.hf_quantizer = None
10481097

10491098
# Save model
1050-
model.save_pretrained(
1051-
export_dir,
1052-
state_dict={**post_state_dict, **(extra_state_dict or {})},
1053-
save_modelopt_state=save_modelopt_state,
1054-
)
1099+
# Temporarily disable revert_weight_conversion if available — it doesn't handle
1100+
# quantized state dicts (scalar scale tensors have 0 dimensions, causing IndexError).
1101+
# We must patch both the source module and the importing module since
1102+
# modeling_utils does `from core_model_loading import revert_weight_conversion`.
1103+
_patches = _patch_revert_weight_conversion()
1104+
1105+
try:
1106+
model.save_pretrained(
1107+
export_dir,
1108+
state_dict={**post_state_dict, **(extra_state_dict or {})},
1109+
save_modelopt_state=save_modelopt_state,
1110+
)
1111+
finally:
1112+
_unpatch_revert_weight_conversion(_patches)
10551113

10561114
original_config = f"{export_dir}/config.json"
10571115
config_data = {}

modelopt/torch/quantization/plugins/huggingface.py

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -734,6 +734,107 @@ def forward(
734734
return next_states
735735

736736

737+
class _Qwen35MoeExpertModule(nn.Module):
738+
"""Container for a single Qwen3.5 MoE expert's linear layers.
739+
740+
Produces the naming pattern: experts.{id}.gate_proj.weight
741+
(consistent with standard Qwen3 MoE per-expert module structure).
742+
"""
743+
744+
def __init__(self, hidden_dim: int, expert_dim: int):
745+
super().__init__()
746+
self.gate_proj = nn.Linear(hidden_dim, expert_dim, bias=False)
747+
self.up_proj = nn.Linear(hidden_dim, expert_dim, bias=False)
748+
self.down_proj = nn.Linear(expert_dim, hidden_dim, bias=False)
749+
750+
751+
class _QuantQwen35MoeExperts(QuantModule):
752+
def _setup(self):
753+
"""Modify the Qwen3_5MoeExperts by using per-expert nn.Module containers.
754+
755+
This produces the naming pattern: experts.{id}.gate_proj.weight
756+
(consistent with standard Qwen3 MoE).
757+
"""
758+
from accelerate import init_empty_weights
759+
760+
dtype, device = self.gate_up_proj.dtype, self.gate_up_proj.device
761+
762+
def _copy_weight(module, weight):
763+
module.to_empty(device=device)
764+
with torch.no_grad():
765+
module.weight.data = weight.detach().data.to(dtype=dtype, device=device)
766+
767+
expert_dim = self.intermediate_dim
768+
769+
with init_empty_weights():
770+
expert_modules = nn.ModuleList(
771+
[
772+
_Qwen35MoeExpertModule(self.hidden_dim, expert_dim)
773+
for _ in range(self.num_experts)
774+
]
775+
)
776+
777+
for idx in range(self.num_experts):
778+
# gate_up_proj shape: (num_experts, 2*intermediate_dim, hidden_dim)
779+
# Already in (out_features, in_features) format, no transpose needed
780+
_copy_weight(expert_modules[idx].gate_proj, self.gate_up_proj[idx, :expert_dim, :])
781+
_copy_weight(expert_modules[idx].up_proj, self.gate_up_proj[idx, expert_dim:, :])
782+
# down_proj shape: (num_experts, hidden_dim, intermediate_dim)
783+
# Already in (out_features, in_features) format
784+
_copy_weight(expert_modules[idx].down_proj, self.down_proj[idx])
785+
786+
delattr(self, "gate_up_proj")
787+
delattr(self, "down_proj")
788+
# Register expert modules directly as numbered children (like nn.ModuleList)
789+
# so the naming pattern is: experts.{id}.gate_proj.weight (no extra nesting)
790+
for idx in range(self.num_experts):
791+
self.add_module(str(idx), expert_modules[idx])
792+
793+
def __len__(self):
794+
"""Support len() so the module is iterable like standard MoE experts."""
795+
return self.num_experts
796+
797+
def __iter__(self):
798+
"""Support iteration over expert modules."""
799+
for idx in range(self.num_experts):
800+
yield getattr(self, str(idx))
801+
802+
def __getitem__(self, idx):
803+
"""Support indexing to get individual expert modules."""
804+
return getattr(self, str(int(idx)))
805+
806+
def forward(
807+
self,
808+
hidden_states: torch.Tensor,
809+
top_k_index: torch.Tensor,
810+
top_k_weights: torch.Tensor,
811+
) -> torch.Tensor:
812+
final_hidden_states = torch.zeros_like(hidden_states)
813+
with torch.no_grad():
814+
expert_mask = torch.nn.functional.one_hot(top_k_index, num_classes=self.num_experts)
815+
expert_mask = expert_mask.permute(2, 1, 0)
816+
expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()
817+
for expert_idx in expert_hit:
818+
expert_idx = expert_idx[0]
819+
if expert_idx == self.num_experts:
820+
continue
821+
with torch.no_grad():
822+
top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
823+
current_state = hidden_states[token_idx]
824+
expert = self[expert_idx]
825+
gate = expert.gate_proj(current_state)
826+
up = expert.up_proj(current_state)
827+
current_hidden_states = self.act_fn(gate) * up
828+
current_hidden_states = expert.down_proj(current_hidden_states)
829+
current_hidden_states = (
830+
current_hidden_states * top_k_weights[token_idx, top_k_pos, None]
831+
)
832+
final_hidden_states.index_add_(
833+
0, token_idx, current_hidden_states.to(final_hidden_states.dtype)
834+
)
835+
return final_hidden_states
836+
837+
737838
class _QuantDbrxFFN(_QuantSparseMoe):
738839
@property
739840
def num_experts(self):
@@ -882,6 +983,20 @@ def unpack_weight(self):
882983
pass
883984

884985

986+
try:
987+
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import Qwen3_5MoeExperts
988+
989+
# Qwen3_5MoeSparseMoeBlock registration is handled by register_sparse_moe_on_the_fly
990+
# (auto-detected via gate.top_k + gate.num_experts + experts pattern).
991+
# Only the fused expert weights need explicit registration.
992+
if Qwen3_5MoeExperts not in QuantModuleRegistry:
993+
QuantModuleRegistry.register({Qwen3_5MoeExperts: "hf.Qwen3_5MoeExperts"})(
994+
_QuantQwen35MoeExperts
995+
)
996+
except ImportError:
997+
pass
998+
999+
8851000
class _QuantGptOssExperts(_QuantFunctionalMixin):
8861001
"""Quantized wrapper for `transformers.GptOssExperts`.
8871002

modelopt/torch/utils/dataset_utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -298,7 +298,7 @@ def get_dataset_dataloader(
298298
An instance of dataloader.
299299
"""
300300
assert tokenizer is not None, "Please provide a tokenizer."
301-
# batch_encode_plus will modify the tokenizer in place, so we need to clone it.
301+
# Tokenizer encoding may modify the tokenizer in place, so we need to clone it.
302302
tokenizer = copy.deepcopy(tokenizer)
303303

304304
if tokenizer.padding_side != "left":
@@ -323,7 +323,7 @@ def get_dataset_dataloader(
323323
)
324324
all_samples.extend(samples)
325325

326-
batch_encoded = tokenizer.batch_encode_plus(
326+
batch_encoded = tokenizer(
327327
all_samples,
328328
return_tensors="pt",
329329
padding=True,

0 commit comments

Comments
 (0)