-
Notifications
You must be signed in to change notification settings - Fork 403
Add Qwen3VL MCore Export support from PR 895 #1482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
a7d1170
36da6de
ff1152f
e8101a7
aecbbfa
80495e6
5bf943b
425145c
d6f03cd
6ad8d0e
77adc9d
5cdb6b4
3637fe7
57a4608
e8e2d7b
1a86b05
63a229a
73d74b3
4dbffb2
cf0fb9f
1243b42
3f0b921
8266670
f56e4c2
74019e3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """Custom mapping from Qwen3-VL Hugging Face models to Megatron Core models. | ||
|
|
||
| Qwen3-VL model structure differs from Qwen3: | ||
| - Language model weights are under `model.language_model.` prefix | ||
| - Visual encoder weights are under `model.visual.` prefix | ||
|
|
||
| This module handles the language model conversion for PTQ/QAT workflows. | ||
| Visual components are typically kept in full precision. | ||
|
|
||
| HuggingFace Qwen3-VL-8B structure: | ||
| - model.language_model.embed_tokens.weight | ||
| - model.language_model.layers.{L}.input_layernorm.weight | ||
| - model.language_model.layers.{L}.self_attn.q_proj.weight | ||
| - model.language_model.layers.{L}.self_attn.k_proj.weight | ||
| - model.language_model.layers.{L}.self_attn.v_proj.weight | ||
| - model.language_model.layers.{L}.self_attn.q_norm.weight | ||
| - model.language_model.layers.{L}.self_attn.k_norm.weight | ||
| - model.language_model.layers.{L}.self_attn.o_proj.weight | ||
| - model.language_model.layers.{L}.post_attention_layernorm.weight | ||
| - model.language_model.layers.{L}.mlp.gate_proj.weight | ||
| - model.language_model.layers.{L}.mlp.up_proj.weight | ||
| - model.language_model.layers.{L}.mlp.down_proj.weight | ||
| - model.language_model.norm.weight | ||
| - lm_head.weight | ||
| """ | ||
|
Comment on lines
+16
to
+33
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [SUGGESTION] Scope-clarifying note: Qwen3-VL ships in two architectures — The MoE variant cannot reuse Consider adding a one-line note to the module docstring (e.g. "Covers the dense Qwen3VL variant only; |
||
|
|
||
| from .mcore_custom import ( | ||
| COL_ETP, | ||
| COL_TP, | ||
| REPLICATE, | ||
| ROW_ETP, | ||
| ROW_TP, | ||
| CustomModuleMapping, | ||
| GatedMLPMerging, | ||
| GatedMLPSlicing, | ||
| NameRemapping, | ||
| QKVMerging, | ||
| QKVSlicing, | ||
| ) | ||
|
|
||
|
Comment on lines
+42
to
+59
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [SUGGESTION] Reconstructing each mapping via A more robust pattern is to deep-copy the original mapping and just rewrite the prefix: def _with_language_model_prefix(
mapping: dict[str, CustomModuleMapping],
) -> dict[str, CustomModuleMapping]:
result = {}
for key, m in mapping.items():
new_m = copy.deepcopy(m)
if new_m.target_name_or_prefix.startswith("model."):
new_m.target_name_or_prefix = (
"model.language_model." + new_m.target_name_or_prefix[len("model.") :]
)
result[key] = new_m
return resultThis preserves |
||
| # Import rules: HuggingFace -> Megatron Core | ||
| qwen3vl_causal_lm_import: dict[str, CustomModuleMapping] = { | ||
| # Embeddings - note the language_model prefix | ||
| "word_embeddings": NameRemapping("model.language_model.embed_tokens.", COL_TP), | ||
| # Final layer norm | ||
| "final_layernorm": NameRemapping("model.language_model.norm.", REPLICATE), | ||
| # Output layer (lm_head is at root level, not under language_model) | ||
| "output_layer": NameRemapping("lm_head.", COL_TP), | ||
|
hychiang-git marked this conversation as resolved.
Outdated
|
||
| # Attention - input layernorm | ||
| "input_layernorm": NameRemapping("model.language_model.layers.{}.input_layernorm.", REPLICATE), | ||
| # Attention - QKV projection (merged) | ||
| "linear_qkv": QKVMerging("model.language_model.layers.{}.self_attn.", COL_TP), | ||
| # Attention - output projection | ||
| "linear_proj": NameRemapping("model.language_model.layers.{}.self_attn.o_proj.", ROW_TP), | ||
| # Attention - Q/K layer norms (Qwen3 uses RMSNorm on Q and K) | ||
| "q_layernorm": NameRemapping("model.language_model.layers.{}.self_attn.q_norm.", REPLICATE), | ||
| "k_layernorm": NameRemapping("model.language_model.layers.{}.self_attn.k_norm.", REPLICATE), | ||
| # MLP - pre-MLP layernorm (post_attention_layernorm in HF) | ||
| "pre_mlp_layernorm": NameRemapping( | ||
| "model.language_model.layers.{}.post_attention_layernorm.", REPLICATE | ||
| ), | ||
| # MLP - gate_proj + up_proj merged into linear_fc1 | ||
| "linear_fc1": GatedMLPMerging("model.language_model.layers.{}.mlp.", COL_TP), | ||
| # MLP - down_proj as linear_fc2 | ||
| "linear_fc2": NameRemapping("model.language_model.layers.{}.mlp.down_proj.", ROW_TP), | ||
| # MoE support (for Qwen3-VL MoE variants like 30B-A3B) | ||
| "router": NameRemapping("model.language_model.layers.{}.mlp.gate.", REPLICATE), | ||
| "local_experts.linear_fc1": GatedMLPMerging( | ||
| "model.language_model.layers.{}.mlp.experts.{}.", COL_ETP | ||
| ), | ||
| "local_experts.linear_fc2": NameRemapping( | ||
| "model.language_model.layers.{}.mlp.experts.{}.down_proj.", ROW_ETP | ||
| ), | ||
| } | ||
|
|
||
| # Export rules: Megatron Core -> HuggingFace | ||
| qwen3vl_causal_lm_export: dict[str, CustomModuleMapping] = { | ||
| # Embeddings | ||
| "word_embeddings": NameRemapping("model.language_model.embed_tokens."), | ||
| # Final layer norm | ||
| "final_layernorm": NameRemapping("model.language_model.norm."), | ||
| # Output layer | ||
| "output_layer": NameRemapping("lm_head."), | ||
| # Attention - input layernorm | ||
| "input_layernorm": NameRemapping("model.language_model.layers.{}.input_layernorm."), | ||
| # Attention - QKV projection (sliced back to separate q/k/v) | ||
| "linear_qkv": QKVSlicing("model.language_model.layers.{}.self_attn."), | ||
| # Attention - output projection | ||
| "linear_proj": NameRemapping("model.language_model.layers.{}.self_attn.o_proj."), | ||
| # Attention - Q/K layer norms | ||
| "q_layernorm": NameRemapping("model.language_model.layers.{}.self_attn.q_norm."), | ||
| "k_layernorm": NameRemapping("model.language_model.layers.{}.self_attn.k_norm."), | ||
| # MLP - pre-MLP layernorm | ||
| "pre_mlp_layernorm": NameRemapping("model.language_model.layers.{}.post_attention_layernorm."), | ||
| # MLP - linear_fc1 sliced back to gate_proj + up_proj | ||
| "linear_fc1": GatedMLPSlicing("model.language_model.layers.{}.mlp."), | ||
| # MLP - down_proj | ||
| "linear_fc2": NameRemapping("model.language_model.layers.{}.mlp.down_proj."), | ||
| # MoE support | ||
| "router": NameRemapping("model.language_model.layers.{}.mlp.gate."), | ||
| "local_experts.linear_fc1": GatedMLPSlicing("model.language_model.layers.{}.mlp.experts.{}."), | ||
| "local_experts.linear_fc2": NameRemapping( | ||
| "model.language_model.layers.{}.mlp.experts.{}.down_proj." | ||
| ), | ||
| } | ||
|
hychiang-git marked this conversation as resolved.
Outdated
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[IMPORTANT Compatibility] The CHANGELOG says the new mapping "supports both dense and MoE variants", but
mcore_common.pyonly registersQwen3VLForConditionalGeneration(the dense architecture). The HF MoE class isQwen3VLMoeForConditionalGenerationand is not added toall_mcore_hf_export_mapping/all_mcore_hf_import_mapping, so it will fall through to aKeyErroronself.rules = self.all_rules[self.arch](unified_export_megatron.py:170,megatron_importer.py:106).Either (a) also register
"Qwen3VLMoeForConditionalGeneration": qwen3vl_causal_lm_export/importinmcore_common.py(the inherited MoE rules fromqwen3_causal_lm_*already coverrouter/local_experts.*and themodel.language_model.prefixing applies to them), or (b) drop the "and MoE variants" claim from the CHANGELOG to avoid promising untested support.