Commit f0d2237
Add Qwen3VL MCore Export support from PR 895 (#1482)
# [Megatron Export] Add Qwen3-VL mcore ↔ HF weight mapping
> This PR is duplicated from [PR
#895](#895).
> The original branch source is no longer available; this new branch
carries the same changes forward.
## What does this PR do?
**New feature:** Add Qwen3-VL (Vision-Language) model support to the
Megatron Core export/import
plugin, enabling HuggingFace-to-mcore weight conversion for PTQ/QAT/QAD
workflows.
### Overview
Qwen3-VL has a different weight structure from Qwen3 text-only models:
- Language model weights are under `model.language_model.` prefix (not
`model.`)
- Visual encoder weights are under `model.visual.` prefix
- `lm_head` is at root level, not nested under `language_model`
### What changed
| File | Change |
|---|---|
| `modelopt/torch/export/plugins/mcore_qwen3vl.py` | New plugin: derives
Qwen3-VL mcore↔HF mapping by rewriting `model.*` →
`model.language_model.*` on top of the existing Qwen3 dense rules;
`lm_head.` is intentionally left unchanged |
| `modelopt/torch/export/plugins/mcore_common.py` | Registers
`Qwen3VLForConditionalGeneration` in `all_mcore_hf_export_mapping` and
`all_mcore_hf_import_mapping` |
| `modelopt/torch/export/plugins/hf_checkpoint_utils.py` | Generalized
`load_multimodal_components` with a `prefixes` parameter; sharded
checkpoints now scan all shards (not just the first) |
| `modelopt/torch/export/unified_export_megatron.py` |
`save_pretrained`: added Qwen3-VL branch that copies `model.visual.*`
vision-encoder weights from the original HF checkpoint into the exported
directory, producing a complete, loadable checkpoint |
| `tests/_test_utils/torch/transformers_models.py` | Added
`get_tiny_qwen3vl` / `create_tiny_qwen3vl_dir` helpers; Qwen3VL classes
are lazy-imported inside the function to avoid collection failures on
older transformers builds |
| `tests/gpu_megatron/torch/export/test_unified_export_megatron.py` |
Integrated Qwen3-VL export/import tests into the existing
`test_unified_export_megatron` / `test_unified_import_megatron`
parametrized suites; removed standalone `test_mcore_qwen3vl.py` |
| `docs/source/deployment/3_unified_hf.rst` | Added Qwen3-VL (FP8 /
NVFP4) to the deployment support matrix for TensorRT-LLM |
### Workflow coverage
| Step | Status | Files |
|---|---|---|
| 1. Quantize Qwen3-VL with `hf_ptq` | ✅ existing | — |
| 2. Export quantized mcore → HF | ✅ this PR |
`plugins/mcore_qwen3vl.py` (weight name mapping),
`unified_export_megatron.py` (export path) |
| 3. Vision-encoder weights merged into export dir | ✅ this PR |
`plugins/hf_checkpoint_utils.py` (`load_multimodal_components` with
`prefixes`), `unified_export_megatron.py` (calls it when `arch ==
"Qwen3VLForConditionalGeneration"`) |
| 4. Import HF checkpoint back to mcore | ✅ this PR |
`plugins/mcore_qwen3vl.py` (same mapping, reverse direction),
`unified_export_megatron.py` (import path) |
### Design notes
- **MoE not supported**: `Qwen3VLMoeForConditionalGeneration` stores
expert weights as
3-D tensors (`mlp.experts.gate_up_proj`, `mlp.experts.down_proj`) that
require a
dedicated fused-expert mapping. A `NotImplementedError` comment in the
plugin
documents this explicitly.
- **`copy.deepcopy` on `func_kwargs`**: each mapping entry gets its own
copy to
prevent shared-dict mutation when both Qwen3 and Qwen3-VL rules are
loaded.
- **`prefixes` parameter on `load_multimodal_components`**:
backward-compatible default
preserves existing LLaVA behaviour (`"multi_modal_projector"`,
`"vision_model"`);
Qwen3-VL callers pass `("model.visual.",)`.
- **Sharded checkpoint scan**: the old code only looked in the first
shard. The
Qwen3-VL vision encoder can span multiple shards, so all shards are now
scanned.
## Usage
From the [Megatron-LM PR
comment](NVIDIA/Megatron-LM#3444 (comment)):
> Qwen3VL is supported within
[Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), and
pretraining and PEFT recipes for Qwen3VL are
[here](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py)
and the core code logic
[here](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/models/qwen_vl).
Create
`Megatron-LM/examples/post_training/modelopt/conf/Qwen/Qwen3-VL-8B-Instruct.sh`:
```bash
#!/bin/bash
# Qwen3-VL-8B-Instruct text-model config for Megatron-LM import/quantize.
#
# Text-model dimensions are identical to Qwen3-8B (4096 hidden, 36 layers,
# 32 heads, GQA=8). Differences: rope_theta=5000000, checkpoint path uses
# model.language_model.* prefix (handled by mcore_qwen3vl plugin).
if [ -z ${HF_MODEL_CKPT} ]; then
HF_MODEL_CKPT=Qwen/Qwen3-VL-8B-Instruct
TOKENIZER_MODEL=Qwen/Qwen3-VL-8B-Instruct
else
TOKENIZER_MODEL=${HF_MODEL_CKPT}
fi
MODEL_ARGS=" \
--save-interval 100000 \
--micro-batch-size 1 \
--bf16 \
--no-masked-softmax-fusion \
--disable-bias-linear \
--untie-embeddings-and-output-weights \
--position-embedding-type rope \
--no-rope-fusion \
--normalization RMSNorm \
--swiglu \
--num-layers 36 \
--hidden-size 4096 \
--ffn-hidden-size 12288 \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 8 \
--kv-channels 128 \
--qk-layernorm \
--seq-length 4096 \
--max-position-embeddings 262144 \
--tokenizer-type HuggingFaceTokenizer \
--make-vocab-size-divisible-by 1187 \
--use-mcore-models \
--rotary-percent 1.0 \
--rotary-base 5000000 \
--no-bias-swiglu-fusion \
"
```
Import Qwen3-VL from HuggingFace to MCore (local, requires GPUs):
```bash
MLM_MODEL_CFG=Qwen/Qwen3-VL-8B-Instruct \
HF_MODEL_CKPT=Qwen/Qwen3-VL-8B-Instruct \
MLM_MODEL_SAVE=/tmp/qwen3vl_mcore \
TP=1 \
bash Megatron-LM/examples/post_training/modelopt/convert.sh Qwen/Qwen3-VL-8B-Instruct
```
Quantize (PTQ via Megatron-LM path):
```bash
MLM_MODEL_CFG=Qwen/Qwen3-VL-8B-Instruct \
HF_MODEL_CKPT=Qwen/Qwen3-VL-8B-Instruct \
QUANT_CFG=NVFP4_DEFAULT_CFG \
TP=4 \
bash Megatron-LM/examples/post_training/modelopt/quantize.sh Qwen/Qwen3-VL-8B-Instruct
```
## Testing
- Verified round-trip import/export with Qwen3-VL-8B-Instruct with the
example usage above
- Unit/GPU tests covering:
- Registration in global export/import mappings
- Import mapping: dense keys, `model.language_model.` prefix, `lm_head.`
at root, `QKVMerging`, `GatedMLPMerging`, `REPLICATE` for layernorms, TP
sharding configs
- Export mapping: `QKVSlicing`, `GatedMLPSlicing`, no `parallel_config`
- Import/export symmetry: same mcore keys, matching HF prefixes
- Qwen3-VL vs Qwen3 difference: same keys, VL adds `language_model.`
prefix, `lm_head` unchanged
## Before your PR is "Ready for review"
- Is this change backward compatible?: Yes, additive only
- Did you write any new necessary tests?: Yes,
`tests/gpu_megatron/torch/export/test_unified_export_megatron.py`
- Did you add or update any necessary documentation? Yes, see
`docs/source/deployment/3_unified_hf.rst`
- Did you update Changelog? Yes, see `CHANGELOG.rst`
## Additional Information
Companion Megatron-LM PR adds `Qwen3VLModel`, `Qwen3VLDataset`, and
`pretrain_qwenvl.py`.
See: NVIDIA/Megatron-LM#3444
---------
Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Signed-off-by: hychiang <hungyuehc@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent 2b7668d commit f0d2237
8 files changed
Lines changed: 302 additions & 91 deletions
File tree
- docs/source/deployment
- modelopt/torch/export
- plugins
- tests
- _test_utils/torch
- gpu_megatron/torch/export
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| 64 | + | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
| 91 | + | |
91 | 92 | | |
92 | 93 | | |
93 | 94 | | |
94 | 95 | | |
95 | 96 | | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
96 | 100 | | |
97 | 101 | | |
98 | 102 | | |
| |||
114 | 118 | | |
115 | 119 | | |
116 | 120 | | |
117 | | - | |
| 121 | + | |
118 | 122 | | |
119 | 123 | | |
120 | 124 | | |
| |||
124 | 128 | | |
125 | 129 | | |
126 | 130 | | |
127 | | - | |
128 | 131 | | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
137 | | - | |
138 | | - | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
149 | 138 | | |
150 | 139 | | |
151 | 140 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| 42 | + | |
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
| |||
54 | 55 | | |
55 | 56 | | |
56 | 57 | | |
| 58 | + | |
57 | 59 | | |
58 | 60 | | |
59 | 61 | | |
| |||
66 | 68 | | |
67 | 69 | | |
68 | 70 | | |
| 71 | + | |
69 | 72 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
382 | 382 | | |
383 | 383 | | |
384 | 384 | | |
385 | | - | |
386 | | - | |
387 | | - | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
388 | 399 | | |
389 | 400 | | |
390 | 401 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| 32 | + | |
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
| |||
121 | 122 | | |
122 | 123 | | |
123 | 124 | | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
124 | 210 | | |
125 | 211 | | |
126 | 212 | | |
| |||
0 commit comments