Commit 24891f0
fix: megatron export correctness for TP>1 GQA, single-file MTP, and Hub remote code (#1209)
### What does this PR do?
Type of change: Bug fix
Three correctness fixes for the Megatron Core GPT export pipeline:
**1. `_qkv_slicing`: reshape failure with TP>1 on GQA models**
When tensor parallelism is enabled, the `linear_qkv` weight tensor
arriving in `_qkv_slicing` is already TP-sharded, so `weight.shape[0]`
equals `per_rank_qkv_dim * head_size`, not `qkv_total_dim * head_size`.
All five reshape/`arange` operations were using the global
`qkv_total_dim`, causing a runtime shape mismatch for any GQA model with
TP > 1. The fix derives `per_rank_qkv_dim` and `num_query_groups_local`
from the actual tensor shape, making the logic correct for any TP degree
(a no-op for TP=1).
**2. `_get_mtp_state_dict`: `EntryNotFoundError` for non-sharded
models**
`hf_hub_download("model.safetensors.index.json")` raises
`EntryNotFoundError` for small models that ship a single
`model.safetensors` rather than a sharded index. The function now
catches this and falls back to downloading/reading `model.safetensors`
directly, scanning its keys with `safe_open`. The same two-path logic
applies to local directories.
**3. `copy_remote_code`: `ValueError` for Hub model IDs**
`copy_remote_code` only accepted local directory paths and raised
`ValueError` for HuggingFace Hub model IDs (e.g.
`"meta-llama/Llama-3.2-1B"`). The function now falls back to
`list_repo_files` + `hf_hub_download` to fetch and copy top-level `.py`
files (custom modeling code) when the path is not a local directory.
### Usage
```python
# TP>1 GQA export now works (previously raised RuntimeError on reshape)
export_mcore_gpt_to_hf(gqa_model, "meta-llama/Llama-3.2-1B", export_dir="./out", dtype=torch.bfloat16)
# Models with a single model.safetensors now have their MTP weights exported
export_mcore_gpt_to_hf(model, "./small_model_dir", export_dir="./out", dtype=torch.bfloat16)
# Hub model IDs no longer raise ValueError in copy_remote_code
export_mcore_gpt_to_hf(model, "org/custom-model-with-remote-code", export_dir="./out", dtype=torch.bfloat16)
```
### Testing
New tests added in `tests/gpu_megatron/torch/export/`:
- `test_unified_export_megatron.py::test_qkv_slicing_gqa_tp2` —
FP8-quantized GQA model export with TP=2 (`num_query_groups=2 <
num_attention_heads=8`), exercises both the weight reshape and
per-channel weight-scale reshape paths.
-
`test_unified_export_megatron.py::test_mtp_state_dict_single_safetensors`
— unit test verifying MTP weights are collected from a single
`model.safetensors` file.
- `test_unified_export_megatron.py::test_mtp_state_dict_index_file` —
unit test verifying MTP weights are collected from a sharded checkpoint.
- `test_unified_export_megatron.py::test_mtp_state_dict_no_mtp_keys` —
edge case: no MTP keys → empty dict, no side effects.
- `test_hf_checkpoint_utils.py` — four tests covering `copy_remote_code`
for local directories and Hub model IDs (with and without `.py` files).
### Before your PR is "*Ready for review*"
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
### Additional Information
Fixes reported against Megatron export when running quantization with
TP>1, small non-sharded HF models, and HuggingFace Hub model IDs passed
to `export_mcore_gpt_to_hf`.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Export functionality now supports downloading code directly from
Hugging Face Hub model repositories in addition to local directories.
* **Bug Fixes**
* Improved safetensors loading with better error handling for missing
model entries and support for both single and sharded weight files.
* Enhanced tensor slicing behavior for multi-GPU distributed export
scenarios.
* **Tests**
* Added comprehensive test coverage for Hugging Face integration and
export functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent e5de5ec commit 24891f0
File tree
4 files changed
+337
-35
lines changed- modelopt/torch/export
- plugins
- tests/gpu_megatron/torch/export
4 files changed
+337
-35
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
| |||
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
38 | 42 | | |
39 | | - | |
| 43 | + | |
40 | 44 | | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
56 | 60 | | |
57 | 61 | | |
58 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
| 32 | + | |
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
| |||
527 | 529 | | |
528 | 530 | | |
529 | 531 | | |
530 | | - | |
531 | | - | |
532 | | - | |
533 | | - | |
534 | | - | |
535 | | - | |
536 | | - | |
537 | | - | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
538 | 549 | | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
539 | 563 | | |
| 564 | + | |
540 | 565 | | |
541 | | - | |
542 | | - | |
543 | | - | |
544 | | - | |
545 | | - | |
546 | | - | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
547 | 577 | | |
548 | | - | |
| 578 | + | |
549 | 579 | | |
550 | 580 | | |
551 | | - | |
552 | | - | |
| 581 | + | |
| 582 | + | |
553 | 583 | | |
554 | 584 | | |
555 | 585 | | |
| |||
985 | 1015 | | |
986 | 1016 | | |
987 | 1017 | | |
988 | | - | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
989 | 1024 | | |
990 | 1025 | | |
991 | 1026 | | |
992 | 1027 | | |
993 | 1028 | | |
994 | | - | |
| 1029 | + | |
995 | 1030 | | |
996 | 1031 | | |
997 | | - | |
998 | | - | |
| 1032 | + | |
| 1033 | + | |
999 | 1034 | | |
1000 | 1035 | | |
1001 | 1036 | | |
| |||
1020 | 1055 | | |
1021 | 1056 | | |
1022 | 1057 | | |
1023 | | - | |
| 1058 | + | |
1024 | 1059 | | |
1025 | 1060 | | |
1026 | 1061 | | |
| |||
1061 | 1096 | | |
1062 | 1097 | | |
1063 | 1098 | | |
1064 | | - | |
| 1099 | + | |
1065 | 1100 | | |
1066 | 1101 | | |
1067 | 1102 | | |
| |||
Lines changed: 109 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
0 commit comments