Skip to content

Commit 4b270f0

Browse files
authored
Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe (#1521)
### What does this PR do? Type of change: New features + Bug fixes Mixed Precision and MSE support in MCore PTQ - support mixed precision export in MCore by detecting mixed precision layers in HF Quant Config - Restore static quantizer in MCore checkpoint restore as `NVFP4QTensor` (not TensorQuantizer which can call max calibrate. we want to skip max calibrate for static quantizer during restore) --> fixes bug during MCore export for MSE - Fix dynamic block quantizer detection when `block_sizes` is dict-backed. - Add a YAML quantization recipe that roughly mirrors Nemotron 3 Super NVFP4 `hf_quant_config.json` Export bug fixes - copy .py files properly from original HF ckpt (for reasoning parser etc) ## Super recipe Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json: - MoE routed experts (mixer.experts.<N>.{up,down}_proj): NVFP4 W4A4 weight MSE, group_size 16 - MoE shared experts (mixer.shared_experts.{up,down}_proj): FP8 per-tensor - Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor - KV cache: FP8 rest: not quantized ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing Tested on Nemotron model ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added NVFP4 (4-bit) quantization checkpoint restore and export support for Megatron-Core models * Added tokenizer file export capability in model checkpoints * Extended quantization support for expert-parallel distributed training * Introduced new PTQ recipes for Nemotron-3-Super models with mixed-precision quantization * **Bug Fixes** * Fixed FP8 and FP4 hardware compatibility detection on non-CUDA systems * Improved offline Hugging Face Hub access handling with better error messaging * Enhanced calibration validation for mixture-of-experts models * Fixed amplitude maximum validation for static block quantizers * **Documentation** * Updated expert weight quantization configuration documentation <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1521?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jennifer Chen <jennifchen@nvidia.com> Signed-off-by: Jenny Chen <jennifchen@nvidia.com>
1 parent 5eba879 commit 4b270f0

25 files changed

Lines changed: 1325 additions & 143 deletions

File tree

CHANGELOG.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,14 @@ Changelog
3636
- The ``nemotron-sft-agentic-v2`` registered dataset (added in #1498) now uses only the ``search`` split. The previously configured ``interactive_agent`` and ``tool_calling`` splits contain content-level defects (heterogeneous schema and a malformed JSON row, respectively) that cause pyarrow's streaming JSON reader to fail deterministically.
3737
- Add shared Megatron-Core calibration forward loop: ``modelopt.torch.utils.plugins.megatron_calibration.get_megatron_calibration_forward_loop`` produces the ``forward_loop`` callable expected by ``mtq.quantize`` / ``mtp.prune``. Replaces the bespoke calibration loops in Megatron-LM and Megatron-Bridge for quantization and pruning with a single canonical implementation.
3838
- Add ``pack=True`` mode to ``get_dataset_dataloader`` (Megatron-LM pretraining-style global-stream document packing): all raw samples concatenated EOS-separated into one token stream, sliced into uniform ``max_sample_length`` rows. Used by the shared megatron calibration loop.
39+
- Support Megatron-Core checkpoint restore and export for MSE ``NVFP4StaticQuantizer``.
40+
- Add mixed-precision FP8 + NVFP4 export for Megatron-Core: per-layer ``quant_algo`` recorded under ``quantized_layers`` in ``hf_quant_config.json``, PP-aware ``kv_cache_dtype`` gather, fused-QKV exclude split into per-HF-name ``q/k/v_proj`` entries.
41+
- Add Nemotron-3-Super-120B-A12B PTQ recipes ``modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`` (MSE-mixed) and ``super-nvfp4-max-calib.yaml`` (max-calib mixed): NVFP4 W4A4 routed experts + FP8 per-tensor shared experts / Mamba in/out_proj + FP8 KV cache.
3942
- Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
4043

4144
**Bug Fixes**
4245

46+
- In Megatron-Core only do EP amax sync for routed expert weights if ``sync_expert_weight_amax=True``. Previously EP amax sync would sync routed expert weights across EP ranks even when ``sync_expert_weight_amax`` was False.
4347
- Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
4448

4549
0.44 (2026-05-14)

examples/specdec_bench/specdec_bench/datasets/speed.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -730,11 +730,7 @@ def _load_dataset(self, config_name_or_dataset_path: config_type | str) -> "Data
730730
# Strip HF metadata from the schema to avoid Feature parsing errors
731731
schema = table.schema
732732
if schema.metadata and b"huggingface" in schema.metadata:
733-
new_meta = {
734-
k: v
735-
for k, v in schema.metadata.items()
736-
if k != b"huggingface"
737-
}
733+
new_meta = {k: v for k, v in schema.metadata.items() if k != b"huggingface"}
738734
table = table.replace_schema_metadata(new_meta or None)
739735
dataset = HFDataset(table)
740736
if self.num_samples is not None and self.num_samples < len(dataset):

modelopt/torch/export/plugins/hf_checkpoint_utils.py

Lines changed: 61 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,21 @@
2222

2323
import torch
2424
from huggingface_hub import snapshot_download
25+
from huggingface_hub.errors import LocalEntryNotFoundError
2526
from safetensors.torch import safe_open
2627
from tqdm import tqdm
2728

29+
_HF_HUB_OFFLINE_TRUE_VALUES = {"1", "ON", "YES", "TRUE"}
30+
31+
32+
def _is_hf_hub_offline() -> bool:
33+
return os.environ.get("HF_HUB_OFFLINE", "").strip().upper() in _HF_HUB_OFFLINE_TRUE_VALUES
34+
35+
36+
def _copy_python_files(source_dir: Path, save_dir: Path) -> None:
37+
for py_file in source_dir.glob("*.py"):
38+
shutil.copy2(py_file, save_dir / py_file.name)
39+
2840

2941
def copy_hf_ckpt_remote_code(
3042
pretrained_model_path: str | os.PathLike, save_directory: str | os.PathLike
@@ -36,7 +48,10 @@ def copy_hf_ckpt_remote_code(
3648
frameworks.
3749
3850
If ``pretrained_model_path`` is a local directory, Python files are copied directly.
39-
If it's a HF Hub model ID (e.g. ``nvidia/NVIDIA-Nemotron-Nano-12B-v2``), files are downloaded from the Hub.
51+
If it's a HF Hub model ID (e.g. ``nvidia/NVIDIA-Nemotron-Nano-12B-v2``), the Hub
52+
snapshot is resolved first and Python files are copied from that snapshot. When
53+
``HF_HUB_OFFLINE`` is set, the snapshot must already be available in the local
54+
Hugging Face cache.
4055
4156
Args:
4257
pretrained_model_path: Local path to the pretrained model or HuggingFace Hub model ID.
@@ -47,14 +62,28 @@ def copy_hf_ckpt_remote_code(
4762
save_dir.mkdir(parents=True, exist_ok=True)
4863

4964
if hf_checkpoint_path.is_dir():
50-
for py_file in hf_checkpoint_path.glob("*.py"):
51-
shutil.copy2(py_file, save_dir / py_file.name)
65+
_copy_python_files(hf_checkpoint_path, save_dir)
5266
else:
53-
snapshot_download(
54-
repo_id=str(pretrained_model_path),
55-
local_dir=str(save_dir),
56-
allow_patterns=["*.py"],
57-
)
67+
local_files_only = _is_hf_hub_offline()
68+
try:
69+
source_dir = Path(
70+
snapshot_download(
71+
repo_id=str(pretrained_model_path),
72+
allow_patterns=["*.py"],
73+
local_files_only=local_files_only,
74+
)
75+
)
76+
except LocalEntryNotFoundError as exc:
77+
if local_files_only:
78+
raise RuntimeError(
79+
f"Could not copy Python sidecar files for {pretrained_model_path!r} because "
80+
"HF_HUB_OFFLINE is enabled and the files are not available in the local "
81+
"Hugging Face cache. Populate the cache with the model's *.py files or pass "
82+
"a local pretrained model directory."
83+
) from exc
84+
raise
85+
86+
_copy_python_files(source_dir, save_dir)
5887

5988

6089
def load_multimodal_components(
@@ -123,3 +152,27 @@ def load_multimodal_components(
123152

124153
print(f"Successfully loaded {len(multimodal_state_dict)} multimodal tensors")
125154
return multimodal_state_dict
155+
156+
157+
def copy_non_safetensor_files_from_ckpt(src: str | os.PathLike, dst: str | os.PathLike):
158+
"""Copy every non-safetensors file from a local HF checkpoint dir verbatim.
159+
160+
Use as a baseline so tokenizer files, remote_code ``*.py``, README, LICENSE, etc.
161+
are preserved from the source. The caller is expected to overwrite the files
162+
modelopt owns (``config.json``, ``generation_config.json``, ``hf_quant_config.json``,
163+
``preprocessor_config.json``) after this step.
164+
165+
Args:
166+
src: Source HF checkpoint directory. Must be a local path.
167+
dst: Destination directory; created if missing.
168+
"""
169+
if not os.path.isdir(src):
170+
raise ValueError(f"Invalid source path: {src}. It should be a directory.")
171+
os.makedirs(dst, exist_ok=True)
172+
for entry in os.listdir(src):
173+
sp = os.path.join(src, entry)
174+
if not os.path.isfile(sp):
175+
continue
176+
if entry.endswith(".safetensors") or entry == "model.safetensors.index.json":
177+
continue
178+
shutil.copy2(sp, dst)

modelopt/torch/export/plugins/mcore_nemotron.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,10 @@
131131
"input_layernorm": NameRemapping("backbone.layers.{}.norm."),
132132
"linear_qkv": QKVSlicing("backbone.layers.{}.mixer."),
133133
"linear_proj": NameRemapping("backbone.layers.{}.mixer.o_proj."),
134-
"core_attention": SelfAttentionScaling("backbone.layers.{}.mixer."),
134+
"core_attention": SelfAttentionScaling(
135+
"backbone.layers.{}.mixer.",
136+
func_kwargs={"k_scale_name": "k_proj.k_scale", "v_scale_name": "v_proj.v_scale"},
137+
),
135138
# MLP
136139
"pre_mlp_layernorm": NameRemapping("backbone.layers.{}.norm."),
137140
"linear_fc1": NameRemapping("backbone.layers.{}.mixer.up_proj."),

modelopt/torch/export/quant_utils.py

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -288,9 +288,25 @@ def _ensure_weight_quantizer_calibrated(
288288
module_name: Optional module name for better warning messages
289289
"""
290290
if isinstance(weight_quantizer, NVFP4StaticQuantizer):
291-
need_per_block = not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None
291+
292+
def _amax_is_invalid(t: torch.Tensor | None) -> bool:
293+
# MCore distcp may register but not fill amax — treat missing/non-finite/negative as recompute.
294+
if t is None:
295+
return True
296+
t = t.detach()
297+
if not torch.is_floating_point(t):
298+
return False
299+
return bool((~torch.isfinite(t) | (t < 0)).any().item())
300+
301+
need_per_block = (
302+
not hasattr(weight_quantizer, "_amax")
303+
or weight_quantizer._amax is None
304+
or _amax_is_invalid(weight_quantizer._amax)
305+
)
292306
need_global = (
293-
not hasattr(weight_quantizer, "_global_amax") or weight_quantizer.global_amax is None
307+
not hasattr(weight_quantizer, "_global_amax")
308+
or weight_quantizer.global_amax is None
309+
or _amax_is_invalid(weight_quantizer.global_amax)
294310
)
295311
if not (need_per_block or need_global):
296312
return

0 commit comments

Comments
 (0)