NVIDIA · realAsma · May 12, 2026 · Apr 24, 2026 · Apr 24, 2026 · May 12, 2026
@@ -41,7 +41,7 @@ Changelog
 - Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#load-qatptq-model-and-serve-in-vllm-wip>`_ for more details.
 - [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
 - [Early Testing] Polish Claude Code evaluation skill (``.claude/skills/evaluation/``) for agent-assisted LLM accuracy benchmarking via NeMo Evaluator Launcher. Adds two companion skills vendored verbatim from `NVIDIA-NeMo/Evaluator <https://github.com/NVIDIA-NeMo/Evaluator>`_: ``launching-evals`` (run/check/debug/analyze NEL evaluations) and ``accessing-mlflow`` (query MLflow runs, compare metrics, fetch artifacts). Re-sync at a pinned upstream SHA via ``.claude/scripts/sync-upstream-skills.sh``. Also adds a shared ``skills/common/credentials.md`` covering HF / NGC / Docker token setup referenced by multiple skills. This feature is in early testing — use with caution.
-- Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml>`_ for usage.
+- Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml>`_ for usage.
 - Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
 - Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter <modelopt.onnx.export.fp8_exporter.FP8QuantExporter>`, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx/torch_quant_to_onnx.py>`_ for the general timm-model quantize→ONNX workflow.
 

@@ -495,6 +495,8 @@ General PTQ recipes are model-agnostic and apply to any supported architecture:
      - NVFP4 for MLP layers only, FP8 KV cache
    * - ``general/ptq/nvfp4_experts_only-kv_fp8``
      - NVFP4 for MoE expert layers only, FP8 KV cache
+   * - ``general/ptq/nvfp4_experts_only-kv_fp8_layerwise``
+     - NVFP4 for MoE expert layers only, FP8 KV cache, layerwise calibration
    * - ``general/ptq/nvfp4_omlp_only-kv_fp8``
      - NVFP4 for output projection + MLP layers, FP8 KV cache
 
@@ -657,6 +659,7 @@ The ``modelopt_recipes/`` package is organized as follows:
    |       +-- nvfp4_default-kv_nvfp4_cast.yaml
    |       +-- nvfp4_mlp_only-kv_fp8.yaml
    |       +-- nvfp4_experts_only-kv_fp8.yaml
+   |       +-- nvfp4_experts_only-kv_fp8_layerwise.yaml
    |       +-- nvfp4_omlp_only-kv_fp8.yaml
    +-- models/                     # Model-specific recipes
    |   +-- Step3.5-Flash/

@@ -988,6 +988,25 @@ def quantize_main(
     default_pad_token,
     device: torch.device,
 ):
+    # Load the recipe up front so we can detect layerwise calibration before batch-size probing.
+    recipe = None
+    if args.recipe is not None and not args.auto_quantize_bits:
+        print(f"Use recipe {args.recipe} for quantization")
+        recipe = load_recipe(args.recipe)
+        if not isinstance(recipe, ModelOptPTQRecipe):
+            raise TypeError(
+                f"Expected PTQ recipe, but got {type(recipe).__name__} from {args.recipe}"
+            )
+
+    def _is_layerwise(obj):
+        if isinstance(obj, ModelOptPTQRecipe):
+            return _is_layerwise(obj.quantize.algorithm)
+        if isinstance(obj, list):
+            return any(_is_layerwise(a) for a in obj)
+        return bool(getattr(obj, "layerwise", False))
+
+    is_layerwise = _is_layerwise(recipe)
+
     if args.batch_size == 0:
         # For VL models with image-text calibration, skip automatic batch size detection
         # since get_max_batch_size can't handle multimodal inputs
@@ -1001,6 +1020,11 @@ def quantize_main(
                 "Offline speculative decoding calibration enabled. Using default batch_size=1 for calibration."
             )
             args.batch_size = 1
+        # Layerwise calibration processes one layer at a time; auto batch-size probing runs a
+        # full-model forward which defeats the point and can OOM on very large models.
+        elif is_layerwise:
+            print("Layerwise calibration enabled. Using default batch_size=1 for calibration.")
+            args.batch_size = 1
         else:
             # Calibration/sparsification will actually take much more memory than regular inference
             # due to intermediate tensors for fake quantization. Setting sample_memory_usage_ratio
@@ -1064,12 +1088,7 @@ def quantize_main(
     else:
         # mono quantization
 
-        if args.recipe is not None:
-            print(f"Use recipe {args.recipe} for quantization")
-            recipe = load_recipe(args.recipe)
-            assert isinstance(recipe, ModelOptPTQRecipe), (
-                f"Expected PTQ recipe, but got {type(recipe).__name__} from {args.recipe}"
-            )
+        if recipe is not None:
             quant_cfg = recipe.quantize.model_dump()
 
         else:

@@ -154,7 +154,7 @@
 import warnings
 from typing import Any, Literal, cast
 
-from pydantic import ValidationInfo, field_validator, model_validator
+from pydantic import AliasChoices, ValidationInfo, field_validator, model_validator
 from typing_extensions import Required, TypedDict
 
 from modelopt.torch.opt.config import ModeloptBaseConfig, ModeloptField
@@ -588,6 +588,7 @@ class QuantizeAlgorithmConfig(ModeloptBaseConfig):
 
     layerwise: bool = ModeloptField(
         default=False,
+        validation_alias=AliasChoices("layerwise", "use_sequential"),
         title="Enable layerwise (layer-by-layer) calibration.",
         description=(
             "If True, the calibration algorithm is applied layer by layer. "

@@ -21,7 +21,7 @@ imports:
 
 metadata:
   recipe_type: ptq
-  description: NVFP4 static weight and dynamic activation for expert layers only (W4A4), FP8 KV cache, max layerwise calibration.
+  description: NVFP4 static weight and dynamic activation for expert layers only (W4A4), FP8 KV cache, max calibration.
 quantize:
   algorithm:
     method: max

@@ -0,0 +1,45 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+imports:
+  base_disable_all: configs/ptq/units/base_disable_all
+  default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
+  nvfp4: configs/numerics/nvfp4
+  kv_fp8: configs/ptq/units/kv_fp8
+
+metadata:
+  recipe_type: ptq
+  description: NVFP4 static weight and dynamic activation for expert layers only (W4A4), FP8 KV cache, max layerwise calibration.
+quantize:
+  algorithm:
+    method: max
+    # Max calibration is fast and does not typically need checkpointing.
+    layerwise: true
+  quant_cfg:
+    - $import: base_disable_all
+    - quantizer_name: '*mlp.experts*weight_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*mlp.experts*input_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*block_sparse_moe*weight_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*block_sparse_moe*input_quantizer'
+      cfg:
+        $import: nvfp4
+    - $import: kv_fp8
+    - $import: default_disabled_quantizers
@@ -136,6 +136,7 @@ def test_load_recipe_builtin_description():
     "general/ptq/nvfp4_default-kv_nvfp4_cast",
     "general/ptq/nvfp4_default-kv_none-gptq",
     "general/ptq/nvfp4_experts_only-kv_fp8",
+    "general/ptq/nvfp4_experts_only-kv_fp8_layerwise",
     "general/ptq/nvfp4_mlp_only-kv_fp8",
     "general/ptq/nvfp4_omlp_only-kv_fp8",
 ]

@@ -25,6 +25,7 @@
     INT4_AWQ_CFG,
     NVFP4_DEFAULT_CFG,
     W4A8_AWQ_BETA_CFG,
+    MaxCalibConfig,
     QuantizeConfig,
     find_quant_cfg_entry_by_path,
     need_calibration,
@@ -525,3 +526,35 @@ def test_validate_quant_cfg_entries_accepts_valid_cfg(self):
             algorithm="max",
         )
         assert len(cfg.quant_cfg) == 2
+
+
+class TestLayerwiseUseSequentialAlias:
+    """`layerwise` accepts the legacy `use_sequential` name via validation_alias.
+
+    Old PTQ checkpoints serialized the field as `use_sequential` before #1251 renamed
+    it to `layerwise`. AliasChoices lets those checkpoints load without a migration
+    validator while still serializing under the current name.
+    """
+
+    def test_use_sequential_true_sets_layerwise(self):
+        cfg = MaxCalibConfig(use_sequential=True)
+        assert cfg.layerwise is True
+
+    def test_use_sequential_false_sets_layerwise(self):
+        cfg = MaxCalibConfig(use_sequential=False)
+        assert cfg.layerwise is False
+
+    def test_layerwise_name_still_accepted(self):
+        cfg = MaxCalibConfig(layerwise=True)
+        assert cfg.layerwise is True
+
+    def test_serializes_under_current_name(self):
+        """Dump must use `layerwise`, not the legacy alias."""
+        dumped = MaxCalibConfig(use_sequential=True).model_dump()
+        assert dumped["layerwise"] is True
+        assert "use_sequential" not in dumped
+
+    def test_unknown_field_still_rejected(self):
+        """extra='forbid' must still reject unrelated unknown fields."""
+        with pytest.raises(ValidationError):
+            MaxCalibConfig(not_a_real_field=True)