Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Changelog
- Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#load-qatptq-model-and-serve-in-vllm-wip>`_ for more details.
- [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
- [Early Testing] Polish Claude Code evaluation skill (``.claude/skills/evaluation/``) for agent-assisted LLM accuracy benchmarking via NeMo Evaluator Launcher. Adds two companion skills vendored verbatim from `NVIDIA-NeMo/Evaluator <https://github.com/NVIDIA-NeMo/Evaluator>`_: ``launching-evals`` (run/check/debug/analyze NEL evaluations) and ``accessing-mlflow`` (query MLflow runs, compare metrics, fetch artifacts). Re-sync at a pinned upstream SHA via ``.claude/scripts/sync-upstream-skills.sh``. Also adds a shared ``skills/common/credentials.md`` covering HF / NGC / Docker token setup referenced by multiple skills. This feature is in early testing — use with caution.
- Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml>`_ for usage.
- Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml>`_ for usage.
- Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
- Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter <modelopt.onnx.export.fp8_exporter.FP8QuantExporter>`, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx/torch_quant_to_onnx.py>`_ for the general timm-model quantize→ONNX workflow.

Expand Down
3 changes: 3 additions & 0 deletions docs/source/guides/10_recipes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -495,6 +495,8 @@ General PTQ recipes are model-agnostic and apply to any supported architecture:
- NVFP4 for MLP layers only, FP8 KV cache
* - ``general/ptq/nvfp4_experts_only-kv_fp8``
- NVFP4 for MoE expert layers only, FP8 KV cache
* - ``general/ptq/nvfp4_experts_only-kv_fp8_layerwise``
- NVFP4 for MoE expert layers only, FP8 KV cache, layerwise calibration
* - ``general/ptq/nvfp4_omlp_only-kv_fp8``
- NVFP4 for output projection + MLP layers, FP8 KV cache

Expand Down Expand Up @@ -657,6 +659,7 @@ The ``modelopt_recipes/`` package is organized as follows:
| +-- nvfp4_default-kv_nvfp4_cast.yaml
| +-- nvfp4_mlp_only-kv_fp8.yaml
| +-- nvfp4_experts_only-kv_fp8.yaml
| +-- nvfp4_experts_only-kv_fp8_layerwise.yaml
| +-- nvfp4_omlp_only-kv_fp8.yaml
+-- models/ # Model-specific recipes
| +-- Step3.5-Flash/
Expand Down
31 changes: 25 additions & 6 deletions examples/llm_ptq/hf_ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -988,6 +988,25 @@ def quantize_main(
default_pad_token,
device: torch.device,
):
# Load the recipe up front so we can detect layerwise calibration before batch-size probing.
recipe = None
if args.recipe is not None and not args.auto_quantize_bits:
print(f"Use recipe {args.recipe} for quantization")
recipe = load_recipe(args.recipe)
Comment thread
cjluo-nv marked this conversation as resolved.
if not isinstance(recipe, ModelOptPTQRecipe):
raise TypeError(
f"Expected PTQ recipe, but got {type(recipe).__name__} from {args.recipe}"
)

def _is_layerwise(obj):
if isinstance(obj, ModelOptPTQRecipe):
return _is_layerwise(obj.quantize.algorithm)
if isinstance(obj, list):
return any(_is_layerwise(a) for a in obj)
return bool(getattr(obj, "layerwise", False))
Comment on lines +1001 to +1006

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle dict-form algorithms in _is_layerwise.

At Line 973, getattr(obj, "layerwise", False) makes dict algorithms evaluate as non-layerwise. That can bypass the Line 990-994 guard and fall back to full-model batch probing.

Suggested fix
     def _is_layerwise(obj):
         if isinstance(obj, ModelOptPTQRecipe):
             return _is_layerwise(obj.quantize.algorithm)
+        if isinstance(obj, dict):
+            if "layerwise" in obj:
+                return bool(obj["layerwise"])
+            if "algorithm" in obj:
+                return _is_layerwise(obj["algorithm"])
+            return False
         if isinstance(obj, list):
             return any(_is_layerwise(a) for a in obj)
         return bool(getattr(obj, "layerwise", False))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 968 - 973, The helper _is_layerwise
currently treats dict-form algorithms as non-layerwise because getattr(obj,
"layerwise", False) returns False for dicts; update _is_layerwise to explicitly
handle dicts by checking if obj is a dict and returning True when
obj.get("layerwise") is truthy or when any of its values (or nested algorithm
entries) are layerwise (i.e., recurse into dict values similar to list
handling). Keep the existing branches for ModelOptPTQRecipe and list, and ensure
the final fallback checks dicts before using getattr to avoid misclassifying
dict algorithms and bypassing the layerwise guard.


is_layerwise = _is_layerwise(recipe)

if args.batch_size == 0:
# For VL models with image-text calibration, skip automatic batch size detection
# since get_max_batch_size can't handle multimodal inputs
Expand All @@ -1001,6 +1020,11 @@ def quantize_main(
"Offline speculative decoding calibration enabled. Using default batch_size=1 for calibration."
)
args.batch_size = 1
# Layerwise calibration processes one layer at a time; auto batch-size probing runs a
# full-model forward which defeats the point and can OOM on very large models.
elif is_layerwise:
print("Layerwise calibration enabled. Using default batch_size=1 for calibration.")
args.batch_size = 1
else:
# Calibration/sparsification will actually take much more memory than regular inference
# due to intermediate tensors for fake quantization. Setting sample_memory_usage_ratio
Expand Down Expand Up @@ -1064,12 +1088,7 @@ def quantize_main(
else:
# mono quantization

if args.recipe is not None:
print(f"Use recipe {args.recipe} for quantization")
recipe = load_recipe(args.recipe)
assert isinstance(recipe, ModelOptPTQRecipe), (
f"Expected PTQ recipe, but got {type(recipe).__name__} from {args.recipe}"
)
if recipe is not None:
quant_cfg = recipe.quantize.model_dump()

else:
Expand Down
3 changes: 2 additions & 1 deletion modelopt/torch/quantization/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@
import warnings
from typing import Any, Literal, cast

from pydantic import ValidationInfo, field_validator, model_validator
from pydantic import AliasChoices, ValidationInfo, field_validator, model_validator
from typing_extensions import Required, TypedDict

from modelopt.torch.opt.config import ModeloptBaseConfig, ModeloptField
Expand Down Expand Up @@ -588,6 +588,7 @@ class QuantizeAlgorithmConfig(ModeloptBaseConfig):

layerwise: bool = ModeloptField(
default=False,
validation_alias=AliasChoices("layerwise", "use_sequential"),
Comment thread
realAsma marked this conversation as resolved.
title="Enable layerwise (layer-by-layer) calibration.",
description=(
"If True, the calibration algorithm is applied layer by layer. "
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ imports:

metadata:
recipe_type: ptq
description: NVFP4 static weight and dynamic activation for expert layers only (W4A4), FP8 KV cache, max layerwise calibration.
description: NVFP4 static weight and dynamic activation for expert layers only (W4A4), FP8 KV cache, max calibration.
quantize:
algorithm:
method: max
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

imports:
base_disable_all: configs/ptq/units/base_disable_all
default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
nvfp4: configs/numerics/nvfp4
kv_fp8: configs/ptq/units/kv_fp8

metadata:
recipe_type: ptq
description: NVFP4 static weight and dynamic activation for expert layers only (W4A4), FP8 KV cache, max layerwise calibration.
quantize:
algorithm:
method: max
# Max calibration is fast and does not typically need checkpointing.
layerwise: true
quant_cfg:
- $import: base_disable_all
- quantizer_name: '*mlp.experts*weight_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*mlp.experts*input_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*block_sparse_moe*weight_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*block_sparse_moe*input_quantizer'
cfg:
$import: nvfp4
- $import: kv_fp8
- $import: default_disabled_quantizers
1 change: 1 addition & 0 deletions tests/unit/recipe/test_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ def test_load_recipe_builtin_description():
"general/ptq/nvfp4_default-kv_nvfp4_cast",
"general/ptq/nvfp4_default-kv_none-gptq",
"general/ptq/nvfp4_experts_only-kv_fp8",
"general/ptq/nvfp4_experts_only-kv_fp8_layerwise",
"general/ptq/nvfp4_mlp_only-kv_fp8",
"general/ptq/nvfp4_omlp_only-kv_fp8",
]
Expand Down
33 changes: 33 additions & 0 deletions tests/unit/torch/quantization/test_config_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
INT4_AWQ_CFG,
NVFP4_DEFAULT_CFG,
W4A8_AWQ_BETA_CFG,
MaxCalibConfig,
QuantizeConfig,
find_quant_cfg_entry_by_path,
need_calibration,
Expand Down Expand Up @@ -525,3 +526,35 @@ def test_validate_quant_cfg_entries_accepts_valid_cfg(self):
algorithm="max",
)
assert len(cfg.quant_cfg) == 2


class TestLayerwiseUseSequentialAlias:
"""`layerwise` accepts the legacy `use_sequential` name via validation_alias.

Old PTQ checkpoints serialized the field as `use_sequential` before #1251 renamed
it to `layerwise`. AliasChoices lets those checkpoints load without a migration
validator while still serializing under the current name.
"""

def test_use_sequential_true_sets_layerwise(self):
cfg = MaxCalibConfig(use_sequential=True)
assert cfg.layerwise is True

def test_use_sequential_false_sets_layerwise(self):
cfg = MaxCalibConfig(use_sequential=False)
assert cfg.layerwise is False

def test_layerwise_name_still_accepted(self):
cfg = MaxCalibConfig(layerwise=True)
assert cfg.layerwise is True

def test_serializes_under_current_name(self):
"""Dump must use `layerwise`, not the legacy alias."""
dumped = MaxCalibConfig(use_sequential=True).model_dump()
assert dumped["layerwise"] is True
assert "use_sequential" not in dumped

def test_unknown_field_still_rejected(self):
"""extra='forbid' must still reject unrelated unknown fields."""
with pytest.raises(ValidationError):
MaxCalibConfig(not_a_real_field=True)
Loading