Add FP8 MHA quantization support for HuggingFace ViT

ajrasane · ajrasane · commit ef8c769020f9 · 2026-04-20T17:14:46.000Z
Enables TensorRT attention-v2 fusion for HuggingFace ViT (and similar
transformer vision models) when exported to ONNX with FP8 Q/DQ.

- fp8_exporter: rewrite attention-scaling Mul and K Transpose to the
  Q-side so DQ feeds MatMul directly, pre-transpose weight constants,
  insert FP8 Q/DQ on Softmax outputs for MHA-v2 fusion. Scale dtype
  now matches the graph's float dtype to keep strongly-typed builds
  consistent.
- onnx/utils: fold Cast(FP16&lt;-&gt;FP32) nodes that convert_float_to_float16
  inserts around Q/DQ by rewriting scale initializers to FP16, so TRT
  fuses DQ into the downstream GEMM/MatMul kernel.
- torch/quantization/export_onnx: keep FP8 Q/DQ scale in the native
  input dtype so no Cast is injected between graph and Q/DQ.
- torch/quantization/nn: register nn.LayerNorm in QuantModuleRegistry
  so LayerNorm output quantizers are honored.
- torch/quantization/plugins/huggingface: skip attention wrappers whose
  children are also "*Attention" to avoid double-patching
  eager_attention_forward (e.g. ViTAttention vs ViTSelfAttention).

Example: examples/torch_onnx/vit_mha_quantization.py shows a ViT-FP8
config (extends FP8_DEFAULT_CFG with LayerNorm output quantizer,
disabled input quantizers on LayerNorm-followed layers, and
*_bmm_quantizer entries) plus accuracy + TRT-latency comparison
against an FP16 baseline.

Measured on ViT-base-patch16-224 (RTX 6000 Ada, batch=1):
- Top-1 / top-5 on 5k ImageNet-val: 81.16% / 95.50% (FP16) vs
  80.96% / 95.44% (torch FP8) — -0.20% / -0.06%
- TRT latency: 0.721 ms (FP16) vs 0.646 ms (torch FP8) — 1.12x speedup

Signed-off-by: ajrasane &lt;131806219+ajrasane@users.noreply.github.com&gt;
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -17,6 +17,7 @@ Changelog
 - [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
 - Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml>`_ for usage.
 - Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
+- Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter <modelopt.onnx.export.fp8_exporter.FP8QuantExporter>`, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx/torch_quant_to_onnx.py>`_ for the general timm-model quantize→ONNX workflow.
 
 **Backward Breaking Changes**
 
diff --git a/modelopt/onnx/export/fp8_exporter.py b/modelopt/onnx/export/fp8_exporter.py
diff --git a/modelopt/onnx/utils.py b/modelopt/onnx/utils.py
@@ -1415,6 +1415,74 @@ def _bypass_cast_node(model: onnx.ModelProto, node: onnx.NodeProto) -> None:
                     consumer.input[i] = input_tensor
 
 
+_DQ_OPS = {"DequantizeLinear", "TRT_FP8DequantizeLinear"}
+_Q_OPS = {"QuantizeLinear", "TRT_FP8QuantizeLinear"}
+
+
+def _scale_fp32_to_fp16(scale_init: onnx.TensorProto) -> None:
+    """Convert a scalar Q/DQ scale initializer in-place from FP32 to FP16.
+
+    Warns if any non-zero scale saturates to 0/inf in FP16 (out of FP16 representable range).
+    """
+    if scale_init.data_type != onnx.TensorProto.FLOAT:
+        return
+    scale_data = np.frombuffer(scale_init.raw_data, dtype=np.float32)
+    if not scale_data.size:
+        scale_data = np.array(scale_init.float_data, dtype=np.float32)
+    fp16_data = scale_data.astype(np.float16)
+    if np.any(np.isinf(fp16_data)) or (
+        np.any(fp16_data == 0) and np.any(scale_data != 0)
+    ):
+        logger.warning(
+            f"Q/DQ scale '{scale_init.name}' overflows or underflows when cast to FP16"
+        )
+    scale_init.data_type = onnx.TensorProto.FLOAT16
+    scale_init.raw_data = fp16_data.tobytes()
+    del scale_init.float_data[:]
+
+
+def fold_q_fp16_to_fp32_casts(onnx_model: onnx.ModelProto) -> onnx.ModelProto:
+    """Remove ``Cast(FP16→FP32) → Q`` patterns inserted by ``convert_float_to_float16``.
+
+    The Q scale is rewritten to FP16 so Q consumes the FP16 graph directly. Skipped for
+    opsets below ``BASE_MIN_OPSET`` since FP16 Q scales require opset >= 19.
+    """
+    if get_opset_version(onnx_model) < BASE_MIN_OPSET:
+        logger.debug(
+            f"Skipping fold_q_fp16_to_fp32_casts: opset < {BASE_MIN_OPSET} (FP16 Q scale unsupported)"
+        )
+        return onnx_model
+
+    consumer_map: dict[str, list[onnx.NodeProto]] = {}
+    for node in onnx_model.graph.node:
+        for inp in node.input:
+            consumer_map.setdefault(inp, []).append(node)
+    initializers = {init.name: init for init in onnx_model.graph.initializer}
+
+    to_remove = []
+    for node in onnx_model.graph.node:
+        if node.op_type != "Cast":
+            continue
+        cast_to = next((a.i for a in node.attribute if a.name == "to"), None)
+        if cast_to != onnx.TensorProto.FLOAT:
+            continue
+        consumers = consumer_map.get(node.output[0], [])
+        if not consumers or not all(c.op_type in _Q_OPS for c in consumers):
+            continue
+
+        for q_node in consumers:
+            if len(q_node.input) >= 2 and q_node.input[1] in initializers:
+                _scale_fp32_to_fp16(initializers[q_node.input[1]])
+
+        _bypass_cast_node(onnx_model, node)
+        to_remove.append(node)
+
+    logger.debug(f"Folded {len(to_remove)} Cast(FP16->FP32) -> Q patterns")
+    for node in to_remove:
+        onnx_model.graph.node.remove(node)
+    return onnx_model
+
+
 def _is_foldable_constant_cast_pattern(model: onnx.ModelProto, node: onnx.NodeProto) -> bool:
     """Check if a Constant -> Cast pattern can be folded."""
     assert node.op_type == "Cast"
@@ -1523,7 +1591,12 @@ def fold_dq_fp32_to_fp16_casts(onnx_model: onnx.ModelProto) -> onnx.ModelProto:
     Returns:
         The ONNX model with Cast nodes removed and DQ outputs set to FP16.
     """
-    import numpy as np
+    if get_opset_version(onnx_model) < BASE_MIN_OPSET:
+        logger.debug(
+            f"Skipping fold_dq_fp32_to_fp16_casts: opset < {BASE_MIN_OPSET} "
+            "(FP16 DQ scale unsupported)"
+        )
+        return onnx_model
 
     dq_ops = {"DequantizeLinear", "TRT_FP8DequantizeLinear"}
 
@@ -1623,6 +1696,13 @@ def fold_qdq_scale_fp16_to_fp32_casts(onnx_model: onnx.ModelProto) -> onnx.Model
     Returns:
         The ONNX model with redundant scale-path casts removed.
     """
+    if get_opset_version(onnx_model) < BASE_MIN_OPSET:
+        logger.debug(
+            f"Skipping fold_qdq_scale_fp16_to_fp32_casts: opset < {BASE_MIN_OPSET} "
+            "(FP16 Q/DQ scale unsupported)"
+        )
+        return onnx_model
+
     qdq_ops = {
         "QuantizeLinear",
         "DequantizeLinear",
diff --git a/modelopt/torch/_deploy/utils/torch_onnx.py b/modelopt/torch/_deploy/utils/torch_onnx.py
@@ -48,6 +48,7 @@
     change_casts_to_fp16,
     check_model_uses_external_data,
     fold_dq_fp32_to_fp16_casts,
+    fold_q_fp16_to_fp32_casts,
     fold_qdq_scale_fp16_to_fp32_casts,
     get_input_names,
     get_input_shapes,
@@ -663,6 +664,11 @@ def get_onnx_bytes_and_metadata(
 
     onnx_opt_graph = remove_redundant_casts(onnx_opt_graph)
 
+    # Remove Cast nodes around Q/DQ for optimal TRT fusion
+    if is_fp8_quantized(model):
+        onnx_opt_graph = fold_q_fp16_to_fp32_casts(onnx_opt_graph)
+        onnx_opt_graph = fold_dq_fp32_to_fp16_casts(onnx_opt_graph)
+
     # TensorRT expects all scales to be postive
     onnx_opt_graph = replace_zero_scale_with_smallest_nonzero(onnx_opt_graph)
 
diff --git a/modelopt/torch/quantization/export_onnx.py b/modelopt/torch/quantization/export_onnx.py
@@ -216,71 +216,54 @@ def _fp8_quantize(
     g: "GraphContext",
     inputs: torch.Value,
     scale_inv: float,
-    trt_high_precision_dtype: str,
 ):
     """Helper Function for Quantization."""
+    # Emit the scale in the native input dtype so no Cast is inserted between the
+    # graph and Q/DQ (Cast nodes block TRT from fusing DQ into the MatMul kernel).
     output_shape = sym_help._get_tensor_sizes(inputs)
-
-    # TRT StronglyType only supports FP16 QDQs
-    # custom ops, so cast the input if needed.
-    input_type = inputs.type().scalarType()
-    assert trt_high_precision_dtype in (input_type, "Float"), (
-        "TRT StronglyType requires both weights and amax to be in the BF16/FP16, or the QDQ in Float."
-    )
-    if trt_high_precision_dtype != input_type:
-        inputs = g.op("Cast", inputs, to_i=onnx_dtype_map[trt_high_precision_dtype])
-
     scale = g.op(
         "Constant",
-        value_t=torch.tensor(scale_inv).to(torch_dtype_map[trt_high_precision_dtype]),
+        value_t=torch.tensor(scale_inv).to(torch_dtype_map[inputs.type().scalarType()]),
     )
-    q_op = g.op("trt::TRT_FP8QuantizeLinear", inputs, scale).setType(
+    return g.op("trt::TRT_FP8QuantizeLinear", inputs, scale).setType(
         inputs.type().with_dtype(torch.uint8).with_sizes(output_shape)
     )
-    return q_op
 
 
 def _fp8_dequantize(
     g: "GraphContext",
     inputs: torch.Value,
     scale_inv: float,
-    trt_high_precision_dtype: str,
     otype: str | None = None,
 ):
     """Helper Function for Dequantization."""
     output_shape = sym_help._get_tensor_sizes(inputs)
-    assert trt_high_precision_dtype in (otype, "Float"), (
-        "TRT StronglyType requires both weights and amax to be in the BF16/FP16, or the QDQ in Float."
-    )
     scale = g.op(
         "Constant",
         value_t=torch.tensor(scale_inv, dtype=torch_dtype_map[otype]),  # type: ignore[index]
     )
-    out = g.op("trt::TRT_FP8DequantizeLinear", inputs, scale).setType(
-        inputs.type().with_dtype(torch_dtype_map[trt_high_precision_dtype]).with_sizes(output_shape)
+    return g.op("trt::TRT_FP8DequantizeLinear", inputs, scale).setType(
+        inputs.type().with_dtype(torch_dtype_map[otype]).with_sizes(output_shape)  # type: ignore[index]
     )
 
-    # DQ outputs are currently constrained to FP32 due to a similar limitation in ORT
-    # custom ops, so cast the output if needed.
-    if trt_high_precision_dtype != otype:
-        out = g.op("Cast", out, to_i=onnx_dtype_map[otype])  # type: ignore[index]
-    return out
-
 
 def export_fp8(
     g: "GraphContext",
     inputs: torch.Value,
     amax: float,
     trt_high_precision_dtype: str | None,
 ):
-    """Export quantized model to FP8 ONNX."""
+    """Export quantized model to FP8 ONNX.
+
+    ``trt_high_precision_dtype`` is accepted for API compatibility but unused: Q/DQ now
+    emit scales in the native input dtype, so no intermediate Cast is required.
+    """
+    del trt_high_precision_dtype
     scale = 1.0 if amax is None else 448.0 / float(amax)
     otype = inputs.type().scalarType()
-    if trt_high_precision_dtype is None:
-        trt_high_precision_dtype = otype
 
-    q_tensor = _fp8_quantize(g, inputs, 1.0 / scale, trt_high_precision_dtype)
-    return _fp8_dequantize(g, q_tensor, 1.0 / scale, trt_high_precision_dtype, otype)
+    q_tensor = _fp8_quantize(g, inputs, 1.0 / scale)
+    return _fp8_dequantize(g, q_tensor, 1.0 / scale, otype)
 
 
 def scaled_dot_product_attention(
diff --git a/modelopt/torch/quantization/nn/__init__.py b/modelopt/torch/quantization/nn/__init__.py
@@ -19,6 +19,7 @@
 from .modules.quant_batchnorm import *
 from .modules.quant_conv import *
 from .modules.quant_instancenorm import *
+from .modules.quant_layernorm import *
 from .modules.quant_linear import *
 from .modules.quant_module import *
 from .modules.quant_pooling import *
diff --git a/modelopt/torch/quantization/nn/modules/quant_layernorm.py b/modelopt/torch/quantization/nn/modules/quant_layernorm.py
@@ -0,0 +1,25 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Registers ``torch.nn.LayerNorm`` with ``QuantInputBase`` so its output quantizer is
+honored during quantization. Required for FP8 attention fusion where a single LayerNorm
+output QDQ is shared across all downstream Q/K/V/FC consumers (instead of repeating it
+on each input), which enables TRT to fuse DQ into the attention MatMul kernels."""
+
+import torch.nn as nn
+
+from .quant_module import QuantInputBase, QuantModuleRegistry
+
+QuantModuleRegistry.register({nn.LayerNorm: "nn.LayerNorm"})(QuantInputBase)
diff --git a/modelopt/torch/quantization/plugins/huggingface.py b/modelopt/torch/quantization/plugins/huggingface.py
@@ -286,9 +286,24 @@ def register_hf_attentions_on_the_fly(model):
 
     attention_cls = set()
     registered_attn_module = False
+
+    # Skip attention wrappers that contain a nested "Attention" child on this specific
+    # instance (e.g. ViTAttention wraps ViTSelfAttention). Patching both would
+    # double-quantize eager_attention_forward. Checked per-instance (not by class) so a
+    # class reused as both wrapper and leaf is not dropped everywhere. In a 3-level
+    # hierarchy (Outer → Middle → Inner), both Outer and Middle are treated as wrappers
+    # and only Inner is registered.
+    def _wraps_nested_attention(module):
+        return any(
+            child is not module and type(child).__name__.endswith("Attention")
+            for _, child in module.named_modules()
+        )
+
     for name, module in model.named_modules():
         # Only register attention classes that are from Huggingface transformers
         if type(module).__name__.endswith("Attention"):
+            if _wraps_nested_attention(module):
+                continue
             attention_type = _QuantAttention.get_attn_type(module)
             # Add modules to be registered only if they arent already registered
             if (