Deprecate AQT quantization in MaxText

sarunsingla11722 · sarunsingla11722 · commit 72ddb3ca535d · 2026-06-01T18:12:22.000Z
diff --git a/docs/reference/architecture/architecture_overview.md b/docs/reference/architecture/architecture_overview.md
@@ -63,7 +63,7 @@ The table below summarizes some of the most critical parameters in base.yml and
 | dataset_type                     | input_pipeline.py           | Specifies the data loader backend ('tfds', 'grain', 'hf').                                                    |
 | enable_checkpointing             | checkpointing.py, train.py  | Enables or disables saving model state.                                                                       |
 | async_checkpointing              | checkpointing.py, train.py  | If True, saves checkpoints without blocking the training loop.                                                |
-| quantization                     | layers.py, optimizers.py    | Enables quantization, e.g., 'int8' for AQT or Qwix.                                                           |
+| quantization                     | layers.py, optimizers.py    | Enables quantization, e.g., 'int8' for Qwix or legacy AQT (deprecated).                                       |
 | compile_topology                 | train_compile.py            | Specifies the target hardware topology for AOT compilation.                                                   |
 
 ## Core architectural components
@@ -82,7 +82,7 @@ While the base model implementations are typically simple, MaxText is equipped t
 
 - Advanced attention mechanisms: The architecture is not limited to standard self-attention. It supports variants like Grouped-Query Attention (GQA), Multi-Query Attention (MQA) and Multi-headed Latent Attention (MLA). Since, like MoE, attention can be a performance hot-spot in transformers, attention is typically implemented in [Pallas](https://docs.jax.dev/en/latest/pallas/index.html) kernels, with Splash (Sparse, Flash) Attention being the default for training.
 
-- Quantization: The framework seamlessly integrates with Google's Accurate Quantized Training (AQT) and Qwix libraries. Quantization logic is applied at the layer level.
+- Quantization: The framework seamlessly integrates with the Qwix and Google's Accurate Quantized Training (AQT, deprecated) libraries. Quantization logic is applied at the layer level.
 
 The modularity of this design is clearly demonstrated by third-party extensions. For instance, the NVIDIA maxtext-jaxpp fork was able to add support for pipeline parallelism by inserting jaxpp.pipeline_enter_stage hooks directly into the \_\_call\_\_ method of the Decoder class, a testament to the codebase's modularity and extensibility.
 
@@ -158,7 +158,7 @@ Performance can be further tuned by setting specific XLA flags in the configurat
 
 ### Quantization for throughput boost
 
-One of the most significant performance levers available in MaxText is the integration of Google's Accurate Quantized Training (AQT) and Qwix libraries. These enable training with reduced numerical precision, reducing memory requirements and often increasing FLOPS, while maintaining model quality and convergence characteristics that are very close to the full-precision baseline.
+One of the most significant performance levers available in MaxText is the integration of the Qwix and Google's Accurate Quantized Training (AQT, deprecated) libraries. These enable training with reduced numerical precision, reducing memory requirements and often increasing FLOPS, while maintaining model quality and convergence characteristics that are very close to the full-precision baseline.
 
 Integration into MaxText is seamless for the user. Quantization can be enabled by simply setting, for example, `quantization: 'int8'` in the configuration file. This flag activates quantization-aware layers (defined in
 [`src/maxtext/layers/quantizations.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/quantizations.py)) that are applied to the relevant dense layers within the model's Flax definition. The quantization library handles the complexities of simulating quantization during the forward and backward passes, allowing the model to learn weights that are robust to the reduced precision.
diff --git a/docs/reference/core_concepts/quantization.md b/docs/reference/core_concepts/quantization.md
@@ -20,7 +20,7 @@
 
 Quantization in deep learning is the process of reducing the precision of numbers used to represent a model's weights and/or activations. Instead of using higher-precision floating-point formats like 32-bit floats (`float32`) or 16-bit brain floats (`bfloat16`), quantization maps these values to lower-precision numerical formats, most commonly 8-bit integers (`int8`) or floats (`fp8`).
 
-MaxText supports quantization via both the [AQT](https://github.com/google/aqt) and [Qwix](https://github.com/google/qwix) libraries. Qwix is the recommended approach, providing a non-intrusive way to apply Quantized Training (QT).
+MaxText supports quantization via the [Qwix](https://github.com/google/qwix) library. Accurate Quantized Training (AQT) is deprecated and will be removed in a future release. Qwix is the recommended approach, providing a non-intrusive way to apply Quantized Training (QT).
 
 ## Why use quantization?
 
@@ -40,7 +40,7 @@ The primary trade-off with quantization is between the model accuracy and comput
 - Impact on Gradients: Gradients during backpropagation can have very different, often wider, distributions than weights or activations, making them more sensitive to quantization errors.
 - Convergence Issues: The approximations introduced by quantization can sometimes hinder the model's ability to converge during training.
 
-To overcome the challenges of quantization, libraries like Google's Accurate Quantized Training (AQT) and its successor Qwix (used in MaxText) employ a suite of advanced techniques. These methods ensure that models can be trained with low-precision arithmetic without significant loss in accuracy and with stable convergence.
+To overcome the challenges of quantization, libraries like Google's Accurate Quantized Training (AQT, deprecated) and its successor Qwix (used in MaxText) employ a suite of advanced techniques. These methods ensure that models can be trained with low-precision arithmetic without significant loss in accuracy and with stable convergence.
 
 ## How Quantized Training (QT) works with Qwix
 
@@ -56,16 +56,16 @@ By integrating the quantization simulation directly into the training, the model
 
 ## Using Quantization in MaxText
 
-You can enable quantization in MaxText by setting flags in your configuration file (e.g., `base.yml`) or via the command line. MaxText supports two quantization libraries: Qwix (recommended) and AQT.
+You can enable quantization in MaxText by setting flags in your configuration file (e.g., `base.yml`) or via the command line. MaxText supports Qwix (recommended) and the legacy AQT library (deprecated).
 
 ### Configuration Flags
 
 The primary flags to control quantization are:
 
 - `use_qwix_quantization`: A boolean flag.
   - Set to `True` to enable quantization using the Qwix library.
-  - Set to `False` (or omit) to use the AQT library if `quantization` is set.
-- `quantization`: A string that specifies the type of quantization to apply. The accepted values depend on whether you are using Qwix or AQT.
+  - Set to `False` (or omit) to use the AQT library (deprecated) if `quantization` is set.
+- `quantization`: A string that specifies the type of quantization to apply. The accepted values depend on whether you are using Qwix or legacy AQT.
 - `quantization_calibration_method`: The calibration method for weights and activations (e.g., `"absmax"`). This is mainly for Qwix.
 
 ### Qwix Quantization (Recommended)
@@ -127,6 +127,9 @@ model = qwix.quantize_model(model, qwix.QtProvider(rule))
 
 ### AQT Quantization
 
+> [!WARNING]
+> **DEPRECATION NOTICE**: AQT quantization is deprecated and will be removed in a future release. Please migrate to Qwix by setting `use_qwix_quantization=True`.
+
 If `use_qwix_quantization` is `False` or not set, you can still apply quantization using the AQT library by setting the `quantization` flag. You can read more about AQT on this [Google Cloud blog](https://cloud.google.com/blog/products/compute/accurate-quantized-training-aqt-for-tpu-v5e).
 
 #### `quantization` values for AQT
diff --git a/docs/reference/models/supported_models_and_architectures.md b/docs/reference/models/supported_models_and_architectures.md
@@ -10,7 +10,7 @@ MaxText is an open-source, high-performance LLM framework written in Python/JAX.
 
 - **Supported Precisions**: FP32, BF16, INT8, and FP8.
 - **Ahead-of-Time Compilation (AOT)**: For faster model development/prototyping and earlier OOM detection.
-- **Quantization**: Via **Qwix** (recommended) and AQT. See Quantization [Guide](../reference/core_concepts/quantization.md).
+- **Quantization**: Via **Qwix** (recommended) and AQT (deprecated). See Quantization [Guide](../reference/core_concepts/quantization.md).
 - **Diagnostics**: Simple logging via `max_logging`, profiling in **XProf**, and visualization in **TensorBoard**.
 - **Multi-Token Prediction (MTP)**: Enables token efficient training with multi-token prediction.
 - **Elastic Training**: Fault-tolerant and dynamic scale-up/scale-down on Cloud TPUs with Pathways.
@@ -74,7 +74,7 @@ MaxText supports a wide range of parallelism strategies for scaling training and
 The following summarizes observed runtime efficiency and scaling behaviors of MaxText across different hardware and model types, based on published benchmarks and large-scale runs.
 
 - **High MFU**: MaxText targets high Model FLOPs Utilization across scales; exact numbers vary by model, hardware and config. See [**Performance Metrics → MFU**](../performance_metrics.md#performance-metrics) for the definition and how we calculate it.
-- **Quantization**: MaxText supports quantization via both the AQT and Qwix libraries. Qwix is the recommended approach, providing a non-intrusive way to apply various quantization techniques, including Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ).
+- **Quantization**: MaxText supports quantization via both the Qwix and AQT (deprecated) libraries. Qwix is the recommended approach, providing a non-intrusive way to apply various quantization techniques, including Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ).
 - **MoE**: The Mixture-of-Experts implementation features dropless routing with efficient kernels including Megablox, `jax.lax.ragged_dot`, and Tokamax Ragged Dot.
 - **Multi-Token Prediction (MTP)**: This feature improves training efficiency on DeepSeek-style models by adding an auxiliary loss based on predicting multiple future tokens.
 - **Long-Context Optimizations**: Implements various efficient attention mechanisms, including: Grouped-Query Attention (GQA), Sliding-Window Attention (SWA), Local–Global interleaved attention, Multi-Head Latent Attention (MLA). They reduce the KV-cache size, making it possible to handle long contexts efficiently.
diff --git a/src/maxtext/configs/base.yml b/src/maxtext/configs/base.yml
@@ -143,7 +143,7 @@ save_quantized_params_path: ""
 # when left as is, corresponds to training
 # accepted values are "inference"
 model_call_mode: ""
-use_qwix_quantization: false # whether to use qwix for quantization. if set to true, the model will be quantized using qwix.
+use_qwix_quantization: false # [DEPRECATED: AQT will be removed in a future release. It is strongly recommended to set use_qwix_quantization to true] whether to use qwix for quantization. if set to true, the model will be quantized using qwix.
 use_manual_quantization: false # a flag if to use manual quantization for batch split. Only used if use_batch_split_schedule is true.
 # quantization calibration method used for weights and activations. supported methods can be found in https://github.com/google/qwix/blob/dc2a0770351c740e5ab3cce7c0efe9f7beacce9e/qwix/qconfig.py#l70-l80
 weight_quantization_calibration_method: "absmax"
diff --git a/src/maxtext/configs/types.py b/src/maxtext/configs/types.py
@@ -2571,6 +2571,14 @@ def get_num_target_devices():
     }
     self.num_slices = max_utils.get_num_slices(raw_keys_for_num_slices)
 
+    # Check for AQT deprecation warning
+    if self.quantization and not self.use_qwix_quantization:
+      if self.quantization not in ("fp8", "nanoo_fp8") and not self.quantization.startswith("te_"):
+        logger.warning(
+            "WARNING: AQT quantization is deprecated and will be removed in a future release. "
+            "Please migrate to Qwix by setting use_qwix_quantization=True."
+        )
+
     # Default quantization sharding count to number of local devices if not set.
     if self.quantization_local_shard_count == -1:
       try:
diff --git a/src/maxtext/layers/quantizations.py b/src/maxtext/layers/quantizations.py
@@ -759,8 +759,8 @@ def get_fp8_full_qwix_rule_w_sparsity(config: Config):
 
 
 def get_quantization_rule(config: Config):
-
   """Returns a list of qwix.QtRule from `dtype`."""
+
   def make_qt_rule(dtype) -> list[qwix.QtRule]:
     return [
         qwix.QtRule(
diff --git a/tests/unit/quantizations_test.py b/tests/unit/quantizations_test.py
@@ -149,6 +149,7 @@ def test_configure_quantization_replicate_scale(self):
       quant = _configure_quantization(quant_str="int8", mode_str=quant_mode, replicate_scale=True)
       self.assertEqual(quant.replicate_scale, True)
 
+  @pytest.mark.cpu_only
   def test_configure_quantization_is_int8(self):
     for quant_mode in ["train", "serve", "convert"]:
       quant = _configure_quantization(quant_str="int8", mode_str=quant_mode)