-
Notifications
You must be signed in to change notification settings - Fork 7k
Integrate AutoRound into Diffusers #13552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
xin3he
wants to merge
7
commits into
huggingface:main
Choose a base branch
from
xin3he:auto_round
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
42d4fdc
support auto_round
xin3he e1714a9
add document and unit tests
xin3he c0daf15
fix CI
xin3he f04afa9
Merge branch 'main' into auto_round
xin3he 677a26e
Apply suggestions from code review
xin3he bc46f4f
update document and overwrite the default quantization_config with sp…
xin3he fb2e4c2
add UT and fix bug
xin3he File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| <!-- Copyright 2026 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. --> | ||
|
|
||
| # AutoRound | ||
|
|
||
| [AutoRound](https://github.com/intel/auto-round) is an advanced quantization toolkit. It achieves high accuracy at ultra-low bit widths (2-4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers [SignRoundV1](https://arxiv.org/pdf/2309.05516) and [SignRoundV2](https://arxiv.org/abs/2512.04746) for more details. | ||
|
|
||
|
|
||
| Install `auto-round`(version ≥ 0.13.0): | ||
|
|
||
| ```bash | ||
| pip install "auto-round>=0.13.0" | ||
| ``` | ||
|
|
||
| To use the Marlin kernel for faster CUDA inference, install `gptqmodel`: | ||
|
|
||
| ```bash | ||
| pip install "gptqmodel>=5.8.0" | ||
| ``` | ||
|
|
||
| ## Load a quantized model | ||
|
|
||
| Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. The method works with any model that loads via [Accelerate(https://hf.co/docs/accelerate/index) and has `torch.nn.Linear` layers. | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import ZImageTransformer2DModel, ZImagePipeline, AutoRoundConfig | ||
|
|
||
| model_id = "INCModel/Z-Image-W4A16-AutoRound" | ||
|
|
||
| quantization_config = AutoRoundConfig(backend="marlin") | ||
| transformer = ZImageTransformer2DModel.from_pretrained( | ||
| model_id, | ||
| subfolder="transformer", | ||
| quantization_config=quantization_config, | ||
| torch_dtype=torch.bfloat16, | ||
| device_map="cuda", | ||
| ) | ||
|
|
||
| pipe = ZImagePipeline.from_pretrained( | ||
| model_id, | ||
| transformer=transformer, | ||
| torch_dtype=torch.bfloat16, | ||
| device_map="cuda", | ||
|
xin3he marked this conversation as resolved.
|
||
| ) | ||
|
|
||
| image = pipe("a cat holding a sign that says hello").images[0] | ||
| image.save("output.png") | ||
| ``` | ||
|
|
||
| > [!NOTE] | ||
| > AutoRound in Diffusers only supports loading *pre-quantized* models. To quantize a model from scratch, use the [AutoRound CLI or Python API](https://github.com/intel/auto-round) directly, then load the result with Diffusers. | ||
|
|
||
| ## Backends | ||
|
|
||
| AutoRound supports multiple inference backends. The backend controls which kernel handles dequantization during the forward pass. Set the `backend` parameter in [`AutoRoundConfig`] to choose one: | ||
|
|
||
| | Backend | Value | Device | Requirements | Notes | | ||
| |---------|-------|--------|--------------|-------| | ||
| | **Auto** | `"auto"` | Any | — | Default. Automatically selects the best available backend. | | ||
| | **PyTorch** | `"torch"` | CPU / CUDA | — | Pure PyTorch implementation. Broadest compatibility. | | ||
| | **Triton** | `"tritonv2"` | CUDA | `triton` | Triton-based kernel for GPU inference. | | ||
| | **ExllamaV2** | `"exllamav2"` | CUDA | `gptqmodel>=5.8.0` | Good CUDA performance via the ExllamaV2 kernel. | | ||
| | **Marlin** | `"marlin"` | CUDA | `gptqmodel>=5.8.0` | Best CUDA performance via the Marlin kernel. | | ||
|
|
||
|
|
||
| ```python | ||
| from diffusers import AutoRoundConfig | ||
|
|
||
| # Auto-select (default) | ||
| config = AutoRoundConfig() | ||
|
|
||
| # Explicit Triton backend for CUDA | ||
| config = AutoRoundConfig(backend="tritonv2") | ||
|
|
||
| # Marlin backend for best CUDA performance (requires gptqmodel>=5.8.0) | ||
| config = AutoRoundConfig(backend="marlin") | ||
|
|
||
| # Marlin backend for best CUDA performance (requires gptqmodel>=5.8.0) | ||
| config = AutoRoundConfig(backend="exllamav2") | ||
|
|
||
| # PyTorch backend for CPU/CUDA inference | ||
| config = AutoRoundConfig(backend="torch") | ||
| ``` | ||
|
|
||
|
|
||
| ## Quantization configurations | ||
|
|
||
| AutoRound focuses on weight-only quantization. The primary configuration is W4A16 (4-bit weights, 16-bit activations), with flexibility in group size and symmetry: | ||
|
|
||
| | Configuration | `bits` | `group_size` | `sym` | Description | | ||
| |--------------|--------|-------------|-------|-------------| | ||
| | W4G128 asymmetric | `4` | `128` | `False` | Default. Good balance of accuracy and compression. | | ||
| | W4G128 symmetric | `4` | `128` | `True` | Faster dequantization, small accuracy trade-off. | | ||
| | W4G32 asymmetric | `4` | `32` | `False` | Higher accuracy at the cost of more metadata. | | ||
|
|
||
| ## Save and load | ||
|
|
||
| <hfoptions id="save-and-load"> | ||
| <hfoption id="save"> | ||
|
|
||
| ```python | ||
| from auto_round import AutoRound | ||
| autoround = AutoRound( | ||
| tiny_z_image_model_path, | ||
| num_inference_steps=3, | ||
| guidance_scale=7.5, | ||
| dataset="coco2014, | ||
| ) | ||
| autoround.quantize_and_save("Z-Image-W4A16-AutoRound") | ||
| ``` | ||
|
|
||
| </hfoption> | ||
| <hfoption id="load"> | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import ZImageTransformer2DModel, ZImagePipeline | ||
|
|
||
| model_id = "INCModel/Z-Image-W4A16-AutoRound" | ||
|
|
||
| # The inference backend will be automatically selected. | ||
| pipe = ZImagePipeline.from_pretrained( | ||
| model_id, | ||
| torch_dtype=torch.bfloat16, | ||
| device_map="cuda", | ||
|
xin3he marked this conversation as resolved.
|
||
| ) | ||
|
|
||
| image = pipe("a cat holding a sign that says hello").images[0] | ||
| image.save("output.png") | ||
| ``` | ||
|
|
||
| </hfoption> | ||
| </hfoptions> | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it compatible with |
||
| ## Resources | ||
|
|
||
| - [Pre-quantized AutoRound models on the Hub](https://huggingface.co/models?search=autoround) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,10 +18,12 @@ | |
|
|
||
| import warnings | ||
|
|
||
| from .autoround import AutoRoundQuantizer | ||
| from .bitsandbytes import BnB4BitDiffusersQuantizer, BnB8BitDiffusersQuantizer | ||
| from .gguf import GGUFQuantizer | ||
| from .modelopt import NVIDIAModelOptQuantizer | ||
| from .quantization_config import ( | ||
| AutoRoundConfig, | ||
| BitsAndBytesConfig, | ||
| GGUFQuantizationConfig, | ||
| NVIDIAModelOptConfig, | ||
|
|
@@ -41,6 +43,7 @@ | |
| "quanto": QuantoQuantizer, | ||
| "torchao": TorchAoHfQuantizer, | ||
| "modelopt": NVIDIAModelOptQuantizer, | ||
| "auto-round": AutoRoundQuantizer, | ||
| } | ||
|
|
||
| AUTO_QUANTIZATION_CONFIG_MAPPING = { | ||
|
|
@@ -50,6 +53,7 @@ | |
| "quanto": QuantoConfig, | ||
| "torchao": TorchAoConfig, | ||
| "modelopt": NVIDIAModelOptConfig, | ||
| "auto-round": AutoRoundConfig, | ||
| } | ||
|
|
||
|
|
||
|
|
@@ -136,13 +140,26 @@ def merge_quantization_configs( | |
| ) | ||
| else: | ||
| warning_msg = "" | ||
|
|
||
| if isinstance(quantization_config, dict): | ||
| existing_fields = set(quantization_config.keys()) | ||
| quantization_config = cls.from_dict(quantization_config) | ||
| else: | ||
| existing_fields = set(quantization_config.__dict__.keys()) | ||
|
Comment on lines
+144
to
+147
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain this change? |
||
|
|
||
| if isinstance(quantization_config, NVIDIAModelOptConfig): | ||
| quantization_config.check_model_patching() | ||
|
|
||
| if quantization_config_from_args is not None: | ||
| # Only override fields that the user explicitly set. | ||
| for key, value in quantization_config_from_args.__dict__.items(): | ||
| if key not in existing_fields: | ||
| # Field does not exist in the model's quantization_config, add it. | ||
| setattr(quantization_config, key, value) | ||
| warning_msg += ( | ||
| f" Field `{key}` from `quantization_config_from_args` is not present in the model's " | ||
| f"`quantization_config`. Adding it with value: {value!r}." | ||
| ) | ||
|
|
||
| if warning_msg != "": | ||
| warnings.warn(warning_msg) | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| from .autoround_quantizer import AutoRoundQuantizer |
128 changes: 128 additions & 0 deletions
128
src/diffusers/quantizers/autoround/autoround_quantizer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from typing import TYPE_CHECKING | ||
|
|
||
| from ...utils import ( | ||
| is_auto_round_available, | ||
| is_torch_available, | ||
| logging, | ||
| ) | ||
| from ..base import DiffusersQuantizer | ||
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| from ...models.modeling_utils import ModelMixin | ||
|
|
||
|
|
||
| if is_torch_available(): | ||
| import torch | ||
|
|
||
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class AutoRoundQuantizer(DiffusersQuantizer): | ||
| r""" | ||
| Diffusers Quantizer for AutoRound (https://github.com/intel/auto-round). | ||
|
|
||
| AutoRound is a weight-only quantization method that uses sign gradient descent to jointly optimize | ||
| rounding values and min-max ranges for weights. It supports W4A16 (4-bit weight, 16-bit activation) | ||
| quantization for efficient inference. | ||
|
|
||
| This quantizer only supports loading pre-quantized AutoRound models. On-the-fly quantization | ||
| (calibration) is not supported through this interface. | ||
| """ | ||
|
|
||
| # AutoRound requires data calibration — we only support loading pre-quantized checkpoints. | ||
| requires_calibration = True | ||
| required_packages = ["auto_round"] | ||
|
|
||
| def __init__(self, quantization_config, **kwargs): | ||
| super().__init__(quantization_config, **kwargs) | ||
|
|
||
| def validate_environment(self, *args, **kwargs): | ||
| """ | ||
| Validates that the auto-round library (>= 0.5) is installed and captures the device_map | ||
| for later use during model conversion. | ||
| """ | ||
| self.device_map = kwargs.get("device_map", None) | ||
| if not is_auto_round_available(): | ||
| raise ImportError( | ||
| "Loading an AutoRound quantized model requires the auto-round library " | ||
| "(`pip install 'auto-round>=0.5'`)" | ||
|
xin3he marked this conversation as resolved.
|
||
| ) | ||
|
|
||
| def _process_model_before_weight_loading( | ||
| self, | ||
| model: "ModelMixin", | ||
| device_map, | ||
| keep_in_fp32_modules: list[str] = [], | ||
| **kwargs, | ||
| ): | ||
| """ | ||
| Replaces target nn.Linear layers with AutoRound's quantized QuantLinear layers before | ||
| weights are loaded from the checkpoint. | ||
|
|
||
| Uses `auto_round.inference.convert_model.convert_hf_model` which: | ||
| - Inspects the model architecture and the quantization config (bits, group_size, sym, backend). | ||
| - Replaces eligible nn.Linear modules with the appropriate QuantLinear variant | ||
| (the packed-weight layer that stores qweight, scales, qzeros). | ||
| - Returns the converted model and a set of used backend names. | ||
|
|
||
| `infer_target_device` resolves the device_map into a single target device string | ||
| that AutoRound uses to select the correct kernel backend (e.g. "cuda", "cpu"). | ||
| """ | ||
| from auto_round.inference.convert_model import convert_hf_model, infer_target_device | ||
|
|
||
| if self.pre_quantized: | ||
| target_device = infer_target_device(self.device_map) | ||
| model, used_backends = convert_hf_model(model, target_device) | ||
| self.used_backends = used_backends | ||
|
|
||
| def _process_model_after_weight_loading(self, model, **kwargs): | ||
| """ | ||
| Finalizes the model after all quantized weights (qweight, scales, qzeros, etc.) have | ||
| been loaded into the QuantLinear layers. | ||
|
|
||
| Uses `auto_round.inference.convert_model.post_init` which: | ||
| - Performs backend-specific finalization (e.g. repacking weights into the kernel's | ||
| expected memory layout, moving buffers to the correct device). | ||
| - Freezes quantized parameters (requires_grad=False). | ||
| - Prepares the model for inference. | ||
|
|
||
| Raises ValueError if the model is not pre-quantized, since AutoRound does not support | ||
| on-the-fly quantization through this loading path. | ||
| """ | ||
| if self.pre_quantized: | ||
| from auto_round.inference.convert_model import post_init | ||
|
|
||
| post_init(model, self.used_backends) | ||
| else: | ||
| raise ValueError( | ||
| "AutoRound quantizer in diffusers only supports pre-quantized models. " | ||
| "Please provide a model that has already been quantized with AutoRound." | ||
| ) | ||
| return model | ||
|
|
||
| @property | ||
| def is_trainable(self) -> bool: | ||
| """AutoRound W4A16 pre-quantized models do not support training.""" | ||
| return False | ||
|
|
||
| @property | ||
| def is_serializable(self): | ||
| """AutoRound quantized models can be serialized (the quantization config may be | ||
| updated by the backend, e.g. for GPTQ/AWQ-compatible formats).""" | ||
| return True | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.