Skip to content

Commit 232736b

Browse files
sugunav14Fridah-nvrealAsma
authored andcommitted
GPTQ official (#853)
## What does this PR do? Implements official version of GPTQ with decoder level sequential calibration flow. Ref: [https://github.com/IST-DASLab/FP-Quant/tree/master](https://github.com/IST-DASLab/FP-Quant/tree/master) **Type of change:** New feature <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** 1. Deletes gptq_lite configuration. gptq() is a more generic implementation and will resolve to gptq_lite when we set use_sequential:False 2. Introduce GPTQHelper class to handle hessian collection patching/unpatching, blockwise weight update, hessian initialization, hessian_inverse computation etc ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this python hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-8B --qformat nvfp4_gptq --kv_cache_qformat none --dataset cnn_dailymail --batch_size 32 --calib_seq 512 --calib_size 512 --export_path exported_model ``` ## Testing <!-- Mention how have you tested your change if applicable. --> Measure perplexity and activation MSE on the following models ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a new NVFP4 GPTQ quantization option for CLI/configuration. * **Improvements** * Replaced GPTQ-lite with a full GPTQ calibration pipeline for more accurate, Hessian-aware blockwise updates and token-based sampling. * Added sequential layer-by-layer calibration support and automatic promotion of NVFP4 static quantizers. * Improved logging and timing for calibration runs. * **Tests** * Expanded GPTQ tests, including export/roundtrip validation for quantized weights. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Signed-off-by: realAsma <akuriparambi@nvidia.com> Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: realAsma <86726418+realAsma@users.noreply.github.com>
1 parent a1ca3f7 commit 232736b

8 files changed

Lines changed: 443 additions & 410 deletions

File tree

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ repos:
7575
# Instead, we should manually add the license header to those files *after* the original header.
7676
exclude: >
7777
(?x)^(
78+
modelopt/torch/quantization/utils/calib_utils.py|
7879
modelopt/onnx/quantization/operators.py|
7980
modelopt/onnx/quantization/ort_patching.py|
8081
modelopt/torch/_deploy/utils/onnx_utils.py|

modelopt/torch/quantization/config.py

Lines changed: 8 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1503,24 +1503,20 @@ class SVDQuantConfig(QuantizeAlgorithmConfig):
15031503
)
15041504

15051505

1506-
class GPTQLiteConfig(QuantizeAlgorithmConfig):
1507-
"""The config for GPTQ lite.
1506+
class GPTQCalibConfig(QuantizeAlgorithmConfig):
1507+
"""The config for GPTQ quantization.
15081508
1509-
GPTQ lite is a variant of GPTQ that does not exactly follow the official GPTQ implementation.
1510-
1511-
GPTQ lite does not perform sequential quantization of layers. This means that the updated
1512-
activations are not used to process the next layer.
1509+
GPTQ minimizes the layer-wise quantization error by using second-order (Hessian) information
1510+
to perform blockwise weight updates that compensate for rounding loss. Layers are quantized
1511+
sequentially so that each layer's Hessian is computed from activations that already reflect
1512+
the quantization of preceding layers.
15131513
15141514
The default values are taken from the official GPTQ implementation:
15151515
https://github.com/IST-DASLab/FP-Quant/blob/d2e3092f968262c4de5fb050e1aef568a280dadd/src/quantization/gptq.py#L35
1516-
1517-
Note: This feature is currently experimental and may not translate to improved accuracy as expected.
1518-
1519-
15201516
"""
15211517

1522-
method: Literal["gptq_lite"] = ModeloptField("gptq_lite")
1523-
percdamp: float | None = ModeloptField(
1518+
method: Literal["gptq"] = ModeloptField("gptq")
1519+
perc_damp: float | None = ModeloptField(
15241520
default=0.01,
15251521
gt=0.0,
15261522
le=1.0,
@@ -1533,12 +1529,6 @@ class GPTQLiteConfig(QuantizeAlgorithmConfig):
15331529
description="""The block size for GPTQ weight update, which must be a multiple of the
15341530
group_size used in the quantization.""",
15351531
)
1536-
hessian_state_path: str | None = ModeloptField(
1537-
default=None,
1538-
title="Path to the Hessian state file.",
1539-
description="""The path to the Hessian state file. If hessian path exists, we load from
1540-
hessian file instead of recomputing them.""",
1541-
)
15421532

15431533

15441534
QuantizeQuantCfgType = list[QuantizerCfgEntry]

modelopt/torch/quantization/mode.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
AWQFullCalibConfig,
3838
AWQLiteCalibConfig,
3939
CompressConfig,
40-
GPTQLiteConfig,
40+
GPTQCalibConfig,
4141
LocalHessianCalibConfig,
4242
MaxCalibConfig,
4343
MseCalibConfig,
@@ -59,7 +59,7 @@
5959
)
6060
from .model_calib import (
6161
awq,
62-
gptq_lite,
62+
gptq,
6363
local_hessian_calibrate,
6464
max_calibrate,
6565
mse_calibrate,
@@ -240,8 +240,8 @@ def wrapped_calib_func(
240240
if sequential:
241241
if forward_loop is None:
242242
raise ValueError("forward_loop is required for calibration but got None.")
243-
assert method in ["max"], (
244-
f"Sequential calibration currently only supports max calibration, got {method}"
243+
assert method in ["max", "gptq"], (
244+
f"Sequential calibration currently only supports max and gptq calibration, got {method}"
245245
)
246246
# Wrap with sequential processing
247247
sequential_calibrate(
@@ -493,12 +493,12 @@ def restore(self) -> RestoreEntrypoint:
493493

494494

495495
@CalibrateModeRegistry.register_mode
496-
class GPTQLiteModeDescriptor(BaseCalibrateModeDescriptor):
496+
class GPTQModeDescriptor(BaseCalibrateModeDescriptor):
497497
"""Mode for GPTQ calibration algorithm."""
498498

499499
@property
500500
def config_class(self) -> type[QuantizeAlgorithmConfig]:
501501
"""Specifies the config class for the mode."""
502-
return GPTQLiteConfig
502+
return GPTQCalibConfig
503503

504-
_calib_func = gptq_lite
504+
_calib_func = gptq

0 commit comments

Comments
 (0)