foundation-model-stack · chichun-charlie-liu · Jun 4, 2025 · Jan 23, 2025 · Jan 23, 2025 · Jan 23, 2025
@@ -63,9 +63,9 @@ ignore-patterns=^\.#
 # (useful for modules/projects where namespaces are manipulated during runtime
 # and thus existing member attributes cannot be deduced by static analysis). It
 # supports qualified module names, as well as Unix pattern matching.
-ignored-modules=auto_gptq,
-                exllama_kernels,
-                exllamav2_kernels,
+ignored-modules=gptqmodel,                
+                gptqmodel_exllama_kernels,
+                gptqmodel_exllamav2_kernels,
                 llmcompressor,
                 cutlass_mm,
                 pygraphviz,

@@ -6,7 +6,6 @@ AIU
 Spyre
 spyre
 Args
-AutoGPTQ
 autoregressive
 backpropagation
 bmm
@@ -38,8 +37,9 @@ frac
 gptq
 GPTQ
 GPTQArguments
+GPTQModel
+gptqmodel
 graphviz
-GPTQ
 hyperparameters
 Inductor
 inferenced

@@ -42,7 +42,7 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
 *Optional packages based on optimization functionality required:*
 
 - **GPTQ** is a popular compression method for LLMs: 
-    - [auto_gptq](https://pypi.org/project/auto-gptq/) or build from [source](https://github.com/AutoGPTQ/AutoGPTQ)
+    - [gptqmodel](https://pypi.org/project/gptqmodel/) or build from [source](https://github.com/ModelCloud/GPTQModel)
 - If you want to experiment with **INT8** deployment in [QAT](./examples/QAT_INT8/) and [PTQ](./examples/PTQ_INT8/) examples:
     - Nvidia GPU with compute capability > 8.0 (A100 family or higher)
     - Option 1:

@@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:
 
 ### GPTQ (weight-only compression, or sometimes referred to as W4A16)
 
-For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
+For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)
 
 
 ## Specification

@@ -1,12 +1,12 @@
 # Generative Pre-Trained Transformer Quantization (GPTQ) of LLAMA-3-8B Model
 
 
-For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `auto_gptq`, a third party library, to perform quantization.
+For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `gptqmodel`, a third party library, to perform quantization.
 
 ## Requirements
 
 - [FMS Model Optimizer requirements](../../README.md#requirements)
-- `auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
+- `gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
 - Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
     ```
     pip install lm-eval
@@ -32,7 +32,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 > - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
 > - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
 
-2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
+2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
 
     ```bash
     python -m fms_mo.run_quant \
@@ -49,8 +49,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 > - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.
 
 > [!TIP]
-> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try install `auto-gptq` from [source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source).
-> 2. If you need to work on a custom model that is not supported by AutoGPTQ, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#customize-model).
+> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try installing `gptqmodel` from [source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file).
+> 2. If you need to work on a custom model that is not supported by GPTQModel, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file#how-to-add-support-for-a-new-model).
 
 3. **Inspect the GPTQ checkpoint**
     ```python
@@ -62,10 +62,10 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 
     ```
                     layer     mem (MB)
-    dtype                            
-    torch.float16    224   109.051904
-    torch.float32     67  4203.757568
-    torch.int32      672  3521.904640
+    dtype
+    torch.bfloat16     67  2101.878784
+    torch.float16     224   109.051904
+    torch.int32       672  3521.904640
     ```
 
 4. **Evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
@@ -82,29 +82,23 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 ## Example Test Results
 
 - Unquantized Model
-```bash
-    |Model       |    Tasks     |Version|Filter|n-shot|  Metric  |   |Value |   |Stderr|
-    |------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
-    | LLAMA3-8B  |lambada_openai|      1|none  |     5|acc       |↑  |0.7103|±  |0.0063|
-    |            |              |       |none  |     5|perplexity|↓  |3.7915|±  |0.0727|
-```
+|Model       |    Tasks     |Version|Filter|n-shot|  Metric  |   |Value |   |Stderr|
+|------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
+| LLAMA3-8B  |lambada_openai|      1|none  |     5|acc       |↑  |0.7103|±  |0.0063|
+|            |              |       |none  |     5|perplexity|↓  |3.7915|±  |0.0727|
 
 - Quantized model with the settings showed above (`desc_act` default to False.)
-```bash
-    |Model       |    Tasks     |Version|Filter|n-shot|  Metric  |   |Value  |   |Stderr|
-    |------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
-    | LLAMA3-8B  |lambada_openai|      1|none  |     5|acc       |↑  |0.4271 |±  |0.0069|
-    |            |              |       |none  |     5|perplexity|↓  |39.2316|±  |2.2090|
-```
-
+|Model       |    Tasks     |Version|Filter|n-shot|  Metric  |   |Value  |   |Stderr|
+|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
+| LLAMA3-8B  |lambada_openai|      1|none  |     5|acc       |↑  |0.6365 |±  |0.0067|
+|            |              |       |none  |     5|perplexity|↓  |5.9307 |±  |0.1830|
 
 - Quantized model with `desc_act` set to `True` (could improve the model quality, but at the cost of inference speed.)
-```bash
-    |Model       |    Tasks     |Version|Filter|n-shot|  Metric  |   |Value  |   |Stderr|
-    |------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
-    | LLAMA3-8B  |lambada_openai|      1|none  |     5|acc       |↑  |0.6193 |±  |0.0068|
-    |            |              |       |none  |     5|perplexity|↓  |5.8879 |±  |0.1546|
-```
+|Model       |    Tasks     |Version|Filter|n-shot|  Metric  |   |Value  |   |Stderr|
+|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
+| LLAMA3-8B  |lambada_openai|      1|none  |     5|acc       |↑  |0.6193 |±  |0.0068|
+|            |              |       |none  |     5|perplexity|↓  |5.8879 |±  |0.1546|
+
 > [!NOTE]
 > There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
 
@@ -114,21 +108,25 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
 1.  Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
 
     ```python
-    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
-    quantize_config = BaseQuantizeConfig(
-                bits=gptq_args.bits,
-                group_size=gptq_args.group_size,
-                desc_act=gptq_args.desc_act,
-                damp_percent=gptq_args.damp_percent)
+    from gptqmodel import GPTQModel, QuantizeConfig
+
+    quantize_config = QuantizeConfig(
+        bits=gptq_args.bits,
+        group_size=gptq_args.group_size,
+        desc_act=gptq_args.desc_act,
+        damp_percent=gptq_args.damp_percent,
+    )
+
     ```
 
-2. Load the pre_trained model with `auto_gptq` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
+2. Load the pre_trained model with `gptqmodel` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
 
     ```python
-    model = AutoGPTQForCausalLM.from_pretrained(
-                model_args.model_name_or_path,
-                quantize_config=quantize_config,
-                torch_dtype=model_args.torch_dtype)
+    model = GPTQModel.from_pretrained(
+        model_args.model_name_or_path,
+        quantize_config=quantize_config,
+        torch_dtype=model_args.torch_dtype,
+    )
     ```
 
 3. Load the tokenized dataset from disk.
@@ -143,9 +141,9 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
     ```python
     model.quantize(
         data,
-        use_triton=gptq_args.use_triton,
+        backend=BACKEND.TRITON if gptq_args.use_triton else BACKEND.AUTO,
         batch_size=gptq_args.batch_size,
-        cache_examples_on_gpu=gptq_args.cache_examples_on_gpu,
+        calibration_enable_gpu_cache=gptq_args.cache_examples_on_gpu,
     )
     ```
 

@@ -14,7 +14,7 @@
 
 
 """This file contains external kernel registrations, compilation, and packing functions.
-Some functions may require additional packages, e.g. auto_gptq, cutlass (source clone)
+Some functions may require additional packages, e.g. gptqmodel, cutlass (source clone)
 """
 
 # pylint: disable=ungrouped-imports,unused-argument,c-extension-no-member
@@ -491,27 +491,29 @@ def create_test_tensors(Nbatch, M, N, K, ele_type, accum_type):
 
 
 def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
-    """Register Exllama kernels borrowed from auto-gptq
+    """Register Exllama kernels borrowed from gptqmodel
     Args:
         qcfg: dict. quant config
         run_unit_test: bool. Run unit tests after Op registration. (if unit tests defined.)
 
     NOTE:
-        1. need to install auto-gptq python package
+        1. need to install gptqmodel python package
         2. Op registration signature changed drastically from torch 2.1 - 2.4. TODO: add 2.4 support
 
-    see https://github.com/AutoGPTQ/AutoGPTQ for installation instruction
+    see https://github.com/ModelCloud/GPTQModel for installation instructions
     """
     if qcfg is None:
         qcfg = {}
     elif qcfg:
-        qcfg["AUTOGPTQ_AVAILABLE"] = False
+        qcfg["GPTQMODEL_AVAILABLE"] = False
 
-    namespace = "autogptq_gemm"
+    namespace = "gptqmodel_gemm"
     # check before compile
-    if hasattr(torch.ops, namespace) and hasattr(torch.ops.autogptq_gemm, "exv1_i4f16"):
-        logger.info("Custom AutoGPTQ functions have been loaded already!")
-        qcfg["AUTOGPTQ_AVAILABLE"] = True
+    if hasattr(torch.ops, namespace) and hasattr(
+        torch.ops.gptqmodel_gemm, "exv1_i4f16"
+    ):
+        logger.info("Custom GPTQModel functions have been loaded already!")
+        qcfg["GPTQMODEL_AVAILABLE"] = True
         need_registration = False
     else:
         need_registration = (
@@ -521,14 +523,14 @@ def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
 
         if not need_registration:
             logger.warning(
-                "Please check the installation of AutoGPTQ package."
+                "Please check the installation of GPTQModel package."
                 "External kernels cannot be used this time."
             )
             return
 
         # Third Party
-        import exllama_kernels
-        import exllamav2_kernels
+        import gptqmodel_exllama_kernels
+        import gptqmodel_exllamav2_kernels
 
         # Register op
         @reg_op(f"{namespace}::exv1_i4f16")
@@ -545,7 +547,7 @@ def exv1_i4f16_impl(x, q4, q4_width):
                 (x.shape[0], q4_width), dtype=torch.float16, device=x.device
             )
 
-            exllama_kernels.q4_matmul(x, q4, output)
+            gptqmodel_exllama_kernels.q4_matmul(x, q4, output)
             return output.view(outshape)
 
         # Abstract implementation
@@ -573,7 +575,9 @@ def exv2_i4f16_impl(x, q_handle, q4_width, force_cuda):
                 (x.shape[0], q4_width), dtype=torch.float16, device=x.device
             )
 
-            exllamav2_kernels.gemm_half_q_half(x, q_handle, output, force_cuda)
+            gptqmodel_exllamav2_kernels.gemm_half_q_half(
+                x, q_handle, output, force_cuda
+            )
             return output.view(outshape)
 
         # Abstract implementation
@@ -609,7 +613,9 @@ def exv2_i4f16_fxinputs_impl(
                 (x.shape[0], q4_width), dtype=torch.float16, device=x.device
             )
 
-            exllamav2_kernels.gemm_half_q_half(x, q_handle, output, force_cuda)
+            gptqmodel_exllamav2_kernels.gemm_half_q_half(
+                x, q_handle, output, force_cuda
+            )
             return output.view(outshape)
 
         # Abstract implementation
@@ -623,10 +629,11 @@ def exv2_i4f16_fxinputs_abstract(
             )
 
         logger.info(
-            f"New AutoGPTQ gemm functions have been loaded and registered to torch.ops.{namespace}."
+            f"New GPTQModel gemm functions have been loaded and registered to \
+            torch.ops.{namespace}."
         )
         if qcfg:
-            qcfg["AUTOGPTQ_AVAILABLE"] = True
+            qcfg["GPTQMODEL_AVAILABLE"] = True
 
     if run_unit_test:
         return NotImplemented
@@ -1171,10 +1178,14 @@ def swap_nnlinear_to_quantlinear(model, qconfig, prefix=None, qlinear2use=None):
         QuantLinear = qlinear2use
     elif exVer == 1:
         # Third Party
-        from auto_gptq.nn_modules.qlinear.qlinear_exllama import QuantLinear
+        from gptqmodel.nn_modules.qlinear.exllama import (
+            ExllamaQuantLinear as QuantLinear,
+        )
     else:
         # Third Party
-        from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import QuantLinear
+        from gptqmodel.nn_modules.qlinear.exllamav2 import (
+            ExllamaV2QuantLinear as QuantLinear,
+        )
 
     num_swapped = 0
     for n, m in model.named_modules():

@@ -41,9 +41,9 @@
     # Local
     from fms_mo.modules.linear import QLinearExv1WI4AF16, QLinearExv2WI4AF16
 
-    autogptq_available = True
+    gptqmodel_available = True
 except ImportError:
-    autogptq_available = False
+    gptqmodel_available = False
 
 
 MIN_BLOCK_SIZE = 5
@@ -91,7 +91,7 @@ def check_qclass_fallback_based_on_min_feat(
     ]
     if cutlass_available:
         qclass_has_constraints += [QLinearCutlassI8I32NT]
-    if autogptq_available:
+    if gptqmodel_available:
         qclass_has_constraints += [QLinearExv1WI4AF16, QLinearExv2WI4AF16]
 
     qclass = type(ref_module)
@@ -129,7 +129,7 @@ def lower_qmodel_to_ext_kernels(
     1. user need to define a mapping thru    qcfg["ext_kernel_mapping_mod"]
     2. to make it simple, only swap user specified qclass, nothing else
     3. move the module to GPU before swapping to accelerate scale/zp calculations
-    4. autogptq_post_init() must be done at model level, or OOM and incorrect results easily
+    4. gptq_post_init() must be done at model level, or OOM and incorrect results easily
 
     Args:
         mod (torch.nn.Module): model to be 'lowered'
@@ -156,7 +156,7 @@ def lower_qmodel_to_ext_kernels(
     qclass_must_start_from_cpu = None
     using_gptq = False
     if (
-        available_packages["auto_gptq"]
+        available_packages["gptqmodel"]
         and available_packages["exllama_kernels"]
         and available_packages["exllamav2_kernels"]
     ):
@@ -207,9 +207,9 @@ def lower_qmodel_to_ext_kernels(
 
     if using_gptq:
         # Third Party
-        from auto_gptq.modeling._utils import autogptq_post_init
+        from gptqmodel.utils.model import hf_gptqmodel_post_init as gptq_post_init
 
-        mod_tmp = autogptq_post_init(mod_tmp, use_act_order=False)  # see Note 4
+        mod_tmp = gptq_post_init(mod_tmp, use_act_order=False)  # see Note 4
 
     mod.to(currDev)
     logger.info(mod)