Skip to content

Commit 9fef5b2

Browse files
Merge pull request #102 from tharapalanivel/gptq_model
feat: GPTQModel Migration
2 parents 05bb442 + a6eed1d commit 9fef5b2

14 files changed

Lines changed: 123 additions & 114 deletions

File tree

.pylintrc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,9 @@ ignore-patterns=^\.#
6363
# (useful for modules/projects where namespaces are manipulated during runtime
6464
# and thus existing member attributes cannot be deduced by static analysis). It
6565
# supports qualified module names, as well as Unix pattern matching.
66-
ignored-modules=auto_gptq,
67-
exllama_kernels,
68-
exllamav2_kernels,
66+
ignored-modules=gptqmodel,
67+
gptqmodel_exllama_kernels,
68+
gptqmodel_exllamav2_kernels,
6969
llmcompressor,
7070
cutlass_mm,
7171
pygraphviz,

.spellcheck-en-custom.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ AIU
66
Spyre
77
spyre
88
Args
9-
AutoGPTQ
109
autoregressive
1110
backpropagation
1211
bmm
@@ -38,8 +37,9 @@ frac
3837
gptq
3938
GPTQ
4039
GPTQArguments
40+
GPTQModel
41+
gptqmodel
4142
graphviz
42-
GPTQ
4343
hyperparameters
4444
Inductor
4545
inferenced

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
4242
*Optional packages based on optimization functionality required:*
4343

4444
- **GPTQ** is a popular compression method for LLMs:
45-
- [auto_gptq](https://pypi.org/project/auto-gptq/) or build from [source](https://github.com/AutoGPTQ/AutoGPTQ)
45+
- [gptqmodel](https://pypi.org/project/gptqmodel/) or build from [source](https://github.com/ModelCloud/GPTQModel)
4646
- If you want to experiment with **INT8** deployment in [QAT](./examples/QAT_INT8/) and [PTQ](./examples/PTQ_INT8/) examples:
4747
- Nvidia GPU with compute capability > 8.0 (A100 family or higher)
4848
- Option 1:

docs/fms_mo_design.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:
8282

8383
### GPTQ (weight-only compression, or sometimes referred to as W4A16)
8484

85-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
85+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)
8686

8787

8888
## Specification

examples/GPTQ/README.md

Lines changed: 39 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Generative Pre-Trained Transformer Quantization (GPTQ) of LLAMA-3-8B Model
22

33

4-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `auto_gptq`, a third party library, to perform quantization.
4+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `gptqmodel`, a third party library, to perform quantization.
55

66
## Requirements
77

88
- [FMS Model Optimizer requirements](../../README.md#requirements)
9-
- `auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
9+
- `gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
1010
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
1111
```
1212
pip install lm-eval
@@ -32,7 +32,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
3232
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
3333
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
3434
35-
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
35+
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
3636
3737
```bash
3838
python -m fms_mo.run_quant \
@@ -49,8 +49,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
4949
> - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.
5050
5151
> [!TIP]
52-
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try install `auto-gptq` from [source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source).
53-
> 2. If you need to work on a custom model that is not supported by AutoGPTQ, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#customize-model).
52+
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try installing `gptqmodel` from [source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file).
53+
> 2. If you need to work on a custom model that is not supported by GPTQModel, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file#how-to-add-support-for-a-new-model).
5454
5555
3. **Inspect the GPTQ checkpoint**
5656
```python
@@ -62,10 +62,10 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
6262
6363
```
6464
layer mem (MB)
65-
dtype
66-
torch.float16 224 109.051904
67-
torch.float32 67 4203.757568
68-
torch.int32 672 3521.904640
65+
dtype
66+
torch.bfloat16 67 2101.878784
67+
torch.float16 224 109.051904
68+
torch.int32 672 3521.904640
6969
```
7070
7171
4. **Evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
@@ -82,29 +82,23 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
8282
## Example Test Results
8383
8484
- Unquantized Model
85-
```bash
86-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
87-
|------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
88-
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.7103|± |0.0063|
89-
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|
90-
```
85+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
86+
|------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
87+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.7103|± |0.0063|
88+
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|
9189
9290
- Quantized model with the settings showed above (`desc_act` default to False.)
93-
```bash
94-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
95-
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
96-
| LLAMA3-8B |lambada_openai| 1|none | 5|acc ||0.4271 |± |0.0069|
97-
| | | |none | 5|perplexity||39.2316|± |2.2090|
98-
```
99-
91+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
92+
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
93+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6365 |± |0.0067|
94+
| | | |none | 5|perplexity|↓ |5.9307 |± |0.1830|
10095
10196
- Quantized model with `desc_act` set to `True` (could improve the model quality, but at the cost of inference speed.)
102-
```bash
103-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
104-
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
105-
| LLAMA3-8B |lambada_openai| 1|none | 5|acc ||0.6193 |± |0.0068|
106-
| | | |none | 5|perplexity||5.8879 |± |0.1546|
107-
```
97+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
98+
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
99+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
100+
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|
101+
108102
> [!NOTE]
109103
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
110104
@@ -114,21 +108,25 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
114108
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
115109
116110
```python
117-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
118-
quantize_config = BaseQuantizeConfig(
119-
bits=gptq_args.bits,
120-
group_size=gptq_args.group_size,
121-
desc_act=gptq_args.desc_act,
122-
damp_percent=gptq_args.damp_percent)
111+
from gptqmodel import GPTQModel, QuantizeConfig
112+
113+
quantize_config = QuantizeConfig(
114+
bits=gptq_args.bits,
115+
group_size=gptq_args.group_size,
116+
desc_act=gptq_args.desc_act,
117+
damp_percent=gptq_args.damp_percent,
118+
)
119+
123120
```
124121
125-
2. Load the pre_trained model with `auto_gptq` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
122+
2. Load the pre_trained model with `gptqmodel` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
126123
127124
```python
128-
model = AutoGPTQForCausalLM.from_pretrained(
129-
model_args.model_name_or_path,
130-
quantize_config=quantize_config,
131-
torch_dtype=model_args.torch_dtype)
125+
model = GPTQModel.from_pretrained(
126+
model_args.model_name_or_path,
127+
quantize_config=quantize_config,
128+
torch_dtype=model_args.torch_dtype,
129+
)
132130
```
133131
134132
3. Load the tokenized dataset from disk.
@@ -143,9 +141,9 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
143141
```python
144142
model.quantize(
145143
data,
146-
use_triton=gptq_args.use_triton,
144+
backend=BACKEND.TRITON if gptq_args.use_triton else BACKEND.AUTO,
147145
batch_size=gptq_args.batch_size,
148-
cache_examples_on_gpu=gptq_args.cache_examples_on_gpu,
146+
calibration_enable_gpu_cache=gptq_args.cache_examples_on_gpu,
149147
)
150148
```
151149

fms_mo/custom_ext_kernels/utils.py

Lines changed: 30 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515

1616
"""This file contains external kernel registrations, compilation, and packing functions.
17-
Some functions may require additional packages, e.g. auto_gptq, cutlass (source clone)
17+
Some functions may require additional packages, e.g. gptqmodel, cutlass (source clone)
1818
"""
1919

2020
# pylint: disable=ungrouped-imports,unused-argument,c-extension-no-member
@@ -491,27 +491,29 @@ def create_test_tensors(Nbatch, M, N, K, ele_type, accum_type):
491491

492492

493493
def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
494-
"""Register Exllama kernels borrowed from auto-gptq
494+
"""Register Exllama kernels borrowed from gptqmodel
495495
Args:
496496
qcfg: dict. quant config
497497
run_unit_test: bool. Run unit tests after Op registration. (if unit tests defined.)
498498
499499
NOTE:
500-
1. need to install auto-gptq python package
500+
1. need to install gptqmodel python package
501501
2. Op registration signature changed drastically from torch 2.1 - 2.4. TODO: add 2.4 support
502502
503-
see https://github.com/AutoGPTQ/AutoGPTQ for installation instruction
503+
see https://github.com/ModelCloud/GPTQModel for installation instructions
504504
"""
505505
if qcfg is None:
506506
qcfg = {}
507507
elif qcfg:
508-
qcfg["AUTOGPTQ_AVAILABLE"] = False
508+
qcfg["GPTQMODEL_AVAILABLE"] = False
509509

510-
namespace = "autogptq_gemm"
510+
namespace = "gptqmodel_gemm"
511511
# check before compile
512-
if hasattr(torch.ops, namespace) and hasattr(torch.ops.autogptq_gemm, "exv1_i4f16"):
513-
logger.info("Custom AutoGPTQ functions have been loaded already!")
514-
qcfg["AUTOGPTQ_AVAILABLE"] = True
512+
if hasattr(torch.ops, namespace) and hasattr(
513+
torch.ops.gptqmodel_gemm, "exv1_i4f16"
514+
):
515+
logger.info("Custom GPTQModel functions have been loaded already!")
516+
qcfg["GPTQMODEL_AVAILABLE"] = True
515517
need_registration = False
516518
else:
517519
need_registration = (
@@ -521,14 +523,14 @@ def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
521523

522524
if not need_registration:
523525
logger.warning(
524-
"Please check the installation of AutoGPTQ package."
526+
"Please check the installation of GPTQModel package."
525527
"External kernels cannot be used this time."
526528
)
527529
return
528530

529531
# Third Party
530-
import exllama_kernels
531-
import exllamav2_kernels
532+
import gptqmodel_exllama_kernels
533+
import gptqmodel_exllamav2_kernels
532534

533535
# Register op
534536
@reg_op(f"{namespace}::exv1_i4f16")
@@ -545,7 +547,7 @@ def exv1_i4f16_impl(x, q4, q4_width):
545547
(x.shape[0], q4_width), dtype=torch.float16, device=x.device
546548
)
547549

548-
exllama_kernels.q4_matmul(x, q4, output)
550+
gptqmodel_exllama_kernels.q4_matmul(x, q4, output)
549551
return output.view(outshape)
550552

551553
# Abstract implementation
@@ -573,7 +575,9 @@ def exv2_i4f16_impl(x, q_handle, q4_width, force_cuda):
573575
(x.shape[0], q4_width), dtype=torch.float16, device=x.device
574576
)
575577

576-
exllamav2_kernels.gemm_half_q_half(x, q_handle, output, force_cuda)
578+
gptqmodel_exllamav2_kernels.gemm_half_q_half(
579+
x, q_handle, output, force_cuda
580+
)
577581
return output.view(outshape)
578582

579583
# Abstract implementation
@@ -609,7 +613,9 @@ def exv2_i4f16_fxinputs_impl(
609613
(x.shape[0], q4_width), dtype=torch.float16, device=x.device
610614
)
611615

612-
exllamav2_kernels.gemm_half_q_half(x, q_handle, output, force_cuda)
616+
gptqmodel_exllamav2_kernels.gemm_half_q_half(
617+
x, q_handle, output, force_cuda
618+
)
613619
return output.view(outshape)
614620

615621
# Abstract implementation
@@ -623,10 +629,11 @@ def exv2_i4f16_fxinputs_abstract(
623629
)
624630

625631
logger.info(
626-
f"New AutoGPTQ gemm functions have been loaded and registered to torch.ops.{namespace}."
632+
f"New GPTQModel gemm functions have been loaded and registered to \
633+
torch.ops.{namespace}."
627634
)
628635
if qcfg:
629-
qcfg["AUTOGPTQ_AVAILABLE"] = True
636+
qcfg["GPTQMODEL_AVAILABLE"] = True
630637

631638
if run_unit_test:
632639
return NotImplemented
@@ -1171,10 +1178,14 @@ def swap_nnlinear_to_quantlinear(model, qconfig, prefix=None, qlinear2use=None):
11711178
QuantLinear = qlinear2use
11721179
elif exVer == 1:
11731180
# Third Party
1174-
from auto_gptq.nn_modules.qlinear.qlinear_exllama import QuantLinear
1181+
from gptqmodel.nn_modules.qlinear.exllama import (
1182+
ExllamaQuantLinear as QuantLinear,
1183+
)
11751184
else:
11761185
# Third Party
1177-
from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import QuantLinear
1186+
from gptqmodel.nn_modules.qlinear.exllamav2 import (
1187+
ExllamaV2QuantLinear as QuantLinear,
1188+
)
11781189

11791190
num_swapped = 0
11801191
for n, m in model.named_modules():

fms_mo/fx/utils.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,9 @@
4141
# Local
4242
from fms_mo.modules.linear import QLinearExv1WI4AF16, QLinearExv2WI4AF16
4343

44-
autogptq_available = True
44+
gptqmodel_available = True
4545
except ImportError:
46-
autogptq_available = False
46+
gptqmodel_available = False
4747

4848

4949
MIN_BLOCK_SIZE = 5
@@ -91,7 +91,7 @@ def check_qclass_fallback_based_on_min_feat(
9191
]
9292
if cutlass_available:
9393
qclass_has_constraints += [QLinearCutlassI8I32NT]
94-
if autogptq_available:
94+
if gptqmodel_available:
9595
qclass_has_constraints += [QLinearExv1WI4AF16, QLinearExv2WI4AF16]
9696

9797
qclass = type(ref_module)
@@ -129,7 +129,7 @@ def lower_qmodel_to_ext_kernels(
129129
1. user need to define a mapping thru qcfg["ext_kernel_mapping_mod"]
130130
2. to make it simple, only swap user specified qclass, nothing else
131131
3. move the module to GPU before swapping to accelerate scale/zp calculations
132-
4. autogptq_post_init() must be done at model level, or OOM and incorrect results easily
132+
4. gptq_post_init() must be done at model level, or OOM and incorrect results easily
133133
134134
Args:
135135
mod (torch.nn.Module): model to be 'lowered'
@@ -156,7 +156,7 @@ def lower_qmodel_to_ext_kernels(
156156
qclass_must_start_from_cpu = None
157157
using_gptq = False
158158
if (
159-
available_packages["auto_gptq"]
159+
available_packages["gptqmodel"]
160160
and available_packages["exllama_kernels"]
161161
and available_packages["exllamav2_kernels"]
162162
):
@@ -207,9 +207,9 @@ def lower_qmodel_to_ext_kernels(
207207

208208
if using_gptq:
209209
# Third Party
210-
from auto_gptq.modeling._utils import autogptq_post_init
210+
from gptqmodel.utils.model import hf_gptqmodel_post_init as gptq_post_init
211211

212-
mod_tmp = autogptq_post_init(mod_tmp, use_act_order=False) # see Note 4
212+
mod_tmp = gptq_post_init(mod_tmp, use_act_order=False) # see Note 4
213213

214214
mod.to(currDev)
215215
logger.info(mod)

0 commit comments

Comments
 (0)