Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,9 @@ ignore-patterns=^\.#
# (useful for modules/projects where namespaces are manipulated during runtime
# and thus existing member attributes cannot be deduced by static analysis). It
# supports qualified module names, as well as Unix pattern matching.
ignored-modules=auto_gptq,
exllama_kernels,
exllamav2_kernels,
ignored-modules=gptqmodel,
gptqmodel_exllama_kernels,
gptqmodel_exllamav2_kernels,
llmcompressor,
cutlass_mm,
pygraphviz,
Expand Down
4 changes: 2 additions & 2 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ AIU
Spyre
spyre
Args
AutoGPTQ
autoregressive
backpropagation
bmm
Expand Down Expand Up @@ -38,8 +37,9 @@ frac
gptq
GPTQ
GPTQArguments
GPTQModel
Comment thread
chichun-charlie-liu marked this conversation as resolved.
gptqmodel
graphviz
GPTQ
hyperparameters
Inductor
inferenced
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ FMS Model Optimizer is a framework for developing reduced precision neural netwo
*Optional packages based on optimization functionality required:*

- **GPTQ** is a popular compression method for LLMs:
- [auto_gptq](https://pypi.org/project/auto-gptq/) or build from [source](https://github.com/AutoGPTQ/AutoGPTQ)
- [gptqmodel](https://pypi.org/project/gptqmodel/) or build from [source](https://github.com/ModelCloud/GPTQModel)
- If you want to experiment with **INT8** deployment in [QAT](./examples/QAT_INT8/) and [PTQ](./examples/PTQ_INT8/) examples:
- Nvidia GPU with compute capability > 8.0 (A100 family or higher)
- Option 1:
Expand Down
2 changes: 1 addition & 1 deletion docs/fms_mo_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:

### GPTQ (weight-only compression, or sometimes referred to as W4A16)

For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)


## Specification
Expand Down
80 changes: 39 additions & 41 deletions examples/GPTQ/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Generative Pre-Trained Transformer Quantization (GPTQ) of LLAMA-3-8B Model


For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `auto_gptq`, a third party library, to perform quantization.
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `gptqmodel`, a third party library, to perform quantization.

## Requirements

- [FMS Model Optimizer requirements](../../README.md#requirements)
- `auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
- `gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
```
pip install lm-eval
Expand All @@ -32,7 +32,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead

2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).

```bash
python -m fms_mo.run_quant \
Expand All @@ -49,8 +49,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
> - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.

> [!TIP]
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try install `auto-gptq` from [source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source).
> 2. If you need to work on a custom model that is not supported by AutoGPTQ, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#customize-model).
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try installing `gptqmodel` from [source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file).
> 2. If you need to work on a custom model that is not supported by GPTQModel, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file#how-to-add-support-for-a-new-model).

3. **Inspect the GPTQ checkpoint**
```python
Expand All @@ -62,10 +62,10 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m

```
layer mem (MB)
dtype
torch.float16 224 109.051904
torch.float32 67 4203.757568
torch.int32 672 3521.904640
dtype
torch.bfloat16 67 2101.878784
torch.float16 224 109.051904
torch.int32 672 3521.904640
```

4. **Evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
Expand All @@ -82,29 +82,23 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
## Example Test Results

- Unquantized Model
```bash
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.7103|± |0.0063|
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|
```
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.7103|± |0.0063|
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|

- Quantized model with the settings showed above (`desc_act` default to False.)
```bash
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.4271 |± |0.0069|
| | | |none | 5|perplexity|↓ |39.2316|± |2.2090|
```

|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6365 |± |0.0067|
| | | |none | 5|perplexity|↓ |5.9307 |± |0.1830|

- Quantized model with `desc_act` set to `True` (could improve the model quality, but at the cost of inference speed.)
```bash
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|
```
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|

> [!NOTE]
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.

Expand All @@ -114,21 +108,25 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)

```python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=gptq_args.bits,
group_size=gptq_args.group_size,
desc_act=gptq_args.desc_act,
damp_percent=gptq_args.damp_percent)
from gptqmodel import GPTQModel, QuantizeConfig

quantize_config = QuantizeConfig(
bits=gptq_args.bits,
group_size=gptq_args.group_size,
desc_act=gptq_args.desc_act,
damp_percent=gptq_args.damp_percent,
)

```

2. Load the pre_trained model with `auto_gptq` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
2. Load the pre_trained model with `gptqmodel` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.

```python
model = AutoGPTQForCausalLM.from_pretrained(
model_args.model_name_or_path,
quantize_config=quantize_config,
torch_dtype=model_args.torch_dtype)
model = GPTQModel.from_pretrained(
model_args.model_name_or_path,
quantize_config=quantize_config,
torch_dtype=model_args.torch_dtype,
)
```

3. Load the tokenized dataset from disk.
Expand All @@ -143,9 +141,9 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
```python
model.quantize(
data,
use_triton=gptq_args.use_triton,
backend=BACKEND.TRITON if gptq_args.use_triton else BACKEND.AUTO,
batch_size=gptq_args.batch_size,
cache_examples_on_gpu=gptq_args.cache_examples_on_gpu,
calibration_enable_gpu_cache=gptq_args.cache_examples_on_gpu,
)
```

Expand Down
49 changes: 30 additions & 19 deletions fms_mo/custom_ext_kernels/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@


"""This file contains external kernel registrations, compilation, and packing functions.
Some functions may require additional packages, e.g. auto_gptq, cutlass (source clone)
Some functions may require additional packages, e.g. gptqmodel, cutlass (source clone)
"""

# pylint: disable=ungrouped-imports,unused-argument,c-extension-no-member
Expand Down Expand Up @@ -491,27 +491,29 @@ def create_test_tensors(Nbatch, M, N, K, ele_type, accum_type):


def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):
"""Register Exllama kernels borrowed from auto-gptq
"""Register Exllama kernels borrowed from gptqmodel
Args:
qcfg: dict. quant config
run_unit_test: bool. Run unit tests after Op registration. (if unit tests defined.)

NOTE:
1. need to install auto-gptq python package
1. need to install gptqmodel python package
2. Op registration signature changed drastically from torch 2.1 - 2.4. TODO: add 2.4 support

see https://github.com/AutoGPTQ/AutoGPTQ for installation instruction
see https://github.com/ModelCloud/GPTQModel for installation instructions
"""
if qcfg is None:
qcfg = {}
elif qcfg:
qcfg["AUTOGPTQ_AVAILABLE"] = False
qcfg["GPTQMODEL_AVAILABLE"] = False

namespace = "autogptq_gemm"
namespace = "gptqmodel_gemm"
# check before compile
if hasattr(torch.ops, namespace) and hasattr(torch.ops.autogptq_gemm, "exv1_i4f16"):
logger.info("Custom AutoGPTQ functions have been loaded already!")
qcfg["AUTOGPTQ_AVAILABLE"] = True
if hasattr(torch.ops, namespace) and hasattr(
torch.ops.gptqmodel_gemm, "exv1_i4f16"
):
logger.info("Custom GPTQModel functions have been loaded already!")
qcfg["GPTQMODEL_AVAILABLE"] = True
need_registration = False
else:
need_registration = (
Expand All @@ -521,14 +523,14 @@ def exllama_ops_load_and_reg(qcfg=None, run_unit_test=False):

if not need_registration:
logger.warning(
"Please check the installation of AutoGPTQ package."
"Please check the installation of GPTQModel package."
"External kernels cannot be used this time."
)
return

# Third Party
import exllama_kernels
import exllamav2_kernels
import gptqmodel_exllama_kernels
import gptqmodel_exllamav2_kernels

# Register op
@reg_op(f"{namespace}::exv1_i4f16")
Expand All @@ -545,7 +547,7 @@ def exv1_i4f16_impl(x, q4, q4_width):
(x.shape[0], q4_width), dtype=torch.float16, device=x.device
)

exllama_kernels.q4_matmul(x, q4, output)
gptqmodel_exllama_kernels.q4_matmul(x, q4, output)
return output.view(outshape)

# Abstract implementation
Expand Down Expand Up @@ -573,7 +575,9 @@ def exv2_i4f16_impl(x, q_handle, q4_width, force_cuda):
(x.shape[0], q4_width), dtype=torch.float16, device=x.device
)

exllamav2_kernels.gemm_half_q_half(x, q_handle, output, force_cuda)
gptqmodel_exllamav2_kernels.gemm_half_q_half(
x, q_handle, output, force_cuda
)
return output.view(outshape)

# Abstract implementation
Expand Down Expand Up @@ -609,7 +613,9 @@ def exv2_i4f16_fxinputs_impl(
(x.shape[0], q4_width), dtype=torch.float16, device=x.device
)

exllamav2_kernels.gemm_half_q_half(x, q_handle, output, force_cuda)
gptqmodel_exllamav2_kernels.gemm_half_q_half(
x, q_handle, output, force_cuda
)
return output.view(outshape)

# Abstract implementation
Expand All @@ -623,10 +629,11 @@ def exv2_i4f16_fxinputs_abstract(
)

logger.info(
f"New AutoGPTQ gemm functions have been loaded and registered to torch.ops.{namespace}."
f"New GPTQModel gemm functions have been loaded and registered to \
torch.ops.{namespace}."
)
if qcfg:
qcfg["AUTOGPTQ_AVAILABLE"] = True
qcfg["GPTQMODEL_AVAILABLE"] = True

if run_unit_test:
return NotImplemented
Expand Down Expand Up @@ -1171,10 +1178,14 @@ def swap_nnlinear_to_quantlinear(model, qconfig, prefix=None, qlinear2use=None):
QuantLinear = qlinear2use
elif exVer == 1:
# Third Party
from auto_gptq.nn_modules.qlinear.qlinear_exllama import QuantLinear
from gptqmodel.nn_modules.qlinear.exllama import (
ExllamaQuantLinear as QuantLinear,
)
else:
# Third Party
from auto_gptq.nn_modules.qlinear.qlinear_exllamav2 import QuantLinear
from gptqmodel.nn_modules.qlinear.exllamav2 import (
ExllamaV2QuantLinear as QuantLinear,
)

num_swapped = 0
for n, m in model.named_modules():
Expand Down
14 changes: 7 additions & 7 deletions fms_mo/fx/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,9 @@
# Local
from fms_mo.modules.linear import QLinearExv1WI4AF16, QLinearExv2WI4AF16

autogptq_available = True
gptqmodel_available = True
except ImportError:
autogptq_available = False
gptqmodel_available = False


MIN_BLOCK_SIZE = 5
Expand Down Expand Up @@ -91,7 +91,7 @@ def check_qclass_fallback_based_on_min_feat(
]
if cutlass_available:
qclass_has_constraints += [QLinearCutlassI8I32NT]
if autogptq_available:
if gptqmodel_available:
qclass_has_constraints += [QLinearExv1WI4AF16, QLinearExv2WI4AF16]

qclass = type(ref_module)
Expand Down Expand Up @@ -129,7 +129,7 @@ def lower_qmodel_to_ext_kernels(
1. user need to define a mapping thru qcfg["ext_kernel_mapping_mod"]
2. to make it simple, only swap user specified qclass, nothing else
3. move the module to GPU before swapping to accelerate scale/zp calculations
4. autogptq_post_init() must be done at model level, or OOM and incorrect results easily
4. gptq_post_init() must be done at model level, or OOM and incorrect results easily

Args:
mod (torch.nn.Module): model to be 'lowered'
Expand All @@ -156,7 +156,7 @@ def lower_qmodel_to_ext_kernels(
qclass_must_start_from_cpu = None
using_gptq = False
if (
available_packages["auto_gptq"]
available_packages["gptqmodel"]
and available_packages["exllama_kernels"]
and available_packages["exllamav2_kernels"]
):
Expand Down Expand Up @@ -207,9 +207,9 @@ def lower_qmodel_to_ext_kernels(

if using_gptq:
# Third Party
from auto_gptq.modeling._utils import autogptq_post_init
from gptqmodel.utils.model import hf_gptqmodel_post_init as gptq_post_init

mod_tmp = autogptq_post_init(mod_tmp, use_act_order=False) # see Note 4
mod_tmp = gptq_post_init(mod_tmp, use_act_order=False) # see Note 4

mod.to(currDev)
logger.info(mod)
Expand Down
Loading
Loading