You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/fms_mo_design.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,7 +82,7 @@ FMS Model Optimizer supports FP8 in two ways:
82
82
83
83
### GPTQ (weight-only compression, or sometimes referred to as W4A16)
84
84
85
-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `auto_gptq` package. See this [example](../examples/GPTQ/)
85
+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed. (Some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this method simply by utilizing `gptqmodel` package. See this [example](../examples/GPTQ/)
Copy file name to clipboardExpand all lines: examples/GPTQ/README.md
+39-41Lines changed: 39 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,12 @@
1
1
# Generative Pre-Trained Transformer Quantization (GPTQ) of LLAMA-3-8B Model
2
2
3
3
4
-
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `auto_gptq`, a third party library, to perform quantization.
4
+
For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or [GPTQ](https://arxiv.org/pdf/2210.17323) by leveraging `gptqmodel`, a third party library, to perform quantization.
5
5
6
6
## Requirements
7
7
8
8
-[FMS Model Optimizer requirements](../../README.md#requirements)
9
-
-`auto-gptq` is needed for this example. Use `pip install auto-gptq` or [install from source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source)
9
+
-`gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
10
10
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
11
11
```
12
12
pip install lm-eval
@@ -32,7 +32,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
32
32
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
33
33
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
34
34
35
-
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `auto_gptq` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
35
+
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
36
36
37
37
```bash
38
38
python -m fms_mo.run_quant \
@@ -49,8 +49,8 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
49
49
> - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.
50
50
51
51
> [!TIP]
52
-
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try install `auto-gptq` from [source](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#install-from-source).
53
-
> 2. If you need to work on a custom model that is not supported by AutoGPTQ, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#customize-model).
52
+
> 1. If you see error messages regarding `exllama_kernels` or `undefined symbol`, try installing `gptqmodel` from [source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file).
53
+
> 2. If you need to work on a custom model that is not supported by GPTQModel, please add your class wrapper [here](../../fms_mo/utils/custom_gptq_models.py). Additional information [here](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file#how-to-add-support-for-a-new-model).
54
54
55
55
3. **Inspect the GPTQ checkpoint**
56
56
```python
@@ -62,10 +62,10 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
62
62
63
63
```
64
64
layer mem (MB)
65
-
dtype
66
-
torch.float16224 109.051904
67
-
torch.float3267 4203.757568
68
-
torch.int32 672 3521.904640
65
+
dtype
66
+
torch.bfloat16 67 2101.878784
67
+
torch.float16224 109.051904
68
+
torch.int32 672 3521.904640
69
69
```
70
70
71
71
4. **Evaluate the quantized model**'s performance on a selected task using `lm-eval` library, the command below will run evaluation on [`lambada_openai`](https://huggingface.co/datasets/EleutherAI/lambada_openai) task and show the perplexity/accuracy at the end.
@@ -82,29 +82,23 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
110
104
@@ -114,21 +108,25 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
114
108
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
115
109
116
110
```python
117
-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
118
-
quantize_config = BaseQuantizeConfig(
119
-
bits=gptq_args.bits,
120
-
group_size=gptq_args.group_size,
121
-
desc_act=gptq_args.desc_act,
122
-
damp_percent=gptq_args.damp_percent)
111
+
from gptqmodel import GPTQModel, QuantizeConfig
112
+
113
+
quantize_config = QuantizeConfig(
114
+
bits=gptq_args.bits,
115
+
group_size=gptq_args.group_size,
116
+
desc_act=gptq_args.desc_act,
117
+
damp_percent=gptq_args.damp_percent,
118
+
)
119
+
123
120
```
124
121
125
-
2. Load the pre_trained model with`auto_gptq`class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
122
+
2. Load the pre_trained model with `gptqmodel` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
126
123
127
124
```python
128
-
model = AutoGPTQForCausalLM.from_pretrained(
129
-
model_args.model_name_or_path,
130
-
quantize_config=quantize_config,
131
-
torch_dtype=model_args.torch_dtype)
125
+
model = GPTQModel.from_pretrained(
126
+
model_args.model_name_or_path,
127
+
quantize_config=quantize_config,
128
+
torch_dtype=model_args.torch_dtype,
129
+
)
132
130
```
133
131
134
132
3. Load the tokenized dataset from disk.
@@ -143,9 +141,9 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
143
141
```python
144
142
model.quantize(
145
143
data,
146
-
use_triton=gptq_args.use_triton,
144
+
backend=BACKEND.TRITON if gptq_args.use_triton else BACKEND.AUTO,
0 commit comments