Skip to content

Commit 6dac25b

Browse files
committed
fix: enabled GPTQv2
Signed-off-by: omobayode.fagbohungbe <omobayode.fagbohungbe@ibm.com>
1 parent 1a86b4c commit 6dac25b

3 files changed

Lines changed: 76 additions & 36 deletions

File tree

examples/GPTQ/README.md

Lines changed: 54 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com
77

88
- [FMS Model Optimizer requirements](../../README.md#requirements)
99
- `gptqmodel` is needed for this example. Use `pip install gptqmodel` or [install from source](https://github.com/ModelCloud/GPTQModel/tree/main?tab=readme-ov-file)
10+
- It is advised to install from source if you plan to use `GPTQv2`
1011
- Optionally for the evaluation section below, install [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness)
1112
```
1213
pip install lm-eval
@@ -32,7 +33,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
3233
> - Tokenized data will be saved in `<path_to_save>_train` and `<path_to_save>_test`
3334
> - If you have trouble downloading Llama family of models from Hugging Face ([LLama models require access](https://www.llama.com/docs/getting-the-models/hugging-face/)), you can use `ibm-granite/granite-8b-code` instead
3435
35-
2. **Quantize the model** using the data generated above, the following command will kick off the quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
36+
2. **Quantize the model** using the data generated above, the following command will kick off the `GPTQv1' quantization job (by invoking `gptqmodel` under the hood.) Additional acceptable arguments can be found here in [GPTQArguments](../../fms_mo/training_args.py#L127).
3637
3738
```bash
3839
python -m fms_mo.run_quant \
@@ -41,9 +42,11 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
4142
--quant_method gptq \
4243
--output_dir Meta-Llama-3-8B-GPTQ \
4344
--bits 4 \
44-
--group_size 128
45+
--group_size 128 \
46+
--v2_mem_device cpu \
47+
4548
```
46-
The model that can be found in the specified output directory (`Meta-Llama-3-8B-GPTQ` in our case) can be deployed and inferenced via `vLLM`.
49+
The model that can be found in the specified output directory (`Meta-Llama-3-8B-GPTQ` in our case) can be deployed and inferenced via `vLLM`. To enable `GPTQv2`, set the `quant_method` argument to `gptqv2`.
4750
4851
> [!NOTE]
4952
> - In GPTQ, `group_size` is a trade-off between accuracy and speed, but there is an additional constraint that `in_features` of the Linear layer to be quantized needs to be an **integer multiple** of `group_size`, i.e. some models may have to use smaller `group_size` than default.
@@ -82,44 +85,67 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
8285
## Example Test Results
8386
8487
- Unquantized Model
85-
-
86-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
87-
|------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
88-
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.7103|± |0.0063|
89-
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|
88+
89+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
90+
|------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:|
91+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.7103|± |0.0063|
92+
| | | |none | 5|perplexity|↓ |3.7915|± |0.0727|
9093
9194
- Quantized model with the settings showed above (`desc_act` default to False.)
92-
-
93-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
94-
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
95-
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6365 |± |0.0067|
96-
| | | |none | 5|perplexity|↓ |5.9307 |± |0.1830|
95+
- `GPTQv1`
96+
97+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
98+
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
99+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6365 |± |0.0067|
100+
| | | |none | 5|perplexity|↓ |5.9307 |± |0.1830|
101+
102+
- `GPTQv2`
103+
104+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
105+
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
106+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6817 |± |0.0065|
107+
| | | |none | 5|perplexity|↓ |4.3994 |± |0.0995|
97108
98109
- Quantized model with `desc_act` set to `True` (could improve the model quality, but at the cost of inference speed.)
99-
-
100-
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
101-
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
102-
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
103-
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|
110+
- `GPTQv1`
111+
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
112+
|------------|--------------|------:|------|-----:|----------|---|------:|---|-----:|
113+
| LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.6193 |± |0.0068|
114+
| | | |none | 5|perplexity|↓ |5.8879 |± |0.1546|
104115
105116
> [!NOTE]
106117
> There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
107118
108119
109120
## Code Walk-through
110121
111-
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py)
122+
1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py). `GPTQv1` and `GPTQv2` is supported.
112123
113-
```python
114-
from gptqmodel import GPTQModel, QuantizeConfig
124+
- To use `GPTQv1`, set the parameter `quant_method` to `gptq` in the command line.
115125
116-
quantize_config = QuantizeConfig(
117-
bits=gptq_args.bits,
118-
group_size=gptq_args.group_size,
119-
desc_act=gptq_args.desc_act,
120-
damp_percent=gptq_args.damp_percent,
121-
)
126+
```python
127+
from gptqmodel import GPTQModel, QuantizeConfig
128+
129+
quantize_config = QuantizeConfig(
130+
bits=gptq_args.bits,
131+
group_size=gptq_args.group_size,
132+
desc_act=gptq_args.desc_act,
133+
damp_percent=gptq_args.damp_percent,
134+
)
135+
```
136+
- To use `GPTQv2`, set `quant_method` to `gptqv2` and `v2_memory_device` to `cpu` , `cuda` or `auto` using `gptq_args.v2_mem_device` in the command line.
122137
138+
```python
139+
from gptqmodel import GPTQModel, QuantizeConfig
140+
141+
quantize_config = QuantizeConfig(
142+
bits=gptq_args.bits,
143+
group_size=gptq_args.group_size,
144+
desc_act=gptq_args.desc_act,
145+
damp_percent=gptq_args.damp_percent,
146+
v2=True,
147+
v2_memory_device=gptq_args.v2_mem_device,
148+
)
123149
```
124150
125151
2. Load the pre_trained model with `gptqmodel` class/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.
@@ -158,4 +184,4 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m
158184
tokenizer.save_pretrained(output_dir) # optional
159185
```
160186
> [!NOTE]
161-
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100.
187+
> 1. GPTQ of a 70B model usually takes ~4-10 hours on A100 with `GPTQv1`.

fms_mo/run_quant.py

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ def quantize(
8888

8989
logger.info(f"{fms_mo_args}\n{opt_args.quant_method}\n")
9090

91-
if opt_args.quant_method == "gptq":
91+
if opt_args.quant_method == "gptq" or opt_args.quant_method == "gptqv2":
9292
if not available_packages["gptqmodel"]:
9393
raise ImportError(
9494
"Quantization method has been selected as gptq but unable to use external library, "
@@ -138,12 +138,23 @@ def run_gptq(model_args, data_args, opt_args, gptq_args):
138138

139139
logger = set_log_level(opt_args.log_level, "fms_mo.run_gptq")
140140

141-
quantize_config = QuantizeConfig(
142-
bits=gptq_args.bits,
143-
group_size=gptq_args.group_size,
144-
desc_act=gptq_args.desc_act,
145-
damp_percent=gptq_args.damp_percent,
146-
)
141+
if opt_args.quant_method == "gptq":
142+
quantize_config = QuantizeConfig(
143+
bits=gptq_args.bits,
144+
group_size=gptq_args.group_size,
145+
desc_act=gptq_args.desc_act,
146+
damp_percent=gptq_args.damp_percent,
147+
)
148+
else:
149+
quantize_config = QuantizeConfig(
150+
bits=gptq_args.bits,
151+
group_size=gptq_args.group_size,
152+
desc_act=gptq_args.desc_act,
153+
damp_percent=gptq_args.damp_percent,
154+
v2=True,
155+
v2_memory_device=gptq_args.v2_mem_device,
156+
)
157+
147158

148159
# Add custom model_type mapping to gptqmodel LUT so GPTQModel can recognize them.
149160
for mtype, cls in custom_gptq_classes.items():

fms_mo/training_args.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ class OptArguments(TypeChecker):
138138
"""Dataclass for optimization related arguments."""
139139

140140
quant_method: str = field(
141-
metadata={"choices": ["gptq", "fp8", "dq"], "help": "Quantization technique"}
141+
metadata={"choices": ["gptq", "gptqv2", "fp8", "dq"], "help": "Quantization technique"}
142142
)
143143
output_dir: str = field(
144144
metadata={
@@ -224,6 +224,9 @@ class GPTQArguments(TypeChecker):
224224
use_cuda_fp16: bool = True
225225
autotune_warmup_after_quantized: bool = False
226226
cache_examples_on_gpu: bool = True
227+
v2_mem_device: str = field(
228+
default="cpu", metadata={"choices": ["auto", "cpu", "cuda"]}
229+
)
227230

228231

229232
@dataclass

0 commit comments

Comments
 (0)