For generative LLMs, very often the bottleneck of inference is no longer the computation itself but the data transfer. In such case, all we need is an efficient compression method to reduce the model size in memory, together with an efficient GPU kernel that can bring in the compressed data and only decompress it at GPU cache-level right before performing an FP16 computation. This approach is very powerful because it could reduce the number of GPUs for serving the model by 4X without sacrificing inference speed (some constraints may apply, such as batch size cannot exceed a certain number.) FMS Model Optimizer supports this "weight-only compression", or sometimes referred to as W4A16 or GPTQ by leveraging gptqmodel, a third party library, to perform quantization.
- FMS Model Optimizer requirements
gptqmodelis needed for this example. Usepip install gptqmodelor install from source- It is advised to install from source if you plan to use
GPTQv2
- It is advised to install from source if you plan to use
- Optionally for the evaluation section below, install lm-eval
pip install lm-eval
This end-to-end example utilizes the common set of interfaces provided by fms_mo for easily applying multiple quantization algorithms with GPTQ being the focus of this example. The steps involved are:
-
Convert the dataset into its tokenized form. An example of tokenization using
LLAMA-3-8B's tokenizer is below.from transformers import AutoTokenizer from fms_mo.utils.calib_data import get_tokenized_data tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", use_fast=True) num_samples = 128 seq_len = 2048 get_tokenized_data("wiki", num_samples, seq_len, tokenizer, gptq_style=True, path_to_save='data')
Note
- Users should provide a tokenized data file based on their need. This is just one example to demonstrate what data format
fms_mois expecting. - Tokenized data will be saved in
<path_to_save>_trainand<path_to_save>_test - If you have trouble downloading Llama family of models from Hugging Face (LLama models require access), you can use
ibm-granite/granite-8b-codeinstead
-
Quantize the model using the data generated above, the following command will kick off the
GPTQv1' quantization job (by invokinggptqmodel` under the hood.) Additional acceptable arguments can be found here in GPTQArguments.python -m fms_mo.run_quant \ --model_name_or_path meta-llama/Meta-Llama-3-8B \ --training_data_path data_train \ --quant_method gptq \ --output_dir Meta-Llama-3-8B-GPTQ \ --bits 4 \ --group_size 128 \The model that can be found in the specified output directory (
Meta-Llama-3-8B-GPTQin our case) can be deployed and inferenced viavLLM. To enableGPTQv2, set thequant_methodargument togptqv2.
Note
- In GPTQ,
group_sizeis a trade-off between accuracy and speed, but there is an additional constraint thatin_featuresof the Linear layer to be quantized needs to be an integer multiple ofgroup_size, i.e. some models may have to use smallergroup_sizethan default.
Tip
-
Inspect the GPTQ checkpoint
from fms_mo.utils.utils import checkpoint_summary checkpoint_summary("Meta-Llama-3-8B-GPTQ")
We can see that most of the tensors are saved in INT32 format (INT4 are not natively supported by PyTorch, hence, packed into INT32 instead.). If you further print out the summary DataFrame (by adding
show_details=Trueflag), you will find out the layers remained infloat16orfloat32arescalesandlayernorm.layer mem (MB) dtype torch.bfloat16 67 2101.878784 torch.float16 224 109.051904 torch.int32 672 3521.904640 -
Evaluate the quantized model's performance on a selected task using
lm-evallibrary, the command below will run evaluation onlambada_openaitask and show the perplexity/accuracy at the end.lm_eval --model hf \ --model_args pretrained="Meta-Llama-3-8B-GPTQ,dtype=float16,gptqmodel=True,enforce_eager=True" \ --tasks lambada_openai \ --num_fewshot 5 \ --device cuda:0 \ --batch_size auto
-
Unquantized Model
|Model | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |------------|--------------|------:|------|-----:|----------|---|-----:|---|-----:| | LLAMA3-8B |lambada_openai| 1|none | 5|acc |↑ |0.7103|± |0.0063| | | | |none | 5|perplexity|↓ |3.7915|± |0.0727| -
Quantized model with the settings showed above (
desc_actdefault to False.)-
GPTQv1Model Tasks Version Filter n-shot Metric Value Stderr LLAMA3-8B lambada_openai 1 none 5 acc ↑ 0.6365 ± 0.0067 none 5 perplexity ↓ 5.9307 ± 0.1830 -
GPTQv2Model Tasks Version Filter n-shot Metric Value Stderr LLAMA3-8B lambada_openai 1 none 5 acc ↑ 0.6817 ± 0.0065 none 5 perplexity ↓ 4.3994 ± 0.0995
-
-
Quantized model with
desc_actset toTrue(could improve the model quality, but at the cost of inference speed.)GPTQv1Model Tasks Version Filter n-shot Metric Value Stderr LLAMA3-8B lambada_openai 1 none 5 acc ↑ 0.6193 ± 0.0068 none 5 perplexity ↓ 5.8879 ± 0.1546
Note
There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05.
-
Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found here.
GPTQv1andGPTQv2is supported.- To use
GPTQv1, set the parameterquant_methodtogptqin the command line.
from gptqmodel import GPTQModel, QuantizeConfig quantize_config = QuantizeConfig( bits=gptq_args.bits, group_size=gptq_args.group_size, desc_act=gptq_args.desc_act, damp_percent=gptq_args.damp_percent, )
- To use
GPTQv2, simply setquant_methodtogptqv2in the command line. Under the hood, two additional arguments will be added to QuantizeConfig, i.e.v2=Trueandv2_memory_device=cpu.
from gptqmodel import GPTQModel, QuantizeConfig quantize_config = QuantizeConfig( bits=gptq_args.bits, group_size=gptq_args.group_size, desc_act=gptq_args.desc_act, damp_percent=gptq_args.damp_percent, v2=True, v2_memory_device='cpu', )
- To use
-
Load the pre_trained model with
gptqmodelclass/wrapper. Tokenizer is optional because we already tokenized the data in a previous step.model = GPTQModel.from_pretrained( model_args.model_name_or_path, quantize_config=quantize_config, torch_dtype=model_args.torch_dtype, )
-
Load the tokenized dataset from disk.
data = load_from_disk(data_args.training_data_path) data = data.with_format("torch")
-
Quantize the model.
model.quantize( data, backend=BACKEND.TRITON if gptq_args.use_triton else BACKEND.AUTO, batch_size=gptq_args.batch_size, calibration_enable_gpu_cache=gptq_args.cache_examples_on_gpu, )
-
Save the logs and the resulting quantized model.
logger.info(f"Saving quantized model and tokenizer to {output_dir}") model.save_quantized(output_dir, use_safetensors=True) tokenizer.save_pretrained(output_dir) # optional
Note
- GPTQ of a 70B model usually takes ~4-10 hours on A100 with
GPTQv1.