There are two types of FP8 support in FMS Model Optimizer:
- mature FP8 which can generate a model that is ready for serving by
vllm, and - experimental FP8 that is simulation only but has more advanced quantization options.
This is an example of mature FP8, which under the hood leverages some functionalities in llm-compressor, a third-party library, to perform FP8 quantization. An example for the experimental FP8 can be found here
-
Nvidia A100 family or higher
-
The llm-compressor library can be installed using pip:
pip install llmcompressor
-
To evaluate the FP8 quantized model, lm-eval and vllm libraries are also required.
pip install vllm lm_eval
Caution
vllm may require a specific PyTorch version that is different from what is installed in your current environment and it may force install without asking. Make sure it's compatible with your settings or create a new environment if needed.
This end-to-end example utilizes the common set of interfaces provided by fms_mo for easily applying multiple quantization algorithms with FP8 being the focus of this example. The steps involved are:
-
FP8 quantization through CLI. Other arguments could be found here FP8Arguments.
python -m fms_mo.run_quant \ --model_name_or_path "meta-llama/Meta-Llama-3-8B" \ --quant_method fp8 \ --torch_dtype bfloat16 \ --output_dir "Meta-Llama-3-8B-FP8"
Note
- The quantized model and tokenizer will be saved to
output_dir, but some additional temporary storage space may be needed. - Runtime ~ 1 min on A100. (model download time not included)
- If you have trouble downloading Llama family of models from Hugging Face (LLama models require access), you can use
ibm-granite/granite-3.0-8b-instructinstead
-
Inspect the FP8 checkpoint
from fms_mo.utils.utils import checkpoint_summary checkpoint_summary("Meta-Llama-3-8B-FP8")
We can see that most of the tensors are saved in FP8 format (
torch.float8_e4m3fn). If you further print out the summary DataFrame (by addingshow_details=Trueflag), you will find out the layers remained inbfloat16areembeddingsandlm_head.mem (MB) dtype torch.bfloat16 2104.631296 torch.float8_e4m3fn 6979.321856
Note
FP16 model file size on storage is ~16.07 GB while FP8 is ~8.6 GB.
-
Evaluate the quantized model's performance on a selected task using
lm-evallibrary, the command below will run evaluation onlambada_openaitask and show the perplexity/accuracy at the end.lm_eval --model vllm \ --model_args pretrained="Meta-Llama-3-8B-FP8,add_bos_token=True,dtype=float16,enforce_eager=True" \ --tasks lambada_openai \ --device cuda:0 \ --batch_size 1 \ --num_fewshot 5
- BF16 (not quantized) LLAMA3-8B model.
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| lambada_openai | 1 | none | 5 | acc | ↑ | 0.7120 | ± | 0.0287 |
| none | 5 | perplexity | ↓ | 3.8683 | ± | 0.3716 |
- FP8 quantized LLAMA3-8B model.
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| lambada_openai | 1 | none | 5 | acc | ↑ | 0.7160 | ± | 0.0286 |
| none | 5 | perplexity | ↓ | 3.8915 | ± | 0.3727 |
-
The non-quantized pre-trained model is loaded using model wrapper from
llm-compressor. The corresponding tokenizer is constructed as well.from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.transformers import SparseAutoModelForCausalLM from llmcompressor import oneshot model = SparseAutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, torch_dtype=model_args.torch_dtype) tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
-
Quantization setting is provided using
QuantizationModifier, additional settings can be found in FP8Arguments.recipe = QuantizationModifier( targets=fp8_args.targets, scheme=fp8_args.scheme, ignore=fp8_args.ignore, )
-
FP8 quantization is performed by calling the
oneshotfunction.oneshot( model=model, recipe=recipe, max_seq_length=data_args.max_seq_length, num_calibration_samples=data_args.num_calibration_samples, )
-
The quantized model and the tokenizer are then saved in
output_dir.logger.info("Saving quantized model and tokenizer to {}".format(output_dir)) model.save_pretrained(output_dir) tokenizer.save_pretrained(output_dir)
-
Model evaluation is done using
lm_evalthroughvllm. Accuracy and perplexity on thelambada_openaitask will be reported.lm_eval --model vllm \ --model_args pretrained="Meta-Llama-3-8B-FP8", add_bos_token=True, dtype="float16", enforce_eager=True \ --tasks lambada_openai \ --device cuda:0 \ --batch_size auto \ --num_fewshot 5
Note
Even though A100 does not support FP8 computation, vllm can still utilize the compressed FP8 model and use FP16 computation to perform efficient inference.