8-bit Floating Point (FP8) Quantization of LLaMA-3-8B Model

There are two types of FP8 support in FMS Model Optimizer:

mature FP8 which can generate a model that is ready for serving by vllm, and
experimental FP8 that is simulation only but has more advanced quantization options.

This is an example of mature FP8, which under the hood leverages some functionalities in llm-compressor, a third-party library, to perform FP8 quantization. An example for the experimental FP8 can be found here

Requirements

FMS Model Optimizer requirements
Nvidia A100 family or higher
The llm-compressor library can be installed using pip:
```
pip install llmcompressor
```
To evaluate the FP8 quantized model, lm-eval and vllm libraries are also required.
```
pip install vllm lm_eval
```

Caution

vllm may require a specific PyTorch version that is different from what is installed in your current environment and it may force install without asking. Make sure it's compatible with your settings or create a new environment if needed.

QuickStart

This end-to-end example utilizes the common set of interfaces provided by fms_mo for easily applying multiple quantization algorithms with FP8 being the focus of this example. The steps involved are:

FP8 quantization through CLI. Other arguments could be found here FP8Arguments.

python -m fms_mo.run_quant \
    --model_name_or_path "meta-llama/Meta-Llama-3-8B" \
    --quant_method fp8 \
    --torch_dtype bfloat16 \
    --output_dir "Meta-Llama-3-8B-FP8"

Note

The quantized model and tokenizer will be saved to output_dir, but some additional temporary storage space may be needed.
Runtime ~ 1 min on A100. (model download time not included)
If you have trouble downloading Llama family of models from Hugging Face (LLama models require access), you can use ibm-granite/granite-3.0-8b-instruct instead

Inspect the FP8 checkpoint
```
from fms_mo.utils.utils import checkpoint_summary
checkpoint_summary("Meta-Llama-3-8B-FP8")
```
We can see that most of the tensors are saved in FP8 format (torch.float8_e4m3fn). If you further print out the summary DataFrame (by adding show_details=True flag), you will find out the layers remained in bfloat16 are embeddings and lm_head.
```
                        mem (MB)
dtype                           
torch.bfloat16       2104.631296
torch.float8_e4m3fn  6979.321856
```

Note

FP16 model file size on storage is ~16.07 GB while FP8 is ~8.6 GB.

Evaluate the quantized model's performance on a selected task using lm-eval library, the command below will run evaluation on lambada_openai task and show the perplexity/accuracy at the end.

lm_eval --model vllm \
    --model_args pretrained="Meta-Llama-3-8B-FP8,add_bos_token=True,dtype=float16,enforce_eager=True" \
    --tasks lambada_openai \
    --device cuda:0 \
    --batch_size 1 \
    --num_fewshot 5

Example Test Results

BF16 (not quantized) LLAMA3-8B model.

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
lambada_openai	1	none	5	acc	↑	0.7120	±	0.0287
		none	5	perplexity	↓	3.8683	±	0.3716

FP8 quantized LLAMA3-8B model.

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
lambada_openai	1	none	5	acc	↑	0.7160	±	0.0286
		none	5	perplexity	↓	3.8915	±	0.3727

Code Walk-through

The non-quantized pre-trained model is loaded using model wrapper from llm-compressor. The corresponding tokenizer is constructed as well.

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM
from llmcompressor import oneshot

model = SparseAutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, torch_dtype=model_args.torch_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)

Quantization setting is provided using QuantizationModifier, additional settings can be found in FP8Arguments.

recipe = QuantizationModifier(
    targets=fp8_args.targets,
    scheme=fp8_args.scheme,
    ignore=fp8_args.ignore,
)

FP8 quantization is performed by calling the oneshot function.

oneshot(
    model=model,
    recipe=recipe,
    max_seq_length=data_args.max_seq_length,
    num_calibration_samples=data_args.num_calibration_samples,
)

The quantized model and the tokenizer are then saved in output_dir.

logger.info("Saving quantized model and tokenizer to {}".format(output_dir))
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Model evaluation is done using lm_eval through vllm. Accuracy and perplexity on the lambada_openai task will be reported.

lm_eval --model vllm \
    --model_args pretrained="Meta-Llama-3-8B-FP8", add_bos_token=True, dtype="float16", enforce_eager=True \
    --tasks lambada_openai \
    --device cuda:0 \
    --batch_size auto \
    --num_fewshot 5

Note

Even though A100 does not support FP8 computation, vllm can still utilize the compressed FP8 model and use FP16 computation to perform efficient inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8-bit Floating Point (FP8) Quantization of LLaMA-3-8B Model

Requirements

QuickStart

Example Test Results

Code Walk-through

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

8-bit Floating Point (FP8) Quantization of LLaMA-3-8B Model

Requirements

QuickStart

Example Test Results

Code Walk-through