FP8 QAT

Introduction

To enable developers to efficiently train ERNIE 4.5 models with minimal resource requirements, we developed an innovative FP8 Quantization-Aware Training method. Our approach delivers two significant advantages:

Training Resource Reduction
- Enables SFT full-parameter tuning of 300B models using just 16 Hopper 80G GPUs — only 17% of the hardware resources required for traditional BF16 mixed-precision training.
- Maintains LLM performance with nearly no accuracy degradation.
Inference Acceleration
- Support tensor-wise static W8A8 FP8 inference without the need for quantiztaion calibration.
- Achieves 1.17x speedup compared to block-wise dynamic FP8 quantization inference.

Method

As shown in the figure below, we introduce a Hadamard matrix to ensure stable convergence in tensor-wise static FP8 quantization-aware training (QAT). To reduce computational overhead and support varying tensor shapes, a block-diagonal Hadamard matrix is used, with standard submatrices placed along the diagonal.

In LLM training, GPU memory is primarily consumed by model parameters, gradients, optimizer states, and intermediate activations. In our FP8 quantization-aware training (QAT) approach, model parameters are stored in FP8, while optimizer moments and gradients use BF16. Furthermore, all optimizer states are offloaded to pinned memory, significantly reducing GPU memory usage during training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 QAT

Introduction

Method

FilesExpand file tree

fp8_qat.md

Latest commit

History

fp8_qat.md

File metadata and controls

FP8 QAT

Introduction

Method