Skip to content

Commit a24df31

Browse files
Nvfp8 (#255)
* update FP8 training * fix front --------- Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
1 parent d7d5fd8 commit a24df31

1 file changed

Lines changed: 10 additions & 12 deletions

File tree

blog/2025-11-19-fp8-rl.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,14 @@ author: "InfiXAI Team, Ant Group AQ Team, SGLang RL Team, miles Team, slime Team
44
date: "November 20, 2025"
55
previewImg: /images/blog/fp8-rl/3_Megatron.png
66
---
7+
8+
> TL;DR: We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.
79
8-
> **TL;DR:**
9-
>
10-
> **We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.**
10+
SGLang RL Team and the slime community have conducted some interesting explorations around RL training stability and acceleration:
1111

12-
The SGLang RL Team and the slime community have recently conducted some interesting explorations around RL training stability and acceleration:
12+
1. In terms of **training stability**, by [aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy), we achieve **strictly zero KL divergence** between the rollout and training processes on dense models, reaching perfect train–inference consistency.
1313

14-
- In terms of **training stability**, by [aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy), we achieve **strictly zero KL divergence** between the rollout and training processes on dense models, reaching perfect train–inference consistency.
15-
16-
- In terms of **training acceleration**, we introduce [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) into the RL sampling pipeline, which can significantly speed up sampling under suitable configurations.
14+
2. In terms of **training acceleration**, we introduce [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) into the RL sampling pipeline, which can significantly speed up sampling under suitable configurations.
1715

1816
Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in slime](https://github.com/THUDM/slime/tree/main/examples/low_precision) and is ready to use out of the box.
1917

@@ -136,7 +134,7 @@ In practice, the memory savings and speedups brought by FP8 are often less signi
136134
- **Performance degradation with small batch sizes**: When batch_size is small, FP8 training may fail to fully utilize GPU compute units and can even underperform BF16. The root cause is that FP8 introduces extra quantization and dequantization operations, which add CPU overhead. In Agentic RL scenarios, which typically use small batch sizes (e.g., batch_size=4), this issue is particularly pronounced—frequent CPU overhead can make FP8 training slower than BF16. (As shown below, GPU kernels are not densely scheduled; often the GPU has already finished the previous work but the next kernel launch is delayed because the system is CPU-bound.)
137135

138136
<p align="center">
139-
<img src="/images/blog/fp8-rl/4_cpu_bound.png" alt="CPU bound for FP8 training" width="80%" />
137+
<img src="/images/blog/fp8-rl/4_cpu_bound.png" alt="CPU bound for FP8 training" width="50%" />
140138
</p>
141139

142140
> Figure: CPU-bound behavior in FP8 training
@@ -167,7 +165,7 @@ The **InfiXAI Team** has already successfully run full FP8 training on **pre-tra
167165
When we directly switched from BF16 to FP8 and started training, we observed a striking phenomenon: compared with BF16 training, FP8 training has a significantly higher KL loss at the first step. As shown below, the initial KL loss for **FP8-TI** is significantly higher than that of BF16 training with FP8 inference (T denotes Training, I denotes Inference):
168166

169167
<p align="center">
170-
<img src="/images/blog/fp8-rl/5_KLloss.png" alt="Initial KL loss comparison" width="80%" />
168+
<img src="/images/blog/fp8-rl/5_KLloss.png" alt="Initial KL loss comparison" width="50%" />
171169
</p>
172170

173171
### **Locating the Source of Error**
@@ -216,7 +214,7 @@ Based on these modes, we run two comparisons:
216214
The figure below visualizes the error distribution of all GEMM outputs over one full forward + backward pass, in execution order:
217215

218216
<p align="center">
219-
<img src="/images/blog/fp8-rl/6_FP8_quant_error.png" alt="FP8 quantization error distribution" width="80%" />
217+
<img src="/images/blog/fp8-rl/6_FP8_quant_error.png" alt="FP8 quantization error distribution" width="50%" />
220218
</p>
221219

222220
> The figure shows how GEMM output errors evolve over one full iteration.
@@ -251,15 +249,15 @@ To validate these hypotheses, we modified Transformer Engine (TE) and designed t
251249
The figure below shows KL-loss curves for the four cases. We see that Case 2, Case 3, and Case 4 (FP8-TI) have nearly identical KL loss at step 1, all significantly higher than Case 1:
252250

253251
<p align="center">
254-
<img src="/images/blog/fp8-rl/7_KLloss2.png" alt="KL-loss comparison under different cases" width="80%" />
252+
<img src="/images/blog/fp8-rl/7_KLloss2.png" alt="KL-loss comparison under different cases" width="50%" />
255253
</p>
256254

257255
**Validating hypothesis 3 — TIS-clipfrac analysis**
258256

259257
We introduce **clipfrac** from **Truncated Importance Sampling (TIS)** to validate hypothesis 3. This metric reflects the degree of off-policy training, i.e., the consistency between the model used for training and for generating experience. Higher clipfrac generally indicates more severe train–inference inconsistency.
260258

261259
<p align="center">
262-
<img src="/images/blog/fp8-rl/8_TIS.png" alt="TIS-clipfrac comparison under different cases" width="80%" />
260+
<img src="/images/blog/fp8-rl/8_TIS.png" alt="TIS-clipfrac comparison under different cases" width="50%" />
263261
</p>
264262

265263
From the figure we see that Case 2, Case 3, and Case 4 (FP8-TI) have clipfrac values of roughly the same order, all significantly lower than Case 1. This confirms:

0 commit comments

Comments
 (0)