You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-11-19-fp8-rl.md
+10-12Lines changed: 10 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,16 +4,14 @@ author: "InfiXAI Team, Ant Group AQ Team, SGLang RL Team, miles Team, slime Team
4
4
date: "November 20, 2025"
5
5
previewImg: /images/blog/fp8-rl/3_Megatron.png
6
6
---
7
+
8
+
> TL;DR: We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.
7
9
8
-
> **TL;DR:**
9
-
>
10
-
> **We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.**
10
+
SGLang RL Team and the slime community have conducted some interesting explorations around RL training stability and acceleration:
11
11
12
-
The SGLang RL Team and the slime community have recently conducted some interesting explorations around RL training stability and acceleration:
12
+
1. In terms of **training stability**, by [aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy), we achieve **strictly zero KL divergence** between the rollout and training processes on dense models, reaching perfect train–inference consistency.
13
13
14
-
- In terms of **training stability**, by [aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy), we achieve **strictly zero KL divergence** between the rollout and training processes on dense models, reaching perfect train–inference consistency.
15
-
16
-
- In terms of **training acceleration**, we introduce [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) into the RL sampling pipeline, which can significantly speed up sampling under suitable configurations.
14
+
2. In terms of **training acceleration**, we introduce [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) into the RL sampling pipeline, which can significantly speed up sampling under suitable configurations.
17
15
18
16
Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in slime](https://github.com/THUDM/slime/tree/main/examples/low_precision) and is ready to use out of the box.
19
17
@@ -136,7 +134,7 @@ In practice, the memory savings and speedups brought by FP8 are often less signi
136
134
-**Performance degradation with small batch sizes**: When batch_size is small, FP8 training may fail to fully utilize GPU compute units and can even underperform BF16. The root cause is that FP8 introduces extra quantization and dequantization operations, which add CPU overhead. In Agentic RL scenarios, which typically use small batch sizes (e.g., batch_size=4), this issue is particularly pronounced—frequent CPU overhead can make FP8 training slower than BF16. (As shown below, GPU kernels are not densely scheduled; often the GPU has already finished the previous work but the next kernel launch is delayed because the system is CPU-bound.)
137
135
138
136
<palign="center">
139
-
<imgsrc="/images/blog/fp8-rl/4_cpu_bound.png"alt="CPU bound for FP8 training"width="80%" />
137
+
<imgsrc="/images/blog/fp8-rl/4_cpu_bound.png"alt="CPU bound for FP8 training"width="50%" />
140
138
</p>
141
139
142
140
> Figure: CPU-bound behavior in FP8 training
@@ -167,7 +165,7 @@ The **InfiXAI Team** has already successfully run full FP8 training on **pre-tra
167
165
When we directly switched from BF16 to FP8 and started training, we observed a striking phenomenon: compared with BF16 training, FP8 training has a significantly higher KL loss at the first step. As shown below, the initial KL loss for **FP8-TI** is significantly higher than that of BF16 training with FP8 inference (T denotes Training, I denotes Inference):
168
166
169
167
<palign="center">
170
-
<imgsrc="/images/blog/fp8-rl/5_KLloss.png"alt="Initial KL loss comparison"width="80%" />
168
+
<imgsrc="/images/blog/fp8-rl/5_KLloss.png"alt="Initial KL loss comparison"width="50%" />
171
169
</p>
172
170
173
171
### **Locating the Source of Error**
@@ -216,7 +214,7 @@ Based on these modes, we run two comparisons:
216
214
The figure below visualizes the error distribution of all GEMM outputs over one full forward + backward pass, in execution order:
> The figure shows how GEMM output errors evolve over one full iteration.
@@ -251,15 +249,15 @@ To validate these hypotheses, we modified Transformer Engine (TE) and designed t
251
249
The figure below shows KL-loss curves for the four cases. We see that Case 2, Case 3, and Case 4 (FP8-TI) have nearly identical KL loss at step 1, all significantly higher than Case 1:
252
250
253
251
<palign="center">
254
-
<imgsrc="/images/blog/fp8-rl/7_KLloss2.png"alt="KL-loss comparison under different cases"width="80%" />
252
+
<imgsrc="/images/blog/fp8-rl/7_KLloss2.png"alt="KL-loss comparison under different cases"width="50%" />
We introduce **clipfrac** from **Truncated Importance Sampling (TIS)** to validate hypothesis 3. This metric reflects the degree of off-policy training, i.e., the consistency between the model used for training and for generating experience. Higher clipfrac generally indicates more severe train–inference inconsistency.
260
258
261
259
<palign="center">
262
-
<imgsrc="/images/blog/fp8-rl/8_TIS.png"alt="TIS-clipfrac comparison under different cases"width="80%" />
260
+
<img src="/images/blog/fp8-rl/8_TIS.png" alt="TIS-clipfrac comparison under different cases" width="50%" />
263
261
</p>
264
262
265
263
From the figure we see that Case 2, Case 3, and Case 4 (FP8-TI) have clipfrac values of roughly the same order, all significantly lower than Case 1. This confirms:
0 commit comments