Nvfp8 (#255)

zhaochenyang20 · zhaochen20 · web-flow · commit a24df31eefcc · 2025-11-19T17:46:40.000-08:00
* update FP8 training

* fix front

---------

Co-authored-by: zhaochenyang20 &lt;zhaochenyang20@gmail.com&gt;
diff --git a/blog/2025-11-19-fp8-rl.md b/blog/2025-11-19-fp8-rl.md
@@ -4,16 +4,14 @@ author: "InfiXAI Team, Ant Group AQ Team, SGLang RL Team, miles Team, slime Team
 date: "November 20, 2025"
 previewImg: /images/blog/fp8-rl/3_Megatron.png
 ---
+ 
+> TL;DR: We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.
 
-> **TL;DR:**
-> 
-> **We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.**
+SGLang RL Team and the slime community have conducted some interesting explorations around RL training stability and acceleration:
 
-The SGLang RL Team and the slime community have recently conducted some interesting explorations around RL training stability and acceleration:
+1. In terms of **training stability**, by [aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy), we achieve **strictly zero KL divergence** between the rollout and training processes on dense models, reaching perfect train–inference consistency.
 
-- In terms of **training stability**, by [aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy), we achieve **strictly zero KL divergence** between the rollout and training processes on dense models, reaching perfect train–inference consistency.
-
-- In terms of **training acceleration**, we introduce [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) into the RL sampling pipeline, which can significantly speed up sampling under suitable configurations.
+2. In terms of **training acceleration**, we introduce [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) into the RL sampling pipeline, which can significantly speed up sampling under suitable configurations.
 
 Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in slime](https://github.com/THUDM/slime/tree/main/examples/low_precision) and is ready to use out of the box.
 
@@ -136,7 +134,7 @@ In practice, the memory savings and speedups brought by FP8 are often less signi
   - **Performance degradation with small batch sizes**: When batch_size is small, FP8 training may fail to fully utilize GPU compute units and can even underperform BF16. The root cause is that FP8 introduces extra quantization and dequantization operations, which add CPU overhead. In Agentic RL scenarios, which typically use small batch sizes (e.g., batch_size=4), this issue is particularly pronounced—frequent CPU overhead can make FP8 training slower than BF16. (As shown below, GPU kernels are not densely scheduled; often the GPU has already finished the previous work but the next kernel launch is delayed because the system is CPU-bound.)
 
 <p align="center">
-  <img src="/images/blog/fp8-rl/4_cpu_bound.png" alt="CPU bound for FP8 training" width="80%" />
+  <img src="/images/blog/fp8-rl/4_cpu_bound.png" alt="CPU bound for FP8 training" width="50%" />
 </p>
 
 > Figure: CPU-bound behavior in FP8 training
@@ -167,7 +165,7 @@ The **InfiXAI Team** has already successfully run full FP8 training on **pre-tra
 When we directly switched from BF16 to FP8 and started training, we observed a striking phenomenon: compared with BF16 training, FP8 training has a significantly higher KL loss at the first step. As shown below, the initial KL loss for **FP8-TI** is significantly higher than that of BF16 training with FP8 inference (T denotes Training, I denotes Inference):
 
 <p align="center">
-  <img src="/images/blog/fp8-rl/5_KLloss.png" alt="Initial KL loss comparison" width="80%" />
+  <img src="/images/blog/fp8-rl/5_KLloss.png" alt="Initial KL loss comparison" width="50%" />
 </p>
 
 ### **Locating the Source of Error**
@@ -216,7 +214,7 @@ Based on these modes, we run two comparisons:
 The figure below visualizes the error distribution of all GEMM outputs over one full forward + backward pass, in execution order:
 
 <p align="center">
-  <img src="/images/blog/fp8-rl/6_FP8_quant_error.png" alt="FP8 quantization error distribution" width="80%" />
+  <img src="/images/blog/fp8-rl/6_FP8_quant_error.png" alt="FP8 quantization error distribution" width="50%" />
 </p>
 
 > The figure shows how GEMM output errors evolve over one full iteration.
@@ -251,15 +249,15 @@ To validate these hypotheses, we modified Transformer Engine (TE) and designed t
 The figure below shows KL-loss curves for the four cases. We see that Case 2, Case 3, and Case 4 (FP8-TI) have nearly identical KL loss at step 1, all significantly higher than Case 1:
 
 <p align="center">
-  <img src="/images/blog/fp8-rl/7_KLloss2.png" alt="KL-loss comparison under different cases" width="80%" />
+  <img src="/images/blog/fp8-rl/7_KLloss2.png" alt="KL-loss comparison under different cases" width="50%" />
 </p>
 
 **Validating hypothesis 3 — TIS-clipfrac analysis**
 
 We introduce **clipfrac** from **Truncated Importance Sampling (TIS)** to validate hypothesis 3. This metric reflects the degree of off-policy training, i.e., the consistency between the model used for training and for generating experience. Higher clipfrac generally indicates more severe train–inference inconsistency.
 
 <p align="center">
-  <img src="/images/blog/fp8-rl/8_TIS.png" alt="TIS-clipfrac comparison under different cases" width="80%" />
+    <img src="/images/blog/fp8-rl/8_TIS.png" alt="TIS-clipfrac comparison under different cases" width="50%" />
 </p>
 
 From the figure we see that Case 2, Case 3, and Case 4 (FP8-TI) have clipfrac values of roughly the same order, all significantly lower than Case 1. This confirms: