lm-sys
diff --git a/‎blog/2025-11-19-miles.md‎
Lines changed: 85 additions & 0 deletions b/‎blog/2025-11-19-miles.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎public/images/blog/miles/miles.png‎
654 KB b/‎public/images/blog/miles/miles.png‎
654 KB
@@ -0,0 +1,85 @@
+---
+title: 'Introducing Miles — RL Framework To Fire Up Large-Scale MoE Training'
+author: "RadixArk Team"
+date: "November 19, 2025"
+previewImg: /images/blog/miles/miles.jpg
+---
+
+> A journey of a thousand miles begins with a single step.
+
+We're excited to introduce Miles, an enterprise-facing reinforcement learning framework designed for large-scale MoE training and production workloads. This introductory chapter will be the beginning of a series of tech blogs.
+
+Miles is forked from slime, the lightweight RL framework that has quietly powered many of today’s post-training pipelines and large MoE training runs. Building on slime’s foundation, Miles aims to deliver a smooth and controllable RL experience for teams that need reliability and scale in real-world deployments.
+
+The GitHub link for Miles can be found here: https://github.com/radixark/miles
+
+## 🧠 Starting Point: slime - A Lightweight and Customizable RL Framework
+
+Every mile of progress begins with one well-placed step - slime it is. As a very lightweight and customizable RL framework, slime has been growing popular across the community. It has also been battle-tested in large MoE training, where it is used to train GLM-4.6. slime comes with a few elegant design principles:
+
+### Native to be performant
+
+Native, structured support of SGLang and Megatron's full optimization stack. Keeping pace with the fast evolution of inference and training frameworks.
+
+### Clear, clean modularity
+
+Its key components—Algorithm / Data / Rollout / Eval—are fully decoupled, letting users plug in new agent types, reward functions, or sampling strategies with minimal change of lines.
+
+### Model scientist-friendly
+
+Every abstraction is readable and designed to be hackable. Algorithm researchers can modify importance sampling, rollout logic, or loss dynamics without touching low-level code. Inference-only and training-only debugging are provided for fast diagnosis of failing runs.
+
+### Community-first
+
+slime evolved through real-world feedback from the LMSYS and SGLang communities. It embodies what open collaboration across research and engineering can achieve.
+
+## ⚙️ Momentum On the Way: What is Recently Implemented
+
+Miles builds on slime but focuses on new hardware (e.g. GB300), large-scale MoE RL, and production-grade stability. The following features have been recently added (we have also upstreamed most of them to slime):
+
+### True On-Policy
+
+Besides the existing determinist feature that the runs yield bitwise identical and repeatable results, we further support [true on-policy](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) via the infrastructure approach.
+
+- The mismatch between training and inference is reduced to exactly zero.
+- To implement it, we use Flash Attention 3, DeepGEMM, batch invariant kernels from Thinking Machines Lab, and torch compile. We also align numeric operation details between training and inference.
+
+<img src="https://raw.githubusercontent.com/THUDM/slime/refs/heads/main/examples/true_on_policy/src/train_rollout_abs_diff.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%"></img>
+
+
+### Memory Improvements
+
+In order to fully utilize the precious GPU memory for maximum performance without encountering OOM errors, we made updates such as the following:
+
+- Add propagation to avoid errors when benign OOM; implement memory margin to fix OOM from NCCL; fix FSDP excessive memory or OOM; support move-based and partial offloading; host peak memory saving.
+
+### Speculative Training
+
+In RL, freezing the draft model prevents it from following the target model policy, reducing accept length and degrading speedup, so we perform online SFT on the draft model throughout RL.
+
+- Achieve 25%+ rollout speedup vs. frozen MTP, especially in the late training stage.
+- Support MTP with sequence packing + CP; Loss masks with proper edge-case handling; LM head/embedding gradient isolation, and Megatron↔SGLang weight syncing.
+
+<img src="https://raw.githubusercontent.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/refs/heads/main/rlhf/slime/spec/pic/overall-throughput.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%"></img>
+
+### Miscellaneous Updates
+
+Enhance the FSDP training backend; allow deploying the rollout subsystem independently outside the framework; debug utilities such as more metrics, post-hoc analyzers, and enhancing profilers; gradually refactor the code to further enhance it; A formal mathematics (Lean) example is provided with SFT/RL scripts.
+
+## 🚧 Towards the Future: Our Roadmap
+
+For the future development of Miles, we will put together more efforts to support enterprise-grade RL training. This includes:
+
+- Large-scale MoE RL examples on new hardware, e.g., GB300.
+- Multi-modal training
+- Rollout accelerations
+  - Compatible with SGLang spec v2 for better performance.
+  - Advance speculative training support, like EAGLE3, multi-spec layer.
+- Resource allocation for balanced training & serving in large-scale async training
+- Elastic to GPU failures
+
+## 🤝 Thanks towards Our Community
+
+Miles exists thanks to the slime authors and the broader (SGLang) RL community.
+
+We invite researchers, startups, and enterprise teams alike to explore slime and Miles - whichever best fits your environment - and to be together with us to make reinforcement learning efficient and reliable. We'll hear from the community and actively work on Miles' future development, towards a production-ready training environment.