|
| 1 | +--- |
| 2 | +title: 'Introducing Miles — RL Framework To Fire Up Large-Scale MoE Training' |
| 3 | +author: "RadixArk Team" |
| 4 | +date: "November 19, 2025" |
| 5 | +previewImg: /images/blog/miles/miles.jpg |
| 6 | +--- |
| 7 | + |
| 8 | +> A journey of a thousand miles begins with a single step. |
| 9 | +
|
| 10 | +We're excited to introduce Miles, an enterprise-facing reinforcement learning framework designed for large-scale MoE training and production workloads. This introductory chapter will be the beginning of a series of tech blogs. |
| 11 | + |
| 12 | +Miles is forked from slime, the lightweight RL framework that has quietly powered many of today’s post-training pipelines and large MoE training runs. Building on slime’s foundation, Miles aims to deliver a smooth and controllable RL experience for teams that need reliability and scale in real-world deployments. |
| 13 | + |
| 14 | +The GitHub link for Miles can be found here: https://github.com/radixark/miles |
| 15 | + |
| 16 | +## 🧠 Starting Point: slime - A Lightweight and Customizable RL Framework |
| 17 | + |
| 18 | +Every mile of progress begins with one well-placed step - slime it is. As a very lightweight and customizable RL framework, slime has been growing popular across the community. It has also been battle-tested in large MoE training, where it is used to train GLM-4.6. slime comes with a few elegant design principles: |
| 19 | + |
| 20 | +### Native to be performant |
| 21 | + |
| 22 | +Native, structured support of SGLang and Megatron's full optimization stack. Keeping pace with the fast evolution of inference and training frameworks. |
| 23 | + |
| 24 | +### Clear, clean modularity |
| 25 | + |
| 26 | +Its key components—Algorithm / Data / Rollout / Eval—are fully decoupled, letting users plug in new agent types, reward functions, or sampling strategies with minimal change of lines. |
| 27 | + |
| 28 | +### Model scientist-friendly |
| 29 | + |
| 30 | +Every abstraction is readable and designed to be hackable. Algorithm researchers can modify importance sampling, rollout logic, or loss dynamics without touching low-level code. Inference-only and training-only debugging are provided for fast diagnosis of failing runs. |
| 31 | + |
| 32 | +### Community-first |
| 33 | + |
| 34 | +slime evolved through real-world feedback from the LMSYS and SGLang communities. It embodies what open collaboration across research and engineering can achieve. |
| 35 | + |
| 36 | +## ⚙️ Momentum On the Way: What is Recently Implemented |
| 37 | + |
| 38 | +Miles builds on slime but focuses on new hardware (e.g. GB300), large-scale MoE RL, and production-grade stability. The following features have been recently added (we have also upstreamed most of them to slime): |
| 39 | + |
| 40 | +### True On-Policy |
| 41 | + |
| 42 | +Besides the existing determinist feature that the runs yield bitwise identical and repeatable results, we further support [true on-policy](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) via the infrastructure approach. |
| 43 | + |
| 44 | +- The mismatch between training and inference is reduced to exactly zero. |
| 45 | +- To implement it, we use Flash Attention 3, DeepGEMM, batch invariant kernels from Thinking Machines Lab, and torch compile. We also align numeric operation details between training and inference. |
| 46 | + |
| 47 | +<img src="https://raw.githubusercontent.com/THUDM/slime/refs/heads/main/examples/true_on_policy/src/train_rollout_abs_diff.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%"></img> |
| 48 | + |
| 49 | + |
| 50 | +### Memory Improvements |
| 51 | + |
| 52 | +In order to fully utilize the precious GPU memory for maximum performance without encountering OOM errors, we made updates such as the following: |
| 53 | + |
| 54 | +- Add propagation to avoid errors when benign OOM; implement memory margin to fix OOM from NCCL; fix FSDP excessive memory or OOM; support move-based and partial offloading; host peak memory saving. |
| 55 | + |
| 56 | +### Speculative Training |
| 57 | + |
| 58 | +In RL, freezing the draft model prevents it from following the target model policy, reducing accept length and degrading speedup, so we perform online SFT on the draft model throughout RL. |
| 59 | + |
| 60 | +- Achieve 25%+ rollout speedup vs. frozen MTP, especially in the late training stage. |
| 61 | +- Support MTP with sequence packing + CP; Loss masks with proper edge-case handling; LM head/embedding gradient isolation, and Megatron↔SGLang weight syncing. |
| 62 | + |
| 63 | +<img src="https://raw.githubusercontent.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/refs/heads/main/rlhf/slime/spec/pic/overall-throughput.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%"></img> |
| 64 | + |
| 65 | +### Miscellaneous Updates |
| 66 | + |
| 67 | +Enhance the FSDP training backend; allow deploying the rollout subsystem independently outside the framework; debug utilities such as more metrics, post-hoc analyzers, and enhancing profilers; gradually refactor the code to further enhance it; A formal mathematics (Lean) example is provided with SFT/RL scripts. |
| 68 | + |
| 69 | +## 🚧 Towards the Future: Our Roadmap |
| 70 | + |
| 71 | +For the future development of Miles, we will put together more efforts to support enterprise-grade RL training. This includes: |
| 72 | + |
| 73 | +- Large-scale MoE RL examples on new hardware, e.g., GB300. |
| 74 | +- Multi-modal training |
| 75 | +- Rollout accelerations |
| 76 | + - Compatible with SGLang spec v2 for better performance. |
| 77 | + - Advance speculative training support, like EAGLE3, multi-spec layer. |
| 78 | +- Resource allocation for balanced training & serving in large-scale async training |
| 79 | +- Elastic to GPU failures |
| 80 | + |
| 81 | +## 🤝 Thanks towards Our Community |
| 82 | + |
| 83 | +Miles exists thanks to the slime authors and the broader (SGLang) RL community. |
| 84 | + |
| 85 | +We invite researchers, startups, and enterprise teams alike to explore slime and Miles - whichever best fits your environment - and to be together with us to make reinforcement learning efficient and reliable. We'll hear from the community and actively work on Miles' future development, towards a production-ready training environment. |
0 commit comments