Skip to content

Commit 2840ec4

Browse files
Introducing Miles — RL Framework To Fire Up Large-Scale MoE Training (#249)
* adding minimax m2 * fix pic * add miles pic --------- Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com>
1 parent ea38086 commit 2840ec4

2 files changed

Lines changed: 85 additions & 0 deletions

File tree

blog/2025-11-19-miles.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: 'Introducing Miles — RL Framework To Fire Up Large-Scale MoE Training'
3+
author: "RadixArk Team"
4+
date: "November 19, 2025"
5+
previewImg: /images/blog/miles/miles.jpg
6+
---
7+
8+
> A journey of a thousand miles begins with a single step.
9+
10+
We're excited to introduce Miles, an enterprise-facing reinforcement learning framework designed for large-scale MoE training and production workloads. This introductory chapter will be the beginning of a series of tech blogs.
11+
12+
Miles is forked from slime, the lightweight RL framework that has quietly powered many of today’s post-training pipelines and large MoE training runs. Building on slime’s foundation, Miles aims to deliver a smooth and controllable RL experience for teams that need reliability and scale in real-world deployments.
13+
14+
The GitHub link for Miles can be found here: https://github.com/radixark/miles
15+
16+
## 🧠 Starting Point: slime - A Lightweight and Customizable RL Framework
17+
18+
Every mile of progress begins with one well-placed step - slime it is. As a very lightweight and customizable RL framework, slime has been growing popular across the community. It has also been battle-tested in large MoE training, where it is used to train GLM-4.6. slime comes with a few elegant design principles:
19+
20+
### Native to be performant
21+
22+
Native, structured support of SGLang and Megatron's full optimization stack. Keeping pace with the fast evolution of inference and training frameworks.
23+
24+
### Clear, clean modularity
25+
26+
Its key components—Algorithm / Data / Rollout / Eval—are fully decoupled, letting users plug in new agent types, reward functions, or sampling strategies with minimal change of lines.
27+
28+
### Model scientist-friendly
29+
30+
Every abstraction is readable and designed to be hackable. Algorithm researchers can modify importance sampling, rollout logic, or loss dynamics without touching low-level code. Inference-only and training-only debugging are provided for fast diagnosis of failing runs.
31+
32+
### Community-first
33+
34+
slime evolved through real-world feedback from the LMSYS and SGLang communities. It embodies what open collaboration across research and engineering can achieve.
35+
36+
## ⚙️ Momentum On the Way: What is Recently Implemented
37+
38+
Miles builds on slime but focuses on new hardware (e.g. GB300), large-scale MoE RL, and production-grade stability. The following features have been recently added (we have also upstreamed most of them to slime):
39+
40+
### True On-Policy
41+
42+
Besides the existing determinist feature that the runs yield bitwise identical and repeatable results, we further support [true on-policy](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) via the infrastructure approach.
43+
44+
- The mismatch between training and inference is reduced to exactly zero.
45+
- To implement it, we use Flash Attention 3, DeepGEMM, batch invariant kernels from Thinking Machines Lab, and torch compile. We also align numeric operation details between training and inference.
46+
47+
<img src="https://raw.githubusercontent.com/THUDM/slime/refs/heads/main/examples/true_on_policy/src/train_rollout_abs_diff.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%"></img>
48+
49+
50+
### Memory Improvements
51+
52+
In order to fully utilize the precious GPU memory for maximum performance without encountering OOM errors, we made updates such as the following:
53+
54+
- Add propagation to avoid errors when benign OOM; implement memory margin to fix OOM from NCCL; fix FSDP excessive memory or OOM; support move-based and partial offloading; host peak memory saving.
55+
56+
### Speculative Training
57+
58+
In RL, freezing the draft model prevents it from following the target model policy, reducing accept length and degrading speedup, so we perform online SFT on the draft model throughout RL.
59+
60+
- Achieve 25%+ rollout speedup vs. frozen MTP, especially in the late training stage.
61+
- Support MTP with sequence packing + CP; Loss masks with proper edge-case handling; LM head/embedding gradient isolation, and Megatron↔SGLang weight syncing.
62+
63+
<img src="https://raw.githubusercontent.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/refs/heads/main/rlhf/slime/spec/pic/overall-throughput.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%"></img>
64+
65+
### Miscellaneous Updates
66+
67+
Enhance the FSDP training backend; allow deploying the rollout subsystem independently outside the framework; debug utilities such as more metrics, post-hoc analyzers, and enhancing profilers; gradually refactor the code to further enhance it; A formal mathematics (Lean) example is provided with SFT/RL scripts.
68+
69+
## 🚧 Towards the Future: Our Roadmap
70+
71+
For the future development of Miles, we will put together more efforts to support enterprise-grade RL training. This includes:
72+
73+
- Large-scale MoE RL examples on new hardware, e.g., GB300.
74+
- Multi-modal training
75+
- Rollout accelerations
76+
- Compatible with SGLang spec v2 for better performance.
77+
- Advance speculative training support, like EAGLE3, multi-spec layer.
78+
- Resource allocation for balanced training & serving in large-scale async training
79+
- Elastic to GPU failures
80+
81+
## 🤝 Thanks towards Our Community
82+
83+
Miles exists thanks to the slime authors and the broader (SGLang) RL community.
84+
85+
We invite researchers, startups, and enterprise teams alike to explore slime and Miles - whichever best fits your environment - and to be together with us to make reinforcement learning efficient and reliable. We'll hear from the community and actively work on Miles' future development, towards a production-ready training environment.

public/images/blog/miles/miles.png

654 KB
Loading

0 commit comments

Comments
 (0)