We introduced a **RDMA-based, Peer to Peer weight update** mechanism for RL workloads in SGLang as a supplement to traditional NCCL broadcast methods, compatible with all major open source models. By utilizing a source-side **CPU engine replica** and **P2P RDMA transfers** via Mooncake TransferEngine, we speed up weight transfer times for 1T-parameter Kimi-K2 7 times (53 seconds -> 7.2 seconds), at the cost of one additional inference engine replica (32G) per training rank on CPU memory. These optimizations minimize network redundancy and allow inference servers to resume rollout significantly faster.
0 commit comments