|
2 | 2 |
|
3 | 3 | Benchmark for the Unsloth + SGLang backend that combines SGLang for inference with Unsloth for MoE training. Uses a **dedicated GPU split** where inference and training run on separate GPUs for zero sleep/wake overhead, with a **persistent training worker** that keeps the model loaded across steps. |
4 | 4 |
|
5 | | ---- |
6 | | - |
7 | | -## Architecture — Dedicated GPU Split (Default) |
8 | | - |
9 | | -``` |
10 | | -┌──────────────────────────────────────────────────────────────────┐ |
11 | | -│ 4-GPU Setup (Recommended) │ |
12 | | -│ │ |
13 | | -│ ┌─ GPUs 0, 2 (TP=2) ────────────┐ ┌─ GPU 1 ──────────────┐ │ |
14 | | -│ │ SGLang Server │ │ Unsloth Training │ │ |
15 | | -│ │ • Always active (no sleep) │ │ • Dedicated GPU │ │ |
16 | | -│ │ • TP=2 inference │ │ • Persistent worker │ │ |
17 | | -│ │ │ │ (model loaded once) │ │ |
18 | | -│ │ ┌──────────┐ ┌────────────┐ │ │ • LoRA + Optimizer │ │ |
19 | | -│ │ │ TP=2 │ │ LoRA │ │ │ • ART loss function │ │ |
20 | | -│ │ │ Model │ │ Hot-reload│ │ │ │ │ |
21 | | -│ │ │ Shards │ │ < 0.1s │ │ │ GPU 3: idle │ │ |
22 | | -│ │ └──────────┘ └────────────┘ │ └───────────────────────┘ │ |
23 | | -│ └─────────────────────────────────┘ │ |
24 | | -│ │ |
25 | | -│ ✓ No sleep/wake overhead │ |
26 | | -│ ✓ SGLang stays active during training │ |
27 | | -│ ✓ Persistent worker — model loaded once, reused across steps │ |
28 | | -│ ✓ TP must be power of 2 (vocab size constraint) │ |
29 | | -│ ✓ Generation is 70-90% of RL time → more inference GPUs = win │ |
30 | | -└───────────────────────────────────────────────────────────────────┘ |
31 | | -``` |
32 | | - |
33 | | -### Auto-Detected GPU Splits |
34 | | - |
35 | 5 | TP must be a power of 2 (model vocab sizes like Qwen3's 151936 are divisible by 1,2,4,8 but NOT 3). |
36 | 6 |
|
37 | 7 | | GPUs Available | Inference GPUs | TP Size | Training GPU | Mode | |
|
0 commit comments