Skip to content

Commit b18dc9d

Browse files
authored
Update README.md
1 parent e579e5b commit b18dc9d

File tree

1 file changed

+0
-30
lines changed

1 file changed

+0
-30
lines changed

benchmarks/sglang_benchmarks/README.md

Lines changed: 0 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2,36 +2,6 @@
22

33
Benchmark for the Unsloth + SGLang backend that combines SGLang for inference with Unsloth for MoE training. Uses a **dedicated GPU split** where inference and training run on separate GPUs for zero sleep/wake overhead, with a **persistent training worker** that keeps the model loaded across steps.
44

5-
---
6-
7-
## Architecture — Dedicated GPU Split (Default)
8-
9-
```
10-
┌──────────────────────────────────────────────────────────────────┐
11-
│ 4-GPU Setup (Recommended) │
12-
│ │
13-
│ ┌─ GPUs 0, 2 (TP=2) ────────────┐ ┌─ GPU 1 ──────────────┐ │
14-
│ │ SGLang Server │ │ Unsloth Training │ │
15-
│ │ • Always active (no sleep) │ │ • Dedicated GPU │ │
16-
│ │ • TP=2 inference │ │ • Persistent worker │ │
17-
│ │ │ │ (model loaded once) │ │
18-
│ │ ┌──────────┐ ┌────────────┐ │ │ • LoRA + Optimizer │ │
19-
│ │ │ TP=2 │ │ LoRA │ │ │ • ART loss function │ │
20-
│ │ │ Model │ │ Hot-reload│ │ │ │ │
21-
│ │ │ Shards │ │ < 0.1s │ │ │ GPU 3: idle │ │
22-
│ │ └──────────┘ └────────────┘ │ └───────────────────────┘ │
23-
│ └─────────────────────────────────┘ │
24-
│ │
25-
│ ✓ No sleep/wake overhead │
26-
│ ✓ SGLang stays active during training │
27-
│ ✓ Persistent worker — model loaded once, reused across steps │
28-
│ ✓ TP must be power of 2 (vocab size constraint) │
29-
│ ✓ Generation is 70-90% of RL time → more inference GPUs = win │
30-
└───────────────────────────────────────────────────────────────────┘
31-
```
32-
33-
### Auto-Detected GPU Splits
34-
355
TP must be a power of 2 (model vocab sizes like Qwen3's 151936 are divisible by 1,2,4,8 but NOT 3).
366

377
| GPUs Available | Inference GPUs | TP Size | Training GPU | Mode |

0 commit comments

Comments
 (0)