Update README.md

pmukeshreddy · web-flow · commit b18dc9dd0995 · 2026-02-16T21:51:47.000-05:00
diff --git a/benchmarks/sglang_benchmarks/README.md b/benchmarks/sglang_benchmarks/README.md
@@ -2,36 +2,6 @@
 
 Benchmark for the Unsloth + SGLang backend that combines SGLang for inference with Unsloth for MoE training. Uses a **dedicated GPU split** where inference and training run on separate GPUs for zero sleep/wake overhead, with a **persistent training worker** that keeps the model loaded across steps.
 
----
-
-## Architecture — Dedicated GPU Split (Default)
-
-```
-┌──────────────────────────────────────────────────────────────────┐
-│  4-GPU Setup (Recommended)                                       │
-│                                                                  │
-│  ┌─ GPUs 0, 2 (TP=2) ────────────┐  ┌─ GPU 1 ──────────────┐   │
-│  │  SGLang Server                  │  │  Unsloth Training     │   │
-│  │  • Always active (no sleep)     │  │  • Dedicated GPU      │   │
-│  │  • TP=2 inference               │  │  • Persistent worker  │   │
-│  │                                 │  │    (model loaded once) │   │
-│  │  ┌──────────┐  ┌────────────┐   │  │  • LoRA + Optimizer   │   │
-│  │  │  TP=2    │  │  LoRA      │   │  │  • ART loss function  │   │
-│  │  │  Model   │  │  Hot-reload│   │  │                       │   │
-│  │  │  Shards  │  │  < 0.1s    │   │  │  GPU 3: idle          │   │
-│  │  └──────────┘  └────────────┘   │  └───────────────────────┘   │
-│  └─────────────────────────────────┘                              │
-│                                                                   │
-│  ✓ No sleep/wake overhead                                         │
-│  ✓ SGLang stays active during training                            │
-│  ✓ Persistent worker — model loaded once, reused across steps     │
-│  ✓ TP must be power of 2 (vocab size constraint)                  │
-│  ✓ Generation is 70-90% of RL time → more inference GPUs = win    │
-└───────────────────────────────────────────────────────────────────┘
-```
-
-### Auto-Detected GPU Splits
-
 TP must be a power of 2 (model vocab sizes like Qwen3's 151936 are divisible by 1,2,4,8 but NOT 3).
 
 | GPUs Available | Inference GPUs | TP Size | Training GPU | Mode |