|
| 1 | +# Parallelism: GPU topology (TP / DP / PP / EP) and concurrency (`parallelism` / `--max-num-seqs`) |
| 2 | + |
| 3 | +Two layers of decisions, made in order: |
| 4 | + |
| 5 | +1. **GPU topology** — how the model is laid out across the GPUs you have |
| 6 | + (`--tensor-parallel-size`, `--data-parallel-size`, `--pipeline-parallel-size`, |
| 7 | + `--enable-expert-parallel`). Decide this first; it determines how many |
| 8 | + independent replicas exist. |
| 9 | +2. **Concurrency** — how many requests are in flight (`parallelism`) and how many |
| 10 | + sequences each replica decodes at once (`--max-num-seqs`). Sized on top of the |
| 11 | + topology. |
| 12 | + |
| 13 | +Both layers affect **throughput only** — never scores. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Layer 1 — GPU topology: TP / DP / PP |
| 18 | + |
| 19 | +| Dim | What it shards | What it buys | Cost | |
| 20 | +| --- | --- | --- | --- | |
| 21 | +| **TP** (tensor parallel) | Each layer's weights + KV across GPUs **within one replica** | Lets a model that doesn't fit on one GPU run; splits KV so longer context fits | All-reduce **every layer** → latency + needs fast interconnect (NVLink). Keep **within one node**. | |
| 22 | +| **DP** (data parallel) | Nothing — **replicates** the (TP-sharded) model | Throughput: N independent replicas serve N× the concurrent requests | N× the weight memory (one full copy per replica) | |
| 23 | +| **PP** (pipeline parallel) | Contiguous **layer ranges** across GPUs | Fits a model too big for intra-node TP; cheaper cross-node than TP | Pipeline bubbles → lower utilization. Mostly for very large / multi-node (see `multi-node.md`). | |
| 24 | + |
| 25 | +**Decision procedure (single node, G GPUs):** |
| 26 | + |
| 27 | +1. **TP = the smallest value that makes the model fit with KV headroom.** |
| 28 | + Estimate weight memory ≈ `params × bytes_per_param`: |
| 29 | + - NVFP4 ≈ 0.5–0.6 B/param (incl. scales) · FP8 ≈ 1 B · BF16/FP16 ≈ 2 B. |
| 30 | + - Need `weights/TP + KV cache + activations + CUDA-graph/overhead` to fit in |
| 31 | + `GPU_mem × gpu_memory_utilization`. If it fits on one GPU → **TP = 1**. |
| 32 | + - Constraints: TP must **divide `num_attention_heads`** (and ideally |
| 33 | + `num_key_value_heads` for GQA, else KV heads get replicated and waste memory); |
| 34 | + use a **power of 2**; never span nodes with TP. |
| 35 | + - Smaller TP = less communication = higher efficiency. Don't over-shard "to be |
| 36 | + safe" — it slows decode. |
| 37 | +2. **DP = floor(G / (TP × PP))** — use the leftover GPUs as replicas. For |
| 38 | + throughput-bound evals (the common case), **maximize DP**: a model that fits on |
| 39 | + one GPU should run `TP=1, DP=G`, not `TP=G, DP=1`. |
| 40 | +3. **PP only if** the model can't fit even at the largest sensible intra-node TP, |
| 41 | + or you're going multi-node — then see `multi-node.md`. |
| 42 | + |
| 43 | +DP is the lever that grows **serving capacity** (Layer 2): more replicas → more |
| 44 | +concurrent sequences. |
| 45 | + |
| 46 | +> **Gotcha — bit-width sets the topology, not the model name.** The weight estimate |
| 47 | +> hinges on `bytes_per_param`, so **read the actual precision from `config.json` |
| 48 | +> (`quantization_config` / `quant_algo` / dtype) before sizing** — do not infer it |
| 49 | +> from the org/handle. Two checkpoints of the *same architecture at the same |
| 50 | +> bit-width* have the same footprint → the same TP/DP/EP, regardless of vendor or |
| 51 | +> quant scheme (INT4 vs NVFP4 differ only in kernel/quant-method flags, which vLLM |
| 52 | +> auto-detects — and a negligible effective-bit difference). The split only changes |
| 53 | +> when the bit-width changes the *size* (see the Kimi example below). |
| 54 | +
|
| 55 | +### Choosing the TP/DP split (when more than one layout fits) |
| 56 | + |
| 57 | +A fixed GPU count usually admits several valid splits — on 8 GPUs a MoE could run |
| 58 | +any factorization with `TP×DP=8`: `TP=1/DP=8`, `TP=2/DP=4`, `TP=4/DP=2`, or |
| 59 | +`TP=8/DP=1` (all with EP=8, see Layer 1b). They are **not** equally good. Default to |
| 60 | +**smallest TP, largest DP**, because: |
| 61 | + |
| 62 | +- **DP scales throughput ~linearly with no extra communication** — attention lanes |
| 63 | + are independent; only the MoE all-to-all couples ranks, and that's EP=`TP×DP` |
| 64 | + regardless of the split, so it's identical across them. |
| 65 | +- **TP adds an all-reduce on every attention layer** and scales sublinearly. Each |
| 66 | + step up in TP buys KV/weight room at an efficiency cost. |
| 67 | + |
| 68 | +Raise TP above 1 **only** to relieve a memory constraint DP cannot fix: |
| 69 | + |
| 70 | +1. **A single request's KV won't fit one replica's pool.** A request runs entirely |
| 71 | + on one replica, so its longest (context + generation) KV must fit in that |
| 72 | + replica's free HBM: TP=1 → one GPU's HBM, TP=2 → two GPUs' (KV is TP-sharded). |
| 73 | + Long-context tasks (AA-LCR ~120K, full 262K) on memory-tight models force this. |
| 74 | +2. **You need more concurrent seqs per replica than one GPU's KV allows** — i.e. |
| 75 | + you see preemption on TP=1 at your target per-replica `max-num-seqs`. TP=2 |
| 76 | + doubles the KV blocks per replica. |
| 77 | +3. **Weights don't fit one GPU** even after EP-sharding (dense models, or a MoE |
| 78 | + whose replicated attention + expert shard exceeds one GPU). |
| 79 | + |
| 80 | +If none bite, higher TP just wastes the extra KV and gives up replicas → net slower. |
| 81 | +**How to know which wins:** deploy the candidate and read the startup line |
| 82 | +`Maximum concurrency for <max-model-len> tokens per request: X.XX×`; if it's |
| 83 | +comfortably above `parallelism / DP` with zero preemption in the canary, the smaller |
| 84 | +TP wins. Step up TP only when the canary proves TP=1 is KV-bound. |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +## Layer 1b — Expert parallelism (EP), MoE only |
| 89 | + |
| 90 | +`--enable-expert-parallel` changes how the **MoE expert (FFN) layers** are parallelized: |
| 91 | + |
| 92 | +- **Off (default):** every expert is tensor-sharded across the TP ranks (each rank |
| 93 | + holds a slice of *all* experts). |
| 94 | +- **On:** whole experts are **partitioned across ranks** (each rank owns a subset of |
| 95 | + experts). Less per-FFN communication and better expert batching — the efficient |
| 96 | + choice for many-expert MoE. |
| 97 | + |
| 98 | +### EP size is derived, not set |
| 99 | + |
| 100 | +`--enable-expert-parallel` is a **boolean** — there is **no `--expert-parallel-size`** |
| 101 | +in vLLM. The EP degree is always: |
| 102 | + |
| 103 | +```text |
| 104 | +EP = tensor_parallel_size × data_parallel_size (the full world size) |
| 105 | +``` |
| 106 | + |
| 107 | +So **EP equals TP only when DP=1.** On a fixed 8-GPU node every fitting split gives |
| 108 | +EP=8 — you don't tune EP, you tune the TP/DP split, which only changes the |
| 109 | +*attention* side: |
| 110 | + |
| 111 | +| Layout (8 GPUs) | EP | Attention | Best when | |
| 112 | +| --- | :--: | --- | --- | |
| 113 | +| `TP=1 DP=8 --enable-expert-parallel` | 8 | 8 replicas, comm-free | throughput; one request's KV fits 1 GPU (**default**) | |
| 114 | +| `TP=2 DP=4 --enable-expert-parallel` | 8 | 4 replicas, TP=2 | need ~2× per-replica KV pool (long ctx) | |
| 115 | +| `TP=4 DP=2 --enable-expert-parallel` | 8 | 2 replicas, TP=4 | ~4× per-replica KV pool, or weights too big for TP≤2, but still want >1 replica | |
| 116 | +| `TP=8 DP=1 --enable-expert-parallel` | 8 | 1 replica, TP=8 | trillion-scale weights / one huge KV pool | |
| 117 | + |
| 118 | +Going down the table trades replicas (throughput) for a bigger per-replica KV pool |
| 119 | +and more weight-fit room; the all-reduce cost rises with TP. Pick the **topmost row |
| 120 | +that satisfies the memory constraints** (the TP-up triggers above). |
| 121 | + |
| 122 | +### How DP-attention connects to the experts (the dataflow) |
| 123 | + |
| 124 | +The DP group and the EP group are the **same physical GPUs** — rank `r` is both DP |
| 125 | +lane `r` (a full attention replica, since TP=1) *and* the owner of expert-shard `r`. |
| 126 | +Per MoE decoder layer: |
| 127 | + |
| 128 | +1. **Attention** runs DP-local — each rank on its own tokens + its own KV, **no |
| 129 | + cross-rank comm**. |
| 130 | +2. **Router** picks top-k experts per token; those experts may live on any rank. |
| 131 | +3. **Dispatch all-to-all** sends each token's hidden vector to the rank that owns its |
| 132 | + expert(s). *This is the only coupling between the DP lanes.* |
| 133 | +4. **Experts** compute locally on the tokens they received (gathered from all ranks). |
| 134 | +5. **Combine all-to-all** returns outputs to each token's home rank → top-k weighted |
| 135 | + sum → next layer. |
| 136 | + |
| 137 | +Consequences: |
| 138 | + |
| 139 | +- **Comm profile differs from TP.** TP = all-reduce *every* layer (incl. attention); |
| 140 | + DP+EP = all-to-all *only at MoE layers*, attention is comm-free. Keep the EP |
| 141 | + all-to-all **intra-node (NVLink)** — cross-node EP is far slower (see `multi-node.md`). |
| 142 | +- **Experts still see a global batch** (tokens gathered from all DP lanes) → better |
| 143 | + utilization than TP-sharded experts. |
| 144 | +- **Routing is data-dependent → load can be uneven** across ranks; vLLM makes idle |
| 145 | + ranks run dummy forward passes while any rank is busy, so DP+EP works best with load |
| 146 | + spread evenly across replicas (normal under steady eval concurrency). |
| 147 | + |
| 148 | +### When to enable |
| 149 | + |
| 150 | +- **Yes — any MoE**, especially large / many-expert (DeepSeek, large Qwen MoE, GLM |
| 151 | + MoE). EP powers the standard high-throughput **DP-attention + EP-MoE** layout. |
| 152 | +- **No — dense models** (no experts). Also a **no-op at `TP=DP=1`** (nothing to |
| 153 | + distribute), so it's safe-but-pointless on a single GPU. |
| 154 | + |
| 155 | +**Detecting MoE:** handle suffix encoding active params (`-A10B`, `-A3B`, `-A22B`), |
| 156 | +`num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or the |
| 157 | +card. (`-A10B` etc. = *active* params of an MoE — a strong MoE signal.) |
| 158 | + |
| 159 | +> Cross-check `recipes.vllm.ai` for the family's validated TP/DP/EP layout and GPU |
| 160 | +> count, then adapt to your GPUs with the fit math (e.g. recipe TP=2 on 2×H200 → on |
| 161 | +> an 8-GPU node, TP=2/DP=4). |
| 162 | +
|
| 163 | +--- |
| 164 | + |
| 165 | +## Layer 2 — Concurrency knobs and how they relate |
| 166 | + |
| 167 | +- **`parallelism`** — requests the eval **client** keeps in flight *per benchmark*. |
| 168 | + Continuous batching holds `parallelism` open at all times, dispatching a new one |
| 169 | + the instant another finishes (a sliding window, not discrete "waves"). |
| 170 | +- **`--max-num-seqs`** — sequences a single vLLM **replica** decodes concurrently. |
| 171 | + Total server capacity: |
| 172 | + |
| 173 | + ```text |
| 174 | + serving_capacity = max-num-seqs × data_parallel_size × num_instances |
| 175 | + ``` |
| 176 | + |
| 177 | + (TP and PP shard *one* replica, so they don't add capacity; replicas = DP, times |
| 178 | + `num_instances` for HAProxy multi-instance — see `multi-node.md`.) |
| 179 | + |
| 180 | +Keep them matched: **`max-num-seqs = ceil(parallelism / (DP × num_instances))`**. |
| 181 | +If `parallelism` exceeds `serving_capacity`, the surplus just queues in vLLM — no |
| 182 | +speedup, and deep queues can trip `request_timeout`. |
| 183 | + |
| 184 | +## The binding constraint flips with run size |
| 185 | + |
| 186 | +`parallelism` is useful only up to the smaller of two ceilings: |
| 187 | + |
| 188 | +1. **Total requests for the task** = `dataset_size × repeats` (`repeats` = |
| 189 | + `n_samples` for simple-evals / tau2-bench, `num_repeats` for nemo-skills — see |
| 190 | + `quantization-benchmarks.md`). You can't have more in flight than exist. |
| 191 | +2. **Sustainable serving capacity** = `max-num-seqs × DP × num_instances`, bounded |
| 192 | + by KV-cache memory (below). |
| 193 | + |
| 194 | +| Situation | Set `parallelism` to | Why | |
| 195 | +| --- | --- | --- | |
| 196 | +| `total_requests ≤ serving_capacity` (small run) | `total_requests` (round up a little for uneven DP routing) | All requests dispatch at once → one wave → finishes in ~one generation-time. Higher is wasted. | |
| 197 | +| `total_requests ≫ serving_capacity` (large run) | `serving_capacity` (largest the GPUs sustain) | Throughput-bound: keep every decode slot full until the queue drains. Request count no longer matters. | |
| 198 | + |
| 199 | +So "set it higher" is right **only up to the request count**; past that you just |
| 200 | +over-reserve KV. |
| 201 | + |
| 202 | +## Sizing `--max-num-seqs` against KV cache |
| 203 | + |
| 204 | +For throughput-bound runs `max-num-seqs` is capped by KV memory, driven by |
| 205 | +**context length × concurrent sequences**. High `max_new_tokens` (e.g. 81920) makes |
| 206 | +each sequence's KV large, shrinking the sustainable batch. **Read the ceiling from |
| 207 | +vLLM's startup log** rather than guessing: |
| 208 | + |
| 209 | +- `# GPU blocks: N` — total KV blocks. |
| 210 | +- `Maximum concurrency for <max-model-len> tokens per request: X.XX×` — how many |
| 211 | + full-context sequences fit (more at shorter effective context). |
| 212 | + |
| 213 | +During the canary, watch: |
| 214 | + |
| 215 | +- **Preemption** (`Preempted N requests` / recompute) ⇒ `max-num-seqs` above what |
| 216 | + KV sustains; lower it (preemption wastes work). |
| 217 | +- **`GPU KV cache usage`** well below 100% with zero preemption ⇒ headroom; raise. |
| 218 | + |
| 219 | +Factors that **relax** the KV limit: small / low-precision weights (more HBM for |
| 220 | +KV), **KV-cache quantization** (`kv_cache_scheme` in `config.json`), and |
| 221 | +**hybrid / linear-attention** layers (near-constant state instead of growing KV). |
| 222 | + |
| 223 | +## Diminishing returns |
| 224 | + |
| 225 | +Decode throughput saturates HBM bandwidth at some batch size; beyond that knee, |
| 226 | +more sequences add latency without adding tokens/sec. Goal = **largest batch with |
| 227 | +~zero preemption**, not the max the config accepts. |
| 228 | + |
| 229 | +## Non-GPU caps |
| 230 | + |
| 231 | +- **Judge / user-sim tasks** (HLE, AA-LCR, Tau2-Bench Telecom): `parallelism` is |
| 232 | + often capped by the **judge's rate limit**, not the served model. Start |
| 233 | + conservative; raise only after judge logs are clean. Use a per-task `parallelism` |
| 234 | + override when its ceiling differs (e.g. Tau2 cap 512). |
| 235 | +- **Per-task overrides:** size `--max-num-seqs` off the **max** `parallelism` across |
| 236 | + the top-level and all per-task overrides. |
| 237 | + |
| 238 | +--- |
| 239 | + |
| 240 | +## Worked examples |
| 241 | + |
| 242 | +**Dense 9B NVFP4, 8×B200 (this skill's GPQA run).** Weights ~5–6 GB → fits one GPU |
| 243 | +with huge KV headroom → **TP=1, DP=8, no EP.** Concurrency: GPQA Diamond = 198 |
| 244 | +questions; `n_samples=1` → 198 requests (request-bound) → `parallelism=256`, |
| 245 | +`max-num-seqs=ceil(256/8)=32`. `n_samples=8` → 1,584 requests (capacity-bound) → |
| 246 | +start `parallelism=512` (`max-num-seqs=64`), then tune from vLLM's max-concurrency |
| 247 | +- preemption (toward 768–1024 if KV has headroom). |
| 248 | + |
| 249 | +**Dense ~70B BF16, 8×H100 (80 GB).** ~140 GB weights → won't fit one GPU; TP=2 |
| 250 | +(~70 GB/GPU + KV) fits → **TP=2, DP=4, no EP.** `serving_capacity = max-num-seqs × |
| 251 | +4`. |
| 252 | + |
| 253 | +**Large MoE ~235B-A22B, 8×H200.** MoE (the `-A22B` = active params) → enable EP. |
| 254 | +Throughput layout: **`--data-parallel-size 8 --enable-expert-parallel`** |
| 255 | +(DP-attention + EP-MoE, EP size 8), or TP=8 + EP if a single replica's attention/KV |
| 256 | +needs the full node. Pick per `recipes.vllm.ai` and the fit math. |
| 257 | + |
| 258 | +**Trillion-scale MoE (Kimi-K2-class, ~1T/32B, MLA), 8×B200 — bit-width flips the |
| 259 | +split.** Same architecture, same node; only the precision differs: |
| 260 | + |
| 261 | +- **FP8 (~1040 GB):** weight-bound. Experts alone are ~124 GB/GPU after EP-sharding; |
| 262 | + add replicated non-expert weights and TP=1 overflows 173 GB → **forced to |
| 263 | + `TP=8, DP=1, EP on`** (one replica across the node, ~43 GB/GPU KV — fine for MLA). |
| 264 | +- **4-bit — INT4 or NVFP4 (~520–572 GB):** experts drop to ~62–68 GB/GPU, leaving |
| 265 | + room to replicate attention 8× → **`TP=1, DP=8, EP on`** (8 lanes, max |
| 266 | + throughput); step to `TP=2/DP=4` only if a long-context canary shows KV preemption. |
| 267 | + |
| 268 | +INT4 and NVFP4 here are ~the same size → **same layout** — don't let the differing |
| 269 | +handles (`moonshotai/…` vs `nvidia/…-NVFP4`) suggest otherwise. The only real |
| 270 | +divider is FP8 (weight-bound, TP=8) vs 4-bit (DP-capable, TP=1). This is also why a |
| 271 | +4-bit Kimi that needed `TP=8/DP=1` on a tighter 8×H200/640 GB node can switch to |
| 272 | +`TP=1/DP=8` on the larger 8×B200 node — adapt the layout to the GPUs you actually |
| 273 | +have. |
0 commit comments