|
| 1 | +# Parallelism: topology (TP/DP/PP/EP) + concurrency (`parallelism` / `--max-num-seqs`) |
| 2 | + |
| 3 | +Two decisions, in order — both affect **throughput only, never scores**: |
| 4 | + |
| 5 | +1. **Topology** — how the model is laid out across GPUs (sets the replica count). |
| 6 | +2. **Concurrency** — requests in flight (`parallelism`) and per replica |
| 7 | + (`--max-num-seqs`), sized on top of the topology. |
| 8 | + |
| 9 | +## Layer 1 — topology (TP / DP / PP) |
| 10 | + |
| 11 | +- **TP** shards each layer (weights+KV) within one replica → fits a too-big model / |
| 12 | + splits KV for long context; costs an all-reduce **every layer** (keep intra-node). |
| 13 | +- **DP** replicates the model → N independent replicas = N× concurrency; N× weight memory. |
| 14 | +- **PP** shards layer ranges → very large / multi-node; pipeline bubbles. See `multi-node.md`. |
| 15 | + |
| 16 | +**Decide (single node, G GPUs):** |
| 17 | + |
| 18 | +1. **TP = smallest that fits** with KV headroom. Weights ≈ `params × bytes/param` |
| 19 | + (NVFP4 ≈0.5–0.6, FP8 ≈1, BF16 ≈2); need |
| 20 | + `weights/TP + KV + activations + overhead < GPU_mem × util`. Fits on one GPU → TP=1. |
| 21 | + TP must divide `num_attention_heads` (ideally `num_key_value_heads`), be a power of |
| 22 | + 2, and never cross nodes. |
| 23 | +2. **DP = floor(G / (TP×PP))** — maximize for throughput (a 1-GPU-fit model runs |
| 24 | + `TP=1,DP=G`, not `TP=G,DP=1`). |
| 25 | +3. **PP** only if it won't fit at max intra-node TP, or multi-node. |
| 26 | + |
| 27 | +> **Gotcha — bit-width sets the topology, not the model name.** Read precision from |
| 28 | +> `config.json` (`quantization_config`/`quant_algo`/dtype); don't infer from the |
| 29 | +> handle. Same arch + same bit-width → same TP/DP/EP regardless of vendor (INT4 vs |
| 30 | +> NVFP4 differ only in auto-detected kernel flags). The split changes only when |
| 31 | +> bit-width changes *size*. |
| 32 | +
|
| 33 | +**Choosing the TP/DP split** (e.g. on 8 GPUs: `1/8`, `2/4`, `4/2`, `8/1`, all EP=8): |
| 34 | +default **smallest TP, largest DP** — DP scales throughput ~linearly with no extra |
| 35 | +comm; TP adds an all-reduce per attention layer. Raise TP **only** to relieve memory |
| 36 | +DP can't: |
| 37 | + |
| 38 | +1. a single request's KV won't fit one replica's HBM (long context — AA-LCR ~120K / 262K); |
| 39 | +2. preemption at your target per-replica `max-num-seqs` (TP=2 doubles per-replica KV); |
| 40 | +3. weights don't fit one GPU even after EP-sharding. |
| 41 | + |
| 42 | +Else higher TP wastes KV and gives up replicas. **Verify:** vLLM startup |
| 43 | +`Maximum concurrency for <max-model-len> tokens` ≳ `parallelism/DP` with no canary |
| 44 | +preemption → smaller TP wins. |
| 45 | + |
| 46 | +## Layer 1b — Expert parallelism (EP), MoE only |
| 47 | + |
| 48 | +`--enable-expert-parallel` is a **boolean** (no `--expert-parallel-size`); experts |
| 49 | +are partitioned across the whole world size: |
| 50 | + |
| 51 | +```text |
| 52 | +EP = tensor_parallel_size × data_parallel_size (EP = TP only when DP=1) |
| 53 | +``` |
| 54 | + |
| 55 | +So on a fixed node you don't tune EP — you tune the TP/DP split, which only changes |
| 56 | +the *attention* side: |
| 57 | + |
| 58 | +| Layout (8 GPUs, all EP=8) | Attention | Best when | |
| 59 | +| --- | --- | --- | |
| 60 | +| `TP=1 DP=8` | 8 replicas, comm-free | **default** — one request's KV fits 1 GPU | |
| 61 | +| `TP=2 DP=4` | 4 replicas | need ~2× per-replica KV (long ctx) | |
| 62 | +| `TP=4 DP=2` | 2 replicas | ~4× per-replica KV, or weights too big for TP≤2 | |
| 63 | +| `TP=8 DP=1` | 1 replica | trillion-scale weights / one huge KV pool | |
| 64 | + |
| 65 | +Down the table = more per-replica KV/weight room, fewer replicas, higher all-reduce |
| 66 | +cost; pick the **topmost row that fits**. |
| 67 | + |
| 68 | +**Dataflow (DP-attention + EP-MoE):** the DP and EP groups are the **same GPUs**. |
| 69 | +Attention is DP-local (no cross-rank comm); each MoE layer does a dispatch+combine |
| 70 | +**all-to-all** to route tokens to the rank owning their expert. So comm is all-to-all |
| 71 | +*only at MoE layers* (vs TP's per-layer all-reduce) — keep it **intra-node (NVLink)**. |
| 72 | +Data-dependent routing → uneven load; vLLM runs dummy passes on idle ranks, so spread |
| 73 | +load evenly. |
| 74 | + |
| 75 | +**Enable for any MoE** (detect via `-A10B`/`-A3B`/`-A22B` handle, `num_experts` / |
| 76 | +`n_routed_experts` in `config.json`); **not for dense**; no-op at `TP=DP=1`. |
| 77 | +Cross-check `recipes.vllm.ai` for the validated layout, then adapt to your GPU count |
| 78 | +via the fit math. |
| 79 | + |
| 80 | +## Layer 2 — concurrency (`parallelism` / `--max-num-seqs`) |
| 81 | + |
| 82 | +- **`parallelism`** = requests the client keeps in flight *per benchmark*. |
| 83 | +- **`--max-num-seqs`** = sequences one replica decodes at once. |
| 84 | + |
| 85 | +```text |
| 86 | +serving_capacity = max-num-seqs × DP × num_instances |
| 87 | +max-num-seqs = ceil(parallelism / (DP × num_instances)) # keep matched |
| 88 | +``` |
| 89 | + |
| 90 | +(TP/PP don't add capacity; replicas = DP, × `num_instances` for HAProxy — see |
| 91 | +`multi-node.md`.) `parallelism` above capacity just queues in vLLM (and risks |
| 92 | +`request_timeout`). |
| 93 | + |
| 94 | +**`parallelism` ceiling = the smaller of:** |
| 95 | + |
| 96 | +1. **total requests** = `dataset_size × repeats` (`n_samples` for simple-evals/tau2, |
| 97 | + `num_repeats` for nemo-skills) — can't have more in flight than exist; |
| 98 | +2. **preemption-free capacity at the task's context** (KV-bound; below). |
| 99 | + |
| 100 | +| Run | Set `parallelism` to | |
| 101 | +| --- | --- | |
| 102 | +| `total_requests ≤ capacity` (small) | `total_requests` (round up for uneven DP routing) → one wave | |
| 103 | +| `total_requests ≫ capacity` (large) | the **preemption-free** capacity at the task's context (often *below* nominal) | |
| 104 | + |
| 105 | +**Sizing `--max-num-seqs` vs KV** — capped by `context × concurrent seqs`; high |
| 106 | +`max_new_tokens` shrinks the batch. Read vLLM startup `# GPU blocks` / |
| 107 | +`Maximum concurrency for <max-model-len> tokens` (full-length floor — you fit more at |
| 108 | +shorter context). Canary: `Preempted N` → lower; KV usage ≪100% with no preemption → |
| 109 | +raise. **Relaxed by:** low-precision weights; **KV-cache quantization** — checkpoint |
| 110 | +`kv_cache_scheme` **or serve-time `--kv-cache-dtype fp8`** (`fp8_e4m3`/`fp8_e5m2`) in |
| 111 | +`deployment.command`, ~halving KV → ~2× concurrency/context (verify support; small |
| 112 | +accuracy effect); and **hybrid/linear-attention** (near-constant KV). |
| 113 | + |
| 114 | +## Balanced sizing — bigger is NOT always faster (esp. long context) |
| 115 | + |
| 116 | +Past the KV-fit point throughput doesn't just plateau, it **regresses** — worst for |
| 117 | +long-context / long-output: |
| 118 | + |
| 119 | +1. **Preemption thrash** — over-admitted seqs get preempted; recomputing a ~120K |
| 120 | + prefill is huge wasted work, so a modest preemption-free concurrency finishes *sooner*. |
| 121 | +2. **Prefill/decode contention** — many long prefills split `--max-num-batched-tokens` |
| 122 | + and starve decode. |
| 123 | +3. **Timeout cascade** — too many in-flight → p99 > `request_timeout` → `max_retries` |
| 124 | + resubmissions pile on more load. |
| 125 | + |
| 126 | +Sustainable concurrency is **context-dependent** — a `parallelism` good for GPQA |
| 127 | +(short) thrashes AA-LCR (~120K). **Rule:** target ~**70–80% of the preemption-free |
| 128 | +KV-fit concurrency at the task's working context × DP**; give long-context/long-output |
| 129 | +tasks a **lower per-task override**; canary-tune up only while throughput↑, |
| 130 | +preemption≈0, p99 < `request_timeout`; **err low** for long context (too-small mildly |
| 131 | +underutilizes; too-large is *multiples* slower). |
| 132 | + |
| 133 | +## Suites — set `parallelism` per task, not per run |
| 134 | + |
| 135 | +Suite tasks hit **different bottlenecks** against one deployment; use a top-level |
| 136 | +default for short model-bound tasks and override the outliers: |
| 137 | + |
| 138 | +| Bottleneck | AA tasks | Cap by | |
| 139 | +| --- | --- | --- | |
| 140 | +| Model / GPU KV (short) | `gpqa_diamond_aa_v3`, `ns_ifbench` | top-level default (preemption-free KV-fit) | |
| 141 | +| Long-context KV (~120K) | `ns_aa_lcr` | **low** override — prefill thrash; MLA ≫ GQA | |
| 142 | +| Judge / user-sim rate limit | `ns_hle_aa`, `ns_aa_lcr`, `tau2_bench_telecom` | judge endpoint 429s, **not** the model | |
| 143 | +| Sandbox execution | `ns_scicode` | sandbox slots | |
| 144 | + |
| 145 | +- Judge/sandbox tasks bottleneck **before** the model — over-parallelizing yields |
| 146 | + 429s/retries, not speed; cap to the endpoint, tune by *its* errors. |
| 147 | +- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the |
| 148 | + busiest task) even if long-context tasks run lower. |
| 149 | +- Canary each class (model / judge / sandbox) separately. Endpoint/context-dependent |
| 150 | + tasks (`ns_aa_lcr`, `tau2_bench_telecom`) ship `parallelism: ???` to force a choice. |
| 151 | + |
| 152 | +## Worked examples (8×B200) |
| 153 | + |
| 154 | +- **Dense 9B NVFP4** (~5–6 GB) → **TP=1/DP=8, no EP**. GPQA `n_samples=1` = 198 reqs |
| 155 | + (request-bound) → `parallelism=256`, `max-num-seqs=32`. `n_samples=8` = 1584 |
| 156 | + (capacity-bound) → start 512; tune up only while preemption≈0 (~82K reasoning output |
| 157 | + → knee may be <1024). |
| 158 | +- **Dense ~70B BF16, 8×H100/80GB** (~140 GB) → won't fit 1 GPU → **TP=2/DP=4, no EP**. |
| 159 | +- **Large MoE ~235B-A22B** → EP on; layout `DP=8 + EP` (or `TP=8 + EP` if one replica |
| 160 | + needs the full node for KV). |
| 161 | +- **Trillion-scale MoE (Kimi-class ~1T, MLA) — bit-width flips the split:** FP8 |
| 162 | + (~1040 GB) is weight-bound → forced **TP=8/DP=1/EP**; 4-bit INT4/NVFP4 (~520–572 GB) |
| 163 | + frees room → **TP=1/DP=8/EP**. INT4 ≈ NVFP4 → same layout (don't let `moonshotai/…` |
| 164 | + vs `nvidia/…-NVFP4` mislead) — same reason a 4-bit Kimi needing TP=8 on 8×H200/640GB |
| 165 | + switches to TP=1/DP=8 on 8×B200. |
0 commit comments