Skip to content

Commit 7a1e110

Browse files
cjluo-nvclaude
andcommitted
Add parallelism sizing reference to evaluation skill
Add references/parallelism.md documenting how to choose deployment topology (TP / DP / PP, expert parallelism) and concurrency (parallelism / --max-num-seqs) for NEL evaluation runs: - GPU topology decision procedure: smallest-TP-then-max-DP, the TP-up triggers, and the TP/DP split tradeoff for a fixed world size. - Expert parallelism: EP = TP x DP (boolean flag, no direct EP size), the DP-attention + EP-MoE dataflow, and when to enable vs not. - Concurrency sizing: request-count vs serving-capacity ceiling, KV-driven --max-num-seqs, empirical tuning from vLLM logs. - Gotcha: bit-width (read from config.json), not the model name, sets the topology; worked examples incl. FP8-vs-4-bit Kimi. Wire SKILL.md to the new reference from the deployment-command, expert-parallel, evaluation-params, Step 4, and canary sections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
1 parent ed0a4b1 commit 7a1e110

2 files changed

Lines changed: 280 additions & 4 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,8 @@ deployment:
138138
139139
Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
140140

141+
For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.
142+
141143
**Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
142144

143145
#### vLLM-backend defaults — always include unless the recipe *contradicts*
@@ -146,15 +148,15 @@ Silence is not contradiction. Drop/override only when the recipe sets a differen
146148

147149
- `--max-num-batched-tokens 8192` — caps per-step batched tokens; prevents long-prefill stalls.
148150
- `--enable-chunked-prefill` — interleaves long prefills with decode steps (required for AA-LCR's ~120K input). Modern vLLM defaults this on for many models; set explicitly to avoid drift.
149-
- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models.
151+
- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models. See `references/parallelism.md` for what EP does and the DP-attention + EP-MoE throughput pattern.
150152
- `--max-num-seqs N` — **omit at generation time** (top-level `parallelism` is `???`). Add this comment above `command:`:
151153

152154
```text
153155
# After filling in `parallelism` values (top-level + per-task overrides),
154156
# append `--max-num-seqs N` where N = ceil(max_parallelism / data_parallel_size).
155157
```
156158

157-
In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation.
159+
In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation. For how to choose the `parallelism` it derives from, read `references/parallelism.md`.
158160

159161
#### Evaluation params template (top-level params)
160162

@@ -164,7 +166,7 @@ The top-level `nemo_evaluator_config.config.params` must contain **exactly these
164166
nemo_evaluator_config:
165167
config:
166168
params:
167-
parallelism: ??? # Required — ask user in Step 4 (depends on cluster + judge rate limits)
169+
parallelism: ??? # Required — size per references/parallelism.md (bounded by total request count vs GPU serving capacity); ask user in Step 4 if still unclear
168170
request_timeout: 3600
169171
max_retries: 10
170172
max_new_tokens: 65536 # see rule below
@@ -192,6 +194,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
192194
### Step 4 — Fill remaining ??? values
193195

194196
- Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
197+
- **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
195198
- Ask about other defaults they may want to change (partition, walltime, MLflow tags).
196199

197200
**Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
@@ -305,7 +308,7 @@ nel info <id> --logs
305308
ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"
306309
```
307310

308-
Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model.
311+
Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model. For capacity-bound runs, tune `parallelism`/`--max-num-seqs` here against vLLM's reported max concurrency + preemption — see `references/parallelism.md`.
309312

310313
Single-task rerun: `nel run --config <path> -t <task_name>` (combine with `-o ++...limit_samples=10` for canary).
311314

Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
# Parallelism: GPU topology (TP / DP / PP / EP) and concurrency (`parallelism` / `--max-num-seqs`)
2+
3+
Two layers of decisions, made in order:
4+
5+
1. **GPU topology** — how the model is laid out across the GPUs you have
6+
(`--tensor-parallel-size`, `--data-parallel-size`, `--pipeline-parallel-size`,
7+
`--enable-expert-parallel`). Decide this first; it determines how many
8+
independent replicas exist.
9+
2. **Concurrency** — how many requests are in flight (`parallelism`) and how many
10+
sequences each replica decodes at once (`--max-num-seqs`). Sized on top of the
11+
topology.
12+
13+
Both layers affect **throughput only** — never scores.
14+
15+
---
16+
17+
## Layer 1 — GPU topology: TP / DP / PP
18+
19+
| Dim | What it shards | What it buys | Cost |
20+
| --- | --- | --- | --- |
21+
| **TP** (tensor parallel) | Each layer's weights + KV across GPUs **within one replica** | Lets a model that doesn't fit on one GPU run; splits KV so longer context fits | All-reduce **every layer** → latency + needs fast interconnect (NVLink). Keep **within one node**. |
22+
| **DP** (data parallel) | Nothing — **replicates** the (TP-sharded) model | Throughput: N independent replicas serve N× the concurrent requests | N× the weight memory (one full copy per replica) |
23+
| **PP** (pipeline parallel) | Contiguous **layer ranges** across GPUs | Fits a model too big for intra-node TP; cheaper cross-node than TP | Pipeline bubbles → lower utilization. Mostly for very large / multi-node (see `multi-node.md`). |
24+
25+
**Decision procedure (single node, G GPUs):**
26+
27+
1. **TP = the smallest value that makes the model fit with KV headroom.**
28+
Estimate weight memory ≈ `params × bytes_per_param`:
29+
- NVFP4 ≈ 0.5–0.6 B/param (incl. scales) · FP8 ≈ 1 B · BF16/FP16 ≈ 2 B.
30+
- Need `weights/TP + KV cache + activations + CUDA-graph/overhead` to fit in
31+
`GPU_mem × gpu_memory_utilization`. If it fits on one GPU → **TP = 1**.
32+
- Constraints: TP must **divide `num_attention_heads`** (and ideally
33+
`num_key_value_heads` for GQA, else KV heads get replicated and waste memory);
34+
use a **power of 2**; never span nodes with TP.
35+
- Smaller TP = less communication = higher efficiency. Don't over-shard "to be
36+
safe" — it slows decode.
37+
2. **DP = floor(G / (TP × PP))** — use the leftover GPUs as replicas. For
38+
throughput-bound evals (the common case), **maximize DP**: a model that fits on
39+
one GPU should run `TP=1, DP=G`, not `TP=G, DP=1`.
40+
3. **PP only if** the model can't fit even at the largest sensible intra-node TP,
41+
or you're going multi-node — then see `multi-node.md`.
42+
43+
DP is the lever that grows **serving capacity** (Layer 2): more replicas → more
44+
concurrent sequences.
45+
46+
> **Gotcha — bit-width sets the topology, not the model name.** The weight estimate
47+
> hinges on `bytes_per_param`, so **read the actual precision from `config.json`
48+
> (`quantization_config` / `quant_algo` / dtype) before sizing** — do not infer it
49+
> from the org/handle. Two checkpoints of the *same architecture at the same
50+
> bit-width* have the same footprint → the same TP/DP/EP, regardless of vendor or
51+
> quant scheme (INT4 vs NVFP4 differ only in kernel/quant-method flags, which vLLM
52+
> auto-detects — and a negligible effective-bit difference). The split only changes
53+
> when the bit-width changes the *size* (see the Kimi example below).
54+
55+
### Choosing the TP/DP split (when more than one layout fits)
56+
57+
A fixed GPU count usually admits several valid splits — on 8 GPUs a MoE could run
58+
any factorization with `TP×DP=8`: `TP=1/DP=8`, `TP=2/DP=4`, `TP=4/DP=2`, or
59+
`TP=8/DP=1` (all with EP=8, see Layer 1b). They are **not** equally good. Default to
60+
**smallest TP, largest DP**, because:
61+
62+
- **DP scales throughput ~linearly with no extra communication** — attention lanes
63+
are independent; only the MoE all-to-all couples ranks, and that's EP=`TP×DP`
64+
regardless of the split, so it's identical across them.
65+
- **TP adds an all-reduce on every attention layer** and scales sublinearly. Each
66+
step up in TP buys KV/weight room at an efficiency cost.
67+
68+
Raise TP above 1 **only** to relieve a memory constraint DP cannot fix:
69+
70+
1. **A single request's KV won't fit one replica's pool.** A request runs entirely
71+
on one replica, so its longest (context + generation) KV must fit in that
72+
replica's free HBM: TP=1 → one GPU's HBM, TP=2 → two GPUs' (KV is TP-sharded).
73+
Long-context tasks (AA-LCR ~120K, full 262K) on memory-tight models force this.
74+
2. **You need more concurrent seqs per replica than one GPU's KV allows** — i.e.
75+
you see preemption on TP=1 at your target per-replica `max-num-seqs`. TP=2
76+
doubles the KV blocks per replica.
77+
3. **Weights don't fit one GPU** even after EP-sharding (dense models, or a MoE
78+
whose replicated attention + expert shard exceeds one GPU).
79+
80+
If none bite, higher TP just wastes the extra KV and gives up replicas → net slower.
81+
**How to know which wins:** deploy the candidate and read the startup line
82+
`Maximum concurrency for <max-model-len> tokens per request: X.XX×`; if it's
83+
comfortably above `parallelism / DP` with zero preemption in the canary, the smaller
84+
TP wins. Step up TP only when the canary proves TP=1 is KV-bound.
85+
86+
---
87+
88+
## Layer 1b — Expert parallelism (EP), MoE only
89+
90+
`--enable-expert-parallel` changes how the **MoE expert (FFN) layers** are parallelized:
91+
92+
- **Off (default):** every expert is tensor-sharded across the TP ranks (each rank
93+
holds a slice of *all* experts).
94+
- **On:** whole experts are **partitioned across ranks** (each rank owns a subset of
95+
experts). Less per-FFN communication and better expert batching — the efficient
96+
choice for many-expert MoE.
97+
98+
### EP size is derived, not set
99+
100+
`--enable-expert-parallel` is a **boolean** — there is **no `--expert-parallel-size`**
101+
in vLLM. The EP degree is always:
102+
103+
```text
104+
EP = tensor_parallel_size × data_parallel_size (the full world size)
105+
```
106+
107+
So **EP equals TP only when DP=1.** On a fixed 8-GPU node every fitting split gives
108+
EP=8 — you don't tune EP, you tune the TP/DP split, which only changes the
109+
*attention* side:
110+
111+
| Layout (8 GPUs) | EP | Attention | Best when |
112+
| --- | :--: | --- | --- |
113+
| `TP=1 DP=8 --enable-expert-parallel` | 8 | 8 replicas, comm-free | throughput; one request's KV fits 1 GPU (**default**) |
114+
| `TP=2 DP=4 --enable-expert-parallel` | 8 | 4 replicas, TP=2 | need ~2× per-replica KV pool (long ctx) |
115+
| `TP=4 DP=2 --enable-expert-parallel` | 8 | 2 replicas, TP=4 | ~4× per-replica KV pool, or weights too big for TP≤2, but still want >1 replica |
116+
| `TP=8 DP=1 --enable-expert-parallel` | 8 | 1 replica, TP=8 | trillion-scale weights / one huge KV pool |
117+
118+
Going down the table trades replicas (throughput) for a bigger per-replica KV pool
119+
and more weight-fit room; the all-reduce cost rises with TP. Pick the **topmost row
120+
that satisfies the memory constraints** (the TP-up triggers above).
121+
122+
### How DP-attention connects to the experts (the dataflow)
123+
124+
The DP group and the EP group are the **same physical GPUs** — rank `r` is both DP
125+
lane `r` (a full attention replica, since TP=1) *and* the owner of expert-shard `r`.
126+
Per MoE decoder layer:
127+
128+
1. **Attention** runs DP-local — each rank on its own tokens + its own KV, **no
129+
cross-rank comm**.
130+
2. **Router** picks top-k experts per token; those experts may live on any rank.
131+
3. **Dispatch all-to-all** sends each token's hidden vector to the rank that owns its
132+
expert(s). *This is the only coupling between the DP lanes.*
133+
4. **Experts** compute locally on the tokens they received (gathered from all ranks).
134+
5. **Combine all-to-all** returns outputs to each token's home rank → top-k weighted
135+
sum → next layer.
136+
137+
Consequences:
138+
139+
- **Comm profile differs from TP.** TP = all-reduce *every* layer (incl. attention);
140+
DP+EP = all-to-all *only at MoE layers*, attention is comm-free. Keep the EP
141+
all-to-all **intra-node (NVLink)** — cross-node EP is far slower (see `multi-node.md`).
142+
- **Experts still see a global batch** (tokens gathered from all DP lanes) → better
143+
utilization than TP-sharded experts.
144+
- **Routing is data-dependent → load can be uneven** across ranks; vLLM makes idle
145+
ranks run dummy forward passes while any rank is busy, so DP+EP works best with load
146+
spread evenly across replicas (normal under steady eval concurrency).
147+
148+
### When to enable
149+
150+
- **Yes — any MoE**, especially large / many-expert (DeepSeek, large Qwen MoE, GLM
151+
MoE). EP powers the standard high-throughput **DP-attention + EP-MoE** layout.
152+
- **No — dense models** (no experts). Also a **no-op at `TP=DP=1`** (nothing to
153+
distribute), so it's safe-but-pointless on a single GPU.
154+
155+
**Detecting MoE:** handle suffix encoding active params (`-A10B`, `-A3B`, `-A22B`),
156+
`num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or the
157+
card. (`-A10B` etc. = *active* params of an MoE — a strong MoE signal.)
158+
159+
> Cross-check `recipes.vllm.ai` for the family's validated TP/DP/EP layout and GPU
160+
> count, then adapt to your GPUs with the fit math (e.g. recipe TP=2 on 2×H200 → on
161+
> an 8-GPU node, TP=2/DP=4).
162+
163+
---
164+
165+
## Layer 2 — Concurrency knobs and how they relate
166+
167+
- **`parallelism`** — requests the eval **client** keeps in flight *per benchmark*.
168+
Continuous batching holds `parallelism` open at all times, dispatching a new one
169+
the instant another finishes (a sliding window, not discrete "waves").
170+
- **`--max-num-seqs`** — sequences a single vLLM **replica** decodes concurrently.
171+
Total server capacity:
172+
173+
```text
174+
serving_capacity = max-num-seqs × data_parallel_size × num_instances
175+
```
176+
177+
(TP and PP shard *one* replica, so they don't add capacity; replicas = DP, times
178+
`num_instances` for HAProxy multi-instance — see `multi-node.md`.)
179+
180+
Keep them matched: **`max-num-seqs = ceil(parallelism / (DP × num_instances))`**.
181+
If `parallelism` exceeds `serving_capacity`, the surplus just queues in vLLM — no
182+
speedup, and deep queues can trip `request_timeout`.
183+
184+
## The binding constraint flips with run size
185+
186+
`parallelism` is useful only up to the smaller of two ceilings:
187+
188+
1. **Total requests for the task** = `dataset_size × repeats` (`repeats` =
189+
`n_samples` for simple-evals / tau2-bench, `num_repeats` for nemo-skills — see
190+
`quantization-benchmarks.md`). You can't have more in flight than exist.
191+
2. **Sustainable serving capacity** = `max-num-seqs × DP × num_instances`, bounded
192+
by KV-cache memory (below).
193+
194+
| Situation | Set `parallelism` to | Why |
195+
| --- | --- | --- |
196+
| `total_requests ≤ serving_capacity` (small run) | `total_requests` (round up a little for uneven DP routing) | All requests dispatch at once → one wave → finishes in ~one generation-time. Higher is wasted. |
197+
| `total_requests ≫ serving_capacity` (large run) | `serving_capacity` (largest the GPUs sustain) | Throughput-bound: keep every decode slot full until the queue drains. Request count no longer matters. |
198+
199+
So "set it higher" is right **only up to the request count**; past that you just
200+
over-reserve KV.
201+
202+
## Sizing `--max-num-seqs` against KV cache
203+
204+
For throughput-bound runs `max-num-seqs` is capped by KV memory, driven by
205+
**context length × concurrent sequences**. High `max_new_tokens` (e.g. 81920) makes
206+
each sequence's KV large, shrinking the sustainable batch. **Read the ceiling from
207+
vLLM's startup log** rather than guessing:
208+
209+
- `# GPU blocks: N` — total KV blocks.
210+
- `Maximum concurrency for <max-model-len> tokens per request: X.XX×` — how many
211+
full-context sequences fit (more at shorter effective context).
212+
213+
During the canary, watch:
214+
215+
- **Preemption** (`Preempted N requests` / recompute) ⇒ `max-num-seqs` above what
216+
KV sustains; lower it (preemption wastes work).
217+
- **`GPU KV cache usage`** well below 100% with zero preemption ⇒ headroom; raise.
218+
219+
Factors that **relax** the KV limit: small / low-precision weights (more HBM for
220+
KV), **KV-cache quantization** (`kv_cache_scheme` in `config.json`), and
221+
**hybrid / linear-attention** layers (near-constant state instead of growing KV).
222+
223+
## Diminishing returns
224+
225+
Decode throughput saturates HBM bandwidth at some batch size; beyond that knee,
226+
more sequences add latency without adding tokens/sec. Goal = **largest batch with
227+
~zero preemption**, not the max the config accepts.
228+
229+
## Non-GPU caps
230+
231+
- **Judge / user-sim tasks** (HLE, AA-LCR, Tau2-Bench Telecom): `parallelism` is
232+
often capped by the **judge's rate limit**, not the served model. Start
233+
conservative; raise only after judge logs are clean. Use a per-task `parallelism`
234+
override when its ceiling differs (e.g. Tau2 cap 512).
235+
- **Per-task overrides:** size `--max-num-seqs` off the **max** `parallelism` across
236+
the top-level and all per-task overrides.
237+
238+
---
239+
240+
## Worked examples
241+
242+
**Dense 9B NVFP4, 8×B200 (this skill's GPQA run).** Weights ~5–6 GB → fits one GPU
243+
with huge KV headroom → **TP=1, DP=8, no EP.** Concurrency: GPQA Diamond = 198
244+
questions; `n_samples=1` → 198 requests (request-bound) → `parallelism=256`,
245+
`max-num-seqs=ceil(256/8)=32`. `n_samples=8` → 1,584 requests (capacity-bound) →
246+
start `parallelism=512` (`max-num-seqs=64`), then tune from vLLM's max-concurrency
247+
- preemption (toward 768–1024 if KV has headroom).
248+
249+
**Dense ~70B BF16, 8×H100 (80 GB).** ~140 GB weights → won't fit one GPU; TP=2
250+
(~70 GB/GPU + KV) fits → **TP=2, DP=4, no EP.** `serving_capacity = max-num-seqs ×
251+
4`.
252+
253+
**Large MoE ~235B-A22B, 8×H200.** MoE (the `-A22B` = active params) → enable EP.
254+
Throughput layout: **`--data-parallel-size 8 --enable-expert-parallel`**
255+
(DP-attention + EP-MoE, EP size 8), or TP=8 + EP if a single replica's attention/KV
256+
needs the full node. Pick per `recipes.vllm.ai` and the fit math.
257+
258+
**Trillion-scale MoE (Kimi-K2-class, ~1T/32B, MLA), 8×B200 — bit-width flips the
259+
split.** Same architecture, same node; only the precision differs:
260+
261+
- **FP8 (~1040 GB):** weight-bound. Experts alone are ~124 GB/GPU after EP-sharding;
262+
add replicated non-expert weights and TP=1 overflows 173 GB → **forced to
263+
`TP=8, DP=1, EP on`** (one replica across the node, ~43 GB/GPU KV — fine for MLA).
264+
- **4-bit — INT4 or NVFP4 (~520–572 GB):** experts drop to ~62–68 GB/GPU, leaving
265+
room to replicate attention 8× → **`TP=1, DP=8, EP on`** (8 lanes, max
266+
throughput); step to `TP=2/DP=4` only if a long-context canary shows KV preemption.
267+
268+
INT4 and NVFP4 here are ~the same size → **same layout** — don't let the differing
269+
handles (`moonshotai/…` vs `nvidia/…-NVFP4`) suggest otherwise. The only real
270+
divider is FP8 (weight-bound, TP=8) vs 4-bit (DP-capable, TP=1). This is also why a
271+
4-bit Kimi that needed `TP=8/DP=1` on a tighter 8×H200/640 GB node can switch to
272+
`TP=1/DP=8` on the larger 8×B200 node — adapt the layout to the GPUs you actually
273+
have.

0 commit comments

Comments
 (0)