Add parallelism sizing reference to evaluation skill

cjluo-nv · claude · cjluo-nv · commit 7a1e11023084 · 2026-06-01T11:28:00.000-07:00
Add references/parallelism.md documenting how to choose deployment
topology (TP / DP / PP, expert parallelism) and concurrency
(parallelism / --max-num-seqs) for NEL evaluation runs:

- GPU topology decision procedure: smallest-TP-then-max-DP, the
  TP-up triggers, and the TP/DP split tradeoff for a fixed world size.
- Expert parallelism: EP = TP x DP (boolean flag, no direct EP size),
  the DP-attention + EP-MoE dataflow, and when to enable vs not.
- Concurrency sizing: request-count vs serving-capacity ceiling,
  KV-driven --max-num-seqs, empirical tuning from vLLM logs.
- Gotcha: bit-width (read from config.json), not the model name,
  sets the topology; worked examples incl. FP8-vs-4-bit Kimi.

Wire SKILL.md to the new reference from the deployment-command,
expert-parallel, evaluation-params, Step 4, and canary sections.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Chenjie Luo &lt;chenjiel@nvidia.com&gt;
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -138,6 +138,8 @@ deployment:
 
 Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
 
+For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.
+
 **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
 
 #### vLLM-backend defaults — always include unless the recipe *contradicts*
@@ -146,15 +148,15 @@ Silence is not contradiction. Drop/override only when the recipe sets a differen
 
 - `--max-num-batched-tokens 8192` — caps per-step batched tokens; prevents long-prefill stalls.
 - `--enable-chunked-prefill` — interleaves long prefills with decode steps (required for AA-LCR's ~120K input). Modern vLLM defaults this on for many models; set explicitly to avoid drift.
-- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models.
+- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models. See `references/parallelism.md` for what EP does and the DP-attention + EP-MoE throughput pattern.
 - `--max-num-seqs N` — **omit at generation time** (top-level `parallelism` is `???`). Add this comment above `command:`:
 
   ```text
   # After filling in `parallelism` values (top-level + per-task overrides),
   # append `--max-num-seqs N` where N = ceil(max_parallelism / data_parallel_size).
   ```
 
-  In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation.
+  In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation. For how to choose the `parallelism` it derives from, read `references/parallelism.md`.
 
 #### Evaluation params template (top-level params)
 
@@ -164,7 +166,7 @@ The top-level `nemo_evaluator_config.config.params` must contain **exactly these
 nemo_evaluator_config:
   config:
     params:
-      parallelism: ???    # Required — ask user in Step 4 (depends on cluster + judge rate limits)
+      parallelism: ???    # Required — size per references/parallelism.md (bounded by total request count vs GPU serving capacity); ask user in Step 4 if still unclear
       request_timeout: 3600
       max_retries: 10
       max_new_tokens: 65536  # see rule below
@@ -192,6 +194,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
 ### Step 4 — Fill remaining ??? values
 
 - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
+- **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
 - Ask about other defaults they may want to change (partition, walltime, MLflow tags).
 
 **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
@@ -305,7 +308,7 @@ nel info <id> --logs
 ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"
 ```
 
-Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model.
+Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model. For capacity-bound runs, tune `parallelism`/`--max-num-seqs` here against vLLM's reported max concurrency + preemption — see `references/parallelism.md`.
 
 Single-task rerun: `nel run --config <path> -t <task_name>` (combine with `-o ++...limit_samples=10` for canary).
 
diff --git a/.claude/skills/evaluation/references/parallelism.md b/.claude/skills/evaluation/references/parallelism.md
@@ -0,0 +1,273 @@
+# Parallelism: GPU topology (TP / DP / PP / EP) and concurrency (`parallelism` / `--max-num-seqs`)
+
+Two layers of decisions, made in order:
+
+1. **GPU topology** — how the model is laid out across the GPUs you have
+   (`--tensor-parallel-size`, `--data-parallel-size`, `--pipeline-parallel-size`,
+   `--enable-expert-parallel`). Decide this first; it determines how many
+   independent replicas exist.
+2. **Concurrency** — how many requests are in flight (`parallelism`) and how many
+   sequences each replica decodes at once (`--max-num-seqs`). Sized on top of the
+   topology.
+
+Both layers affect **throughput only** — never scores.
+
+---
+
+## Layer 1 — GPU topology: TP / DP / PP
+
+| Dim | What it shards | What it buys | Cost |
+| --- | --- | --- | --- |
+| **TP** (tensor parallel) | Each layer's weights + KV across GPUs **within one replica** | Lets a model that doesn't fit on one GPU run; splits KV so longer context fits | All-reduce **every layer** → latency + needs fast interconnect (NVLink). Keep **within one node**. |
+| **DP** (data parallel) | Nothing — **replicates** the (TP-sharded) model | Throughput: N independent replicas serve N× the concurrent requests | N× the weight memory (one full copy per replica) |
+| **PP** (pipeline parallel) | Contiguous **layer ranges** across GPUs | Fits a model too big for intra-node TP; cheaper cross-node than TP | Pipeline bubbles → lower utilization. Mostly for very large / multi-node (see `multi-node.md`). |
+
+**Decision procedure (single node, G GPUs):**
+
+1. **TP = the smallest value that makes the model fit with KV headroom.**
+   Estimate weight memory ≈ `params × bytes_per_param`:
+   - NVFP4 ≈ 0.5–0.6 B/param (incl. scales) · FP8 ≈ 1 B · BF16/FP16 ≈ 2 B.
+   - Need `weights/TP + KV cache + activations + CUDA-graph/overhead` to fit in
+     `GPU_mem × gpu_memory_utilization`. If it fits on one GPU → **TP = 1**.
+   - Constraints: TP must **divide `num_attention_heads`** (and ideally
+     `num_key_value_heads` for GQA, else KV heads get replicated and waste memory);
+     use a **power of 2**; never span nodes with TP.
+   - Smaller TP = less communication = higher efficiency. Don't over-shard "to be
+     safe" — it slows decode.
+2. **DP = floor(G / (TP × PP))** — use the leftover GPUs as replicas. For
+   throughput-bound evals (the common case), **maximize DP**: a model that fits on
+   one GPU should run `TP=1, DP=G`, not `TP=G, DP=1`.
+3. **PP only if** the model can't fit even at the largest sensible intra-node TP,
+   or you're going multi-node — then see `multi-node.md`.
+
+DP is the lever that grows **serving capacity** (Layer 2): more replicas → more
+concurrent sequences.
+
+> **Gotcha — bit-width sets the topology, not the model name.** The weight estimate
+> hinges on `bytes_per_param`, so **read the actual precision from `config.json`
+> (`quantization_config` / `quant_algo` / dtype) before sizing** — do not infer it
+> from the org/handle. Two checkpoints of the *same architecture at the same
+> bit-width* have the same footprint → the same TP/DP/EP, regardless of vendor or
+> quant scheme (INT4 vs NVFP4 differ only in kernel/quant-method flags, which vLLM
+> auto-detects — and a negligible effective-bit difference). The split only changes
+> when the bit-width changes the *size* (see the Kimi example below).
+
+### Choosing the TP/DP split (when more than one layout fits)
+
+A fixed GPU count usually admits several valid splits — on 8 GPUs a MoE could run
+any factorization with `TP×DP=8`: `TP=1/DP=8`, `TP=2/DP=4`, `TP=4/DP=2`, or
+`TP=8/DP=1` (all with EP=8, see Layer 1b). They are **not** equally good. Default to
+**smallest TP, largest DP**, because:
+
+- **DP scales throughput ~linearly with no extra communication** — attention lanes
+  are independent; only the MoE all-to-all couples ranks, and that's EP=`TP×DP`
+  regardless of the split, so it's identical across them.
+- **TP adds an all-reduce on every attention layer** and scales sublinearly. Each
+  step up in TP buys KV/weight room at an efficiency cost.
+
+Raise TP above 1 **only** to relieve a memory constraint DP cannot fix:
+
+1. **A single request's KV won't fit one replica's pool.** A request runs entirely
+   on one replica, so its longest (context + generation) KV must fit in that
+   replica's free HBM: TP=1 → one GPU's HBM, TP=2 → two GPUs' (KV is TP-sharded).
+   Long-context tasks (AA-LCR ~120K, full 262K) on memory-tight models force this.
+2. **You need more concurrent seqs per replica than one GPU's KV allows** — i.e.
+   you see preemption on TP=1 at your target per-replica `max-num-seqs`. TP=2
+   doubles the KV blocks per replica.
+3. **Weights don't fit one GPU** even after EP-sharding (dense models, or a MoE
+   whose replicated attention + expert shard exceeds one GPU).
+
+If none bite, higher TP just wastes the extra KV and gives up replicas → net slower.
+**How to know which wins:** deploy the candidate and read the startup line
+`Maximum concurrency for <max-model-len> tokens per request: X.XX×`; if it's
+comfortably above `parallelism / DP` with zero preemption in the canary, the smaller
+TP wins. Step up TP only when the canary proves TP=1 is KV-bound.
+
+---
+
+## Layer 1b — Expert parallelism (EP), MoE only
+
+`--enable-expert-parallel` changes how the **MoE expert (FFN) layers** are parallelized:
+
+- **Off (default):** every expert is tensor-sharded across the TP ranks (each rank
+  holds a slice of *all* experts).
+- **On:** whole experts are **partitioned across ranks** (each rank owns a subset of
+  experts). Less per-FFN communication and better expert batching — the efficient
+  choice for many-expert MoE.
+
+### EP size is derived, not set
+
+`--enable-expert-parallel` is a **boolean** — there is **no `--expert-parallel-size`**
+in vLLM. The EP degree is always:
+
+```text
+EP = tensor_parallel_size × data_parallel_size      (the full world size)
+```
+
+So **EP equals TP only when DP=1.** On a fixed 8-GPU node every fitting split gives
+EP=8 — you don't tune EP, you tune the TP/DP split, which only changes the
+*attention* side:
+
+| Layout (8 GPUs) | EP | Attention | Best when |
+| --- | :--: | --- | --- |
+| `TP=1 DP=8 --enable-expert-parallel` | 8 | 8 replicas, comm-free | throughput; one request's KV fits 1 GPU (**default**) |
+| `TP=2 DP=4 --enable-expert-parallel` | 8 | 4 replicas, TP=2 | need ~2× per-replica KV pool (long ctx) |
+| `TP=4 DP=2 --enable-expert-parallel` | 8 | 2 replicas, TP=4 | ~4× per-replica KV pool, or weights too big for TP≤2, but still want >1 replica |
+| `TP=8 DP=1 --enable-expert-parallel` | 8 | 1 replica, TP=8 | trillion-scale weights / one huge KV pool |
+
+Going down the table trades replicas (throughput) for a bigger per-replica KV pool
+and more weight-fit room; the all-reduce cost rises with TP. Pick the **topmost row
+that satisfies the memory constraints** (the TP-up triggers above).
+
+### How DP-attention connects to the experts (the dataflow)
+
+The DP group and the EP group are the **same physical GPUs** — rank `r` is both DP
+lane `r` (a full attention replica, since TP=1) *and* the owner of expert-shard `r`.
+Per MoE decoder layer:
+
+1. **Attention** runs DP-local — each rank on its own tokens + its own KV, **no
+   cross-rank comm**.
+2. **Router** picks top-k experts per token; those experts may live on any rank.
+3. **Dispatch all-to-all** sends each token's hidden vector to the rank that owns its
+   expert(s). *This is the only coupling between the DP lanes.*
+4. **Experts** compute locally on the tokens they received (gathered from all ranks).
+5. **Combine all-to-all** returns outputs to each token's home rank → top-k weighted
+   sum → next layer.
+
+Consequences:
+
+- **Comm profile differs from TP.** TP = all-reduce *every* layer (incl. attention);
+  DP+EP = all-to-all *only at MoE layers*, attention is comm-free. Keep the EP
+  all-to-all **intra-node (NVLink)** — cross-node EP is far slower (see `multi-node.md`).
+- **Experts still see a global batch** (tokens gathered from all DP lanes) → better
+  utilization than TP-sharded experts.
+- **Routing is data-dependent → load can be uneven** across ranks; vLLM makes idle
+  ranks run dummy forward passes while any rank is busy, so DP+EP works best with load
+  spread evenly across replicas (normal under steady eval concurrency).
+
+### When to enable
+
+- **Yes — any MoE**, especially large / many-expert (DeepSeek, large Qwen MoE, GLM
+  MoE). EP powers the standard high-throughput **DP-attention + EP-MoE** layout.
+- **No — dense models** (no experts). Also a **no-op at `TP=DP=1`** (nothing to
+  distribute), so it's safe-but-pointless on a single GPU.
+
+**Detecting MoE:** handle suffix encoding active params (`-A10B`, `-A3B`, `-A22B`),
+`num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or the
+card. (`-A10B` etc. = *active* params of an MoE — a strong MoE signal.)
+
+> Cross-check `recipes.vllm.ai` for the family's validated TP/DP/EP layout and GPU
+> count, then adapt to your GPUs with the fit math (e.g. recipe TP=2 on 2×H200 → on
+> an 8-GPU node, TP=2/DP=4).
+
+---
+
+## Layer 2 — Concurrency knobs and how they relate
+
+- **`parallelism`** — requests the eval **client** keeps in flight *per benchmark*.
+  Continuous batching holds `parallelism` open at all times, dispatching a new one
+  the instant another finishes (a sliding window, not discrete "waves").
+- **`--max-num-seqs`** — sequences a single vLLM **replica** decodes concurrently.
+  Total server capacity:
+
+  ```text
+  serving_capacity = max-num-seqs × data_parallel_size × num_instances
+  ```
+
+  (TP and PP shard *one* replica, so they don't add capacity; replicas = DP, times
+  `num_instances` for HAProxy multi-instance — see `multi-node.md`.)
+
+Keep them matched: **`max-num-seqs = ceil(parallelism / (DP × num_instances))`**.
+If `parallelism` exceeds `serving_capacity`, the surplus just queues in vLLM — no
+speedup, and deep queues can trip `request_timeout`.
+
+## The binding constraint flips with run size
+
+`parallelism` is useful only up to the smaller of two ceilings:
+
+1. **Total requests for the task** = `dataset_size × repeats` (`repeats` =
+   `n_samples` for simple-evals / tau2-bench, `num_repeats` for nemo-skills — see
+   `quantization-benchmarks.md`). You can't have more in flight than exist.
+2. **Sustainable serving capacity** = `max-num-seqs × DP × num_instances`, bounded
+   by KV-cache memory (below).
+
+| Situation | Set `parallelism` to | Why |
+| --- | --- | --- |
+| `total_requests ≤ serving_capacity` (small run) | `total_requests` (round up a little for uneven DP routing) | All requests dispatch at once → one wave → finishes in ~one generation-time. Higher is wasted. |
+| `total_requests ≫ serving_capacity` (large run) | `serving_capacity` (largest the GPUs sustain) | Throughput-bound: keep every decode slot full until the queue drains. Request count no longer matters. |
+
+So "set it higher" is right **only up to the request count**; past that you just
+over-reserve KV.
+
+## Sizing `--max-num-seqs` against KV cache
+
+For throughput-bound runs `max-num-seqs` is capped by KV memory, driven by
+**context length × concurrent sequences**. High `max_new_tokens` (e.g. 81920) makes
+each sequence's KV large, shrinking the sustainable batch. **Read the ceiling from
+vLLM's startup log** rather than guessing:
+
+- `# GPU blocks: N` — total KV blocks.
+- `Maximum concurrency for <max-model-len> tokens per request: X.XX×` — how many
+  full-context sequences fit (more at shorter effective context).
+
+During the canary, watch:
+
+- **Preemption** (`Preempted N requests` / recompute) ⇒ `max-num-seqs` above what
+  KV sustains; lower it (preemption wastes work).
+- **`GPU KV cache usage`** well below 100% with zero preemption ⇒ headroom; raise.
+
+Factors that **relax** the KV limit: small / low-precision weights (more HBM for
+KV), **KV-cache quantization** (`kv_cache_scheme` in `config.json`), and
+**hybrid / linear-attention** layers (near-constant state instead of growing KV).
+
+## Diminishing returns
+
+Decode throughput saturates HBM bandwidth at some batch size; beyond that knee,
+more sequences add latency without adding tokens/sec. Goal = **largest batch with
+~zero preemption**, not the max the config accepts.
+
+## Non-GPU caps
+
+- **Judge / user-sim tasks** (HLE, AA-LCR, Tau2-Bench Telecom): `parallelism` is
+  often capped by the **judge's rate limit**, not the served model. Start
+  conservative; raise only after judge logs are clean. Use a per-task `parallelism`
+  override when its ceiling differs (e.g. Tau2 cap 512).
+- **Per-task overrides:** size `--max-num-seqs` off the **max** `parallelism` across
+  the top-level and all per-task overrides.
+
+---
+
+## Worked examples
+
+**Dense 9B NVFP4, 8×B200 (this skill's GPQA run).** Weights ~5–6 GB → fits one GPU
+with huge KV headroom → **TP=1, DP=8, no EP.** Concurrency: GPQA Diamond = 198
+questions; `n_samples=1` → 198 requests (request-bound) → `parallelism=256`,
+`max-num-seqs=ceil(256/8)=32`. `n_samples=8` → 1,584 requests (capacity-bound) →
+start `parallelism=512` (`max-num-seqs=64`), then tune from vLLM's max-concurrency
+- preemption (toward 768–1024 if KV has headroom).
+
+**Dense ~70B BF16, 8×H100 (80 GB).** ~140 GB weights → won't fit one GPU; TP=2
+(~70 GB/GPU + KV) fits → **TP=2, DP=4, no EP.** `serving_capacity = max-num-seqs ×
+4`.
+
+**Large MoE ~235B-A22B, 8×H200.** MoE (the `-A22B` = active params) → enable EP.
+Throughput layout: **`--data-parallel-size 8 --enable-expert-parallel`**
+(DP-attention + EP-MoE, EP size 8), or TP=8 + EP if a single replica's attention/KV
+needs the full node. Pick per `recipes.vllm.ai` and the fit math.
+
+**Trillion-scale MoE (Kimi-K2-class, ~1T/32B, MLA), 8×B200 — bit-width flips the
+split.** Same architecture, same node; only the precision differs:
+
+- **FP8 (~1040 GB):** weight-bound. Experts alone are ~124 GB/GPU after EP-sharding;
+  add replicated non-expert weights and TP=1 overflows 173 GB → **forced to
+  `TP=8, DP=1, EP on`** (one replica across the node, ~43 GB/GPU KV — fine for MLA).
+- **4-bit — INT4 or NVFP4 (~520–572 GB):** experts drop to ~62–68 GB/GPU, leaving
+  room to replicate attention 8× → **`TP=1, DP=8, EP on`** (8 lanes, max
+  throughput); step to `TP=2/DP=4` only if a long-context canary shows KV preemption.
+
+INT4 and NVFP4 here are ~the same size → **same layout** — don't let the differing
+handles (`moonshotai/…` vs `nvidia/…-NVFP4`) suggest otherwise. The only real
+divider is FP8 (weight-bound, TP=8) vs 4-bit (DP-capable, TP=1). This is also why a
+4-bit Kimi that needed `TP=8/DP=1` on a tighter 8×H200/640 GB node can switch to
+`TP=1/DP=8` on the larger 8×B200 node — adapt the layout to the GPUs you actually
+have.