NVIDIA · cjluo-nv · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026
@@ -138,6 +138,8 @@ deployment:
 
 Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
 
+For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.
+
 **Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
 
 #### vLLM-backend defaults — always include unless the recipe *contradicts*
@@ -146,15 +148,15 @@ Silence is not contradiction. Drop/override only when the recipe sets a differen
 
 - `--max-num-batched-tokens 8192` — caps per-step batched tokens; prevents long-prefill stalls.
 - `--enable-chunked-prefill` — interleaves long prefills with decode steps (required for AA-LCR's ~120K input). Modern vLLM defaults this on for many models; set explicitly to avoid drift.
-- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models.
+- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models. See `references/parallelism.md` for what EP does and the DP-attention + EP-MoE throughput pattern.
 - `--max-num-seqs N` — **omit at generation time** (top-level `parallelism` is `???`). Add this comment above `command:`:
 
   ```text
   # After filling in `parallelism` values (top-level + per-task overrides),
   # append `--max-num-seqs N` where N = ceil(max_parallelism / data_parallel_size).
   ```
 
-  In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation.
+  In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation. For how to choose the `parallelism` it derives from, read `references/parallelism.md`.
 
 #### Evaluation params template (top-level params)
 
@@ -164,7 +166,7 @@ The top-level `nemo_evaluator_config.config.params` must contain **exactly these
 nemo_evaluator_config:
   config:
     params:
-      parallelism: ???    # Required — ask user in Step 4 (depends on cluster + judge rate limits)
+      parallelism: ???    # Required — size per references/parallelism.md (bounded by total request count vs GPU serving capacity); ask user in Step 4 if still unclear
       request_timeout: 3600
       max_retries: 10
       max_new_tokens: 65536  # see rule below
@@ -192,6 +194,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
 ### Step 4 — Fill remaining ??? values
 
 - Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
+- **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
 - Ask about other defaults they may want to change (partition, walltime, MLflow tags).
 
 **Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
@@ -305,7 +308,7 @@ nel info <id> --logs
 ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"
 ```
 
-Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model.
+Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model. For capacity-bound runs, tune `parallelism`/`--max-num-seqs` here against vLLM's reported max concurrency + preemption — see `references/parallelism.md`.
 
 Single-task rerun: `nel run --config <path> -t <task_name>` (combine with `-o ++...limit_samples=10` for canary).
 

@@ -13,6 +13,19 @@ AA-LCR needs long context: plan for roughly 120K input tokens plus 16K
 generation tokens. Set deployment `--max-model-len` to at least `131072`, and
 use a larger value when the model supports it.
 
+**Parallelism — set this *lower* than the top-level default.** AA-LCR is the
+suite's most concurrency-sensitive task on two fronts at once. (1) *KV-bound:* each
+request carries ~120K input tokens, so its KV footprint is large and a high
+`parallelism` triggers preemption — and recomputing 120K-token prefills is hugely
+wasteful, so over-parallelizing here makes the run *slower*, not faster (see
+`references/parallelism.md`, "Balanced sizing"). (2) *Judge-bound:* the
+equality-checker endpoint rate-limits before your served model does. So give it an
+explicit per-task `parallelism` well below the model/GPU-bound tasks' value: start
+small (≈16–32 for GQA models; MLA models such as Kimi tolerate several× more) and
+raise only while preemption ≈ 0 and the judge shows no 429s. The field is left as
+`???`; after choosing a value, recompute the deployment's `--max-num-seqs` per
+SKILL.md Step 3 (sized off the *max* parallelism across all tasks).
+
 ## YAML Fragment
 
 LCR has a deployment-side requirement (`--max-model-len 131072`) and a task
@@ -35,6 +48,7 @@ block. Per SKILL.md Step 3, the deployment flag must live inside
           use_response_logging: false
     config:
       params:
+        parallelism: ???   # set LOWER than top-level: long-context (KV-bound) + judge-bound; see body above. Recompute --max-num-seqs after setting.
         extra:
           num_repeats: 16
           judge:

@@ -0,0 +1,165 @@
+# Parallelism: topology (TP/DP/PP/EP) + concurrency (`parallelism` / `--max-num-seqs`)
+
+Two decisions, in order — both affect **throughput only, never scores**:
+
+1. **Topology** — how the model is laid out across GPUs (sets the replica count).
+2. **Concurrency** — requests in flight (`parallelism`) and per replica
+   (`--max-num-seqs`), sized on top of the topology.
+
+## Layer 1 — topology (TP / DP / PP)
+
+- **TP** shards each layer (weights+KV) within one replica → fits a too-big model /
+  splits KV for long context; costs an all-reduce **every layer** (keep intra-node).
+- **DP** replicates the model → N independent replicas = N× concurrency; N× weight memory.
+- **PP** shards layer ranges → very large / multi-node; pipeline bubbles. See `multi-node.md`.
+
+**Decide (single node, G GPUs):**
+
+1. **TP = smallest that fits** with KV headroom. Weights ≈ `params × bytes/param`
+   (NVFP4 ≈0.5–0.6, FP8 ≈1, BF16 ≈2); need
+   `weights/TP + KV + activations + overhead < GPU_mem × util`. Fits on one GPU → TP=1.
+   TP must divide `num_attention_heads` (ideally `num_key_value_heads`), be a power of
+   2, and never cross nodes.
+2. **DP = floor(G / (TP×PP))** — maximize for throughput (a 1-GPU-fit model runs
+   `TP=1,DP=G`, not `TP=G,DP=1`).
+3. **PP** only if it won't fit at max intra-node TP, or multi-node.
+
+> **Gotcha — bit-width sets the topology, not the model name.** Read precision from
+> `config.json` (`quantization_config`/`quant_algo`/dtype); don't infer from the
+> handle. Same arch + same bit-width → same TP/DP/EP regardless of vendor (INT4 vs
+> NVFP4 differ only in auto-detected kernel flags). The split changes only when
+> bit-width changes *size*.
+
+**Choosing the TP/DP split** (e.g. on 8 GPUs: `1/8`, `2/4`, `4/2`, `8/1`, all EP=8):
+default **smallest TP, largest DP** — DP scales throughput ~linearly with no extra
+comm; TP adds an all-reduce per attention layer. Raise TP **only** to relieve memory
+DP can't:
+
+1. a single request's KV won't fit one replica's HBM (long context — AA-LCR ~120K / 262K);
+2. preemption at your target per-replica `max-num-seqs` (TP=2 doubles per-replica KV);
+3. weights don't fit one GPU even after EP-sharding.
+
+Else higher TP wastes KV and gives up replicas. **Verify:** vLLM startup
+`Maximum concurrency for <max-model-len> tokens` ≳ `parallelism/DP` with no canary
+preemption → smaller TP wins.
+
+## Layer 1b — Expert parallelism (EP), MoE only
+
+`--enable-expert-parallel` is a **boolean** (no `--expert-parallel-size`); experts
+are partitioned across the whole world size:
+
+```text
+EP = tensor_parallel_size × data_parallel_size    (EP = TP only when DP=1)
+```
+
+So on a fixed node you don't tune EP — you tune the TP/DP split, which only changes
+the *attention* side:
+
+| Layout (8 GPUs, all EP=8) | Attention | Best when |
+| --- | --- | --- |
+| `TP=1 DP=8` | 8 replicas, comm-free | **default** — one request's KV fits 1 GPU |
+| `TP=2 DP=4` | 4 replicas | need ~2× per-replica KV (long ctx) |
+| `TP=4 DP=2` | 2 replicas | ~4× per-replica KV, or weights too big for TP≤2 |
+| `TP=8 DP=1` | 1 replica | trillion-scale weights / one huge KV pool |
+
+Down the table = more per-replica KV/weight room, fewer replicas, higher all-reduce
+cost; pick the **topmost row that fits**.
+
+**Dataflow (DP-attention + EP-MoE):** the DP and EP groups are the **same GPUs**.
+Attention is DP-local (no cross-rank comm); each MoE layer does a dispatch+combine
+**all-to-all** to route tokens to the rank owning their expert. So comm is all-to-all
+*only at MoE layers* (vs TP's per-layer all-reduce) — keep it **intra-node (NVLink)**.
+Data-dependent routing → uneven load; vLLM runs dummy passes on idle ranks, so spread
+load evenly.
+
+**Enable for any MoE** (detect via `-A10B`/`-A3B`/`-A22B` handle, `num_experts` /
+`n_routed_experts` in `config.json`); **not for dense**; no-op at `TP=DP=1`.
+Cross-check `recipes.vllm.ai` for the validated layout, then adapt to your GPU count
+via the fit math.
+
+## Layer 2 — concurrency (`parallelism` / `--max-num-seqs`)
+
+- **`parallelism`** = requests the client keeps in flight *per benchmark*.
+- **`--max-num-seqs`** = sequences one replica decodes at once.
+
+```text
+serving_capacity = max-num-seqs × DP × num_instances
+max-num-seqs     = ceil(parallelism / (DP × num_instances))   # keep matched
+```
+
+(TP/PP don't add capacity; replicas = DP, × `num_instances` for HAProxy — see
+`multi-node.md`.) `parallelism` above capacity just queues in vLLM (and risks
+`request_timeout`).
+
+**`parallelism` ceiling = the smaller of:**
+
+1. **total requests** = `dataset_size × repeats` (`n_samples` for simple-evals/tau2,
+   `num_repeats` for nemo-skills) — can't have more in flight than exist;
+2. **preemption-free capacity at the task's context** (KV-bound; below).
+
+| Run | Set `parallelism` to |
+| --- | --- |
+| `total_requests ≤ capacity` (small) | `total_requests` (round up for uneven DP routing) → one wave |
+| `total_requests ≫ capacity` (large) | the **preemption-free** capacity at the task's context (often *below* nominal) |
+
+**Sizing `--max-num-seqs` vs KV** — capped by `context × concurrent seqs`; high
+`max_new_tokens` shrinks the batch. Read vLLM startup `# GPU blocks` /
+`Maximum concurrency for <max-model-len> tokens` (full-length floor — you fit more at
+shorter context). Canary: `Preempted N` → lower; KV usage ≪100% with no preemption →
+raise. **Relaxed by:** low-precision weights; **KV-cache quantization** — checkpoint
+`kv_cache_scheme` **or serve-time `--kv-cache-dtype fp8`** (`fp8_e4m3`/`fp8_e5m2`) in
+`deployment.command`, ~halving KV → ~2× concurrency/context (verify support; small
+accuracy effect); and **hybrid/linear-attention** (near-constant KV).
+
+## Balanced sizing — bigger is NOT always faster (esp. long context)
+
+Past the KV-fit point throughput doesn't just plateau, it **regresses** — worst for
+long-context / long-output:
+
+1. **Preemption thrash** — over-admitted seqs get preempted; recomputing a ~120K
+   prefill is huge wasted work, so a modest preemption-free concurrency finishes *sooner*.
+2. **Prefill/decode contention** — many long prefills split `--max-num-batched-tokens`
+   and starve decode.
+3. **Timeout cascade** — too many in-flight → p99 > `request_timeout` → `max_retries`
+   resubmissions pile on more load.
+
+Sustainable concurrency is **context-dependent** — a `parallelism` good for GPQA
+(short) thrashes AA-LCR (~120K). **Rule:** target ~**70–80% of the preemption-free
+KV-fit concurrency at the task's working context × DP**; give long-context/long-output
+tasks a **lower per-task override**; canary-tune up only while throughput↑,
+preemption≈0, p99 < `request_timeout`; **err low** for long context (too-small mildly
+underutilizes; too-large is *multiples* slower).
+
+## Suites — set `parallelism` per task, not per run
+
+Suite tasks hit **different bottlenecks** against one deployment; use a top-level
+default for short model-bound tasks and override the outliers:
+
+| Bottleneck | AA tasks | Cap by |
+| --- | --- | --- |
+| Model / GPU KV (short) | `gpqa_diamond_aa_v3`, `ns_ifbench` | top-level default (preemption-free KV-fit) |
+| Long-context KV (~120K) | `ns_aa_lcr` | **low** override — prefill thrash; MLA ≫ GQA |
+| Judge / user-sim rate limit | `ns_hle_aa`, `ns_aa_lcr`, `tau2_bench_telecom` | judge endpoint 429s, **not** the model |
+| Sandbox execution | `ns_scicode` | sandbox slots |
+
+- Judge/sandbox tasks bottleneck **before** the model — over-parallelizing yields
+  429s/retries, not speed; cap to the endpoint, tune by *its* errors.
+- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
+  busiest task) even if long-context tasks run lower.
- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
-  busiest task) even if long-context tasks run lower.
+- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the
+  busiest task) even if long-context tasks run lower.
- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
-  busiest task) even if long-context tasks run lower.
+- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the
+  busiest task) even if long-context tasks run lower.
+- Canary each class (model / judge / sandbox) separately. Endpoint/context-dependent
+  tasks (`ns_aa_lcr`, `tau2_bench_telecom`) ship `parallelism: ???` to force a choice.
+
+## Worked examples (8×B200)
+
+- **Dense 9B NVFP4** (~5–6 GB) → **TP=1/DP=8, no EP**. GPQA `n_samples=1` = 198 reqs
+  (request-bound) → `parallelism=256`, `max-num-seqs=32`. `n_samples=8` = 1584
+  (capacity-bound) → start 512; tune up only while preemption≈0 (~82K reasoning output
+  → knee may be <1024).
+- **Dense ~70B BF16, 8×H100/80GB** (~140 GB) → won't fit 1 GPU → **TP=2/DP=4, no EP**.
+- **Large MoE ~235B-A22B** → EP on; layout `DP=8 + EP` (or `TP=8 + EP` if one replica
+  needs the full node for KV).
+- **Trillion-scale MoE (Kimi-class ~1T, MLA) — bit-width flips the split:** FP8
+  (~1040 GB) is weight-bound → forced **TP=8/DP=1/EP**; 4-bit INT4/NVFP4 (~520–572 GB)
+  frees room → **TP=1/DP=8/EP**. INT4 ≈ NVFP4 → same layout (don't let `moonshotai/…`
+  vs `nvidia/…-NVFP4` mislead) — same reason a 4-bit Kimi needing TP=8 on 8×H200/640GB
+  switches to TP=1/DP=8 on 8×B200.