Skip to content

Commit 38c7843

Browse files
cjluo-nvclaude
andauthored
Add parallelism sizing reference to evaluation skill (#1583)
### What does this PR do? Type of change: documentation Adds `.claude/skills/evaluation/references/parallelism.md`, a reference for sizing both the **GPU topology** and the **request concurrency** of NEL evaluation runs, and wires the main `SKILL.md` to it. Pure docs for the agent-facing evaluation skill — no code or runtime behavior changes. The new reference covers: - **GPU topology (TP / DP / PP):** decision procedure (smallest-TP-then-max-DP), the TP-up triggers, and the TP/DP-split tradeoff for a fixed world size (e.g. all `TP×DP=8` factorizations on an 8-GPU node). - **Expert parallelism (EP):** `--enable-expert-parallel` is a boolean and `EP = TP × DP` (no direct EP-size flag); the DP-attention + EP-MoE dataflow (per-layer dispatch/combine all-to-all); when to enable vs not. - **Concurrency (`parallelism` / `--max-num-seqs`):** the request-count-vs-serving-capacity ceiling, KV-driven `--max-num-seqs`, and empirical tuning from vLLM startup logs + preemption. - **Gotcha + worked examples:** bit-width read from `config.json` (not the model name) sets the topology; includes the FP8-vs-4-bit Kimi example. `SKILL.md` gains pointers to the reference from the deployment-command, expert-parallel, evaluation-params, Step 4, and canary sections. ### Usage N/A — documentation only (agent skill guidance). ### Testing `pre-commit run` passes on both files (markdownlint included). ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (documentation) - Did you update Changelog?: N/A (skill docs, not a library feature) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information Commits are signed (`git commit -s -S`). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Expanded deployment guidance with clearer GPU topology, MoE/EP semantics, and bit‑width caveats. * Refined concurrency guidance: explicit rules for sizing top-level parallelism and computing max concurrent sequences, plus canary-run tuning to watch preemption and KV utilization. * Added task-level guidance for long‑context/judge‑bound jobs with recomputation advice and worked examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent f0d2237 commit 38c7843

3 files changed

Lines changed: 186 additions & 4 deletions

File tree

.claude/skills/evaluation/SKILL.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,8 @@ deployment:
138138
139139
Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.
140140

141+
For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.
142+
141143
**Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
142144

143145
#### vLLM-backend defaults — always include unless the recipe *contradicts*
@@ -146,15 +148,15 @@ Silence is not contradiction. Drop/override only when the recipe sets a differen
146148

147149
- `--max-num-batched-tokens 8192` — caps per-step batched tokens; prevents long-prefill stalls.
148150
- `--enable-chunked-prefill` — interleaves long prefills with decode steps (required for AA-LCR's ~120K input). Modern vLLM defaults this on for many models; set explicitly to avoid drift.
149-
- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models.
151+
- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models. See `references/parallelism.md` for what EP does and the DP-attention + EP-MoE throughput pattern.
150152
- `--max-num-seqs N` — **omit at generation time** (top-level `parallelism` is `???`). Add this comment above `command:`:
151153

152154
```text
153155
# After filling in `parallelism` values (top-level + per-task overrides),
154156
# append `--max-num-seqs N` where N = ceil(max_parallelism / data_parallel_size).
155157
```
156158

157-
In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation.
159+
In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation. For how to choose the `parallelism` it derives from, read `references/parallelism.md`.
158160

159161
#### Evaluation params template (top-level params)
160162

@@ -164,7 +166,7 @@ The top-level `nemo_evaluator_config.config.params` must contain **exactly these
164166
nemo_evaluator_config:
165167
config:
166168
params:
167-
parallelism: ??? # Required — ask user in Step 4 (depends on cluster + judge rate limits)
169+
parallelism: ??? # Required — size per references/parallelism.md (bounded by total request count vs GPU serving capacity); ask user in Step 4 if still unclear
168170
request_timeout: 3600
169171
max_retries: 10
170172
max_new_tokens: 65536 # see rule below
@@ -192,6 +194,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
192194
### Step 4 — Fill remaining ??? values
193195

194196
- Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
197+
- **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
195198
- Ask about other defaults they may want to change (partition, walltime, MLflow tags).
196199

197200
**Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
@@ -305,7 +308,7 @@ nel info <id> --logs
305308
ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"
306309
```
307310

308-
Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model.
311+
Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model. For capacity-bound runs, tune `parallelism`/`--max-num-seqs` here against vLLM's reported max concurrency + preemption — see `references/parallelism.md`.
309312

310313
Single-task rerun: `nel run --config <path> -t <task_name>` (combine with `-o ++...limit_samples=10` for canary).
311314

.claude/skills/evaluation/recipes/tasks/aa/lcr.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,19 @@ AA-LCR needs long context: plan for roughly 120K input tokens plus 16K
1313
generation tokens. Set deployment `--max-model-len` to at least `131072`, and
1414
use a larger value when the model supports it.
1515

16+
**Parallelism — set this *lower* than the top-level default.** AA-LCR is the
17+
suite's most concurrency-sensitive task on two fronts at once. (1) *KV-bound:* each
18+
request carries ~120K input tokens, so its KV footprint is large and a high
19+
`parallelism` triggers preemption — and recomputing 120K-token prefills is hugely
20+
wasteful, so over-parallelizing here makes the run *slower*, not faster (see
21+
`references/parallelism.md`, "Balanced sizing"). (2) *Judge-bound:* the
22+
equality-checker endpoint rate-limits before your served model does. So give it an
23+
explicit per-task `parallelism` well below the model/GPU-bound tasks' value: start
24+
small (≈16–32 for GQA models; MLA models such as Kimi tolerate several× more) and
25+
raise only while preemption ≈ 0 and the judge shows no 429s. The field is left as
26+
`???`; after choosing a value, recompute the deployment's `--max-num-seqs` per
27+
SKILL.md Step 3 (sized off the *max* parallelism across all tasks).
28+
1629
## YAML Fragment
1730

1831
LCR has a deployment-side requirement (`--max-model-len 131072`) and a task
@@ -35,6 +48,7 @@ block. Per SKILL.md Step 3, the deployment flag must live inside
3548
use_response_logging: false
3649
config:
3750
params:
51+
parallelism: ??? # set LOWER than top-level: long-context (KV-bound) + judge-bound; see body above. Recompute --max-num-seqs after setting.
3852
extra:
3953
num_repeats: 16
4054
judge:
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Parallelism: topology (TP/DP/PP/EP) + concurrency (`parallelism` / `--max-num-seqs`)
2+
3+
Two decisions, in order — both affect **throughput only, never scores**:
4+
5+
1. **Topology** — how the model is laid out across GPUs (sets the replica count).
6+
2. **Concurrency** — requests in flight (`parallelism`) and per replica
7+
(`--max-num-seqs`), sized on top of the topology.
8+
9+
## Layer 1 — topology (TP / DP / PP)
10+
11+
- **TP** shards each layer (weights+KV) within one replica → fits a too-big model /
12+
splits KV for long context; costs an all-reduce **every layer** (keep intra-node).
13+
- **DP** replicates the model → N independent replicas = N× concurrency; N× weight memory.
14+
- **PP** shards layer ranges → very large / multi-node; pipeline bubbles. See `multi-node.md`.
15+
16+
**Decide (single node, G GPUs):**
17+
18+
1. **TP = smallest that fits** with KV headroom. Weights ≈ `params × bytes/param`
19+
(NVFP4 ≈0.5–0.6, FP8 ≈1, BF16 ≈2); need
20+
`weights/TP + KV + activations + overhead < GPU_mem × util`. Fits on one GPU → TP=1.
21+
TP must divide `num_attention_heads` (ideally `num_key_value_heads`), be a power of
22+
2, and never cross nodes.
23+
2. **DP = floor(G / (TP×PP))** — maximize for throughput (a 1-GPU-fit model runs
24+
`TP=1,DP=G`, not `TP=G,DP=1`).
25+
3. **PP** only if it won't fit at max intra-node TP, or multi-node.
26+
27+
> **Gotcha — bit-width sets the topology, not the model name.** Read precision from
28+
> `config.json` (`quantization_config`/`quant_algo`/dtype); don't infer from the
29+
> handle. Same arch + same bit-width → same TP/DP/EP regardless of vendor (INT4 vs
30+
> NVFP4 differ only in auto-detected kernel flags). The split changes only when
31+
> bit-width changes *size*.
32+
33+
**Choosing the TP/DP split** (e.g. on 8 GPUs: `1/8`, `2/4`, `4/2`, `8/1`, all EP=8):
34+
default **smallest TP, largest DP** — DP scales throughput ~linearly with no extra
35+
comm; TP adds an all-reduce per attention layer. Raise TP **only** to relieve memory
36+
DP can't:
37+
38+
1. a single request's KV won't fit one replica's HBM (long context — AA-LCR ~120K / 262K);
39+
2. preemption at your target per-replica `max-num-seqs` (TP=2 doubles per-replica KV);
40+
3. weights don't fit one GPU even after EP-sharding.
41+
42+
Else higher TP wastes KV and gives up replicas. **Verify:** vLLM startup
43+
`Maximum concurrency for <max-model-len> tokens``parallelism/DP` with no canary
44+
preemption → smaller TP wins.
45+
46+
## Layer 1b — Expert parallelism (EP), MoE only
47+
48+
`--enable-expert-parallel` is a **boolean** (no `--expert-parallel-size`); experts
49+
are partitioned across the whole world size:
50+
51+
```text
52+
EP = tensor_parallel_size × data_parallel_size (EP = TP only when DP=1)
53+
```
54+
55+
So on a fixed node you don't tune EP — you tune the TP/DP split, which only changes
56+
the *attention* side:
57+
58+
| Layout (8 GPUs, all EP=8) | Attention | Best when |
59+
| --- | --- | --- |
60+
| `TP=1 DP=8` | 8 replicas, comm-free | **default** — one request's KV fits 1 GPU |
61+
| `TP=2 DP=4` | 4 replicas | need ~2× per-replica KV (long ctx) |
62+
| `TP=4 DP=2` | 2 replicas | ~4× per-replica KV, or weights too big for TP≤2 |
63+
| `TP=8 DP=1` | 1 replica | trillion-scale weights / one huge KV pool |
64+
65+
Down the table = more per-replica KV/weight room, fewer replicas, higher all-reduce
66+
cost; pick the **topmost row that fits**.
67+
68+
**Dataflow (DP-attention + EP-MoE):** the DP and EP groups are the **same GPUs**.
69+
Attention is DP-local (no cross-rank comm); each MoE layer does a dispatch+combine
70+
**all-to-all** to route tokens to the rank owning their expert. So comm is all-to-all
71+
*only at MoE layers* (vs TP's per-layer all-reduce) — keep it **intra-node (NVLink)**.
72+
Data-dependent routing → uneven load; vLLM runs dummy passes on idle ranks, so spread
73+
load evenly.
74+
75+
**Enable for any MoE** (detect via `-A10B`/`-A3B`/`-A22B` handle, `num_experts` /
76+
`n_routed_experts` in `config.json`); **not for dense**; no-op at `TP=DP=1`.
77+
Cross-check `recipes.vllm.ai` for the validated layout, then adapt to your GPU count
78+
via the fit math.
79+
80+
## Layer 2 — concurrency (`parallelism` / `--max-num-seqs`)
81+
82+
- **`parallelism`** = requests the client keeps in flight *per benchmark*.
83+
- **`--max-num-seqs`** = sequences one replica decodes at once.
84+
85+
```text
86+
serving_capacity = max-num-seqs × DP × num_instances
87+
max-num-seqs = ceil(parallelism / (DP × num_instances)) # keep matched
88+
```
89+
90+
(TP/PP don't add capacity; replicas = DP, × `num_instances` for HAProxy — see
91+
`multi-node.md`.) `parallelism` above capacity just queues in vLLM (and risks
92+
`request_timeout`).
93+
94+
**`parallelism` ceiling = the smaller of:**
95+
96+
1. **total requests** = `dataset_size × repeats` (`n_samples` for simple-evals/tau2,
97+
`num_repeats` for nemo-skills) — can't have more in flight than exist;
98+
2. **preemption-free capacity at the task's context** (KV-bound; below).
99+
100+
| Run | Set `parallelism` to |
101+
| --- | --- |
102+
| `total_requests ≤ capacity` (small) | `total_requests` (round up for uneven DP routing) → one wave |
103+
| `total_requests ≫ capacity` (large) | the **preemption-free** capacity at the task's context (often *below* nominal) |
104+
105+
**Sizing `--max-num-seqs` vs KV** — capped by `context × concurrent seqs`; high
106+
`max_new_tokens` shrinks the batch. Read vLLM startup `# GPU blocks` /
107+
`Maximum concurrency for <max-model-len> tokens` (full-length floor — you fit more at
108+
shorter context). Canary: `Preempted N` → lower; KV usage ≪100% with no preemption →
109+
raise. **Relaxed by:** low-precision weights; **KV-cache quantization** — checkpoint
110+
`kv_cache_scheme` **or serve-time `--kv-cache-dtype fp8`** (`fp8_e4m3`/`fp8_e5m2`) in
111+
`deployment.command`, ~halving KV → ~2× concurrency/context (verify support; small
112+
accuracy effect); and **hybrid/linear-attention** (near-constant KV).
113+
114+
## Balanced sizing — bigger is NOT always faster (esp. long context)
115+
116+
Past the KV-fit point throughput doesn't just plateau, it **regresses** — worst for
117+
long-context / long-output:
118+
119+
1. **Preemption thrash** — over-admitted seqs get preempted; recomputing a ~120K
120+
prefill is huge wasted work, so a modest preemption-free concurrency finishes *sooner*.
121+
2. **Prefill/decode contention** — many long prefills split `--max-num-batched-tokens`
122+
and starve decode.
123+
3. **Timeout cascade** — too many in-flight → p99 > `request_timeout``max_retries`
124+
resubmissions pile on more load.
125+
126+
Sustainable concurrency is **context-dependent** — a `parallelism` good for GPQA
127+
(short) thrashes AA-LCR (~120K). **Rule:** target ~**70–80% of the preemption-free
128+
KV-fit concurrency at the task's working context × DP**; give long-context/long-output
129+
tasks a **lower per-task override**; canary-tune up only while throughput↑,
130+
preemption≈0, p99 < `request_timeout`; **err low** for long context (too-small mildly
131+
underutilizes; too-large is *multiples* slower).
132+
133+
## Suites — set `parallelism` per task, not per run
134+
135+
Suite tasks hit **different bottlenecks** against one deployment; use a top-level
136+
default for short model-bound tasks and override the outliers:
137+
138+
| Bottleneck | AA tasks | Cap by |
139+
| --- | --- | --- |
140+
| Model / GPU KV (short) | `gpqa_diamond_aa_v3`, `ns_ifbench` | top-level default (preemption-free KV-fit) |
141+
| Long-context KV (~120K) | `ns_aa_lcr` | **low** override — prefill thrash; MLA ≫ GQA |
142+
| Judge / user-sim rate limit | `ns_hle_aa`, `ns_aa_lcr`, `tau2_bench_telecom` | judge endpoint 429s, **not** the model |
143+
| Sandbox execution | `ns_scicode` | sandbox slots |
144+
145+
- Judge/sandbox tasks bottleneck **before** the model — over-parallelizing yields
146+
429s/retries, not speed; cap to the endpoint, tune by *its* errors.
147+
- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
148+
busiest task) even if long-context tasks run lower.
149+
- Canary each class (model / judge / sandbox) separately. Endpoint/context-dependent
150+
tasks (`ns_aa_lcr`, `tau2_bench_telecom`) ship `parallelism: ???` to force a choice.
151+
152+
## Worked examples (8×B200)
153+
154+
- **Dense 9B NVFP4** (~5–6 GB) → **TP=1/DP=8, no EP**. GPQA `n_samples=1` = 198 reqs
155+
(request-bound) → `parallelism=256`, `max-num-seqs=32`. `n_samples=8` = 1584
156+
(capacity-bound) → start 512; tune up only while preemption≈0 (~82K reasoning output
157+
→ knee may be <1024).
158+
- **Dense ~70B BF16, 8×H100/80GB** (~140 GB) → won't fit 1 GPU → **TP=2/DP=4, no EP**.
159+
- **Large MoE ~235B-A22B** → EP on; layout `DP=8 + EP` (or `TP=8 + EP` if one replica
160+
needs the full node for KV).
161+
- **Trillion-scale MoE (Kimi-class ~1T, MLA) — bit-width flips the split:** FP8
162+
(~1040 GB) is weight-bound → forced **TP=8/DP=1/EP**; 4-bit INT4/NVFP4 (~520–572 GB)
163+
frees room → **TP=1/DP=8/EP**. INT4 ≈ NVFP4 → same layout (don't let `moonshotai/…`
164+
vs `nvidia/…-NVFP4` mislead) — same reason a 4-bit Kimi needing TP=8 on 8×H200/640GB
165+
switches to TP=1/DP=8 on 8×B200.

0 commit comments

Comments
 (0)