Skip to content

Commit 386b498

Browse files
cjluo-nvclaude
andcommitted
parallelism.md: add balanced-sizing strategy (bigger is not always faster)
Higher parallelism is not universally better — past the KV-fit point it regresses, worst for long-context / long-output tasks. Document the three mechanisms (preemption thrash on huge prefills, prefill/decode contention, latency->timeout->retry cascade), that sustainable concurrency is context-dependent, and a balanced rule: target ~70-80% of the preemption-free KV-fit concurrency at the task's working context, give long-context tasks a lower per-task parallelism override, tune by canary while throughput rises / preemption ~0 / p99 within request_timeout, and err low when unsure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
1 parent 3604bea commit 386b498

1 file changed

Lines changed: 49 additions & 11 deletions

File tree

.claude/skills/evaluation/references/parallelism.md

Lines changed: 49 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -194,10 +194,11 @@ speedup, and deep queues can trip `request_timeout`.
194194
| Situation | Set `parallelism` to | Why |
195195
| --- | --- | --- |
196196
| `total_requests ≤ serving_capacity` (small run) | `total_requests` (round up a little for uneven DP routing) | All requests dispatch at once → one wave → finishes in ~one generation-time. Higher is wasted. |
197-
| `total_requests ≫ serving_capacity` (large run) | `serving_capacity` (largest the GPUs sustain) | Throughput-bound: keep every decode slot full until the queue drains. Request count no longer matters. |
197+
| `total_requests ≫ serving_capacity` (large run) | the **preemption-free** capacity at the *task's* context — often *below* nominal `serving_capacity` (see Balanced sizing) | Throughput-bound: keep decode slots full *without thrashing*. Request count no longer matters; KV headroom does. |
198198

199-
So "set it higher" is right **only up to the request count**; past that you just
200-
over-reserve KV.
199+
So "set it higher" is right **only up to the request count** for small runs; for
200+
large runs it's right **only up to the preemption-free point** — past that you don't
201+
just over-reserve KV, you *regress* (next section).
201202

202203
## Sizing `--max-num-seqs` against KV cache
203204

@@ -220,20 +221,56 @@ Factors that **relax** the KV limit: small / low-precision weights (more HBM for
220221
KV), **KV-cache quantization** (`kv_cache_scheme` in `config.json`), and
221222
**hybrid / linear-attention** layers (near-constant state instead of growing KV).
222223

223-
## Diminishing returns
224-
225-
Decode throughput saturates HBM bandwidth at some batch size; beyond that knee,
226-
more sequences add latency without adding tokens/sec. Goal = **largest batch with
227-
~zero preemption**, not the max the config accepts.
224+
## Balanced sizing: bigger is not always faster (especially long context)
225+
226+
Decode throughput first saturates HBM bandwidth (more sequences stop adding
227+
tokens/sec) and then, past the KV-fit point, **regresses** — and the regression is
228+
worst for long-context / long-output tasks. Three mechanisms:
229+
230+
1. **Preemption thrash.** When admitted sequences exceed what KV holds, vLLM preempts
231+
(recompute or swap). Recompute discards a partially-finished decode — and
232+
re-running a ~120K-token prefill is enormous wasted work. A modest,
233+
preemption-free concurrency finishes *sooner* than a high one that thrashes.
234+
2. **Prefill/decode contention.** Long inputs = huge prefills. With
235+
`--max-num-batched-tokens` fixed, many concurrent long prefills split that budget
236+
and starve decode — everything crawls.
237+
3. **Latency → timeout → retry cascade.** Too many in-flight requests shrink each
238+
one's compute share; p99 latency climbs past `request_timeout`, triggering
239+
`max_retries` resubmissions that pile *more* load onto an already-saturated server.
240+
241+
**Sustainable concurrency is context-dependent.** vLLM's startup
242+
`Maximum concurrency for <max-model-len> tokens` is the *full-length floor*; at a
243+
task's actual working length you fit more (short tasks) — but for long-context tasks
244+
only a handful. So a `parallelism` that's ideal for GPQA (short prompt) will thrash
245+
AA-LCR (~120K input). **Never inherit a short task's `parallelism` for a long one.**
246+
247+
**Balanced rule:**
248+
249+
- Target `parallelism`**70–80% of the preemption-free KV-fit concurrency at the
250+
task's working context** (prompt + expected generation) × DP — not the model's
251+
nominal max. The 20–30% margin absorbs length variance and uneven DP routing.
252+
- **Per-task override for long-context / long-output tasks** (AA-LCR, big
253+
`max_new_tokens` reasoning): set a *lower* `parallelism` under that task's `params`;
254+
don't let the higher top-level value apply.
255+
- **Tune empirically (canary), raising only while ALL THREE hold:** throughput
256+
(req/s) rises, preemption ≈ 0, and p99 latency stays within `request_timeout`. Stop
257+
at the first that breaks — that's the knee; back off ~20%.
258+
- **When unsure, err low for long context.** A slightly-too-small `parallelism` only
259+
mildly underutilizes the GPUs; a too-large one thrashes and can be *multiples*
260+
slower. Goal = **largest batch with ~zero preemption**, not the max the config accepts.
228261

229262
## Non-GPU caps
230263

231264
- **Judge / user-sim tasks** (HLE, AA-LCR, Tau2-Bench Telecom): `parallelism` is
232265
often capped by the **judge's rate limit**, not the served model. Start
233266
conservative; raise only after judge logs are clean. Use a per-task `parallelism`
234267
override when its ceiling differs (e.g. Tau2 cap 512).
268+
- **Context length is itself a per-task cap.** Long-context / long-output tasks need
269+
a *lower* `parallelism` than short ones on the same deployment — give them an
270+
explicit per-task override (see Balanced sizing), don't reuse the top-level value.
235271
- **Per-task overrides:** size `--max-num-seqs` off the **max** `parallelism` across
236-
the top-level and all per-task overrides.
272+
the top-level and all per-task overrides (the deployment must support the busiest
273+
task), even though long-context tasks themselves run at a lower `parallelism`.
237274

238275
---
239276

@@ -243,8 +280,9 @@ more sequences add latency without adding tokens/sec. Goal = **largest batch wit
243280
with huge KV headroom → **TP=1, DP=8, no EP.** Concurrency: GPQA Diamond = 198
244281
questions; `n_samples=1` → 198 requests (request-bound) → `parallelism=256`,
245282
`max-num-seqs=ceil(256/8)=32`. `n_samples=8` → 1,584 requests (capacity-bound) →
246-
start `parallelism=512` (`max-num-seqs=64`), then tune from vLLM's max-concurrency
247-
- preemption (toward 768–1024 if KV has headroom).
283+
start `parallelism=512` (`max-num-seqs=64`), then tune up **only while preemption
284+
stays ≈ 0** — GPQA's reasoning outputs run to ~82K tokens, so the knee may sit well
285+
below 1024; watch the preemption counter rather than assuming KV headroom.
248286

249287
**Dense ~70B BF16, 8×H100 (80 GB).** ~140 GB weights → won't fit one GPU; TP=2
250288
(~70 GB/GPU + KV) fits → **TP=2, DP=4, no EP.** `serving_capacity = max-num-seqs ×

0 commit comments

Comments
 (0)