You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
parallelism.md: add balanced-sizing strategy (bigger is not always faster)
Higher parallelism is not universally better — past the KV-fit point it
regresses, worst for long-context / long-output tasks. Document the three
mechanisms (preemption thrash on huge prefills, prefill/decode contention,
latency->timeout->retry cascade), that sustainable concurrency is
context-dependent, and a balanced rule: target ~70-80% of the preemption-free
KV-fit concurrency at the task's working context, give long-context tasks a
lower per-task parallelism override, tune by canary while throughput rises /
preemption ~0 / p99 within request_timeout, and err low when unsure.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy file name to clipboardExpand all lines: .claude/skills/evaluation/references/parallelism.md
+49-11Lines changed: 49 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -194,10 +194,11 @@ speedup, and deep queues can trip `request_timeout`.
194
194
| Situation | Set `parallelism` to | Why |
195
195
| --- | --- | --- |
196
196
|`total_requests ≤ serving_capacity` (small run) |`total_requests` (round up a little for uneven DP routing) | All requests dispatch at once → one wave → finishes in ~one generation-time. Higher is wasted. |
197
-
|`total_requests ≫ serving_capacity` (large run) |`serving_capacity` (largest the GPUs sustain) | Throughput-bound: keep every decode slot full until the queue drains. Request count no longer matters. |
197
+
|`total_requests ≫ serving_capacity` (large run) |the **preemption-free** capacity at the *task's* context — often *below* nominal `serving_capacity` (see Balanced sizing) | Throughput-bound: keep decode slots full *without thrashing*. Request count no longer matters; KV headroom does. |
198
198
199
-
So "set it higher" is right **only up to the request count**; past that you just
200
-
over-reserve KV.
199
+
So "set it higher" is right **only up to the request count** for small runs; for
200
+
large runs it's right **only up to the preemption-free point** — past that you don't
201
+
just over-reserve KV, you *regress* (next section).
201
202
202
203
## Sizing `--max-num-seqs` against KV cache
203
204
@@ -220,20 +221,56 @@ Factors that **relax** the KV limit: small / low-precision weights (more HBM for
220
221
KV), **KV-cache quantization** (`kv_cache_scheme` in `config.json`), and
221
222
**hybrid / linear-attention** layers (near-constant state instead of growing KV).
222
223
223
-
## Diminishing returns
224
-
225
-
Decode throughput saturates HBM bandwidth at some batch size; beyond that knee,
226
-
more sequences add latency without adding tokens/sec. Goal = **largest batch with
227
-
~zero preemption**, not the max the config accepts.
224
+
## Balanced sizing: bigger is not always faster (especially long context)
225
+
226
+
Decode throughput first saturates HBM bandwidth (more sequences stop adding
227
+
tokens/sec) and then, past the KV-fit point, **regresses** — and the regression is
228
+
worst for long-context / long-output tasks. Three mechanisms:
229
+
230
+
1.**Preemption thrash.** When admitted sequences exceed what KV holds, vLLM preempts
231
+
(recompute or swap). Recompute discards a partially-finished decode — and
232
+
re-running a ~120K-token prefill is enormous wasted work. A modest,
233
+
preemption-free concurrency finishes *sooner* than a high one that thrashes.
234
+
2.**Prefill/decode contention.** Long inputs = huge prefills. With
235
+
`--max-num-batched-tokens` fixed, many concurrent long prefills split that budget
236
+
and starve decode — everything crawls.
237
+
3.**Latency → timeout → retry cascade.** Too many in-flight requests shrink each
238
+
one's compute share; p99 latency climbs past `request_timeout`, triggering
239
+
`max_retries` resubmissions that pile *more* load onto an already-saturated server.
240
+
241
+
**Sustainable concurrency is context-dependent.** vLLM's startup
242
+
`Maximum concurrency for <max-model-len> tokens` is the *full-length floor*; at a
243
+
task's actual working length you fit more (short tasks) — but for long-context tasks
244
+
only a handful. So a `parallelism` that's ideal for GPQA (short prompt) will thrash
245
+
AA-LCR (~120K input). **Never inherit a short task's `parallelism` for a long one.**
246
+
247
+
**Balanced rule:**
248
+
249
+
- Target `parallelism` ≈ **70–80% of the preemption-free KV-fit concurrency at the
250
+
task's working context** (prompt + expected generation) × DP — not the model's
251
+
nominal max. The 20–30% margin absorbs length variance and uneven DP routing.
252
+
-**Per-task override for long-context / long-output tasks** (AA-LCR, big
253
+
`max_new_tokens` reasoning): set a *lower*`parallelism` under that task's `params`;
254
+
don't let the higher top-level value apply.
255
+
-**Tune empirically (canary), raising only while ALL THREE hold:** throughput
256
+
(req/s) rises, preemption ≈ 0, and p99 latency stays within `request_timeout`. Stop
257
+
at the first that breaks — that's the knee; back off ~20%.
258
+
-**When unsure, err low for long context.** A slightly-too-small `parallelism` only
259
+
mildly underutilizes the GPUs; a too-large one thrashes and can be *multiples*
260
+
slower. Goal = **largest batch with ~zero preemption**, not the max the config accepts.
228
261
229
262
## Non-GPU caps
230
263
231
264
-**Judge / user-sim tasks** (HLE, AA-LCR, Tau2-Bench Telecom): `parallelism` is
232
265
often capped by the **judge's rate limit**, not the served model. Start
233
266
conservative; raise only after judge logs are clean. Use a per-task `parallelism`
234
267
override when its ceiling differs (e.g. Tau2 cap 512).
268
+
-**Context length is itself a per-task cap.** Long-context / long-output tasks need
269
+
a *lower*`parallelism` than short ones on the same deployment — give them an
270
+
explicit per-task override (see Balanced sizing), don't reuse the top-level value.
235
271
-**Per-task overrides:** size `--max-num-seqs` off the **max**`parallelism` across
236
-
the top-level and all per-task overrides.
272
+
the top-level and all per-task overrides (the deployment must support the busiest
273
+
task), even though long-context tasks themselves run at a lower `parallelism`.
237
274
238
275
---
239
276
@@ -243,8 +280,9 @@ more sequences add latency without adding tokens/sec. Goal = **largest batch wit
243
280
with huge KV headroom → **TP=1, DP=8, no EP.** Concurrency: GPQA Diamond = 198
0 commit comments