Skip to content

Commit d622241

Browse files
[Blog] Benchmarking Prefill–Decode: fixed 1:3 as a strong default
Final edits
1 parent 4abb315 commit d622241

File tree

1 file changed

+43
-43
lines changed

1 file changed

+43
-43
lines changed
Lines changed: 43 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Benchmarking Prefill–Decode: fixed 1:3 as a strong default"
2+
title: "Benchmarking Prefill–Decode ratios: fixed vs dynamic"
33
date: 2025-09-25
44
description: "TBA"
55
slug: benchmarking-pd-ratios
@@ -8,11 +8,10 @@ categories:
88
- Benchmarks
99
---
1010

11-
# Benchmarking Prefill–Decode: fixed 1:3 as a strong default
11+
# Benchmarking Prefill–Decode ratios: fixed vs dynamic
1212

13-
As demand for low-latency LLM inference grows, squeezing more useful work out of every GPU minute is critical.
14-
This benchmark evaluates how the Prefill–Decode worker disaggregatioh ratio affects performance across workload profiles and concurrency levels,
15-
and assess if dynamic ratio adjustment adds value.
13+
This benchmark investigates whether the Prefill–Decode worker ratio needs to be managed dynamically at runtime, or if a fixed split can deliver the same performance with simpler orchestration.
14+
We evaluate different ratios across workload profiles and concurrency levels to measure their impact on TTFT, ITL, and throughput, and to see whether fixing the ratio in advance is a practical alternative to dynamic adjustment.
1615

1716
<img src="https://dstack.ai/static-assets/static-assets/images/benchmarking-pd-ratios.png" width="630" />
1817

@@ -22,45 +21,49 @@ and assess if dynamic ratio adjustment adds value.
2221

2322
### What is Prefill–Decode disaggregation?
2423

25-
DistServe ([Zhong et al., 2024 :material-arrow-top-right-thin:{ .external }](https://arxiv.org/pdf/2401.09670){:target="_blank"}) proposes prefill–decode disaggregation, separating the two phases of inference across dedicated workers.
26-
Prefill can be heavily batched—prompt tokens are processed in parallel—so it is compute-intensive. Decode is intrinsically sequential—one token per iteration with full KV-cache access—so it is memory- and bandwidth-intensive. Disaggregating these phases reduces cross-phase interference, allowing hardware to be provisioned for the dominant bottleneck and improving end-to-end service performance.
24+
LLM inference has two distinct phases: prefill and decode. Prefill processes all prompt tokens in parallel and is compute-intensive. Decode generates tokens one by one, repeatedly accessing the KV-cache, making it memory- and bandwidth-intensive. DistServe ([Zhong et al., 2024 :material-arrow-top-right-thin:{ .external }](https://arxiv.org/pdf/2401.09670){:target="_blank"}) introduced prefill–decode disaggregation to separate these phases across dedicated workers, reducing interference and enabling hardware to be allocated more efficiently.
2725

28-
### What is the Prefill–Decode ratio?
26+
### What is the prefill–decode ratio?
2927

30-
The optimal split between prefill and decode workers depends on service-level objectives (SLOs) and workload shape. DistServe shows that for input sequence length (ISL) = 512 and output sequence length (OSL) = 64, "2 prefill to 1 decode" meets both TTFT and TPOT targets. Beyond this illustrative case, however, DistServe does not systematically explore other Prefill–Decode ratios.
28+
The ratio of prefill to decode workers determines how much capacity is dedicated to each phase. DistServe showed that for a workload with ISL=512 and OSL=64, a 2:1 ratio met both TTFT and TPOT targets. But this example does not answer how the ratio should be chosen more generally, or whether it needs to change at runtime.
3129

3230
!!! info "Reasoning model example"
33-
In the DeepSeek deployment ([LMSYS, 2025 :material-arrow-top-right-thin:{ .external }](https://lmsys.org/blog/2025-05-05-large-scale-ep){:target="_blank"}), 3 nodes were allocated to prefill and 9 to decode. The decode-heavy split reflects reasoning workloads, where chains of thought push output lengths high. Allocating more capacity to decode reduces inter-token latency and keeps long responses streaming smoothly.
31+
In the DeepSeek deployment ([LMSYS, 2025 :material-arrow-top-right-thin:{ .external }](https://lmsys.org/blog/2025-05-05-large-scale-ep){:target="_blank"}), the ratio was 1:3. This decode-leaning split reflects reasoning workloads, where long outputs dominate. Allocating more workers to decode reduces inter-token latency and keeps responses streaming smoothly.
3432

35-
### Dynamic ratio adjustment
33+
### Dynamic ratio
3634

37-
Dynamic allocation adjusts the split between prefill and decode workers at runtime. NVIDIA’s [SLA-based planner :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/dynamo/latest/architecture/sla_planner.html){:target="_blank"}
38-
estimates the workers needed to meet TTFT and ITL targets, while the [Load-based planner :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/dynamo/latest/architecture/load_planner.html){:target="_blank"}
39-
reallocates workers using KV-cache and queue signals. These planners describe how to move capacity between phases, but they do not prescribe a specific Prefill–Decode ratio.
35+
Dynamic approaches, such as NVIDIA’s [SLA-based :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/dynamo/latest/architecture/sla_planner.html){:target="_blank"}
36+
and [Load-based :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/dynamo/latest/architecture/load_planner.html){:target="_blank"} planners, adjust the ratio at runtime according to SLO targets or load. However, they do this in conjunction with auto-scaling, which increases orchestration complexity. This raises the question: does the prefill–decode ratio really need to be dynamic, or can a fixed ratio be chosen ahead of time and still provide robust performance?
4037

4138
## Benchmark purpose
4239

43-
Prior art points to different “best” ratios depending on workload: DistServe’s 2:1 for short outputs, the SGLang DeepSeek example’s 1:3 for long outputs, and dynamic planners that adapt the split in real time. Building on these insights, this benchmark evaluates how the Prefill–Decode worker ratio affects performance across workload profiles and concurrency levels.
40+
The aim of this benchmark is to test whether the prefill–decode ratio must be adjusted dynamically at runtime, or if a fixed split can perform just as well.
4441

45-
We measure TTFT, ITL, and throughput to understand how allocation choices influence both latency and efficiency—and to assess when dynamic ratio adjustment adds value versus when a fixed ratio suffices for a known workload.
42+
If a fixed ratio works across workload profiles and concurrency levels, it would mean the ratio can be chosen ahead of time, simplifying orchestration by removing the need for runtime ratio management.
4643

47-
??? info "Why these metrics matter"
48-
49-
* **TTFT** (Time to First Token) captures perceived responsiveness—crucial for interactive experiences (e.g., support bots, code assistants).
50-
* **ITL** (inter-token latency) captures streaming smoothness—critical for long, reasoning-style outputs.
51-
* **Throughput** (tokens/sec) reflects cost efficiency. Prefill-heavy tasks (e.g., summarization of long docs) stress prefill; reasoning tasks stress decode. Maintaining high throughput ensures the under-stressed phase doesn’t leave GPUs idle.
44+
We evaluate different ratios across workload types (prefill-heavy, decode-heavy, balanced) and concurrency levels to see how each affects TTFT, ITL, and throughput.
5245

5346
## Methodology
5447

55-
We ran a single-node study on 8xH200 GPUs, varying the number of prefill and decode workers to examine how the split shapes performance. We compared three prefill-decode ratios—3:1, 2:2, 1:3 both lower and higher request concurrency for three workload profiles:
48+
To test this, we benchmarked different fixed prefill–decode ratios under varying workload profiles and concurrency levels. The experiments were run on a single node with 8xH200 GPUs, using SGLang to serve the model.
49+
50+
We compared three ratios—3:1, 2:2, and 1:3—at both low and high concurrency across three workload types:
5651

5752
* **Prefill-heavy** (ISL > OSL) — e.g., summarization: long inputs, short outputs.
5853
* **Decode-heavy** (ISL < OSL) — e.g., reasoning: short inputs, long chains of thought.
5954
* **Balanced** (ISL ≈ OSL) — e.g., translation, paraphrasing.
6055

61-
Lower concurrency highlights intrinsic trade-offs (prefill-leaning improves TTFT; decode-leaning improves ITL and throughput). Higher concurrency reveals the true bottleneck. In real deployments, success is meeting TTFT/ITL SLOs and sustaining throughput for cost efficiency, so we evaluate both.
56+
Lower concurrency highlights intrinsic trade-offs (prefill-leaning improves TTFT; decode-leaning improves ITL and throughput). Higher concurrency reveals the true bottleneck. In real deployments, success means meeting TTFT/ITL SLOs and sustaining throughput for cost efficiency, so we evaluate both.
57+
58+
To evaluate performance, we measured TTFT, ITL, and throughput to capture both latency and efficiency.
6259

63-
> A single-node design isolates the question at hand—does adjusting the prefill/decode split improve performance? If a benefit doesn’t manifest on one node, scaling out will typically amplify the same dynamics rather than change them.
60+
??? info "Why these metrics matter"
61+
62+
* **TTFT** (Time to First Token) captures perceived responsiveness—crucial for interactive experiences (e.g., support bots, code assistants).
63+
* **ITL** (inter-token latency) captures streaming smoothness—critical for long, reasoning-style outputs.
64+
* **Throughput** (tokens/sec) reflects cost efficiency. Prefill-heavy tasks (e.g., summarization of long docs) stress prefill; reasoning tasks stress decode. Maintaining high throughput ensures the under-stressed phase doesn’t leave GPUs idle.
65+
66+
If a fixed ratio consistently performs well across these metrics, it would indicate that the ratio can be chosen ahead of time, without requiring runtime adjustment.
6467

6568
## Benchmark setup
6669

@@ -79,11 +82,11 @@ At higher concurrency, 1:3 wins across all metrics. Because TTFT = prefill time
7982

8083
In practice, summarization rarely has tight TTFT SLOs—users expect some delay after uploading long documents. Throughput and ITL dominate cost and experience, making 1:3 the recommended split for prefill-heavy workloads at both low and high concurrency.
8184

82-
*TBA: Fig-1: ISL 2048, OSL 128, concurrency 32*
85+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmarking-pd-ratios-fig-1.png" width="750" />
8386

8487
> Metrics are normalized per chart: the best value for each metric is 100%; others are percentages of that maximum. Lower is better for ITL/TTFT; higher is better for Throughput.
8588
86-
*TBA: Fig-2: ISL 2048, OSL 128, concurrency 128*
89+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmarking-pd-ratios-fig-2.png" width="750" />
8790

8891
## Finding 2: Decode-heavy workloads
8992

@@ -93,11 +96,11 @@ At higher concurrency, 1:3 again leads across all metrics.
9396

9497
For reasoning tasks, ITL is usually the tightest SLO—smooth, uninterrupted token streaming drives user experience. We recommend 1:3 for decode-heavy workloads at both low and high concurrency.
9598

96-
*TBA: Fig-3: ISL 128, OSL 2048, concurrency 32*
99+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmarking-pd-ratios-fig-3.png" width="750" />
97100

98101
> Metrics normalized as above. Lower is better for ITL/TTFT; higher is better for Throughput.
99102
100-
*TBA: Fig-4: ISL 128, OSL 2048, concurrency 128*
103+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmarking-pd-ratios-fig-4.png" width="750" />
101104

102105
## Finding 3: Balanced workloads
103106

@@ -107,35 +110,32 @@ At higher concurrency, 1:3 regains the lead across metrics, while 1:1 sees TTFT
107110

108111
Since 1:1 becomes limiting under load, 1:3 is the safer default for balanced workloads—1:1 can offer slightly lower TTFT at light load, but 1:3 scales better and sustains higher throughput.
109112

110-
*TBA: Fig-5: ISL 2048, OSL 2048, concurrency 32*
113+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmarking-pd-ratios-fig-5.png" width="750" />
111114

112115
> Metrics normalized as above. Lower is better for ITL/TTFT; higher is better for Throughput.
113116
114-
*TBA: Fig-6: ISL 2048, OSL 2048, concurrency 128*
117+
<img src="https://dstack.ai/static-assets/static-assets/images/benchmarking-pd-ratios-fig-6.png" width="750" />
115118

116119
## Conclusion
117120

118-
This study examined how the prefill/decode split shapes performance across workload profiles and load levels, and when dynamic adjustment is beneficial.
121+
Across all workload profiles and concurrency levels, a fixed ratio delivered robust performance.
122+
This suggests that while dynamic planners (e.g., SLA- and load-based) provide a flexible framework for worker allocation, in many cases a fixed ratio combined with standard autoscaling can achieve similar outcomes with simpler orchestration.
119123

120-
1. A decode-leaning default performs robustly. Across profiles and loads, 1:3 consistently offered the strongest ITL and throughput, while keeping TTFT competitive when concurrency rises. For many known workload mixes, this reduces the need for dynamic rebalancing.
121-
2. Resilience under surges. The 1:3 split scales gracefully with concurrency, absorbing bursts without resorting to complex runtime adjustments.
122-
3. TTFT in context. 1:3 can show higher TTFT at low concurrency, but real-world expectations matter. Summarization users anticipate a delay after long uploads; reasoning users value smooth streaming most. For interactive chat with tight TTFT SLOs, techniques such as prefix caching and cache-aware routing can reduce prefill work and lower TTFT—often without changing the prefill/decode split.
123-
124-
> Taken together, these results suggest that while dynamic planners (e.g., SLA- and load-based) provide a powerful framework to adapt capacity, in many production scenarios a simple, decode-leaning 1:3 baseline plus conventional autoscaling delivers excellent outcomes with less operational complexity.
124+
A fixed ratio therefore serves as a practical baseline for Prefill–Decode disaggregation. Dynamic adjustment remains valuable when workloads are highly unpredictable, but when profiles are understood, setting the ratio in advance can reduce operational complexity without sacrificing performance.
125125

126126
## Limitations
127127

128-
This evaluation uses SGLang’s implementation of Prefill–Decode disaggregation. To strengthen generality, repeating the study with vLLM’s implementation would be valuable.
128+
1. This benchmark does not provide a method for determining the fixed ratio.
129+
2. The benchmark evaluated only a limited set of ratios: 3:1, 2:2, and 1:3.
130+
3. The benchmark does not directly validate whether dynamic ratio adjustment (e.g., NVIDIA’s planners) delivers better or worse performance compared with a fixed-ratio approach.
131+
4. The benchmark only considers tensor parallelism and not data parallelism, e.g. to assess how other forms of model parallelism interact with PD and affect latency/throughput trade-offs.
132+
133+
Overall, more study on how the optimal ratio is found and what factors it depends on is required to ensure there is a simple and robust framework, ideally without overcomplicating orchestration.
129134

130135
## References
131136

132137
* [DistServe :material-arrow-top-right-thin:{ .external }](https://arxiv.org/pdf/2401.09670){:target="_blank"}
133138
* [DeepSeek deployment on 96 H100 GPUs :material-arrow-top-right-thin:{ .external }](https://lmsys.org/blog/2025-05-05-large-scale-ep/){:target="_blank"}
134139
* [Dynamo disaggregated serving :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/dynamo/latest/architecture/disagg_serving.html#){:target="_blank"}
135140
* [SGLang PD disaggregation :material-arrow-top-right-thin:{ .external }](https://docs.sglang.ai/advanced_features/pd_disaggregation.html){:target="_blank"}
136-
* [vLLM disaggregated prefilling :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/v0.9.2/features/disagg_prefill.html){:target="_blank"}
137-
138-
!!! info "What's next?"
139-
140-
* **KV-cache–aware routing & prefix caching with PD**: Quantify how cache-aware routing and prefix caching, combined with PD, reduce redundant prefill compute and improve TTFT.
141-
* **PD with model parallelism**: Extend beyond tensor parallelism to assess how additional forms of model parallelism interact with PD and affect latency/throughput trade-offs.
141+
* [vLLM disaggregated prefilling :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/v0.9.2/features/disagg_prefill.html){:target="_blank"}

0 commit comments

Comments
 (0)