Skip to content

Commit 8d2cf6d

Browse files
[Blog]: DeepSeek R1 inference performance: MI300X vs. H200 (#2425)
1 parent cea8e05 commit 8d2cf6d

File tree

1 file changed

+233
-0
lines changed

1 file changed

+233
-0
lines changed
Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
---
2+
title: "DeepSeek R1 inference performance: MI300X vs. H200"
3+
date: 2025-03-18
4+
description: "TBA"
5+
slug: h200-mi300x-deepskeek-benchmark
6+
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true
7+
categories:
8+
- Benchmarks
9+
- AMD
10+
- NVIDIA
11+
---
12+
13+
# DeepSeek R1 inference performance: MI300X vs. H200
14+
15+
DeepSeek-R1, with its innovative architecture combining Multi-head Latent Attention (MLA) and DeepSeekMoE, presents
16+
unique challenges for inference workloads. As a reasoning-focused model, it generates intermediate chain-of-thought
17+
outputs, placing significant demands on memory capacity and bandwidth.
18+
19+
In this benchmark, we evaluate the performance of three inference backends—SGLang, vLLM, and TensorRT-LLM—on two hardware
20+
configurations: 8x NVIDIA H200 and 8x AMD MI300X. Our goal is to compare throughput, latency, and overall efficiency to
21+
determine the optimal backend and hardware pairing for DeepSeek-R1's demanding requirements.
22+
23+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true" width="630"/>
24+
25+
This benchmark was made possible through the generous support of our partners at
26+
[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} and
27+
[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"},
28+
who provided access to the necessary hardware.
29+
30+
<!-- more -->
31+
32+
## Benchmark setup
33+
34+
### Hardware configurations
35+
36+
1. AMD 8xMI300x
37+
* 2x Intel Xeon Platinum 8468, 48C/96T, 16GT/s, 105M Cache (350W)
38+
* 8x AMD MI300x GPU, 192GB, 750W
39+
* 32x 64GB DDR5, 4800MT/s
40+
2. NVIDIA 8xH200 SXM5
41+
* 2x Intel Xeon Platinum 8570, 56C/112T, 20GT/s, 300M Cache (350W)
42+
* 8x NVIDIA H200 SXM5 GPU, 141GB, 700W
43+
* 32x 64GB DDR5, 5600MT/s
44+
45+
### Benchmark methodology
46+
47+
**Online inference**
48+
49+
We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"}
50+
script, modified to incorporate TensorRT-LLM.
51+
52+
Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200.
53+
54+
| Request Concurrencies | Output Token Lengths | Prefix-Cached |
55+
|------------------------|----------------------|----------------|
56+
| 4,8,16,...,128 | 800 | No |
57+
| 128 | 1600, 3200, 6400 | No |
58+
| 128 | 800 | Yes |
59+
60+
To test prefix caching ability, about 62.5% of each ~3200-token prompt (i.e., 2000 out of 3200 tokens) is a repeated prefix across multiple requests.
61+
62+
**Offline inference**
63+
64+
For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"},
65+
modified for SGLang. TensorRT-LLM was tested using a custom
66+
[`benchmark_throughput_trt.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/blob/deepseek-r1-benchmark/Deepseek-R1/benchmark_throughput_trt.py){:target="_blank"}.
67+
The benchmark examined performance across various batch sizes and output token lengths.
68+
69+
| Batch Sizes | Output Token Lengths |
70+
|--------------------|----------------------|
71+
| 32,64,128,...,1024 | 800 |
72+
| 256, 512, 1024 | 1600 |
73+
| 256, 512, 1024 | 3200 |
74+
75+
## Key observations
76+
77+
### Throughput and End-to-End Latency
78+
79+
**NVIDIA H200 performance**
80+
81+
* TensorRT-LLM outperformed both vLLM and SGLang, achieving the highest online throughput of 4176 tokens/s on H200.
82+
* At concurrencies below 128, vLLM led in online throughput and end-to-end latency.
83+
* In offline scenarios, H200 achieved the highest overall throughput of 6311 tokens/s with SGLang.
84+
85+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online_throughput_vs_latency.png" />
86+
87+
**AMD MI300X performance**
88+
89+
* vLLM outperformed SGLang in both online and offline throughput and end-to-end latency.
90+
* MI300X with vLLM achieved the highest overall throughput of 4574 tokens/s in online scenarios.
91+
* At request concurrencies below 32, SGLang outperformed vLLM in online throughput and latency.
92+
93+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline_throughput_vs_latency.png" />
94+
95+
While MI300X's larger memory capacity and higher bandwidth should theoretically enable higher throughput at larger batch
96+
sizes, the results suggest that inference backends for MI300X may require further optimization to fully leverage its
97+
architectural advantages.
98+
99+
### Throughput and Latency vs. Output Token Length
100+
101+
**NVIDIA H200 performance**
102+
103+
* SGLang delivered slightly higher throughput and better latency as output token length increased in online scenarios.
104+
* In offline scenarios, SGLang with H200 outperformed MI300X as output token length increased.
105+
106+
=== "Throughput"
107+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online-throughput-vs-output.png" />
108+
109+
=== "Latency"
110+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online-latency-vs-output.png" />
111+
112+
**AMD MI300X performance**
113+
114+
vLLM maintained the lead in both online and offline scenarios as output token length increased.
115+
116+
=== "Throughput"
117+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline-throughput-vs-output-256.png" />
118+
119+
=== "Latency"
120+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline-latency-vs-output-length-256.png" />
121+
122+
### Time to First Token (TTFT)
123+
124+
**NVIDIA H200 performance**
125+
126+
TensorRT-LLM maintained the lowest and most consistent TTFT up to concurrency 64.
127+
128+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-concurrency.png" />
129+
130+
**AMD MI300X performance**
131+
132+
vLLM achieved the lowest TTFT at concurrency 128. Below 128, vLLM and SGLang had similar TTFT.
133+
134+
TTFT, being compute-intensive, highlights H200's advantage, aligning with [SemiAnalysis’s MI300X vs. H200 TFLOPS benchmark :material-arrow-top-right-thin:{ .external }](https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-benchmark-part-1-training/){:target="_blank"}.
135+
However, at 128 concurrent requests, MI300X's memory capacity and bandwidth advantages become evident.
136+
137+
### Time Per Output Token (TPOT)
138+
139+
**NVIDIA H200 performance**
140+
141+
vLLM maintained the lowest TPOT across all request concurrencies.
142+
143+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-concurrency.png" />
144+
145+
**AMD MI300X performance**
146+
147+
SGLang delivered the lowest TPOT up to concurrency 32. Beyond that, vLLM took the lead.
148+
149+
Given that TPOT is memory-bound, MI300X should have a stronger advantage with further optimizations.
150+
151+
### TTFT vs. Output Token Length
152+
153+
**NVIDIA H200 performance**
154+
155+
* SGLang demonstrated stable TTFT across increasing output token lengths.
156+
* vLLM and TensorRT-LLM showed significant increases in TTFT as output token length grew, likely due to KV cache memory pressure.
157+
158+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-output-length.png" />
159+
160+
**AMD MI300X performance**
161+
162+
Both vLLM and SGLang demonstrated stable TTFT across increasing output token lengths, with vLLM maintaining lower TTFT.
163+
164+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-output-length-no-h200-vllm.png" />
165+
166+
### TPOT vs. Output Token Length
167+
168+
**NVIDIA H200 performance**
169+
170+
SGLang and TensorRT-LLM demonstrated stable TPOT across increasing output token lengths.
171+
172+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-output-length.png" />
173+
174+
vLLM maintained the lowest TPOT up to 3200 tokens but showed a sudden increase at 6400 tokens, likely due to memory pressure.
175+
176+
**AMD MI300X performance**
177+
178+
Both SGLang and vLLM demonstrated stable TPOT across increasing output token lengths, with vLLM maintaining the lowest TPOT.
179+
180+
### Prefix caching
181+
182+
**NVIDIA H200 performance**
183+
184+
vLLM outperformed SGLang in online throughput, TTFT, and end-to-end latency with prefix caching enabled. However, vLLM's
185+
TPOT increased after prefix caching, which requires further investigation.
186+
187+
=== "Throughput"
188+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-throughput-comparison.png" />
189+
=== "TTFT"
190+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-ttft-comparison.png" />
191+
=== "TPOT"
192+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-tpot-comparison.png" />
193+
=== "Latency"
194+
<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-end-to-end-latency-comparison.png" />
195+
196+
## Limitations
197+
198+
1. The offline benchmark results for TensorRT-LLM were obtained using the DeepSeek-R1 model engine built from the
199+
[`deepseek` branch :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek){:target="_blank"}.
200+
However, the TensorRT-LLM team recommends using the TorchFlow-based approach for deployment.
201+
2. The impact of dynamic batching on inference efficiency was not tested.
202+
3. vLLM's prefix caching support for MI300X is a work in progress and can be tracked [here :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/vllm/issues/457){:target="_blank"}.
203+
4. The inference backends are being optimized for the DeepSeek-R1 model. Given these continuous updates, the current
204+
results reflect only the performance tested at the time of the benchmark. Overall, performance for all backends is
205+
expected to improve as more optimizations are made by the backend teams.
206+
207+
## Source code
208+
209+
All source code and findings are available in
210+
[our GitHub repo :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/deepseek-r1-benchmark/Deepseek-R1){:target="_blank"}.
211+
212+
## References
213+
214+
* [Unlock DeepSeek-R1 Inference Performance on AMD Instinct MI300X GPU :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html){:target="_blank"}
215+
* [Deploy DeepSeek-R1 671B on 8x NVIDIA H200 with SGLang :material-arrow-top-right-thin:{ .external }](https://datacrunch.io/blog/deploy-deepseek-r1-on-8x-nvidia-h200){:target="_blank"}
216+
* [vLLM Prefix Caching :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching){:target="_blank"}
217+
* [SgLang Prefix Caching :material-arrow-top-right-thin:{ .external }](https://lmsys.org/blog/2024-01-17-sglang/){:target="_blank"}
218+
219+
## Acknowledgments
220+
221+
### Vultr
222+
223+
[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x AMD MI300X GPUs. We are truly thankful for their support.
224+
225+
If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Vultr. With `dstack`, provisioning
226+
and accessing compute via `dstack` is seamless and straightforward.
227+
228+
### Lambda
229+
230+
[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"} provided access to 8x
231+
NVIDIA H200 GPUs. We are truly thankful for their support
232+
233+
Both Vultr and Lambda are natively supported and can be seamlessly integrated with `dstack`.

0 commit comments

Comments
 (0)