|
| 1 | +--- |
| 2 | +title: "DeepSeek R1 inference performance: MI300X vs. H200" |
| 3 | +date: 2025-03-18 |
| 4 | +description: "TBA" |
| 5 | +slug: h200-mi300x-deepskeek-benchmark |
| 6 | +image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true |
| 7 | +categories: |
| 8 | + - Benchmarks |
| 9 | + - AMD |
| 10 | + - NVIDIA |
| 11 | +--- |
| 12 | + |
| 13 | +# DeepSeek R1 inference performance: MI300X vs. H200 |
| 14 | + |
| 15 | +DeepSeek-R1, with its innovative architecture combining Multi-head Latent Attention (MLA) and DeepSeekMoE, presents |
| 16 | +unique challenges for inference workloads. As a reasoning-focused model, it generates intermediate chain-of-thought |
| 17 | +outputs, placing significant demands on memory capacity and bandwidth. |
| 18 | + |
| 19 | +In this benchmark, we evaluate the performance of three inference backends—SGLang, vLLM, and TensorRT-LLM—on two hardware |
| 20 | +configurations: 8x NVIDIA H200 and 8x AMD MI300X. Our goal is to compare throughput, latency, and overall efficiency to |
| 21 | +determine the optimal backend and hardware pairing for DeepSeek-R1's demanding requirements. |
| 22 | + |
| 23 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/h200-mi300x-deepskeek-benchmark-v2.png?raw=true" width="630"/> |
| 24 | + |
| 25 | +This benchmark was made possible through the generous support of our partners at |
| 26 | +[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} and |
| 27 | +[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"}, |
| 28 | +who provided access to the necessary hardware. |
| 29 | + |
| 30 | +<!-- more --> |
| 31 | + |
| 32 | +## Benchmark setup |
| 33 | + |
| 34 | +### Hardware configurations |
| 35 | + |
| 36 | +1. AMD 8xMI300x |
| 37 | + * 2x Intel Xeon Platinum 8468, 48C/96T, 16GT/s, 105M Cache (350W) |
| 38 | + * 8x AMD MI300x GPU, 192GB, 750W |
| 39 | + * 32x 64GB DDR5, 4800MT/s |
| 40 | +2. NVIDIA 8xH200 SXM5 |
| 41 | + * 2x Intel Xeon Platinum 8570, 56C/112T, 20GT/s, 300M Cache (350W) |
| 42 | + * 8x NVIDIA H200 SXM5 GPU, 141GB, 700W |
| 43 | + * 32x 64GB DDR5, 5600MT/s |
| 44 | + |
| 45 | +### Benchmark methodology |
| 46 | + |
| 47 | +**Online inference** |
| 48 | + |
| 49 | +We utilized SGLang's [`Deepseek-R1/bench_serving.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/main/Deepseek-R1/bench_serving.py){:target="_blank"} |
| 50 | +script, modified to incorporate TensorRT-LLM. |
| 51 | + |
| 52 | +Tests were conducted across multiple request concurrencies and output token lengths, with input token length fixed at 3200. |
| 53 | + |
| 54 | +| Request Concurrencies | Output Token Lengths | Prefix-Cached | |
| 55 | +|------------------------|----------------------|----------------| |
| 56 | +| 4,8,16,...,128 | 800 | No | |
| 57 | +| 128 | 1600, 3200, 6400 | No | |
| 58 | +| 128 | 800 | Yes | |
| 59 | + |
| 60 | +To test prefix caching ability, about 62.5% of each ~3200-token prompt (i.e., 2000 out of 3200 tokens) is a repeated prefix across multiple requests. |
| 61 | + |
| 62 | +**Offline inference** |
| 63 | + |
| 64 | +For offline inference, we used vLLM’s [`benchmark_throughput.py` :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py){:target="_blank"}, |
| 65 | +modified for SGLang. TensorRT-LLM was tested using a custom |
| 66 | +[`benchmark_throughput_trt.py` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/blob/deepseek-r1-benchmark/Deepseek-R1/benchmark_throughput_trt.py){:target="_blank"}. |
| 67 | +The benchmark examined performance across various batch sizes and output token lengths. |
| 68 | + |
| 69 | +| Batch Sizes | Output Token Lengths | |
| 70 | +|--------------------|----------------------| |
| 71 | +| 32,64,128,...,1024 | 800 | |
| 72 | +| 256, 512, 1024 | 1600 | |
| 73 | +| 256, 512, 1024 | 3200 | |
| 74 | + |
| 75 | +## Key observations |
| 76 | + |
| 77 | +### Throughput and End-to-End Latency |
| 78 | + |
| 79 | +**NVIDIA H200 performance** |
| 80 | + |
| 81 | +* TensorRT-LLM outperformed both vLLM and SGLang, achieving the highest online throughput of 4176 tokens/s on H200. |
| 82 | +* At concurrencies below 128, vLLM led in online throughput and end-to-end latency. |
| 83 | +* In offline scenarios, H200 achieved the highest overall throughput of 6311 tokens/s with SGLang. |
| 84 | + |
| 85 | +<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online_throughput_vs_latency.png" /> |
| 86 | + |
| 87 | +**AMD MI300X performance** |
| 88 | + |
| 89 | +* vLLM outperformed SGLang in both online and offline throughput and end-to-end latency. |
| 90 | +* MI300X with vLLM achieved the highest overall throughput of 4574 tokens/s in online scenarios. |
| 91 | +* At request concurrencies below 32, SGLang outperformed vLLM in online throughput and latency. |
| 92 | + |
| 93 | +<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline_throughput_vs_latency.png" /> |
| 94 | + |
| 95 | +While MI300X's larger memory capacity and higher bandwidth should theoretically enable higher throughput at larger batch |
| 96 | +sizes, the results suggest that inference backends for MI300X may require further optimization to fully leverage its |
| 97 | +architectural advantages. |
| 98 | + |
| 99 | +### Throughput and Latency vs. Output Token Length |
| 100 | + |
| 101 | +**NVIDIA H200 performance** |
| 102 | + |
| 103 | +* SGLang delivered slightly higher throughput and better latency as output token length increased in online scenarios. |
| 104 | +* In offline scenarios, SGLang with H200 outperformed MI300X as output token length increased. |
| 105 | + |
| 106 | +=== "Throughput" |
| 107 | + <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online-throughput-vs-output.png" /> |
| 108 | + |
| 109 | +=== "Latency" |
| 110 | + <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/online-latency-vs-output.png" /> |
| 111 | + |
| 112 | +**AMD MI300X performance** |
| 113 | + |
| 114 | +vLLM maintained the lead in both online and offline scenarios as output token length increased. |
| 115 | + |
| 116 | +=== "Throughput" |
| 117 | + <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline-throughput-vs-output-256.png" /> |
| 118 | + |
| 119 | +=== "Latency" |
| 120 | + <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/offline-latency-vs-output-length-256.png" /> |
| 121 | + |
| 122 | +### Time to First Token (TTFT) |
| 123 | + |
| 124 | +**NVIDIA H200 performance** |
| 125 | + |
| 126 | +TensorRT-LLM maintained the lowest and most consistent TTFT up to concurrency 64. |
| 127 | + |
| 128 | +<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-concurrency.png" /> |
| 129 | + |
| 130 | +**AMD MI300X performance** |
| 131 | + |
| 132 | +vLLM achieved the lowest TTFT at concurrency 128. Below 128, vLLM and SGLang had similar TTFT. |
| 133 | + |
| 134 | +TTFT, being compute-intensive, highlights H200's advantage, aligning with [SemiAnalysis’s MI300X vs. H200 TFLOPS benchmark :material-arrow-top-right-thin:{ .external }](https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-benchmark-part-1-training/){:target="_blank"}. |
| 135 | +However, at 128 concurrent requests, MI300X's memory capacity and bandwidth advantages become evident. |
| 136 | + |
| 137 | +### Time Per Output Token (TPOT) |
| 138 | + |
| 139 | +**NVIDIA H200 performance** |
| 140 | + |
| 141 | +vLLM maintained the lowest TPOT across all request concurrencies. |
| 142 | + |
| 143 | +<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-concurrency.png" /> |
| 144 | + |
| 145 | +**AMD MI300X performance** |
| 146 | + |
| 147 | +SGLang delivered the lowest TPOT up to concurrency 32. Beyond that, vLLM took the lead. |
| 148 | + |
| 149 | +Given that TPOT is memory-bound, MI300X should have a stronger advantage with further optimizations. |
| 150 | + |
| 151 | +### TTFT vs. Output Token Length |
| 152 | + |
| 153 | +**NVIDIA H200 performance** |
| 154 | + |
| 155 | +* SGLang demonstrated stable TTFT across increasing output token lengths. |
| 156 | +* vLLM and TensorRT-LLM showed significant increases in TTFT as output token length grew, likely due to KV cache memory pressure. |
| 157 | + |
| 158 | +<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-output-length.png" /> |
| 159 | + |
| 160 | +**AMD MI300X performance** |
| 161 | + |
| 162 | +Both vLLM and SGLang demonstrated stable TTFT across increasing output token lengths, with vLLM maintaining lower TTFT. |
| 163 | + |
| 164 | +<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/ttft-vs-output-length-no-h200-vllm.png" /> |
| 165 | + |
| 166 | +### TPOT vs. Output Token Length |
| 167 | + |
| 168 | +**NVIDIA H200 performance** |
| 169 | + |
| 170 | +SGLang and TensorRT-LLM demonstrated stable TPOT across increasing output token lengths. |
| 171 | + |
| 172 | +<img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/tpot-vs-output-length.png" /> |
| 173 | + |
| 174 | +vLLM maintained the lowest TPOT up to 3200 tokens but showed a sudden increase at 6400 tokens, likely due to memory pressure. |
| 175 | + |
| 176 | +**AMD MI300X performance** |
| 177 | + |
| 178 | +Both SGLang and vLLM demonstrated stable TPOT across increasing output token lengths, with vLLM maintaining the lowest TPOT. |
| 179 | + |
| 180 | +### Prefix caching |
| 181 | + |
| 182 | +**NVIDIA H200 performance** |
| 183 | + |
| 184 | +vLLM outperformed SGLang in online throughput, TTFT, and end-to-end latency with prefix caching enabled. However, vLLM's |
| 185 | +TPOT increased after prefix caching, which requires further investigation. |
| 186 | + |
| 187 | +=== "Throughput" |
| 188 | + <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-throughput-comparison.png" /> |
| 189 | +=== "TTFT" |
| 190 | + <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-ttft-comparison.png" /> |
| 191 | +=== "TPOT" |
| 192 | + <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-tpot-comparison.png" /> |
| 193 | +=== "Latency" |
| 194 | + <img src="https://github.com/dstackai/benchmarks/raw/deepseek-r1-benchmark/Deepseek-R1/images/prefix-cache-end-to-end-latency-comparison.png" /> |
| 195 | + |
| 196 | +## Limitations |
| 197 | + |
| 198 | +1. The offline benchmark results for TensorRT-LLM were obtained using the DeepSeek-R1 model engine built from the |
| 199 | + [`deepseek` branch :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek){:target="_blank"}. |
| 200 | + However, the TensorRT-LLM team recommends using the TorchFlow-based approach for deployment. |
| 201 | +2. The impact of dynamic batching on inference efficiency was not tested. |
| 202 | +3. vLLM's prefix caching support for MI300X is a work in progress and can be tracked [here :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/vllm/issues/457){:target="_blank"}. |
| 203 | +4. The inference backends are being optimized for the DeepSeek-R1 model. Given these continuous updates, the current |
| 204 | + results reflect only the performance tested at the time of the benchmark. Overall, performance for all backends is |
| 205 | + expected to improve as more optimizations are made by the backend teams. |
| 206 | + |
| 207 | +## Source code |
| 208 | + |
| 209 | +All source code and findings are available in |
| 210 | +[our GitHub repo :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/benchmarks/tree/deepseek-r1-benchmark/Deepseek-R1){:target="_blank"}. |
| 211 | + |
| 212 | +## References |
| 213 | + |
| 214 | +* [Unlock DeepSeek-R1 Inference Performance on AMD Instinct MI300X GPU :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html){:target="_blank"} |
| 215 | +* [Deploy DeepSeek-R1 671B on 8x NVIDIA H200 with SGLang :material-arrow-top-right-thin:{ .external }](https://datacrunch.io/blog/deploy-deepseek-r1-on-8x-nvidia-h200){:target="_blank"} |
| 216 | +* [vLLM Prefix Caching :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching){:target="_blank"} |
| 217 | +* [SgLang Prefix Caching :material-arrow-top-right-thin:{ .external }](https://lmsys.org/blog/2024-01-17-sglang/){:target="_blank"} |
| 218 | + |
| 219 | +## Acknowledgments |
| 220 | + |
| 221 | +### Vultr |
| 222 | + |
| 223 | +[Vultr :material-arrow-top-right-thin:{ .external }](https://www.vultr.com/){:target="_blank"} provided access to 8x AMD MI300X GPUs. We are truly thankful for their support. |
| 224 | + |
| 225 | +If you're looking for top-tier bare metal compute with AMD GPUs, we highly recommend Vultr. With `dstack`, provisioning |
| 226 | +and accessing compute via `dstack` is seamless and straightforward. |
| 227 | + |
| 228 | +### Lambda |
| 229 | + |
| 230 | +[Lambda :material-arrow-top-right-thin:{ .external }](https://lambdalabs.com/){:target="_blank"} provided access to 8x |
| 231 | +NVIDIA H200 GPUs. We are truly thankful for their support |
| 232 | + |
| 233 | +Both Vultr and Lambda are natively supported and can be seamlessly integrated with `dstack`. |
0 commit comments