|
1 | 1 | --- |
2 | 2 | title: "DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles" |
3 | | -author: "The SGLang Team" |
| 3 | +author: "The SGLang and Miles Team" |
4 | 4 | date: "April 25, 2026" |
5 | 5 | previewImg: /images/blog/deepseek_v4/benchmark_vs_oss.png |
6 | 6 | type: blog |
@@ -73,10 +73,10 @@ Two optimizations drive the speedup: |
73 | 73 | ### HiSparse: Turbocharging Sparse Attention with Hierarchical Memory |
74 | 74 | Recently introduced, [HiSparse](https://www.lmsys.org/blog/2026-04-10-sglang-hisparse/) is a technique that **offloads inactive KV cache to CPU memory**, enabling larger batch sizes and higher throughput for sparse attention. HiSparse fits naturally with **C4 layers**: each step the indexer top-k touches only a small fraction of compressed positions, so **most C4 KV is inactive at any moment and can live on CPU**. C128 is dense (every position is touched) and SWA is already small (128 tokens), so neither benefits from offload. By using a CPU memory pool to extend just the C4 KV cache pool, we improve overall token capacity and throughput for long-context serving by **up to 3x**. |
75 | 75 |
|
76 | | -<p align="center"> |
77 | | - <img src="/images/blog/deepseek_v4/fig3_hisparse_arch.png" alt="HiSparse architecture" style="width: 58%; display: inline-block; margin: 0;"/> |
78 | | - <img src="/images/blog/deepseek_v4/fig_hisparse_peak_throughput.png" alt="HiSparse peak throughput" style="width: 40%; display: inline-block; margin: 0;"/> |
79 | | -</p> |
| 76 | +<div style="display: flex; align-items: center; justify-content: center; gap: 2%; max-width: 100%; margin: 1em 0;"> |
| 77 | + <img src="/images/blog/deepseek_v4/fig3_hisparse_arch.png" alt="HiSparse architecture" style="width: 58%; height: auto; object-fit: contain; margin: 0;"/> |
| 78 | + <img src="/images/blog/deepseek_v4/fig_hisparse_peak_throughput.png" alt="HiSparse peak throughput" style="width: 40%; height: auto; object-fit: contain; margin: 0;"/> |
| 79 | +</div> |
80 | 80 |
|
81 | 81 | *Left: the GPU keeps only small device buffers for the active working set of the C4 KV cache, while a larger pinned CPU mirror stores the full-context KV cache. At each step, the HiSparse Coordinator swaps missed pages in from CPU and evicts inactive GPU pages using an LRU policy. Newly generated tokens are asynchronously backed up to the CPU mirror. Right: peak throughput for [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) on 2xB200, 200K-input / 20K-output, `swa_full_tokens_ratio=0.001`.* |
82 | 82 |
|
@@ -226,6 +226,6 @@ Thanks also to the individuals who contributed directly to this effort: Justin C |
226 | 226 | author \= {Ke Bao and Tom Chen and Mingyi Lu and Ying Sheng and Yusheng Su and Yihao Wang and Zhiqiang Xie and Ziyi Xu and Liangsheng Yin and Qiaolin Yu and Yueming Yuan and Baizhou Zhang and Banghua Zhu}, |
227 | 227 | title \= {DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles}, |
228 | 228 | year \= {2026}, |
229 | | - url \= {TODO} |
| 229 | + url \= {https://www.lmsys.org/blog/2026-04-25-deepseek-v4/} |
230 | 230 | } |
231 | 231 | ``` |
0 commit comments