Skip to content

Commit 19cec9d

Browse files
authored
minor author and citation fix (#333)
1 parent 251cd75 commit 19cec9d

1 file changed

Lines changed: 6 additions & 6 deletions

File tree

blog/2026-04-25-deepseek-v4.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles"
3-
author: "The SGLang Team"
3+
author: "The SGLang and Miles Team"
44
date: "April 25, 2026"
55
previewImg: /images/blog/deepseek_v4/benchmark_vs_oss.png
66
type: blog
@@ -73,10 +73,10 @@ Two optimizations drive the speedup:
7373
### HiSparse: Turbocharging Sparse Attention with Hierarchical Memory
7474
Recently introduced, [HiSparse](https://www.lmsys.org/blog/2026-04-10-sglang-hisparse/) is a technique that **offloads inactive KV cache to CPU memory**, enabling larger batch sizes and higher throughput for sparse attention. HiSparse fits naturally with **C4 layers**: each step the indexer top-k touches only a small fraction of compressed positions, so **most C4 KV is inactive at any moment and can live on CPU**. C128 is dense (every position is touched) and SWA is already small (128 tokens), so neither benefits from offload. By using a CPU memory pool to extend just the C4 KV cache pool, we improve overall token capacity and throughput for long-context serving by **up to 3x**.
7575

76-
<p align="center">
77-
<img src="/images/blog/deepseek_v4/fig3_hisparse_arch.png" alt="HiSparse architecture" style="width: 58%; display: inline-block; margin: 0;"/>
78-
<img src="/images/blog/deepseek_v4/fig_hisparse_peak_throughput.png" alt="HiSparse peak throughput" style="width: 40%; display: inline-block; margin: 0;"/>
79-
</p>
76+
<div style="display: flex; align-items: center; justify-content: center; gap: 2%; max-width: 100%; margin: 1em 0;">
77+
<img src="/images/blog/deepseek_v4/fig3_hisparse_arch.png" alt="HiSparse architecture" style="width: 58%; height: auto; object-fit: contain; margin: 0;"/>
78+
<img src="/images/blog/deepseek_v4/fig_hisparse_peak_throughput.png" alt="HiSparse peak throughput" style="width: 40%; height: auto; object-fit: contain; margin: 0;"/>
79+
</div>
8080

8181
*Left: the GPU keeps only small device buffers for the active working set of the C4 KV cache, while a larger pinned CPU mirror stores the full-context KV cache. At each step, the HiSparse Coordinator swaps missed pages in from CPU and evicts inactive GPU pages using an LRU policy. Newly generated tokens are asynchronously backed up to the CPU mirror. Right: peak throughput for [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) on 2xB200, 200K-input / 20K-output, `swa_full_tokens_ratio=0.001`.*
8282

@@ -226,6 +226,6 @@ Thanks also to the individuals who contributed directly to this effort: Justin C
226226
author \= {Ke Bao and Tom Chen and Mingyi Lu and Ying Sheng and Yusheng Su and Yihao Wang and Zhiqiang Xie and Ziyi Xu and Liangsheng Yin and Qiaolin Yu and Yueming Yuan and Baizhou Zhang and Banghua Zhu},
227227
title \= {DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles},
228228
year \= {2026},
229-
url \= {TODO}
229+
url \= {https://www.lmsys.org/blog/2026-04-25-deepseek-v4/}
230230
}
231231
```

0 commit comments

Comments
 (0)