You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Why sparse attention leaves performance on the table
9
9
Self-attention has become a major bottleneck in scaling LLMs to long contexts because of its quadratic compute and memory/IO cost. This has driven growing interest in efficient attention mechanisms. Among them, **sparse attention** is especially promising: by attending to only a selected subset of KV caches, it retains strong modeling capability while avoiding the sharp increase in compute and I/O costs that regular attention faces as context grows.
10
10
11
-
However, sparse attention—typically top-k selection—does not eliminate the **memory capacity bottleneck**. In practice, the KV cache for the full context must remain in GPU HBM for fast access, even though only a small fraction of entries are active at any given decoding step. As a result, sparse attention is often capacity-bound rather than compute-bound, limiting the achievable batch size and overall throughput. As shown in the figure below, token-generation throughput of a vanilla implementation plateaus early because the KV cache footprint quickly hits the GPU memory capacity limit. By comparison, HiSparse achieves near-linear throughput scaling with increasing concurrency, reaching over 3× the baseline throughput at 256 concurrent requests—despite modest overhead at low concurrency, where the extra I/O from sparse KV loading outweighs the memory savings.
11
+
However, sparse attention—typically top-k selection—does not eliminate the **memory capacity bottleneck**. In practice, the KV cache for the full context must remain in GPU HBM for fast access, even though only a small fraction of entries are active at any given decoding step. As a result, sparse attention is often capacity-bound rather than compute-bound, limiting the achievable batch size and overall throughput. As shown in the figure below, token-generation throughput of the baseline (sparse attention without HiSparse) plateaus early because the KV cache footprint quickly hits the GPU memory capacity limit.
12
+
By comparison, HiSparse achieves near-linear throughput scaling with increasing concurrency, reaching over 3× the baseline throughput at 256 concurrent requests. Note that at low concurrency, HiSparse introduces modest overhead, as the extra I/O from sparse KV loading outweighs the memory savings. The gains become pronounced as concurrency increases and memory pressure dominates.
<pstyle="text-align: center; color: #666; font-style: italic;"> Benchmark results for the <ahref="https://huggingface.co/zai-org/GLM-5.1-FP8">GLM-5.1-FP8</a> model using 32k-input, 8k-output queries on a PD-colocated 8×H200 deployment. </p>
15
16
16
17
17
18
18
-
## Design of HiSparse:
19
+
## Design of HiSparse
19
20
In line with our prior work, [HiCache](https://www.lmsys.org/blog/2025-09-10-sglang-hicache/), we propose HiSparse: a hierarchical memory system designed to overcome this limitation. HiSparse proactively offloads inactive KV cache entries to host memory, significantly reducing GPU memory pressure, while maintaining a hot device buffer on GPU HBM for frequently accessed KV regions to minimize data movement on the critical path. This enables much larger decoding batch sizes, improving throughput while scaling to longer contexts. The diagram below illustrates the HiSparse workflow. Although depicted in a prefill–decode disaggregated setup, the design applies equally to co-located instances.
@@ -33,7 +31,7 @@ The figure below illustrates the impact of hot-buffer sizing and eviction policy
33
31
34
32
35
33
## Benchmark
36
-
Following we highlight a result sweeping various sequence configurations for a state-of-the-art open model GLM-5.1-FP8, achieving up to 5x throughput improvement on long context scenarios, and you can find more detailed instructions [here](https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/hisparse_guide.md).
34
+
Below, we highlight results from sweeping various sequence configurations for a state-of-the-art open model GLM-5.1-FP8, achieving up to 5x throughput improvement on long context scenarios, and you can find more detailed instructions [here](https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/hisparse_guide.md).
HiSparse currently supports model families that use [DeepSeek Sparse Attention (DSA)](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324), including DeepSeek-V3.2 and GLM-5.1. As an experimental feature, we expect to continue improving both performance and model coverage. HiSparse is designed for high-concurrency scenarios to maximize throughput; however, it also introduces some overhead due to the additional IO incurred by top-k cache misses. We expect to keep reducing this overhead through better overlap, and anticipate that it will be further mitigated by emerging hardware platforms such as Grace Blackwell (GB) systems.
75
+
HiSparse currently supports model families that use [DeepSeek Sparse Attention (DSA)](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324), including DeepSeek-V3.2 and GLM-5.1. As an experimental feature, we expect to continue improving both performance and model coverage. HiSparse is designed for high-concurrency scenarios to maximize throughput; however, it also introduces some overhead due to the additional IO incurred by top-k cache misses.
76
+
We expect to reduce this overhead through better overlap, and believe it will be further mitigated by the higher CPU–GPU bandwidth of emerging platforms such as Grace Blackwell (GB) systems.
78
77
79
78
Looking ahead, following the direction of our earlier HiCache work, we plan to extend this hierarchical memory management approach to support a broader range of emerging architectures, including [hybrid models](https://github.com/sgl-project/sglang/pull/21206).
80
79
81
80
82
-
## Acknowledgement:
83
-
We would like to thank the Alibaba Cloud TairKVCache team and the Ant Group SCT Inference team for their valuable contributions. We are also grateful to Shangming Cai, Teng Ma, and Xingyu Ling from Alibaba Cloud and Ziyi Xu from the SGLang community for their generous support. We also thank Christos Kozyrakis and Kristopher Geda from Stanford and the Baidu Baige AI Team for their helpful feedback.
81
+
## Acknowledgements
82
+
We would like to thank the Alibaba Cloud TairKVCache team and the Ant Group SCT Inference team for their valuable contributions. We are also grateful to Shangming Cai, Teng Ma, and Xingyu Ling from Alibaba Cloud, and to Ziyi Xu from the SGLang community, for their generous support. We further thank Christos Kozyrakis and Kristopher Geda from Stanford, as well as the Baidu Baige AI Team, for their thoughtful feedback.
0 commit comments