Skip to content

Commit d8c1612

Browse files
HiSparse blog quick fix (#329)
1 parent bf8b444 commit d8c1612

3 files changed

Lines changed: 13 additions & 14 deletions

File tree

blog/2026-04-10-sglang-hisparse.md

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,25 +2,23 @@
22
title: "HiSparse: Turbocharging Sparse Attention with Hierarchical Memory"
33
author: "Zhiqiang Xie, Zhangheng Huang, Tingwei Huang"
44
date: "April 10, 2026"
5-
previewImg: /images/blog/hisparse/hisparse_diagram.png
5+
previewImg: /images/blog/hisparse/hisparse_overview.png
66
---
77

88
## Why sparse attention leaves performance on the table
99
Self-attention has become a major bottleneck in scaling LLMs to long contexts because of its quadratic compute and memory/IO cost. This has driven growing interest in efficient attention mechanisms. Among them, **sparse attention** is especially promising: by attending to only a selected subset of KV caches, it retains strong modeling capability while avoiding the sharp increase in compute and I/O costs that regular attention faces as context grows.
1010

11-
However, sparse attention—typically top-k selection—does not eliminate the **memory capacity bottleneck**. In practice, the KV cache for the full context must remain in GPU HBM for fast access, even though only a small fraction of entries are active at any given decoding step. As a result, sparse attention is often capacity-bound rather than compute-bound, limiting the achievable batch size and overall throughput. As shown in the figure below, token-generation throughput of a vanilla implementation plateaus early because the KV cache footprint quickly hits the GPU memory capacity limit. By comparison, HiSparse achieves near-linear throughput scaling with increasing concurrency, reaching over 3× the baseline throughput at 256 concurrent requests—despite modest overhead at low concurrency, where the extra I/O from sparse KV loading outweighs the memory savings.
11+
However, sparse attention—typically top-k selection—does not eliminate the **memory capacity bottleneck**. In practice, the KV cache for the full context must remain in GPU HBM for fast access, even though only a small fraction of entries are active at any given decoding step. As a result, sparse attention is often capacity-bound rather than compute-bound, limiting the achievable batch size and overall throughput. As shown in the figure below, token-generation throughput of the baseline (sparse attention without HiSparse) plateaus early because the KV cache footprint quickly hits the GPU memory capacity limit.
12+
By comparison, HiSparse achieves near-linear throughput scaling with increasing concurrency, reaching over 3× the baseline throughput at 256 concurrent requests. Note that at low concurrency, HiSparse introduces modest overhead, as the extra I/O from sparse KV loading outweighs the memory savings. The gains become pronounced as concurrency increases and memory pressure dominates.
1213

1314
<img src="/images/blog/hisparse/throughput_concurrency.png" style="width: 50vw; min-width: 300px;" />
1415
<p style="text-align: center; color: #666; font-style: italic;"> Benchmark results for the <a href="https://huggingface.co/zai-org/GLM-5.1-FP8">GLM-5.1-FP8</a> model using 32k-input, 8k-output queries on a PD-colocated 8×H200 deployment. </p>
1516

1617

1718

18-
## Design of HiSparse:
19+
## Design of HiSparse
1920
In line with our prior work, [HiCache](https://www.lmsys.org/blog/2025-09-10-sglang-hicache/), we propose HiSparse: a hierarchical memory system designed to overcome this limitation. HiSparse proactively offloads inactive KV cache entries to host memory, significantly reducing GPU memory pressure, while maintaining a hot device buffer on GPU HBM for frequently accessed KV regions to minimize data movement on the critical path. This enables much larger decoding batch sizes, improving throughput while scaling to longer contexts. The diagram below illustrates the HiSparse workflow. Although depicted in a prefill–decode disaggregated setup, the design applies equally to co-located instances.
20-
21-
<div style="overflow: hidden; margin: 0 auto; width: 50vw; min-width: 300px;">
22-
<img src="/images/blog/hisparse/hisparse_diagram.png" style="margin: -7% 0%;" />
23-
</div>
21+
<img src="/images/blog/hisparse/hisparse_overview.png" style="width: 50vw; min-width: 300px;" />
2422

2523

2624
### Efficient Swap-in Kernel
@@ -33,7 +31,7 @@ The figure below illustrates the impact of hot-buffer sizing and eviction policy
3331

3432

3533
## Benchmark
36-
Following we highlight a result sweeping various sequence configurations for a state-of-the-art open model GLM-5.1-FP8, achieving up to 5x throughput improvement on long context scenarios, and you can find more detailed instructions [here](https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/hisparse_guide.md).
34+
Below, we highlight results from sweeping various sequence configurations for a state-of-the-art open model GLM-5.1-FP8, achieving up to 5x throughput improvement on long context scenarios, and you can find more detailed instructions [here](https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/hisparse_guide.md).
3735

3836

3937
<img src="/images/blog/hisparse/hisparse_sweep.png" style="width: 50vw; min-width: 300px;" />
@@ -42,7 +40,7 @@ Following we highlight a result sweeping various sequence configurations for a s
4240

4341
```bash
4442
# PD-disaggregation deployment (recommended) on two H20 nodes
45-
## prefill:
43+
# prefill instance:
4644
python3 -m sglang.launch_server \
4745
--model-path "zai-org/GLM-5.1-FP8" --trust-remote-code --watchdog-timeout 100000 \
4846
--chunked-prefill-size 65536 --max-running-requests 480 --mem-fraction-static 0.8 \
@@ -51,7 +49,7 @@ python3 -m sglang.launch_server \
5149
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
5250
--dist-init-addr 127.0.0.1:5757 --nnodes 1 --node-rank 0
5351

54-
## decode:
52+
# decode instance:
5553
python3 -m sglang.launch_server \
5654
--model-path "zai-org/GLM-5.1-FP8" --trust-remote-code --watchdog-timeout 100000 \
5755
--chunked-prefill-size 65536 --max-running-requests 480 --mem-fraction-static 0.85 \
@@ -61,7 +59,7 @@ python3 -m sglang.launch_server \
6159
--disaggregation-mode decode --dist-init-addr 127.0.0.1:5757 \
6260
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 --nnodes 1 --node-rank 0 \
6361
--enable-hisparse \
64-
--hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10}'
62+
--hisparse-config '{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10}'
6563

6664

6765
# PD-colocation deployment on a single 8xH200 instance
@@ -74,10 +72,11 @@ python3 -m sglang.launch_server \
7472
```
7573

7674
## Future Work
77-
HiSparse currently supports model families that use [DeepSeek Sparse Attention (DSA)](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324), including DeepSeek-V3.2 and GLM-5.1. As an experimental feature, we expect to continue improving both performance and model coverage. HiSparse is designed for high-concurrency scenarios to maximize throughput; however, it also introduces some overhead due to the additional IO incurred by top-k cache misses. We expect to keep reducing this overhead through better overlap, and anticipate that it will be further mitigated by emerging hardware platforms such as Grace Blackwell (GB) systems.
75+
HiSparse currently supports model families that use [DeepSeek Sparse Attention (DSA)](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324), including DeepSeek-V3.2 and GLM-5.1. As an experimental feature, we expect to continue improving both performance and model coverage. HiSparse is designed for high-concurrency scenarios to maximize throughput; however, it also introduces some overhead due to the additional IO incurred by top-k cache misses.
76+
We expect to reduce this overhead through better overlap, and believe it will be further mitigated by the higher CPU–GPU bandwidth of emerging platforms such as Grace Blackwell (GB) systems.
7877

7978
Looking ahead, following the direction of our earlier HiCache work, we plan to extend this hierarchical memory management approach to support a broader range of emerging architectures, including [hybrid models](https://github.com/sgl-project/sglang/pull/21206).
8079

8180

82-
## Acknowledgement:
83-
We would like to thank the Alibaba Cloud TairKVCache team and the Ant Group SCT Inference team for their valuable contributions. We are also grateful to Shangming Cai, Teng Ma, and Xingyu Ling from Alibaba Cloud and Ziyi Xu from the SGLang community for their generous support. We also thank Christos Kozyrakis and Kristopher Geda from Stanford and the Baidu Baige AI Team for their helpful feedback.
81+
## Acknowledgements
82+
We would like to thank the Alibaba Cloud TairKVCache team and the Ant Group SCT Inference team for their valuable contributions. We are also grateful to Shangming Cai, Teng Ma, and Xingyu Ling from Alibaba Cloud, and to Ziyi Xu from the SGLang community, for their generous support. We further thank Christos Kozyrakis and Kristopher Geda from Stanford, as well as the Baidu Baige AI Team, for their thoughtful feedback.
-28.2 KB
Binary file not shown.
24 KB
Loading

0 commit comments

Comments
 (0)