Skip to content

Commit ceeec44

Browse files
searchthecleoncheng
andauthored
docs: add Stem news entry and benchmark results (#328)
Co-authored-by: cleoncheng <cleoncheng@tencent.com>
1 parent 8bd4dcb commit ceeec44

4 files changed

Lines changed: 14 additions & 4 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ A more accessible, comprehensive, and efficient toolkit for large model compress
2222
</p>
2323

2424
## 📣Latest News
25+
- [26/06/04] We have released **Stem**, a sparse attention algorithm that accelerates the **Prefill** stage of long-context LLMs by dynamically selecting top-k key blocks for block-sparse attention, significantly reducing latency while preserving generation quality. [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/sparse_attention/stem.html)
2526
- [26/06/01] We have released **DFlare**, a block-diffusion speculative decoding framework with layer-wise fusion that achieves up to **5.52× end-to-end speedup**. [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/dflare.html)
2627
- [26/05/27] We have released **D-Cut**, an adaptive verification depth pruning technique for speculative decoding. [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/dcut.html)
2728
- [26/05/20] We support Distillation for full-precision HuggingFace models and **quantized QAT-style** models, as detailed in the [distillation documentation](https://angelslim.readthedocs.io/zh-cn/latest/features/distill/index.html).

README_cn.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
</p>
2323

2424
## 📣最新进展
25+
- [26/06/04] 我们发布了 **Stem**,一种稀疏注意力算法,通过在 block 粒度动态选择 top-k 关键块执行 block-sparse attention,加速长上下文 LLM 的 **Prefill** 阶段,在大幅降低延迟的同时实现几乎无损的生成质量。[[文档]](https://angelslim.readthedocs.io/zh-cn/latest/features/sparse_attention/stem.html)
2526
- [26/06/01] 我们发布了 **DFlare**,一种基于 layer-wise fusion 的块扩散投机解码框架,端到端加速比可达 **5.52×**[[文档]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/dflare.html)
2627
- [26/05/27] 我们发布了 **D-Cut**,一种用于投机解码的自适应验证深度裁剪技术。[[文档]](https://angelslim.readthedocs.io/zh-cn/latest/dcut.html)
2728
- [26/05/20] 我们支持了模型蒸馏功能,适用于huggingface 全精度或者**QAT量化**模型,详细步骤可以参考[文档](https://angelslim.readthedocs.io/zh-cn/latest/features/distill/index.html).🔥🔥🔥
443 KB
Loading

docs/source/features/sparse_attention/stem.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,15 @@ $$\text{score}(Q_i, K_j) = \frac{Q_i \cdot K_j^T}{\sqrt{d} \cdot s \cdot n} + \l
4949
| **HPC 精度** | bf16(dense prefill)、fp8(block-sparse prefill,varlen / paged) |
5050
| **序列长度** | 无上限,建议 4K+ tokens 以体现加速效果 |
5151

52-
## 4. 快速开始
52+
## 4. 性能评测
53+
54+
我们在长上下文与 Agent 类任务上评测了 Stem 的精度保持能力。在 **FP8-W8A8 + Stem** 配置下,模型在 LongBench v2、CL-bench、CL-bench Life、SWE-bench Verified、Terminal-Bench 2.0、ClawEval 等多个 benchmark 上的得分与 BF16 基线基本持平,部分任务(如 ClawEval)甚至略有提升,验证了 Stem 稀疏注意力在大幅加速 Prefill 的同时几乎无损模型质量。
55+
56+
:::{image} /assets/stem/benchmark.png
57+
:alt: Stem 在多个 benchmark 上的精度对比(BF16 vs FP8-W8A8+Stem)。
58+
:::
59+
60+
## 5. 快速开始
5361

5462
确保已安装 AngelSlim(`pip install -e .``uv sync`),然后在项目根目录运行:
5563

@@ -105,7 +113,7 @@ python tools/run_stem.py \
105113
bash scripts/sparsity/run_stem.sh /path/to/Qwen3-8B prompt.txt stem
106114
```
107115

108-
## 5. 参数说明
116+
## 6. 参数说明
109117

110118
| 参数 | 默认值 | 说明 |
111119
|------|--------|------|
@@ -120,7 +128,7 @@ bash scripts/sparsity/run_stem.sh /path/to/Qwen3-8B prompt.txt stem
120128
| `initial_blocks` | `4` | 始终保留的头部 block 数量(sink tokens) |
121129
| `window_size` | `4` | sliding window 保留的尾部 block 数量 |
122130

123-
## 6. 代码结构
131+
## 7. 代码结构
124132

125133
```
126134
angelslim/compressor/sparsity/
@@ -143,7 +151,7 @@ tools/run_stem.py # 推理入口
143151
scripts/sparsity/run_stem.sh # 启动脚本
144152
```
145153

146-
## 7. Python API
154+
## 8. Python API
147155

148156
```python
149157
from angelslim.compressor.sparsity import StemInference

0 commit comments

Comments
 (0)