docs: add Stem news entry and benchmark results (#328)

searchthe · cleoncheng · web-flow · commit ceeec4401c07 · 2026-06-04T15:33:32.000+08:00
Co-authored-by: cleoncheng &lt;cleoncheng@tencent.com&gt;
diff --git a/README.md b/README.md
@@ -22,6 +22,7 @@ A more accessible, comprehensive, and efficient toolkit for large model compress
 </p>
 
 ## 📣Latest News
+- [26/06/04] We have released **Stem**, a sparse attention algorithm that accelerates the **Prefill** stage of long-context LLMs by dynamically selecting top-k key blocks for block-sparse attention, significantly reducing latency while preserving generation quality. [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/sparse_attention/stem.html)
 - [26/06/01] We have released **DFlare**, a block-diffusion speculative decoding framework with layer-wise fusion that achieves up to **5.52× end-to-end speedup**. [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/dflare.html)
 - [26/05/27] We have released **D-Cut**, an adaptive verification depth pruning technique for speculative decoding. [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/dcut.html)
 - [26/05/20] We support Distillation for full-precision HuggingFace models and **quantized QAT-style** models, as detailed in the [distillation documentation](https://angelslim.readthedocs.io/zh-cn/latest/features/distill/index.html). 
diff --git a/README_cn.md b/README_cn.md
@@ -22,6 +22,7 @@
 </p>
 
 ## 📣最新进展
+- [26/06/04] 我们发布了 **Stem**，一种稀疏注意力算法，通过在 block 粒度动态选择 top-k 关键块执行 block-sparse attention，加速长上下文 LLM 的 **Prefill** 阶段，在大幅降低延迟的同时实现几乎无损的生成质量。[[文档]](https://angelslim.readthedocs.io/zh-cn/latest/features/sparse_attention/stem.html)
 - [26/06/01] 我们发布了 **DFlare**，一种基于 layer-wise fusion 的块扩散投机解码框架，端到端加速比可达 **5.52×**。[[文档]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/dflare.html)
 - [26/05/27] 我们发布了 **D-Cut**，一种用于投机解码的自适应验证深度裁剪技术。[[文档]](https://angelslim.readthedocs.io/zh-cn/latest/dcut.html)
 - [26/05/20]  我们支持了模型蒸馏功能，适用于huggingface 全精度或者**QAT量化**模型，详细步骤可以参考[文档](https://angelslim.readthedocs.io/zh-cn/latest/features/distill/index.html).🔥🔥🔥
diff --git a/docs/source/assets/stem/benchmark.png b/docs/source/assets/stem/benchmark.png
diff --git a/docs/source/features/sparse_attention/stem.md b/docs/source/features/sparse_attention/stem.md
@@ -49,7 +49,15 @@ $$\text{score}(Q_i, K_j) = \frac{Q_i \cdot K_j^T}{\sqrt{d} \cdot s \cdot n} + \l
 | **HPC 精度** | bf16（dense prefill）、fp8（block-sparse prefill，varlen / paged） |
 | **序列长度** | 无上限，建议 4K+ tokens 以体现加速效果 |
 
-## 4. 快速开始
+## 4. 性能评测
+
+我们在长上下文与 Agent 类任务上评测了 Stem 的精度保持能力。在 **FP8-W8A8 + Stem** 配置下，模型在 LongBench v2、CL-bench、CL-bench Life、SWE-bench Verified、Terminal-Bench 2.0、ClawEval 等多个 benchmark 上的得分与 BF16 基线基本持平，部分任务（如 ClawEval）甚至略有提升，验证了 Stem 稀疏注意力在大幅加速 Prefill 的同时几乎无损模型质量。
+
+:::{image} /assets/stem/benchmark.png
+:alt: Stem 在多个 benchmark 上的精度对比（BF16 vs FP8-W8A8+Stem）。
+:::
+
+## 5. 快速开始
 
 确保已安装 AngelSlim（`pip install -e .` 或 `uv sync`），然后在项目根目录运行：
 
@@ -105,7 +113,7 @@ python tools/run_stem.py \
 bash scripts/sparsity/run_stem.sh /path/to/Qwen3-8B prompt.txt stem
 ```
 
-## 5. 参数说明
+## 6. 参数说明
 
 | 参数 | 默认值 | 说明 |
 |------|--------|------|
@@ -120,7 +128,7 @@ bash scripts/sparsity/run_stem.sh /path/to/Qwen3-8B prompt.txt stem
 | `initial_blocks` | `4` | 始终保留的头部 block 数量（sink tokens） |
 | `window_size` | `4` | sliding window 保留的尾部 block 数量 |
 
-## 6. 代码结构
+## 7. 代码结构
 
 ```
 angelslim/compressor/sparsity/
@@ -143,7 +151,7 @@ tools/run_stem.py                            # 推理入口
 scripts/sparsity/run_stem.sh             # 启动脚本
 ```
 
-## 7. Python API
+## 8. Python API
 
 ```python
 from angelslim.compressor.sparsity import StemInference