update README.md for speculative decoding

root · root · commit e207913ed65c · 2026-01-08T19:06:25.000+08:00
diff --git a/README.md b/README.md
@@ -353,11 +353,7 @@ We evaluated the Eagle3 model trained by AngelSlim on tasks including code gener
 
 #### 1.1 Qwen3 Series Models
 
-**vLLM v0.11.2 Benchmark Results**
-
-We report benchmark results of the Qwen3 series models using the Eagle3 speculative decoding algorithm across multiple evaluation suites, including **MT-bench**, **HumanEval**, **GSM8K**, and **Alpaca**.
-All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
-**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**.
+We report benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
 
 <table>
   <thead>
@@ -493,15 +489,11 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration
   </tbody>
 </table>
 
-#### 1.2 VLM & Audio Models
+#### 1.2 VLM Models
 
 ##### 1.2.1 Qwen3-VL Series Models
 
-vLLM v0.12.0 Benchmark Results
-
-We report benchmark results of the Qwen3-VL series models using the Eagle3 speculative decoding algorithm across multiple evaluation suites, including **MT-bench**, **HumanEval**, **GSM8K**, **Alpaca**, **MATH-500**, and multimodal understanding tasks, including **MMMU**, **MMStar**. All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
-**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**.
-
+We report benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
 
 <table><thead>
   <tr>
@@ -643,11 +635,7 @@ We report benchmark results of the Qwen3-VL series models using the Eagle3 specu
 
 ##### 1.2.2 HunyuanOCR Model
 
-vLLM v0.13.0 Benchmark Results
-
-We report benchmark results of the HunyuanOCR using the Eagle3 speculative decoding algorithm across **OCR-Bench**. All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
-**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**.
-
+We report benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across OCR tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
 
 <table><thead>
   <tr>
@@ -678,18 +666,17 @@ We report benchmark results of the HunyuanOCR using the Eagle3 speculative decod
 </tbody>
 </table>
 
-##### 1.2.3 Qwen2-Audio Model
+#### 1.3 Audio Models
 
-vLLM v0.12.0 Benchmark Results
+##### 1.3.1 Qwen2-Audio Model
 
-We report benchmark results of the HunyuanOCR using the Eagle3 speculative decoding algorithm across **[librispeech_dev](https://www.openslr.org/12)** dataset. All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
-**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**.
+We report benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
 
 <table><thead>
   <tr>
     <th>Model</th>
     <th>Method</th>
-    <th colspan="2">librispeech_dev</th>
+    <th colspan="2">LibriSpeech</th>
   </tr></thead>
 <tbody>
   <tr>
@@ -713,6 +700,40 @@ We report benchmark results of the HunyuanOCR using the Eagle3 speculative decod
 </tbody>
 </table>
 
+#### 1.3.2 Fun-CosyVoice3 Model
+
+We report benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+
+<table><thead>
+  <tr>
+    <th>Model</th>
+    <th>Method</th>
+    <th colspan="2"><a href="https://www.openslr.org/60/">LibriTTS</a></th>
+  </tr></thead>
+<tbody>
+  <tr>
+    <td></td>
+    <td></td>
+    <td>throughput (tokens/s)</td>
+    <td>accept length</td>
+  </tr>
+  <tr>
+    <td>Fun-CosyVoice3</td>
+    <td>Vanilla</td>
+    <td>-</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td></td>
+    <td>Eagle3</td>
+    <td>-</td>
+    <td>1.96</td>
+  </tr>
+</tbody>
+</table>
+
+> Adapted for Transformers backend inference, only displays acceptance rate.
+
 ### 2. Quantization
 
 The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
diff --git a/README_cn.md b/README_cn.md
@@ -492,10 +492,11 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
   </tbody>
 </table>
 
-#### 1.2 多模态理解 & 语音模型
+#### 1.2 多模态理解模型
 
 ##### 1.2.1 Qwen3-VL系列模型
-我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务: **MT-bench**、 **HumanEval**、 **GSM8K**、**Alpaca**、**MATH-500** 和多模态理解任务: **MMMU**、**MMStar** 等数据集上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+
+我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
 
 <table><thead>
   <tr>
@@ -636,6 +637,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
 </tbody></table>
 
 ##### 1.2.2 HunyuanOCR模型
+
 我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在 **OCR-Bench** 上的接收长度和吞吐。结果是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
 
 
@@ -668,15 +670,17 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
 </tbody>
 </table>
 
-##### 1.2.3 Qwen2-Audio模型
+#### 1.3 语音模型
 
-我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[librispeech_dev](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+##### 1.3.1 Qwen2-Audio模型
+
+我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
 
 <table><thead>
   <tr>
     <th>Model</th>
     <th>Method</th>
-    <th colspan="2">librispeech_dev</th>
+    <th colspan="2">LibriSpeech</th>
   </tr></thead>
 <tbody>
   <tr>
@@ -700,6 +704,39 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
 </tbody>
 </table>
 
+#### 1.3.2 Fun-CosyVoice3模型
+我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+
+<table><thead>
+  <tr>
+    <th>Model</th>
+    <th>Method</th>
+    <th colspan="2"><a href="https://www.openslr.org/60/">LibriTTS</a></th>
+  </tr></thead>
+<tbody>
+  <tr>
+    <td></td>
+    <td></td>
+    <td>throughput (tokens/s)</td>
+    <td>accept length</td>
+  </tr>
+  <tr>
+    <td>Fun-CosyVoice3</td>
+    <td>Vanilla</td>
+    <td>-</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td></td>
+    <td>Eagle3</td>
+    <td>-</td>
+    <td>1.96</td>
+  </tr>
+</tbody>
+</table>
+
+> Adapted for Transformers backend inference, only displays acceptance rate.
+
 ### 2、量化
 
 下面只展示了部分模型的效果测试情况，完整Benchmark可以参考[Benchmark文档](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
diff --git a/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png b/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png