diff --git a/README.md b/README.md index 3875893a..48f0d510 100644 --- a/README.md +++ b/README.md @@ -180,7 +180,7 @@ A more accessible, comprehensive, and efficient toolkit for large model compress @@ -341,13 +341,19 @@ For more detaileds, please refer to the [Deployment Documentation](https://angel ### 1. Speculative Decoding -#### 1.1 Qwen3 Series Models +We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows. + +

+ + + AngelSlim + +

+ -**vLLM v0.11.2 Benchmark Results** +#### 1.1 Qwen3 Series Models -We report benchmark results of the Qwen3 series models using the Eagle3 speculative decoding algorithm across multiple evaluation suites, including **MT-bench**, **HumanEval**, **GSM8K**, and **Alpaca**. -All experiments were conducted on a single NVIDIA H20 GPU with the configuration: -**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**. +Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**). @@ -379,7 +385,7 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration - + @@ -387,7 +393,7 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration - + @@ -483,6 +489,251 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration
378.861 378.381 390.531318.051381.051
Eagle3653.292.19 680.12.2 621.442.17642.932.18642.932.17
+#### 1.2 VLM Models + +##### 1.2.1 Qwen3-VL Series Models + +Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodGSM8KAlpacaHumanEvalMT-benchMATH-500MMMUMMStar
throughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept length
Qwen3-VL-2B-InstructVanilla348.551350.91346.071346.31182.96183.27181.631
Eagle3511.522.11560.552.26826.013.39555.222.29163.092.57154.182.55139.732.31
Qwen3-VL-4B-InstructVanilla212.871213.241211.691212.1167.96165.88167.751
Eagle3415.292.57372.892.26459.372.82382.332.34141.872.72104.442.05107.072.1
Qwen3-VL-30B-A3B-InstructVanilla179.941184.61168.681180.57131.08131.51130.931
Eagle3281.932.82241.422.13223.052.57240.472.1975.312.7948.471.7852.571.94
+ +##### 1.2.2 HunyuanOCR Model + +Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across OCR tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodOCR-Bench-Internal
throughput (tokens/s)accept length
Hunyuan-OCRVanilla71.211
Eagle3120.752.2
+ +#### 1.3 Audio Models + +##### 1.3.1 Qwen2-Audio Model + +Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodLibriSpeech
throughput (tokens/s)accept length
Qwen2-Audio-7B-InstructVanilla78.761
Eagle3146.663.51
+ +##### 1.3.2 Fun-CosyVoice3 Model + +Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodLibriTTS
throughput (tokens/s)accept length
Fun-CosyVoice3Vanilla-1
Eagle3-1.96
+ +> Adapted for Transformers backend inference, only displays accept length. + ### 2. Quantization The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html) diff --git a/README_cn.md b/README_cn.md index 8b9d262e..e5ab1c07 100644 --- a/README_cn.md +++ b/README_cn.md @@ -181,7 +181,7 @@ @@ -345,6 +345,15 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta ### 1、投机采样 +我们使用vLLM在代码、数学、指令跟随、文本生成、多模态理解等任务上评测了AngelSlim所训练的Eagle3模型,设置num_speculative_tokens=2 or 4 下我们所训的模型加速和接收长度表现如下所示。 + +

+ + + AngelSlim + +

+ #### 1.1 Qwen3系列模型 我们使用vLLM(v0.11.2)评测了Qwen3系列Eagle3模型在**MT-bench**、 **HumanEval**、 **GSM8K**、**Alpaca**等数据集上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**。 @@ -379,7 +388,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta 378.861 378.381 390.531 - 318.051 + 381.051 Eagle3 @@ -387,7 +396,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta 653.292.19 680.12.2 621.442.17 - 642.932.18 + 642.932.17 @@ -483,6 +492,251 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta +#### 1.2 多模态理解模型 + +##### 1.2.1 Qwen3-VL系列模型 + +我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodGSM8KAlpacaHumanEvalMT-benchMATH-500MMMUMMStar
throughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept length
Qwen3-VL-2B-InstructVanilla348.551350.91346.071346.31182.96183.27181.631
Eagle3511.522.11560.552.26826.013.39555.222.29163.092.57154.182.55139.732.31
Qwen3-VL-4B-InstructVanilla212.871213.241211.691212.1167.96165.88167.751
Eagle3415.292.57372.892.26459.372.82382.332.34141.872.72104.442.05107.072.1
Qwen3-VL-30B-A3B-InstructVanilla179.941184.61168.681180.57131.08131.51130.931
Eagle3281.932.82241.422.13223.052.57240.472.1975.312.7948.471.7852.571.94
+ +##### 1.2.2 HunyuanOCR模型 + +我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在 **OCR-Bench** 上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodOCR-Bench-Internal
throughput (tokens/s)accept length
Hunyuan-OCRVanilla71.211
Eagle3120.752.2
+ +#### 1.3 语音模型 + +##### 1.3.1 Qwen2-Audio模型 + +我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodLibriSpeech
throughput (tokens/s)accept length
Qwen2-Audio-7B-InstructVanilla78.761
Eagle3146.663.51
+ +##### 1.3.2 Fun-CosyVoice3模型 +我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodLibriTTS
throughput (tokens/s)accept length
Fun-CosyVoice3Vanilla-1
Eagle3-1.96
+ +> Adapted for Transformers backend inference, only displays accept length. + ### 2、量化 下面只展示了部分模型的效果测试情况,完整Benchmark可以参考[Benchmark文档](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html) diff --git a/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png b/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png new file mode 100644 index 00000000..3a9d3780 Binary files /dev/null and b/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png differ