diff --git a/README.md b/README.md index 48f0d510..a7e466d1 100644 --- a/README.md +++ b/README.md @@ -170,6 +170,7 @@ A more accessible, comprehensive, and efficient toolkit for large model compress @@ -341,7 +342,7 @@ For more detaileds, please refer to the [Deployment Documentation](https://angel ### 1. Speculative Decoding -We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows. +We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows, with an accept length of 1.8–3.5 and a maximum speedup of 1.4–1.9×.

@@ -636,13 +637,11 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o ##### 1.2.2 HunyuanOCR Model Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across OCR tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). - - - + @@ -652,13 +651,12 @@ Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.1 - + - @@ -686,13 +684,12 @@ Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0. - + - @@ -708,7 +705,7 @@ Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across ** - + @@ -718,13 +715,12 @@ Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across ** - + - @@ -732,7 +728,7 @@ Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **
Model MethodOCR-Bench-InternalOCR-Bench-Internal
accept length
Hunyuan-OCRHunyuan-OCR Vanilla 71.21 1
Eagle3 120.75 2.2accept length
Qwen2-Audio-7B-InstructQwen2_Audio Vanilla 78.76 1
Eagle3 146.66 3.51
Model MethodLibriTTSLibriTTS
accept length
Fun-CosyVoice3Fun-CosyVoice3 Vanilla - 1
Eagle3 - 1.96
-> Adapted for Transformers backend inference, only displays accept length. +> Adapted for Transformers backend inference, only displays accept length. vLLM speedup ~1.6×, estimated from baseline LLM speedup. ### 2. Quantization diff --git a/README_cn.md b/README_cn.md index e5ab1c07..a47839de 100644 --- a/README_cn.md +++ b/README_cn.md @@ -171,6 +171,7 @@

@@ -345,7 +346,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta ### 1、投机采样 -我们使用vLLM在代码、数学、指令跟随、文本生成、多模态理解等任务上评测了AngelSlim所训练的Eagle3模型,设置num_speculative_tokens=2 or 4 下我们所训的模型加速和接收长度表现如下所示。 +我们使用vLLM在代码、数学、指令跟随、文本生成、多模态理解等任务上评测了AngelSlim所训练的Eagle3模型,设置num_speculative_tokens=2 or 4 下我们所训的模型加速和接收长度表现如下所示,接收长度在1.8-3.5,最高加速可达1.4-1.9倍。 +

@@ -640,13 +642,11 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta 我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在 **OCR-Bench** 上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 - - - + @@ -656,13 +656,12 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta - + - @@ -690,13 +689,12 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta - + - @@ -711,7 +709,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta - + @@ -721,13 +719,12 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta - + - @@ -735,7 +732,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
Model MethodOCR-Bench-InternalOCR-Bench-Internal
accept length
Hunyuan-OCRHunyuan-OCR Vanilla 71.21 1
Eagle3 120.75 2.2accept length
Qwen2-Audio-7B-InstructQwen2_Audio Vanilla 78.76 1
Eagle3 146.66 3.51
Model MethodLibriTTSLibriTTS
accept length
Fun-CosyVoice3Fun-CosyVoice3 Vanilla - 1
Eagle3 - 1.96
-> Adapted for Transformers backend inference, only displays accept length. +> Adapted for Transformers backend inference, only displays accept length. vLLM speedup ~1.6×, estimated from baseline LLM speedup. ### 2、量化 diff --git a/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png b/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png index 3a9d3780..ccaf1bf0 100644 Binary files a/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png and b/docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png differ