Skip to content

Commit e207913

Browse files
rootroot
authored andcommitted
update README.md for speculative decoding
1 parent 2e480ca commit e207913

3 files changed

Lines changed: 84 additions & 26 deletions

File tree

README.md

Lines changed: 42 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -353,11 +353,7 @@ We evaluated the Eagle3 model trained by AngelSlim on tasks including code gener
353353

354354
#### 1.1 Qwen3 Series Models
355355

356-
**vLLM v0.11.2 Benchmark Results**
357-
358-
We report benchmark results of the Qwen3 series models using the Eagle3 speculative decoding algorithm across multiple evaluation suites, including **MT-bench**, **HumanEval**, **GSM8K**, and **Alpaca**.
359-
All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
360-
**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**.
356+
We report benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
361357

362358
<table>
363359
<thead>
@@ -493,15 +489,11 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration
493489
</tbody>
494490
</table>
495491

496-
#### 1.2 VLM & Audio Models
492+
#### 1.2 VLM Models
497493

498494
##### 1.2.1 Qwen3-VL Series Models
499495

500-
vLLM v0.12.0 Benchmark Results
501-
502-
We report benchmark results of the Qwen3-VL series models using the Eagle3 speculative decoding algorithm across multiple evaluation suites, including **MT-bench**, **HumanEval**, **GSM8K**, **Alpaca**, **MATH-500**, and multimodal understanding tasks, including **MMMU**, **MMStar**. All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
503-
**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**.
504-
496+
We report benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
505497

506498
<table><thead>
507499
<tr>
@@ -643,11 +635,7 @@ We report benchmark results of the Qwen3-VL series models using the Eagle3 specu
643635

644636
##### 1.2.2 HunyuanOCR Model
645637

646-
vLLM v0.13.0 Benchmark Results
647-
648-
We report benchmark results of the HunyuanOCR using the Eagle3 speculative decoding algorithm across **OCR-Bench**. All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
649-
**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**.
650-
638+
We report benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across OCR tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
651639

652640
<table><thead>
653641
<tr>
@@ -678,18 +666,17 @@ We report benchmark results of the HunyuanOCR using the Eagle3 speculative decod
678666
</tbody>
679667
</table>
680668

681-
##### 1.2.3 Qwen2-Audio Model
669+
#### 1.3 Audio Models
682670

683-
vLLM v0.12.0 Benchmark Results
671+
##### 1.3.1 Qwen2-Audio Model
684672

685-
We report benchmark results of the HunyuanOCR using the Eagle3 speculative decoding algorithm across **[librispeech_dev](https://www.openslr.org/12)** dataset. All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
686-
**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**.
673+
We report benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
687674

688675
<table><thead>
689676
<tr>
690677
<th>Model</th>
691678
<th>Method</th>
692-
<th colspan="2">librispeech_dev</th>
679+
<th colspan="2">LibriSpeech</th>
693680
</tr></thead>
694681
<tbody>
695682
<tr>
@@ -713,6 +700,40 @@ We report benchmark results of the HunyuanOCR using the Eagle3 speculative decod
713700
</tbody>
714701
</table>
715702

703+
#### 1.3.2 Fun-CosyVoice3 Model
704+
705+
We report benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
706+
707+
<table><thead>
708+
<tr>
709+
<th>Model</th>
710+
<th>Method</th>
711+
<th colspan="2"><a href="https://www.openslr.org/60/">LibriTTS</a></th>
712+
</tr></thead>
713+
<tbody>
714+
<tr>
715+
<td></td>
716+
<td></td>
717+
<td>throughput (tokens/s)</td>
718+
<td>accept length</td>
719+
</tr>
720+
<tr>
721+
<td>Fun-CosyVoice3</td>
722+
<td>Vanilla</td>
723+
<td>-</td>
724+
<td>1</td>
725+
</tr>
726+
<tr>
727+
<td></td>
728+
<td>Eagle3</td>
729+
<td>-</td>
730+
<td>1.96</td>
731+
</tr>
732+
</tbody>
733+
</table>
734+
735+
> Adapted for Transformers backend inference, only displays acceptance rate.
736+
716737
### 2. Quantization
717738

718739
The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)

README_cn.md

Lines changed: 42 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -492,10 +492,11 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
492492
</tbody>
493493
</table>
494494

495-
#### 1.2 多模态理解 & 语音模型
495+
#### 1.2 多模态理解模型
496496

497497
##### 1.2.1 Qwen3-VL系列模型
498-
我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务: **MT-bench****HumanEval****GSM8K****Alpaca****MATH-500** 和多模态理解任务: **MMMU****MMStar** 等数据集上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**
498+
499+
我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**
499500

500501
<table><thead>
501502
<tr>
@@ -636,6 +637,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
636637
</tbody></table>
637638

638639
##### 1.2.2 HunyuanOCR模型
640+
639641
我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在 **OCR-Bench** 上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**
640642

641643

@@ -668,15 +670,17 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
668670
</tbody>
669671
</table>
670672

671-
##### 1.2.3 Qwen2-Audio模型
673+
#### 1.3 语音模型
672674

673-
我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[librispeech_dev](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**
675+
##### 1.3.1 Qwen2-Audio模型
676+
677+
我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**
674678

675679
<table><thead>
676680
<tr>
677681
<th>Model</th>
678682
<th>Method</th>
679-
<th colspan="2">librispeech_dev</th>
683+
<th colspan="2">LibriSpeech</th>
680684
</tr></thead>
681685
<tbody>
682686
<tr>
@@ -700,6 +704,39 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
700704
</tbody>
701705
</table>
702706

707+
#### 1.3.2 Fun-CosyVoice3模型
708+
我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**
709+
710+
<table><thead>
711+
<tr>
712+
<th>Model</th>
713+
<th>Method</th>
714+
<th colspan="2"><a href="https://www.openslr.org/60/">LibriTTS</a></th>
715+
</tr></thead>
716+
<tbody>
717+
<tr>
718+
<td></td>
719+
<td></td>
720+
<td>throughput (tokens/s)</td>
721+
<td>accept length</td>
722+
</tr>
723+
<tr>
724+
<td>Fun-CosyVoice3</td>
725+
<td>Vanilla</td>
726+
<td>-</td>
727+
<td>1</td>
728+
</tr>
729+
<tr>
730+
<td></td>
731+
<td>Eagle3</td>
732+
<td>-</td>
733+
<td>1.96</td>
734+
</tr>
735+
</tbody>
736+
</table>
737+
738+
> Adapted for Transformers backend inference, only displays acceptance rate.
739+
703740
### 2、量化
704741

705742
下面只展示了部分模型的效果测试情况,完整Benchmark可以参考[Benchmark文档](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
-3.23 KB
Loading

0 commit comments

Comments
 (0)