From 5774f961bf31e56dcafea31b77bea863ef5b3d4f Mon Sep 17 00:00:00 2001 From: root Date: Tue, 13 Jan 2026 04:08:35 +0800 Subject: [PATCH 1/5] update vlm & audio eagle3 doc & readme --- README.md | 2 + README_cn.md | 2 + .../speculative_decoding/eagle/audio_eagle.md | 190 ++++++++++++++ .../eagle/audio_tts_eagle.md | 150 +++++++++++ .../speculative_decoding/{ => eagle}/eagle.md | 9 +- .../speculative_decoding/eagle/index.md | 17 ++ .../speculative_decoding/eagle/vlm_eagle.md | 246 ++++++++++++++++++ .../features/speculative_decoding/index.md | 2 +- .../speculative_decoding/benchmarks.md | 238 ++++++++++++++++- .../vllm_offline_eagle3_qwen2_audio_bench.py | 4 +- tools/vllm_offline_eagle3_vlm_batch.py | 9 +- 11 files changed, 857 insertions(+), 12 deletions(-) create mode 100644 docs/source/features/speculative_decoding/eagle/audio_eagle.md create mode 100644 docs/source/features/speculative_decoding/eagle/audio_tts_eagle.md rename docs/source/features/speculative_decoding/{ => eagle}/eagle.md (96%) create mode 100644 docs/source/features/speculative_decoding/eagle/index.md create mode 100644 docs/source/features/speculative_decoding/eagle/vlm_eagle.md diff --git a/README.md b/README.md index 432844eb..1d097b91 100644 --- a/README.md +++ b/README.md @@ -233,6 +233,8 @@ bash scripts/speculative/train_eagle3_online.sh For detailed training configurations and vLLM performance benchmarks of Eagle3, please refer to the [Quick Start Guide for Speculative Sampling](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html#id5). +Training and Deployment Guide for Multimodal Model Eagle3—Supporting LLM, VLM, and Audio (ASR & TTS) Models: [Qwen3 series](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [Qwen3-VL series & HunyuanOCR](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Qwen2Audio](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_eagle.html) | [Fun-CosyVoice3](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html). + #### 2.2 LLM/VLM Model Quantization After installing `AngelSlim`, you can launch static FP8 quantization for the Qwen3-1.7B model with the following one-command script: diff --git a/README_cn.md b/README_cn.md index 51fcdea3..372c3e80 100644 --- a/README_cn.md +++ b/README_cn.md @@ -234,6 +234,8 @@ bash scripts/speculative/train_eagle3_online.sh 详细训练配置,以及`Eagle3`的vLLM性能测试,详情请参考投机采样[快速开始文档](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html#id5)。 +多模态模型 Eagle3 训练与部署指南,支持LLM / VLM / Audio (ASR & TTS) 模型:[Qwen3 series](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [Qwen3-VL series & HunyuanOCR](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Qwen2Audio](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_eagle.html) | [Fun-CosyVoice3](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html)。 + #### 2.2 LLM/VLM模型量化 完成安装`AngelSlim`后,您可以通过以下脚本快速开始,完成`Qwen3-1.7B`模型的静态`FP8`量化: diff --git a/docs/source/features/speculative_decoding/eagle/audio_eagle.md b/docs/source/features/speculative_decoding/eagle/audio_eagle.md new file mode 100644 index 00000000..d771a140 --- /dev/null +++ b/docs/source/features/speculative_decoding/eagle/audio_eagle.md @@ -0,0 +1,190 @@ +# 语音理解模型EAGLE3 + +[Eagle3](https://arxiv.org/pdf/2503.01840)是目前最常用、加速效果最好的投机采样算法。 +本项目包括Eagle3的训练以及benchmark测试,并开源了Qwen2Audio的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 + +我们训练的Qwen2Audio Eagle3模型的表现可以参见基准测试[benchmarks](../../performance/speculative_decoding/benchmarks.md), +其中全部数据都是在单张H20上使用vLLM推理获得。 + +## 1. 支持模型列表 +- `Qwen2Audio` + +## 2. 准备数据 + +### 2.1 数据组织形式 + +所有数据需保存在jsonl文件中,训练数据格式可参考: + +- 数据示例: AngelSlim/dataset/librispeech_test/librispeech_eval_10_test.jsonl + + ```shell + {"id": 5910, "conversations": [{"role": "user", "content": [{"type": "audio", "audio": "./audios/1580-141083-0008.flac"}, {"type": "text", "text": "Detect the language and recognize the speech: <|en|>"}]}, {"role": "assistant", "content": [{"type": "text", "text": "THE PROOF WAS IN THREE LONG SLIPS I HAD LEFT THEM ALL TOGETHER"}]}]} + ``` + +- 典型字段意义如下: + - id: 对话唯一标识 + - conversations: OpenAI 对话格式 + - audio: 对应音频文件路径 + +### 2.2 重采样训练数据(推荐) + +为得到高质量的目标模型SFT数据,建议使用目标模型重新采样训练数据,将LLM生成的结果保存在jsonl文件中,对应的Audio文件存储在同一目录下,组织形式同上。 + +可基于实际应用场景自行生成训练数据,下面提供vLLM生成数据流程参考: + +**步骤1:启动vLLM server** + +首先需要启动vLLM server来提供模型推理服务: + +```shell +bash scripts/speculative/run_vllm_server.sh +``` + +**server配置说明:** +- 该脚本会启动目标基础模型的vLLM推理服务 +- 确保服务器成功启动后再进行下一步数据生成 +- 可以通过修改脚本中的参数来调整vLLM server配置(如vLLM启动参数、GPU数量等),来适应不同的目标模型 + +**步骤2:生成采样数据** + +vLLM server启动后,使用 `scripts/speculative/generate_data_for_target_model.sh` 脚本生成训练数据: + +```shell +bash scripts/speculative/generate_data_for_target_model.sh +``` + +**脚本功能说明:** +- 通过vLLM server调用目标基础模型对输入数据进行采样 +- 生成 `.jsonl` 格式的训练数据集 +- 数据将用于后续Eagle模型的在线训练 + +**脚本参数说明:** + +在使用前,需要在脚本中配置以下参数: + +- `DATA_NAME_OR_PATH`: 输入数据集的HF名称或本地路径 +- `OUTPUT_DIR`: 生成的数据集输出路径 +- `DATA_FORMAT`: 输入数据集的格式(sharegpt|ultrachat) +- `DATA_SHARD_SIZE`: 生成数据集的切分子集大小 +- `BASE_PORT`: vLLM server的端口号 + +**注意事项:** +- 确保vLLM服务器已成功启动并正常运行 +- 数据生成过程可能需要较长时间,取决于样本数量和模型规模 + + +## 3. 训练Eagle3模型 + +目前支持Qwen2Audio在线训练模式:在线训练适合显存足够、目标模型不大、训练上下文长度不要求极长的场景。 + +### 3.1 在线训练 + +使用下面的脚本进行Eagle3模型的在线训练: + +```shell +bash scripts/speculative/qwen2_audio/train_eagle3_audio_online.sh +``` + +**脚本参数说明:** + +在使用前,需要在脚本中配置以下参数: + +- `TARGET_MODEL_NAME_OR_PATH`: 目标模型的HF名称或本地名称 +- `DRAFT_MODEL_CONFIG_PATH`: 草稿模型的config路径 +- `TRAIN_DATA_PATH`: 训练数据路径 +- `EVAL_DATA_PATH`: 验证数据路径 +- `OUTPUT_DIR`: Eagle3模型输出路径 +- `MODEL_MAX_LENGTH`: 训练数据的最大长度 +- `CHAT_TEMPLATE_TYPE`: 目标模型的数据模板类型 + +## 4. 基准测试 + +AngelSlim提供了Qwen2Audio模型vLLM backend的Eagle3基准测试脚本,用于评估投机采样的性能提升。 + +### 4.1 vLLM基准测试 + +> vLLM 适配参考: [Support Eagle3 for Qwen2Audio](https://github.com/irisliu10/vllm/tree/eagle3_infer) + +#### 4.1.1 基本用法 + +使用 `tools/vllm_offline_eagle3_qwen2_audio_bench.py` 脚本进行投机采样基准测试: + +```shell +python3 tools/vllm_offline_eagle3_qwen2_audio_bench.py \ + --target_model ${BASE_MODEL_PATH} \ + --draft_model ${EAGLE_MODEL_PATH} \ + --output_file ${OUTPUT_FILE} \ + --use_eagle \ +``` + +#### 4.1.2 参数说明 + +**模型配置参数:** +- `--target_model`: 基础模型路径(必需) +- `--draft_model`: Eagle辅助模型路径(必需) + +**基准测试配置:** +- `--test_data_path`: 测试jsonl文件路径,默认为: "dataset/librispeech_test/librispeech_eval_10_test.jsonl" +- `--use_eagle`: 运行Eagle3推理,默认为False +- `--output_file`: 输出结果文件路径 +- `--num_prompts`: 测试用例数量,默认为100 + +**生成参数:** +- `--temp`: 采样温度,默认为 0 +- `--max_model_len`: 最大上下文长度,默认为 16384 +- `--output_len`: 最大生成token数,默认为 1024 +- `--max_num_seqs`: 每次迭代的最大序列数,默认为 1 +- `--num_spec_tokens`: draft model投机采样token数量,默认为2 + +**硬件配置:** +- `--tp`: 张量并行大小,默认为1 + +**其他设置:** +- `--seed`: 随机种子 + +#### 4.1.3 使用示例 + +测试数据组织形式:所有数据需保存在jsonl文件中,对应的Audio文件存储在同一目录下,目录结构可参考: +``` +└── librispeech_test + ├── librispeech_eval_10_test.json + ├── audios + │ ├── xxx.flac + │ ├── xxx.flac +``` + +**运行投机采样:** +```shell +python3 tools/vllm_offline_eagle3_qwen2_audio_bench.py \ + --target_model Qwen/Qwen2-Audio-7B-Instruct \ + --draft_model "$EAGLE_DIR" \ + --use_eagle \ + --num_spec_tokens 4 \ + --num_prompts 10 \ + --temp 0 \ + --max_num_seqs 1 \ + --output_len 1024 \ + --output_file "$OUTPUT_FILE" +``` + +**Baseline基准测试:** +```shell +python3 tools/vllm_offline_eagle3_qwen2_audio_bench.py \ + --target_model Qwen/Qwen2-Audio-7B-Instruct \ + --num_prompts 10 \ + --temp 0 \ + --max_num_seqs 1 \ + --output_len 1024 \ + --output_file "$OUTPUT_FILE" +``` + +#### 4.1.4 性能报告 + +运行完成后,工具会自动生成性能报告,包括: +- 投机采样与基线模型的性能对比 +- 加速比统计 +- 生成质量指标(如果启用) + +结果将保存在指定的输出目录中,便于后续分析和比较。 + +完整的vLLM benchmark结果可见[Benchmark](../../../performance/speculative_decoding/benchmarks.md)。 \ No newline at end of file diff --git a/docs/source/features/speculative_decoding/eagle/audio_tts_eagle.md b/docs/source/features/speculative_decoding/eagle/audio_tts_eagle.md new file mode 100644 index 00000000..90558a8f --- /dev/null +++ b/docs/source/features/speculative_decoding/eagle/audio_tts_eagle.md @@ -0,0 +1,150 @@ +# 语音合成模型EAGLE3 + +[Eagle3](https://arxiv.org/pdf/2503.01840)是目前最常用、加速效果最好的投机采样算法。 +本项目包括Eagle3的训练以及benchmark测试,并开源了Fun-CosyVoice3的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 + +我们训练的Fun-CosyVoice3 Eagle3模型的表现可以参见基准测试[benchmarks](../../performance/speculative_decoding/benchmarks.md) + +## 1. 支持模型列表 +- `Fun-CosyVoice3` + +## 2. 准备数据 + +### 2.1 数据组织形式 + +所有数据需保存在`jsonl`文件中,数据格式可参考: +- 原始训练数据示例:`AngelSlim/dataset/tts_fake_data/train.jsonl` +- 重采样训练数据示例:`AngelSlim/dataset/tts_fake_data/train_regenerate.jsonl` + +### 2.2 原始训练数据 + +每行字段意义如下: +- `text`:输入文本 +- `audio_path`:真实音频绝对路径 +- `instruct`:发音人文本表示 +- `instruct_audio_path`:发音人音频绝对路径 + +### 2.3 重采样训练数据(推荐) + +为得到高质量的目标模型SFT数据,建议使用目标模型重新采样训练数据,将LLM生成的语音token保存在`jsonl`文件中,每行字段意义如下: +- `text`:输入文本 +- `audio_tokens`:生成的语音token +- `instruct`:发音人文本表示 +- `instruct_audio_path`:发音人音频绝对路径 + + +## 3. 训练Eagle3模型 + +目前仅支持在线训练模式。 + +### 3.1 在线训练 + +使用下面的脚本进行Eagle3模型的在线训练: + +```shell +bash scripts/speculative/train_eagle3_tts_online.sh +``` + +**脚本参数说明:** + +在使用前,需要在脚本中配置以下参数: + +- `TARGET_MODEL_NAME_OR_PATH`: 目标模型的HF名称或本地名称 +- `DRAFT_MODEL_CONFIG_PATH`: 草稿模型的config路径 +- `TRAIN_DATA_PATH`: 训练数据路径 +- `OUTPUT_DIR`: Eagle3模型输出路径 +- `MODEL_MAX_LENGTH`: 训练数据的最大长度 + + +## 4. 基准测试 + +AngelSlim提供了HF backend的Eagle3基准测试脚本,用于评估投机采样的性能提升。 + +### 4.1 HF基准测试 + +`Fun-CosyVoice3`仅支持HF测试平均接收长度。 + +#### 4.1.1 基本用法 + +使用 `tools/spec_benchmark.py` 脚本进行投机采样基准测试: + +```shell +python3 tools/spec_benchmark.py \ + --base-model-path ${BASE_MODEL_PATH} \ + --eagle-model-path ${EAGLE_MODEL_PATH} \ + --model-id ${MODEL_ID} \ + --mode both +``` + +#### 4.1.2 参数说明 + +**模型配置参数:** +- `--base-model-path`: 基础模型路径(必需) +- `--eagle-model-path`: Eagle辅助模型路径(必需) +- `--model-id`: 模型标识符(必需) + +**基准测试配置:** +- `--bench-name`: 基准数据集名称,可参考【`tts_fake_data`】 +- `--mode`: 执行模式,可选 `eagle`(仅投机采样)、`baseline`(仅基线)、`both`(两者都执行),默认为 `both` +- `--output-dir`: 结果输出目录 + +**生成参数:** +- `--temperature`: 采样温度,默认为 1.0 +- `--max-new-token`: 最大生成token数,默认为 1024 +- `--total-token`: 草稿树中的总节点数,默认为 60 +- `--depth`: 树深度,默认为 5 +- `--top-k`: Top-k采样,默认为 10 +- `--generate-audio`: 是否生成最终音频 + +**硬件配置:** +- `--num-gpus-per-model`: 每个模型使用的GPU数量,默认为 1 +- `--num-gpus-total`: 总GPU数量,默认为 1 +- `--max-gpu-memory`: 每个GPU的最大内存限制 + +**其他设置:** +- `--seed`: 随机种子,默认为 42 +- `--question-begin`: 问题起始索引(用于调试) +- `--question-end`: 问题结束索引(用于调试) +- `--no-metrics`: 跳过自动指标计算 + +**注意事项:** +- `--bench-name`: 也可以添加自定义测试集,在`AngelSlim/dataset`目录下创建新的子目录并将目录名作为`--bench-name`,在新目录下创建`question.jsonl`,框架会自动读取该文件 +- `--temperature`: `Fun-CosyVoice3`在`temperature`为0时容易生成大量重复token,建议测试时使用默认配置 + +#### 4.1.3 使用示例 + +**完整基准测试:** +```shell +python3 tools/spec_benchmark.py \ + --base-model-path /path/to/base/model \ + --eagle-model-path /path/to/eagle/model \ + --model-id cosyvoice3 \ + --mode both \ + --output-dir ./results \ + --deploy-backend pytorch_tts \ + --generate-audio +``` + +**注意事项:** +- `Fun-CosyVoice3`在设置`generate-audio`为`True`时需要额外导入`cosyvoice`包,安装步骤如下: + ```shell + git clone https://github.com/FunAudioLLM/CosyVoice + pip install hyperpyyaml omegaconf conformer diffusers hydra-core lightning gdown matplotlib wget x-transformers pyworld librosa + ``` + + 测试脚本参考: + ```shell + export PYTHONPATH=/xxx/CosyVoice:/xxx/CosyVoice/third_party/Matcha-TTS:$PYTHONPATH + python3 tools/spec_benchmark.py [ARGS] + ``` + +**不生成音频:** +```shell +python3 tools/spec_benchmark.py \ + --base-model-path /path/to/base/model \ + --eagle-model-path /path/to/eagle/model \ + --model-id cosyvoice3 \ + --mode both \ + --output-dir ./results \ + --deploy-backend pytorch_tts +``` \ No newline at end of file diff --git a/docs/source/features/speculative_decoding/eagle.md b/docs/source/features/speculative_decoding/eagle/eagle.md similarity index 96% rename from docs/source/features/speculative_decoding/eagle.md rename to docs/source/features/speculative_decoding/eagle/eagle.md index 8770c8d2..22f02747 100644 --- a/docs/source/features/speculative_decoding/eagle.md +++ b/docs/source/features/speculative_decoding/eagle/eagle.md @@ -1,9 +1,10 @@ -# EAGLE +# EAGLE3 + [Eagle3](https://arxiv.org/pdf/2503.01840)是目前最常用、加速效果最好的投机采样算法。 本项目包括Eagle3的训练以及benchmark测试,并开源了Qwen3和Hunyuan系列的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 -我们训练的Qwen3系列Eagle3模型的表现可以参见基准测试[benchmarks](../../performance/speculative_decoding/benchmarks.md), -其中全部数据都是在单张H20上使用pytorch推理获得。 +我们训练的Qwen3系列Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md), +其中全部数据都是在单张H20上使用vLLM推理获得。 ## 1. 数据生成 @@ -237,4 +238,4 @@ python3 vllm_spec_benchmark.py \ --max-num-seqs "$BATCH_SIZE" ``` -完整的vLLM benchmark结果可见[Benchmark](../../performance/speculative_decoding/benchmarks.md)。 \ No newline at end of file +完整的vLLM benchmark结果可见[Benchmark](../../../performance/speculative_decoding/benchmarks.md)。 \ No newline at end of file diff --git a/docs/source/features/speculative_decoding/eagle/index.md b/docs/source/features/speculative_decoding/eagle/index.md new file mode 100644 index 00000000..91259941 --- /dev/null +++ b/docs/source/features/speculative_decoding/eagle/index.md @@ -0,0 +1,17 @@ +# EAGLE + +[Eagle3](https://arxiv.org/pdf/2503.01840)是目前最常用、加速效果最好的投机采样算法。 +本项目包括Eagle3的训练以及benchmark测试,并开源了Hunyuan、HunyuanOCR、Qwen3、Qwen3-VL、Qwen2Audio、Fun-CosyVoice3等模型的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 + +我们训练的Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md), +其中全部数据都是在单张H20上使用vLLM推理获得。 + +:::{toctree} +:caption: Contents +:maxdepth: 1 + +eagle +vlm_eagle +audio_eagle +audio_tts_eagle +::: diff --git a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md new file mode 100644 index 00000000..90898769 --- /dev/null +++ b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md @@ -0,0 +1,246 @@ +# 视觉理解模型EAGLE3 + +[Eagle3](https://arxiv.org/pdf/2503.01840)是目前最常用、加速效果最好的投机采样算法。 +本项目包括Eagle3的训练以及benchmark测试,并开源了HunyuanOCR和Qwen3-VL系列的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 + +我们训练的HunyuanOCR和Qwen3-VL系列Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md), +其中全部数据都是在单张H20上使用vLLM推理获得。 +## 1. 支持模型列表 +- `HunyuanOCR` +- `Qwen3-VL` + +## 2. 准备数据 + +### 2.1 数据组织形式 +所有数据需保存在jsonl文件中,训练数据格式可参考: + +- 数据示例: + + ```json + # 文生文 + {"id": "0", "conversations": [{"role": "user", "content": [{"type": "text", "text": "xxx"}]}, {"role": "assistant", "content": [{"type": "text", "text": "xxx"}]}]} + + # 图生文 + {"id": "1", "conversations": [{"role": "user", "content": [{"type": "image", "image": "xxx/images/0001.jpeg"}, {"type": "text", "text": "xxx"}]}, {"role": "assistant", "content": [{"type": "text", "text": "xxxx"}]}]} + ``` + +- 典型字段意义如下: + - id: 对话唯一标识 + - conversations: OpenAI 对话格式 + - image: 对应图像文件路径 + + +### 2.2 数据生成 + +数据生成包括:1)为目标模型生成采样数据,2)为Eagle3模型离线生成目标模型的hidden states。 + +#### 2.2.1 为目标模型生成采样数据 + +生成采样数据为可选项,当有足够数量以及足够质量的目标模型SFT数据时,此步可略过。当训练数据和目标模型不配套时,则需要为目标模型重新采样生成数据。 + +**步骤1:启动vLLM server** + +首先需要启动vLLM server来提供模型推理服务: + +```shell +bash scripts/speculative/run_vllm_server.sh +``` + +**server配置说明:** +- 该脚本会启动目标基础模型的vLLM推理服务 +- 确保服务器成功启动后再进行下一步数据生成 +- 可以通过修改脚本中的参数来调整vLLM server配置(如vLLM启动参数、GPU数量等),来适应不同的目标模型 + +**步骤2:生成采样数据** + +vLLM server启动后,使用 `scripts/speculative/generate_data_for_target_model.sh` 脚本生成训练数据: + +```shell +bash scripts/speculative/generate_data_for_target_model.sh +``` + +**脚本功能说明:** +- 通过vLLM server调用目标基础模型对输入数据进行采样 +- 生成 `.jsonl` 格式的训练数据集 +- 数据将用于后续Eagle模型的在线训练 + +**脚本参数说明:** + +在使用前,需要在脚本中配置以下参数: + +- `DATA_NAME_OR_PATH`: 输入数据集的HF名称或本地路径 +- `OUTPUT_DIR`: 生成的数据集输出路径 +- `DATA_FORMAT`: 输入数据集的格式(sharegpt|ultrachat) +- `DATA_SHARD_SIZE`: 生成数据集的切分子集大小 +- `BASE_PORT`: vLLM server的端口号 + +**注意事项:** +- 确保vLLM服务器已成功启动并正常运行 +- 数据生成过程可能需要较长时间,取决于样本数量和模型规模 + + +#### 2.2.2 为Eagle3模型生成hidden states + +目前仅支持以HF为后端生成hidden states,调用脚本如下: +```shell +# For HunyuanOCR +bash scripts/speculative/hunyuan_ocr/generate_vlm_hidden_for_draft_model.sh +# For Qwen3-VL series +bash scripts/speculative/qwen3_vl/generate_vlm_hidden_for_draft_model.sh +``` + +**脚本参数说明:** + +在使用前,需要在脚本中配置以下参数: + +- `DATASET_PATH`: 输入数据集的HF名称或本地路径 +- `MODEL_NAME`: 目标模型的HF名称或本地路径 +- `TARGET_BACKEND`: 目标模型后端,目前仅支持HF +- `MODEL_MAX_LENGTH`: 生成数据的上下文长度 +- `CHAT_TEMPLATE_TYPE`: 目标模型的目标类型,目前支持qwen3_vl/hunyuan_vl +- `OUTPUT_DIR`: 生成的数据集输出路径 + + +## 3. 训练Eagle3模型 + +目前支持在线训练和离线训练两种模式:在线训练适合显存足够、目标模型不大、训练上下文长度不要求极长的场景, +离线训练适合大尺寸目标模型、磁盘空间足够、长上下文训练场景。 + +### 3.1 在线训练 + +使用下面的脚本进行Eagle3模型的在线训练: + +```shell +# For HunyuanOCR +bash scripts/speculative/hunyuan_ocr/train_eagle3_vlm_online.sh +# For Qwen3-VL series +bash scripts/speculative/qwen3_vl/train_eagle3_vlm_online.sh +``` + +**脚本参数说明:** + +在使用前,需要在脚本中配置以下参数: + +- `TARGET_MODEL_NAME_OR_PATH`: 目标模型的HF名称或本地名称 +- `DRAFT_MODEL_CONFIG_PATH`: 草稿模型的config路径 +- `TRAIN_DATA_PATH`: 训练数据路径 +- `EVAL_DATA_PATH`: 验证数据路径 +- `OUTPUT_DIR`: Eagle3模型输出路径 +- `MODEL_MAX_LENGTH`: 训练数据的最大长度 +- `CHAT_TEMPLATE_TYPE`: 目标模型的数据模板类型 + +### 3.2 离线训练 + +在离线训练前,必须要完成`2.2.2` 为Eagle3模型生成hidden states。 +使用下面的脚本进行Eagle3模型的离线训练: + +```shell +# For HunyuanOCR +bash scripts/speculative/hunyuan_ocr/train_eagle3_vlm_offline.sh +# For Qwen3-VL series +bash scripts/speculative/qwen3_vl/train_eagle3_vlm_offline.sh +``` + +**脚本参数说明:** + +在使用前,需要在脚本中配置以下参数: + +- `TARGET_MODEL_NAME_OR_PATH`: 目标模型的HF名称或本地名称 +- `DRAFT_MODEL_CONFIG_PATH`: 草稿模型的config路径 +- `TRAIN_DATA_PATH`: 训练数据路径,.jsonl格式 +- `TRAIN_HIDDEN_PATH`: 训练hidden states数据路径 +- `EVAL_HIDDEN_PATH`: 验证hidden states数据路径 +- `OUTPUT_DIR`: Eagle3模型输出路径 +- `MODEL_MAX_LENGTH`: 训练数据的最大长度 +- `CHAT_TEMPLATE_TYPE`: 目标模型的数据模板类型 +- `LM_HEAD_KEY`: 目标模型lm head的weight key名称,可以在model.safetensors.index.json中查看,默认为lm_head.weight时可不指定这个参数。当为model.embed_tokens.weight时,需要指定。 +- `RUN_NAME`: 当`report_to`设为wand时,可以指定该参数设置wand中的run name。 + + +## 4. 基准测试 + +AngelSlim提供了HunyuanOCR和Qwen3-VL系列模型vLLM backend的Eagle3基准测试脚本,用于评估投机采样的性能提升。 + +### 4.1 vLLM基准测试 + +> vLLM 适配参考: [Support Eagle3 for HunyuanOCR & Qwen3-VL](https://github.com/irisliu10/vllm/tree/eagle3_infer) + +#### 4.1.1 基本用法 + +使用 `tools/vllm_offline_eagle3_vlm_batch.py` 脚本进行投机采样基准测试: + +```shell +python3 tools/vllm_offline_eagle3_vlm_batch.py \ + --base-model-path ${BASE_MODEL_PATH} \ + --eagle-model-path ${EAGLE_MODEL_PATH} \ + --output_file ${OUTPUT_FILE} \ + --use_eagle +``` + + +#### 4.1.2 参数说明 + +**模型配置参数:** +- `--target_model`: 基础模型路径(必需) +- `--draft_model`: Eagle辅助模型路径(必需) + +**基准测试配置:** +- `--dataset`: 基准数据集名称,默认为 `lmms-lab/textvqa`, 可选【`lmms-lab/textvqa`,`MMMU/MMMU`,`Lin-Chen/MMStar`,`opendatalab/OmniDocBench`,`Lin-Chen/MMStar`】 +- `--use_eagle`: 运行Eagle3推理,默认为False +- `--output_file`: 输出结果文件路径 +- `--num_prompts`: 测试用例数量,默认为100 + +**生成参数:** +- `--temp`: 采样温度,默认为 0 +- `--max_model_len`: 最大上下文长度,默认为 32768 +- `--output_len`: 最大生成token数,默认为 1024 +- `--max_num_seqs`: 每次迭代的最大序列数,默认为 1 +- `--num_spec_tokens`: draft model投机采样token数量,默认为2 + +**硬件配置:** +- `--tp`: 张量并行大小,默认为1 + +**其他设置:** +- `--seed`: 随机种子 + +#### 4.1.3 使用示例 + + +**运行投机采样:** +```shell +python3 tools/vllm_offline_eagle3_vlm_batch.py \ + --target_model Qwen/Qwen3-VL-2B-Instruct \ + --draft_model "$EAGLE_DIR" \ + --use_eagle \ + --num_spec_tokens 4 \ + --dataset "$task" \ + --num_prompts 80 \ + --temp 0 \ + --max_num_seqs 1 \ + --output_len 1024 \ + --output_file "$OUTPUT_FILE" +``` + +**Baseline基准测试:** +```shell +python3 tools/vllm_offline_eagle3_vlm_batch.py \ + --target_model Qwen/Qwen3-VL-2B-Instruct \ + --num_spec_tokens 4 \ + --dataset "$task" \ + --num_prompts 80 \ + --temp 0 \ + --max_num_seqs 1 \ + --output_len 1024 \ + --output_file "$OUTPUT_FILE" +``` + +#### 4.1.4 性能报告 + +运行完成后,工具会自动生成性能报告,包括: +- 投机采样与基线模型的性能对比 +- 加速比统计 +- 生成质量指标(如果启用) + +结果将保存在指定的输出目录中,便于后续分析和比较。 + +完整的vLLM benchmark结果可见[Benchmark](../../../performance/speculative_decoding/benchmarks.md)。 \ No newline at end of file diff --git a/docs/source/features/speculative_decoding/index.md b/docs/source/features/speculative_decoding/index.md index 54cca113..79e0d157 100644 --- a/docs/source/features/speculative_decoding/index.md +++ b/docs/source/features/speculative_decoding/index.md @@ -6,6 +6,6 @@ :caption: Contents :maxdepth: 1 -eagle +eagle/index spec_exit ::: diff --git a/docs/source/performance/speculative_decoding/benchmarks.md b/docs/source/performance/speculative_decoding/benchmarks.md index 8021449f..ea032c82 100644 --- a/docs/source/performance/speculative_decoding/benchmarks.md +++ b/docs/source/performance/speculative_decoding/benchmarks.md @@ -2,13 +2,13 @@ ## Eagle3 -### Qwen3 Series Models +### 1. Qwen3 Series Models | Model | Method | GSM8K | | Alpaca | | HumanEval | | MT-bench | | Mean | | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | | | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | -| **Qwen3-1.7B** | Vanilla | 376.42 | 1 | 378.86 | 1 | 378.38 | 1 | 390.53 | 1 | 318.05 | 1 | -| | Eagle3 | 616.9 | 2.13 | 653.29 | 2.19 | 680.1 | 2.2 | 621.44 | 2.17 | 642.93 | 2.18 | +| **Qwen3-1.7B** | Vanilla | 376.42 | 1 | 378.86 | 1 | 378.38 | 1 | 390.53 | 1 | 381.05 | 1 | +| | Eagle3 | 616.9 | 2.13 | 653.29 | 2.19 | 680.1 | 2.2 | 621.44 | 2.17 | 642.93 | 2.17 | | **Qwen3-4B** | Vanilla | 229.05 | 1 | 235.29 | 1 | 234.66 | 1 | 234.04 | 1 | 233.26 | 1 | | | Eagle3 | 389.35 | 2.07 | 395.97 | 2.1 | 377.84 | 2.08 | 384.6 | 2.07 | 386.94 | 2.08 | | **Qwen3-8B** | Vanilla | 149.63 | 1 | 149.93 | 1 | 153.85 | 1 | 153.81 | 1 | 151.81 | 1 | @@ -18,4 +18,234 @@ | **Qwen3-32B** | Vanilla | 43.39 | 1 | 43.38 | 1 | 43.19 | 1 | 43.3 | 1 | 43.32 | 1 | | | Eagle3 | 80.43 | 2.01 | 72.49 | 1.9 | 71.57 | 1.86 | 74.1 | 1.86 | 74.1 | 1.91 | | **Qwen3-30B-A3B** | Vanilla | 311.84 | 1 | 320.43 | 1 | 325.77 | 1 | 325.42 | 1 | 320.87 | 1 | -| | Eagle3 | 453.97 | 2.1 | 432.45 | 2.04 | 428.81 | 2.02 | 437.06 | 2.01 | 438.07 | 2.04 | \ No newline at end of file +| | Eagle3 | 453.97 | 2.1 | 432.45 | 2.04 | 428.81 | 2.02 | 437.06 | 2.01 | 438.07 | 2.04 | + +### 2. VLM Models +#### 2.1 Qwen3-VL Series Models + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodGSM8KAlpacaHumanEvalMT-benchMATH-500MMMUMMStar
throughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept length
Qwen3-VL-2B-InstructVanilla348.551350.91346.071346.31182.96183.27181.631
Eagle3511.522.11560.552.26826.013.39555.222.29163.092.57154.182.55139.732.31
Qwen3-VL-4B-InstructVanilla212.871213.241211.691212.1167.96165.88167.751
Eagle3415.292.57372.892.26459.372.82382.332.34141.872.72104.442.05107.072.1
Qwen3-VL-30B-A3B-InstructVanilla179.941184.61168.681180.57131.08131.51130.931
Eagle3281.932.82241.422.13223.052.57240.472.1975.312.7948.471.7852.571.94
+ +#### 2.2 HunyuanOCR Model + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodOmniDocBench
throughput (tokens/s)accept length
Hunyuan-OCRVanilla70.121
Eagle3108.12.08
+ +### 3. Audio Models + +#### 3.1 Qwen2-Audio Model + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodLibriSpeech
throughput (tokens/s)accept length
Qwen2_AudioVanilla78.761
Eagle3146.663.51
+ +#### 3.2 Fun-CosyVoice3 Model + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelMethodLibriTTS
throughput (tokens/s)accept length
Fun-CosyVoice3Vanilla-1
Eagle3-1.96
\ No newline at end of file diff --git a/tools/vllm_offline_eagle3_qwen2_audio_bench.py b/tools/vllm_offline_eagle3_qwen2_audio_bench.py index 25d4ac2d..2edaebf9 100644 --- a/tools/vllm_offline_eagle3_qwen2_audio_bench.py +++ b/tools/vllm_offline_eagle3_qwen2_audio_bench.py @@ -116,7 +116,9 @@ def parse_args(): "--num_spec_tokens", type=int, default=2, help="Number of speculative tokens" ) parser.add_argument("--max_num_seqs", type=int, default=1) - parser.add_argument("--max_model_len", type=int, default=1024) + parser.add_argument( + "--max_model_len", type=int, default=16384, help="Maximum model length" + ) parser.add_argument( "--num_prompts", type=int, default=100, help="Number of prompts to run" ) diff --git a/tools/vllm_offline_eagle3_vlm_batch.py b/tools/vllm_offline_eagle3_vlm_batch.py index 32d1e0e0..1027213f 100755 --- a/tools/vllm_offline_eagle3_vlm_batch.py +++ b/tools/vllm_offline_eagle3_vlm_batch.py @@ -55,7 +55,9 @@ def parse_args(): parser.add_argument( "--draft_model", type=str, default=None, help="Path to draft model" ) - parser.add_argument("--dataset", type=str, default="textvqa", help="Dataset to use") + parser.add_argument( + "--dataset", type=str, default="lmms-lab/textvqa", help="Dataset to use" + ) parser.add_argument( "--use_eagle", action="store_true", @@ -74,6 +76,9 @@ def parse_args(): "--num_spec_tokens", type=int, default=2, help="Number of speculative tokens" ) parser.add_argument("--max_num_seqs", type=int, default=1) + parser.add_argument( + "--max_model_len", type=int, default=32768, help="Maximum model length" + ) parser.add_argument( "--temp", type=float, default=0, help="Number of speculative tokens" ) @@ -196,7 +201,7 @@ def main(): max_num_seqs=args.max_num_seqs, enforce_eager=True, disable_log_stats=False, - max_model_len=32768, + max_model_len=args.max_model_len, limit_mm_per_prompt={"image": 1}, disable_chunked_mm_input=False, ) From c13b9dd52b3dbdd2a47e9722ced828813b86fcbd Mon Sep 17 00:00:00 2001 From: root Date: Tue, 13 Jan 2026 10:25:04 +0800 Subject: [PATCH 2/5] add transformers descrip for qwen3_vl hidden states dump --- docs/source/features/speculative_decoding/eagle/vlm_eagle.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md index 90898769..171481d1 100644 --- a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md +++ b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md @@ -88,6 +88,7 @@ bash scripts/speculative/hunyuan_ocr/generate_vlm_hidden_for_draft_model.sh # For Qwen3-VL series bash scripts/speculative/qwen3_vl/generate_vlm_hidden_for_draft_model.sh ``` +> 注意:qwen3_vl系列模型生成hidden states需要更新transformers库: `pip install git+https://github.com/huggingface/transformers.git` **脚本参数说明:** From 9e99c6be73238fbfe2c522b3dded1a1dfc718f4c Mon Sep 17 00:00:00 2001 From: root Date: Tue, 13 Jan 2026 11:09:31 +0800 Subject: [PATCH 3/5] update readme & benchmarks --- README.md | 22 +- README_cn.md | 23 +- .../speculative_decoding/benchmarks.md | 239 ++---------------- 3 files changed, 62 insertions(+), 222 deletions(-) diff --git a/README.md b/README.md index 1d097b91..32b97fc9 100644 --- a/README.md +++ b/README.md @@ -233,7 +233,7 @@ bash scripts/speculative/train_eagle3_online.sh For detailed training configurations and vLLM performance benchmarks of Eagle3, please refer to the [Quick Start Guide for Speculative Sampling](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html#id5). -Training and Deployment Guide for Multimodal Model Eagle3—Supporting LLM, VLM, and Audio (ASR & TTS) Models: [Qwen3 series](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [Qwen3-VL series & HunyuanOCR](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Qwen2Audio](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_eagle.html) | [Fun-CosyVoice3](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html). +Training and Deployment Guide for Multimodal Model Eagle3—Supporting LLM, VLM, and Audio (ASR & TTS) Models: [LLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [VLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Audio(ASR)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_eagle.html) | [Audio(TTS)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html). #### 2.2 LLM/VLM Model Quantization @@ -500,7 +500,7 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o - + @@ -509,6 +509,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o + + @@ -528,6 +530,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o + + @@ -546,6 +550,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o + + @@ -563,6 +569,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o + + @@ -581,6 +589,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o + + @@ -598,6 +608,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o + + @@ -616,6 +628,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o + + @@ -633,6 +647,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o + +
Model Method GSM8K AlpacaMATH-500 MMMU MMStarMean
accept length throughput (tokens/s) accept lengththroughput (tokens/s)accept length
Qwen3-VL-2B-Instruct1 81.63 1234.241
Eagle32.55 139.73 2.31415.762.5
Qwen3-VL-4B-Instruct1 67.75 1150.211
Eagle32.05 107.07 2.1283.322.41
Qwen3-VL-30B-A3B-Instruct1 30.93 1115.331
Eagle31.78 52.57 1.94166.172.32
@@ -686,7 +702,7 @@ Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0. accept length - Qwen2_Audio + Qwen2-Audio Vanilla 78.76 1 diff --git a/README_cn.md b/README_cn.md index 372c3e80..5b3adcc3 100644 --- a/README_cn.md +++ b/README_cn.md @@ -234,8 +234,7 @@ bash scripts/speculative/train_eagle3_online.sh 详细训练配置,以及`Eagle3`的vLLM性能测试,详情请参考投机采样[快速开始文档](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html#id5)。 -多模态模型 Eagle3 训练与部署指南,支持LLM / VLM / Audio (ASR & TTS) 模型:[Qwen3 series](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [Qwen3-VL series & HunyuanOCR](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Qwen2Audio](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_eagle.html) | [Fun-CosyVoice3](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html)。 - +多模态模型 Eagle3 训练与部署指南,支持LLM / VLM / Audio (ASR & TTS) 模型:[LLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [VLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Audio(ASR)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_eagle.html) | [Audio(TTS)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html). #### 2.2 LLM/VLM模型量化 完成安装`AngelSlim`后,您可以通过以下脚本快速开始,完成`Qwen3-1.7B`模型的静态`FP8`量化: @@ -504,7 +503,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta - + @@ -513,6 +512,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta + + @@ -532,6 +533,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta + + @@ -550,6 +553,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta + + @@ -567,6 +572,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta + + @@ -585,6 +592,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta + + @@ -602,6 +611,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta + + @@ -620,6 +631,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta + + @@ -637,6 +650,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta + +
Model Method GSM8K AlpacaMATH-500 MMMU MMStarMean
accept length throughput (tokens/s) accept lengththroughput (tokens/s)accept length
Qwen3-VL-2B-Instruct1 81.63 1234.241
Eagle32.55 139.73 2.31415.762.5
Qwen3-VL-4B-Instruct1 67.75 1150.211
Eagle32.05 107.07 2.1283.322.41
Qwen3-VL-30B-A3B-Instruct1 30.93 1115.331
Eagle31.78 52.57 1.94166.172.32
@@ -691,7 +706,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta accept length - Qwen2_Audio + Qwen2-Audio Vanilla 78.76 1 diff --git a/docs/source/performance/speculative_decoding/benchmarks.md b/docs/source/performance/speculative_decoding/benchmarks.md index ea032c82..1ba1ae7c 100644 --- a/docs/source/performance/speculative_decoding/benchmarks.md +++ b/docs/source/performance/speculative_decoding/benchmarks.md @@ -23,229 +23,38 @@ ### 2. VLM Models #### 2.1 Qwen3-VL Series Models - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelMethodGSM8KAlpacaHumanEvalMT-benchMATH-500MMMUMMStar
throughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept lengththroughput (tokens/s)accept length
Qwen3-VL-2B-InstructVanilla348.551350.91346.071346.31182.96183.27181.631
Eagle3511.522.11560.552.26826.013.39555.222.29163.092.57154.182.55139.732.31
Qwen3-VL-4B-InstructVanilla212.871213.241211.691212.1167.96165.88167.751
Eagle3415.292.57372.892.26459.372.82382.332.34141.872.72104.442.05107.072.1
Qwen3-VL-30B-A3B-InstructVanilla179.941184.61168.681180.57131.08131.51130.931
Eagle3281.932.82241.422.13223.052.57240.472.1975.312.7948.471.7852.571.94
+| Model | Method | **GSM8K** | | **Alpaca** | | **HumanEval** | | **MT-bench** | | **MATH-500** | | **MMMU** | | **MMStar** | | **Mean** | | +|-------------------------------|---------|---------------------------|-------------------|---------------------------|-------------------|---------------------------|-------------------|---------------------------|-------------------|---------------------------|-------------------|---------------------------|-------------------|---------------------------|-------------------|---------------------------|-------------------| +| | | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | **throughput (tokens/s)** | **accept length** | +| **Qwen3-VL-2B-Instruct** | Vanilla | 348.55 | 1 | 350.9 | 1 | 346.07 | 1 | 346.31 | 1 | 82.96 | 1 | 83.27 | 1 | 81.63 | 1 | 234.24 | 1 | +| | Eagle3 | 511.52 | 2.11 | 560.55 | 2.26 | 826.01 | 3.39 | 555.22 | 2.29 | 163.09 | 2.57 | 154.18 | 2.55 | 139.73 | 2.31 | 415.76 | 2.5 | +| **Qwen3-VL-4B-Instruct** | Vanilla | 212.87 | 1 | 213.24 | 1 | 211.69 | 1 | 212.1 | 1 | 67.96 | 1 | 65.88 | 1 | 67.75 | 1 | 150.21 | 1 | +| | Eagle3 | 415.29 | 2.57 | 372.89 | 2.26 | 459.37 | 2.82 | 382.33 | 2.34 | 141.87 | 2.72 | 104.44 | 2.05 | 107.07 | 2.1 | 283.32 | 2.41 | +| **Qwen3-VL-30B-A3B-Instruct** | Vanilla | 179.94 | 1 | 184.6 | 1 | 168.68 | 1 | 180.57 | 1 | 31.08 | 1 | 31.51 | 1 | 30.93 | 1 | 115.33 | 1 | +| | Eagle3 | 281.93 | 2.82 | 241.42 | 2.13 | 223.05 | 2.57 | 240.47 | 2.19 | 75.31 | 2.79 | 48.47 | 1.78 | 52.57 | 1.94 | 166.17 | 2.32 | #### 2.2 HunyuanOCR Model - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelMethodOmniDocBench
throughput (tokens/s)accept length
Hunyuan-OCRVanilla70.121
Eagle3108.12.08
+| Model | Method | OmniDocBench | | +|-------------|---------|-----------------------|---------------| +| | | **throughput (tokens/s)** | **accept length** | +| **Hunyuan-OCR** | Vanilla | 70.12 | 1 | +| | Eagle3 | 108.1 | 2.08 | ### 3. Audio Models #### 3.1 Qwen2-Audio Model - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelMethodLibriSpeech
throughput (tokens/s)accept length
Qwen2_AudioVanilla78.761
Eagle3146.663.51
+| Model | Method | LibriSpeech | | +|-----------------|---------|---------------------------|-------------------| +| | | **throughput (tokens/s)** | **accept length** | +| **Qwen2-Audio** | Vanilla | 78.76 | 1 | +| | Eagle3 | 146.66 | 3.51 | #### 3.2 Fun-CosyVoice3 Model - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelMethodLibriTTS
throughput (tokens/s)accept length
Fun-CosyVoice3Vanilla-1
Eagle3-1.96
\ No newline at end of file +| Model | Method | LibriTTS | | +|--------------------|---------|---------------------------|-------------------| +| | | **throughput (tokens/s)** | **accept length** | +| **Fun-CosyVoice3** | Vanilla | - | 1 | +| | Eagle3 | - | 1.96 | \ No newline at end of file From 06582ec70cf597f5f200c587eef5b3bbdfe801ac Mon Sep 17 00:00:00 2001 From: root Date: Tue, 13 Jan 2026 11:21:35 +0800 Subject: [PATCH 4/5] update vLLM PR --- .../eagle/{audio_eagle.md => audio_asr_eagle.md} | 2 +- docs/source/features/speculative_decoding/eagle/index.md | 2 +- docs/source/features/speculative_decoding/eagle/vlm_eagle.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) rename docs/source/features/speculative_decoding/eagle/{audio_eagle.md => audio_asr_eagle.md} (99%) diff --git a/docs/source/features/speculative_decoding/eagle/audio_eagle.md b/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md similarity index 99% rename from docs/source/features/speculative_decoding/eagle/audio_eagle.md rename to docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md index d771a140..65770831 100644 --- a/docs/source/features/speculative_decoding/eagle/audio_eagle.md +++ b/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md @@ -103,7 +103,7 @@ AngelSlim提供了Qwen2Audio模型vLLM backend的Eagle3基准测试脚本,用 ### 4.1 vLLM基准测试 -> vLLM 适配参考: [Support Eagle3 for Qwen2Audio](https://github.com/irisliu10/vllm/tree/eagle3_infer) +> vLLM 适配参考: [Support Eagle3 for Qwen2Audio](https://github.com/vllm-project/vllm/pull/32230) #### 4.1.1 基本用法 diff --git a/docs/source/features/speculative_decoding/eagle/index.md b/docs/source/features/speculative_decoding/eagle/index.md index 91259941..8aa746c4 100644 --- a/docs/source/features/speculative_decoding/eagle/index.md +++ b/docs/source/features/speculative_decoding/eagle/index.md @@ -12,6 +12,6 @@ eagle vlm_eagle -audio_eagle +audio_asr_eagle audio_tts_eagle ::: diff --git a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md index 171481d1..e90e68b8 100644 --- a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md +++ b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md @@ -164,7 +164,7 @@ AngelSlim提供了HunyuanOCR和Qwen3-VL系列模型vLLM backend的Eagle3基准 ### 4.1 vLLM基准测试 -> vLLM 适配参考: [Support Eagle3 for HunyuanOCR & Qwen3-VL](https://github.com/irisliu10/vllm/tree/eagle3_infer) +> vLLM 适配参考: [Support Eagle3 for HunyuanOCR & Qwen3-VL](https://github.com/vllm-project/vllm/pull/32230) #### 4.1.1 基本用法 From ef3befc36dcd3df0175ede4de0413a5859e3e998 Mon Sep 17 00:00:00 2001 From: root Date: Tue, 13 Jan 2026 11:26:09 +0800 Subject: [PATCH 5/5] fix bug in readme --- README.md | 2 +- README_cn.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 32b97fc9..9d474c75 100644 --- a/README.md +++ b/README.md @@ -500,7 +500,7 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o - + diff --git a/README_cn.md b/README_cn.md index 5b3adcc3..1bef73ba 100644 --- a/README_cn.md +++ b/README_cn.md @@ -503,7 +503,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
Model Method GSM8K Alpaca
- +
Model Method GSM8K Alpaca