Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,8 @@ bash scripts/speculative/train_eagle3_online.sh

For detailed training configurations and vLLM performance benchmarks of Eagle3, please refer to the [Quick Start Guide for Speculative Sampling](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html#id5).

Training and Deployment Guide for Multimodal Model Eagle3—Supporting LLM, VLM, and Audio (ASR & TTS) Models: [LLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [VLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Audio(ASR)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_eagle.html) | [Audio(TTS)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html).

#### 2.2 LLM/VLM Model Quantization

After installing `AngelSlim`, you can launch static FP8 quantization for the Qwen3-1.7B model with the following one-command script:
Expand Down Expand Up @@ -507,6 +509,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
<th colspan="2">MATH-500</th>
<th colspan="2">MMMU</th>
<th colspan="2">MMStar</th>
<th>Mean</th>
<th></th>
</tr></thead>
<tbody>
<tr>
Expand All @@ -526,6 +530,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
<td>accept length</td>
<td>throughput (tokens/s)</td>
<td>accept length</td>
<td>throughput (tokens/s)</td>
<td>accept length</td>
</tr>
<tr>
<td rowspan="2">Qwen3-VL-2B-Instruct</td>
Expand All @@ -544,6 +550,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
<td>1</td>
<td>81.63</td>
<td>1</td>
<td>234.24</td>
<td>1</td>
</tr>
<tr>
<td>Eagle3</td>
Expand All @@ -561,6 +569,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
<td>2.55</td>
<td>139.73</td>
<td>2.31</td>
<td>415.76</td>
<td>2.5</td>
</tr>
<tr>
<td rowspan="2">Qwen3-VL-4B-Instruct</td>
Expand All @@ -579,6 +589,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
<td>1</td>
<td>67.75</td>
<td>1</td>
<td>150.21</td>
<td>1</td>
</tr>
<tr>
<td>Eagle3</td>
Expand All @@ -596,6 +608,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
<td>2.05</td>
<td>107.07</td>
<td>2.1</td>
<td>283.32</td>
<td>2.41</td>
</tr>
<tr>
<td rowspan="2">Qwen3-VL-30B-A3B-Instruct</td>
Expand All @@ -614,6 +628,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
<td>1</td>
<td>30.93</td>
<td>1</td>
<td>115.33</td>
<td>1</td>
</tr>
<tr>
<td>Eagle3</td>
Expand All @@ -631,6 +647,8 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
<td>1.78</td>
<td>52.57</td>
<td>1.94</td>
<td>166.17</td>
<td>2.32</td>
</tr>
</tbody></table>

Expand Down Expand Up @@ -684,7 +702,7 @@ Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.
<td>accept length</td>
</tr>
<tr>
<td rowspan="2">Qwen2_Audio</td>
<td rowspan="2">Qwen2-Audio</td>
<td>Vanilla</td>
<td>78.76</td>
<td>1</td>
Expand Down
19 changes: 18 additions & 1 deletion README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,7 @@ bash scripts/speculative/train_eagle3_online.sh

详细训练配置,以及`Eagle3`的vLLM性能测试,详情请参考投机采样[快速开始文档](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html#id5)。

多模态模型 Eagle3 训练与部署指南,支持LLM / VLM / Audio (ASR & TTS) 模型:[LLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html) | [VLM](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html) | [Audio(ASR)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_eagle.html) | [Audio(TTS)](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html).
#### 2.2 LLM/VLM模型量化
完成安装`AngelSlim`后,您可以通过以下脚本快速开始,完成`Qwen3-1.7B`模型的静态`FP8`量化:

Expand Down Expand Up @@ -511,6 +512,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<th colspan="2">MATH-500</th>
<th colspan="2">MMMU</th>
<th colspan="2">MMStar</th>
<th>Mean</th>
<th></th>
</tr></thead>
<tbody>
<tr>
Expand All @@ -530,6 +533,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<td>accept length</td>
<td>throughput (tokens/s)</td>
<td>accept length</td>
<td>throughput (tokens/s)</td>
<td>accept length</td>
</tr>
<tr>
<td rowspan="2">Qwen3-VL-2B-Instruct</td>
Expand All @@ -548,6 +553,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<td>1</td>
<td>81.63</td>
<td>1</td>
<td>234.24</td>
<td>1</td>
</tr>
<tr>
<td>Eagle3</td>
Expand All @@ -565,6 +572,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<td>2.55</td>
<td>139.73</td>
<td>2.31</td>
<td>415.76</td>
<td>2.5</td>
</tr>
<tr>
<td rowspan="2">Qwen3-VL-4B-Instruct</td>
Expand All @@ -583,6 +592,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<td>1</td>
<td>67.75</td>
<td>1</td>
<td>150.21</td>
<td>1</td>
</tr>
<tr>
<td>Eagle3</td>
Expand All @@ -600,6 +611,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<td>2.05</td>
<td>107.07</td>
<td>2.1</td>
<td>283.32</td>
<td>2.41</td>
</tr>
<tr>
<td rowspan="2">Qwen3-VL-30B-A3B-Instruct</td>
Expand All @@ -618,6 +631,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<td>1</td>
<td>30.93</td>
<td>1</td>
<td>115.33</td>
<td>1</td>
</tr>
<tr>
<td>Eagle3</td>
Expand All @@ -635,6 +650,8 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<td>1.78</td>
<td>52.57</td>
<td>1.94</td>
<td>166.17</td>
<td>2.32</td>
</tr>
</tbody></table>

Expand Down Expand Up @@ -689,7 +706,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
<td>accept length</td>
</tr>
<tr>
<td rowspan="2">Qwen2_Audio</td>
<td rowspan="2">Qwen2-Audio</td>
<td>Vanilla</td>
<td>78.76</td>
<td>1</td>
Expand Down
190 changes: 190 additions & 0 deletions docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# 语音理解模型EAGLE3

[Eagle3](https://arxiv.org/pdf/2503.01840)是目前最常用、加速效果最好的投机采样算法。
本项目包括Eagle3的训练以及benchmark测试,并开源了Qwen2Audio的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。

我们训练的Qwen2Audio Eagle3模型的表现可以参见基准测试[benchmarks](../../performance/speculative_decoding/benchmarks.md),
其中全部数据都是在单张H20上使用vLLM推理获得。

## 1. 支持模型列表
- `Qwen2Audio`

## 2. 准备数据

### 2.1 数据组织形式

所有数据需保存在jsonl文件中,训练数据格式可参考:

- 数据示例: AngelSlim/dataset/librispeech_test/librispeech_eval_10_test.jsonl

```shell
{"id": 5910, "conversations": [{"role": "user", "content": [{"type": "audio", "audio": "./audios/1580-141083-0008.flac"}, {"type": "text", "text": "Detect the language and recognize the speech: <|en|>"}]}, {"role": "assistant", "content": [{"type": "text", "text": "THE PROOF WAS IN THREE LONG SLIPS I HAD LEFT THEM ALL TOGETHER"}]}]}
```

- 典型字段意义如下:
- id: 对话唯一标识
- conversations: OpenAI 对话格式
- audio: 对应音频文件路径

### 2.2 重采样训练数据(推荐)

为得到高质量的目标模型SFT数据,建议使用目标模型重新采样训练数据,将LLM生成的结果保存在jsonl文件中,对应的Audio文件存储在同一目录下,组织形式同上。

可基于实际应用场景自行生成训练数据,下面提供vLLM生成数据流程参考:

**步骤1:启动vLLM server**

首先需要启动vLLM server来提供模型推理服务:

```shell
bash scripts/speculative/run_vllm_server.sh
```

**server配置说明:**
- 该脚本会启动目标基础模型的vLLM推理服务
- 确保服务器成功启动后再进行下一步数据生成
- 可以通过修改脚本中的参数来调整vLLM server配置(如vLLM启动参数、GPU数量等),来适应不同的目标模型

**步骤2:生成采样数据**

vLLM server启动后,使用 `scripts/speculative/generate_data_for_target_model.sh` 脚本生成训练数据:

```shell
bash scripts/speculative/generate_data_for_target_model.sh
```

**脚本功能说明:**
- 通过vLLM server调用目标基础模型对输入数据进行采样
- 生成 `.jsonl` 格式的训练数据集
- 数据将用于后续Eagle模型的在线训练

**脚本参数说明:**

在使用前,需要在脚本中配置以下参数:

- `DATA_NAME_OR_PATH`: 输入数据集的HF名称或本地路径
- `OUTPUT_DIR`: 生成的数据集输出路径
- `DATA_FORMAT`: 输入数据集的格式(sharegpt|ultrachat)
- `DATA_SHARD_SIZE`: 生成数据集的切分子集大小
- `BASE_PORT`: vLLM server的端口号

**注意事项:**
- 确保vLLM服务器已成功启动并正常运行
- 数据生成过程可能需要较长时间,取决于样本数量和模型规模


## 3. 训练Eagle3模型

目前支持Qwen2Audio在线训练模式:在线训练适合显存足够、目标模型不大、训练上下文长度不要求极长的场景。

### 3.1 在线训练

使用下面的脚本进行Eagle3模型的在线训练:

```shell
bash scripts/speculative/qwen2_audio/train_eagle3_audio_online.sh
```

**脚本参数说明:**

在使用前,需要在脚本中配置以下参数:

- `TARGET_MODEL_NAME_OR_PATH`: 目标模型的HF名称或本地名称
- `DRAFT_MODEL_CONFIG_PATH`: 草稿模型的config路径
- `TRAIN_DATA_PATH`: 训练数据路径
- `EVAL_DATA_PATH`: 验证数据路径
- `OUTPUT_DIR`: Eagle3模型输出路径
- `MODEL_MAX_LENGTH`: 训练数据的最大长度
- `CHAT_TEMPLATE_TYPE`: 目标模型的数据模板类型

## 4. 基准测试

AngelSlim提供了Qwen2Audio模型vLLM backend的Eagle3基准测试脚本,用于评估投机采样的性能提升。

### 4.1 vLLM基准测试

> vLLM 适配参考: [Support Eagle3 for Qwen2Audio](https://github.com/vllm-project/vllm/pull/32230)

#### 4.1.1 基本用法

使用 `tools/vllm_offline_eagle3_qwen2_audio_bench.py` 脚本进行投机采样基准测试:

```shell
python3 tools/vllm_offline_eagle3_qwen2_audio_bench.py \
--target_model ${BASE_MODEL_PATH} \
--draft_model ${EAGLE_MODEL_PATH} \
--output_file ${OUTPUT_FILE} \
--use_eagle \
```

#### 4.1.2 参数说明

**模型配置参数:**
- `--target_model`: 基础模型路径(必需)
- `--draft_model`: Eagle辅助模型路径(必需)

**基准测试配置:**
- `--test_data_path`: 测试jsonl文件路径,默认为: "dataset/librispeech_test/librispeech_eval_10_test.jsonl"
- `--use_eagle`: 运行Eagle3推理,默认为False
- `--output_file`: 输出结果文件路径
- `--num_prompts`: 测试用例数量,默认为100

**生成参数:**
- `--temp`: 采样温度,默认为 0
- `--max_model_len`: 最大上下文长度,默认为 16384
- `--output_len`: 最大生成token数,默认为 1024
- `--max_num_seqs`: 每次迭代的最大序列数,默认为 1
- `--num_spec_tokens`: draft model投机采样token数量,默认为2

**硬件配置:**
- `--tp`: 张量并行大小,默认为1

**其他设置:**
- `--seed`: 随机种子

#### 4.1.3 使用示例

测试数据组织形式:所有数据需保存在jsonl文件中,对应的Audio文件存储在同一目录下,目录结构可参考:
```
└── librispeech_test
├── librispeech_eval_10_test.json
├── audios
│ ├── xxx.flac
│ ├── xxx.flac
```

**运行投机采样:**
```shell
python3 tools/vllm_offline_eagle3_qwen2_audio_bench.py \
--target_model Qwen/Qwen2-Audio-7B-Instruct \
--draft_model "$EAGLE_DIR" \
--use_eagle \
--num_spec_tokens 4 \
--num_prompts 10 \
--temp 0 \
--max_num_seqs 1 \
--output_len 1024 \
--output_file "$OUTPUT_FILE"
```

**Baseline基准测试:**
```shell
python3 tools/vllm_offline_eagle3_qwen2_audio_bench.py \
--target_model Qwen/Qwen2-Audio-7B-Instruct \
--num_prompts 10 \
--temp 0 \
--max_num_seqs 1 \
--output_len 1024 \
--output_file "$OUTPUT_FILE"
```

#### 4.1.4 性能报告

运行完成后,工具会自动生成性能报告,包括:
- 投机采样与基线模型的性能对比
- 加速比统计
- 生成质量指标(如果启用)

结果将保存在指定的输出目录中,便于后续分析和比较。

完整的vLLM benchmark结果可见[Benchmark](../../../performance/speculative_decoding/benchmarks.md)。
Loading
Loading