diff --git a/README.md b/README.md
index dc36c4e8..654785d4 100644
--- a/README.md
+++ b/README.md
@@ -371,7 +371,7 @@ We evaluated the Eagle3 model trained by AngelSlim on tasks including code gener
#### 1.1 Qwen3 Series Models
-Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
+Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
@@ -511,7 +511,7 @@ Benchmark results for Qwen3 series models using Eagle3 speculative decoding on v
##### 1.2.1 Qwen3-VL Series Models
-Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
@@ -668,7 +668,7 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
##### 1.2.2 HunyuanOCR Model
-Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
@@ -701,7 +701,7 @@ Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.1
##### 1.3.1 Qwen2-Audio Model
-Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
@@ -732,7 +732,7 @@ Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.
##### 1.3.2 Fun-CosyVoice3 Model
-Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
diff --git a/README_cn.md b/README_cn.md
index d4d67fab..5b2cbf0f 100644
--- a/README_cn.md
+++ b/README_cn.md
@@ -372,7 +372,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
#### 1.1 Qwen3系列模型
-我们使用vLLM(v0.11.2)评测了Qwen3系列Eagle3模型在**MT-bench**、 **HumanEval**、 **GSM8K**、**Alpaca**等数据集上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**。
+我们使用vLLM(v0.11.2)评测了Qwen3系列Eagle3模型在**MT-bench**、 **HumanEval**、 **GSM8K**、**Alpaca**等数据集上的接收长度和吞吐。全部结果都是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**。
@@ -512,7 +512,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
##### 1.2.1 Qwen3-VL系列模型
-我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
@@ -669,7 +669,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
##### 1.2.2 HunyuanOCR模型
-我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)上的接收长度和吞吐。结果是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
@@ -702,7 +702,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
##### 1.3.1 Qwen2-Audio模型
-我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
@@ -732,7 +732,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
##### 1.3.2 Fun-CosyVoice3模型
-我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
diff --git a/angelslim/data/dataloader.py b/angelslim/data/dataloader.py
index 1e7b9160..a3dcc0a6 100644
--- a/angelslim/data/dataloader.py
+++ b/angelslim/data/dataloader.py
@@ -43,6 +43,7 @@ def create_data_loader(
inference_settings: Dict = None,
use_audio_in_video: bool = False,
model_name: str = None,
+ quantization_config: str = None,
) -> DataLoader:
"""
Create appropriate DataLoader based on data source
@@ -94,6 +95,7 @@ def create_data_loader(
data_source=data_source,
is_hf_dataset=not os.path.isfile(data_source),
model_name=model_name,
+ quantization_config=quantization_config,
)
elif data_type == "Text2ImageDataset":
dataset = Text2ImageDataset(
diff --git a/angelslim/data/multimodal_dataset.py b/angelslim/data/multimodal_dataset.py
index a94da609..752503bd 100644
--- a/angelslim/data/multimodal_dataset.py
+++ b/angelslim/data/multimodal_dataset.py
@@ -37,10 +37,12 @@ def __init__(
data_source: Union[str, Dict] = None,
is_hf_dataset: bool = False,
model_name: str = None,
+ quantization_config: str = None,
):
super().__init__(processor, device, max_length)
self.is_hf_dataset = is_hf_dataset
self.model_name = model_name
+ self.quant_algo = quantization_config.name if quantization_config else None
if is_hf_dataset:
self._load_hf_dataset(data_source, num_samples)
@@ -174,6 +176,13 @@ def _load_hf_dataset(self, dataset: str, num_samples: int):
def _process_and_append(self, messages: List[Dict], tools=None):
"""Process messages and append to dataset"""
+
+ # max_length padding for int4 gptq, gptaq and awq
+ if "int4_" in self.quant_algo:
+ padding = "max_length"
+ else:
+ padding = True
+
if self.model_name in ["Qwen3VL", "Qwen3VLMoE"]:
inputs = self.processor.apply_chat_template(
messages,
@@ -181,7 +190,7 @@ def _process_and_append(self, messages: List[Dict], tools=None):
tokenize=True,
add_generation_prompt=True,
return_dict=True,
- padding="max_length",
+ padding=padding,
truncation=True,
return_tensors="pt",
max_length=self.max_length,
@@ -196,7 +205,7 @@ def _process_and_append(self, messages: List[Dict], tools=None):
inputs = self.processor(
text=[text],
images=image_inputs,
- padding="max_length",
+ padding=padding,
truncation=True,
return_tensors="pt",
max_length=self.max_length,
@@ -214,7 +223,7 @@ def _process_and_append(self, messages: List[Dict], tools=None):
text=[text],
images=image_inputs,
videos=video_inputs,
- padding="max_length",
+ padding=padding,
truncation=True,
return_tensors="pt",
max_length=self.max_length,
diff --git a/angelslim/engine.py b/angelslim/engine.py
index 7ddcddf9..e22fe4be 100644
--- a/angelslim/engine.py
+++ b/angelslim/engine.py
@@ -149,6 +149,7 @@ def prepare_data(
inference_settings=None,
use_audio_in_video=False,
model_name=None,
+ quantization_config=None,
) -> Optional[Any]:
"""Prepare compression dataset"""
if custom_dataloader is not None:
@@ -174,6 +175,7 @@ def prepare_data(
inference_settings=inference_settings,
use_audio_in_video=use_audio_in_video,
model_name=model_name,
+ quantization_config=quantization_config,
)
self.max_seq_length = max_length
diff --git a/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md b/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md
index 65770831..fc101273 100644
--- a/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md
+++ b/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md
@@ -4,7 +4,7 @@
本项目包括Eagle3的训练以及benchmark测试,并开源了Qwen2Audio的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。
我们训练的Qwen2Audio Eagle3模型的表现可以参见基准测试[benchmarks](../../performance/speculative_decoding/benchmarks.md),
-其中全部数据都是在单张H20上使用vLLM推理获得。
+其中全部数据都是在单张GPU上使用vLLM推理获得。
## 1. 支持模型列表
- `Qwen2Audio`
diff --git a/docs/source/features/speculative_decoding/eagle/eagle.md b/docs/source/features/speculative_decoding/eagle/eagle.md
index 5c83f1c6..1bfa3116 100644
--- a/docs/source/features/speculative_decoding/eagle/eagle.md
+++ b/docs/source/features/speculative_decoding/eagle/eagle.md
@@ -4,7 +4,7 @@
本项目包括Eagle3的训练以及benchmark测试,并开源了Qwen3和Hunyuan系列的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。
我们训练的Qwen3系列Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md),
-其中全部数据都是在单张H20上使用vLLM推理获得。
+其中全部数据都是在单张GPU上使用vLLM推理获得。
## 1. 数据生成
diff --git a/docs/source/features/speculative_decoding/eagle/index.md b/docs/source/features/speculative_decoding/eagle/index.md
index 8aa746c4..9777c082 100644
--- a/docs/source/features/speculative_decoding/eagle/index.md
+++ b/docs/source/features/speculative_decoding/eagle/index.md
@@ -4,7 +4,7 @@
本项目包括Eagle3的训练以及benchmark测试,并开源了Hunyuan、HunyuanOCR、Qwen3、Qwen3-VL、Qwen2Audio、Fun-CosyVoice3等模型的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。
我们训练的Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md),
-其中全部数据都是在单张H20上使用vLLM推理获得。
+其中全部数据都是在单张GPU上使用vLLM推理获得。
:::{toctree}
:caption: Contents
diff --git a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md
index e90e68b8..9b747ce5 100644
--- a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md
+++ b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md
@@ -4,7 +4,7 @@
本项目包括Eagle3的训练以及benchmark测试,并开源了HunyuanOCR和Qwen3-VL系列的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。
我们训练的HunyuanOCR和Qwen3-VL系列Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md),
-其中全部数据都是在单张H20上使用vLLM推理获得。
+其中全部数据都是在单张GPU上使用vLLM推理获得。
## 1. 支持模型列表
- `HunyuanOCR`
- `Qwen3-VL`
@@ -88,7 +88,9 @@ bash scripts/speculative/hunyuan_ocr/generate_vlm_hidden_for_draft_model.sh
# For Qwen3-VL series
bash scripts/speculative/qwen3_vl/generate_vlm_hidden_for_draft_model.sh
```
-> 注意:qwen3_vl系列模型生成hidden states需要更新transformers库: `pip install git+https://github.com/huggingface/transformers.git`
+> 注意:qwen3_vl系列模型生成hidden states需要更新transformers>=5.0.0,
+ 或者cherry-pick: https://github.com/huggingface/transformers/pull/42609,
+ 否则抓取的hidden states不可用!!!
**脚本参数说明:**
diff --git a/tools/run.py b/tools/run.py
index 7e4e5196..5d2d6834 100644
--- a/tools/run.py
+++ b/tools/run.py
@@ -169,6 +169,7 @@ def run(config):
inference_settings=dataset_config.inference_settings,
use_audio_in_video=model_config.use_audio_in_video,
model_name=model_config.name,
+ quantization_config=compress_config.quantization,
)
# Step 5: Initialize compressor