Tencent · irisliu10 · Feb 2, 2026 · Jan 21, 2026 · Feb 2, 2026 · Feb 2, 2026
diff --git a/README.md b/README.md
@@ -371,7 +371,7 @@ We evaluated the Eagle3 model trained by AngelSlim on tasks including code gener
 
 #### 1.1 Qwen3 Series Models
 
-Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
+Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
 
 <table>
   <thead>
@@ -511,7 +511,7 @@ Benchmark results for Qwen3 series models using Eagle3 speculative decoding on v
 
 ##### 1.2.1 Qwen3-VL Series Models
 
-Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
 
 <table><thead>
   <tr>
@@ -668,7 +668,7 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o
 
 ##### 1.2.2 HunyuanOCR Model
 
-Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
 
 <table><thead>
   <tr>
@@ -701,7 +701,7 @@ Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.1
 
 ##### 1.3.1 Qwen2-Audio Model
 
-Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
 
 <table><thead>
   <tr>
@@ -732,7 +732,7 @@ Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.
 
 ##### 1.3.2 Fun-CosyVoice3 Model
 
-Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
+Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
 
 <table><thead>
   <tr>

diff --git a/README_cn.md b/README_cn.md
@@ -372,7 +372,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
 
 #### 1.1 Qwen3系列模型
 
-我们使用vLLM(v0.11.2)评测了Qwen3系列Eagle3模型在**MT-bench**、 **HumanEval**、 **GSM8K**、**Alpaca**等数据集上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**。
+我们使用vLLM(v0.11.2)评测了Qwen3系列Eagle3模型在**MT-bench**、 **HumanEval**、 **GSM8K**、**Alpaca**等数据集上的接收长度和吞吐。全部结果都是在单张GPU上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**。
 
 <table>
   <thead>
@@ -512,7 +512,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
 
 ##### 1.2.1 Qwen3-VL系列模型
 
-我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张GPU上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
 
 <table><thead>
   <tr>
@@ -669,7 +669,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
 
 ##### 1.2.2 HunyuanOCR模型
 
-我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)上的接收长度和吞吐。结果是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)上的接收长度和吞吐。结果是在单张GPU上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
 
 <table><thead>
   <tr>
@@ -702,7 +702,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
 
 ##### 1.3.1 Qwen2-Audio模型
 
-我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张GPU上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
 
 <table><thead>
   <tr>
@@ -732,7 +732,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
 </table>
 
 ##### 1.3.2 Fun-CosyVoice3模型
-我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张H20上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
+我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张GPU上用以下设置测得：**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
 
 <table><thead>
   <tr>

diff --git a/angelslim/data/dataloader.py b/angelslim/data/dataloader.py
@@ -43,6 +43,7 @@ def create_data_loader(
         inference_settings: Dict = None,
         use_audio_in_video: bool = False,
         model_name: str = None,
+        quantization_config: str = None,
     ) -> DataLoader:
         """
         Create appropriate DataLoader based on data source
@@ -94,6 +95,7 @@ def create_data_loader(
                 data_source=data_source,
                 is_hf_dataset=not os.path.isfile(data_source),
                 model_name=model_name,
+                quantization_config=quantization_config,
             )
         elif data_type == "Text2ImageDataset":
             dataset = Text2ImageDataset(

diff --git a/angelslim/data/multimodal_dataset.py b/angelslim/data/multimodal_dataset.py
@@ -37,10 +37,12 @@ def __init__(
         data_source: Union[str, Dict] = None,
         is_hf_dataset: bool = False,
         model_name: str = None,
+        quantization_config: str = None,
     ):
         super().__init__(processor, device, max_length)
         self.is_hf_dataset = is_hf_dataset
         self.model_name = model_name
+        self.quant_algo = quantization_config.name if quantization_config else None
 
         if is_hf_dataset:
             self._load_hf_dataset(data_source, num_samples)
@@ -174,14 +176,21 @@ def _load_hf_dataset(self, dataset: str, num_samples: int):
 
     def _process_and_append(self, messages: List[Dict], tools=None):
         """Process messages and append to dataset"""
+
+        # max_length padding for int4 gptq, gptaq and awq
+        if "int4_" in self.quant_algo:
+            padding = "max_length"
+        else:
+            padding = True
+
         if self.model_name in ["Qwen3VL", "Qwen3VLMoE"]:
             inputs = self.processor.apply_chat_template(
                 messages,
                 tools=tools,
                 tokenize=True,
                 add_generation_prompt=True,
                 return_dict=True,
-                padding="max_length",
+                padding=padding,
                 truncation=True,
                 return_tensors="pt",
                 max_length=self.max_length,
@@ -196,7 +205,7 @@ def _process_and_append(self, messages: List[Dict], tools=None):
             inputs = self.processor(
                 text=[text],
                 images=image_inputs,
-                padding="max_length",
+                padding=padding,
                 truncation=True,
                 return_tensors="pt",
                 max_length=self.max_length,
@@ -214,7 +223,7 @@ def _process_and_append(self, messages: List[Dict], tools=None):
                 text=[text],
                 images=image_inputs,
                 videos=video_inputs,
-                padding="max_length",
+                padding=padding,
                 truncation=True,
                 return_tensors="pt",
                 max_length=self.max_length,

diff --git a/angelslim/engine.py b/angelslim/engine.py
@@ -149,6 +149,7 @@ def prepare_data(
         inference_settings=None,
         use_audio_in_video=False,
         model_name=None,
+        quantization_config=None,
     ) -> Optional[Any]:
         """Prepare compression dataset"""
         if custom_dataloader is not None:
@@ -174,6 +175,7 @@ def prepare_data(
             inference_settings=inference_settings,
             use_audio_in_video=use_audio_in_video,
             model_name=model_name,
+            quantization_config=quantization_config,
         )
         self.max_seq_length = max_length
 

diff --git a/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md b/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md
@@ -4,7 +4,7 @@
 本项目包括Eagle3的训练以及benchmark测试，并开源了Qwen2Audio的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。
 
 我们训练的Qwen2Audio Eagle3模型的表现可以参见基准测试[benchmarks](../../performance/speculative_decoding/benchmarks.md)，
-其中全部数据都是在单张H20上使用vLLM推理获得。
+其中全部数据都是在单张GPU上使用vLLM推理获得。
 
 ## 1. 支持模型列表
 - `Qwen2Audio`

diff --git a/docs/source/features/speculative_decoding/eagle/eagle.md b/docs/source/features/speculative_decoding/eagle/eagle.md
@@ -4,7 +4,7 @@
 本项目包括Eagle3的训练以及benchmark测试，并开源了Qwen3和Hunyuan系列的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。
 
 我们训练的Qwen3系列Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md)，
-其中全部数据都是在单张H20上使用vLLM推理获得。
+其中全部数据都是在单张GPU上使用vLLM推理获得。
 
 ## 1. 数据生成
 

diff --git a/docs/source/features/speculative_decoding/eagle/index.md b/docs/source/features/speculative_decoding/eagle/index.md
@@ -4,7 +4,7 @@
 本项目包括Eagle3的训练以及benchmark测试，并开源了Hunyuan、HunyuanOCR、Qwen3、Qwen3-VL、Qwen2Audio、Fun-CosyVoice3等模型的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。
 
 我们训练的Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md)，
-其中全部数据都是在单张H20上使用vLLM推理获得。
+其中全部数据都是在单张GPU上使用vLLM推理获得。
 
 :::{toctree}
 :caption: Contents

diff --git a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md
@@ -4,7 +4,7 @@
 本项目包括Eagle3的训练以及benchmark测试，并开源了HunyuanOCR和Qwen3-VL系列的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。
 
 我们训练的HunyuanOCR和Qwen3-VL系列Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md)，
-其中全部数据都是在单张H20上使用vLLM推理获得。
+其中全部数据都是在单张GPU上使用vLLM推理获得。
 ## 1. 支持模型列表
 - `HunyuanOCR`
 - `Qwen3-VL`
@@ -88,7 +88,9 @@ bash scripts/speculative/hunyuan_ocr/generate_vlm_hidden_for_draft_model.sh
 # For Qwen3-VL series
 bash scripts/speculative/qwen3_vl/generate_vlm_hidden_for_draft_model.sh
 ```
-> 注意：qwen3_vl系列模型生成hidden states需要更新transformers库: `pip install git+https://github.com/huggingface/transformers.git`
+> 注意：qwen3_vl系列模型生成hidden states需要更新transformers>=5.0.0,
+ 或者cherry-pick: https://github.com/huggingface/transformers/pull/42609,
+ 否则抓取的hidden states不可用！！！
 
 **脚本参数说明：**
 

diff --git a/tools/run.py b/tools/run.py
@@ -169,6 +169,7 @@ def run(config):
             inference_settings=dataset_config.inference_settings,
             use_audio_in_video=model_config.use_audio_in_video,
             model_name=model_config.name,
+            quantization_config=compress_config.quantization,
         )
 
     # Step 5: Initialize compressor