diff --git a/README.md b/README.md index dc36c4e8..654785d4 100644 --- a/README.md +++ b/README.md @@ -371,7 +371,7 @@ We evaluated the Eagle3 model trained by AngelSlim on tasks including code gener #### 1.1 Qwen3 Series Models -Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**). +Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**). @@ -511,7 +511,7 @@ Benchmark results for Qwen3 series models using Eagle3 speculative decoding on v ##### 1.2.1 Qwen3-VL Series Models -Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). +Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
@@ -668,7 +668,7 @@ Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding o ##### 1.2.2 HunyuanOCR Model -Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). +Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across **[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
@@ -701,7 +701,7 @@ Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.1 ##### 1.3.1 Qwen2-Audio Model -Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). +Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
@@ -732,7 +732,7 @@ Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0. ##### 1.3.2 Fun-CosyVoice3 Model -Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**). +Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
diff --git a/README_cn.md b/README_cn.md index d4d67fab..5b2cbf0f 100644 --- a/README_cn.md +++ b/README_cn.md @@ -372,7 +372,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta #### 1.1 Qwen3系列模型 -我们使用vLLM(v0.11.2)评测了Qwen3系列Eagle3模型在**MT-bench**、 **HumanEval**、 **GSM8K**、**Alpaca**等数据集上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**。 +我们使用vLLM(v0.11.2)评测了Qwen3系列Eagle3模型在**MT-bench**、 **HumanEval**、 **GSM8K**、**Alpaca**等数据集上的接收长度和吞吐。全部结果都是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**。
@@ -512,7 +512,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta ##### 1.2.1 Qwen3-VL系列模型 -我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 +我们使用(v0.12.0)评测了Qwen3-VL系列Eagle3模型在语言理解任务和多模态理解任务上的接收长度和吞吐。全部结果都是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
@@ -669,7 +669,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta ##### 1.2.2 HunyuanOCR模型 -我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 +我们使用(v0.13.0)评测了HunyuanOCR Eagle3模型在[OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)上的接收长度和吞吐。结果是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
@@ -702,7 +702,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta ##### 1.3.1 Qwen2-Audio模型 -我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 +我们使用(v0.12.0)评测了Qwen2-Audio Eagle3模型在[LibriSpeech](https://www.openslr.org/12)数据集上的接收长度和吞吐。结果是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。
@@ -732,7 +732,7 @@ bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --ta
##### 1.3.2 Fun-CosyVoice3模型 -我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张H20上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 +我们评测了Fun-CosyVoice3 Eagle3模型在[LibriTTS](https://www.openslr.org/60/)数据集上的接收长度。结果是在单张GPU上用以下设置测得:**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**。 diff --git a/angelslim/data/dataloader.py b/angelslim/data/dataloader.py index 1e7b9160..a3dcc0a6 100644 --- a/angelslim/data/dataloader.py +++ b/angelslim/data/dataloader.py @@ -43,6 +43,7 @@ def create_data_loader( inference_settings: Dict = None, use_audio_in_video: bool = False, model_name: str = None, + quantization_config: str = None, ) -> DataLoader: """ Create appropriate DataLoader based on data source @@ -94,6 +95,7 @@ def create_data_loader( data_source=data_source, is_hf_dataset=not os.path.isfile(data_source), model_name=model_name, + quantization_config=quantization_config, ) elif data_type == "Text2ImageDataset": dataset = Text2ImageDataset( diff --git a/angelslim/data/multimodal_dataset.py b/angelslim/data/multimodal_dataset.py index a94da609..752503bd 100644 --- a/angelslim/data/multimodal_dataset.py +++ b/angelslim/data/multimodal_dataset.py @@ -37,10 +37,12 @@ def __init__( data_source: Union[str, Dict] = None, is_hf_dataset: bool = False, model_name: str = None, + quantization_config: str = None, ): super().__init__(processor, device, max_length) self.is_hf_dataset = is_hf_dataset self.model_name = model_name + self.quant_algo = quantization_config.name if quantization_config else None if is_hf_dataset: self._load_hf_dataset(data_source, num_samples) @@ -174,6 +176,13 @@ def _load_hf_dataset(self, dataset: str, num_samples: int): def _process_and_append(self, messages: List[Dict], tools=None): """Process messages and append to dataset""" + + # max_length padding for int4 gptq, gptaq and awq + if "int4_" in self.quant_algo: + padding = "max_length" + else: + padding = True + if self.model_name in ["Qwen3VL", "Qwen3VLMoE"]: inputs = self.processor.apply_chat_template( messages, @@ -181,7 +190,7 @@ def _process_and_append(self, messages: List[Dict], tools=None): tokenize=True, add_generation_prompt=True, return_dict=True, - padding="max_length", + padding=padding, truncation=True, return_tensors="pt", max_length=self.max_length, @@ -196,7 +205,7 @@ def _process_and_append(self, messages: List[Dict], tools=None): inputs = self.processor( text=[text], images=image_inputs, - padding="max_length", + padding=padding, truncation=True, return_tensors="pt", max_length=self.max_length, @@ -214,7 +223,7 @@ def _process_and_append(self, messages: List[Dict], tools=None): text=[text], images=image_inputs, videos=video_inputs, - padding="max_length", + padding=padding, truncation=True, return_tensors="pt", max_length=self.max_length, diff --git a/angelslim/engine.py b/angelslim/engine.py index 7ddcddf9..e22fe4be 100644 --- a/angelslim/engine.py +++ b/angelslim/engine.py @@ -149,6 +149,7 @@ def prepare_data( inference_settings=None, use_audio_in_video=False, model_name=None, + quantization_config=None, ) -> Optional[Any]: """Prepare compression dataset""" if custom_dataloader is not None: @@ -174,6 +175,7 @@ def prepare_data( inference_settings=inference_settings, use_audio_in_video=use_audio_in_video, model_name=model_name, + quantization_config=quantization_config, ) self.max_seq_length = max_length diff --git a/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md b/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md index 65770831..fc101273 100644 --- a/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md +++ b/docs/source/features/speculative_decoding/eagle/audio_asr_eagle.md @@ -4,7 +4,7 @@ 本项目包括Eagle3的训练以及benchmark测试,并开源了Qwen2Audio的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 我们训练的Qwen2Audio Eagle3模型的表现可以参见基准测试[benchmarks](../../performance/speculative_decoding/benchmarks.md), -其中全部数据都是在单张H20上使用vLLM推理获得。 +其中全部数据都是在单张GPU上使用vLLM推理获得。 ## 1. 支持模型列表 - `Qwen2Audio` diff --git a/docs/source/features/speculative_decoding/eagle/eagle.md b/docs/source/features/speculative_decoding/eagle/eagle.md index 5c83f1c6..1bfa3116 100644 --- a/docs/source/features/speculative_decoding/eagle/eagle.md +++ b/docs/source/features/speculative_decoding/eagle/eagle.md @@ -4,7 +4,7 @@ 本项目包括Eagle3的训练以及benchmark测试,并开源了Qwen3和Hunyuan系列的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 我们训练的Qwen3系列Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md), -其中全部数据都是在单张H20上使用vLLM推理获得。 +其中全部数据都是在单张GPU上使用vLLM推理获得。 ## 1. 数据生成 diff --git a/docs/source/features/speculative_decoding/eagle/index.md b/docs/source/features/speculative_decoding/eagle/index.md index 8aa746c4..9777c082 100644 --- a/docs/source/features/speculative_decoding/eagle/index.md +++ b/docs/source/features/speculative_decoding/eagle/index.md @@ -4,7 +4,7 @@ 本项目包括Eagle3的训练以及benchmark测试,并开源了Hunyuan、HunyuanOCR、Qwen3、Qwen3-VL、Qwen2Audio、Fun-CosyVoice3等模型的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 我们训练的Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md), -其中全部数据都是在单张H20上使用vLLM推理获得。 +其中全部数据都是在单张GPU上使用vLLM推理获得。 :::{toctree} :caption: Contents diff --git a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md index e90e68b8..9b747ce5 100644 --- a/docs/source/features/speculative_decoding/eagle/vlm_eagle.md +++ b/docs/source/features/speculative_decoding/eagle/vlm_eagle.md @@ -4,7 +4,7 @@ 本项目包括Eagle3的训练以及benchmark测试,并开源了HunyuanOCR和Qwen3-VL系列的[Eagle3权重](https://huggingface.co/collections/AngelSlim/eagle3)。 我们训练的HunyuanOCR和Qwen3-VL系列Eagle3模型的表现可以参见基准测试[benchmarks](../../../performance/speculative_decoding/benchmarks.md), -其中全部数据都是在单张H20上使用vLLM推理获得。 +其中全部数据都是在单张GPU上使用vLLM推理获得。 ## 1. 支持模型列表 - `HunyuanOCR` - `Qwen3-VL` @@ -88,7 +88,9 @@ bash scripts/speculative/hunyuan_ocr/generate_vlm_hidden_for_draft_model.sh # For Qwen3-VL series bash scripts/speculative/qwen3_vl/generate_vlm_hidden_for_draft_model.sh ``` -> 注意:qwen3_vl系列模型生成hidden states需要更新transformers库: `pip install git+https://github.com/huggingface/transformers.git` +> 注意:qwen3_vl系列模型生成hidden states需要更新transformers>=5.0.0, + 或者cherry-pick: https://github.com/huggingface/transformers/pull/42609, + 否则抓取的hidden states不可用!!! **脚本参数说明:** diff --git a/tools/run.py b/tools/run.py index 7e4e5196..5d2d6834 100644 --- a/tools/run.py +++ b/tools/run.py @@ -169,6 +169,7 @@ def run(config): inference_settings=dataset_config.inference_settings, use_audio_in_video=model_config.use_audio_in_video, model_name=model_config.name, + quantization_config=compress_config.quantization, ) # Step 5: Initialize compressor