Skip to content

Commit 2802302

Browse files
SpecDec Bench: February Update (#875)
## What does this PR do? **Type of change:** ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** Addition of SpecBench Dataset Addition of NVIDID SPEED-Bench dataset, preproc scripts, and custom metrics aggregator Addition of example of converting SpecBench Medusa to this FW Addition of Initial TRTLLM AutoDeploy Specdec support Updates to all frameworks for better perf (overlap/async scheduling etc) ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added SPEED-Bench dataset support with configurable throughput and qualitative configurations * Introduced SpecBench metrics with acceptance rate analysis and visualizations * Added progress bar during benchmark execution * New model implementations for auto-deployment and Medusa-style speculative decoding * Data preparation utility for benchmark datasets * Enhanced metrics with per-category analysis and performance charts * **Documentation** * Updated README with SPEED-Bench workflow and examples * New porting guide for integrating custom benchmark runners * **Refactor** * Streamlined model and runner interfaces for improved flexibility * Consolidated dataset implementations and removed deprecated base classes * **Chores** * Added required dependencies for data handling and visualizations <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
1 parent c689ea1 commit 2802302

28 files changed

+2303
-126
lines changed

.pre-commit-config.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,9 @@ repos:
2424
hooks:
2525
- id: ruff-check
2626
args: [--fix, --exit-non-zero-on-fix]
27+
exclude: ^examples/specdec_bench/specdec_bench/datasets/speed\.py$
2728
- id: ruff-format
29+
exclude: ^examples/specdec_bench/specdec_bench/datasets/speed\.py$
2830

2931
- repo: https://github.com/pre-commit/mirrors-mypy
3032
rev: v1.17.1
@@ -93,6 +95,7 @@ repos:
9395
examples/llm_eval/modeling.py|
9496
examples/llm_qat/main.py|
9597
examples/llm_sparsity/weight_sparsity/finetune.py|
98+
examples/specdec_bench/specdec_bench/models/specbench_medusa.py|
9699
examples/speculative_decoding/main.py|
97100
examples/speculative_decoding/medusa_utils.py|
98101
examples/speculative_decoding/server_generate.py|

examples/specdec_bench/README.md

Lines changed: 107 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,17 +28,121 @@ MTBench is available [here](https://huggingface.co/datasets/HuggingFaceH4/mt_ben
2828
Download `nvidia/gpt-oss-120b-Eagle3` to a local directory `/path/to/eagle`.
2929

3030
```bash
31-
python3 run.py --model_dir openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --draft_model_dir /path/to/eagle --mtbench question.jsonl --tp_size 1 --ep_size 1 --draft_length 3 --output_length 4096 --num_requests 80 --engine TRTLLM --concurrency 1 --postprocess gptoss
32-
31+
python3 run.py \
32+
--model_dir openai/gpt-oss-120b \
33+
--tokenizer openai/gpt-oss-120b \
34+
--draft_model_dir /path/to/eagle \
35+
--mtbench question.jsonl \
36+
--tp_size 1 \
37+
--ep_size 1 \
38+
--draft_length 3 \
39+
--output_length 4096 \
40+
--num_requests 80 \
41+
--engine TRTLLM \
42+
--concurrency 1 \
43+
--postprocess gptoss
3344
```
3445

3546
### Running Random ids on GPT OSS + Eagle3
3647

3748
Download `nvidia/gpt-oss-120b-Eagle3` to a local directory `/path/to/eagle`.
3849

3950
```bash
40-
python3 run.py --model_dir openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --draft_model_dir /path/to/eagle --random_isl 1024 --tp_size 1 --ep_size 1 --draft_length 3 --output_length 4096 --num_requests 40 --engine TRTLLM --concurrency 1
51+
python3 run.py \
52+
--model_dir openai/gpt-oss-120b \
53+
--tokenizer openai/gpt-oss-120b \
54+
--draft_model_dir /path/to/eagle \
55+
--random_isl 1024 \
56+
--tp_size 1 \
57+
--ep_size 1 \
58+
--draft_length 3 \
59+
--output_length 4096 \
60+
--num_requests 40 \
61+
--engine TRTLLM \
62+
--concurrency 1
63+
```
64+
65+
### Running [SPEED-Bench](https://huggingface.co/datasets/nvidia/SPEED-Bench) on Llama 3.3 70B + Eagle 3
66+
67+
1. Install the requirements file using `pip install -r requirements_speed.txt`
68+
69+
2. Prepare the data using the provided script:
70+
71+
```bash
72+
python3 prepare_data.py --dataset speed --config all
73+
```
74+
75+
The data will be saved to `data/` directory, each config type (qualitative, throughput_1k, ...) to each own directory.
76+
77+
#### License
78+
79+
GOVERNING TERMS: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement.
80+
81+
ADDITIONAL INFORMATION: MIT for bigcode/humanevalpack, RUCAIBox/MMATH, RUCAIBox/BAMBOO and EQ-Bench. Apache 2.0 for Writing Bench and Spec-Bench. CC BY 4.0 for FBK-MT/MCIF. MIT and Apache 2.0 for tianyang/repobench_python_v1.1, JetBrains-Research/lca-project-level-code-completion and tianyang/repobench_java_v1.1.
82+
83+
NOTICE: For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The `prepare_data.py` script automatically fetches data from all the source datasets.
84+
85+
Additional details are in [HuggingFace dataset repository](https://huggingface.co/datasets/nvidia/SPEED-Bench).
86+
87+
#### Qualitative split
88+
89+
```bash
90+
python3 run.py \
91+
--model_dir meta-llama/Llama-3.3-70B-Instruct \
92+
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
93+
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
94+
--dataset speed \
95+
--dataset_path data/speed/qualitative \
96+
--tp_size 8 \
97+
--ep_size 1 \
98+
--draft_length 3 \
99+
--output_length 4096 \
100+
--engine TRTLLM \
101+
--concurrency 32 \
102+
--show_progress
103+
```
104+
105+
#### Throughput split
41106

107+
```bash
108+
python3 run.py \
109+
--model_dir meta-llama/Llama-3.3-70B-Instruct \
110+
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
111+
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
112+
--dataset speed \
113+
--dataset_path data/speed/throughput_1k \
114+
--tp_size 8 \
115+
--ep_size 1 \
116+
--draft_length 3 \
117+
--output_length 4096 \
118+
--engine TRTLLM \
119+
--concurrency 32 \
120+
--show_progress
121+
```
122+
123+
For longer context (>8192 tokens), please use the following configuration when using TRTLLM:
124+
125+
```yaml
126+
engine_args:
127+
max_seq_len: 131072 # Model max context length (for Llama 3.3 70B)
128+
enable_chunked_prefill: true
129+
```
130+
131+
```bash
132+
python3 run.py \
133+
--model_dir meta-llama/Llama-3.3-70B-Instruct \
134+
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
135+
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
136+
--dataset speed \
137+
--dataset_path data/speed/throughput_16k \
138+
--tp_size 8 \
139+
--ep_size 1 \
140+
--draft_length 3 \
141+
--output_length 4096 \
142+
--engine TRTLLM \
143+
--concurrency 32 \
144+
--show_progress \
145+
--runtime_params runtime_args_long_context.yaml
42146
```
43147

44148
## Notes

0 commit comments

Comments
 (0)