Skip to content

Commit ac7ed96

Browse files
author
Izzy Putterman
committed
SpecDec Bench: April Update
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
1 parent 952a62b commit ac7ed96

31 files changed

Lines changed: 1643 additions & 986 deletions

examples/specdec_bench/README.md

Lines changed: 74 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
## Installation
44

55
This benchmark is meant to be a lightweight layer ontop of an existing vLLM/SGLang/TRTLLM installation. For example, no install
6-
is required if one is running in the following dockers: `vllm/vllm-openai:v0.11.0` (vLLM), `lmsysorg/sglang:v0.5.4.post2` (SGLang), or
7-
`nvcr.io/nvidia/tensorrt-llm/release:1.2.0` (TRT-LLM).
6+
is required if one is running in the following dockers: `vllm/vllm-openai:v0.19.0` (vLLM), `lmsysorg/sglang:v0.5.10.post1` (SGLang), or
7+
`nvcr.io/nvidia/tensorrt-llm/release:1.3.0.rc10` (TRT-LLM).
88

99
Next
1010

@@ -145,9 +145,80 @@ python3 run.py \
145145
--runtime_params runtime_args_long_context.yaml
146146
```
147147

148+
## Running Sweeps
149+
150+
A sweep runs multiple dataset/concurrency combinations in a single invocation — useful for
151+
throughput curves, ablations over concurrency levels, or multi-dataset evaluations.
152+
153+
### Sweep config format
154+
155+
Create a YAML file with a `runs` key (or a flat list). Each entry supports:
156+
157+
| Key | Type | Description |
158+
|-----|------|-------------|
159+
| `dataset` | string (required) | Dataset name (same choices as `--dataset`) |
160+
| `dataset_path` | string | Path to dataset (can also be supplied via CLI) |
161+
| `random_isl` | int | Input sequence length for the `random` dataset |
162+
| `concurrency` | int or list | Concurrency level(s) to sweep over |
163+
| `num_requests` | int or list | Requests per concurrency level (list must match length of `concurrency`) |
164+
| `output_length` | int or list | Output token limit per concurrency level |
165+
| `temperature` | float or list | Sampling temperature per concurrency level |
166+
| `category` | string | Category filter (for datasets that support it) |
167+
168+
`concurrency` accepts a single integer or a list. When a list, `num_requests`,
169+
`output_length`, and `temperature` can each be a matching-length list to set per-level
170+
values, or a single scalar to apply the same value to all levels.
171+
172+
**Example `sweep_example.yaml`:**
173+
174+
```yaml
175+
runs:
176+
- dataset: speed
177+
dataset_path: /data/speed/qualitative
178+
concurrency: [32]
179+
num_requests: 880
180+
output_length: 4096
181+
- dataset: speed
182+
dataset_path: /data/speed/throughput_1k
183+
concurrency: [1, 2, 4, 8, 16, 32, 64]
184+
num_requests: [8, 16, 32, 32, 64, 128, 256]
185+
output_length: 2048
186+
```
187+
188+
### Running a sweep
189+
190+
Pass `--sweep_config` in place of the usual dataset flags:
191+
192+
```bash
193+
python3 run.py \
194+
--model_dir meta-llama/Llama-3.3-70B-Instruct \
195+
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
196+
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
197+
--tp_size 8 \
198+
--ep_size 1 \
199+
--draft_length 3 \
200+
--engine TRTLLM \
201+
--sweep_config sweep_example.yaml \
202+
--sweep_output_root ./my_sweep_results
203+
```
204+
205+
### Output structure
206+
207+
Each run is saved to its own subdirectory under the sweep output root:
208+
209+
```
210+
my_sweep_results/
211+
000_speed_c32/ # first entry, concurrency=32
212+
001_speed_c1/ # second entry, concurrency=1
213+
001_speed_c2/ # second entry, concurrency=2
214+
...
215+
```
216+
217+
If `--sweep_output_root` is not set, outputs go to `./sweep_outputs/<timestamp>/`.
218+
148219
## Notes
149220

150221
The goal of this benchmark is to provide an easy way to configure, run, and compare speculative implementations across frameworks in an apples-to-apples method.
151222
This benchmark sends request in a single-threaded fashion, so running large concurrency (>256) may result in python async scheduling delays and skew metrics.
152223
If larger concurrency is needed, it is recommended to fully deploy the model using `vllm serve`, `python -m sglang.launch_server`, or `trtllm-serve` (for vLLM, SGlang, or TRTLLM respectively) and
153-
use a more robust benchmarking client like NVIDIA AI Perf.
224+
use a more robust benchmarking client like NVIDIA AI Perf.

examples/specdec_bench/SPECBENCH_PORTING.md

Lines changed: 2 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,11 @@ This guide explains how to convert any `inference_*.py` runner from [Spec-Bench]
55
## Overview
66

77
Spec-Bench inference runners follow a pattern where:
8-
98
1. A `*_forward()` function handles the speculative decoding logic
109
2. The `run_eval()` function orchestrates evaluation with tokenized inputs
1110
3. Models are loaded in `__main__` and passed to `run_eval()`
1211

1312
In contrast, `specdec_bench` uses a class-based approach where:
14-
1513
1. Models inherit from the `Model` base class
1614
2. `__init__()` handles model loading
1715
3. `run()` is an async method that processes single requests
@@ -29,7 +27,6 @@ class Model:
2927
prompt_ids: list of token IDs (not a tensor!)
3028
Returns dict with:
3129
- output_ids: list of list of token chunks per step [[chunk1, chunk2, ...]]
32-
- output_logits: optional logits (usually None)
3330
- token_times: list of timestamps per decoding step
3431
"""
3532
raise NotImplementedError
@@ -181,7 +178,7 @@ Convert the standalone `*_forward()` function to an internal method:
181178
turn_id: Turn identifier
182179
183180
Returns:
184-
dict with output_ids, output_logits, token_times
181+
dict with output_ids and token_times
185182
"""
186183
output_dict = {}
187184

@@ -221,7 +218,6 @@ Convert the standalone `*_forward()` function to an internal method:
221218
reformatted_output_ids[0].append(generated_tokens[start:])
222219

223220
output_dict['output_ids'] = reformatted_output_ids
224-
output_dict['output_logits'] = None
225221
output_dict['token_times'] = timing
226222

227223
return output_dict
@@ -261,7 +257,7 @@ from .specbench_<method> import SpecBench<Method>Model
261257
| Aspect | Spec-Bench | specdec_bench |
262258
|--------|-----------|---------------|
263259
| Input format | `inputs.input_ids` (tensor from tokenizer) | `prompt_ids` (list of ints) |
264-
| Output format | `(output_ids, new_token, steps, accept_lengths)` | `dict` with `output_ids`, `output_logits`, `token_times` |
260+
| Output format | `(output_ids, new_token, steps, accept_lengths)` | `dict` with `output_ids`, `token_times` |
265261
| Output IDs | Full sequence tensor | List of token chunks per step |
266262
| Timing | External (in `run_eval`) | Internal (in `run()`) |
267263
| Device | `device_map="auto"` | Explicit single device |
@@ -319,11 +315,3 @@ async def test():
319315

320316
asyncio.run(test())
321317
```
322-
323-
Adjust the vicuna chat template to be in the tokenizer_config to be
324-
325-
Insert to tokenizer_config (for vicuna)
326-
327-
```json
328-
"chat_template": "{% set ns = namespace(system='') %}{% for m in messages %}{% if m['role'] == 'system' %}{% set ns.system = m['content'] %}{% endif %}{% endfor %}{{ ns.system | trim }}{% if ns.system | trim != '' %} {% endif %}{% for m in messages %}{% if m['role'] == 'user' %}USER: {{ m['content'] | trim }} ASSISTANT:{% elif m['role'] == 'assistant' %}{{ m['content'] | trim }}{% endif %}{% endfor %}"
329-
```

examples/specdec_bench/prepare_data.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ def prepare_data(args: argparse.Namespace) -> None:
6767
"--config",
6868
type=str,
6969
default="all",
70-
choices=[*list(get_args(config_type)), "all"],
70+
choices=[*get_args(config_type), "all"],
7171
help='SPEED-Bench configuration to prepare. Use "all" to prepare all configs. (default: %(default)s)',
7272
)
7373
parser.add_argument(
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
datasets>=3.1.0
1+
datasets>=4.2.0
22
rich>=14.2.0
33
seaborn>=0.13.2
44
tiktoken>=0.12.0

0 commit comments

Comments
 (0)