This benchmark is meant to be a lightweight layer ontop of an existing vLLM/SGLang/TRTLLM installation. For example, no install
is required if one is running in the following dockers: vllm/vllm-openai:v0.11.0 (vLLM), lmsysorg/sglang:v0.5.4.post2 (SGLang), or
nvcr.io/nvidia/tensorrt-llm/release:1.2.0 (TRT-LLM).
Next
cd examples/specdec_benchCollect relevant metrics on acceptance rate, timing, and outputs for Speculative Decoding methods. Acceptance rate refers to the number of tokens generated on every iteration. For a standard Autoregressive LLM, this number is just 1.
A basic example run script is provided which benchmarks MTBench (a standard 160 prompts spanning 8 categories). MTBench is available here
Download nvidia/gpt-oss-120b-Eagle3 to a local directory /path/to/eagle.
python3 run.py \
--model_dir openai/gpt-oss-120b \
--tokenizer openai/gpt-oss-120b \
--draft_model_dir /path/to/eagle \
--mtbench question.jsonl \
--tp_size 1 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--num_requests 80 \
--engine TRTLLM \
--concurrency 1 \
--postprocess gptossDownload nvidia/gpt-oss-120b-Eagle3 to a local directory /path/to/eagle.
python3 run.py \
--model_dir openai/gpt-oss-120b \
--tokenizer openai/gpt-oss-120b \
--draft_model_dir /path/to/eagle \
--random_isl 1024 \
--tp_size 1 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--num_requests 40 \
--engine TRTLLM \
--concurrency 1Running SPEED-Bench on Llama 3.3 70B + Eagle 3
-
Install the requirements file using
pip install -r requirements.txt -
Prepare the data using the provided script:
python3 prepare_data.py --dataset speed --config allThe data will be saved to data/ directory, each config type (qualitative, throughput_1k, ...) to each own directory.
GOVERNING TERMS: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement.
ADDITIONAL INFORMATION: MIT for bigcode/humanevalpack, RUCAIBox/MMATH, RUCAIBox/BAMBOO and EQ-Bench. Apache 2.0 for Writing Bench and Spec-Bench. CC BY 4.0 for FBK-MT/MCIF. MIT and Apache 2.0 for tianyang/repobench_python_v1.1, JetBrains-Research/lca-project-level-code-completion and tianyang/repobench_java_v1.1.
NOTICE: For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The prepare_data.py script automatically fetches data from all the source datasets.
Additional details are in HuggingFace dataset repository.
python3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/qualitative \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progresspython3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/throughput_1k \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progressFor longer context (>8192 tokens), please use the following configuration when using TRTLLM:
engine_args:
max_seq_len: 131072 # Model max context length (for Llama 3.3 70B)
enable_chunked_prefill: truepython3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/throughput_16k \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progress \
--runtime_params runtime_args_long_context.yamlEach run.py invocation writes a result directory containing configuration.json,
timing.json, acceptance_rate.json, and (when applicable) mtbench.json / specbench.json.
upload_to_s3.py is a single-file, drop-in tool that uploads one run — or an entire sweep —
to any S3-compatible bucket:
python upload_to_s3.py /path/to/run_or_sweep_dir s3://your-bucket/some/prefix \
--endpoint https://your-s3-endpoint \
--key-id YOUR_KEY_ID \
--secret YOUR_SECRET--endpoint, --key-id, and --secret default to the S3_ENDPOINT, S3_KEY_ID, and
S3_SECRET environment variables. Omit --endpoint (or set S3_ENDPOINT="") to use AWS S3's
default endpoint. Use --dry-run to preview the upload plan, and --skip-existing to skip
runs already present at the destination instead of failing.
The tool handles two directory layouts and mirrors them into S3:
- Flat —
LOCAL_DIR/run_name/{configuration,timing,...}.json - Sweep —
LOCAL_DIR/sweep_name/run_name/{configuration,timing,...}.json
LOCAL_DIR's basename is preserved in the destination prefix, so re-uploads from the same
source land in the same place.
The goal of this benchmark is to provide an easy way to configure, run, and compare speculative implementations across frameworks in an apples-to-apples method.
This benchmark sends request in a single-threaded fashion, so running large concurrency (>256) may result in python async scheduling delays and skew metrics.
If larger concurrency is needed, it is recommended to fully deploy the model using vllm serve, python -m sglang.launch_server, or trtllm-serve (for vLLM, SGlang, or TRTLLM respectively) and
use a more robust benchmarking client like NVIDIA AI Perf.