| sidebar-title | Profile with SPEED-Bench Dataset |
|---|
AIPerf supports benchmarking using SPEED-Bench (SPEculative Evaluation Dataset), a benchmark designed for evaluating speculative decoding across diverse semantic domains and input sequence lengths.
This guide covers profiling speculative-decoding-enabled inference servers using SPEED-Bench prompts and collecting server-side acceptance rate metrics per category.
These load all categories combined in a single dataset:
| Dataset Name | Samples | Description |
|---|---|---|
speed_bench_qualitative |
880 | All 11 semantic domains combined |
speed_bench_throughput_1k |
1,536 | ~1K input tokens, all 3 entropy tiers |
speed_bench_throughput_2k |
1,536 | ~2K input tokens, all 3 entropy tiers |
speed_bench_throughput_8k |
1,536 | ~8K input tokens, all 3 entropy tiers |
speed_bench_throughput_16k |
1,536 | ~16K input tokens, all 3 entropy tiers |
speed_bench_throughput_32k |
1,536 | ~32K input tokens, all 3 entropy tiers |
For per-category acceptance rate measurement, each of the 11 qualitative domains is registered separately:
| Dataset Name | Category |
|---|---|
speed_bench_coding |
Code generation and programming |
speed_bench_humanities |
History, philosophy, liberal arts |
speed_bench_math |
Mathematical reasoning |
speed_bench_multilingual |
Tasks across 23 languages |
speed_bench_qa |
Question answering |
speed_bench_rag |
Retrieval-augmented generation |
speed_bench_reasoning |
Logical and analytical reasoning |
speed_bench_roleplay |
Creative roleplay and dialogue |
speed_bench_stem |
Science, technology, engineering |
speed_bench_summarization |
Text summarization |
speed_bench_writing |
Creative and technical writing |
Each throughput ISL bucket is also available filtered by entropy tier:
| Pattern | Tiers | Description |
|---|---|---|
speed_bench_throughput_{ISL}_low_entropy |
Code, sorting | Predictable output patterns |
speed_bench_throughput_{ISL}_mixed |
Needle-in-a-haystack, exams | Moderate unpredictability |
speed_bench_throughput_{ISL}_high_entropy |
Creative writing, dialogue | Highly unpredictable output |
Where {ISL} is one of: 1k, 2k, 8k, 16k, 32k.
NOTICE: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement. For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The prepare data script below automatically fetches data from all the source datasets.
You should first download and prepare the dataset using the following one liner:
SPEED_BENCH_DIR="./datasets/speed-bench"
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 - --output_dir $SPEED_BENCH_DIRThis will download all splits into the working directory as JSONL files. Other supported options of the prepare script:
--config: select which config to prepare, can be one of the splits in the dataset (e.g.,qualitative,throughput_2k) orallto prepare all of the configs.--output_dir: select different output directory to download the dataset to.
Launch an inference server with speculative decoding enabled. For example, with vLLM:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{"model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 5, "method": "draft_model"}'Verify the server is ready:
curl -s localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'AIPerf auto-discovers the Prometheus endpoint at {url}/metrics. If your server uses a different path, pass it explicitly with --server-metrics:
| Server Type | Metrics Path | Flag Needed |
|---|---|---|
| Standalone vLLM / SGLang | /metrics (default) |
None (auto-discovered) |
| NIM-LLM containers | /v1/metrics |
--server-metrics http://localhost:8000/v1/metrics |
For standard (non-reasoning) models, use temperature=0 and a 4K output length cap:
aiperf profile \
--model meta-llama/Llama-3.1-8B-Instruct\
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--custom-dataset-type speed_bench_coding \
--input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
--osl 4096 \
--extra-inputs temperature:0 \
--concurrency 16Do not set ignore_eos — let the model stop naturally at its end-of-sequence token.
For reasoning models (e.g., DeepSeek-R1, QwQ), follow the model card's recommended settings for temperature, top_p, and output length. Reasoning models typically require higher output limits and specific sampling parameters.
To measure acceptance rates per category (matching the SPEED-Bench paper methodology), run each category separately. Each run collects speculative decoding metrics from the server's Prometheus endpoint.
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--custom-dataset-type speed_bench_coding \
--input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
--server-metrics http://localhost:8000/metrics \
--osl 4096 \
--extra-inputs temperature:0 \
--concurrency 16 \
--output-artifact-dir ./artifacts/speed_bench_codingLoop through all categories, then assemble results into a per-category matrix:
CATEGORIES="coding humanities math multilingual qa rag reasoning roleplay stem summarization writing"
MODEL="meta/llama-3.1-8b-instruct"
for cat in $CATEGORIES; do
echo "=== Running category: $cat ==="
aiperf profile \
--model "$MODEL" \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--custom-dataset-type speed_bench_${cat} \
--input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
--server-metrics http://localhost:8000/metrics \
--osl 4096 \
--extra-inputs temperature:0 \
--concurrency 16 \
--output-artifact-dir "./artifacts/speed_bench_${cat}"
done
# Assemble the matrix report
aiperf speed-bench-report ./artifacts/ --format bothThis produces a CSV (speed_bench_report.csv) and console table:
SPEED-Bench Acceptance Length Report
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Model ┃ coding ┃ humanities ┃ math ┃ writing ┃ Overall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ meta/llama-3.1-8b-instruct │ 1.80 │ 1.84 │ 1.78 │ 1.76 │ 1.78 │
└────────────────────────────┴────────┴────────────┴──────┴─────────┴─────────┘
The report script computes acceptance length from vLLM counter metrics (accepted_tokens / num_drafts + 1) and also supports SGLang's direct spec_accept_length gauge.
Additional report metrics:
# Acceptance rate matrix (accepted / draft tokens)
aiperf speed-bench-report ./artifacts/ --metric accept_rate
# Throughput matrix (output tokens/sec per category)
aiperf speed-bench-report ./artifacts/ --metric throughputTo run all 880 prompts in a single benchmark (without per-category breakdown):
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--custom-dataset-type speed_bench_qualitative \
--input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
--server-metrics http://localhost:8000/metrics \
--concurrency 16The throughput splits benchmark end-to-end performance at fixed input sequence lengths:
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--custom-dataset-type speed_bench_throughput_1k \
--input-file ${SPEED_BENCH_DIR}/throughput_1k.jsonl \
--server-metrics http://localhost:8000/metrics \
--concurrency 64 \
--benchmark-duration 120Replace speed_bench_throughput_1k with any throughput variant (_2k, _8k, _16k, _32k) to test at different input lengths.
To isolate entropy effects on acceptance rate at a given ISL:
for tier in low_entropy mixed high_entropy; do
echo "=== Running throughput_1k tier: $tier ==="
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--custom-dataset-type "speed_bench_throughput_1k_${tier}" \
--input-file ${SPEED_BENCH_DIR}/throughput_1k.jsonl \
--server-metrics http://localhost:8000/metrics \
--concurrency 64 \
--benchmark-duration 60
doneServer metrics collection is enabled by default. To disable it:
aiperf profile \
--model meta/llama-3.1-8b-instruct \
--endpoint-type chat \
--streaming \
--url localhost:8000 \
--custom-dataset-type speed_bench_qualitative \
--input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
--no-server-metrics \
--concurrency 16