sidebar-title	Profile with SPEED-Bench Dataset

Profile with SPEED-Bench Dataset

AIPerf supports benchmarking using SPEED-Bench (SPEculative Evaluation Dataset), a benchmark designed for evaluating speculative decoding across diverse semantic domains and input sequence lengths.

This guide covers profiling speculative-decoding-enabled inference servers using SPEED-Bench prompts and collecting server-side acceptance rate metrics per category.

Available Dataset Variants

Aggregate Datasets

These load all categories combined in a single dataset:

Dataset Name	Samples	Description
`speed_bench_qualitative`	880	All 11 semantic domains combined
`speed_bench_throughput_1k`	1,536	~1K input tokens, all 3 entropy tiers
`speed_bench_throughput_2k`	1,536	~2K input tokens, all 3 entropy tiers
`speed_bench_throughput_8k`	1,536	~8K input tokens, all 3 entropy tiers
`speed_bench_throughput_16k`	1,536	~16K input tokens, all 3 entropy tiers
`speed_bench_throughput_32k`	1,536	~32K input tokens, all 3 entropy tiers

Per-Category Qualitative Datasets (80 prompts each)

For per-category acceptance rate measurement, each of the 11 qualitative domains is registered separately:

Dataset Name	Category
`speed_bench_coding`	Code generation and programming
`speed_bench_humanities`	History, philosophy, liberal arts
`speed_bench_math`	Mathematical reasoning
`speed_bench_multilingual`	Tasks across 23 languages
`speed_bench_qa`	Question answering
`speed_bench_rag`	Retrieval-augmented generation
`speed_bench_reasoning`	Logical and analytical reasoning
`speed_bench_roleplay`	Creative roleplay and dialogue
`speed_bench_stem`	Science, technology, engineering
`speed_bench_summarization`	Text summarization
`speed_bench_writing`	Creative and technical writing

Per-Entropy-Tier Throughput Datasets (512 prompts each)

Each throughput ISL bucket is also available filtered by entropy tier:

Pattern	Tiers	Description
`speed_bench_throughput_{ISL}_low_entropy`	Code, sorting	Predictable output patterns
`speed_bench_throughput_{ISL}_mixed`	Needle-in-a-haystack, exams	Moderate unpredictability
`speed_bench_throughput_{ISL}_high_entropy`	Creative writing, dialogue	Highly unpredictable output

Where {ISL} is one of: 1k, 2k, 8k, 16k, 32k.

Prepare the Dataset

NOTICE: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement. For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The prepare data script below automatically fetches data from all the source datasets.

You should first download and prepare the dataset using the following one liner:

SPEED_BENCH_DIR="./datasets/speed-bench"
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 - --output_dir $SPEED_BENCH_DIR

This will download all splits into the working directory as JSONL files. Other supported options of the prepare script:

--config: select which config to prepare, can be one of the splits in the dataset (e.g., qualitative, throughput_2k) or all to prepare all of the configs.
--output_dir: select different output directory to download the dataset to.

Start a Server with Speculative Decoding

Launch an inference server with speculative decoding enabled. For example, with vLLM:

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 5, "method": "draft_model"}'

Verify the server is ready:

curl -s localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Server Metrics Endpoint

AIPerf auto-discovers the Prometheus endpoint at {url}/metrics. If your server uses a different path, pass it explicitly with --server-metrics:

Server Type	Metrics Path	Flag Needed
Standalone vLLM / SGLang	`/metrics` (default)	None (auto-discovered)
NIM-LLM containers	`/v1/metrics`	`--server-metrics http://localhost:8000/v1/metrics`

Recommended Defaults

Non-Reasoning Models

For standard (non-reasoning) models, use temperature=0 and a 4K output length cap:

aiperf profile \
    --model meta-llama/Llama-3.1-8B-Instruct\
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_coding \
    --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
    --osl 4096 \
    --extra-inputs temperature:0 \
    --concurrency 16

Do not set ignore_eos — let the model stop naturally at its end-of-sequence token.

Reasoning Models

For reasoning models (e.g., DeepSeek-R1, QwQ), follow the model card's recommended settings for temperature, top_p, and output length. Reasoning models typically require higher output limits and specific sampling parameters.

Per-Category Acceptance Rate Benchmarking

To measure acceptance rates per category (matching the SPEED-Bench paper methodology), run each category separately. Each run collects speculative decoding metrics from the server's Prometheus endpoint.

Single Category

aiperf profile \
    --model meta/llama-3.1-8b-instruct \
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_coding \
    --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
    --server-metrics http://localhost:8000/metrics \
    --osl 4096 \
    --extra-inputs temperature:0 \
    --concurrency 16 \
    --output-artifact-dir ./artifacts/speed_bench_coding

All 11 Categories with Matrix Report

Loop through all categories, then assemble results into a per-category matrix:

CATEGORIES="coding humanities math multilingual qa rag reasoning roleplay stem summarization writing"
MODEL="meta/llama-3.1-8b-instruct"

for cat in $CATEGORIES; do
  echo "=== Running category: $cat ==="
  aiperf profile \
      --model "$MODEL" \
      --endpoint-type chat \
      --streaming \
      --url localhost:8000 \
      --custom-dataset-type speed_bench_${cat} \
      --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
      --server-metrics http://localhost:8000/metrics \
      --osl 4096 \
      --extra-inputs temperature:0 \
      --concurrency 16 \
      --output-artifact-dir "./artifacts/speed_bench_${cat}"
done

# Assemble the matrix report
aiperf speed-bench-report ./artifacts/ --format both

This produces a CSV (speed_bench_report.csv) and console table:

                         SPEED-Bench Acceptance Length Report
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Model                      ┃ coding ┃ humanities ┃ math ┃ writing ┃ Overall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ meta/llama-3.1-8b-instruct │   1.80 │       1.84 │ 1.78 │    1.76 │    1.78 │
└────────────────────────────┴────────┴────────────┴──────┴─────────┴─────────┘

The report script computes acceptance length from vLLM counter metrics (accepted_tokens / num_drafts + 1) and also supports SGLang's direct spec_accept_length gauge.

Additional report metrics:

# Acceptance rate matrix (accepted / draft tokens)
aiperf speed-bench-report ./artifacts/ --metric accept_rate

# Throughput matrix (output tokens/sec per category)
aiperf speed-bench-report ./artifacts/ --metric throughput

Profile with Aggregate Qualitative Split

To run all 880 prompts in a single benchmark (without per-category breakdown):

aiperf profile \
    --model meta/llama-3.1-8b-instruct \
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_qualitative \
    --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
    --server-metrics http://localhost:8000/metrics \
    --concurrency 16

Profile with Throughput Splits

The throughput splits benchmark end-to-end performance at fixed input sequence lengths:

aiperf profile \
    --model meta/llama-3.1-8b-instruct \
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_throughput_1k \
    --input-file ${SPEED_BENCH_DIR}/throughput_1k.jsonl \
    --server-metrics http://localhost:8000/metrics \
    --concurrency 64 \
    --benchmark-duration 120

Replace speed_bench_throughput_1k with any throughput variant (_2k, _8k, _16k, _32k) to test at different input lengths.

Per-Entropy-Tier Throughput

To isolate entropy effects on acceptance rate at a given ISL:

for tier in low_entropy mixed high_entropy; do
  echo "=== Running throughput_1k tier: $tier ==="
  aiperf profile \
      --model meta/llama-3.1-8b-instruct \
      --endpoint-type chat \
      --streaming \
      --url localhost:8000 \
      --custom-dataset-type "speed_bench_throughput_1k_${tier}" \
      --input-file ${SPEED_BENCH_DIR}/throughput_1k.jsonl \
      --server-metrics http://localhost:8000/metrics \
      --concurrency 64 \
      --benchmark-duration 60
done

Disable Server Metrics

Server metrics collection is enabled by default. To disable it:

aiperf profile \
    --model meta/llama-3.1-8b-instruct \
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_qualitative \
    --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
    --no-server-metrics \
    --concurrency 16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profile with SPEED-Bench Dataset

Available Dataset Variants

Aggregate Datasets

Per-Category Qualitative Datasets (80 prompts each)

Per-Entropy-Tier Throughput Datasets (512 prompts each)

Prepare the Dataset

Start a Server with Speculative Decoding

Server Metrics Endpoint

Recommended Defaults

Non-Reasoning Models

Reasoning Models

Per-Category Acceptance Rate Benchmarking

Single Category

All 11 Categories with Matrix Report

Profile with Aggregate Qualitative Split

Profile with Throughput Splits

Per-Entropy-Tier Throughput

Disable Server Metrics

Uh oh!

FilesExpand file tree

speed-bench.md

Latest commit

History

speed-bench.md

File metadata and controls

Profile with SPEED-Bench Dataset

Available Dataset Variants

Aggregate Datasets

Per-Category Qualitative Datasets (80 prompts each)

Per-Entropy-Tier Throughput Datasets (512 prompts each)

Prepare the Dataset

Start a Server with Speculative Decoding

Server Metrics Endpoint

Recommended Defaults

Non-Reasoning Models

Reasoning Models

Per-Category Acceptance Rate Benchmarking

Single Category

All 11 Categories with Matrix Report

Profile with Aggregate Qualitative Split

Profile with Throughput Splits

Per-Entropy-Tier Throughput

Disable Server Metrics