Skip to content

Latest commit

 

History

History
285 lines (213 loc) · 10.4 KB

File metadata and controls

285 lines (213 loc) · 10.4 KB
sidebar-title Profile with SPEED-Bench Dataset

Profile with SPEED-Bench Dataset

AIPerf supports benchmarking using SPEED-Bench (SPEculative Evaluation Dataset), a benchmark designed for evaluating speculative decoding across diverse semantic domains and input sequence lengths.

This guide covers profiling speculative-decoding-enabled inference servers using SPEED-Bench prompts and collecting server-side acceptance rate metrics per category.


Available Dataset Variants

Aggregate Datasets

These load all categories combined in a single dataset:

Dataset Name Samples Description
speed_bench_qualitative 880 All 11 semantic domains combined
speed_bench_throughput_1k 1,536 ~1K input tokens, all 3 entropy tiers
speed_bench_throughput_2k 1,536 ~2K input tokens, all 3 entropy tiers
speed_bench_throughput_8k 1,536 ~8K input tokens, all 3 entropy tiers
speed_bench_throughput_16k 1,536 ~16K input tokens, all 3 entropy tiers
speed_bench_throughput_32k 1,536 ~32K input tokens, all 3 entropy tiers

Per-Category Qualitative Datasets (80 prompts each)

For per-category acceptance rate measurement, each of the 11 qualitative domains is registered separately:

Dataset Name Category
speed_bench_coding Code generation and programming
speed_bench_humanities History, philosophy, liberal arts
speed_bench_math Mathematical reasoning
speed_bench_multilingual Tasks across 23 languages
speed_bench_qa Question answering
speed_bench_rag Retrieval-augmented generation
speed_bench_reasoning Logical and analytical reasoning
speed_bench_roleplay Creative roleplay and dialogue
speed_bench_stem Science, technology, engineering
speed_bench_summarization Text summarization
speed_bench_writing Creative and technical writing

Per-Entropy-Tier Throughput Datasets (512 prompts each)

Each throughput ISL bucket is also available filtered by entropy tier:

Pattern Tiers Description
speed_bench_throughput_{ISL}_low_entropy Code, sorting Predictable output patterns
speed_bench_throughput_{ISL}_mixed Needle-in-a-haystack, exams Moderate unpredictability
speed_bench_throughput_{ISL}_high_entropy Creative writing, dialogue Highly unpredictable output

Where {ISL} is one of: 1k, 2k, 8k, 16k, 32k.


Prepare the Dataset

NOTICE: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement. For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The prepare data script below automatically fetches data from all the source datasets.

You should first download and prepare the dataset using the following one liner:

SPEED_BENCH_DIR="./datasets/speed-bench"
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 - --output_dir $SPEED_BENCH_DIR

This will download all splits into the working directory as JSONL files. Other supported options of the prepare script:

  • --config: select which config to prepare, can be one of the splits in the dataset (e.g., qualitative, throughput_2k) or all to prepare all of the configs.
  • --output_dir: select different output directory to download the dataset to.

Start a Server with Speculative Decoding

Launch an inference server with speculative decoding enabled. For example, with vLLM:

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"model": "meta-llama/Llama-3.2-1B-Instruct", "num_speculative_tokens": 5, "method": "draft_model"}'

Verify the server is ready:

curl -s localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Server Metrics Endpoint

AIPerf auto-discovers the Prometheus endpoint at {url}/metrics. If your server uses a different path, pass it explicitly with --server-metrics:

Server Type Metrics Path Flag Needed
Standalone vLLM / SGLang /metrics (default) None (auto-discovered)
NIM-LLM containers /v1/metrics --server-metrics http://localhost:8000/v1/metrics

Recommended Defaults

Non-Reasoning Models

For standard (non-reasoning) models, use temperature=0 and a 4K output length cap:

aiperf profile \
    --model meta-llama/Llama-3.1-8B-Instruct\
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_coding \
    --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
    --osl 4096 \
    --extra-inputs temperature:0 \
    --concurrency 16

Do not set ignore_eos — let the model stop naturally at its end-of-sequence token.

Reasoning Models

For reasoning models (e.g., DeepSeek-R1, QwQ), follow the model card's recommended settings for temperature, top_p, and output length. Reasoning models typically require higher output limits and specific sampling parameters.


Per-Category Acceptance Rate Benchmarking

To measure acceptance rates per category (matching the SPEED-Bench paper methodology), run each category separately. Each run collects speculative decoding metrics from the server's Prometheus endpoint.

Single Category

aiperf profile \
    --model meta/llama-3.1-8b-instruct \
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_coding \
    --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
    --server-metrics http://localhost:8000/metrics \
    --osl 4096 \
    --extra-inputs temperature:0 \
    --concurrency 16 \
    --output-artifact-dir ./artifacts/speed_bench_coding

All 11 Categories with Matrix Report

Loop through all categories, then assemble results into a per-category matrix:

CATEGORIES="coding humanities math multilingual qa rag reasoning roleplay stem summarization writing"
MODEL="meta/llama-3.1-8b-instruct"

for cat in $CATEGORIES; do
  echo "=== Running category: $cat ==="
  aiperf profile \
      --model "$MODEL" \
      --endpoint-type chat \
      --streaming \
      --url localhost:8000 \
      --custom-dataset-type speed_bench_${cat} \
      --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
      --server-metrics http://localhost:8000/metrics \
      --osl 4096 \
      --extra-inputs temperature:0 \
      --concurrency 16 \
      --output-artifact-dir "./artifacts/speed_bench_${cat}"
done

# Assemble the matrix report
aiperf speed-bench-report ./artifacts/ --format both

This produces a CSV (speed_bench_report.csv) and console table:

                         SPEED-Bench Acceptance Length Report
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Model                      ┃ coding ┃ humanities ┃ math ┃ writing ┃ Overall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ meta/llama-3.1-8b-instruct │   1.80 │       1.84 │ 1.78 │    1.76 │    1.78 │
└────────────────────────────┴────────┴────────────┴──────┴─────────┴─────────┘

The report script computes acceptance length from vLLM counter metrics (accepted_tokens / num_drafts + 1) and also supports SGLang's direct spec_accept_length gauge.

Additional report metrics:

# Acceptance rate matrix (accepted / draft tokens)
aiperf speed-bench-report ./artifacts/ --metric accept_rate

# Throughput matrix (output tokens/sec per category)
aiperf speed-bench-report ./artifacts/ --metric throughput

Profile with Aggregate Qualitative Split

To run all 880 prompts in a single benchmark (without per-category breakdown):

aiperf profile \
    --model meta/llama-3.1-8b-instruct \
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_qualitative \
    --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
    --server-metrics http://localhost:8000/metrics \
    --concurrency 16

Profile with Throughput Splits

The throughput splits benchmark end-to-end performance at fixed input sequence lengths:

aiperf profile \
    --model meta/llama-3.1-8b-instruct \
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_throughput_1k \
    --input-file ${SPEED_BENCH_DIR}/throughput_1k.jsonl \
    --server-metrics http://localhost:8000/metrics \
    --concurrency 64 \
    --benchmark-duration 120

Replace speed_bench_throughput_1k with any throughput variant (_2k, _8k, _16k, _32k) to test at different input lengths.

Per-Entropy-Tier Throughput

To isolate entropy effects on acceptance rate at a given ISL:

for tier in low_entropy mixed high_entropy; do
  echo "=== Running throughput_1k tier: $tier ==="
  aiperf profile \
      --model meta/llama-3.1-8b-instruct \
      --endpoint-type chat \
      --streaming \
      --url localhost:8000 \
      --custom-dataset-type "speed_bench_throughput_1k_${tier}" \
      --input-file ${SPEED_BENCH_DIR}/throughput_1k.jsonl \
      --server-metrics http://localhost:8000/metrics \
      --concurrency 64 \
      --benchmark-duration 60
done

Disable Server Metrics

Server metrics collection is enabled by default. To disable it:

aiperf profile \
    --model meta/llama-3.1-8b-instruct \
    --endpoint-type chat \
    --streaming \
    --url localhost:8000 \
    --custom-dataset-type speed_bench_qualitative \
    --input-file ${SPEED_BENCH_DIR}/qualitative.jsonl \
    --no-server-metrics \
    --concurrency 16