Caution
This guide is for MAX framework developers (Git repo developers).
If you're using MAX as an application developer (via the modular package),
you should instead use the max benchmark command line tool—see the more
detailed guide to benchmark MAX on
GPUs.
This guide describes how to benchmark the performance of a MAX model
server—measuring throughput, latency, and resource utilization—using
benchmark_serving.py. You can use this script to compare other serving
backends, namely vLLM, against MAX.
Key features:
- Tests any OpenAI-compatible HTTP endpoint
- Supports both chat and completion APIs
- Measures detailed latency metrics
- Works with hosted services
The
benchmark_serving.pyscript is adapted from vLLM, licensed under Apache 2.0. We forked this script to ensure consistency with vLLM's measurement methodology and extended it with features we found helpful, such as client-side GPU metric collection viamax.diagnostics.
You can benchmark any HTTP endpoint that implements OpenAI-compatible APIs as follows:
python benchmark_serving.py \
--model google/gemma-3-27b-it \
--backend modular \
--endpoint /v1/chat/completions \
--dataset-name sonnet \
--num-prompts 500 \
--sonnet-input-len 512 \
--output-lengths 256 \
--sonnet-prefix-len 200To see all the available options, run:
python benchmark_serving.py --helpFor more information, see the max benchmark
documentation.
When you want to save your benchmark configurations, you can define them in a
YAML file and pass it with the --config-file option.
The YAML file supports all the same properties as the benchmark_serving.py
arguments, except in the YAML file the properties must use snake_case
names instead of hyphenated names—for example, --num-prompts becomes
num_prompts. And all properties must be nested under a top level
benchmark_config key. For example:
benchmark_config:
model: google/gemma-3-27b-it
backend: modular
endpoint: /v1/chat/completions
dataset_name: sonnet
...To help you reproduce our own benchmarks, we've made some of our config files
available in the configs directory. For example, copy our
gemma-3-27b-sonnet-decode-heavy-prefix200.yaml
file from GitHub, and you can benchmark Gemma3-27B with this command:
max benchmark --config-file gemma-3-27b-sonnet-decode-heavy-prefix200.yamlResults are printed to the terminal but you can also save a JSON-formatted
file by adding the --save-result flag. It's saved to the local directory
with this naming convention:
{backend}-{request_rate}qps-{model_name}-{timestamp}.jsonYou can change the file name with --result-filename and change the directory
with --result-dir.
The output in the terminal should look similar to the following:
============ Serving Benchmark Result ============
Successful requests: 500
Failed requests: 0
Benchmark duration (s): 46.27
Total input tokens: 100895
Total generated tokens: 106511
Request throughput (req/s): 10.81
Input token throughput (tok/s): 2180.51
Output token throughput (tok/s): 2301.89
---------------Time to First Token----------------
Mean TTFT (ms): 15539.31
Median TTFT (ms): 15068.37
P99 TTFT (ms): 33034.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.23
Median TPOT (ms): 28.47
P99 TPOT (ms): 138.55
---------------Inter-token Latency----------------
Mean ITL (ms): 26.76
Median ITL (ms): 5.42
P99 ITL (ms): 228.45
-------------------Token Stats--------------------
Max input tokens: 933
Max output tokens: 806
Max total tokens: 1570
--------------------GPU Stats---------------------
GPU Utilization (%): 94.74
Peak GPU Memory Used (MiB): 37228.12
GPU Memory Available (MiB): 3216.25
==================================================- Request throughput: Number of complete requests processed per second
- Input token throughput: Number of input tokens processed per second
- Output token throughput: Number of tokens generated per second
- TTFT: Time to first token (TTFT), the time from request start to first token generation
- TPOT: Time per output token (TPOT), the average time taken to generate each output token
- ITL: Inter-token latency (ITL), the average time between consecutive token or token-chunk generations
- GPU utilization: Percentage of time during which at least one GPU kernel is being executed
- Peak GPU memory used: Peak memory usage during benchmark run
- Reduce batch size
- Check GPU memory availability:
nvidia-smiorrocm-smi
- Verify
HF_TOKENis set correctly - Ensure model access on Hugging Face