| sidebar-title | Custom Prompt Benchmarking |
|---|
Benchmark with prompts from your own file, sent exactly as specified without sampling or generation.
This tutorial uses the mooncake_trace dataset type with text_input field to send prompts exactly as-is.
The mooncake_trace dataset type with text_input provides:
- Exact Control: Send precisely the text you specify
- Deterministic Testing: Same file produces identical request sequence every time
- Production Replay: Use real user queries for realistic benchmarking
- Debugging: Isolate performance issues with specific prompts
This is different from random_pool which samples from a dataset.
Traces send each entry exactly once in order.
# Start vLLM server
docker pull vllm/vllm-openai:latest
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 &# Wait for server to be ready
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }# Create an input file to use for benchmarking
# Text can be provided via text_input or input_length
# Output_length can optionally set the maximum number of tokens to generate in the response
cat > production_queries.jsonl << 'EOF'
{"text_input": "What is the capital of France?", "output_length": 20}
{"text_input": "Explain quantum computing in simple terms.", "output_length": 100}
{"text_input": "Write a Python function to calculate fibonacci numbers.", "output_length": 150}
{"text_input": "Summarize the main causes of World War II.", "output_length": 200}
{"text_input": "How do neural networks learn?", "output_length": 80}
EOF
# Run with exact text payloads
aiperf profile \
--model Qwen/Qwen3-0.6B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8000 \
--input-file production_queries.jsonl \
--custom-dataset-type mooncake_trace \
--concurrency 2 \
--warmup-request-count 1Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Loaded 5 entries from production_queries.jsonl
INFO Using mooncake_trace dataset type
INFO AIPerf System is WARMING UP
Warming Up: 1/1 |████████████████████████| 100% [00:01<00:00]
INFO Warmup completed, starting profiling phase
INFO AIPerf System is PROFILING
Profiling: 5/5 |████████████████████████| 100% [00:12<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency2/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 1234.56 │ 456.78 │ 2345.67 │ 2345.67 │ 1089.34 │
│ Time to First Token (ms) │ 45.67 │ 32.34 │ 67.89 │ 67.89 │ 43.12 │
│ Inter Token Latency (ms) │ 12.34 │ 9.87 │ 16.78 │ 16.78 │ 11.90 │
│ Output Token Count (tokens) │ 110.00 │ 20.00 │ 200.00 │ 200.00 │ 100.00 │
│ Request Throughput (req/s) │ 4.23 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency2/profile_export_aiperf.json
Key Points:
- Each line in the JSONL file becomes exactly one request
- Requests are sent in the order they appear in the file
- The
text_inputis sent exactly as specified
Perfect for:
- Regression testing (detecting performance changes)
- A/B testing different model configurations
- Debugging specific prompt performance
- Production workload replay
Not ideal for:
- Load testing with varied request patterns (use
random_poolinstead) - Scalability testing requiring many unique requests