A comprehensive benchmarking toolkit for measuring AI inference performance across NVIDIA, AMD, and Apple Silicon GPUs. Built by Petronella Technology Group to help organizations make data-driven decisions about GPU infrastructure for large language model workloads.
| Benchmark | Key Metrics | Why It Matters |
|---|---|---|
| Inference | Tokens/sec, time-to-first-token, latency p50/p95/p99 | Core LLM serving performance |
| Memory | Bandwidth (GB/s), model load time, peak vs allocated VRAM | Right-sizing GPU memory for your models |
| Multi-GPU Scaling | Scaling efficiency %, tensor vs pipeline parallel throughput | Whether adding GPUs actually helps |
| Power Efficiency | Tokens per watt, average/peak draw | Operating cost and datacenter planning |
# Clone and install
git clone https://github.com/capetron/ptg-gpu-bench.git
cd ptg-gpu-bench
pip install -r requirements.txt
# Run all benchmarks with auto-detected GPU
bash scripts/run_all.sh
# Run a single inference benchmark
python -m bench.inference_bench --model meta-llama/Llama-3.1-7B
# Benchmark a 70B model across 4 GPUs with pipeline parallelism
python -m bench.inference_bench --model meta-llama/Llama-3.1-70B --gpu-count 4 --parallel-mode pipeline
# Test multi-GPU scaling from 1 to 8 GPUs
python -m bench.multi_gpu_bench --model meta-llama/Llama-3.1-13B --max-gpus 8
# Measure power efficiency
python -m bench.power_efficiency --model meta-llama/Llama-3.1-7B --duration 120
# Compare two result sets
python scripts/compare.py results/h100_sxm_run1.json results/rtx_6000_pro_run1.json- PyTorch + CUDA -- NVIDIA GPUs (H100, H200, A100, RTX 6000 Pro, and more)
- vLLM -- High-throughput serving engine for NVIDIA GPUs
- MLX -- Apple Silicon unified memory (M4 Ultra, M4 Max, M3 Ultra)
The benchmark suite auto-detects your hardware and selects the appropriate backend. You can also force a specific backend with --backend cuda, --backend vllm, or --backend mlx.
Pre-built configurations for common GPU setups live in configs/. Each JSON file specifies the GPU name, memory, TDP, expected bandwidth, and recommended test parameters.
| Config | GPU | Memory | TDP | Use Case |
|---|---|---|---|---|
rtx_6000_pro.json |
RTX 6000 Pro Blackwell | 96 GB GDDR7 | 350W | Professional AI workstation |
h100_sxm.json |
H100 SXM5 | 80 GB HBM3 | 700W | Datacenter training and inference |
h200_nvl.json |
H200 NVL | 141 GB HBM3e | 700W | Large model inference |
a100_80gb.json |
A100 80GB | 80 GB HBM2e | 400W | Versatile datacenter GPU |
apple_m4_ultra.json |
M4 Ultra | 256 GB Unified | 75W | Power-efficient desktop AI |
Use a config to pre-fill GPU parameters:
python -m bench.inference_bench --config configs/h100_sxm.json --model meta-llama/Llama-3.1-70B| GPU | Tokens/sec | TTFT (ms) | p50 (ms) | p95 (ms) | p99 (ms) |
|---|---|---|---|---|---|
| H200 NVL (x2) | 142.3 | 89 | 7.0 | 8.4 | 11.2 |
| H100 SXM5 (x2) | 118.7 | 112 | 8.4 | 10.1 | 13.8 |
| RTX 6000 Pro (x4) | 96.4 | 156 | 10.4 | 13.2 | 18.1 |
| A100 80GB (x2) | 78.2 | 198 | 12.8 | 15.6 | 21.3 |
| M4 Ultra (MLX) | 41.6 | 245 | 24.0 | 28.1 | 34.7 |
| GPU | 1x | 2x | 4x | 8x |
|---|---|---|---|---|
| H100 SXM5 | 1.00x | 1.91x (96%) | 3.72x (93%) | 7.18x (90%) |
| RTX 6000 Pro | 1.00x | 1.84x (92%) | 3.48x (87%) | -- |
| A100 80GB | 1.00x | 1.87x (94%) | 3.58x (90%) | 6.82x (85%) |
| GPU | Tokens/sec | Avg Power (W) | Tokens/Watt |
|---|---|---|---|
| M4 Ultra (MLX) | 68.4 | 62 | 1.103 |
| H200 NVL | 312.8 | 485 | 0.645 |
| RTX 6000 Pro | 198.6 | 290 | 0.685 |
| H100 SXM5 | 287.4 | 520 | 0.553 |
| A100 80GB | 201.2 | 310 | 0.649 |
These are representative results. Your numbers will vary based on model quantization, batch size, system configuration, and thermal conditions. Run your own benchmarks for accurate comparisons.
All benchmarks produce structured JSON results in the results/ directory:
{
"benchmark": "inference",
"timestamp": "2026-04-14T12:00:00Z",
"gpu": "NVIDIA H100 SXM5",
"gpu_count": 2,
"model": "meta-llama/Llama-3.1-70B",
"backend": "vllm",
"metrics": {
"tokens_per_second": 118.7,
"time_to_first_token_ms": 112,
"latency_p50_ms": 8.4,
"latency_p95_ms": 10.1,
"latency_p99_ms": 13.8,
"peak_memory_gb": 74.2,
"total_tokens_generated": 11870
},
"config": {
"batch_size": 1,
"max_new_tokens": 512,
"parallel_mode": "tensor",
"quantization": null
}
}ptg-gpu-bench/
├── bench/
│ ├── __init__.py # Package init with version and utilities
│ ├── inference_bench.py # LLM inference benchmark
│ ├── memory_bench.py # GPU memory bandwidth test
│ ├── multi_gpu_bench.py # Multi-GPU scaling benchmark
│ ├── power_efficiency.py # Tokens per watt measurement
│ └── report.py # Generate comparison reports
├── configs/
│ ├── rtx_6000_pro.json # RTX 6000 Pro Blackwell
│ ├── h100_sxm.json # H100 SXM5
│ ├── h200_nvl.json # H200 NVL
│ ├── a100_80gb.json # A100 80GB
│ └── apple_m4_ultra.json # Apple M4 Ultra (MLX)
├── results/
│ └── README.md # Results format documentation
├── scripts/
│ ├── run_all.sh # Run full benchmark suite
│ └── compare.py # Compare two result sets
├── requirements.txt
├── LICENSE
└── README.md
We publish in-depth analysis of GPU hardware for AI workloads. These benchmarks complement our hardware advisory practice.
- AI Development Systems -- Choosing the right GPU configuration for your AI workload
- NVIDIA DGX Systems -- Enterprise-grade AI infrastructure
- DGX Station GB300 Power Efficiency -- 1.6 kW power budget analysis
- NVIDIA SXM Total Cost of Ownership -- SXM vs PCIe TCO breakdown
- Apple MLX for AI Development -- When unified memory makes sense
- RTX 6000 Pro Blackwell Multi-GPU vLLM -- Scaling vLLM across RTX 6000 Pro GPUs
- AI Services -- Full AI consulting and implementation services
Petronella Technology Group is a cybersecurity, compliance, and AI infrastructure firm based in Raleigh, North Carolina. We help organizations select, deploy, and optimize GPU infrastructure for AI workloads -- from single-workstation development environments to multi-node DGX clusters.
Our team holds CMMC-RP, CCNA, CWNE, and DFE certifications. We specialize in:
- AI Infrastructure -- GPU selection, cluster design, vLLM deployment, MLX optimization
- Hardware Advisory -- NVIDIA DGX, HGX, RTX workstations, Apple Silicon systems
- Performance Engineering -- Benchmarking, profiling, and tuning inference pipelines
- Cybersecurity and Compliance -- CMMC, HIPAA, SOC 2 for organizations running AI workloads
Visit petronellatech.com or call (919) 348-4912 to discuss your AI infrastructure needs.
We welcome contributions -- especially benchmark results from hardware we have not tested. Please open an issue or pull request with your results and hardware details.
MIT License. See LICENSE for details.