Skip to content

capetron/ptg-gpu-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Petronella Technology Group GPU Benchmark Suite

A comprehensive benchmarking toolkit for measuring AI inference performance across NVIDIA, AMD, and Apple Silicon GPUs. Built by Petronella Technology Group to help organizations make data-driven decisions about GPU infrastructure for large language model workloads.

What This Measures

Benchmark Key Metrics Why It Matters
Inference Tokens/sec, time-to-first-token, latency p50/p95/p99 Core LLM serving performance
Memory Bandwidth (GB/s), model load time, peak vs allocated VRAM Right-sizing GPU memory for your models
Multi-GPU Scaling Scaling efficiency %, tensor vs pipeline parallel throughput Whether adding GPUs actually helps
Power Efficiency Tokens per watt, average/peak draw Operating cost and datacenter planning

Quick Start

# Clone and install
git clone https://github.com/capetron/ptg-gpu-bench.git
cd ptg-gpu-bench
pip install -r requirements.txt

# Run all benchmarks with auto-detected GPU
bash scripts/run_all.sh

# Run a single inference benchmark
python -m bench.inference_bench --model meta-llama/Llama-3.1-7B

# Benchmark a 70B model across 4 GPUs with pipeline parallelism
python -m bench.inference_bench --model meta-llama/Llama-3.1-70B --gpu-count 4 --parallel-mode pipeline

# Test multi-GPU scaling from 1 to 8 GPUs
python -m bench.multi_gpu_bench --model meta-llama/Llama-3.1-13B --max-gpus 8

# Measure power efficiency
python -m bench.power_efficiency --model meta-llama/Llama-3.1-7B --duration 120

# Compare two result sets
python scripts/compare.py results/h100_sxm_run1.json results/rtx_6000_pro_run1.json

Supported Backends

  • PyTorch + CUDA -- NVIDIA GPUs (H100, H200, A100, RTX 6000 Pro, and more)
  • vLLM -- High-throughput serving engine for NVIDIA GPUs
  • MLX -- Apple Silicon unified memory (M4 Ultra, M4 Max, M3 Ultra)

The benchmark suite auto-detects your hardware and selects the appropriate backend. You can also force a specific backend with --backend cuda, --backend vllm, or --backend mlx.

GPU Configuration Files

Pre-built configurations for common GPU setups live in configs/. Each JSON file specifies the GPU name, memory, TDP, expected bandwidth, and recommended test parameters.

Config GPU Memory TDP Use Case
rtx_6000_pro.json RTX 6000 Pro Blackwell 96 GB GDDR7 350W Professional AI workstation
h100_sxm.json H100 SXM5 80 GB HBM3 700W Datacenter training and inference
h200_nvl.json H200 NVL 141 GB HBM3e 700W Large model inference
a100_80gb.json A100 80GB 80 GB HBM2e 400W Versatile datacenter GPU
apple_m4_ultra.json M4 Ultra 256 GB Unified 75W Power-efficient desktop AI

Use a config to pre-fill GPU parameters:

python -m bench.inference_bench --config configs/h100_sxm.json --model meta-llama/Llama-3.1-70B

Example Results

Inference Throughput (Llama 3.1 70B, batch size 1)

GPU Tokens/sec TTFT (ms) p50 (ms) p95 (ms) p99 (ms)
H200 NVL (x2) 142.3 89 7.0 8.4 11.2
H100 SXM5 (x2) 118.7 112 8.4 10.1 13.8
RTX 6000 Pro (x4) 96.4 156 10.4 13.2 18.1
A100 80GB (x2) 78.2 198 12.8 15.6 21.3
M4 Ultra (MLX) 41.6 245 24.0 28.1 34.7

Multi-GPU Scaling Efficiency (Llama 3.1 70B)

GPU 1x 2x 4x 8x
H100 SXM5 1.00x 1.91x (96%) 3.72x (93%) 7.18x (90%)
RTX 6000 Pro 1.00x 1.84x (92%) 3.48x (87%) --
A100 80GB 1.00x 1.87x (94%) 3.58x (90%) 6.82x (85%)

Power Efficiency (Llama 3.1 7B, batch size 1)

GPU Tokens/sec Avg Power (W) Tokens/Watt
M4 Ultra (MLX) 68.4 62 1.103
H200 NVL 312.8 485 0.645
RTX 6000 Pro 198.6 290 0.685
H100 SXM5 287.4 520 0.553
A100 80GB 201.2 310 0.649

These are representative results. Your numbers will vary based on model quantization, batch size, system configuration, and thermal conditions. Run your own benchmarks for accurate comparisons.

Output Format

All benchmarks produce structured JSON results in the results/ directory:

{
  "benchmark": "inference",
  "timestamp": "2026-04-14T12:00:00Z",
  "gpu": "NVIDIA H100 SXM5",
  "gpu_count": 2,
  "model": "meta-llama/Llama-3.1-70B",
  "backend": "vllm",
  "metrics": {
    "tokens_per_second": 118.7,
    "time_to_first_token_ms": 112,
    "latency_p50_ms": 8.4,
    "latency_p95_ms": 10.1,
    "latency_p99_ms": 13.8,
    "peak_memory_gb": 74.2,
    "total_tokens_generated": 11870
  },
  "config": {
    "batch_size": 1,
    "max_new_tokens": 512,
    "parallel_mode": "tensor",
    "quantization": null
  }
}

Project Structure

ptg-gpu-bench/
├── bench/
│   ├── __init__.py               # Package init with version and utilities
│   ├── inference_bench.py        # LLM inference benchmark
│   ├── memory_bench.py           # GPU memory bandwidth test
│   ├── multi_gpu_bench.py        # Multi-GPU scaling benchmark
│   ├── power_efficiency.py       # Tokens per watt measurement
│   └── report.py                 # Generate comparison reports
├── configs/
│   ├── rtx_6000_pro.json         # RTX 6000 Pro Blackwell
│   ├── h100_sxm.json             # H100 SXM5
│   ├── h200_nvl.json             # H200 NVL
│   ├── a100_80gb.json            # A100 80GB
│   └── apple_m4_ultra.json       # Apple M4 Ultra (MLX)
├── results/
│   └── README.md                 # Results format documentation
├── scripts/
│   ├── run_all.sh                # Run full benchmark suite
│   └── compare.py                # Compare two result sets
├── requirements.txt
├── LICENSE
└── README.md

Hardware Guides from Petronella Technology Group

We publish in-depth analysis of GPU hardware for AI workloads. These benchmarks complement our hardware advisory practice.

Who We Are

Petronella Technology Group is a cybersecurity, compliance, and AI infrastructure firm based in Raleigh, North Carolina. We help organizations select, deploy, and optimize GPU infrastructure for AI workloads -- from single-workstation development environments to multi-node DGX clusters.

Our team holds CMMC-RP, CCNA, CWNE, and DFE certifications. We specialize in:

  • AI Infrastructure -- GPU selection, cluster design, vLLM deployment, MLX optimization
  • Hardware Advisory -- NVIDIA DGX, HGX, RTX workstations, Apple Silicon systems
  • Performance Engineering -- Benchmarking, profiling, and tuning inference pipelines
  • Cybersecurity and Compliance -- CMMC, HIPAA, SOC 2 for organizations running AI workloads

Visit petronellatech.com or call (919) 348-4912 to discuss your AI infrastructure needs.

Contributing

We welcome contributions -- especially benchmark results from hardware we have not tested. Please open an issue or pull request with your results and hardware details.

License

MIT License. See LICENSE for details.

About

GPU benchmark suite for AI inference workloads. Test throughput, latency, and power efficiency across NVIDIA, AMD, and Apple Silicon. By Petronella Technology Group.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors