Petronella Technology Group GPU Benchmark Suite

A comprehensive benchmarking toolkit for measuring AI inference performance across NVIDIA, AMD, and Apple Silicon GPUs. Built by Petronella Technology Group to help organizations make data-driven decisions about GPU infrastructure for large language model workloads.

What This Measures

Benchmark	Key Metrics	Why It Matters
Inference	Tokens/sec, time-to-first-token, latency p50/p95/p99	Core LLM serving performance
Memory	Bandwidth (GB/s), model load time, peak vs allocated VRAM	Right-sizing GPU memory for your models
Multi-GPU Scaling	Scaling efficiency %, tensor vs pipeline parallel throughput	Whether adding GPUs actually helps
Power Efficiency	Tokens per watt, average/peak draw	Operating cost and datacenter planning

Quick Start

# Clone and install
git clone https://github.com/capetron/ptg-gpu-bench.git
cd ptg-gpu-bench
pip install -r requirements.txt

# Run all benchmarks with auto-detected GPU
bash scripts/run_all.sh

# Run a single inference benchmark
python -m bench.inference_bench --model meta-llama/Llama-3.1-7B

# Benchmark a 70B model across 4 GPUs with pipeline parallelism
python -m bench.inference_bench --model meta-llama/Llama-3.1-70B --gpu-count 4 --parallel-mode pipeline

# Test multi-GPU scaling from 1 to 8 GPUs
python -m bench.multi_gpu_bench --model meta-llama/Llama-3.1-13B --max-gpus 8

# Measure power efficiency
python -m bench.power_efficiency --model meta-llama/Llama-3.1-7B --duration 120

# Compare two result sets
python scripts/compare.py results/h100_sxm_run1.json results/rtx_6000_pro_run1.json

Supported Backends

PyTorch + CUDA -- NVIDIA GPUs (H100, H200, A100, RTX 6000 Pro, and more)
vLLM -- High-throughput serving engine for NVIDIA GPUs
MLX -- Apple Silicon unified memory (M4 Ultra, M4 Max, M3 Ultra)

The benchmark suite auto-detects your hardware and selects the appropriate backend. You can also force a specific backend with --backend cuda, --backend vllm, or --backend mlx.

GPU Configuration Files

Pre-built configurations for common GPU setups live in configs/. Each JSON file specifies the GPU name, memory, TDP, expected bandwidth, and recommended test parameters.

Config	GPU	Memory	TDP	Use Case
`rtx_6000_pro.json`	RTX 6000 Pro Blackwell	96 GB GDDR7	350W	Professional AI workstation
`h100_sxm.json`	H100 SXM5	80 GB HBM3	700W	Datacenter training and inference
`h200_nvl.json`	H200 NVL	141 GB HBM3e	700W	Large model inference
`a100_80gb.json`	A100 80GB	80 GB HBM2e	400W	Versatile datacenter GPU
`apple_m4_ultra.json`	M4 Ultra	256 GB Unified	75W	Power-efficient desktop AI

Use a config to pre-fill GPU parameters:

python -m bench.inference_bench --config configs/h100_sxm.json --model meta-llama/Llama-3.1-70B

Example Results

Inference Throughput (Llama 3.1 70B, batch size 1)

GPU	Tokens/sec	TTFT (ms)	p50 (ms)	p95 (ms)	p99 (ms)
H200 NVL (x2)	142.3	89	7.0	8.4	11.2
H100 SXM5 (x2)	118.7	112	8.4	10.1	13.8
RTX 6000 Pro (x4)	96.4	156	10.4	13.2	18.1
A100 80GB (x2)	78.2	198	12.8	15.6	21.3
M4 Ultra (MLX)	41.6	245	24.0	28.1	34.7

Multi-GPU Scaling Efficiency (Llama 3.1 70B)

GPU	1x	2x	4x	8x
H100 SXM5	1.00x	1.91x (96%)	3.72x (93%)	7.18x (90%)
RTX 6000 Pro	1.00x	1.84x (92%)	3.48x (87%)	--
A100 80GB	1.00x	1.87x (94%)	3.58x (90%)	6.82x (85%)

Power Efficiency (Llama 3.1 7B, batch size 1)

GPU	Tokens/sec	Avg Power (W)	Tokens/Watt
M4 Ultra (MLX)	68.4	62	1.103
H200 NVL	312.8	485	0.645
RTX 6000 Pro	198.6	290	0.685
H100 SXM5	287.4	520	0.553
A100 80GB	201.2	310	0.649

These are representative results. Your numbers will vary based on model quantization, batch size, system configuration, and thermal conditions. Run your own benchmarks for accurate comparisons.

Output Format

All benchmarks produce structured JSON results in the results/ directory:

{
  "benchmark": "inference",
  "timestamp": "2026-04-14T12:00:00Z",
  "gpu": "NVIDIA H100 SXM5",
  "gpu_count": 2,
  "model": "meta-llama/Llama-3.1-70B",
  "backend": "vllm",
  "metrics": {
    "tokens_per_second": 118.7,
    "time_to_first_token_ms": 112,
    "latency_p50_ms": 8.4,
    "latency_p95_ms": 10.1,
    "latency_p99_ms": 13.8,
    "peak_memory_gb": 74.2,
    "total_tokens_generated": 11870
  },
  "config": {
    "batch_size": 1,
    "max_new_tokens": 512,
    "parallel_mode": "tensor",
    "quantization": null
  }
}

Project Structure

ptg-gpu-bench/
├── bench/
│   ├── __init__.py               # Package init with version and utilities
│   ├── inference_bench.py        # LLM inference benchmark
│   ├── memory_bench.py           # GPU memory bandwidth test
│   ├── multi_gpu_bench.py        # Multi-GPU scaling benchmark
│   ├── power_efficiency.py       # Tokens per watt measurement
│   └── report.py                 # Generate comparison reports
├── configs/
│   ├── rtx_6000_pro.json         # RTX 6000 Pro Blackwell
│   ├── h100_sxm.json             # H100 SXM5
│   ├── h200_nvl.json             # H200 NVL
│   ├── a100_80gb.json            # A100 80GB
│   └── apple_m4_ultra.json       # Apple M4 Ultra (MLX)
├── results/
│   └── README.md                 # Results format documentation
├── scripts/
│   ├── run_all.sh                # Run full benchmark suite
│   └── compare.py                # Compare two result sets
├── requirements.txt
├── LICENSE
└── README.md

Hardware Guides from Petronella Technology Group

We publish in-depth analysis of GPU hardware for AI workloads. These benchmarks complement our hardware advisory practice.

AI Development Systems -- Choosing the right GPU configuration for your AI workload
NVIDIA DGX Systems -- Enterprise-grade AI infrastructure
DGX Station GB300 Power Efficiency -- 1.6 kW power budget analysis
NVIDIA SXM Total Cost of Ownership -- SXM vs PCIe TCO breakdown
Apple MLX for AI Development -- When unified memory makes sense
RTX 6000 Pro Blackwell Multi-GPU vLLM -- Scaling vLLM across RTX 6000 Pro GPUs
AI Services -- Full AI consulting and implementation services

Who We Are

Petronella Technology Group is a cybersecurity, compliance, and AI infrastructure firm based in Raleigh, North Carolina. We help organizations select, deploy, and optimize GPU infrastructure for AI workloads -- from single-workstation development environments to multi-node DGX clusters.

Our team holds CMMC-RP, CCNA, CWNE, and DFE certifications. We specialize in:

AI Infrastructure -- GPU selection, cluster design, vLLM deployment, MLX optimization
Hardware Advisory -- NVIDIA DGX, HGX, RTX workstations, Apple Silicon systems
Performance Engineering -- Benchmarking, profiling, and tuning inference pipelines
Cybersecurity and Compliance -- CMMC, HIPAA, SOC 2 for organizations running AI workloads

Visit petronellatech.com or call (919) 348-4912 to discuss your AI infrastructure needs.

Contributing

We welcome contributions -- especially benchmark results from hardware we have not tested. Please open an issue or pull request with your results and hardware details.

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Petronella Technology Group GPU Benchmark Suite

What This Measures

Quick Start

Supported Backends

GPU Configuration Files

Example Results

Inference Throughput (Llama 3.1 70B, batch size 1)

Multi-GPU Scaling Efficiency (Llama 3.1 70B)

Power Efficiency (Llama 3.1 7B, batch size 1)

Output Format

Project Structure

Hardware Guides from Petronella Technology Group

Who We Are

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bench		bench
configs		configs
results		results
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Petronella Technology Group GPU Benchmark Suite

What This Measures

Quick Start

Supported Backends

GPU Configuration Files

Example Results

Inference Throughput (Llama 3.1 70B, batch size 1)

Multi-GPU Scaling Efficiency (Llama 3.1 70B)

Power Efficiency (Llama 3.1 7B, batch size 1)

Output Format

Project Structure

Hardware Guides from Petronella Technology Group

Who We Are

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages