LLM Gateway Benchmark Rig — full runbook

A start-to-finish guide for someone who has never used this before. It walks you through: build the cloud environment → install the software → run a stress test → read the results → check the audit data → shut things down to save money. Copy-paste the commands; every prompt is explained.

The rig benchmarks several LLM gateways head-to-head: each gateway runs on its own EC2 box, all forwarding to one shared mock upstream, with a dedicated load-generator box. You bring it up on demand and tear it down when done.

Just want to run it from the in-VPC control box? Use CONTROL-BOX-RUNBOOK.md — a strictly linear, copy-paste sequence (no decisions). This document is the full reference behind it.

Quick answers — every task is one command

Two blocks. Each row is a task → the one script that does it. Details for each are in the numbered sections below.

Block 1 — Environment lifecycle (build / stop / destroy)

I want to…	run this	what it does
Build the whole environment (boxes and software, one shot)	`PROVISION=1 ./deploy.sh`	interactive bring-up (pick account/region/key/VPC/SG) then auto-runs Ansible. (Or `./deploy.sh` then `cd ansible && ansible-playbook -i inventory.ini site.yml`.) → §1–3
Stop one or more machines (save cost)	`scripts/boxes.sh`	menu: pick account/region → list boxes → select one/many → stop. (Scriptable: `scripts/box.sh stop nexus bifrost`.) → §8
Start stopped machines	`scripts/box.sh start nexus`	then `scripts/gen-inventory.sh gw-bench ~/.ssh/<key>.pem us-east-1` (public IPs changed) → §8
Destroy the entire environment	`./down.sh`	deletes the whole CloudFormation stack (all boxes) → §8

Block 2 — Stress testing

I want to…	run this	what it does
Test one gateway, all tiers	`GATEWAY=bifrost scripts/bench/run-tiers.sh`	a full cycle per prompt-size tier (128/550/12.5k × non-SSE/SSE) → one `report.md` each → §5
Test every deployed gateway	`scripts/bench/run-all.sh`	run-tiers for each gateway found in the inventory → §5
Run the whole campaign hands-off (chosen gateways × N rounds)	`GATEWAYS="bifrost litellm kong portkey tensorzero" nohup scripts/bench/run-campaign.sh > ~/campaign.log 2>&1 &`	per gw: up→provision→run N rounds→purge jsonl→archive to `v<r>/<gw>/`→down, one box at a time; resumable; heartbeat → §5
Run a stress test (one full cycle)	`scripts/bench/bench.sh`	clean → setup → restart → health → cooldown → run → verify-audit → report → §5
Test a specific gateway	`GATEWAY=nexus scripts/bench/run-tiers.sh`	any of nexus/bifrost/litellm/kong/portkey/tensorzero → §5
Pick the prompt size	`PROFILE=nonstream-550 scripts/bench/bench.sh`	6 tiers — `nonstream-128`/`stream-128`, `nonstream-550`/`stream-550`, `nonstream`/`stream`; run them all, size decides the bottleneck → §5
Set the concurrency / load	`STAGES="200:60s" scripts/bench/bench.sh`	closed `N:dur` · open `@rate:dur` · ramp `@from-to:dur`; ladder is tier-aware → §5
Read the report	`cat results/<run>/report.md` · A/B diff: `COMPARE="off on" scripts/bench/report.sh`	per-stage RPS/ok%/p99/TTFT + validity gate → §6
Check the targets are healthy	`scripts/bench/health.sh`	every selected gateway must route to the mock (`chatcmpl-mock`) → §5
Check how much audit data was lost	`scripts/bench/verify-audit.sh`	per gateway: captured rows vs requests sent → LOSSLESS / INCOMPLETE N% → §7
Nexus: hooks ON vs OFF	`NEXUS_HOOKS=on scripts/bench/bench.sh` / `NEXUS_HOOKS=off …`	the headline comparison (content scanning on/off) → §5
Nexus: capture full bodies	`NEXUS_AUDIT_BODIES=on scripts/bench/bench.sh`	audit stores the full prompt + completion text (not just metadata) — the heaviest lossless-audit case → §5, §7
Clean a gateway's data (fresh start)	`scripts/bench/clean.sh`	TRUNCATE its traffic tables + flush Redis (verified empty) — `bench.sh` does this each cycle → §5
See every knob	—	the 6 run knobs are in `scripts/bench/README.md`; deploy/profile/loadtest-CLI in §10

The benchmark has no config file — you only ever set 6 inline knobs (GATEWAY, PROFILE, STAGES, RUN_ID, NEXUS_HOOKS, NEXUS_AUDIT_BODIES); every other value is a fixed rig constant or always-on policy hardcoded in scripts/bench/lib.sh. Only nexus has the two extra scenario knobs (NEXUS_HOOKS, NEXUS_AUDIT_BODIES, applied by setup.sh); other gateways run vanilla.

0. Before you start (one-time, on your laptop)

You need, on your control machine (your laptop):

AWS CLI, logged in to the account you want to deploy into. Check:
```
aws configure list-profiles      # shows your profiles
aws sts get-caller-identity      # shows the account your default profile points at
```
If you use a named profile for this account, note its name — deploy.sh will let you pick it from a menu.

Ansible:

ansible-galaxy collection install ansible.posix community.postgresql

This repo, cloned. All commands below run from its root.

You do not need to pre-create an SSH key, VPC, subnet, or security group — deploy.sh lists what exists and offers to create what's missing.

1. Configure (optional but recommended)

cp deploy.env.example deploy.env      # gitignored; your local settings

Open deploy.env and set AWS_PROFILE to the profile for your target account (everything else has sane defaults you can leave alone). If you skip this, deploy.sh will simply ask you interactively.

deploy.env also has BENCH_SSH_PUBLIC_KEYS — paste your ~/.ssh/*.pub there to get passwordless ssh ec2-user@<box> on every box (otherwise use -i <the .pem>).

2. Build the cloud environment

./deploy.sh

It is fully interactive and checks everything before creating anything. You'll be asked, in order (just press Enter to take the marked (current) default unless you know otherwise):

Prompt	What it means / what to pick
AWS profile	Which account to deploy into. Pick yours.
AWS region	Where to deploy. `us-east-1` default.
`deploy into account <id>?`	Safety gate — confirm it's the right account.
(vCPU quota check)	Confirms your account can fit the boxes; warns if not.
EC2 key pair	The SSH key for the boxes. Pick an existing one (those with a local `.pem ✓` are ready to use) or create a new one (it's saved to `~/.ssh/<name>.pem` and the path is printed).
VPC	Pick your default VPC, or create a dedicated one (a small CloudFormation network stack).
Subnet	Pick a public one, or create a new public subnet.
Security group	Create a new one (recommended — it opens exactly the ports the rig needs) or reuse an existing SG (it tells you which required ports are missing).
`deploy this?`	Final confirmation — shows the full plan (account, region, boxes, sizes).

When it finishes it prints the box public IPs and writes ansible/inventory.ini (the list of boxes Ansible + the benchmark use).

Which boxes get built is the matrix. The default is mock nexus bifrost loadtest. To include all gateways, set these in deploy.env (or your shell) before deploy.sh:

DEPLOY_LITELLM=true DEPLOY_KONG=true DEPLOY_PORTKEY=true DEPLOY_TENSORZERO=true

3. Install the software on the boxes

cd ansible && ansible-playbook -i inventory.ini site.yml

This installs and configures every box (host-native, no containers): kernel tuning, PostgreSQL/Redis where needed, each gateway, and the load generator. It is idempotent — safe to re-run. It takes several minutes (the gateways download/build their bits).

Success looks like PLAY RECAP with failed=0 for every box, and each gateway's Smoke ... ok line (it confirms the gateway reaches the mock and gets chatcmpl-mock back). The Could not match supplied host pattern: litellm/kong/... lines are normal for gateways you didn't deploy — they're skipped, not failed.

To re-apply just one box/role later:

ansible-playbook -i inventory.ini site.yml --tags <role> --limit <role>   # e.g. --tags nexus --limit nexus

4. The boxes (what each one is)

All gateways forward to the shared mock, which always answers with id == chatcmpl-mock — that string is the end-to-end "routing works" signal. Every gateway is driven at the SAME underlying model so the comparison is apples-to-apples; only the model-id string each gateway expects differs.

box	what it is	language	gateway port	model id clients send
mock	shared OpenAI-compatible upstream every gateway calls	Go	`:3062`	`mock-gpt-4o`
nexus	the gateway under test: ai-gateway + Hub + control-plane + cp-ui (nginx) + PG/Redis/NATS. Full audit pipeline, content hooks, virtual keys.	Go	`:3050` (API), `:443` (admin UI)	`mock-gpt-4o` (needs a virtual key)
bifrost	comparison gateway	Go	`:8080`	`mock-provider/mock-gpt-4o` (no auth)
litellm	comparison gateway	Python	`:4000`	`mock-gpt-4o`
kong	comparison gateway (AI Gateway plugin)	Lua/OpenResty	`:8000` (`:8001` admin)	`mock-gpt-4o`
portkey	comparison gateway	Node	`:8787`	`mock-gpt-4o`
tensorzero	comparison gateway	Rust	`:3000`	`tensorzero::model_name::mock-gpt-4o`
loadtest	the load generator (the project's own Go `loadtest` — the only load tool) + the benchmark scripts	Go	—	drives all the others over the private subnet

The benchmark scripts know each gateway's port / path / model / auth automatically (hardcoded in scripts/bench/lib.sh), so you just name a gateway — you don't type ports.

The Nexus admin UI

Open https://<nexus-public-ip>/ (HTTPS, not HTTP).
The cert is self-signed → click Advanced → Proceed. (HTTPS is required: the login uses browser WebCrypto, which only works over HTTPS/localhost.)
Login: admin@nexus.ai / nexus-demo.
The gateway API on :3050 is plain HTTP and is what the load test hits (with a virtual key, handled for you).

5. Run a stress test

Where you run it — and how it actually works

The suite (scripts/bench/) is an orchestrator: you run it on ONE machine and it drives the others over SSH. It does NOT generate load itself — per run it SSHes to the loadtest box and runs the loadtest generator there (traffic originates in-subnet, right next to the gateways — no internet hop), SSHes to the gateway box to clean / set knobs / restart / health-check, then pulls the results back to wherever you launched it (results/<run-id>/).

So you only need ONE machine that can SSH to the boxes and has this repo + a generated inventory. Pick one:

A control box in the same VPC (recommended). A small EC2 box holding the repo; it reaches the gateways over the private network (immune to a slow home connection). gen-inventory auto-writes private IPs there (override BENCH_INV_IP=public|private).
Your laptop — works too, but every SSH crosses the internet (public IPs).
The loadtest box itself — the suite is also copied to /opt/perf/scripts/bench/.

The control box deploys with the stack by default (DeployControl=true). To add one to a rig that doesn't have it (reproducible — lands in the rig's subnet/SG, discovered from CloudFormation, nothing hardcoded):

scripts/spin-control-box.sh           # reads STACK/REGION/KEY_NAME from deploy.env; idempotent

It boots with git + ansible-core. Then, one-time ON the control box (the script prints these):

# bring the rig's private key over (used to SSH the gateway boxes):
scp -i ~/.ssh/<key>.pem ~/.ssh/<key>.pem ec2-user@<control-public-ip>:~/.ssh/   # then chmod 600
git clone git@github.com:AlphaBitCore/llm-gateway-benchmark.git llm-gateway-benchmark && cd llm-gateway-benchmark
ansible-galaxy collection install -r ansible/requirements.yml   # AL2023-compatible, pinned
aws configure                                                   # creds to read the stack
scripts/gen-inventory.sh <stack> ~/.ssh/<key>.pem <region>      # auto-writes private IPs in-VPC
cd ansible && ansible all -m ping -o                            # verify connectivity

There is no config file to create — the benchmark's fixed values live in scripts/bench/lib.sh; you vary a run with inline knobs only.

The easy way — one command, a report per tier

GATEWAY=bifrost scripts/bench/run-tiers.sh   # one gateway, all 6 prompt-size tiers
scripts/bench/run-all.sh                     # every deployed gateway (auto-discovered from inventory)

run-tiers.sh runs a full cycle for each tier (128 / 550 / 12.5k × non-SSE/SSE) under its own RUN_ID, so results/<gw>-<tier>/report.md is a standalone per-tier report. REGEN_INVENTORY=1 refreshes the box IPs from CloudFormation first (after a stop/start). No IPs are hardcoded. It runs unattended (auto-yes on the per-cycle clean/preflight confirm; BENCH_ASSUME_YES=0 to be prompted).

Re-running? Move the finished results out of results/ first — otherwise the second round is SKIPPED. Each cycle writes to results/<run-id>/, and the run ID is deterministic — <gw>-<tier> (plus -hooks<h>-bodies<b> for nexus), no timestamp — so a second round reuses the exact same paths. To stay resumable after an interruption, run-tiers.sh skips any cycle whose results/<run-id>/report.md already exists (you'll see >> <run-id> already done — skip). That's what you want when resuming a half-finished suite; it's a trap when you wanted fresh numbers — the already-done cycles are silently skipped and you keep the old report. So before a clean re-run, archive (don't delete) the previous round:
mv results results.$(date +%Y%m%d-%H%M)     # archive the whole finished round, then re-run
# …or just the gateway you're redoing:
mkdir -p results.prev && mv results/nexus-* results.prev/   # then GATEWAY=nexus run-tiers.sh
# …or force ONE cycle: delete just its results/<run-id>/ and re-run (run-tiers redoes only that).
Reports under results/ are the deliverable — mv them aside, don't lose them.

Long runs — detach so a dropped SSH doesn't kill it. A full suite is many minutes (nexus's matrix, hours). Launch it in the background and tail the log, so closing your laptop / losing wifi doesn't SIGHUP it midway:

nohup env GATEWAY=bifrost scripts/bench/run-tiers.sh > ~/bifrost-tiers.log 2>&1 &
tail -f ~/bifrost-tiers.log      # Ctrl-C just stops tailing; the run keeps going
# reconnect later:  tail -f ~/bifrost-tiers.log   (or: pgrep -af run-tiers)

(or use tmux/screen if installed). Reports still land in results/<gw>-<tier>/.

Stopping a run early — kill BOTH the orchestrator AND the remote generator. The orchestration (run-tiers.sh/bench.sh) runs on the control box, but the actual loadtest generator runs on the loadtest box (launched over a detached SSH). So pkill -f scripts/bench/ on the control box stops the orchestrator but leaves the generator hammering the gateway — and a second run launched on top then double-loads the gateway (corrupts results, can OOM/freeze it). Always also kill the generator on the loadtest box and confirm the gateway has no in-flight connections:
# on the control box:
pkill -f 'scripts/bench/'
# then the generator on the loadtest box (resolve its IP from ansible/inventory.ini):
ssh <loadtest-ip> "pkill -9 -f /usr/local/bin/loadtest"
# verify the load actually stopped (gateway's listen port should drain to 0):
ssh <gateway-ip> "ss -tan | awk '\$1==\"ESTAB\" && \$4 ~ /:3050\$/{c++} END{print c+0}'"   # -> 0
# and confirm exactly one (or zero) run-tiers remains:
ssh <control-ip> "pgrep -af run-tiers"

Fully hands-off — every gateway × N rounds in one command

When you want the whole matrix unattended — every non-nexus gateway, repeated for run-to-run variance — scripts/bench/run-campaign.sh does it end-to-end. For each gateway it brings the box up + provisions, runs the full tier suite N times, deletes the per-request .jsonl after each round (they're the disk hog), archives the reports to v<r>/<gw>/, then brings the box down before the next one — one box at a time, so only mock + loadtest + the box under test are ever up.

GATEWAYS is required (no default — so the set under test is always spelled out in the command; nexus runs only if you name it).

# launch so an SSH logout can't kill it (the control box has lingering enabled):
systemd-run --user --unit=gw-campaign --working-directory="$PWD" \
  bash -lc 'GATEWAYS="bifrost litellm kong portkey tensorzero" scripts/bench/run-campaign.sh'
journalctl --user -u gw-campaign -f          # follow;  systemctl --user stop gw-campaign  to stop
# …or plain nohup (also survives logout thanks to lingering):
GATEWAYS="bifrost litellm kong portkey tensorzero" nohup scripts/bench/run-campaign.sh > ~/campaign.log 2>&1 &

knob	default	meaning
`GATEWAYS`	required (no default)	which gateways to test, e.g. `"bifrost litellm kong portkey tensorzero"` — nexus only if you name it
`SKIP`	(none)	gateways to drop from `GATEWAYS`, e.g. `SKIP="kong"`
`ROUNDS`	`2`	how many rounds → archived as `v1 … vN`
`ARCHIVE`	`~/results-archive`	reports land at `ARCHIVE/v<r>/<gw>/<tier>/report.md`
`KEEP_UP`	`0`	`1` = don't auto-down each gateway after its rounds
`CLEAN_LOADTEST_JSONL`	`1`	also purge `.jsonl` on the loadtest box over ssh
`FORCE`	`0`	`1` = start even if another run is live (DANGEROUS — double-loads the box)

It runs for hours — check it's alive / how far it got:

cat ~/results-archive/HEARTBEAT            # refreshed every 30s; a STALE mtime means it died
cat ~/results-archive/CAMPAIGN_STATUS.txt  # full timeline of up / round / archive / down

Resumable. A round already in ARCHIVE/v<r>/<gw>/ is skipped; a gateway whose every round is archived is skipped without touching its box. Re-run after a crash/stop and it continues.
Disk-safe. Only report.md + summaries/csv/prom + monitor/ are kept (all small); the giant results-*.jsonl are deleted each round on both the control box and the loadtest box.
It won't start if another run-tiers/bench.sh is already running (would double-load the box; FORCE=1 overrides — don't).

The manual way — one cycle, full control

GATEWAY=nexus PROFILE=nonstream-550 STAGES="200:60s" scripts/bench/bench.sh

bench.sh runs a complete, repeatable cycle and confirms the target account/boxes before touching anything:

preflight (show account + target boxes) → confirm
  → clean    (wipe last round's data, verified empty)
  → setup    (nexus only: apply the hooks / bodies scenario knobs)
  → restart  (cold gateway processes)
  → health   (every gateway must route to the mock — gate)
  → cooldown (wait until the box is idle so the last round doesn't bleed in)
  → run      (the actual load) + monitor (server CPU/mem/disk/net)
  → verify-audit (per gateway: did its audit/log capture every request? — see §7)
  → report   (the numbers)

Choosing what to test

Set these inline — that's the whole list (there is no config file):

knob	what it does	example
`GATEWAY`	which gateway to drive	`GATEWAY=nexus`
`STAGES`	the load / concurrency (see below)	`STAGES="200:60s"`
`PROFILE`	request shape — a bundled tier name (a single `bench.sh` cycle; `run-tiers.sh` runs all 6)	`PROFILE=nonstream-550`
`RUN_ID`	names the results dir `results/<RUN_ID>/`	`RUN_ID=nexus-1`
`NEXUS_HOOKS`	nexus content scanning `off`/`on` (the headline comparison)	`NEXUS_HOOKS=on`
`NEXUS_AUDIT_BODIES`	nexus: `on` = the audit stores the full request+response bodies (the real prompt + completion) into `traffic_event_payload`, not just metadata (timestamp/model/tokens/latency/status). Bodies cost more (≈50 KB/request to copy+store), so `off` is the light default — turn `on` for the maximal "lossless even with full payloads" proof.	`NEXUS_AUDIT_BODIES=on`

`PROFILE` — the prompt-size tier (run all of them)

Prompt SIZE decides what you measure, so there are three size tiers, each in non-streaming (throughput) and streaming/SSE (TTFT) form. Run them all and report each separately — one size hides half the story (a big body makes every gateway JSON-parse-bound and masks routing-core differences; a tiny body exposes them). Sizes follow the published benchmark shapes so the numbers are comparable to other gateways' public results.

`PROFILE=`	in / out tokens	standard	what it measures
`nonstream-128` / `stream-128`	128 / 128	NVIDIA, vLLM	routing-core overhead — fixed per-request cost; the real RPS ceiling (tens of thousands)
`nonstream-550` / `stream-550`	550 / 150	LLMPerf, Anyscale	realistic chat — the headline number
`nonstream` / `stream`	~12.5k / 64	long-context / RAG	large-body path (JSON-parse / forward bound)

(nonstream-* = non-SSE → reports RPS; stream-* = SSE → reports TTFT.) PROFILE is one of these six bundled names → scripts/bench/profiles/<name>.json. The load generator itself is the standalone OSS nexus-loadtest (its own repo — see its README/DESIGN for the profile JSON schema and CLI).

Profile JSON schema (to write your own). A profile is declarative JSON. The bundled tiers look like this (scripts/bench/profiles/nonstream-128.json, trimmed):

{
  "defaults": {
    "protocol": "openai-chat",
    "target":   "http://localhost:3050/v1/chat/completions",
    "headers":  { "Authorization": "Bearer REPLACE_WITH_VK" },
    "model":    "mock-gpt-4o-mini",
    "max_tokens": 128
  },
  "warmup": "15s",
  "cache_mode": "bust",
  "correlation": { "uuid_in_prompt": true, "header": "x-request-id" },
  "thresholds": { "ttft_p95_ms": 0, "p95_ms": 0, "error_rate": 0.02 },
  "stages":    [ { "concurrency": 50, "duration": "30s" } ],
  "scenarios": [
    { "name": "overhead-128-nonstream", "weight": 100, "turns": 1, "stream": false,
      "max_tokens": 128, "content": { "mode": "sized", "approx_input_tokens": 128 } }
  ]
}

The rig OVERRIDES target, model, the vk (auth header) and stages at run time — it points the generator at the gateway under test (resolved from the inventory) and applies the STAGES=… ladder. So in an external profile those are placeholders; what you actually control is the request shape:

field	meaning
`defaults.protocol`	wire protocol — `openai-chat` (others exist in the tool)
`defaults.max_tokens`	output-token cap (a scenario may override per-mix)
`warmup`	warm-up window discarded before measurement
`cache_mode`	`bust` = every request unique (defeats prompt caching) — keep it for fair numbers
`correlation`	inject a per-request UUID (prompt + header) for audit cross-checking
`thresholds`	pass/fail gates; `0` = report-only (don't gate)
`scenarios[]`	the traffic mix — each entry has `weight` (relative %), `turns`, `stream`, `max_tokens`
`scenarios[].content.mode`	`sized` (generate ~`approx_input_tokens` of input — sets prompt size), `pool` (random from a `prompts[]` list), or `scripted` (fixed dialogue)

`STAGES` — the concurrency / load control

Comma-separated stages, run in sequence. Three shapes (mixable):

`STAGES` entry	meaning
`200:60s`	closed-loop: 200 concurrent virtual users (VUs) for 60s — the max-throughput view
`@4000:60s`	open-loop: a fixed 4000 req/s arrival rate for 60s — the honest tail-latency vs offered load view
`@1000-8000:60s`	open-loop ramp: arrival rate climbs 1000→8000 req/s (walks the latency knee)

How to size the ramp. Climb concurrency geometrically until the gateway box CPU saturates and RPS stops rising (the knee), then 1–2 points past it — going further is wasted queuing (latency balloons, RPS flat). The knee's concurrency depends on the tier (RPS = concurrency ÷ per-request latency, and latency differs ~100× across tiers), so the ladder is tier-aware, not one-size-fits-all:

tier	closed-loop ramp	why
`-128`, `-550` (small / mid)	`50:30s,100:30s,200:30s,400:30s,800:60s,1200:60s,1600:60s`	low latency → knee is high; may not saturate even at 1600 VU → finish with the open-loop sweep below
`nonstream` / `stream` (~12.5k)	`50:30s,100:30s,200:60s,400:60s,800:60s`	high latency → knee is low (~200–400 VU ≈ saturation); 2000 VU here is pure queuing, wasted minutes

Open-loop sweep — the right way to find the small-tier ceiling and honest tails; extend the rates until error_rate climbs or the run goes INVALID:

STAGES="@5000:60s,@10000:60s,@20000:60s,@40000:60s,@60000:60s" PROFILE=nonstream-128 scripts/bench/bench.sh

Per-stage duration: 30s for sub-knee climbing stages (long enough to reach steady state + a stable throughput read); 60s at/near/past the knee and for every open-loop stage (stable p99/p99.9 — coordinated-omission correction needs a window). Warmup (15s) is already baked into the profile. Don't drop below 30s (noisy tails) or pile on >8 stages per run — geometric placement beats a linear every-200 ladder: same resolution near the knee, far less wall-clock.

The nexus-only comparisons — hooks off/on and bodies off/on

nexus has two cost dimensions no other gateway has, and they're the headline comparisons:

NEXUS_HOOKS — off = the bare gateway (no content scanning); on = every request is content-scanned (the compliance cost).
NEXUS_AUDIT_BODIES — off = audit stores metadata only; on = stores the full request+response bodies (the heaviest no-loss-audit case).

run-tiers.sh sweeps these automatically for GATEWAY=nexus: every tier runs across hooks{off,on} × bodies{off,on} (4 combos/tier), each with a RUN_ID suffixed -hooks<h>-bodies<b>, so any pair is a clean diff:

GATEWAY=nexus scripts/bench/run-tiers.sh        # full matrix (restrict with NEXUS_AUDIT_BODIES=off)

# For a long high-RPS nexus run, launch it detached:
nohup env GATEWAY=nexus scripts/bench/run-tiers.sh > ~/nexus-tiers.log 2>&1 &
# Every cycle already deep-cleans (stop ai-gateway+hub → `nats stream purge` NEXUS_EVENTS →
# wipe the spill spool → TRUNCATE with both down → restart), so a run never starts on the
# previous cycle's audit backlog. After each run, verify-audit waits 2 min and counts the
# audit rows that landed vs requests sent (§7) — at ~40k RPS nexus won't be at 100% within
# 2 min, and that ratio IS the honest figure (it's a measurement, not a pass/fail gate).
# (To PROVE lossless audit you need a SUSTAINABLE rate — a low tier / rate-capped open-loop
# where the pipeline keeps up; see §7. Also: GOMEMLIMIT must be set on the gw — see the
# nexus Ansible role — or the SSE/12k tiers OOM-freeze it.)

# hooks off vs on at the 550 chat tier, bodies held off:
COMPARE="nexus-550-nonstream-hooksoff-bodiesoff nexus-550-nonstream-hookson-bodiesoff" scripts/bench/report.sh
# bodies off vs on, hooks held off:
COMPARE="nexus-550-nonstream-hooksoff-bodiesoff nexus-550-nonstream-hooksoff-bodieson" scripts/bench/report.sh

Or one cycle by hand:

RUN_ID=off NEXUS_HOOKS=off scripts/bench/bench.sh
RUN_ID=on  NEXUS_HOOKS=on  scripts/bench/bench.sh
COMPARE="off on" scripts/bench/report.sh        # off-vs-on delta table

Fairness: compare like-for-like. bifrost (no audit/scan) lines up against nexus hooks-off + bodies-off (also bare) for the routing-core race; hooks-on / bodies-on show what those features cost, reported separately — never mix "nexus with audit on" against "a bare competitor". A Vectorscan validity gate runs automatically on hooks=on: it sends a PII probe and fails the run unless the response comes back redacted (a mis-built nexus binary can silently disable scanning, which would make hooks-on look artificially fast). If it fails, the nexus binary asset needs rebuilding FAT_RUNTIME=OFF — escalate.

6. Read the results

The cycle prints a report and saves files under results/<run-id>/ on your control machine (pulled from the loadtest box's /var/log/perf-bench/<run-id>/):

file	what it is
`report.md`	the human summary (per-stage RPS, ok%, p50/p95/p99, TTFT)
`<gw>/summary-*.json`	machine-readable totals (used by `-compare`)
`<gw>/report-.csv`, `.prom`	spreadsheet / Prometheus feeds
`<gw>/results-*.jsonl`	one line per request (latency breakdown, tokens, status)
`monitor/`	per-box CPU/mem/disk/net during the run + a nexus CPU profile

How to read it (important):

Generator health is a validity gate — check it FIRST. report.md shows each gateway as OK or INVALID. INVALID means the load generator (not the gateway) ran out of FDs/ports or dropped records — those numbers don't count. Re-run; never report an INVALID run.
Compare gateways by TTFT, not absolute latency. Every gateway calls the same mock, so the mock's latency is a constant in every number. It cancels in the gateway-to-gateway (or hooks-off-vs-on) TTFT delta — that delta is the gateway's own overhead. Absolute p99 is dominated by the mock.
ITL (inter-token latency) only matters for PROFILE=stream; it's 0 for nonstream.

Before a second round, mv this round's results/ aside. Run IDs are deterministic (no timestamp), so run-tiers.sh treats a present results/<run-id>/report.md as "already done" and skips it — a re-run with the old results still in place keeps the stale report. See the archive snippet in §5.

Run-to-run regression check (non-zero exit on a regression — usable as a CI gate):

loadtest -compare results/<old>/<gw>/summary-*.json,results/<new>/<gw>/summary-*.json

7. Check the audit data (is the gateway lossless?)

A key claim is that Nexus's audit captures every request (and, with bodies-on, the full request/response payloads) even under load — without slowing down. The audit is asynchronous (gateway → NATS → Hub → PostgreSQL, plus an on-disk spool that a recovery sweeper replays), so you must let it fully drain before counting, or you'll undercount and think data was lost when it's just still flushing.

verify-audit.sh does this for you, for every gateway (fair — each gateway's own log table is checked the same way), and runs automatically inside bench.sh:

RUN_ID=<the run id> scripts/bench/verify-audit.sh

It waits 2 minutes for the async audit to settle, then for nexus reports two DISTINCT things (don't conflate them):

Pipeline loss — did the gw DROP any audit event it created? Measured against enqueued (the events the gw made), via its own dropped/poisoned counters AND the conservation PG + NATS-backlog + spill ≈ enqueued. dropped=0 + ≈100% of enqueued ⇒ lossless (every created event is in PG or durably queued). This is the headline claim.
Coverage — did every request get an audit event? enqueued vs sent. < 100% at the high tiers is failed/incomplete requests (nothing to audit), NOT loss — read it against the stage ok%: low ok% explains low coverage; high ok% + low coverage = a real bug.

Why pipeline loss is measured vs enqueued, not sent: at high load many sent requests never complete (timeout/RST/5xx), so the gw legitimately makes fewer events than sent — counting that against the pipeline would falsely read as "lost data". For best-effort gateways (bifrost) there is no pipeline; their PG shortfall vs sent is a real by-design drop, shown plainly.

With NEXUS_AUDIT_BODIES=on it also checks the request/response bodies landed (traffic_event_payload).

What the bodies flag does and doesn't gate. NEXUS_AUDIT_BODIES only gates capture of the user content (the full request prompt + the completion) — the heavy, optional part. Gateway-generated error responses are always recorded regardless of the flag (their synthetic error envelope carries no user content and is the most useful thing to have when a request fails — kept for traceability). So on a clean 100%-ok run, bodies-off truly captures no payloads; if requests error, you'll still see those error envelopes in the audit — by design.

To look manually (on the nexus box): psql is local —

PGPASSWORD=benchpass psql -h localhost -U bench -d gatewaybench -c "SELECT count(*) FROM traffic_event;"

(clean.sh truncates this before each run, so the count is that run's.)

Audit completeness vs. high RPS — read this before trusting an INCOMPLETE. No gateway lands its per-request log in real time at tens-of-thousands of RPS — the async pipeline (queue → store) is designed to absorb the burst and catch up afterwards, and at ~40k RPS it catches up only slowly (the backlog drains long after the load stops). So "captured == sent" is not a meaningful discriminator at the high tiers — everyone is INCOMPLETE there. Two consequences for how we run:

To actually verify a lossless claim, test where the pipeline can keep up: a low-RPS tier (or open-loop capped at a sustainable rate). The honest durability metric is the highest sustained rate at which the backlog stays bounded, not a 40k-RPS pass/fail. Above that, durability is just each gateway's stated guarantee (block/no-drop vs spill vs best-effort) — there's no end-to-end way to prove it at that load.

At the high tiers it's a measurement, not a gate: verify-audit always just waits 2 min and reports the ratio (cheap). At ~40k RPS that ratio will be < 100% — report throughput/latency as the headline and the audit ratio as context, and say so.

Because that backlog also outlives the run, a plain TRUNCATE between runs wouldn't stick (the gateway keeps flushing into the table right after). So every run deep-cleans, for every gateway fairly: clean.sh stops the gateway's services (dropping the in-memory buffer), purges any durable backlog (nexus: nats stream purge the NEXUS_EVENTS stream + wipe the spill spool — deleting the JetStream store dir does not work, nats restores it on restart), TRUNCATEs with services down, then restarts — so each run starts verified-empty.

8. Save money — stop / start / tear down

Boxes cost money while running. You don't need them all up at once.

# interactive: pick an account/region, list the boxes, select one or more, stop/start/terminate
scripts/boxes.sh

# scriptable, by role (no menus):
scripts/box.sh state                  # list every box: role / state / public IP
scripts/box.sh stop nexus bifrost     # stop boxes (frees compute; disk kept; restart is fast)
scripts/box.sh start nexus            # start them again

# full teardown of everything (deletes the stack):
./down.sh

stop vs terminate. Stop frees the compute charge, keeps the disk, and you can restart in seconds — use this between tests. Terminate deletes the box; because the boxes are CloudFormation-managed, terminating one outside CFN drifts the stack, so for a clean full teardown use ./down.sh.

After stopping+starting a box its PUBLIC IP changes → re-generate the inventory:

scripts/gen-inventory.sh gw-bench ~/.ssh/<your-key>.pem us-east-1

(Private IPs are kept, so gateway↔mock traffic is unaffected; only SSH/targets need the refresh.)

9. Troubleshooting

symptom	cause / fix
`deploy.sh`: `unbound variable`	you're on an ancient bash — the scripts handle bash 3.2 (macOS); re-pull if you see this.
ansible: `Could not match host pattern: kong/...`	normal — that gateway isn't deployed (`DEPLOY_*=false`). Not an error.
smoke fails / not `chatcmpl-mock`	the gateway can't reach the mock — check the mock box is up and the SG allows the port.
nexus services not running	re-run `--tags nexus --limit nexus`; check `systemctl status nexus-ai-gateway nexus-hub nexus-control-plane nats` on the box (note: the unit is `nexus-hub`, the binary is `nexus-hub`).
`verify-audit` shows INCOMPLETE	either the audit truly dropped under load, or it hadn't fully drained — re-run; it waits for drain.
can't `ssh ec2-user@<box>`	use `-i ~/.ssh/<key>.pem`, or add your pubkey via `BENCH_SSH_PUBLIC_KEYS` in `deploy.env` and re-provision.

10. Parameter reference (every knob + how to set it)

Three layers, set in three places. Precedence everywhere: a value exported in your shell / on the command line WINS over the file default.

10.1 Bring-up / provision — `deploy.env` (copy from `deploy.env.example`)

Read by ./deploy.sh. Anything you leave unset, deploy.sh asks for interactively.

variable	default	what it controls
`AWS_PROFILE`	(your default)	which AWS account to deploy into
`REGION`	`us-east-1`	AWS region
`STACK`	`gw-bench`	CloudFormation stack name
`NET_STACK`	`${STACK}-net`	name of the optional "create a VPC" network stack
`KEY_NAME`	`llm-bench-key`	EC2 key pair name
`KEY_FILE`	`~/.ssh/${KEY_NAME}.pem`	local private key path
`VPC_ID`	(pick/auto)	existing VPC; blank → choose/create at the prompt
`SUBNET_ID`	(pick/auto)	existing public subnet; blank → choose/create
`SECURITY_GROUP_ID`	(blank)	reuse an SG; blank → the stack creates one with the right ports
`ADMIN_CIDR`	(your IP/32)	who may SSH (:22). Blank → auto-detect your IP
`ACCESS_CIDR`	`0.0.0.0/0`	who may reach gateway/UI/mock ports
`GATEWAY_TYPE`	`c6i.4xlarge`	instance type for gateway boxes
`AUX_TYPE`	`c6i.4xlarge`	instance type for mock + loadtest boxes
`VOLUME_GIB`	`120`	root disk size
`VOLUME_IOPS`	`12000`	root disk IOPS (gp3; `>= 4x` the throughput MB/s)
`VOLUME_THROUGHPUT`	`1000`	root disk throughput MB/s (gp3, 125-2000), applied online post-launch by `scripts/set-ebs-throughput.sh` (deploy.sh runs it — EC2::Instance can't set gp3 Throughput inline). The volume is sized so it's never the bottleneck; the real cap is the instance EBS bandwidth (~593 MiB/s sustained / ~1250 burst on `c6i.4xlarge`) — for more sustained write, bump `GATEWAY_TYPE`, not this
`DEPLOY_MOCK/NEXUS/BIFROST/LOADTEST`	`true`	which boxes to build
`DEPLOY_LITELLM/KONG/PORTKEY/TENSORZERO`	`false`	the other gateways (set `true` to include)
`BENCH_SSH_PUBLIC_KEYS`	(empty)	your `ssh-…` pubkey(s), `;`-separated → passwordless login on every box
`ASSUME_YES`	`0`	`1` = skip all confirmation prompts (CI)
`PROVISION`	`0`	`1` = also run Ansible right after the boxes are up

10.2 Per-run — inline env knobs (no config file)

The benchmark has no config file. Set any of these inline (GATEWAY=… PROFILE=… scripts/bench/run-tiers.sh); they default in scripts/bench/lib.sh. These six are the whole list:

variable	default	what it controls
`GATEWAY`	`bifrost`	which gateway to drive (nexus/bifrost/litellm/kong/portkey/tensorzero)
`STAGES`	`50:30s,100:30s,200:30s,400:30s,800:60s`	load/concurrency — closed `N:dur`, open `@rate:dur`, ramp `@from-to:dur` → §5
`PROFILE`	`nonstream-550`	request shape (single `bench.sh` cycle; `run-tiers.sh` runs all 6 bundled tiers) → §5
`RUN_ID`	timestamp	label for this run's output dir + compare
`NEXUS_HOOKS`	`off`	nexus content scanning `off`/`on` (`run-tiers.sh` sweeps both)
`NEXUS_AUDIT_BODIES`	`off`	nexus full request/response body capture `off`/`on` (`run-tiers.sh` sweeps both)

Everything else is a fixed rig constant or always-on policy, hardcoded in lib.sh — not a knob: per-gateway PORT_/PATH_/MODEL_/AUTH_, MOCK_MODEL/PORT, PG_*, AUDIT_TABLE_* / CLEAN_TABLES_*, REMOTE_OUT/LOCAL_OUT, the cooldown thresholds, the 2-min audit settle, deep-clean (always on, for fairness), and the nexus PII/redaction scan gate (always on). To change one of those, edit lib.sh.

10.3 Request shape — the profile JSON (`scripts/bench/profiles/*.json`)

Fixes HOW each request looks (so every gateway is measured identically). target / model / auth header / stages are overridden per-run by run.sh; edit the rest to change the workload:

field	meaning
`defaults.max_tokens`	response size cap
`warmup`	leading window excluded from steady-state stats (e.g. `"10s"`)
`cache_mode`	`bust` = UUID-front cache-bust (the benchmark default)
`reuse_body`	marshal the body once, reuse zero-copy (cache-OFF runs only — removes the per-request 50 KB marshal that caps generator RPS)
`arrival`	open-loop inter-arrival: `uniform` (default) or `poisson` (bursty, harsher honest tails)
`openloop_max_inflight`	cap on outstanding open-loop requests (default 50000)
`live_interval`	print a rolling RPS/p50/p99 line every interval (e.g. `"10s"`; `"0"` off)
`timeout` / `think_time`	per-request timeout / inter-turn pause
`capture_error_body`	store non-200 bodies in the JSONL
`thresholds`	`ttft_p95_ms` / `p95_ms` / `error_rate` SLOs + `abort_on_fail` (stop the sweep at the first breach)
`scenarios[]`	`weight`, `turns`, `stream`, `max_tokens`, `content.{mode,approx_input_tokens}`, and `checks` (contains / token-range response assertions)

10.4 The generator binary — `loadtest` flags

run.sh builds these for you; useful for an ad-hoc test: --config <profile.json> · --target <url> · --model <id> · --vk <token> · --stages '<spec>' · --out <dir> · --compare old.json,new.json · --regress-pct <N> (regression threshold for -compare).

10.5 Provision-time tuning — `ansible/group_vars/all.yml` (advanced)

Applied by Ansible; change here + re-provision (these are the gateway's internals, not per-run knobs):

variable	default	what it controls
`nexus_gomemlimit_pct`	`70`	ai-gateway GOMEMLIMIT as a % of box RAM (must be explicit on bare EC2)
`nexus_upstream_max_idle_conns_per_host`	`2000`	gw→mock keepalive pool per host (avoids socket churn at high concurrency)
`nexus_audit_max_queued_records`	`100000`	gateway in-heap audit queue depth (burst absorbed before drops)
`nexus_audit_frame_max_bytes`	`262144`	audit publish frame size (256 KB)
`nexus_audit_spool_dir`	`/var/lib/nexus/audit-spool`	on-disk audit spill spool
`nexus_audit_spool_max_total_mb`	`51200`	on-disk spill buffer cap (50 GiB)
`ports.<gw>`	per gateway	gateway listen ports (must match the SG)
`pg_bench_db/user/password`	`gatewaybench`/`bench`/`benchpass`	standardized datastore creds

Audit loss-mode/codec/compression and Vectorscan scan-sizing are not set here — they ship as the gateway's code defaults (a stock deploy is zero perf-config). Override them via the gateway's own env only for an A/B baseline.

Rules (do not break)

Never change a box by hand. Every change goes through CloudFormation (cloudformation/) or an Ansible role (ansible/roles/) and must be reproducible by re-running. Fix/re-apply with ansible-playbook -i inventory.ini site.yml --limit <host> --tags <role>; change ports/SG by editing the template + aws cloudformation update-stack.
Secrets are never committed. deploy.env holds your local config/credentials and is gitignored — only its *.example is committed. The benchmark itself has no config file (rig constants are in scripts/bench/lib.sh; run knobs are inline env vars).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM Gateway Benchmark Rig — full runbook

Quick answers — every task is one command

Block 1 — Environment lifecycle (build / stop / destroy)

Block 2 — Stress testing

0. Before you start (one-time, on your laptop)

1. Configure (optional but recommended)

2. Build the cloud environment

3. Install the software on the boxes

4. The boxes (what each one is)

The Nexus admin UI

5. Run a stress test

Where you run it — and how it actually works

The easy way — one command, a report per tier

Fully hands-off — every gateway × N rounds in one command

The manual way — one cycle, full control

Choosing what to test

`PROFILE` — the prompt-size tier (run all of them)

`STAGES` — the concurrency / load control

The nexus-only comparisons — hooks off/on and bodies off/on

6. Read the results

7. Check the audit data (is the gateway lossless?)

8. Save money — stop / start / tear down

9. Troubleshooting

10. Parameter reference (every knob + how to set it)

10.1 Bring-up / provision — `deploy.env` (copy from `deploy.env.example`)

10.2 Per-run — inline env knobs (no config file)

10.3 Request shape — the profile JSON (`scripts/bench/profiles/*.json`)

10.4 The generator binary — `loadtest` flags

10.5 Provision-time tuning — `ansible/group_vars/all.yml` (advanced)

Rules (do not break)

Uh oh!

FilesExpand file tree

LOADTEST-RUNBOOK.md

Latest commit

History

LOADTEST-RUNBOOK.md

File metadata and controls

LLM Gateway Benchmark Rig — full runbook

Quick answers — every task is one command

Block 1 — Environment lifecycle (build / stop / destroy)

Block 2 — Stress testing

0. Before you start (one-time, on your laptop)

1. Configure (optional but recommended)

2. Build the cloud environment

3. Install the software on the boxes

4. The boxes (what each one is)

The Nexus admin UI

5. Run a stress test

Where you run it — and how it actually works

The easy way — one command, a report per tier

Fully hands-off — every gateway × N rounds in one command

The manual way — one cycle, full control

Choosing what to test

PROFILE — the prompt-size tier (run all of them)

STAGES — the concurrency / load control

The nexus-only comparisons — hooks off/on and bodies off/on

6. Read the results

7. Check the audit data (is the gateway lossless?)

8. Save money — stop / start / tear down

9. Troubleshooting

10. Parameter reference (every knob + how to set it)

10.1 Bring-up / provision — deploy.env (copy from deploy.env.example)

10.2 Per-run — inline env knobs (no config file)

10.3 Request shape — the profile JSON (scripts/bench/profiles/*.json)

10.4 The generator binary — loadtest flags

10.5 Provision-time tuning — ansible/group_vars/all.yml (advanced)

Rules (do not break)

`PROFILE` — the prompt-size tier (run all of them)

`STAGES` — the concurrency / load control

10.1 Bring-up / provision — `deploy.env` (copy from `deploy.env.example`)

10.3 Request shape — the profile JSON (`scripts/bench/profiles/*.json`)

10.4 The generator binary — `loadtest` flags

10.5 Provision-time tuning — `ansible/group_vars/all.yml` (advanced)