A start-to-finish guide for someone who has never used this before. It walks you through: build the cloud environment → install the software → run a stress test → read the results → check the audit data → shut things down to save money. Copy-paste the commands; every prompt is explained.
The rig benchmarks several LLM gateways head-to-head: each gateway runs on its own EC2 box, all forwarding to one shared mock upstream, with a dedicated load-generator box. You bring it up on demand and tear it down when done.
Just want to run it from the in-VPC control box? Use
CONTROL-BOX-RUNBOOK.md— a strictly linear, copy-paste sequence (no decisions). This document is the full reference behind it.
Two blocks. Each row is a task → the one script that does it. Details for each are in the numbered sections below.
| I want to… | run this | what it does |
|---|---|---|
| Build the whole environment (boxes and software, one shot) | PROVISION=1 ./deploy.sh |
interactive bring-up (pick account/region/key/VPC/SG) then auto-runs Ansible. (Or ./deploy.sh then cd ansible && ansible-playbook -i inventory.ini site.yml.) → §1–3 |
| Stop one or more machines (save cost) | scripts/boxes.sh |
menu: pick account/region → list boxes → select one/many → stop. (Scriptable: scripts/box.sh stop nexus bifrost.) → §8 |
| Start stopped machines | scripts/box.sh start nexus |
then scripts/gen-inventory.sh gw-bench ~/.ssh/<key>.pem us-east-1 (public IPs changed) → §8 |
| Destroy the entire environment | ./down.sh |
deletes the whole CloudFormation stack (all boxes) → §8 |
| I want to… | run this | what it does |
|---|---|---|
| Test one gateway, all tiers | GATEWAY=bifrost scripts/bench/run-tiers.sh |
a full cycle per prompt-size tier (128/550/12.5k × non-SSE/SSE) → one report.md each → §5 |
| Test every deployed gateway | scripts/bench/run-all.sh |
run-tiers for each gateway found in the inventory → §5 |
| Run the whole campaign hands-off (chosen gateways × N rounds) | GATEWAYS="bifrost litellm kong portkey tensorzero" nohup scripts/bench/run-campaign.sh > ~/campaign.log 2>&1 & |
per gw: up→provision→run N rounds→purge jsonl→archive to v<r>/<gw>/→down, one box at a time; resumable; heartbeat → §5 |
| Run a stress test (one full cycle) | scripts/bench/bench.sh |
clean → setup → restart → health → cooldown → run → verify-audit → report → §5 |
| Test a specific gateway | GATEWAY=nexus scripts/bench/run-tiers.sh |
any of nexus/bifrost/litellm/kong/portkey/tensorzero → §5 |
| Pick the prompt size | PROFILE=nonstream-550 scripts/bench/bench.sh |
6 tiers — nonstream-128/stream-128, nonstream-550/stream-550, nonstream/stream; run them all, size decides the bottleneck → §5 |
| Set the concurrency / load | STAGES="200:60s" scripts/bench/bench.sh |
closed N:dur · open @rate:dur · ramp @from-to:dur; ladder is tier-aware → §5 |
| Read the report | cat results/<run>/report.md · A/B diff: COMPARE="off on" scripts/bench/report.sh |
per-stage RPS/ok%/p99/TTFT + validity gate → §6 |
| Check the targets are healthy | scripts/bench/health.sh |
every selected gateway must route to the mock (chatcmpl-mock) → §5 |
| Check how much audit data was lost | scripts/bench/verify-audit.sh |
per gateway: captured rows vs requests sent → LOSSLESS / INCOMPLETE N% → §7 |
| Nexus: hooks ON vs OFF | NEXUS_HOOKS=on scripts/bench/bench.sh / NEXUS_HOOKS=off … |
the headline comparison (content scanning on/off) → §5 |
| Nexus: capture full bodies | NEXUS_AUDIT_BODIES=on scripts/bench/bench.sh |
audit stores the full prompt + completion text (not just metadata) — the heaviest lossless-audit case → §5, §7 |
| Clean a gateway's data (fresh start) | scripts/bench/clean.sh |
TRUNCATE its traffic tables + flush Redis (verified empty) — bench.sh does this each cycle → §5 |
| See every knob | — | the 6 run knobs are in scripts/bench/README.md; deploy/profile/loadtest-CLI in §10 |
The benchmark has no config file — you only ever set 6 inline knobs (
GATEWAY,PROFILE,STAGES,RUN_ID,NEXUS_HOOKS,NEXUS_AUDIT_BODIES); every other value is a fixed rig constant or always-on policy hardcoded inscripts/bench/lib.sh. Only nexus has the two extra scenario knobs (NEXUS_HOOKS,NEXUS_AUDIT_BODIES, applied bysetup.sh); other gateways run vanilla.
You need, on your control machine (your laptop):
- AWS CLI, logged in to the account you want to deploy into. Check:
If you use a named profile for this account, note its name —
aws configure list-profiles # shows your profiles aws sts get-caller-identity # shows the account your default profile points at
deploy.shwill let you pick it from a menu. - Ansible:
ansible-galaxy collection install ansible.posix community.postgresql
- This repo, cloned. All commands below run from its root.
You do not need to pre-create an SSH key, VPC, subnet, or security group —
deploy.sh lists what exists and offers to create what's missing.
cp deploy.env.example deploy.env # gitignored; your local settingsOpen deploy.env and set AWS_PROFILE to the profile for your target account
(everything else has sane defaults you can leave alone). If you skip this, deploy.sh
will simply ask you interactively.
deploy.env also has BENCH_SSH_PUBLIC_KEYS — paste your ~/.ssh/*.pub there to get
passwordless ssh ec2-user@<box> on every box (otherwise use -i <the .pem>).
./deploy.shIt is fully interactive and checks everything before creating anything. You'll be asked, in order (just press Enter to take the marked (current) default unless you know otherwise):
| Prompt | What it means / what to pick |
|---|---|
| AWS profile | Which account to deploy into. Pick yours. |
| AWS region | Where to deploy. us-east-1 default. |
deploy into account <id>? |
Safety gate — confirm it's the right account. |
| (vCPU quota check) | Confirms your account can fit the boxes; warns if not. |
| EC2 key pair | The SSH key for the boxes. Pick an existing one (those with a local .pem ✓ are ready to use) or create a new one (it's saved to ~/.ssh/<name>.pem and the path is printed). |
| VPC | Pick your default VPC, or create a dedicated one (a small CloudFormation network stack). |
| Subnet | Pick a public one, or create a new public subnet. |
| Security group | Create a new one (recommended — it opens exactly the ports the rig needs) or reuse an existing SG (it tells you which required ports are missing). |
deploy this? |
Final confirmation — shows the full plan (account, region, boxes, sizes). |
When it finishes it prints the box public IPs and writes ansible/inventory.ini
(the list of boxes Ansible + the benchmark use).
Which boxes get built is the matrix. The default is mock nexus bifrost loadtest.
To include all gateways, set these in deploy.env (or your shell) before deploy.sh:
DEPLOY_LITELLM=true DEPLOY_KONG=true DEPLOY_PORTKEY=true DEPLOY_TENSORZERO=truecd ansible && ansible-playbook -i inventory.ini site.ymlThis installs and configures every box (host-native, no containers): kernel tuning, PostgreSQL/Redis where needed, each gateway, and the load generator. It is idempotent — safe to re-run. It takes several minutes (the gateways download/build their bits).
Success looks like PLAY RECAP with failed=0 for every box, and each gateway's
Smoke ... ok line (it confirms the gateway reaches the mock and gets
chatcmpl-mock back). The Could not match supplied host pattern: litellm/kong/...
lines are normal for gateways you didn't deploy — they're skipped, not failed.
To re-apply just one box/role later:
ansible-playbook -i inventory.ini site.yml --tags <role> --limit <role> # e.g. --tags nexus --limit nexusAll gateways forward to the shared mock, which always answers with
id == chatcmpl-mock — that string is the end-to-end "routing works" signal.
Every gateway is driven at the SAME underlying model so the comparison is apples-to-apples;
only the model-id string each gateway expects differs.
| box | what it is | language | gateway port | model id clients send |
|---|---|---|---|---|
| mock | shared OpenAI-compatible upstream every gateway calls | Go | :3062 |
mock-gpt-4o |
| nexus | the gateway under test: ai-gateway + Hub + control-plane + cp-ui (nginx) + PG/Redis/NATS. Full audit pipeline, content hooks, virtual keys. | Go | :3050 (API), :443 (admin UI) |
mock-gpt-4o (needs a virtual key) |
| bifrost | comparison gateway | Go | :8080 |
mock-provider/mock-gpt-4o (no auth) |
| litellm | comparison gateway | Python | :4000 |
mock-gpt-4o |
| kong | comparison gateway (AI Gateway plugin) | Lua/OpenResty | :8000 (:8001 admin) |
mock-gpt-4o |
| portkey | comparison gateway | Node | :8787 |
mock-gpt-4o |
| tensorzero | comparison gateway | Rust | :3000 |
tensorzero::model_name::mock-gpt-4o |
| loadtest | the load generator (the project's own Go loadtest — the only load tool) + the benchmark scripts |
Go | — | drives all the others over the private subnet |
The benchmark scripts know each gateway's port / path / model / auth automatically
(hardcoded in scripts/bench/lib.sh), so you just name a gateway — you don't type ports.
- Open
https://<nexus-public-ip>/(HTTPS, not HTTP). - The cert is self-signed → click Advanced → Proceed. (HTTPS is required: the login uses browser WebCrypto, which only works over HTTPS/localhost.)
- Login:
admin@nexus.ai/nexus-demo. - The gateway API on
:3050is plain HTTP and is what the load test hits (with a virtual key, handled for you).
The suite (scripts/bench/) is an orchestrator: you run it on ONE machine and it
drives the others over SSH. It does NOT generate load itself — per run it SSHes to the
loadtest box and runs the loadtest generator there (traffic originates in-subnet,
right next to the gateways — no internet hop), SSHes to the gateway box to
clean / set knobs / restart / health-check, then pulls the results back to wherever you
launched it (results/<run-id>/).
So you only need ONE machine that can SSH to the boxes and has this repo + a generated inventory. Pick one:
- A control box in the same VPC (recommended). A small EC2 box holding the repo; it
reaches the gateways over the private network (immune to a slow home connection).
gen-inventoryauto-writes private IPs there (overrideBENCH_INV_IP=public|private). - Your laptop — works too, but every SSH crosses the internet (public IPs).
- The loadtest box itself — the suite is also copied to
/opt/perf/scripts/bench/.
The control box deploys with the stack by default (DeployControl=true). To add one to a
rig that doesn't have it (reproducible — lands in the rig's subnet/SG, discovered from
CloudFormation, nothing hardcoded):
scripts/spin-control-box.sh # reads STACK/REGION/KEY_NAME from deploy.env; idempotentIt boots with git + ansible-core. Then, one-time ON the control box (the script prints these):
# bring the rig's private key over (used to SSH the gateway boxes):
scp -i ~/.ssh/<key>.pem ~/.ssh/<key>.pem ec2-user@<control-public-ip>:~/.ssh/ # then chmod 600
git clone git@github.com:AlphaBitCore/llm-gateway-benchmark.git llm-gateway-benchmark && cd llm-gateway-benchmark
ansible-galaxy collection install -r ansible/requirements.yml # AL2023-compatible, pinned
aws configure # creds to read the stack
scripts/gen-inventory.sh <stack> ~/.ssh/<key>.pem <region> # auto-writes private IPs in-VPC
cd ansible && ansible all -m ping -o # verify connectivityThere is no config file to create — the benchmark's fixed values live in
scripts/bench/lib.sh; you vary a run with inline knobs only.
GATEWAY=bifrost scripts/bench/run-tiers.sh # one gateway, all 6 prompt-size tiers
scripts/bench/run-all.sh # every deployed gateway (auto-discovered from inventory)run-tiers.sh runs a full cycle for each tier (128 / 550 / 12.5k × non-SSE/SSE) under its
own RUN_ID, so results/<gw>-<tier>/report.md is a standalone per-tier report.
REGEN_INVENTORY=1 refreshes the box IPs from CloudFormation first (after a stop/start).
No IPs are hardcoded. It runs unattended (auto-yes on the per-cycle clean/preflight confirm;
BENCH_ASSUME_YES=0 to be prompted).
Re-running? Move the finished results out of
results/first — otherwise the second round is SKIPPED. Each cycle writes toresults/<run-id>/, and the run ID is deterministic —<gw>-<tier>(plus-hooks<h>-bodies<b>for nexus), no timestamp — so a second round reuses the exact same paths. To stay resumable after an interruption,run-tiers.shskips any cycle whoseresults/<run-id>/report.mdalready exists (you'll see>> <run-id> already done — skip). That's what you want when resuming a half-finished suite; it's a trap when you wanted fresh numbers — the already-done cycles are silently skipped and you keep the old report. So before a clean re-run, archive (don't delete) the previous round:mv results results.$(date +%Y%m%d-%H%M) # archive the whole finished round, then re-run # …or just the gateway you're redoing: mkdir -p results.prev && mv results/nexus-* results.prev/ # then GATEWAY=nexus run-tiers.sh # …or force ONE cycle: delete just its results/<run-id>/ and re-run (run-tiers redoes only that).Reports under
results/are the deliverable —mvthem aside, don't lose them.
Long runs — detach so a dropped SSH doesn't kill it. A full suite is many minutes (nexus's matrix, hours). Launch it in the background and tail the log, so closing your laptop / losing wifi doesn't SIGHUP it midway:
nohup env GATEWAY=bifrost scripts/bench/run-tiers.sh > ~/bifrost-tiers.log 2>&1 &
tail -f ~/bifrost-tiers.log # Ctrl-C just stops tailing; the run keeps going
# reconnect later: tail -f ~/bifrost-tiers.log (or: pgrep -af run-tiers)(or use tmux/screen if installed). Reports still land in results/<gw>-<tier>/.
Stopping a run early — kill BOTH the orchestrator AND the remote generator. The orchestration (
run-tiers.sh/bench.sh) runs on the control box, but the actualloadtestgenerator runs on the loadtest box (launched over a detached SSH). Sopkill -f scripts/bench/on the control box stops the orchestrator but leaves the generator hammering the gateway — and a second run launched on top then double-loads the gateway (corrupts results, can OOM/freeze it). Always also kill the generator on the loadtest box and confirm the gateway has no in-flight connections:# on the control box: pkill -f 'scripts/bench/' # then the generator on the loadtest box (resolve its IP from ansible/inventory.ini): ssh <loadtest-ip> "pkill -9 -f /usr/local/bin/loadtest" # verify the load actually stopped (gateway's listen port should drain to 0): ssh <gateway-ip> "ss -tan | awk '\$1==\"ESTAB\" && \$4 ~ /:3050\$/{c++} END{print c+0}'" # -> 0 # and confirm exactly one (or zero) run-tiers remains: ssh <control-ip> "pgrep -af run-tiers"
When you want the whole matrix unattended — every non-nexus gateway, repeated for run-to-run
variance — scripts/bench/run-campaign.sh does it end-to-end. For each gateway it brings the box
up + provisions, runs the full tier suite N times, deletes the per-request .jsonl after
each round (they're the disk hog), archives the reports to v<r>/<gw>/, then brings the box
down before the next one — one box at a time, so only mock + loadtest + the box under test
are ever up.
GATEWAYS is required (no default — so the set under test is always spelled out in the command;
nexus runs only if you name it).
# launch so an SSH logout can't kill it (the control box has lingering enabled):
systemd-run --user --unit=gw-campaign --working-directory="$PWD" \
bash -lc 'GATEWAYS="bifrost litellm kong portkey tensorzero" scripts/bench/run-campaign.sh'
journalctl --user -u gw-campaign -f # follow; systemctl --user stop gw-campaign to stop
# …or plain nohup (also survives logout thanks to lingering):
GATEWAYS="bifrost litellm kong portkey tensorzero" nohup scripts/bench/run-campaign.sh > ~/campaign.log 2>&1 &| knob | default | meaning |
|---|---|---|
GATEWAYS |
required (no default) | which gateways to test, e.g. "bifrost litellm kong portkey tensorzero" — nexus only if you name it |
SKIP |
(none) | gateways to drop from GATEWAYS, e.g. SKIP="kong" |
ROUNDS |
2 |
how many rounds → archived as v1 … vN |
ARCHIVE |
~/results-archive |
reports land at ARCHIVE/v<r>/<gw>/<tier>/report.md |
KEEP_UP |
0 |
1 = don't auto-down each gateway after its rounds |
CLEAN_LOADTEST_JSONL |
1 |
also purge .jsonl on the loadtest box over ssh |
FORCE |
0 |
1 = start even if another run is live (DANGEROUS — double-loads the box) |
It runs for hours — check it's alive / how far it got:
cat ~/results-archive/HEARTBEAT # refreshed every 30s; a STALE mtime means it died
cat ~/results-archive/CAMPAIGN_STATUS.txt # full timeline of up / round / archive / down- Resumable. A round already in
ARCHIVE/v<r>/<gw>/is skipped; a gateway whose every round is archived is skipped without touching its box. Re-run after a crash/stop and it continues. - Disk-safe. Only
report.md+ summaries/csv/prom +monitor/are kept (all small); the giantresults-*.jsonlare deleted each round on both the control box and the loadtest box. - It won't start if another
run-tiers/bench.shis already running (would double-load the box;FORCE=1overrides — don't).
GATEWAY=nexus PROFILE=nonstream-550 STAGES="200:60s" scripts/bench/bench.shbench.sh runs a complete, repeatable cycle and confirms the target account/boxes
before touching anything:
preflight (show account + target boxes) → confirm
→ clean (wipe last round's data, verified empty)
→ setup (nexus only: apply the hooks / bodies scenario knobs)
→ restart (cold gateway processes)
→ health (every gateway must route to the mock — gate)
→ cooldown (wait until the box is idle so the last round doesn't bleed in)
→ run (the actual load) + monitor (server CPU/mem/disk/net)
→ verify-audit (per gateway: did its audit/log capture every request? — see §7)
→ report (the numbers)
Set these inline — that's the whole list (there is no config file):
| knob | what it does | example |
|---|---|---|
GATEWAY |
which gateway to drive | GATEWAY=nexus |
STAGES |
the load / concurrency (see below) | STAGES="200:60s" |
PROFILE |
request shape — a bundled tier name (a single bench.sh cycle; run-tiers.sh runs all 6) |
PROFILE=nonstream-550 |
RUN_ID |
names the results dir results/<RUN_ID>/ |
RUN_ID=nexus-1 |
NEXUS_HOOKS |
nexus content scanning off/on (the headline comparison) |
NEXUS_HOOKS=on |
NEXUS_AUDIT_BODIES |
nexus: on = the audit stores the full request+response bodies (the real prompt + completion) into traffic_event_payload, not just metadata (timestamp/model/tokens/latency/status). Bodies cost more (≈50 KB/request to copy+store), so off is the light default — turn on for the maximal "lossless even with full payloads" proof. |
NEXUS_AUDIT_BODIES=on |
Prompt SIZE decides what you measure, so there are three size tiers, each in non-streaming (throughput) and streaming/SSE (TTFT) form. Run them all and report each separately — one size hides half the story (a big body makes every gateway JSON-parse-bound and masks routing-core differences; a tiny body exposes them). Sizes follow the published benchmark shapes so the numbers are comparable to other gateways' public results.
PROFILE= |
in / out tokens | standard | what it measures |
|---|---|---|---|
nonstream-128 / stream-128 |
128 / 128 | NVIDIA, vLLM | routing-core overhead — fixed per-request cost; the real RPS ceiling (tens of thousands) |
nonstream-550 / stream-550 |
550 / 150 | LLMPerf, Anyscale | realistic chat — the headline number |
nonstream / stream |
~12.5k / 64 | long-context / RAG | large-body path (JSON-parse / forward bound) |
(nonstream-* = non-SSE → reports RPS; stream-* = SSE → reports TTFT.) PROFILE is one
of these six bundled names → scripts/bench/profiles/<name>.json. The load generator itself
is the standalone OSS nexus-loadtest (its own repo — see its README/DESIGN for the
profile JSON schema and CLI).
Profile JSON schema (to write your own). A profile is declarative JSON. The bundled tiers
look like this (scripts/bench/profiles/nonstream-128.json, trimmed):
{
"defaults": {
"protocol": "openai-chat",
"target": "http://localhost:3050/v1/chat/completions",
"headers": { "Authorization": "Bearer REPLACE_WITH_VK" },
"model": "mock-gpt-4o-mini",
"max_tokens": 128
},
"warmup": "15s",
"cache_mode": "bust",
"correlation": { "uuid_in_prompt": true, "header": "x-request-id" },
"thresholds": { "ttft_p95_ms": 0, "p95_ms": 0, "error_rate": 0.02 },
"stages": [ { "concurrency": 50, "duration": "30s" } ],
"scenarios": [
{ "name": "overhead-128-nonstream", "weight": 100, "turns": 1, "stream": false,
"max_tokens": 128, "content": { "mode": "sized", "approx_input_tokens": 128 } }
]
}The rig OVERRIDES target, model, the vk (auth header) and stages at run time — it
points the generator at the gateway under test (resolved from the inventory) and applies the
STAGES=… ladder. So in an external profile those are placeholders; what you actually control
is the request shape:
| field | meaning |
|---|---|
defaults.protocol |
wire protocol — openai-chat (others exist in the tool) |
defaults.max_tokens |
output-token cap (a scenario may override per-mix) |
warmup |
warm-up window discarded before measurement |
cache_mode |
bust = every request unique (defeats prompt caching) — keep it for fair numbers |
correlation |
inject a per-request UUID (prompt + header) for audit cross-checking |
thresholds |
pass/fail gates; 0 = report-only (don't gate) |
scenarios[] |
the traffic mix — each entry has weight (relative %), turns, stream, max_tokens |
scenarios[].content.mode |
sized (generate ~approx_input_tokens of input — sets prompt size), pool (random from a prompts[] list), or scripted (fixed dialogue) |
Comma-separated stages, run in sequence. Three shapes (mixable):
STAGES entry |
meaning |
|---|---|
200:60s |
closed-loop: 200 concurrent virtual users (VUs) for 60s — the max-throughput view |
@4000:60s |
open-loop: a fixed 4000 req/s arrival rate for 60s — the honest tail-latency vs offered load view |
@1000-8000:60s |
open-loop ramp: arrival rate climbs 1000→8000 req/s (walks the latency knee) |
How to size the ramp. Climb concurrency geometrically until the gateway box CPU saturates and RPS stops rising (the knee), then 1–2 points past it — going further is wasted queuing (latency balloons, RPS flat). The knee's concurrency depends on the tier (RPS = concurrency ÷ per-request latency, and latency differs ~100× across tiers), so the ladder is tier-aware, not one-size-fits-all:
| tier | closed-loop ramp | why |
|---|---|---|
*-128, *-550 (small / mid) |
50:30s,100:30s,200:30s,400:30s,800:60s,1200:60s,1600:60s |
low latency → knee is high; may not saturate even at 1600 VU → finish with the open-loop sweep below |
nonstream / stream (~12.5k) |
50:30s,100:30s,200:60s,400:60s,800:60s |
high latency → knee is low (~200–400 VU ≈ saturation); 2000 VU here is pure queuing, wasted minutes |
Open-loop sweep — the right way to find the small-tier ceiling and honest tails;
extend the rates until error_rate climbs or the run goes INVALID:
STAGES="@5000:60s,@10000:60s,@20000:60s,@40000:60s,@60000:60s" PROFILE=nonstream-128 scripts/bench/bench.shPer-stage duration: 30s for sub-knee climbing stages (long enough to reach steady state + a stable throughput read); 60s at/near/past the knee and for every open-loop stage (stable p99/p99.9 — coordinated-omission correction needs a window). Warmup (15s) is already baked into the profile. Don't drop below 30s (noisy tails) or pile on >8 stages per run — geometric placement beats a linear every-200 ladder: same resolution near the knee, far less wall-clock.
nexus has two cost dimensions no other gateway has, and they're the headline comparisons:
NEXUS_HOOKS—off= the bare gateway (no content scanning);on= every request is content-scanned (the compliance cost).NEXUS_AUDIT_BODIES—off= audit stores metadata only;on= stores the full request+response bodies (the heaviest no-loss-audit case).
run-tiers.sh sweeps these automatically for GATEWAY=nexus: every tier runs across
hooks{off,on} × bodies{off,on} (4 combos/tier), each with a RUN_ID suffixed
-hooks<h>-bodies<b>, so any pair is a clean diff:
GATEWAY=nexus scripts/bench/run-tiers.sh # full matrix (restrict with NEXUS_AUDIT_BODIES=off)
# For a long high-RPS nexus run, launch it detached:
nohup env GATEWAY=nexus scripts/bench/run-tiers.sh > ~/nexus-tiers.log 2>&1 &
# Every cycle already deep-cleans (stop ai-gateway+hub → `nats stream purge` NEXUS_EVENTS →
# wipe the spill spool → TRUNCATE with both down → restart), so a run never starts on the
# previous cycle's audit backlog. After each run, verify-audit waits 2 min and counts the
# audit rows that landed vs requests sent (§7) — at ~40k RPS nexus won't be at 100% within
# 2 min, and that ratio IS the honest figure (it's a measurement, not a pass/fail gate).
# (To PROVE lossless audit you need a SUSTAINABLE rate — a low tier / rate-capped open-loop
# where the pipeline keeps up; see §7. Also: GOMEMLIMIT must be set on the gw — see the
# nexus Ansible role — or the SSE/12k tiers OOM-freeze it.)
# hooks off vs on at the 550 chat tier, bodies held off:
COMPARE="nexus-550-nonstream-hooksoff-bodiesoff nexus-550-nonstream-hookson-bodiesoff" scripts/bench/report.sh
# bodies off vs on, hooks held off:
COMPARE="nexus-550-nonstream-hooksoff-bodiesoff nexus-550-nonstream-hooksoff-bodieson" scripts/bench/report.shOr one cycle by hand:
RUN_ID=off NEXUS_HOOKS=off scripts/bench/bench.sh
RUN_ID=on NEXUS_HOOKS=on scripts/bench/bench.sh
COMPARE="off on" scripts/bench/report.sh # off-vs-on delta tableFairness: compare like-for-like. bifrost (no audit/scan) lines up against nexus hooks-off + bodies-off (also bare) for the routing-core race; hooks-on / bodies-on show what those features cost, reported separately — never mix "nexus with audit on" against "a bare competitor". A Vectorscan validity gate runs automatically on
hooks=on: it sends a PII probe and fails the run unless the response comes back redacted (a mis-built nexus binary can silently disable scanning, which would make hooks-on look artificially fast). If it fails, the nexus binary asset needs rebuildingFAT_RUNTIME=OFF— escalate.
The cycle prints a report and saves files under results/<run-id>/ on your control
machine (pulled from the loadtest box's /var/log/perf-bench/<run-id>/):
| file | what it is |
|---|---|
report.md |
the human summary (per-stage RPS, ok%, p50/p95/p99, TTFT) |
<gw>/summary-*.json |
machine-readable totals (used by -compare) |
<gw>/report-*.csv, *.prom |
spreadsheet / Prometheus feeds |
<gw>/results-*.jsonl |
one line per request (latency breakdown, tokens, status) |
monitor/ |
per-box CPU/mem/disk/net during the run + a nexus CPU profile |
How to read it (important):
- Generator health is a validity gate — check it FIRST.
report.mdshows each gateway as OK or INVALID. INVALID means the load generator (not the gateway) ran out of FDs/ports or dropped records — those numbers don't count. Re-run; never report an INVALID run. - Compare gateways by TTFT, not absolute latency. Every gateway calls the same mock, so the mock's latency is a constant in every number. It cancels in the gateway-to-gateway (or hooks-off-vs-on) TTFT delta — that delta is the gateway's own overhead. Absolute p99 is dominated by the mock.
- ITL (inter-token latency) only matters for
PROFILE=stream; it's 0 for nonstream.
Before a second round,
mvthis round'sresults/aside. Run IDs are deterministic (no timestamp), sorun-tiers.shtreats a presentresults/<run-id>/report.mdas "already done" and skips it — a re-run with the old results still in place keeps the stale report. See the archive snippet in §5.
Run-to-run regression check (non-zero exit on a regression — usable as a CI gate):
loadtest -compare results/<old>/<gw>/summary-*.json,results/<new>/<gw>/summary-*.jsonA key claim is that Nexus's audit captures every request (and, with bodies-on, the full request/response payloads) even under load — without slowing down. The audit is asynchronous (gateway → NATS → Hub → PostgreSQL, plus an on-disk spool that a recovery sweeper replays), so you must let it fully drain before counting, or you'll undercount and think data was lost when it's just still flushing.
verify-audit.sh does this for you, for every gateway (fair — each gateway's own
log table is checked the same way), and runs automatically inside bench.sh:
RUN_ID=<the run id> scripts/bench/verify-audit.shIt waits 2 minutes for the async audit to settle, then for nexus reports two DISTINCT things (don't conflate them):
- Pipeline loss — did the gw DROP any audit event it created? Measured against
enqueued(the events the gw made), via its owndropped/poisonedcounters AND the conservationPG + NATS-backlog + spill ≈ enqueued.dropped=0+≈100% of enqueued⇒ lossless (every created event is in PG or durably queued). This is the headline claim. - Coverage — did every request get an audit event?
enqueuedvssent. < 100% at the high tiers is failed/incomplete requests (nothing to audit), NOT loss — read it against the stage ok%: low ok% explains low coverage; high ok% + low coverage = a real bug.
Why pipeline loss is measured vs
enqueued, notsent: at high load many sent requests never complete (timeout/RST/5xx), so the gw legitimately makes fewer events than sent — counting that against the pipeline would falsely read as "lost data". For best-effort gateways (bifrost) there is no pipeline; their PG shortfall vs sent is a real by-design drop, shown plainly.
With NEXUS_AUDIT_BODIES=on it also checks the request/response bodies landed
(traffic_event_payload).
What the bodies flag does and doesn't gate.
NEXUS_AUDIT_BODIESonly gates capture of the user content (the full request prompt + the completion) — the heavy, optional part. Gateway-generated error responses are always recorded regardless of the flag (their synthetic error envelope carries no user content and is the most useful thing to have when a request fails — kept for traceability). So on a clean 100%-ok run, bodies-off truly captures no payloads; if requests error, you'll still see those error envelopes in the audit — by design.
To look manually (on the nexus box): psql is local —
PGPASSWORD=benchpass psql -h localhost -U bench -d gatewaybench -c "SELECT count(*) FROM traffic_event;"(clean.sh truncates this before each run, so the count is that run's.)
Audit completeness vs. high RPS — read this before trusting an INCOMPLETE. No gateway lands its per-request log in real time at tens-of-thousands of RPS — the async pipeline (queue → store) is designed to absorb the burst and catch up afterwards, and at ~40k RPS it catches up only slowly (the backlog drains long after the load stops). So "captured == sent" is not a meaningful discriminator at the high tiers — everyone is INCOMPLETE there. Two consequences for how we run:
- To actually verify a lossless claim, test where the pipeline can keep up: a low-RPS tier (or open-loop capped at a sustainable rate). The honest durability metric is the highest sustained rate at which the backlog stays bounded, not a 40k-RPS pass/fail. Above that, durability is just each gateway's stated guarantee (block/no-drop vs spill vs best-effort) — there's no end-to-end way to prove it at that load.
- At the high tiers it's a measurement, not a gate: verify-audit always just waits 2 min and reports the ratio (cheap). At ~40k RPS that ratio will be < 100% — report throughput/latency as the headline and the audit ratio as context, and say so.
Because that backlog also outlives the run, a plain TRUNCATE between runs wouldn't stick (the gateway keeps flushing into the table right after). So every run deep-cleans, for every gateway fairly:
clean.shstops the gateway's services (dropping the in-memory buffer), purges any durable backlog (nexus:nats stream purgethe NEXUS_EVENTS stream + wipe the spill spool — deleting the JetStream store dir does not work, nats restores it on restart), TRUNCATEs with services down, then restarts — so each run starts verified-empty.
Boxes cost money while running. You don't need them all up at once.
# interactive: pick an account/region, list the boxes, select one or more, stop/start/terminate
scripts/boxes.sh
# scriptable, by role (no menus):
scripts/box.sh state # list every box: role / state / public IP
scripts/box.sh stop nexus bifrost # stop boxes (frees compute; disk kept; restart is fast)
scripts/box.sh start nexus # start them again
# full teardown of everything (deletes the stack):
./down.shstop vs terminate. Stop frees the compute charge, keeps the disk, and you can restart in seconds — use this between tests. Terminate deletes the box; because the boxes are CloudFormation-managed, terminating one outside CFN drifts the stack, so for a clean full teardown use
./down.sh.
After stopping+starting a box its PUBLIC IP changes → re-generate the inventory:
scripts/gen-inventory.sh gw-bench ~/.ssh/<your-key>.pem us-east-1(Private IPs are kept, so gateway↔mock traffic is unaffected; only SSH/targets need the refresh.)
| symptom | cause / fix |
|---|---|
deploy.sh: unbound variable |
you're on an ancient bash — the scripts handle bash 3.2 (macOS); re-pull if you see this. |
ansible: Could not match host pattern: kong/... |
normal — that gateway isn't deployed (DEPLOY_*=false). Not an error. |
smoke fails / not chatcmpl-mock |
the gateway can't reach the mock — check the mock box is up and the SG allows the port. |
| nexus services not running | re-run --tags nexus --limit nexus; check systemctl status nexus-ai-gateway nexus-hub nexus-control-plane nats on the box (note: the unit is nexus-hub, the binary is nexus-hub). |
verify-audit shows INCOMPLETE |
either the audit truly dropped under load, or it hadn't fully drained — re-run; it waits for drain. |
can't ssh ec2-user@<box> |
use -i ~/.ssh/<key>.pem, or add your pubkey via BENCH_SSH_PUBLIC_KEYS in deploy.env and re-provision. |
Three layers, set in three places. Precedence everywhere: a value exported in your shell / on the command line WINS over the file default.
Read by ./deploy.sh. Anything you leave unset, deploy.sh asks for interactively.
| variable | default | what it controls |
|---|---|---|
AWS_PROFILE |
(your default) | which AWS account to deploy into |
REGION |
us-east-1 |
AWS region |
STACK |
gw-bench |
CloudFormation stack name |
NET_STACK |
${STACK}-net |
name of the optional "create a VPC" network stack |
KEY_NAME |
llm-bench-key |
EC2 key pair name |
KEY_FILE |
~/.ssh/${KEY_NAME}.pem |
local private key path |
VPC_ID |
(pick/auto) | existing VPC; blank → choose/create at the prompt |
SUBNET_ID |
(pick/auto) | existing public subnet; blank → choose/create |
SECURITY_GROUP_ID |
(blank) | reuse an SG; blank → the stack creates one with the right ports |
ADMIN_CIDR |
(your IP/32) | who may SSH (:22). Blank → auto-detect your IP |
ACCESS_CIDR |
0.0.0.0/0 |
who may reach gateway/UI/mock ports |
GATEWAY_TYPE |
c6i.4xlarge |
instance type for gateway boxes |
AUX_TYPE |
c6i.4xlarge |
instance type for mock + loadtest boxes |
VOLUME_GIB |
120 |
root disk size |
VOLUME_IOPS |
12000 |
root disk IOPS (gp3; >= 4x the throughput MB/s) |
VOLUME_THROUGHPUT |
1000 |
root disk throughput MB/s (gp3, 125-2000), applied online post-launch by scripts/set-ebs-throughput.sh (deploy.sh runs it — EC2::Instance can't set gp3 Throughput inline). The volume is sized so it's never the bottleneck; the real cap is the instance EBS bandwidth (~593 MiB/s sustained / ~1250 burst on c6i.4xlarge) — for more sustained write, bump GATEWAY_TYPE, not this |
DEPLOY_MOCK/NEXUS/BIFROST/LOADTEST |
true |
which boxes to build |
DEPLOY_LITELLM/KONG/PORTKEY/TENSORZERO |
false |
the other gateways (set true to include) |
BENCH_SSH_PUBLIC_KEYS |
(empty) | your ssh-… pubkey(s), ;-separated → passwordless login on every box |
ASSUME_YES |
0 |
1 = skip all confirmation prompts (CI) |
PROVISION |
0 |
1 = also run Ansible right after the boxes are up |
The benchmark has no config file. Set any of these inline
(GATEWAY=… PROFILE=… scripts/bench/run-tiers.sh); they default in scripts/bench/lib.sh.
These six are the whole list:
| variable | default | what it controls |
|---|---|---|
GATEWAY |
bifrost |
which gateway to drive (nexus/bifrost/litellm/kong/portkey/tensorzero) |
STAGES |
50:30s,100:30s,200:30s,400:30s,800:60s |
load/concurrency — closed N:dur, open @rate:dur, ramp @from-to:dur → §5 |
PROFILE |
nonstream-550 |
request shape (single bench.sh cycle; run-tiers.sh runs all 6 bundled tiers) → §5 |
RUN_ID |
timestamp | label for this run's output dir + compare |
NEXUS_HOOKS |
off |
nexus content scanning off/on (run-tiers.sh sweeps both) |
NEXUS_AUDIT_BODIES |
off |
nexus full request/response body capture off/on (run-tiers.sh sweeps both) |
Everything else is a fixed rig constant or always-on policy, hardcoded in lib.sh — not a
knob: per-gateway PORT_/PATH_/MODEL_/AUTH_, MOCK_MODEL/PORT, PG_*, AUDIT_TABLE_* /
CLEAN_TABLES_*, REMOTE_OUT/LOCAL_OUT, the cooldown thresholds, the 2-min audit settle,
deep-clean (always on, for fairness), and the nexus PII/redaction scan gate (always on). To
change one of those, edit lib.sh.
Fixes HOW each request looks (so every gateway is measured identically). target /
model / auth header / stages are overridden per-run by run.sh; edit the rest to
change the workload:
| field | meaning |
|---|---|
defaults.max_tokens |
response size cap |
warmup |
leading window excluded from steady-state stats (e.g. "10s") |
cache_mode |
bust = UUID-front cache-bust (the benchmark default) |
reuse_body |
marshal the body once, reuse zero-copy (cache-OFF runs only — removes the per-request 50 KB marshal that caps generator RPS) |
arrival |
open-loop inter-arrival: uniform (default) or poisson (bursty, harsher honest tails) |
openloop_max_inflight |
cap on outstanding open-loop requests (default 50000) |
live_interval |
print a rolling RPS/p50/p99 line every interval (e.g. "10s"; "0" off) |
timeout / think_time |
per-request timeout / inter-turn pause |
capture_error_body |
store non-200 bodies in the JSONL |
thresholds |
ttft_p95_ms / p95_ms / error_rate SLOs + abort_on_fail (stop the sweep at the first breach) |
scenarios[] |
weight, turns, stream, max_tokens, content.{mode,approx_input_tokens}, and checks (contains / token-range response assertions) |
run.sh builds these for you; useful for an ad-hoc test:
--config <profile.json> · --target <url> · --model <id> · --vk <token> ·
--stages '<spec>' · --out <dir> · --compare old.json,new.json ·
--regress-pct <N> (regression threshold for -compare).
Applied by Ansible; change here + re-provision (these are the gateway's internals, not per-run knobs):
| variable | default | what it controls |
|---|---|---|
nexus_gomemlimit_pct |
70 |
ai-gateway GOMEMLIMIT as a % of box RAM (must be explicit on bare EC2) |
nexus_upstream_max_idle_conns_per_host |
2000 |
gw→mock keepalive pool per host (avoids socket churn at high concurrency) |
nexus_audit_max_queued_records |
100000 |
gateway in-heap audit queue depth (burst absorbed before drops) |
nexus_audit_frame_max_bytes |
262144 |
audit publish frame size (256 KB) |
nexus_audit_spool_dir |
/var/lib/nexus/audit-spool |
on-disk audit spill spool |
nexus_audit_spool_max_total_mb |
51200 |
on-disk spill buffer cap (50 GiB) |
ports.<gw> |
per gateway | gateway listen ports (must match the SG) |
pg_bench_db/user/password |
gatewaybench/bench/benchpass |
standardized datastore creds |
Audit loss-mode/codec/compression and Vectorscan scan-sizing are not set here — they ship as the gateway's code defaults (a stock deploy is zero perf-config). Override them via the gateway's own env only for an A/B baseline.
- Never change a box by hand. Every change goes through CloudFormation
(
cloudformation/) or an Ansible role (ansible/roles/) and must be reproducible by re-running. Fix/re-apply withansible-playbook -i inventory.ini site.yml --limit <host> --tags <role>; change ports/SG by editing the template +aws cloudformation update-stack. - Secrets are never committed.
deploy.envholds your local config/credentials and is gitignored — only its*.exampleis committed. The benchmark itself has no config file (rig constants are inscripts/bench/lib.sh; run knobs are inline env vars).