Skip to content

Latest commit

 

History

History
757 lines (622 loc) · 44.1 KB

File metadata and controls

757 lines (622 loc) · 44.1 KB

LLM Gateway Benchmark Rig — full runbook

A start-to-finish guide for someone who has never used this before. It walks you through: build the cloud environment → install the software → run a stress test → read the results → check the audit data → shut things down to save money. Copy-paste the commands; every prompt is explained.

The rig benchmarks several LLM gateways head-to-head: each gateway runs on its own EC2 box, all forwarding to one shared mock upstream, with a dedicated load-generator box. You bring it up on demand and tear it down when done.

Just want to run it from the in-VPC control box? Use CONTROL-BOX-RUNBOOK.md — a strictly linear, copy-paste sequence (no decisions). This document is the full reference behind it.


Quick answers — every task is one command

Two blocks. Each row is a task → the one script that does it. Details for each are in the numbered sections below.

Block 1 — Environment lifecycle (build / stop / destroy)

I want to… run this what it does
Build the whole environment (boxes and software, one shot) PROVISION=1 ./deploy.sh interactive bring-up (pick account/region/key/VPC/SG) then auto-runs Ansible. (Or ./deploy.sh then cd ansible && ansible-playbook -i inventory.ini site.yml.) → §1–3
Stop one or more machines (save cost) scripts/boxes.sh menu: pick account/region → list boxes → select one/many → stop. (Scriptable: scripts/box.sh stop nexus bifrost.) → §8
Start stopped machines scripts/box.sh start nexus then scripts/gen-inventory.sh gw-bench ~/.ssh/<key>.pem us-east-1 (public IPs changed) → §8
Destroy the entire environment ./down.sh deletes the whole CloudFormation stack (all boxes) → §8

Block 2 — Stress testing

I want to… run this what it does
Test one gateway, all tiers GATEWAY=bifrost scripts/bench/run-tiers.sh a full cycle per prompt-size tier (128/550/12.5k × non-SSE/SSE) → one report.md each → §5
Test every deployed gateway scripts/bench/run-all.sh run-tiers for each gateway found in the inventory → §5
Run the whole campaign hands-off (chosen gateways × N rounds) GATEWAYS="bifrost litellm kong portkey tensorzero" nohup scripts/bench/run-campaign.sh > ~/campaign.log 2>&1 & per gw: up→provision→run N rounds→purge jsonl→archive to v<r>/<gw>/→down, one box at a time; resumable; heartbeat → §5
Run a stress test (one full cycle) scripts/bench/bench.sh clean → setup → restart → health → cooldown → run → verify-audit → report → §5
Test a specific gateway GATEWAY=nexus scripts/bench/run-tiers.sh any of nexus/bifrost/litellm/kong/portkey/tensorzero → §5
Pick the prompt size PROFILE=nonstream-550 scripts/bench/bench.sh 6 tiers — nonstream-128/stream-128, nonstream-550/stream-550, nonstream/stream; run them all, size decides the bottleneck → §5
Set the concurrency / load STAGES="200:60s" scripts/bench/bench.sh closed N:dur · open @rate:dur · ramp @from-to:dur; ladder is tier-aware → §5
Read the report cat results/<run>/report.md · A/B diff: COMPARE="off on" scripts/bench/report.sh per-stage RPS/ok%/p99/TTFT + validity gate → §6
Check the targets are healthy scripts/bench/health.sh every selected gateway must route to the mock (chatcmpl-mock) → §5
Check how much audit data was lost scripts/bench/verify-audit.sh per gateway: captured rows vs requests sent → LOSSLESS / INCOMPLETE N% → §7
Nexus: hooks ON vs OFF NEXUS_HOOKS=on scripts/bench/bench.sh / NEXUS_HOOKS=off … the headline comparison (content scanning on/off) → §5
Nexus: capture full bodies NEXUS_AUDIT_BODIES=on scripts/bench/bench.sh audit stores the full prompt + completion text (not just metadata) — the heaviest lossless-audit case → §5, §7
Clean a gateway's data (fresh start) scripts/bench/clean.sh TRUNCATE its traffic tables + flush Redis (verified empty) — bench.sh does this each cycle → §5
See every knob the 6 run knobs are in scripts/bench/README.md; deploy/profile/loadtest-CLI in §10

The benchmark has no config file — you only ever set 6 inline knobs (GATEWAY, PROFILE, STAGES, RUN_ID, NEXUS_HOOKS, NEXUS_AUDIT_BODIES); every other value is a fixed rig constant or always-on policy hardcoded in scripts/bench/lib.sh. Only nexus has the two extra scenario knobs (NEXUS_HOOKS, NEXUS_AUDIT_BODIES, applied by setup.sh); other gateways run vanilla.


0. Before you start (one-time, on your laptop)

You need, on your control machine (your laptop):

  1. AWS CLI, logged in to the account you want to deploy into. Check:
    aws configure list-profiles      # shows your profiles
    aws sts get-caller-identity      # shows the account your default profile points at
    If you use a named profile for this account, note its name — deploy.sh will let you pick it from a menu.
  2. Ansible:
    ansible-galaxy collection install ansible.posix community.postgresql
  3. This repo, cloned. All commands below run from its root.

You do not need to pre-create an SSH key, VPC, subnet, or security group — deploy.sh lists what exists and offers to create what's missing.


1. Configure (optional but recommended)

cp deploy.env.example deploy.env      # gitignored; your local settings

Open deploy.env and set AWS_PROFILE to the profile for your target account (everything else has sane defaults you can leave alone). If you skip this, deploy.sh will simply ask you interactively.

deploy.env also has BENCH_SSH_PUBLIC_KEYS — paste your ~/.ssh/*.pub there to get passwordless ssh ec2-user@<box> on every box (otherwise use -i <the .pem>).


2. Build the cloud environment

./deploy.sh

It is fully interactive and checks everything before creating anything. You'll be asked, in order (just press Enter to take the marked (current) default unless you know otherwise):

Prompt What it means / what to pick
AWS profile Which account to deploy into. Pick yours.
AWS region Where to deploy. us-east-1 default.
deploy into account <id>? Safety gate — confirm it's the right account.
(vCPU quota check) Confirms your account can fit the boxes; warns if not.
EC2 key pair The SSH key for the boxes. Pick an existing one (those with a local .pem ✓ are ready to use) or create a new one (it's saved to ~/.ssh/<name>.pem and the path is printed).
VPC Pick your default VPC, or create a dedicated one (a small CloudFormation network stack).
Subnet Pick a public one, or create a new public subnet.
Security group Create a new one (recommended — it opens exactly the ports the rig needs) or reuse an existing SG (it tells you which required ports are missing).
deploy this? Final confirmation — shows the full plan (account, region, boxes, sizes).

When it finishes it prints the box public IPs and writes ansible/inventory.ini (the list of boxes Ansible + the benchmark use).

Which boxes get built is the matrix. The default is mock nexus bifrost loadtest. To include all gateways, set these in deploy.env (or your shell) before deploy.sh:

DEPLOY_LITELLM=true DEPLOY_KONG=true DEPLOY_PORTKEY=true DEPLOY_TENSORZERO=true

3. Install the software on the boxes

cd ansible && ansible-playbook -i inventory.ini site.yml

This installs and configures every box (host-native, no containers): kernel tuning, PostgreSQL/Redis where needed, each gateway, and the load generator. It is idempotent — safe to re-run. It takes several minutes (the gateways download/build their bits).

Success looks like PLAY RECAP with failed=0 for every box, and each gateway's Smoke ... ok line (it confirms the gateway reaches the mock and gets chatcmpl-mock back). The Could not match supplied host pattern: litellm/kong/... lines are normal for gateways you didn't deploy — they're skipped, not failed.

To re-apply just one box/role later:

ansible-playbook -i inventory.ini site.yml --tags <role> --limit <role>   # e.g. --tags nexus --limit nexus

4. The boxes (what each one is)

All gateways forward to the shared mock, which always answers with id == chatcmpl-mock — that string is the end-to-end "routing works" signal. Every gateway is driven at the SAME underlying model so the comparison is apples-to-apples; only the model-id string each gateway expects differs.

box what it is language gateway port model id clients send
mock shared OpenAI-compatible upstream every gateway calls Go :3062 mock-gpt-4o
nexus the gateway under test: ai-gateway + Hub + control-plane + cp-ui (nginx) + PG/Redis/NATS. Full audit pipeline, content hooks, virtual keys. Go :3050 (API), :443 (admin UI) mock-gpt-4o (needs a virtual key)
bifrost comparison gateway Go :8080 mock-provider/mock-gpt-4o (no auth)
litellm comparison gateway Python :4000 mock-gpt-4o
kong comparison gateway (AI Gateway plugin) Lua/OpenResty :8000 (:8001 admin) mock-gpt-4o
portkey comparison gateway Node :8787 mock-gpt-4o
tensorzero comparison gateway Rust :3000 tensorzero::model_name::mock-gpt-4o
loadtest the load generator (the project's own Go loadtest — the only load tool) + the benchmark scripts Go drives all the others over the private subnet

The benchmark scripts know each gateway's port / path / model / auth automatically (hardcoded in scripts/bench/lib.sh), so you just name a gateway — you don't type ports.

The Nexus admin UI

  • Open https://<nexus-public-ip>/ (HTTPS, not HTTP).
  • The cert is self-signed → click Advanced → Proceed. (HTTPS is required: the login uses browser WebCrypto, which only works over HTTPS/localhost.)
  • Login: admin@nexus.ai / nexus-demo.
  • The gateway API on :3050 is plain HTTP and is what the load test hits (with a virtual key, handled for you).

5. Run a stress test

Where you run it — and how it actually works

The suite (scripts/bench/) is an orchestrator: you run it on ONE machine and it drives the others over SSH. It does NOT generate load itself — per run it SSHes to the loadtest box and runs the loadtest generator there (traffic originates in-subnet, right next to the gateways — no internet hop), SSHes to the gateway box to clean / set knobs / restart / health-check, then pulls the results back to wherever you launched it (results/<run-id>/).

So you only need ONE machine that can SSH to the boxes and has this repo + a generated inventory. Pick one:

  • A control box in the same VPC (recommended). A small EC2 box holding the repo; it reaches the gateways over the private network (immune to a slow home connection). gen-inventory auto-writes private IPs there (override BENCH_INV_IP=public|private).
  • Your laptop — works too, but every SSH crosses the internet (public IPs).
  • The loadtest box itself — the suite is also copied to /opt/perf/scripts/bench/.

The control box deploys with the stack by default (DeployControl=true). To add one to a rig that doesn't have it (reproducible — lands in the rig's subnet/SG, discovered from CloudFormation, nothing hardcoded):

scripts/spin-control-box.sh           # reads STACK/REGION/KEY_NAME from deploy.env; idempotent

It boots with git + ansible-core. Then, one-time ON the control box (the script prints these):

# bring the rig's private key over (used to SSH the gateway boxes):
scp -i ~/.ssh/<key>.pem ~/.ssh/<key>.pem ec2-user@<control-public-ip>:~/.ssh/   # then chmod 600
git clone git@github.com:AlphaBitCore/llm-gateway-benchmark.git llm-gateway-benchmark && cd llm-gateway-benchmark
ansible-galaxy collection install -r ansible/requirements.yml   # AL2023-compatible, pinned
aws configure                                                   # creds to read the stack
scripts/gen-inventory.sh <stack> ~/.ssh/<key>.pem <region>      # auto-writes private IPs in-VPC
cd ansible && ansible all -m ping -o                            # verify connectivity

There is no config file to create — the benchmark's fixed values live in scripts/bench/lib.sh; you vary a run with inline knobs only.

The easy way — one command, a report per tier

GATEWAY=bifrost scripts/bench/run-tiers.sh   # one gateway, all 6 prompt-size tiers
scripts/bench/run-all.sh                     # every deployed gateway (auto-discovered from inventory)

run-tiers.sh runs a full cycle for each tier (128 / 550 / 12.5k × non-SSE/SSE) under its own RUN_ID, so results/<gw>-<tier>/report.md is a standalone per-tier report. REGEN_INVENTORY=1 refreshes the box IPs from CloudFormation first (after a stop/start). No IPs are hardcoded. It runs unattended (auto-yes on the per-cycle clean/preflight confirm; BENCH_ASSUME_YES=0 to be prompted).

Re-running? Move the finished results out of results/ first — otherwise the second round is SKIPPED. Each cycle writes to results/<run-id>/, and the run ID is deterministic<gw>-<tier> (plus -hooks<h>-bodies<b> for nexus), no timestamp — so a second round reuses the exact same paths. To stay resumable after an interruption, run-tiers.sh skips any cycle whose results/<run-id>/report.md already exists (you'll see >> <run-id> already done — skip). That's what you want when resuming a half-finished suite; it's a trap when you wanted fresh numbers — the already-done cycles are silently skipped and you keep the old report. So before a clean re-run, archive (don't delete) the previous round:

mv results results.$(date +%Y%m%d-%H%M)     # archive the whole finished round, then re-run
# …or just the gateway you're redoing:
mkdir -p results.prev && mv results/nexus-* results.prev/   # then GATEWAY=nexus run-tiers.sh
# …or force ONE cycle: delete just its results/<run-id>/ and re-run (run-tiers redoes only that).

Reports under results/ are the deliverable — mv them aside, don't lose them.

Long runs — detach so a dropped SSH doesn't kill it. A full suite is many minutes (nexus's matrix, hours). Launch it in the background and tail the log, so closing your laptop / losing wifi doesn't SIGHUP it midway:

nohup env GATEWAY=bifrost scripts/bench/run-tiers.sh > ~/bifrost-tiers.log 2>&1 &
tail -f ~/bifrost-tiers.log      # Ctrl-C just stops tailing; the run keeps going
# reconnect later:  tail -f ~/bifrost-tiers.log   (or: pgrep -af run-tiers)

(or use tmux/screen if installed). Reports still land in results/<gw>-<tier>/.

Stopping a run early — kill BOTH the orchestrator AND the remote generator. The orchestration (run-tiers.sh/bench.sh) runs on the control box, but the actual loadtest generator runs on the loadtest box (launched over a detached SSH). So pkill -f scripts/bench/ on the control box stops the orchestrator but leaves the generator hammering the gateway — and a second run launched on top then double-loads the gateway (corrupts results, can OOM/freeze it). Always also kill the generator on the loadtest box and confirm the gateway has no in-flight connections:

# on the control box:
pkill -f 'scripts/bench/'
# then the generator on the loadtest box (resolve its IP from ansible/inventory.ini):
ssh <loadtest-ip> "pkill -9 -f /usr/local/bin/loadtest"
# verify the load actually stopped (gateway's listen port should drain to 0):
ssh <gateway-ip> "ss -tan | awk '\$1==\"ESTAB\" && \$4 ~ /:3050\$/{c++} END{print c+0}'"   # -> 0
# and confirm exactly one (or zero) run-tiers remains:
ssh <control-ip> "pgrep -af run-tiers"

Fully hands-off — every gateway × N rounds in one command

When you want the whole matrix unattended — every non-nexus gateway, repeated for run-to-run variance — scripts/bench/run-campaign.sh does it end-to-end. For each gateway it brings the box up + provisions, runs the full tier suite N times, deletes the per-request .jsonl after each round (they're the disk hog), archives the reports to v<r>/<gw>/, then brings the box down before the next one — one box at a time, so only mock + loadtest + the box under test are ever up.

GATEWAYS is required (no default — so the set under test is always spelled out in the command; nexus runs only if you name it).

# launch so an SSH logout can't kill it (the control box has lingering enabled):
systemd-run --user --unit=gw-campaign --working-directory="$PWD" \
  bash -lc 'GATEWAYS="bifrost litellm kong portkey tensorzero" scripts/bench/run-campaign.sh'
journalctl --user -u gw-campaign -f          # follow;  systemctl --user stop gw-campaign  to stop
# …or plain nohup (also survives logout thanks to lingering):
GATEWAYS="bifrost litellm kong portkey tensorzero" nohup scripts/bench/run-campaign.sh > ~/campaign.log 2>&1 &
knob default meaning
GATEWAYS required (no default) which gateways to test, e.g. "bifrost litellm kong portkey tensorzero" — nexus only if you name it
SKIP (none) gateways to drop from GATEWAYS, e.g. SKIP="kong"
ROUNDS 2 how many rounds → archived as v1 … vN
ARCHIVE ~/results-archive reports land at ARCHIVE/v<r>/<gw>/<tier>/report.md
KEEP_UP 0 1 = don't auto-down each gateway after its rounds
CLEAN_LOADTEST_JSONL 1 also purge .jsonl on the loadtest box over ssh
FORCE 0 1 = start even if another run is live (DANGEROUS — double-loads the box)

It runs for hours — check it's alive / how far it got:

cat ~/results-archive/HEARTBEAT            # refreshed every 30s; a STALE mtime means it died
cat ~/results-archive/CAMPAIGN_STATUS.txt  # full timeline of up / round / archive / down
  • Resumable. A round already in ARCHIVE/v<r>/<gw>/ is skipped; a gateway whose every round is archived is skipped without touching its box. Re-run after a crash/stop and it continues.
  • Disk-safe. Only report.md + summaries/csv/prom + monitor/ are kept (all small); the giant results-*.jsonl are deleted each round on both the control box and the loadtest box.
  • It won't start if another run-tiers/bench.sh is already running (would double-load the box; FORCE=1 overrides — don't).

The manual way — one cycle, full control

GATEWAY=nexus PROFILE=nonstream-550 STAGES="200:60s" scripts/bench/bench.sh

bench.sh runs a complete, repeatable cycle and confirms the target account/boxes before touching anything:

preflight (show account + target boxes) → confirm
  → clean    (wipe last round's data, verified empty)
  → setup    (nexus only: apply the hooks / bodies scenario knobs)
  → restart  (cold gateway processes)
  → health   (every gateway must route to the mock — gate)
  → cooldown (wait until the box is idle so the last round doesn't bleed in)
  → run      (the actual load) + monitor (server CPU/mem/disk/net)
  → verify-audit (per gateway: did its audit/log capture every request? — see §7)
  → report   (the numbers)

Choosing what to test

Set these inline — that's the whole list (there is no config file):

knob what it does example
GATEWAY which gateway to drive GATEWAY=nexus
STAGES the load / concurrency (see below) STAGES="200:60s"
PROFILE request shape — a bundled tier name (a single bench.sh cycle; run-tiers.sh runs all 6) PROFILE=nonstream-550
RUN_ID names the results dir results/<RUN_ID>/ RUN_ID=nexus-1
NEXUS_HOOKS nexus content scanning off/on (the headline comparison) NEXUS_HOOKS=on
NEXUS_AUDIT_BODIES nexus: on = the audit stores the full request+response bodies (the real prompt + completion) into traffic_event_payload, not just metadata (timestamp/model/tokens/latency/status). Bodies cost more (≈50 KB/request to copy+store), so off is the light default — turn on for the maximal "lossless even with full payloads" proof. NEXUS_AUDIT_BODIES=on

PROFILE — the prompt-size tier (run all of them)

Prompt SIZE decides what you measure, so there are three size tiers, each in non-streaming (throughput) and streaming/SSE (TTFT) form. Run them all and report each separately — one size hides half the story (a big body makes every gateway JSON-parse-bound and masks routing-core differences; a tiny body exposes them). Sizes follow the published benchmark shapes so the numbers are comparable to other gateways' public results.

PROFILE= in / out tokens standard what it measures
nonstream-128 / stream-128 128 / 128 NVIDIA, vLLM routing-core overhead — fixed per-request cost; the real RPS ceiling (tens of thousands)
nonstream-550 / stream-550 550 / 150 LLMPerf, Anyscale realistic chat — the headline number
nonstream / stream ~12.5k / 64 long-context / RAG large-body path (JSON-parse / forward bound)

(nonstream-* = non-SSE → reports RPS; stream-* = SSE → reports TTFT.) PROFILE is one of these six bundled names → scripts/bench/profiles/<name>.json. The load generator itself is the standalone OSS nexus-loadtest (its own repo — see its README/DESIGN for the profile JSON schema and CLI).

Profile JSON schema (to write your own). A profile is declarative JSON. The bundled tiers look like this (scripts/bench/profiles/nonstream-128.json, trimmed):

{
  "defaults": {
    "protocol": "openai-chat",
    "target":   "http://localhost:3050/v1/chat/completions",
    "headers":  { "Authorization": "Bearer REPLACE_WITH_VK" },
    "model":    "mock-gpt-4o-mini",
    "max_tokens": 128
  },
  "warmup": "15s",
  "cache_mode": "bust",
  "correlation": { "uuid_in_prompt": true, "header": "x-request-id" },
  "thresholds": { "ttft_p95_ms": 0, "p95_ms": 0, "error_rate": 0.02 },
  "stages":    [ { "concurrency": 50, "duration": "30s" } ],
  "scenarios": [
    { "name": "overhead-128-nonstream", "weight": 100, "turns": 1, "stream": false,
      "max_tokens": 128, "content": { "mode": "sized", "approx_input_tokens": 128 } }
  ]
}

The rig OVERRIDES target, model, the vk (auth header) and stages at run time — it points the generator at the gateway under test (resolved from the inventory) and applies the STAGES=… ladder. So in an external profile those are placeholders; what you actually control is the request shape:

field meaning
defaults.protocol wire protocol — openai-chat (others exist in the tool)
defaults.max_tokens output-token cap (a scenario may override per-mix)
warmup warm-up window discarded before measurement
cache_mode bust = every request unique (defeats prompt caching) — keep it for fair numbers
correlation inject a per-request UUID (prompt + header) for audit cross-checking
thresholds pass/fail gates; 0 = report-only (don't gate)
scenarios[] the traffic mix — each entry has weight (relative %), turns, stream, max_tokens
scenarios[].content.mode sized (generate ~approx_input_tokens of input — sets prompt size), pool (random from a prompts[] list), or scripted (fixed dialogue)

STAGES — the concurrency / load control

Comma-separated stages, run in sequence. Three shapes (mixable):

STAGES entry meaning
200:60s closed-loop: 200 concurrent virtual users (VUs) for 60s — the max-throughput view
@4000:60s open-loop: a fixed 4000 req/s arrival rate for 60s — the honest tail-latency vs offered load view
@1000-8000:60s open-loop ramp: arrival rate climbs 1000→8000 req/s (walks the latency knee)

How to size the ramp. Climb concurrency geometrically until the gateway box CPU saturates and RPS stops rising (the knee), then 1–2 points past it — going further is wasted queuing (latency balloons, RPS flat). The knee's concurrency depends on the tier (RPS = concurrency ÷ per-request latency, and latency differs ~100× across tiers), so the ladder is tier-aware, not one-size-fits-all:

tier closed-loop ramp why
*-128, *-550 (small / mid) 50:30s,100:30s,200:30s,400:30s,800:60s,1200:60s,1600:60s low latency → knee is high; may not saturate even at 1600 VU → finish with the open-loop sweep below
nonstream / stream (~12.5k) 50:30s,100:30s,200:60s,400:60s,800:60s high latency → knee is low (~200–400 VU ≈ saturation); 2000 VU here is pure queuing, wasted minutes

Open-loop sweep — the right way to find the small-tier ceiling and honest tails; extend the rates until error_rate climbs or the run goes INVALID:

STAGES="@5000:60s,@10000:60s,@20000:60s,@40000:60s,@60000:60s" PROFILE=nonstream-128 scripts/bench/bench.sh

Per-stage duration: 30s for sub-knee climbing stages (long enough to reach steady state + a stable throughput read); 60s at/near/past the knee and for every open-loop stage (stable p99/p99.9 — coordinated-omission correction needs a window). Warmup (15s) is already baked into the profile. Don't drop below 30s (noisy tails) or pile on >8 stages per run — geometric placement beats a linear every-200 ladder: same resolution near the knee, far less wall-clock.

The nexus-only comparisons — hooks off/on and bodies off/on

nexus has two cost dimensions no other gateway has, and they're the headline comparisons:

  • NEXUS_HOOKSoff = the bare gateway (no content scanning); on = every request is content-scanned (the compliance cost).
  • NEXUS_AUDIT_BODIESoff = audit stores metadata only; on = stores the full request+response bodies (the heaviest no-loss-audit case).

run-tiers.sh sweeps these automatically for GATEWAY=nexus: every tier runs across hooks{off,on} × bodies{off,on} (4 combos/tier), each with a RUN_ID suffixed -hooks<h>-bodies<b>, so any pair is a clean diff:

GATEWAY=nexus scripts/bench/run-tiers.sh        # full matrix (restrict with NEXUS_AUDIT_BODIES=off)

# For a long high-RPS nexus run, launch it detached:
nohup env GATEWAY=nexus scripts/bench/run-tiers.sh > ~/nexus-tiers.log 2>&1 &
# Every cycle already deep-cleans (stop ai-gateway+hub → `nats stream purge` NEXUS_EVENTS →
# wipe the spill spool → TRUNCATE with both down → restart), so a run never starts on the
# previous cycle's audit backlog. After each run, verify-audit waits 2 min and counts the
# audit rows that landed vs requests sent (§7) — at ~40k RPS nexus won't be at 100% within
# 2 min, and that ratio IS the honest figure (it's a measurement, not a pass/fail gate).
# (To PROVE lossless audit you need a SUSTAINABLE rate — a low tier / rate-capped open-loop
# where the pipeline keeps up; see §7. Also: GOMEMLIMIT must be set on the gw — see the
# nexus Ansible role — or the SSE/12k tiers OOM-freeze it.)

# hooks off vs on at the 550 chat tier, bodies held off:
COMPARE="nexus-550-nonstream-hooksoff-bodiesoff nexus-550-nonstream-hookson-bodiesoff" scripts/bench/report.sh
# bodies off vs on, hooks held off:
COMPARE="nexus-550-nonstream-hooksoff-bodiesoff nexus-550-nonstream-hooksoff-bodieson" scripts/bench/report.sh

Or one cycle by hand:

RUN_ID=off NEXUS_HOOKS=off scripts/bench/bench.sh
RUN_ID=on  NEXUS_HOOKS=on  scripts/bench/bench.sh
COMPARE="off on" scripts/bench/report.sh        # off-vs-on delta table

Fairness: compare like-for-like. bifrost (no audit/scan) lines up against nexus hooks-off + bodies-off (also bare) for the routing-core race; hooks-on / bodies-on show what those features cost, reported separately — never mix "nexus with audit on" against "a bare competitor". A Vectorscan validity gate runs automatically on hooks=on: it sends a PII probe and fails the run unless the response comes back redacted (a mis-built nexus binary can silently disable scanning, which would make hooks-on look artificially fast). If it fails, the nexus binary asset needs rebuilding FAT_RUNTIME=OFF — escalate.


6. Read the results

The cycle prints a report and saves files under results/<run-id>/ on your control machine (pulled from the loadtest box's /var/log/perf-bench/<run-id>/):

file what it is
report.md the human summary (per-stage RPS, ok%, p50/p95/p99, TTFT)
<gw>/summary-*.json machine-readable totals (used by -compare)
<gw>/report-*.csv, *.prom spreadsheet / Prometheus feeds
<gw>/results-*.jsonl one line per request (latency breakdown, tokens, status)
monitor/ per-box CPU/mem/disk/net during the run + a nexus CPU profile

How to read it (important):

  1. Generator health is a validity gate — check it FIRST. report.md shows each gateway as OK or INVALID. INVALID means the load generator (not the gateway) ran out of FDs/ports or dropped records — those numbers don't count. Re-run; never report an INVALID run.
  2. Compare gateways by TTFT, not absolute latency. Every gateway calls the same mock, so the mock's latency is a constant in every number. It cancels in the gateway-to-gateway (or hooks-off-vs-on) TTFT delta — that delta is the gateway's own overhead. Absolute p99 is dominated by the mock.
  3. ITL (inter-token latency) only matters for PROFILE=stream; it's 0 for nonstream.

Before a second round, mv this round's results/ aside. Run IDs are deterministic (no timestamp), so run-tiers.sh treats a present results/<run-id>/report.md as "already done" and skips it — a re-run with the old results still in place keeps the stale report. See the archive snippet in §5.

Run-to-run regression check (non-zero exit on a regression — usable as a CI gate):

loadtest -compare results/<old>/<gw>/summary-*.json,results/<new>/<gw>/summary-*.json

7. Check the audit data (is the gateway lossless?)

A key claim is that Nexus's audit captures every request (and, with bodies-on, the full request/response payloads) even under load — without slowing down. The audit is asynchronous (gateway → NATS → Hub → PostgreSQL, plus an on-disk spool that a recovery sweeper replays), so you must let it fully drain before counting, or you'll undercount and think data was lost when it's just still flushing.

verify-audit.sh does this for you, for every gateway (fair — each gateway's own log table is checked the same way), and runs automatically inside bench.sh:

RUN_ID=<the run id> scripts/bench/verify-audit.sh

It waits 2 minutes for the async audit to settle, then for nexus reports two DISTINCT things (don't conflate them):

  • Pipeline loss — did the gw DROP any audit event it created? Measured against enqueued (the events the gw made), via its own dropped/poisoned counters AND the conservation PG + NATS-backlog + spill ≈ enqueued. dropped=0 + ≈100% of enqueuedlossless (every created event is in PG or durably queued). This is the headline claim.
  • Coverage — did every request get an audit event? enqueued vs sent. < 100% at the high tiers is failed/incomplete requests (nothing to audit), NOT loss — read it against the stage ok%: low ok% explains low coverage; high ok% + low coverage = a real bug.

Why pipeline loss is measured vs enqueued, not sent: at high load many sent requests never complete (timeout/RST/5xx), so the gw legitimately makes fewer events than sent — counting that against the pipeline would falsely read as "lost data". For best-effort gateways (bifrost) there is no pipeline; their PG shortfall vs sent is a real by-design drop, shown plainly.

With NEXUS_AUDIT_BODIES=on it also checks the request/response bodies landed (traffic_event_payload).

What the bodies flag does and doesn't gate. NEXUS_AUDIT_BODIES only gates capture of the user content (the full request prompt + the completion) — the heavy, optional part. Gateway-generated error responses are always recorded regardless of the flag (their synthetic error envelope carries no user content and is the most useful thing to have when a request fails — kept for traceability). So on a clean 100%-ok run, bodies-off truly captures no payloads; if requests error, you'll still see those error envelopes in the audit — by design.

To look manually (on the nexus box): psql is local —

PGPASSWORD=benchpass psql -h localhost -U bench -d gatewaybench -c "SELECT count(*) FROM traffic_event;"

(clean.sh truncates this before each run, so the count is that run's.)

Audit completeness vs. high RPS — read this before trusting an INCOMPLETE. No gateway lands its per-request log in real time at tens-of-thousands of RPS — the async pipeline (queue → store) is designed to absorb the burst and catch up afterwards, and at ~40k RPS it catches up only slowly (the backlog drains long after the load stops). So "captured == sent" is not a meaningful discriminator at the high tiers — everyone is INCOMPLETE there. Two consequences for how we run:

  • To actually verify a lossless claim, test where the pipeline can keep up: a low-RPS tier (or open-loop capped at a sustainable rate). The honest durability metric is the highest sustained rate at which the backlog stays bounded, not a 40k-RPS pass/fail. Above that, durability is just each gateway's stated guarantee (block/no-drop vs spill vs best-effort) — there's no end-to-end way to prove it at that load.
  • At the high tiers it's a measurement, not a gate: verify-audit always just waits 2 min and reports the ratio (cheap). At ~40k RPS that ratio will be < 100% — report throughput/latency as the headline and the audit ratio as context, and say so.

Because that backlog also outlives the run, a plain TRUNCATE between runs wouldn't stick (the gateway keeps flushing into the table right after). So every run deep-cleans, for every gateway fairly: clean.sh stops the gateway's services (dropping the in-memory buffer), purges any durable backlog (nexus: nats stream purge the NEXUS_EVENTS stream + wipe the spill spool — deleting the JetStream store dir does not work, nats restores it on restart), TRUNCATEs with services down, then restarts — so each run starts verified-empty.


8. Save money — stop / start / tear down

Boxes cost money while running. You don't need them all up at once.

# interactive: pick an account/region, list the boxes, select one or more, stop/start/terminate
scripts/boxes.sh

# scriptable, by role (no menus):
scripts/box.sh state                  # list every box: role / state / public IP
scripts/box.sh stop nexus bifrost     # stop boxes (frees compute; disk kept; restart is fast)
scripts/box.sh start nexus            # start them again

# full teardown of everything (deletes the stack):
./down.sh

stop vs terminate. Stop frees the compute charge, keeps the disk, and you can restart in seconds — use this between tests. Terminate deletes the box; because the boxes are CloudFormation-managed, terminating one outside CFN drifts the stack, so for a clean full teardown use ./down.sh.

After stopping+starting a box its PUBLIC IP changes → re-generate the inventory:

scripts/gen-inventory.sh gw-bench ~/.ssh/<your-key>.pem us-east-1

(Private IPs are kept, so gateway↔mock traffic is unaffected; only SSH/targets need the refresh.)


9. Troubleshooting

symptom cause / fix
deploy.sh: unbound variable you're on an ancient bash — the scripts handle bash 3.2 (macOS); re-pull if you see this.
ansible: Could not match host pattern: kong/... normal — that gateway isn't deployed (DEPLOY_*=false). Not an error.
smoke fails / not chatcmpl-mock the gateway can't reach the mock — check the mock box is up and the SG allows the port.
nexus services not running re-run --tags nexus --limit nexus; check systemctl status nexus-ai-gateway nexus-hub nexus-control-plane nats on the box (note: the unit is nexus-hub, the binary is nexus-hub).
verify-audit shows INCOMPLETE either the audit truly dropped under load, or it hadn't fully drained — re-run; it waits for drain.
can't ssh ec2-user@<box> use -i ~/.ssh/<key>.pem, or add your pubkey via BENCH_SSH_PUBLIC_KEYS in deploy.env and re-provision.

10. Parameter reference (every knob + how to set it)

Three layers, set in three places. Precedence everywhere: a value exported in your shell / on the command line WINS over the file default.

10.1 Bring-up / provision — deploy.env (copy from deploy.env.example)

Read by ./deploy.sh. Anything you leave unset, deploy.sh asks for interactively.

variable default what it controls
AWS_PROFILE (your default) which AWS account to deploy into
REGION us-east-1 AWS region
STACK gw-bench CloudFormation stack name
NET_STACK ${STACK}-net name of the optional "create a VPC" network stack
KEY_NAME llm-bench-key EC2 key pair name
KEY_FILE ~/.ssh/${KEY_NAME}.pem local private key path
VPC_ID (pick/auto) existing VPC; blank → choose/create at the prompt
SUBNET_ID (pick/auto) existing public subnet; blank → choose/create
SECURITY_GROUP_ID (blank) reuse an SG; blank → the stack creates one with the right ports
ADMIN_CIDR (your IP/32) who may SSH (:22). Blank → auto-detect your IP
ACCESS_CIDR 0.0.0.0/0 who may reach gateway/UI/mock ports
GATEWAY_TYPE c6i.4xlarge instance type for gateway boxes
AUX_TYPE c6i.4xlarge instance type for mock + loadtest boxes
VOLUME_GIB 120 root disk size
VOLUME_IOPS 12000 root disk IOPS (gp3; >= 4x the throughput MB/s)
VOLUME_THROUGHPUT 1000 root disk throughput MB/s (gp3, 125-2000), applied online post-launch by scripts/set-ebs-throughput.sh (deploy.sh runs it — EC2::Instance can't set gp3 Throughput inline). The volume is sized so it's never the bottleneck; the real cap is the instance EBS bandwidth (~593 MiB/s sustained / ~1250 burst on c6i.4xlarge) — for more sustained write, bump GATEWAY_TYPE, not this
DEPLOY_MOCK/NEXUS/BIFROST/LOADTEST true which boxes to build
DEPLOY_LITELLM/KONG/PORTKEY/TENSORZERO false the other gateways (set true to include)
BENCH_SSH_PUBLIC_KEYS (empty) your ssh-… pubkey(s), ;-separated → passwordless login on every box
ASSUME_YES 0 1 = skip all confirmation prompts (CI)
PROVISION 0 1 = also run Ansible right after the boxes are up

10.2 Per-run — inline env knobs (no config file)

The benchmark has no config file. Set any of these inline (GATEWAY=… PROFILE=… scripts/bench/run-tiers.sh); they default in scripts/bench/lib.sh. These six are the whole list:

variable default what it controls
GATEWAY bifrost which gateway to drive (nexus/bifrost/litellm/kong/portkey/tensorzero)
STAGES 50:30s,100:30s,200:30s,400:30s,800:60s load/concurrency — closed N:dur, open @rate:dur, ramp @from-to:dur → §5
PROFILE nonstream-550 request shape (single bench.sh cycle; run-tiers.sh runs all 6 bundled tiers) → §5
RUN_ID timestamp label for this run's output dir + compare
NEXUS_HOOKS off nexus content scanning off/on (run-tiers.sh sweeps both)
NEXUS_AUDIT_BODIES off nexus full request/response body capture off/on (run-tiers.sh sweeps both)

Everything else is a fixed rig constant or always-on policy, hardcoded in lib.sh — not a knob: per-gateway PORT_/PATH_/MODEL_/AUTH_, MOCK_MODEL/PORT, PG_*, AUDIT_TABLE_* / CLEAN_TABLES_*, REMOTE_OUT/LOCAL_OUT, the cooldown thresholds, the 2-min audit settle, deep-clean (always on, for fairness), and the nexus PII/redaction scan gate (always on). To change one of those, edit lib.sh.

10.3 Request shape — the profile JSON (scripts/bench/profiles/*.json)

Fixes HOW each request looks (so every gateway is measured identically). target / model / auth header / stages are overridden per-run by run.sh; edit the rest to change the workload:

field meaning
defaults.max_tokens response size cap
warmup leading window excluded from steady-state stats (e.g. "10s")
cache_mode bust = UUID-front cache-bust (the benchmark default)
reuse_body marshal the body once, reuse zero-copy (cache-OFF runs only — removes the per-request 50 KB marshal that caps generator RPS)
arrival open-loop inter-arrival: uniform (default) or poisson (bursty, harsher honest tails)
openloop_max_inflight cap on outstanding open-loop requests (default 50000)
live_interval print a rolling RPS/p50/p99 line every interval (e.g. "10s"; "0" off)
timeout / think_time per-request timeout / inter-turn pause
capture_error_body store non-200 bodies in the JSONL
thresholds ttft_p95_ms / p95_ms / error_rate SLOs + abort_on_fail (stop the sweep at the first breach)
scenarios[] weight, turns, stream, max_tokens, content.{mode,approx_input_tokens}, and checks (contains / token-range response assertions)

10.4 The generator binary — loadtest flags

run.sh builds these for you; useful for an ad-hoc test: --config <profile.json> · --target <url> · --model <id> · --vk <token> · --stages '<spec>' · --out <dir> · --compare old.json,new.json · --regress-pct <N> (regression threshold for -compare).

10.5 Provision-time tuning — ansible/group_vars/all.yml (advanced)

Applied by Ansible; change here + re-provision (these are the gateway's internals, not per-run knobs):

variable default what it controls
nexus_gomemlimit_pct 70 ai-gateway GOMEMLIMIT as a % of box RAM (must be explicit on bare EC2)
nexus_upstream_max_idle_conns_per_host 2000 gw→mock keepalive pool per host (avoids socket churn at high concurrency)
nexus_audit_max_queued_records 100000 gateway in-heap audit queue depth (burst absorbed before drops)
nexus_audit_frame_max_bytes 262144 audit publish frame size (256 KB)
nexus_audit_spool_dir /var/lib/nexus/audit-spool on-disk audit spill spool
nexus_audit_spool_max_total_mb 51200 on-disk spill buffer cap (50 GiB)
ports.<gw> per gateway gateway listen ports (must match the SG)
pg_bench_db/user/password gatewaybench/bench/benchpass standardized datastore creds

Audit loss-mode/codec/compression and Vectorscan scan-sizing are not set here — they ship as the gateway's code defaults (a stock deploy is zero perf-config). Override them via the gateway's own env only for an A/B baseline.

Rules (do not break)

  • Never change a box by hand. Every change goes through CloudFormation (cloudformation/) or an Ansible role (ansible/roles/) and must be reproducible by re-running. Fix/re-apply with ansible-playbook -i inventory.ini site.yml --limit <host> --tags <role>; change ports/SG by editing the template + aws cloudformation update-stack.
  • Secrets are never committed. deploy.env holds your local config/credentials and is gitignored — only its *.example is committed. The benchmark itself has no config file (rig constants are in scripts/bench/lib.sh; run knobs are inline env vars).