A strictly linear guide: type each command in order, top to bottom. No branching, no
decisions. It assumes the rig stack (gw-bench) is already deployed and includes a control
box (it does by default — DeployControl=true). Everything runs on the control box.
Defaults used below — change only if your rig differs:
STACK=gw-bench · REGION=us-east-1 · key llm-bench-key · gateway under test bifrost.
# the control box's current public IP (it changes if the box is stopped/started)
aws ec2 describe-instances --region us-east-1 \
--filters "Name=tag:aws:cloudformation:stack-name,Values=gw-bench" \
"Name=tag:Role,Values=control" "Name=instance-state-name,Values=running" \
--query "Reservations[].Instances[].PublicIpAddress" --output text# SSH in with the stack key (use the IP printed above)
ssh -i ~/.ssh/llm-bench-key.pem ec2-user@<CONTROL_PUBLIC_IP>
Connection closed by … port 22right after a fresh deploy? The box is still running cloud-init (it regenerates SSH host keys + restarts sshd during init) — not a security-group issue (you reached sshd; a blocked port would time out). Wait ~1–2 min and retry. (A timeout instead means your IP isn't in the stack'sAdminCidr— re-deploy with your IP.)
Everything from here runs on the control box.
# clone the repo (skip if already cloned)
git clone git@github.com:AlphaBitCore/llm-gateway-benchmark.git ~/llm-gateway-benchmark
cd ~/llm-gateway-benchmark# generate this box's SSH key + authorize it on every rig box (over SSM; ~30s)
scripts/control-ssh-setup.sh# install Ansible collections + write the inventory (uses this box's own key)
scripts/control-bootstrap.shIf you ever stop/start rig boxes, re-run
scripts/control-bootstrap.shto refresh the inventory (private IPs are stable, but it re-checks the stack).
cd ~/llm-gateway-benchmark/ansible
ansible-playbook site.yml --limit mock,bifrost,loadtestThis provisions the shared mock, the bifrost gateway, and the load generator. Each finishes
with a smoke check that proves it reaches the mock (chatcmpl-mock). All three must end
failed=0. (To also do nexus later: --limit mock,nexus,loadtest.)
run-tiers.sh runs all 6 profiles for one gateway — the three prompt sizes
(128 / 550 / 12k) each in non-stream (throughput/RPS) and stream (TTFT) form — back to
back, one full cycle and one report per profile. Detached, so a dropped SSH won't kill it;
it runs unattended (each cycle's clean wipes data without prompting).
cd ~/llm-gateway-benchmark
nohup env GATEWAY=bifrost scripts/bench/run-tiers.sh > ~/bifrost-tiers.log 2>&1 &# watch progress (all 6 profiles take tens of minutes — the 12k pair is the slow part)
tail -f ~/bifrost-tiers.logIt writes one report per profile:
| profile | report |
|---|---|
nonstream-128 |
results/bifrost-128-nonstream/report.md |
stream-128 |
results/bifrost-128-stream/report.md |
nonstream-550 |
results/bifrost-550-nonstream/report.md |
stream-550 |
results/bifrost-550-stream/report.md |
nonstream (12k) |
results/bifrost-12k-nonstream/report.md |
stream (12k) |
results/bifrost-12k-stream/report.md |
One profile only (faster — skips 128 and the slow 12k): use
bench.sh, which runs a SINGLE profile (defaultnonstream-550):BENCH_ASSUME_YES=1 GATEWAY=bifrost RUN_ID=bifrost-550 PROFILE=nonstream-550 \ nohup scripts/bench/bench.sh > ~/bifrost-550.log 2>&1 &Rule of thumb:
run-tiers.sh= all 6 profiles,bench.sh= one profile.
cd ~/llm-gateway-benchmark
ls -t results/ # one dir per profile (e.g. bifrost-550-nonstream)# the headline number (550 non-stream throughput); repeat for the other 5 dirs above
cat results/bifrost-550-nonstream/report.mdRead each report top-down: first the generator-health validity gate — if the load generator itself was the bottleneck (FD/port exhaustion), the row is flagged and those numbers don't count (re-run). Then per-stage RPS / ok% / p99 (non-stream profiles) or TTFT (stream profiles).
cd ~/llm-gateway-benchmark
scripts/box.sh stop bifrost loadtest # keep mock + control if you'll test more soon
# or stop everything except the control box:
# scripts/box.sh stop mock bifrost loadtestStopped boxes keep their disk; restart with scripts/box.sh start <role> then re-run
step 1's scripts/control-bootstrap.sh (public IPs change, private IPs don't).
To test gateways one at a time (cost/quota discipline: only the mock + the box under test need
to run), don't keep all six up. scripts/gateway.sh manages a single gateway box's whole
lifecycle — stack create/terminate + Ansible provision + refreshing the control box's
inventory, bench targets, and passwordless SSH — so after it returns, step 3 just works.
The common move — finish one gateway, start the next — is one command:
cd ~/llm-gateway-benchmark
scripts/gateway.sh swap bifrost litellm # terminate bifrost, then create + provision litellmThen benchmark the new one exactly as in step 3:
nohup env GATEWAY=litellm scripts/bench/run-tiers.sh > ~/litellm-tiers.log 2>&1 &All the commands (gw ∈ nexus bifrost litellm kong portkey tensorzero) — one example each:
cd ~/llm-gateway-benchmark
# DOWN one box — CFN-terminate it; frees its vCPU + cost (do this before bringing another up)
scripts/gateway.sh down bifrost
# UP one box — CFN-create it (if absent) + provision (common + datastore + role); also refreshes
# the control box's inventory / bench targets / passwordless SSH so step 3 just works
scripts/gateway.sh up litellm
# REDEPLOY one box — re-apply STACK + ANSIBLE in place, without terminating: pushes updated
# template / role / artifacts (e.g. a fresh gateway binary or config) to an already-up box
scripts/gateway.sh redeploy nexus
# REBUILD one box — terminate and recreate it FRESH, then provision (clean slate from scratch)
scripts/gateway.sh rebuild kong
# SWAP — down <old> then up <new> in one go (finish one gateway, start the next)
scripts/gateway.sh swap bifrost litellm| command | what it does |
|---|---|
gateway.sh down <gw> |
CFN-terminate the box → frees its vCPU + cost |
gateway.sh up <gw> |
CFN-create the box (if absent) + provision it |
gateway.sh redeploy <gw> |
re-apply stack + Ansible in place (push updated template / role / artifacts to an already-up box; no terminate) |
gateway.sh rebuild <gw> |
terminate + recreate fresh, then provision (clean slate) |
gateway.sh swap <old> <new> |
down <old> then up <new> |
Every command ends by re-running
control-ssh-setup.sh+gen-inventory.sh, so the new box's (possibly new) private IP lands ininventory.ini,host_targets.env, and~/.ssh/config, and passwordless SSH is asserted — you never hand-edit IPs. The load generator reads its--targetfrom the inventory at run time, so there's nothing static to update on the loadtest box.downfrees vCPU so the nextuphas quota (that's whyswapgoes down-then-up). Needs spare vCPU forup; if the account is full,downsomething first.
- Another gateway: use
scripts/gateway.sh swap <old> <new>(§6), thenGATEWAY=<new>in step 3. Or, if the box is already provisioned, just changeGATEWAY=in step 3. - Nexus's content-scanning / full-audit scenarios:
GATEWAY=nexusautomatically sweepsNEXUS_HOOKS(scanning off/on) ×NEXUS_AUDIT_BODIES(full bodies off/on) across all tiers.
The only knobs you ever set are GATEWAY, PROFILE, STAGES, RUN_ID, NEXUS_HOOKS,
NEXUS_AUDIT_BODIES — explained in ../scripts/bench/README.md.
Full methodology + A/B diffs: LOADTEST-RUNBOOK.md. Control-box
internals (IAM, key distribution, troubleshooting): CONTROL-BOX.md.
Already deployed and just pulled new code? What to re-apply after a git pull:
UPDATING-AN-EXISTING-RIG.md.