Skip to content

Latest commit

 

History

History
215 lines (160 loc) · 8.55 KB

File metadata and controls

215 lines (160 loc) · 8.55 KB

Benchmark runbook — run everything from the control box (copy-paste)

A strictly linear guide: type each command in order, top to bottom. No branching, no decisions. It assumes the rig stack (gw-bench) is already deployed and includes a control box (it does by default — DeployControl=true). Everything runs on the control box.

Defaults used below — change only if your rig differs: STACK=gw-bench · REGION=us-east-1 · key llm-bench-key · gateway under test bifrost.


0. Get onto the control box (run on YOUR LAPTOP, once)

# the control box's current public IP (it changes if the box is stopped/started)
aws ec2 describe-instances --region us-east-1 \
  --filters "Name=tag:aws:cloudformation:stack-name,Values=gw-bench" \
            "Name=tag:Role,Values=control" "Name=instance-state-name,Values=running" \
  --query "Reservations[].Instances[].PublicIpAddress" --output text
# SSH in with the stack key (use the IP printed above)
ssh -i ~/.ssh/llm-bench-key.pem ec2-user@<CONTROL_PUBLIC_IP>

Connection closed by … port 22 right after a fresh deploy? The box is still running cloud-init (it regenerates SSH host keys + restarts sshd during init) — not a security-group issue (you reached sshd; a blocked port would time out). Wait ~1–2 min and retry. (A timeout instead means your IP isn't in the stack's AdminCidr — re-deploy with your IP.)

Everything from here runs on the control box.


1. One-time setup on the control box

# clone the repo (skip if already cloned)
git clone git@github.com:AlphaBitCore/llm-gateway-benchmark.git ~/llm-gateway-benchmark
cd ~/llm-gateway-benchmark
# generate this box's SSH key + authorize it on every rig box (over SSM; ~30s)
scripts/control-ssh-setup.sh
# install Ansible collections + write the inventory (uses this box's own key)
scripts/control-bootstrap.sh

If you ever stop/start rig boxes, re-run scripts/control-bootstrap.sh to refresh the inventory (private IPs are stable, but it re-checks the stack).


2. Install the gateway software on the boxes

cd ~/llm-gateway-benchmark/ansible
ansible-playbook site.yml --limit mock,bifrost,loadtest

This provisions the shared mock, the bifrost gateway, and the load generator. Each finishes with a smoke check that proves it reaches the mock (chatcmpl-mock). All three must end failed=0. (To also do nexus later: --limit mock,nexus,loadtest.)


3. Run the benchmark — ALL profiles (bifrost)

run-tiers.sh runs all 6 profiles for one gateway — the three prompt sizes (128 / 550 / 12k) each in non-stream (throughput/RPS) and stream (TTFT) form — back to back, one full cycle and one report per profile. Detached, so a dropped SSH won't kill it; it runs unattended (each cycle's clean wipes data without prompting).

cd ~/llm-gateway-benchmark
nohup env GATEWAY=bifrost scripts/bench/run-tiers.sh > ~/bifrost-tiers.log 2>&1 &
# watch progress (all 6 profiles take tens of minutes — the 12k pair is the slow part)
tail -f ~/bifrost-tiers.log

It writes one report per profile:

profile report
nonstream-128 results/bifrost-128-nonstream/report.md
stream-128 results/bifrost-128-stream/report.md
nonstream-550 results/bifrost-550-nonstream/report.md
stream-550 results/bifrost-550-stream/report.md
nonstream (12k) results/bifrost-12k-nonstream/report.md
stream (12k) results/bifrost-12k-stream/report.md

One profile only (faster — skips 128 and the slow 12k): use bench.sh, which runs a SINGLE profile (default nonstream-550):

BENCH_ASSUME_YES=1 GATEWAY=bifrost RUN_ID=bifrost-550 PROFILE=nonstream-550 \
  nohup scripts/bench/bench.sh > ~/bifrost-550.log 2>&1 &

Rule of thumb: run-tiers.sh = all 6 profiles, bench.sh = one profile.


4. Read the results

cd ~/llm-gateway-benchmark
ls -t results/                       # one dir per profile (e.g. bifrost-550-nonstream)
# the headline number (550 non-stream throughput); repeat for the other 5 dirs above
cat results/bifrost-550-nonstream/report.md

Read each report top-down: first the generator-health validity gate — if the load generator itself was the bottleneck (FD/port exhaustion), the row is flagged and those numbers don't count (re-run). Then per-stage RPS / ok% / p99 (non-stream profiles) or TTFT (stream profiles).


5. Stop the boxes when done (save cost)

cd ~/llm-gateway-benchmark
scripts/box.sh stop bifrost loadtest        # keep mock + control if you'll test more soon
# or stop everything except the control box:
# scripts/box.sh stop mock bifrost loadtest

Stopped boxes keep their disk; restart with scripts/box.sh start <role> then re-run step 1's scripts/control-bootstrap.sh (public IPs change, private IPs don't).


6. Switch to another gateway — one command (gateway.sh)

To test gateways one at a time (cost/quota discipline: only the mock + the box under test need to run), don't keep all six up. scripts/gateway.sh manages a single gateway box's whole lifecycle — stack create/terminate + Ansible provision + refreshing the control box's inventory, bench targets, and passwordless SSH — so after it returns, step 3 just works.

The common move — finish one gateway, start the next — is one command:

cd ~/llm-gateway-benchmark
scripts/gateway.sh swap bifrost litellm      # terminate bifrost, then create + provision litellm

Then benchmark the new one exactly as in step 3:

nohup env GATEWAY=litellm scripts/bench/run-tiers.sh > ~/litellm-tiers.log 2>&1 &

All the commands (gw ∈ nexus bifrost litellm kong portkey tensorzero) — one example each:

cd ~/llm-gateway-benchmark

# DOWN one box — CFN-terminate it; frees its vCPU + cost (do this before bringing another up)
scripts/gateway.sh down bifrost

# UP one box — CFN-create it (if absent) + provision (common + datastore + role); also refreshes
# the control box's inventory / bench targets / passwordless SSH so step 3 just works
scripts/gateway.sh up litellm

# REDEPLOY one box — re-apply STACK + ANSIBLE in place, without terminating: pushes updated
# template / role / artifacts (e.g. a fresh gateway binary or config) to an already-up box
scripts/gateway.sh redeploy nexus

# REBUILD one box — terminate and recreate it FRESH, then provision (clean slate from scratch)
scripts/gateway.sh rebuild kong

# SWAP — down <old> then up <new> in one go (finish one gateway, start the next)
scripts/gateway.sh swap bifrost litellm
command what it does
gateway.sh down <gw> CFN-terminate the box → frees its vCPU + cost
gateway.sh up <gw> CFN-create the box (if absent) + provision it
gateway.sh redeploy <gw> re-apply stack + Ansible in place (push updated template / role / artifacts to an already-up box; no terminate)
gateway.sh rebuild <gw> terminate + recreate fresh, then provision (clean slate)
gateway.sh swap <old> <new> down <old> then up <new>

Every command ends by re-running control-ssh-setup.sh + gen-inventory.sh, so the new box's (possibly new) private IP lands in inventory.ini, host_targets.env, and ~/.ssh/config, and passwordless SSH is asserted — you never hand-edit IPs. The load generator reads its --target from the inventory at run time, so there's nothing static to update on the loadtest box. down frees vCPU so the next up has quota (that's why swap goes down-then-up). Needs spare vCPU for up; if the account is full, down something first.


Change what you test (same flow, different value)

  • Another gateway: use scripts/gateway.sh swap <old> <new> (§6), then GATEWAY=<new> in step 3. Or, if the box is already provisioned, just change GATEWAY= in step 3.
  • Nexus's content-scanning / full-audit scenarios: GATEWAY=nexus automatically sweeps NEXUS_HOOKS (scanning off/on) × NEXUS_AUDIT_BODIES (full bodies off/on) across all tiers.

The only knobs you ever set are GATEWAY, PROFILE, STAGES, RUN_ID, NEXUS_HOOKS, NEXUS_AUDIT_BODIES — explained in ../scripts/bench/README.md. Full methodology + A/B diffs: LOADTEST-RUNBOOK.md. Control-box internals (IAM, key distribution, troubleshooting): CONTROL-BOX.md. Already deployed and just pulled new code? What to re-apply after a git pull: UPDATING-AN-EXISTING-RIG.md.