Skip to content

AlphaBitCore/llm-gateway-benchmark

Repository files navigation

LLM Gateway Benchmark Matrix — on-demand AWS rig

CloudFormation (infra) + Ansible (host-native install) for a fair, reproducible, multi-box benchmark of 6 LLM gateways against a shared mock upstream. On-demand: deploy to bring up, delete-stack to tear down. See ARCHITECTURE.md for the full picture.

Matrix (5 languages, each isolated on its own box)

Box Software Lang Port Datastores
mock nexus-mock-provider (prebuilt) Go 3062
nexus Nexus (ai-gw+hub+cp+cp-ui/nginx) Go 3050 PG+Redis+NATS
bifrost Bifrost Go 8080 PG+Redis
litellm LiteLLM Python 4000 PG+Redis
kong Kong AI Gateway (ai-proxy) Lua 8000 PG
portkey Portkey Node 8787 PG+Redis(idle)
tensorzero TensorZero (obs off) Rust 3000 — (ClickHouse-native, disabled)
loadtest Go loadtest (the only load tool)

Prerequisites (control machine)

  • AWS CLI configured; an EC2 key pair.
  • ansible-core + collections: ansible-galaxy collection install ansible.posix community.postgresql
  • Gateway/mock/loadtest binaries + cp-ui ship prebuilt in artifacts/ — the roles copy them, nothing compiles on-box. Nexus's DB tooling (Prisma) ships as the self-contained artifacts/db-migrate asset and the OAuth login helper is vendored as scripts/nexus-auth.sh, so no Nexus source checkout is needed — the rig deploys from assets only.

1. Deploy infra

aws cloudformation deploy \
  --stack-name nexus-perf-matrix \
  --template-file cloudformation/perf-matrix-stack.yaml \
  --capabilities CAPABILITY_IAM \
  --parameter-overrides KeyName=<your-key> AdminCidr=<your.ip>/32

(Defaults: gateways c6i.4xlarge, mock+loadtest c6i.4xlarge, AL2023 x86. For Graviton: GatewayInstanceType=c7g.4xlarge + the arm64 LatestAmiId.)

2. Provision (host-native, from control machine)

scripts/gen-inventory.sh nexus-perf-matrix ~/.ssh/<your-key>.pem <region>
cd ansible
ansible-playbook -i inventory.ini site.yml                 # everything
# or one gateway:  ansible-playbook -i inventory.ini site.yml --tags kong --limit kong

Each role finishes with a smoke check asserting the mock signature (id == chatcmpl-mock, prompt echoed, usage 9/1/10) — proof the gateway actually reaches the mock and not a real provider.

Control machine = a laptop, or the in-VPC control box. The stack deploys a small c6i.xlarge control box by default (DeployControl=true) to run all of the above from inside the VPC. For a strictly linear copy-paste sequence (SSH in → setup → run → read results), follow docs/CONTROL-BOX-RUNBOOK.md; control-box internals (IAM, key distribution) are in docs/CONTROL-BOX.md.

3. Run the benchmark

GATEWAY=bifrost scripts/bench/run-tiers.sh   # one gateway, all 6 tiers → a report each
# or one cycle:  GATEWAY=bifrost PROFILE=nonstream-550 scripts/bench/bench.sh

No config file — you only ever set a few inline knobs (GATEWAY, PROFILE, STAGES, NEXUS_HOOKS, NEXUS_AUDIT_BODIES, RUN_ID); everything else is fixed in lib.sh. Results land in results/<run-id>/report.md. Check generator health first (the report's validity gate): if the load generator was the bottleneck (FD/port exhaustion), the numbers don't count — re-run. Compare gateways by TTFT delta (the shared mock's latency cancels). Knobs: scripts/bench/README.md · full guide: docs/LOADTEST-RUNBOOK.md.

4. Tear down (on-demand)

aws cloudformation delete-stack --stack-name nexus-perf-matrix

Nothing persists except the IaC in git. Delete when idle to control cost.

Fairness & methodology (baked in)

  • All host-native (no container overhead), each gateway isolated on its own box.
  • Standardized storage: PostgreSQL + Redis everywhere (TensorZero is the documented exception — ClickHouse-native, run with observability off).
  • Same mock + same stages for all; each gateway addresses the mock per its own routing rules — see per-gateway gotchas below.
  • Nexus runs 100% durable audit (code default AI_GATEWAY_AUDIT_LOSS_MODE=spillblock, zero-loss — no env needed); when comparing RPS, note who persists what (Bifrost default drops ~99% of logs).
  • Not directly comparable to competitors' single-box published numbers — this is a fair head-to-head between these gateways (each isolated, shared remote mock).

Per-gateway gotchas (from the mock's CONFIGURE docs)

  • Bifrost: no OPENAI_API_KEY in env (or openai/* routes to real OpenAI); call the model as mock-provider/mock-gpt-4o.
  • LiteLLM: api_base includes /v1; may require a master key on the proxy.
  • Kong: the ai-proxy plugin is enabled (a bare reverse proxy is NOT an AI gateway and its RPS is meaningless).
  • Portkey: routes via request headers (x-portkey-provider + custom host).
  • TensorZero: OpenAI-compatible endpoint at /openai/v1/chat/completions.

Layout

cloudformation/   perf-matrix-stack.yaml (8 benchmark boxes + optional control box, SG, IAM) · network-stack · validate-stack
ansible/          site.yml + group_vars + roles/ (common · datastore · mock · 6 gateways · loadtest)
deploy.sh · down.sh   (repo root: bring up / tear down)
scripts/          gen-inventory.sh · box.sh · nexus-configure.sh · spin-control-box.sh
                  control-ssh-setup.sh · control-bootstrap.sh (in-VPC control box)
scripts/bench/    load-test orchestration (clean/setup/restart/run/monitor/report) + profiles
artifacts/        prebuilt linux/amd64 binaries + cp-ui zip + prisma db-migrate
docs/             LOADTEST-RUNBOOK.md · CONTROL-BOX-RUNBOOK.md · CONTROL-BOX.md

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors