LLM Gateway Benchmark Matrix — on-demand AWS rig

CloudFormation (infra) + Ansible (host-native install) for a fair, reproducible, multi-box benchmark of 6 LLM gateways against a shared mock upstream. On-demand: deploy to bring up, delete-stack to tear down. See ARCHITECTURE.md for the full picture.

Matrix (5 languages, each isolated on its own box)

Box	Software	Lang	Port	Datastores
mock	nexus-mock-provider (prebuilt)	Go	3062	—
nexus	Nexus (ai-gw+hub+cp+cp-ui/nginx)	Go	3050	PG+Redis+NATS
bifrost	Bifrost	Go	8080	PG+Redis
litellm	LiteLLM	Python	4000	PG+Redis
kong	Kong AI Gateway (ai-proxy)	Lua	8000	PG
portkey	Portkey	Node	8787	PG+Redis(idle)
tensorzero	TensorZero (obs off)	Rust	3000	— (ClickHouse-native, disabled)
loadtest	Go loadtest (the only load tool)	—	—	—

Prerequisites (control machine)

AWS CLI configured; an EC2 key pair.
ansible-core + collections: ansible-galaxy collection install ansible.posix community.postgresql
Gateway/mock/loadtest binaries + cp-ui ship prebuilt in artifacts/ — the roles copy them, nothing compiles on-box. Nexus's DB tooling (Prisma) ships as the self-contained artifacts/db-migrate asset and the OAuth login helper is vendored as scripts/nexus-auth.sh, so no Nexus source checkout is needed — the rig deploys from assets only.

1. Deploy infra

aws cloudformation deploy \
  --stack-name nexus-perf-matrix \
  --template-file cloudformation/perf-matrix-stack.yaml \
  --capabilities CAPABILITY_IAM \
  --parameter-overrides KeyName=<your-key> AdminCidr=<your.ip>/32

(Defaults: gateways c6i.4xlarge, mock+loadtest c6i.4xlarge, AL2023 x86. For Graviton: GatewayInstanceType=c7g.4xlarge + the arm64 LatestAmiId.)

2. Provision (host-native, from control machine)

scripts/gen-inventory.sh nexus-perf-matrix ~/.ssh/<your-key>.pem <region>
cd ansible
ansible-playbook -i inventory.ini site.yml                 # everything
# or one gateway:  ansible-playbook -i inventory.ini site.yml --tags kong --limit kong

Each role finishes with a smoke check asserting the mock signature (id == chatcmpl-mock, prompt echoed, usage 9/1/10) — proof the gateway actually reaches the mock and not a real provider.

Control machine = a laptop, or the in-VPC control box. The stack deploys a small c6i.xlarge control box by default (DeployControl=true) to run all of the above from inside the VPC. For a strictly linear copy-paste sequence (SSH in → setup → run → read results), follow docs/CONTROL-BOX-RUNBOOK.md; control-box internals (IAM, key distribution) are in docs/CONTROL-BOX.md.

3. Run the benchmark

GATEWAY=bifrost scripts/bench/run-tiers.sh   # one gateway, all 6 tiers → a report each
# or one cycle:  GATEWAY=bifrost PROFILE=nonstream-550 scripts/bench/bench.sh

No config file — you only ever set a few inline knobs (GATEWAY, PROFILE, STAGES, NEXUS_HOOKS, NEXUS_AUDIT_BODIES, RUN_ID); everything else is fixed in lib.sh. Results land in results/<run-id>/report.md. Check generator health first (the report's validity gate): if the load generator was the bottleneck (FD/port exhaustion), the numbers don't count — re-run. Compare gateways by TTFT delta (the shared mock's latency cancels). Knobs: scripts/bench/README.md · full guide: docs/LOADTEST-RUNBOOK.md.

4. Tear down (on-demand)

aws cloudformation delete-stack --stack-name nexus-perf-matrix

Nothing persists except the IaC in git. Delete when idle to control cost.

Fairness & methodology (baked in)

All host-native (no container overhead), each gateway isolated on its own box.
Standardized storage: PostgreSQL + Redis everywhere (TensorZero is the documented exception — ClickHouse-native, run with observability off).
Same mock + same stages for all; each gateway addresses the mock per its own routing rules — see per-gateway gotchas below.
Nexus runs 100% durable audit (code default AI_GATEWAY_AUDIT_LOSS_MODE=spillblock, zero-loss — no env needed); when comparing RPS, note who persists what (Bifrost default drops ~99% of logs).
Not directly comparable to competitors' single-box published numbers — this is a fair head-to-head between these gateways (each isolated, shared remote mock).

Per-gateway gotchas (from the mock's CONFIGURE docs)

Bifrost: no OPENAI_API_KEY in env (or openai/* routes to real OpenAI); call the model as mock-provider/mock-gpt-4o.
LiteLLM: api_base includes /v1; may require a master key on the proxy.
Kong: the ai-proxy plugin is enabled (a bare reverse proxy is NOT an AI gateway and its RPS is meaningless).
Portkey: routes via request headers (x-portkey-provider + custom host).
TensorZero: OpenAI-compatible endpoint at /openai/v1/chat/completions.

Layout

cloudformation/   perf-matrix-stack.yaml (8 benchmark boxes + optional control box, SG, IAM) · network-stack · validate-stack
ansible/          site.yml + group_vars + roles/ (common · datastore · mock · 6 gateways · loadtest)
deploy.sh · down.sh   (repo root: bring up / tear down)
scripts/          gen-inventory.sh · box.sh · nexus-configure.sh · spin-control-box.sh
                  control-ssh-setup.sh · control-bootstrap.sh (in-VPC control box)
scripts/bench/    load-test orchestration (clean/setup/restart/run/monitor/report) + profiles
artifacts/        prebuilt linux/amd64 binaries + cp-ui zip + prisma db-migrate
docs/             LOADTEST-RUNBOOK.md · CONTROL-BOX-RUNBOOK.md · CONTROL-BOX.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Gateway Benchmark Matrix — on-demand AWS rig

Matrix (5 languages, each isolated on its own box)

Prerequisites (control machine)

1. Deploy infra

2. Provision (host-native, from control machine)

3. Run the benchmark

4. Tear down (on-demand)

Fairness & methodology (baked in)

Per-gateway gotchas (from the mock's CONFIGURE docs)

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ansible		ansible
artifacts		artifacts
benchmark-results		benchmark-results
cloudformation		cloudformation
docs		docs
scripts		scripts
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
deploy.env.example		deploy.env.example
deploy.sh		deploy.sh
down.sh		down.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLM Gateway Benchmark Matrix — on-demand AWS rig

Matrix (5 languages, each isolated on its own box)

Prerequisites (control machine)

1. Deploy infra

2. Provision (host-native, from control machine)

3. Run the benchmark

4. Tear down (on-demand)

Fairness & methodology (baked in)

Per-gateway gotchas (from the mock's CONFIGURE docs)

Layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages