Benchmarks

This page explains what benchmark scripts measure, what they do not measure, and how to interpret results safely.

Benchmark Tracks

bun run bench:agent

Measures end-to-end path cost for:

bun run bench:frameworks

Measures local tool-dispatch overhead with one shared handler implementation:

bun run bench:report

Generates/updates:

Benchmarks are local machine measurements, not universal truth.
This suite isolates orchestration overhead (no external model API calls in framework comparison track).
Optional frameworks are skipped if dependencies are unavailable.

ops/sec: higher is better.
avg ms: lower is better.
p99 ms: lower is better for tail latency stability.
vs direct: slowdown relative to direct baseline (1.00x = equal to baseline).

A faster wrapper in synthetic benchmarks does not automatically mean better full-system latency.
Cross-machine comparisons are usually invalid unless hardware/runtime are normalized.
Throughput alone is not enough; include ergonomics, interoperability, and failure handling in selection decisions.