Performance benchmark for the Eliza agent framework using the TypeScript runtime (Bun).
By replacing the real LLM with a deterministic mock plugin that returns instant, fixed responses, this benchmark isolates and measures the framework itself:
- Latency: End-to-end message processing time (min/avg/median/p95/p99)
- Throughput: Messages per second (sequential and concurrent)
- Pipeline breakdown: Time in state composition, provider execution, model calls, action dispatch, evaluators, memory CRUD
- Resource usage: RSS memory (start/peak/delta)
- Scaling behavior: Performance vs provider count, conversation history size, concurrent load
- Startup time: Agent creation and initialization
- DB throughput: In-memory database read/write operations per second
# Run default scenarios
./run.sh
# Run all scenarios (including stress tests)
./run.sh --all
# Run specific scenarios
./run.sh --scenarios=single-message,burst-100,startup-cold
# Just generate comparison from existing results
./run.sh --comparecd typescript
bun install
bun run src/bench.ts
bun run src/bench.ts --all
bun run src/bench.ts --scenarios=single-message,startup-coldFlags --ts-only, --py-only, and --rs-only are obsolete; --py-only / --rs-only exit with an error.
After running benchmarks, generate a side-by-side comparison:
bun run compare.tsbenchmarks/framework/
├── README.md # This file
├── PLAN.md # Detailed design document
├── run.sh # Orchestrator script
├── compare.ts # Comparison tool for result JSON files
├── shared/
│ ├── character.json # Shared agent character definition
│ └── scenarios.json # 20 test scenarios
├── typescript/
│ ├── package.json
│ └── src/
│ ├── bench.ts # Benchmark harness
│ ├── mock-llm-plugin.ts # Mock LLM model handlers
│ └── metrics.ts # Measurement utilities
└── results/ # JSON output files
The harness uses a mock LLM plugin that:
- Registers handlers for
TEXT_SMALL,TEXT_LARGE,TEXT_EMBEDDING,TEXT_COMPLETION - Returns deterministic, pre-computed XML responses that pass the framework's validation pipeline
- Detects which template is being evaluated (shouldRespond vs message handler vs reply action) by inspecting the prompt
- Returns zero-latency responses (no artificial delay)
- shouldRespond returns
RESPONDfor all messages (agent name is always included in benchmark messages)
| ID | Description | Messages | Notes |
|---|---|---|---|
single-message |
Baseline latency | 1 | 50 iterations |
conversation-10 |
State growth | 10 | Sequential conversation |
conversation-100 |
Large state | 100 | Generated messages |
burst-100 |
Sequential throughput | 100 | As fast as possible |
burst-1000 |
High throughput | 1000 | Stress test |
with-should-respond |
With name check | 5 | Agent name in messages |
with-should-respond-no-name |
LLM evaluation | 5 | No agent name |
with-actions |
Action execution | 3 | REPLY action |
provider-scaling-10/50/100 |
Provider overhead | 1 | N dummy providers |
history-scaling-100/1K/10K |
Memory overhead | 1 | Pre-populated history |
concurrent-10/50 |
Concurrent load | N | asyncio.gather / Promise.all |
db-write-throughput |
DB writes | 10K ops | In-memory adapter |
db-read-throughput |
DB reads | 10K ops | In-memory adapter |
startup-cold |
Initialization | 0 | 20 fresh inits |
multi-step |
Multi-step mode | 1 | Mock completes immediately |
minimal-bootstrap |
Minimal providers | 1 | CHARACTER only |
- Provider execution: TypeScript runs providers in parallel (
Promise.all), Python and Rust run sequentially - GC vs manual memory: Rust has no GC; TypeScript (V8) and Python have GC pauses
- Concurrency model: Rust uses
Arc<RwLock>, TypeScript uses single-threaded event loop, Python uses cooperative async with GIL - Serialization: Rust uses protobuf-backed types; TypeScript/Python use native JSON
- State caching: TypeScript and Python cache composed state; Rust does not