Name	Name	Last commit message	Last commit date
parent directory ..
results	results
shared	shared
typescript	typescript
README.md	README.md
compare.ts	compare.ts
run.sh	run.sh
visualize.py	visualize.py

Eliza Framework Benchmark

Performance benchmark for the Eliza agent framework using the TypeScript runtime (Bun).

What It Measures

By replacing the real LLM with a deterministic mock plugin that returns instant, fixed responses, this benchmark isolates and measures the framework itself:

Latency: End-to-end message processing time (min/avg/median/p95/p99)
Throughput: Messages per second (sequential and concurrent)
Pipeline breakdown: Time in state composition, provider execution, model calls, action dispatch, evaluators, memory CRUD
Resource usage: RSS memory (start/peak/delta)
Scaling behavior: Performance vs provider count, conversation history size, concurrent load
Startup time: Agent creation and initialization
DB throughput: In-memory database read/write operations per second

Quick Start

# Run default scenarios
./run.sh

# Run all scenarios (including stress tests)
./run.sh --all

# Run specific scenarios
./run.sh --scenarios=single-message,burst-100,startup-cold

# Just generate comparison from existing results
./run.sh --compare

TypeScript harness (direct)

cd typescript
bun install
bun run src/bench.ts
bun run src/bench.ts --all
bun run src/bench.ts --scenarios=single-message,startup-cold

Flags --ts-only, --py-only, and --rs-only are obsolete; --py-only / --rs-only exit with an error.

Comparison Report

After running benchmarks, generate a side-by-side comparison:

bun run compare.ts

Architecture

benchmarks/framework/
├── README.md               # This file
├── PLAN.md                 # Detailed design document
├── run.sh                  # Orchestrator script
├── compare.ts              # Comparison tool for result JSON files
├── shared/
│   ├── character.json      # Shared agent character definition
│   └── scenarios.json      # 20 test scenarios
├── typescript/
│   ├── package.json
│   └── src/
│       ├── bench.ts        # Benchmark harness
│       ├── mock-llm-plugin.ts  # Mock LLM model handlers
│       └── metrics.ts      # Measurement utilities
└── results/                # JSON output files

Mock LLM Plugin

The harness uses a mock LLM plugin that:

Registers handlers for TEXT_SMALL, TEXT_LARGE, TEXT_EMBEDDING, TEXT_COMPLETION
Returns deterministic, pre-computed XML responses that pass the framework's validation pipeline
Detects which template is being evaluated (shouldRespond vs message handler vs reply action) by inspecting the prompt
Returns zero-latency responses (no artificial delay)
shouldRespond returns RESPOND for all messages (agent name is always included in benchmark messages)

Scenarios (20 total)

ID	Description	Messages	Notes
`single-message`	Baseline latency	1	50 iterations
`conversation-10`	State growth	10	Sequential conversation
`conversation-100`	Large state	100	Generated messages
`burst-100`	Sequential throughput	100	As fast as possible
`burst-1000`	High throughput	1000	Stress test
`with-should-respond`	With name check	5	Agent name in messages
`with-should-respond-no-name`	LLM evaluation	5	No agent name
`with-actions`	Action execution	3	REPLY action
`provider-scaling-10/50/100`	Provider overhead	1	N dummy providers
`history-scaling-100/1K/10K`	Memory overhead	1	Pre-populated history
`concurrent-10/50`	Concurrent load	N	asyncio.gather / Promise.all
`db-write-throughput`	DB writes	10K ops	In-memory adapter
`db-read-throughput`	DB reads	10K ops	In-memory adapter
`startup-cold`	Initialization	0	20 fresh inits
`multi-step`	Multi-step mode	1	Mock completes immediately
`minimal-bootstrap`	Minimal providers	1	CHARACTER only

Key Architectural Differences Surfaced

Provider execution: TypeScript runs providers in parallel (Promise.all), Python and Rust run sequentially
GC vs manual memory: Rust has no GC; TypeScript (V8) and Python have GC pauses
Concurrency model: Rust uses Arc<RwLock>, TypeScript uses single-threaded event loop, Python uses cooperative async with GIL
Serialization: Rust uses protobuf-backed types; TypeScript/Python use native JSON
State caching: TypeScript and Python cache composed state; Rust does not

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Eliza Framework Benchmark

What It Measures

Quick Start

TypeScript harness (direct)

Comparison Report

Architecture

Mock LLM Plugin

Scenarios (20 total)

Key Architectural Differences Surfaced

FilesExpand file tree

framework

Directory actions

More options

Directory actions

More options

Latest commit

History

framework

Folders and files

parent directory

README.md

Eliza Framework Benchmark

What It Measures

Quick Start

TypeScript harness (direct)

Comparison Report

Architecture

Mock LLM Plugin

Scenarios (20 total)

Key Architectural Differences Surfaced