Evaluation suite for the MetaClaw Evolution Benchmark β measures how well AI agents learn and adapt from multi-day interaction histories.
# Install (requires Python β₯ 3.10)
cd benchmark
pip install -e .
# Set required environment variables (see docs/scripts.md for the full list)
export BENCHMARK_BASE_URL=http://127.0.0.1:30000/v1 # OpenAI-compatible API base URL
export BENCHMARK_API_KEY=<your-api-key> # API key for the above endpoint
export BENCHMARK_MODEL=<your-model-id> # Model ID expected by the API server
# Validate the dataset (run from the project root: MetaClaw/)
metaclaw-bench check -p benchmark/data/metaclaw-bench/all_tests.json
# Run the full pipeline (infer β scoring β report)
metaclaw-bench run \
-i benchmark/data/metaclaw-bench/all_tests.json \
-o benchmark/resultsBENCHMARK_MODEL is the model ID string that your API server uses (e.g.
gpt-5.2, Kimi-K2.5). It is injected into the openclaw agent config as the
primary model.
For ready-made experiment runner scripts (with proxy, skills, memory, and RL support) see docs/scripts.md.
The scripts under benchmark/scripts/ handle proxy lifecycle, port allocation,
and logging automatically.
Recommended: always set METACLAW_ROOT to an absolute path before running
any script. Relative paths are resolved from the current working directory and
may silently fail when the script is called from an unexpected location. The
same applies to METACLAW_SKILLS_DIR and METACLAW_API_KEY_SCRIPT.
Concurrency: scripts that activate MetaClaw features (memory, skills, RL)
must run serially β each test day depends on the state accumulated in
previous days. These scripts hardcode -w 1. baseline_run.py and
proxy_passthrough_run.py carry no cross-day state and default to -w 15
(parallel) β different scenes are independent, only rounds within a scene are
sequential.
Example: baseline (no proxy)
# 1. Required LLM credentials
export BENCHMARK_BASE_URL=https://api.openai.com/v1
export BENCHMARK_API_KEY=sk-...
export BENCHMARK_MODEL=gpt-5.2 # model ID expected by your API server
# 2. Strongly recommended: pin the project root to an absolute path
export METACLAW_ROOT=/absolute/path/to/MetaClaw
# 3. Run
python /absolute/path/to/MetaClaw/benchmark/scripts/baseline_run.pyExample: memory mode
export BENCHMARK_BASE_URL=https://api.openai.com/v1
export BENCHMARK_API_KEY=sk-...
export BENCHMARK_MODEL=gpt-5.2
export METACLAW_ROOT=/absolute/path/to/MetaClaw
# memory_run.py starts a proxy β no extra vars needed for memory-only mode
python /absolute/path/to/MetaClaw/benchmark/scripts/memory_run.pyExample: skills mode
export BENCHMARK_BASE_URL=https://api.openai.com/v1
export BENCHMARK_API_KEY=sk-...
export BENCHMARK_MODEL=gpt-5.2
export METACLAW_ROOT=/absolute/path/to/MetaClaw
export METACLAW_SKILLS_DIR=/absolute/path/to/your/skills # required for skills scripts
python /absolute/path/to/MetaClaw/benchmark/scripts/skills_only_run.pyExample: RL mode
# Skill evolver uses BENCHMARK_* vars
# export BENCHMARK_BASE_URL=https://api.openai.com/v1
# export BENCHMARK_API_KEY=sk-...
# export BENCHMARK_MODEL=gpt-5.2
# export METACLAW_ROOT=/absolute/path/to/MetaClaw
# Additional vars for RL training
export TINKER_KEY=<tinker-api-key>
export TINKER_MODEL=<model-id-for-rl>
export PRM_MODEL=<process-reward-model-id>
python /absolute/path/to/MetaClaw/benchmark/scripts/rl_only_run.pyFor a ready-to-fill template copy benchmark/scripts/_env_arg_example.sh.
Full variable reference: docs/env.md.
Per-script breakdown: docs/scripts.md.
benchmark/
βββ data/
β βββ metaclaw-bench/ # Full benchmark (30 days)
β βββ metaclaw-bench-small/ # Small subset (12 days)
βββ docs/
β βββ CLI.md # CLI reference (all commands and flags)
β βββ scripts.md # Experiment runner scripts guide
β βββ env.md # Environment variable reference
βββ scripts/ # Experiment runner scripts
β βββ config/ # YAML configs for each strategy
β βββ _env_arg_example.sh # Environment variable template
βββ src/ # Core library
β βββ cli.py # Entry point
β βββ check/ # Dataset validation
β βββ infer/ # Agent inference
β βββ scoring/ # Result scoring
β βββ report/ # Report generation
β βββ run/ # Full pipeline orchestration
β βββ clean/ # Workspace cleanup
βββ tests/ # Unit tests
βββ openclaw_customize/ # OpenClaw plugin extensions
| Command | Description |
|---|---|
check |
Validate dataset integrity (8 checks) |
infer |
Run agent inference on scenarios |
scoring |
Score inference results |
report |
Generate summary report |
report-ratio |
Compute compaction ratios between reports |
run |
Full pipeline: infer β scoring β report |
clean |
Remove temporary work directories |
See docs/CLI.md for full usage, all flags, and examples.
Pre-built runner scripts under scripts/ support various agent strategies:
| Script | Features |
|---|---|
baseline_run.py |
No proxy β direct API calls |
proxy_passthrough_run.py |
Proxy in passthrough mode (no enhancements) |
skills_only_run.py |
Proxy with pre-built skills |
memory_run.py |
Proxy with memory extraction/injection |
rl_only_run.py |
Proxy with RL training |
rl_run.py |
RL + skills |
skills_memory_run.py |
Skills + memory |
rl_only_memory_run.py |
RL + memory |
madmax_memory_run.py |
RL + skills + memory (all features) |
Each script reads environment variables from the shell; no path editing is needed. See docs/scripts.md for setup instructions, environment variable reference, and a per-script description.
pip install -e ".[dev]"
pytest -v tests/