Pre-built runner scripts under benchmark/scripts/ automate the full
experiment workflow: setting up a MetaClaw proxy, running the benchmark, and
collecting results. Each script is self-contained and can be run directly
with python scripts/<script>.py from the project root.
All scripts read environment variables for credentials and paths. Copy and
fill in scripts/_env_arg_example.sh, then source it before running:
cp benchmark/scripts/_env_arg_example.sh ~/.metaclaw_bench_env.sh
# Edit ~/.metaclaw_bench_env.sh with your values
source ~/.metaclaw_bench_env.sh| Variable | Description |
|---|---|
BENCHMARK_BASE_URL |
Base URL of the OpenAI-compatible API endpoint (e.g. https://api.openai.com/v1) |
BENCHMARK_API_KEY |
API key for the above endpoint |
BENCHMARK_MODEL |
Model ID as registered in your API server; used in openclaw.json as the agent model |
BENCHMARK_MODELmust exactly match the model ID your API server expects. It is injected into the openclaw config as"primary": "metaclaw-bench/${BENCHMARK_MODEL}".
| Variable | Description |
|---|---|
TINKER_KEY |
API key for the Tinker (RL training) service |
TINKER_MODEL |
Model ID used by the RL fine-tuning step |
PRM_MODEL |
Process Reward Model ID used for RL scoring |
| Variable | Default | Description |
|---|---|---|
METACLAW_ROOT |
Auto-detected from script location | Absolute path to the MetaClaw project root |
METACLAW_BENCH_BIN |
metaclaw-bench (resolved via PATH) |
Explicit path to the metaclaw-bench binary |
METACLAW_BIN |
metaclaw (resolved via PATH) |
Explicit path to the metaclaw binary (proxy_run.py) |
METACLAW_API_KEY_SCRIPT |
unset (skip) | Shell script to source for exporting env vars; useful when credentials are managed externally |
METACLAW_SKILLS_DIR |
<METACLAW_ROOT>/memory_data/skills |
Directory containing pre-built skills (required for skills-based scripts) |
METACLAW_BENCH_INPUT |
per-script default (see table below) | Override the benchmark input file for all scripts; useful for switching to a smaller dataset without editing individual scripts |
All scripts auto-detect paths from METACLAW_ROOT (or the script's own
location when METACLAW_ROOT is unset). No path editing is needed unless
you want to override defaults.
Runs the benchmark without the MetaClaw proxy. The agent communicates
directly with BENCHMARK_BASE_URL. Use this to establish the baseline score.
python benchmark/scripts/baseline_run.pyConfig: none (no proxy launched).
Input: all_tests.json (standard, without metaclaw agent config).
Starts the MetaClaw proxy with all enhancements disabled (no skills, no memory, no RL) and routes agent requests through it. Useful for validating the proxy plumbing or measuring its raw overhead without any feature active.
python benchmark/scripts/proxy_passthrough_run.pyConfig: scripts/config/dummy.yaml.
Requires: METACLAW_SKILLS_DIR (needed by the proxy config even with skills off).
Runs through the proxy with pre-built skills injected into the agent context. No memory or RL.
python benchmark/scripts/skills_only_run.pyConfig: scripts/config/skills-only.yaml.
Requires: METACLAW_SKILLS_DIR.
Runs through the proxy with memory extraction and injection active. Generates
memory_report.md alongside the regular benchmark report.
python benchmark/scripts/memory_run.pyConfig: scripts/config/memory.yaml.
Memory store: benchmark/memory_runs/<timestamp>/.
Runs through the proxy with RL training triggered every SCENE_PER_TRAIN
completed test scenes. No skills or memory.
python benchmark/scripts/rl_only_run.pyConfig: scripts/config/rl-only.yaml.
Requires RL variables: TINKER_KEY, TINKER_MODEL, PRM_MODEL.
Runs through the proxy with both RL training and pre-built skills active.
python benchmark/scripts/rl_run.pyConfig: scripts/config/rl.yaml.
Requires: METACLAW_SKILLS_DIR + RL variables.
Combines skills and memory; no RL.
python benchmark/scripts/skills_memory_run.pyConfig: scripts/config/skills-memory.yaml.
Requires: METACLAW_SKILLS_DIR.
Combines RL training and memory; no skills.
python benchmark/scripts/rl_only_memory_run.pyConfig: scripts/config/rl-only-memory.yaml.
Requires RL variables.
All features enabled simultaneously.
python benchmark/scripts/madmax_memory_run.pyConfig: scripts/config/madmax.yaml.
Requires: METACLAW_SKILLS_DIR + RL variables.
Starts metaclaw start and tees output to both terminal and log. Invoked
automatically by the other runner scripts; can also be run directly when you
need a standalone proxy.
python benchmark/scripts/proxy_run.pyPass the proxy config via METACLAW_CONFIG_FILE before launching:
METACLAW_CONFIG_FILE=/path/to/config.yaml python benchmark/scripts/proxy_run.py| Script | Proxy | Skills | Memory | RL | Workers | Default input file |
|---|---|---|---|---|---|---|
baseline_run.py |
β | β | β | β | parallel (-w 15) |
all_tests.json |
proxy_passthrough_run.py |
β | β | β | β | parallel (-w 15) |
all_tests_metaclaw.json |
skills_only_run.py |
β | β | β | β | serial (-w 1) |
all_tests_metaclaw.json |
memory_run.py |
β | β | β | β | serial (-w 1) |
all_tests_metaclaw.json |
buffer_memory_run.py |
β | β | β (incremental) | β | serial (-w 1) |
all_tests_metaclaw.json |
rl_only_run.py |
β | β | β | β | serial (-w 1) |
all_tests_metaclaw.json |
rl_run.py |
β | β | β | β | serial (-w 1) |
all_tests_metaclaw.json |
skills_memory_run.py |
β | β | β | β | serial (-w 1) |
all_tests_metaclaw.json |
rl_only_memory_run.py |
β | β | β | β | serial (-w 1) |
all_tests_metaclaw.json |
madmax_memory_run.py |
β | β | β | β | serial (-w 1) |
all_tests_metaclaw.json |
All default paths resolve under <METACLAW_ROOT>/benchmark/data/metaclaw-bench/.
Set METACLAW_BENCH_INPUT to an absolute path to override for every script at once,
e.g. to point at the smaller metaclaw-bench-small dataset.
Why the serial constraint? Any script that activates MetaClaw state (memory extracts, skill updates, RL checkpoints) relies on strict ordering: day N+1 must see the state accumulated through day N. Running tests concurrently would corrupt the accumulated state.
baseline_run.py and proxy_passthrough_run.py have no such dependency β for a plain LLM inference engine the rounds within a single scene are sequential, but different scenes (days) are fully independent, so parallel execution is safe.
Each YAML file under scripts/config/ controls the MetaClaw proxy for a
specific experiment variant. ${VAR} placeholders are expanded from the
current environment at runtime. See the table above to match scripts to
their config files.
The env_arg_example.sh file contains a ready-to-fill template for all
required environment variables.
All scripts write a run log to:
benchmark/logs/<script_name>/bench_run.log
If the file already exists a numeric suffix is appended (_1, _2, β¦) so
previous logs are never overwritten.