Command-line interface for running and evaluating the MetaClaw Evolution Benchmark.
metaclaw-bench <command> [options]
Or via Python module (run from the benchmark/ directory):
python -m src.cli <command> [options]
All relative paths are resolved against the project root (the directory
that contains benchmark/), not against benchmark/ itself.
Validate a benchmark dataset before running inference.
metaclaw-bench check -p <path/to/all_tests.json>
Options
| Flag | Required | Description |
|---|---|---|
-p, --path |
yes | Path to all_tests.json |
Checks performed (8 total)
| # | Checker | Description |
|---|---|---|
| 1 | AllTests Structure | Top-level fields and test array structure; unique agent ID |
| 2 | Basic Integrity | All referenced files exist on disk |
| 3 | ID Consistency | Session IDs unique; internal IDs match filenames |
| 4 | File Format | JSONL and questions.json files are valid JSON |
| 5 | Directory Structure | eval/ and sessions/ directories exist |
| 6 | Workspace Integrity | workspace_src contains required identity files |
| 7 | Session Format | Session JSONL first/second line roles are correct |
| 8 | Questions Integrity | Round types, feedback strings, and eval field structure |
Example
# Run from the project root (MetaClaw/)
metaclaw-bench check -p benchmark/data/metaclaw-bench/all_tests.jsonRun the openclaw agent for each test scenario and save per-question results.
metaclaw-bench infer -i <input> -o <output> [options]
Options
| Flag | Required | Default | Description |
|---|---|---|---|
-i, --input |
yes | β | Path to all_tests.json, a directory of all_tests.json files, or a JSON list of paths |
-o, --output |
yes | β | Output directory |
-w, --workers |
no | 1 |
Maximum concurrent tests |
-n, --retry |
no | 0 |
Retries per failed question |
--scene-per-train |
no | disabled | Trigger metaclaw train-step every N scenes |
--memory |
no | off | Trigger POST /v1/memory/ingest after each scene |
--memory-proxy-port |
no | 30000 |
MetaClaw proxy port for memory ingest |
Note: Tests always run serially (workers=1) when
--scene-per-trainor--memoryis active, because these features require strict ordering.
questions.json format
{
"id": "day01",
"desc": "Time format preference",
"rounds": [
{
"id": "r1",
"type": "file_check",
"question": "Save meeting notes to tasks/day01/meeting.json.",
"feedback": {
"correct": "Format is correct!",
"incorrect": "Please use ISO 8601 for time fields."
},
"eval": {
"command": "python scripts/check_meeting.py day01/meeting.json",
"expect_exit": 0,
"expect_stdout": "OK"
}
},
{
"id": "r2",
"type": "multi_choice",
"question": "Which time format did you use?",
"feedback": { "correct": "Correct!", "incorrect": "Review ISO 8601." },
"eval": {
"options": { "A": "ISO 8601", "B": "Unix timestamp", "C": "Plain text" },
"answer": ["A"]
}
}
]
}Feedback injection: each round (except the first) receives the previous round's
feedback prepended as [Previous Feedback] <text>\n\n<question>. A standalone
feedback message is sent after the last round.
Example
metaclaw-bench infer \
-i benchmark/data/metaclaw-bench/all_tests.json \
-o /tmp/infer_out \
-n 1Score inference results against correct answers.
metaclaw-bench scoring -i <input> -r <result_dir>
Options
| Flag | Required | Description |
|---|---|---|
-i, --input |
yes | Path to all_tests.json |
-r, --result |
yes | Directory to search recursively for infer_result.json files |
Scoring rules:
file_checkrounds: scored from theinline_score.passedfield written during inference.multi_choicerounds: extracts\bbox{X}from the agent response and compares toeval.answer.
Example
metaclaw-bench scoring \
-i benchmark/data/metaclaw-bench/all_tests.json \
-r /tmp/infer_outGenerate an accuracy and token-usage report from scoring results.
metaclaw-bench report -r <result_dir> [-c <compaction_results.json>] [-o <output_dir>]
Options
| Flag | Required | Description |
|---|---|---|
-r, --result |
yes | Directory containing scoring.json files |
-c, --compaction |
no | Path to compaction_results.json for token aggregation |
-o, --output |
no | Output directory for report.json and report.md; prints to terminal if omitted |
Example
metaclaw-bench report -r /tmp/infer_out -o /tmp/report_outFull pipeline: infer β scoring β report.
metaclaw-bench run -i <input> -o <output> [options]
Accepts the same options as infer. If a compaction_results.json exists
alongside all_tests.json it is automatically picked up for the report step.
When multiple test sets are processed, a combined reports.md is written to
the output root.
Options
| Flag | Required | Default | Description |
|---|---|---|---|
-i, --input |
yes | β | Path to all_tests.json or a directory of them |
-o, --output |
yes | β | Output directory |
-w, --workers |
no | 1 |
Maximum concurrent tests |
-n, --retry |
no | 0 |
Retries per failed question |
--scene-per-train |
no | disabled | Trigger metaclaw train-step every N scenes |
--memory |
no | off | Trigger memory ingest after each scene |
--memory-proxy-port |
no | 30000 |
MetaClaw proxy port for memory ingest |
Example
# Run from the project root (MetaClaw/)
metaclaw-bench run \
-i benchmark/data/metaclaw-bench/all_tests.json \
-o /tmp/run_outRemove work/ isolation directories created by infer.
metaclaw-bench clean -p <root_dir>
Options
| Flag | Required | Description |
|---|---|---|
-p, --path |
yes | Root directory to search recursively for work/ directories |
Example
metaclaw-bench clean -p benchmark/data/metaclaw-bench