Skip to content

Commit 5d610e7

Browse files
authored
feat(bench): add DABStep adapter and SDK compat
* feat(bench): add DABStep adapter * docs(api): refresh runtime docs * docs(api): stabilize runtime summary
1 parent 9455ff9 commit 5d610e7

18 files changed

Lines changed: 593 additions & 82 deletions

bench/HARNESS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ in a `.venv`/Docker subprocess → parse its JSON report → `{resolved,score}`)
202202
copy of the process/venv/Docker/temp/report plumbing; commit0+appworld also share its
203203
stdin-piping runner (`runVenvScriptStdin`).
204204
- **Real, runnable with ZERO extra deps:** finsearchcomp (GitHub dataset + fixtures + LLM judge — the gate bench), hotpotqa + simpleqa + frames (HF/web QA + F1/LLM judge; `*_FIXTURES=1` offline), **aec-bench** (real GitHub task tree + fixtures; judge = the task's own `tests/verify.py` over python3 stdlib — **deterministic, graded per-field partial credit, no Docker, no LLM** → the candidate non-oracle correctable-middle-band bench for the open gate).
205-
- **Real code, needs an external harness/tools to run (fail loud with the exact install/Docker fix; never a fabricated score):** swe-bench + terminal-bench (`bench/.venv` + Docker), **commit0** (ISOLATED `bench/.venv-commit0` via `python3 -m venv bench/.venv-commit0 && bench/.venv-commit0/bin/pip install commit0 datasets` — its deps conflict with the shared `.venv`; override dir with `COMMIT0_VENV` — plus Docker; judge = official pytest harness, graded (passed+xfail)/total; the rollout prompt stages in-box (clones `commit-0/<repo>` @ `base_commit`, emits `git diff`); `COMMIT0_FIXTURES=1` for offline listing), **programbench** (`pip install programbench` + Docker on linux/amd64 + HF blobs; judge = official cleanroom eval, graded passed/total; `PROGRAMBENCH_FIXTURES=1` offline), **appworld** (`pip install appworld` + `appworld install` + `appworld download data`; judge = AppWorld's own `world.evaluate()`, graded passes/num_tests — NO committed fixture: task data exists only after `download data`, so loadTasks fails loud rather than fabricate a task), mind2web, cad-design + cadbench + cadgenbench (openscad/blender/build123d).
205+
- **Real code, needs an external harness/tools to run (fail loud with the exact install/Docker fix; never a fabricated score):** swe-bench + terminal-bench (`bench/.venv` + Docker), **commit0** (ISOLATED `bench/.venv-commit0` via `python3 -m venv bench/.venv-commit0 && bench/.venv-commit0/bin/pip install commit0 datasets` — its deps conflict with the shared `.venv`; override dir with `COMMIT0_VENV` — plus Docker; judge = official pytest harness, graded (passed+xfail)/total; the rollout prompt stages in-box (clones `commit-0/<repo>` @ `base_commit`, emits `git diff`); `COMMIT0_FIXTURES=1` for offline listing), **programbench** (`pip install programbench` + Docker on linux/amd64 + HF blobs; judge = official cleanroom eval, graded passed/total; `PROGRAMBENCH_FIXTURES=1` offline), **appworld** (`pip install appworld` + `appworld install` + `appworld download data`; judge = AppWorld's own `world.evaluate()`, graded passes/num_tests — NO committed fixture: task data exists only after `download data`, so loadTasks fails loud rather than fabricate a task), **dabstep** (`DABSTEP_DIR=/path/to/EnvCommons/DABStep` with the released `dataset.csv`, `splits/*.txt`, `files/*`, and `grade.py`; judge delegates to official `grade.py`; `DABSTEP_FIXTURES=1` only tests adapter plumbing and does not fabricate benchmark scores), mind2web, cad-design + cadbench + cadgenbench (openscad/blender/build123d).
206206
- **goldArtifact:** aec-bench returns the task's real `golden_pass.md` (verify-judge works fully offline). commit0 / programbench / appworld return `undefined` — the oracle is a git ref / stripped source / engine-bundled solution, not a portable string; judge correctness is proven by a real solve through the harness, not a synthetic gold (documented + fail-loud, not a fake).
207207
- **Absent (not built):** swe-gym, swe-bench-multimodal, and the rest of the survey set.
208208
Every unbuilt/scaffold adapter fails LOUD (throws with the integration step) rather than faking a score — no silent zeros in any corpus. Offline fixture tests: `benchmarks/{aec-bench,commit0,programbench,appworld}.test.mts` (`tsx --test`).

bench/fixtures/dabstep.json

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
[
2+
{
3+
"task_id": 1,
4+
"instructions": "Using the payment files, answer this calibration task with the exact integer 42.",
5+
"all_golds_by_task": [
6+
{
7+
"kind": "number",
8+
"value": 42.0
9+
}
10+
]
11+
},
12+
{
13+
"task_id": 2,
14+
"instructions": "Using the payment files, answer this calibration task with the card scheme nexpay.",
15+
"all_golds_by_task": [
16+
{
17+
"kind": "scheme",
18+
"value": "nexpay"
19+
}
20+
]
21+
}
22+
]

bench/package.json

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"name": "@tangle-network/agent-bench",
33
"version": "0.1.0",
44
"type": "module",
5-
"description": "The unified benchmark suite for agent-runtime agents: 18 adapters (commit0, enterpriseops-gym, trata-hedge, finsearchcomp, swe-bench, humaneval, …) behind one resolveAdapter registry, each with a real deterministic judge. Score any profile/skill/prompt change against them. Map: bench/HARNESS.md.",
5+
"description": "The unified benchmark suite for agent-runtime agents: 19 adapters (commit0, enterpriseops-gym, trata-hedge, finsearchcomp, dabstep, swe-bench, humaneval, …) behind one resolveAdapter registry, each with a real deterministic judge. Score any profile/skill/prompt change against them. Map: bench/HARNESS.md.",
66
"main": "src/index.ts",
77
"types": "src/index.ts",
88
"exports": {
@@ -18,7 +18,7 @@
1818
},
1919
"dependencies": {
2020
"@tangle-network/agent-eval": "^0.100.0",
21-
"@tangle-network/agent-runtime": "^0.78.0",
21+
"@tangle-network/agent-runtime": "^0.79.3",
2222
"@tangle-network/sandbox": "^0.9.3"
2323
},
2424
"devDependencies": {
@@ -27,6 +27,10 @@
2727
},
2828
"files": [
2929
"src",
30+
"fixtures",
31+
"scripts",
32+
"tb_agents/*.py",
33+
"steerers",
3034
"README.md"
3135
],
3236
"publishConfig": {

bench/pnpm-lock.yaml

Lines changed: 76 additions & 49 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

bench/scripts/dabstep_judge.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/usr/bin/env python3
2+
"""DABStep judge bridge.
3+
4+
Reads {"prediction": str, "golds": list} from stdin and delegates scoring to
5+
the official DABStep grade.py module. This script owns no grading semantics.
6+
"""
7+
8+
import argparse
9+
import importlib.util
10+
import json
11+
import sys
12+
from pathlib import Path
13+
14+
15+
def load_grade(grade_file: Path):
16+
spec = importlib.util.spec_from_file_location("dabstep_grade", grade_file)
17+
if spec is None or spec.loader is None:
18+
raise RuntimeError(f"could not import DABStep grade file: {grade_file}")
19+
module = importlib.util.module_from_spec(spec)
20+
spec.loader.exec_module(module)
21+
return module.grade
22+
23+
24+
def main() -> int:
25+
parser = argparse.ArgumentParser(description="Score one DABStep answer")
26+
parser.add_argument("--grade-file", required=True)
27+
args = parser.parse_args()
28+
29+
try:
30+
payload = json.loads(sys.stdin.read())
31+
prediction = payload["prediction"]
32+
golds = payload["golds"]
33+
correct = bool(load_grade(Path(args.grade_file))(prediction, golds))
34+
print(json.dumps({"correct": correct, "score": 1.0 if correct else 0.0}))
35+
return 0
36+
except Exception as exc:
37+
print(json.dumps({"error": str(exc)}))
38+
return 1
39+
40+
41+
if __name__ == "__main__":
42+
raise SystemExit(main())

bench/src/adapters.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import { createCadBenchAdapter } from './benchmarks/cadbench'
1111
import { createCadDesignAdapter } from './benchmarks/cad-design'
1212
import { createCadGenBenchAdapter } from './benchmarks/cadgenbench'
1313
import { createCommit0Adapter } from './benchmarks/commit0'
14+
import { createDabstepAdapter } from './benchmarks/dabstep'
1415
import { createEnterpriseOpsGymAdapter } from './benchmarks/enterpriseops-gym'
1516
import { createFinsearchcompAdapter } from './benchmarks/finsearchcomp'
1617
import { createFramesAdapter } from './benchmarks/frames'
@@ -32,6 +33,7 @@ export const ADAPTERS: Record<string, () => BenchmarkAdapter> = {
3233
// delegates to the benchmark's own harness and fails loud when it/Docker is absent.
3334
'aec-bench': createAecBenchAdapter,
3435
commit0: createCommit0Adapter,
36+
dabstep: createDabstepAdapter,
3537
programbench: createProgrambenchAdapter,
3638
appworld: createAppWorldAdapter,
3739
// AppWorld's native interactive protocol — the worker is the in-engine ReAct

0 commit comments

Comments
 (0)