Example benchmark wrappers

Reference implementations of BenchmarkAdapter for two public benchmarks. They are NOT bundled — they're intentionally shipped as source you read, copy, and adapt.

Wrapper	What it does	Why it's an example, not core
`gsm8k/`	Exact-match grading on the final numeric answer of GSM8K (Cobbe et al.)	The dataset isn't ours and isn't bundled. The wrapper points to a local JSONL via `AGENT_EVAL_GSM8K_PATH`.
`swebench-lite/`	Pass/fail grading via an external SWE-Bench grader command	The grader is a separate binary; the wrapper stubs the integration via `AGENT_EVAL_SWEBENCH_GRADER_CMD`.

The novel benchmark we ship and own — the synthetic routing task — lives in src/benchmarks/routing/ and IS in the bundle.

Using these wrappers

Read and inline them. Copy the wrapper file into your project, then replace imports such as ../../../src/benchmarks/types and ../../../src/run-record with @tangle-network/agent-eval. These examples are repository source, not published npm subpaths.

What every BenchmarkAdapter exports

loadDataset(split: 'search' | 'dev' | 'holdout'): Promise<DatasetItem[]>
evaluate(item, response): Promise<{ score: number, raw: Record<string, unknown> }>
assignSplit(itemId: string): 'search' | 'dev' | 'holdout'

assignSplit uses deterministicSplit(itemId, BENCHMARK_SPLIT_SEED) — same item gets the same split everywhere. Don't change the seed; it's load-bearing for reproducibility.

Adding a new benchmark

Create examples/benchmarks/<your-benchmark>/index.ts.
Export loadDataset, evaluate, assignSplit. Optionally a typed Adapter class.
Use deterministicSplit from @tangle-network/agent-eval for split assignment.
Fail loud on missing config (env vars, paths). Never default to silent-pass.
Document config requirements in a per-benchmark README.

If your benchmark is novel and broadly useful, propose moving it into src/benchmarks/ as core surface (PR welcome). The bar is: novel rubric, reusable across projects, low maintenance burden.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example benchmark wrappers

Using these wrappers

What every BenchmarkAdapter exports

Adding a new benchmark

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Example benchmark wrappers

Using these wrappers

What every BenchmarkAdapter exports

Adding a new benchmark