Reference implementations of BenchmarkAdapter for two public benchmarks. They are NOT bundled — they're intentionally shipped as source you read, copy, and adapt.
| Wrapper | What it does | Why it's an example, not core |
|---|---|---|
gsm8k/ |
Exact-match grading on the final numeric answer of GSM8K (Cobbe et al.) | The dataset isn't ours and isn't bundled. The wrapper points to a local JSONL via AGENT_EVAL_GSM8K_PATH. |
swebench-lite/ |
Pass/fail grading via an external SWE-Bench grader command | The grader is a separate binary; the wrapper stubs the integration via AGENT_EVAL_SWEBENCH_GRADER_CMD. |
The novel benchmark we ship and own — the synthetic routing task — lives in src/benchmarks/routing/ and IS in the bundle.
Read and inline them. Copy the wrapper file into your project, then replace
imports such as ../../../src/benchmarks/types and ../../../src/run-record
with @tangle-network/agent-eval. These examples are repository source, not
published npm subpaths.
loadDataset(split: 'search' | 'dev' | 'holdout'): Promise<DatasetItem[]>
evaluate(item, response): Promise<{ score: number, raw: Record<string, unknown> }>
assignSplit(itemId: string): 'search' | 'dev' | 'holdout'assignSplit uses deterministicSplit(itemId, BENCHMARK_SPLIT_SEED) — same item gets the same split everywhere. Don't change the seed; it's load-bearing for reproducibility.
- Create
examples/benchmarks/<your-benchmark>/index.ts. - Export
loadDataset,evaluate,assignSplit. Optionally a typedAdapterclass. - Use
deterministicSplitfrom@tangle-network/agent-evalfor split assignment. - Fail loud on missing config (env vars, paths). Never default to silent-pass.
- Document config requirements in a per-benchmark README.
If your benchmark is novel and broadly useful, propose moving it into src/benchmarks/ as core surface (PR welcome). The bar is: novel rubric, reusable across projects, low maintenance burden.