Bench Protocol v1 defines the runtime contract between BenchLocal and installable Bench Packs.
It covers:
- Bench Pack metadata
- the runtime entrypoint
- scenario metadata
- host context
- generation settings
- verifier endpoints
- progress events
- scenario results
- BenchLocal owns the shared desktop runtime
- Bench Packs own benchmark-specific behavior
- metadata is static and file-based
- runtime behavior is explicit and deterministic
- verifier dependencies are declared, not hardcoded
Each Bench Pack artifact must expose:
benchlocal.pack.json
dist/benchlocal/index.js
Optional runtime content:
verification/README.mdMETHODOLOGY.md
File name:
benchlocal.pack.json
This file is the canonical Bench Pack metadata source.
Representative shape:
{
"schemaVersion": 1,
"protocolVersion": 1,
"id": "bugfind-15",
"name": "BugFind-15",
"author": "stevibe",
"version": "1.0.0",
"description": "Execution-backed benchmark for bug finding and bug fixing.",
"entry": "./dist/benchlocal/index.js",
"samplingDefaults": {
"temperature": 0
},
"capabilities": {
"tools": false,
"multiTurn": false,
"streamingProgress": true,
"verification": true
},
"verifiers": [
{
"id": "verifier",
"transport": "http",
"required": true,
"defaultMode": "docker",
"docker": {
"buildContext": "./verification",
"listenPort": 4010,
"healthcheckPath": "/health"
}
}
]
}Important fields:
idnameversionentrycapabilities
Common optional fields:
authordescriptionrepositorythemesamplingDefaultsverifiers
BenchLocal loads dist/benchlocal/index.js and expects a default export with this shape:
export interface BenchPackRuntime {
manifest: BenchPackManifest;
listScenarios(): Promise<ScenarioMeta[]>;
prepare(context: HostContext): Promise<PreparedBenchPack>;
scoreModelResults(results: ScenarioResult[]): BenchmarkScore;
}
export interface PreparedBenchPack {
runScenario(input: ScenarioRunInput, emit: ProgressEmitter): Promise<ScenarioResult>;
dispose(): Promise<void>;
}prepare(context) is the point where the pack receives the resolved host state for a run session.
listScenarios() returns the UI-visible metadata for each scenario.
Important fields:
idtitlecategorydescriptiondetailCards
detailCards power the structured scenario cards shown in the desktop UI, such as:
What this testsSuccess caseFailure case
BenchLocal provides a HostContext to prepare(context).
Key fields:
benchPack- install and storage paths
providers- resolved provider registry
models- shared registered models
secrets- resolved provider secrets from config or environment
verifiers- resolved verifier endpoints and status
logger- host logging bridge
Bench Packs should usually use the helpers from @benchlocal/sdk instead of reading the raw context manually.
Per-scenario generation settings arrive as:
type GenerationRequest = {
temperature?: number;
top_p?: number;
top_k?: number;
min_p?: number;
repetition_penalty?: number;
request_timeout_seconds?: number;
};Behavior:
- if a field is present, the pack may forward it to the provider client
- if a field is omitted, the pack should not send that value
This allows:
- pack-level defaults from
benchlocal.pack.json - per-tab user overrides from BenchLocal
- omission of unsupported or unnecessary values
Bench Packs emit deterministic progress events through emit.
Current event types:
run_startedscenario_startedmodel_progressscenario_resultscenario_finishedrun_finishedrun_error
BenchLocal stores these events for detached logs, status UI, and run history.
Each runScenario(...) call returns a ScenarioResult.
Representative shape:
type ScenarioResult = {
scenarioId: string;
status: "pass" | "partial" | "fail";
score?: number;
points?: number;
summary: string;
note?: string;
rawLog: string;
output?: ModelOutput;
verifier?: VerifierResult;
artifacts?: ArtifactRef[];
timings?: {
startedAt?: string;
completedAt?: string;
durationMs?: number;
};
};After a run completes, BenchLocal asks the pack to aggregate model-level results into a BenchmarkScore.
Representative shape:
type BenchmarkScore = {
totalScore: number;
categories: Array<{
id: string;
label: string;
score: number;
weight?: number;
}>;
summary?: string;
};Verifier-dependent Bench Packs declare their verifier requirements in the manifest.
BenchLocal owns:
- verifier mode selection
- Docker lifecycle
- dynamic host port assignment
- health checks
- status reporting
Bench Packs own:
- verifier implementation
- verifier request and response contract
- use of the resolved verifier URL
listenPort is the internal verifier port inside the container. BenchLocal assigns the host port automatically.
The codebase still carries a few sidecar aliases for backward compatibility.
Public protocol terminology should use:
verifierverifiers
not sidecar.