|
1 | 1 | # @tangle-network/agent-eval |
2 | 2 |
|
3 | | -Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector). |
| 3 | +**A library for deciding whether an LLM-driven generator did its job.** |
4 | 4 |
|
5 | | -## Install |
| 5 | +You hand it the thing the generator produced — a code scaffold, a patch, a tweet, a JSON config — and you get back a structured verdict: pass/fail, dimension scores, plain-English rationale. Built to catch the LLM failure modes that LLM-as-judge alone misses. |
6 | 6 |
|
7 | | -```bash |
| 7 | +```ts |
| 8 | +import { BuilderSession, SubprocessSandboxDriver, InMemoryTraceStore } from '@tangle-network/agent-eval' |
| 9 | + |
| 10 | +const session = new BuilderSession(new InMemoryTraceStore(), { projectId: 'my-app' }, new SubprocessSandboxDriver()) |
| 11 | +await session.startChat() |
| 12 | +const ship = await session.ship({ |
| 13 | + harness: { setupCommand: 'pnpm install', testCommand: 'pnpm exec tsc --noEmit', cwd: scaffoldDir, timeoutMs: 180_000 }, |
| 14 | +}) |
| 15 | +console.log(ship.result.passed, ship.result.score) |
| 16 | +``` |
| 17 | + |
| 18 | +## Who this is for |
| 19 | + |
| 20 | +- You ship a code generator (scaffolder, patcher, refactor agent) and need to gate on whether its output actually works. |
| 21 | +- You ship a content generator and need quality signal beyond "the LLM said it's good". |
| 22 | +- You want a release gate that fails on regressions you can name, not vibes. |
| 23 | + |
| 24 | +If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then come back here. |
| 25 | + |
| 26 | +## Quickstart |
| 27 | + |
| 28 | +### From any language: HTTP or RPC |
| 29 | + |
| 30 | +The fastest path. agent-eval ships a CLI that runs as either an HTTP server or a stdio RPC binary. Drive it from Python, Rust, Go, anything. |
| 31 | + |
| 32 | +```sh |
| 33 | +npm i -g @tangle-network/agent-eval |
| 34 | + |
| 35 | +# HTTP — long-running |
| 36 | +agent-eval serve --port 5005 |
| 37 | + |
| 38 | +# stdio RPC — one-shot, batch |
| 39 | +echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge |
| 40 | +``` |
| 41 | + |
| 42 | +Python: |
| 43 | +```sh |
| 44 | +pip install tangle-agent-eval |
| 45 | +``` |
| 46 | +```python |
| 47 | +from tangle_agent_eval import Client |
| 48 | +c = Client() |
| 49 | +r = c.judge(content="our scaffold ships zero-copy IO", rubric_name="anti-slop") |
| 50 | +print(r.composite, r.failure_modes) |
| 51 | +``` |
| 52 | + |
| 53 | +See [`docs/wire-protocol.md`](./docs/wire-protocol.md) for the full surface. |
| 54 | + |
| 55 | +### From TypeScript: import directly |
| 56 | + |
| 57 | +In-process; no wire round-trip. Use this when your eval lives in the same Node process as your generator. |
| 58 | + |
| 59 | +```sh |
8 | 60 | pnpm add @tangle-network/agent-eval |
9 | 61 | ``` |
10 | 62 |
|
11 | | -## Usage |
| 63 | +The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](./.claude/skills/agent-eval/SKILL.md#minimal-working-path-builder-of-builders). |
| 64 | + |
| 65 | +## Two ways to read this repo |
| 66 | + |
| 67 | +- **You're a human onboarding** — read [`docs/concepts.md`](./docs/concepts.md) for the mental model, then [`docs/wire-protocol.md`](./docs/wire-protocol.md) if you'll call from another language, or `SKILL.md` if you'll embed in TS. |
| 68 | +- **You're an LLM agent writing integration code** — read `SKILL.md`. Every directive there encodes a shipped bug; skipping one reintroduces the bug class. |
12 | 69 |
|
13 | | -**→ [`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md)** — single source of truth for every usage pattern. Covers: minimal builder-of-builders path, the seven muffled-gate footguns paid for in shipped bugs, the three-layer eval contract, regression tests worth writing, and "when to use what" for the 100+ exports. |
| 70 | +## What's in the box |
14 | 71 |
|
15 | | -If you're an LLM or agent reading this, load the skill file before writing integration code — it encodes 10+ incident-driven directives that will save you from rediscovering them. |
| 72 | +| Module | What it does | Doc | |
| 73 | +|---|---|---| |
| 74 | +| `BuilderSession` | Three-layer eval orchestrator (builder → app-build → app-runtime) for code generators. | concepts.md §three-layer eval | |
| 75 | +| `MultiLayerVerifier` | Pipeline of layers (install → typecheck → build → semantic). Skip-on-fail, weighted aggregate. | concepts.md §verifiers | |
| 76 | +| `judges`, `createCustomJudge`, `createAntiSlopJudge` | LLM and deterministic judges. | SKILL.md | |
| 77 | +| Wire protocol (`agent-eval serve` / `rpc`) | HTTP and stdio RPC interface for cross-language clients. | wire-protocol.md | |
| 78 | +| `clients/python/` | First-party Python client (`tangle-agent-eval` on PyPI). Version-locked to npm. | clients/python/README.md | |
| 79 | +| `BenchmarkRunner`, `executeScenario`, `ConvergenceTracker` | Multi-turn scenario execution + cross-run tracking. | SKILL.md | |
| 80 | +| `ExperimentTracker`, `PromptOptimizer`, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md | |
| 81 | +| Telemetry (`telemetry/`, `telemetry/file`) | OTLP export, trace replay, file sinks. | inline JSDoc | |
16 | 82 |
|
17 | | -## Dev |
| 83 | +## Tech stack |
18 | 84 |
|
19 | | -```bash |
20 | | -pnpm build # tsup |
21 | | -pnpm test # vitest |
22 | | -pnpm typecheck # tsc --noEmit |
| 85 | +- TypeScript strict, no semicolons, single quotes, 2-space indent |
| 86 | +- `tsup` for bundling, `vitest` for tests |
| 87 | +- `@tangle-network/tcloud` for LLM calls (judges, driver) |
| 88 | +- `hono` + `@asteasolutions/zod-to-openapi` for the wire protocol |
| 89 | + |
| 90 | +## Develop |
| 91 | + |
| 92 | +```sh |
| 93 | +pnpm install |
| 94 | +pnpm typecheck |
| 95 | +pnpm test |
| 96 | +pnpm build |
| 97 | +pnpm openapi # write dist/openapi.json from the wire schemas |
| 98 | + |
| 99 | +# Run the server locally |
| 100 | +node dist/cli.js serve --port 5005 |
| 101 | + |
| 102 | +# Python client tests (require pnpm build first) |
| 103 | +cd clients/python && pip install -e ".[dev]" && pytest |
23 | 104 | ``` |
24 | 105 |
|
| 106 | +## Release |
| 107 | + |
| 108 | +`@tangle-network/agent-eval` (npm) and `tangle-agent-eval` (PyPI) ship from the same git tag in the same CI workflow. If either fails to publish, neither does. Versions are locked. |
| 109 | + |
25 | 110 | ## Related |
26 | 111 |
|
27 | 112 | - [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway) |
|
0 commit comments