Skip to content

Commit fa8fc94

Browse files
authored
feat: wire-protocol module + first-party Python client (#11)
* feat(wire): HTTP + stdio RPC + OpenAPI emission for cross-language clients Adds src/wire/ — Zod schemas as the single contract, pure handler functions, Hono HTTP server, stdio RPC for batch use, and OpenAPI 3.1 emission. CLI binary (agent-eval) wraps both transports. Schemas: JudgeRequest, JudgeResult, Rubric, RubricDimension, FailureMode, ListRubricsResponse, VersionResponse, ErrorResponse — all with .describe() field-level docs that flow through to the generated OpenAPI. Methods exposed: judge, listRubrics, version. Built-in rubric: anti-slop (voice/signal quality for technical-buyer audiences). Inline rubrics also accepted via the same endpoint. Both transports route to identical handlers. The TS runtime is the source of truth; clients in other languages are generated from openapi.json. 24 new tests, 576/576 pass. * feat(python): tangle-agent-eval client — HTTP + subprocess, version-locked to npm clients/python/ — pip-installable as `tangle-agent-eval`, version-locked to @tangle-network/agent-eval. Thin transport adapter: every judgement runs in the Node runtime, marshalled over HTTP or stdio RPC. No Python-side eval logic — preventing drift by construction. API: Client.judge(content=..., rubric_name="anti-slop") -> JudgeResult Client.list_rubrics() -> ListRubricsResponse Client.version() -> VersionResponse Auto-detects HTTP server, falls back to subprocess. pydantic v2 models mirror the Zod schemas; mutual-exclusion refinement (rubric_name XOR rubric) validates client-side before any transport fires. 11/11 tests pass — including 4 real subprocess integration tests against the bundled CLI (no mocks). * docs: educational entry points — concepts.md, wire-protocol.md, README rewrite Drew's directive: SKILL.md is dense by design (footgun bible) and overloaded as the sole onboarding doc. Split the audiences: - README.md: human entry point. What it is, who it's for, 30-second quickstart for both wire-protocol and in-process TS use. - docs/concepts.md (NEW): 5-minute mental model. Vocabulary table, three-layer eval explained, rubric + verifier basics, trace model. - docs/wire-protocol.md (NEW): full HTTP/RPC reference with request/response examples for every endpoint. Adding-a-method recipe. - SKILL.md: vocabulary section added at top so agents have plain-English definitions of every term used in the directives. - CLAUDE.md: split-audience pointer instead of single SKILL.md redirect. Principle: every term defined in plain English with one example. Every endpoint has copy-pasteable input/output. No jargon left undefined. * ci: dual-publish workflow — npm + PyPI version-locked verify job typechecks JS, runs JS tests, builds, emits OpenAPI, version-locks npm and PyPI package versions, installs the Python client, runs its tests (including real subprocess integration against dist/cli.js), uploads artifacts. publish-npm depends on verify; publish-pypi depends on publish-npm. If anything breaks, neither package ships. PyPI uses trusted publishing (OIDC); npm uses NPM_TOKEN.
1 parent eb90eb5 commit fa8fc94

29 files changed

Lines changed: 2793 additions & 58 deletions

.claude/skills/agent-eval/SKILL.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,27 @@ description: Trace-first evaluation framework for code-generator + LLM-in-the-lo
55

66
# agent-eval — usage directives
77

8-
**One authoritative doc.** `README.md`, `CLAUDE.md`, inline JSDoc all
9-
point here. The rules below were paid for in real bugs; skip one and
10-
the bug class reappears.
8+
**You're an agent writing integration code? Read this whole file.** Each rule below was paid for in a shipped bug; skip one and the bug class reappears.
9+
10+
**You're a human onboarding?** Read [`docs/concepts.md`](../../../docs/concepts.md) first — 5-minute mental model — then come back. The rest of this file is dense by design (it's a footgun bible, not a tutorial).
11+
12+
## Vocabulary you need before reading the rules
13+
14+
| Term | Plain English |
15+
|---|---|
16+
| **Artifact** | The thing being judged. Often a workdir; sometimes text. |
17+
| **Snapshot** | Frozen view of an artifact (`files: Record<path,string>`). What judges read. |
18+
| **Harness** | Description of how to run the artifact: setupCommand, testCommand, cwd, timeoutMs. |
19+
| **Sandbox driver** | The thing that actually runs commands. `SubprocessSandboxDriver` runs locally. |
20+
| **Layer** | One stage of a verifier pipeline (install / typecheck / build / semantic / …). |
21+
| **Judge** | A function that scores one artifact. Some are LLM-backed, some deterministic. |
22+
| **Rubric** | Data describing what a judge scores on, with weights. |
23+
| **Trace store** | Append-only log of spans. `BuilderSession` writes here. |
24+
| **Composite score** | 0..1 number combining all dimensions — the gate value. |
25+
| **Muffled gate** | A check that should fail loud but silently passes. The most expensive bug class — see Footgun 1 and §Common bug classes. |
26+
| **L0 / L1 / L2** | Three layers of code-generator eval: agent session / app-build / app-runtime. |
27+
28+
If a term below isn't in this table or in `docs/concepts.md`, that's a bug — file an issue.
1129

1230
---
1331

.github/workflows/publish.yml

Lines changed: 93 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,73 @@ on:
77
workflow_dispatch:
88

99
jobs:
10-
publish:
10+
# Verify both packages build and pass tests, in lockstep.
11+
# If either fails, neither publishes.
12+
verify:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- uses: actions/checkout@v4
16+
17+
- uses: pnpm/action-setup@v4
18+
19+
- uses: actions/setup-node@v4
20+
with:
21+
node-version: 22
22+
cache: pnpm
23+
registry-url: https://registry.npmjs.org
24+
25+
- uses: actions/setup-python@v5
26+
with:
27+
python-version: '3.12'
28+
29+
- name: Install JS deps
30+
run: pnpm install --frozen-lockfile
31+
32+
- name: Typecheck JS
33+
run: pnpm typecheck
34+
35+
- name: Test JS
36+
run: pnpm test
37+
38+
- name: Build JS
39+
run: pnpm build
40+
41+
- name: Emit OpenAPI spec
42+
run: pnpm openapi
43+
44+
- name: Verify version lock between npm and PyPI packages
45+
run: |
46+
NPM_VERSION=$(node -p "require('./package.json').version")
47+
PY_VERSION=$(grep -E '^version' clients/python/pyproject.toml | head -1 | sed -E 's/.*"([^"]+)".*/\1/')
48+
if [ "$NPM_VERSION" != "$PY_VERSION" ]; then
49+
echo "::error::Version mismatch: npm=$NPM_VERSION pypi=$PY_VERSION. Bump them together."
50+
exit 1
51+
fi
52+
echo "Versions locked: $NPM_VERSION"
53+
54+
- name: Install Python client
55+
working-directory: clients/python
56+
run: pip install -e ".[dev]"
57+
58+
- name: Test Python client (incl. real subprocess integration)
59+
working-directory: clients/python
60+
run: pytest -v
61+
62+
- name: Upload OpenAPI artifact
63+
uses: actions/upload-artifact@v4
64+
with:
65+
name: openapi
66+
path: dist/openapi.json
67+
68+
- name: Upload Python build context
69+
uses: actions/upload-artifact@v4
70+
with:
71+
name: python-source
72+
path: clients/python
73+
74+
publish-npm:
75+
needs: verify
76+
if: startsWith(github.ref, 'refs/tags/v')
1177
runs-on: ubuntu-latest
1278
permissions:
1379
contents: read
@@ -24,10 +90,33 @@ jobs:
2490
registry-url: https://registry.npmjs.org
2591

2692
- run: pnpm install --frozen-lockfile
27-
- run: pnpm typecheck
28-
- run: pnpm test
2993
- run: pnpm build
30-
3194
- run: pnpm publish --no-git-checks --access public
3295
env:
3396
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
97+
98+
publish-pypi:
99+
needs: [verify, publish-npm]
100+
if: startsWith(github.ref, 'refs/tags/v')
101+
runs-on: ubuntu-latest
102+
permissions:
103+
contents: read
104+
id-token: write
105+
steps:
106+
- uses: actions/checkout@v4
107+
108+
- uses: actions/setup-python@v5
109+
with:
110+
python-version: '3.12'
111+
112+
- name: Install build tools
113+
run: pip install build twine
114+
115+
- name: Build wheel + sdist
116+
working-directory: clients/python
117+
run: python -m build
118+
119+
- name: Publish to PyPI (trusted publishing)
120+
uses: pypa/gh-action-pypi-publish@release/v1
121+
with:
122+
packages-dir: clients/python/dist

CLAUDE.md

Lines changed: 8 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,13 @@
11
# @tangle-network/agent-eval
22

3-
Claude Code agents working in this repo: the usage directives are in
4-
**[`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md)**
5-
and are auto-discovered by Claude Code as the `/agent-eval` skill.
6-
7-
That file is the sole source of truth for:
8-
- minimal builder-of-builders integration path
9-
- the seven muffled-gate footguns (from shipped bugs)
10-
- three-layer eval contract (`BuilderSession``app-build``app-runtime`)
11-
- regression tests every consumer should carry
12-
- "when to use what" index of the 100+ exports
13-
14-
Do not duplicate content from SKILL.md here. Update SKILL.md; this file is
15-
a pointer.
3+
Two docs, two audiences:
4+
5+
- **Humans onboarding**[`docs/concepts.md`](./docs/concepts.md) (mental model, 5 min) and [`README.md`](./README.md) (entry points + quickstart).
6+
- **LLM agents writing integration code**[`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md). Auto-discovered by Claude Code as `/agent-eval`. Encodes shipped-bug directives that prevent regression — skipping one reintroduces the bug class.
7+
8+
Wire-protocol consumers (any language other than TypeScript) → [`docs/wire-protocol.md`](./docs/wire-protocol.md) and [`clients/python/README.md`](./clients/python/README.md).
9+
10+
Update the doc closest to the change. Don't duplicate content across docs; cross-link.
1611

1712
## Tech stack (unchanging)
1813

README.md

Lines changed: 96 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,112 @@
11
# @tangle-network/agent-eval
22

3-
Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).
3+
**A library for deciding whether an LLM-driven generator did its job.**
44

5-
## Install
5+
You hand it the thing the generator produced — a code scaffold, a patch, a tweet, a JSON config — and you get back a structured verdict: pass/fail, dimension scores, plain-English rationale. Built to catch the LLM failure modes that LLM-as-judge alone misses.
66

7-
```bash
7+
```ts
8+
import { BuilderSession, SubprocessSandboxDriver, InMemoryTraceStore } from '@tangle-network/agent-eval'
9+
10+
const session = new BuilderSession(new InMemoryTraceStore(), { projectId: 'my-app' }, new SubprocessSandboxDriver())
11+
await session.startChat()
12+
const ship = await session.ship({
13+
harness: { setupCommand: 'pnpm install', testCommand: 'pnpm exec tsc --noEmit', cwd: scaffoldDir, timeoutMs: 180_000 },
14+
})
15+
console.log(ship.result.passed, ship.result.score)
16+
```
17+
18+
## Who this is for
19+
20+
- You ship a code generator (scaffolder, patcher, refactor agent) and need to gate on whether its output actually works.
21+
- You ship a content generator and need quality signal beyond "the LLM said it's good".
22+
- You want a release gate that fails on regressions you can name, not vibes.
23+
24+
If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then come back here.
25+
26+
## Quickstart
27+
28+
### From any language: HTTP or RPC
29+
30+
The fastest path. agent-eval ships a CLI that runs as either an HTTP server or a stdio RPC binary. Drive it from Python, Rust, Go, anything.
31+
32+
```sh
33+
npm i -g @tangle-network/agent-eval
34+
35+
# HTTP — long-running
36+
agent-eval serve --port 5005
37+
38+
# stdio RPC — one-shot, batch
39+
echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge
40+
```
41+
42+
Python:
43+
```sh
44+
pip install tangle-agent-eval
45+
```
46+
```python
47+
from tangle_agent_eval import Client
48+
c = Client()
49+
r = c.judge(content="our scaffold ships zero-copy IO", rubric_name="anti-slop")
50+
print(r.composite, r.failure_modes)
51+
```
52+
53+
See [`docs/wire-protocol.md`](./docs/wire-protocol.md) for the full surface.
54+
55+
### From TypeScript: import directly
56+
57+
In-process; no wire round-trip. Use this when your eval lives in the same Node process as your generator.
58+
59+
```sh
860
pnpm add @tangle-network/agent-eval
961
```
1062

11-
## Usage
63+
The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](./.claude/skills/agent-eval/SKILL.md#minimal-working-path-builder-of-builders).
64+
65+
## Two ways to read this repo
66+
67+
- **You're a human onboarding** — read [`docs/concepts.md`](./docs/concepts.md) for the mental model, then [`docs/wire-protocol.md`](./docs/wire-protocol.md) if you'll call from another language, or `SKILL.md` if you'll embed in TS.
68+
- **You're an LLM agent writing integration code** — read `SKILL.md`. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.
1269

13-
**[`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md)** — single source of truth for every usage pattern. Covers: minimal builder-of-builders path, the seven muffled-gate footguns paid for in shipped bugs, the three-layer eval contract, regression tests worth writing, and "when to use what" for the 100+ exports.
70+
## What's in the box
1471

15-
If you're an LLM or agent reading this, load the skill file before writing integration code — it encodes 10+ incident-driven directives that will save you from rediscovering them.
72+
| Module | What it does | Doc |
73+
|---|---|---|
74+
| `BuilderSession` | Three-layer eval orchestrator (builder → app-build → app-runtime) for code generators. | concepts.md §three-layer eval |
75+
| `MultiLayerVerifier` | Pipeline of layers (install → typecheck → build → semantic). Skip-on-fail, weighted aggregate. | concepts.md §verifiers |
76+
| `judges`, `createCustomJudge`, `createAntiSlopJudge` | LLM and deterministic judges. | SKILL.md |
77+
| Wire protocol (`agent-eval serve` / `rpc`) | HTTP and stdio RPC interface for cross-language clients. | wire-protocol.md |
78+
| `clients/python/` | First-party Python client (`tangle-agent-eval` on PyPI). Version-locked to npm. | clients/python/README.md |
79+
| `BenchmarkRunner`, `executeScenario`, `ConvergenceTracker` | Multi-turn scenario execution + cross-run tracking. | SKILL.md |
80+
| `ExperimentTracker`, `PromptOptimizer`, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
81+
| Telemetry (`telemetry/`, `telemetry/file`) | OTLP export, trace replay, file sinks. | inline JSDoc |
1682

17-
## Dev
83+
## Tech stack
1884

19-
```bash
20-
pnpm build # tsup
21-
pnpm test # vitest
22-
pnpm typecheck # tsc --noEmit
85+
- TypeScript strict, no semicolons, single quotes, 2-space indent
86+
- `tsup` for bundling, `vitest` for tests
87+
- `@tangle-network/tcloud` for LLM calls (judges, driver)
88+
- `hono` + `@asteasolutions/zod-to-openapi` for the wire protocol
89+
90+
## Develop
91+
92+
```sh
93+
pnpm install
94+
pnpm typecheck
95+
pnpm test
96+
pnpm build
97+
pnpm openapi # write dist/openapi.json from the wire schemas
98+
99+
# Run the server locally
100+
node dist/cli.js serve --port 5005
101+
102+
# Python client tests (require pnpm build first)
103+
cd clients/python && pip install -e ".[dev]" && pytest
23104
```
24105

106+
## Release
107+
108+
`@tangle-network/agent-eval` (npm) and `tangle-agent-eval` (PyPI) ship from the same git tag in the same CI workflow. If either fails to publish, neither does. Versions are locked.
109+
25110
## Related
26111

27112
- [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)

clients/python/.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
.venv/
2+
__pycache__/
3+
*.egg-info/
4+
.pytest_cache/
5+
.ruff_cache/
6+
build/
7+
dist/
8+
*.pyc

0 commit comments

Comments
 (0)