tangle-network
diff --git a/‎.claude/skills/agent-eval/SKILL.md‎
Lines changed: 21 additions & 3 deletions b/‎.claude/skills/agent-eval/SKILL.md‎
Lines changed: 21 additions & 3 deletions
diff --git a/‎.github/workflows/publish.yml‎
Lines changed: 93 additions & 4 deletions b/‎.github/workflows/publish.yml‎
Lines changed: 93 additions & 4 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 8 additions & 13 deletions b/‎CLAUDE.md‎
Lines changed: 8 additions & 13 deletions
diff --git a/‎README.md‎
Lines changed: 96 additions & 11 deletions b/‎README.md‎
Lines changed: 96 additions & 11 deletions
diff --git a/‎clients/python/.gitignore‎
Lines changed: 8 additions & 0 deletions b/‎clients/python/.gitignore‎
Lines changed: 8 additions & 0 deletions
@@ -5,9 +5,27 @@ description: Trace-first evaluation framework for code-generator + LLM-in-the-lo
 
 # agent-eval — usage directives
 
-**One authoritative doc.** `README.md`, `CLAUDE.md`, inline JSDoc all
-point here. The rules below were paid for in real bugs; skip one and
-the bug class reappears.
+**You're an agent writing integration code? Read this whole file.** Each rule below was paid for in a shipped bug; skip one and the bug class reappears.
+
+**You're a human onboarding?** Read [`docs/concepts.md`](../../../docs/concepts.md) first — 5-minute mental model — then come back. The rest of this file is dense by design (it's a footgun bible, not a tutorial).
+
+## Vocabulary you need before reading the rules
+
+| Term | Plain English |
+|---|---|
+| **Artifact** | The thing being judged. Often a workdir; sometimes text. |
+| **Snapshot** | Frozen view of an artifact (`files: Record<path,string>`). What judges read. |
+| **Harness** | Description of how to run the artifact: setupCommand, testCommand, cwd, timeoutMs. |
+| **Sandbox driver** | The thing that actually runs commands. `SubprocessSandboxDriver` runs locally. |
+| **Layer** | One stage of a verifier pipeline (install / typecheck / build / semantic / …). |
+| **Judge** | A function that scores one artifact. Some are LLM-backed, some deterministic. |
+| **Rubric** | Data describing what a judge scores on, with weights. |
+| **Trace store** | Append-only log of spans. `BuilderSession` writes here. |
+| **Composite score** | 0..1 number combining all dimensions — the gate value. |
+| **Muffled gate** | A check that should fail loud but silently passes. The most expensive bug class — see Footgun 1 and §Common bug classes. |
+| **L0 / L1 / L2** | Three layers of code-generator eval: agent session / app-build / app-runtime. |
+
+If a term below isn't in this table or in `docs/concepts.md`, that's a bug — file an issue.
 
 ---
 
 
@@ -7,7 +7,73 @@ on:
   workflow_dispatch:
 
 jobs:
-  publish:
+  # Verify both packages build and pass tests, in lockstep.
+  # If either fails, neither publishes.
+  verify:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: pnpm/action-setup@v4
+
+      - uses: actions/setup-node@v4
+        with:
+          node-version: 22
+          cache: pnpm
+          registry-url: https://registry.npmjs.org
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+
+      - name: Install JS deps
+        run: pnpm install --frozen-lockfile
+
+      - name: Typecheck JS
+        run: pnpm typecheck
+
+      - name: Test JS
+        run: pnpm test
+
+      - name: Build JS
+        run: pnpm build
+
+      - name: Emit OpenAPI spec
+        run: pnpm openapi
+
+      - name: Verify version lock between npm and PyPI packages
+        run: |
+          NPM_VERSION=$(node -p "require('./package.json').version")
+          PY_VERSION=$(grep -E '^version' clients/python/pyproject.toml | head -1 | sed -E 's/.*"([^"]+)".*/\1/')
+          if [ "$NPM_VERSION" != "$PY_VERSION" ]; then
+            echo "::error::Version mismatch: npm=$NPM_VERSION pypi=$PY_VERSION. Bump them together."
+            exit 1
+          fi
+          echo "Versions locked: $NPM_VERSION"
+
+      - name: Install Python client
+        working-directory: clients/python
+        run: pip install -e ".[dev]"
+
+      - name: Test Python client (incl. real subprocess integration)
+        working-directory: clients/python
+        run: pytest -v
+
+      - name: Upload OpenAPI artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: openapi
+          path: dist/openapi.json
+
+      - name: Upload Python build context
+        uses: actions/upload-artifact@v4
+        with:
+          name: python-source
+          path: clients/python
+
+  publish-npm:
+    needs: verify
+    if: startsWith(github.ref, 'refs/tags/v')
     runs-on: ubuntu-latest
     permissions:
       contents: read
@@ -24,10 +90,33 @@ jobs:
           registry-url: https://registry.npmjs.org
 
       - run: pnpm install --frozen-lockfile
-      - run: pnpm typecheck
-      - run: pnpm test
       - run: pnpm build
-
       - run: pnpm publish --no-git-checks --access public
         env:
           NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
+
+  publish-pypi:
+    needs: [verify, publish-npm]
+    if: startsWith(github.ref, 'refs/tags/v')
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      id-token: write
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+
+      - name: Install build tools
+        run: pip install build twine
+
+      - name: Build wheel + sdist
+        working-directory: clients/python
+        run: python -m build
+
+      - name: Publish to PyPI (trusted publishing)
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          packages-dir: clients/python/dist
@@ -1,18 +1,13 @@
 # @tangle-network/agent-eval
 
-Claude Code agents working in this repo: the usage directives are in
-**[`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md)**
-and are auto-discovered by Claude Code as the `/agent-eval` skill.
-
-That file is the sole source of truth for:
-- minimal builder-of-builders integration path
-- the seven muffled-gate footguns (from shipped bugs)
-- three-layer eval contract (`BuilderSession` → `app-build` → `app-runtime`)
-- regression tests every consumer should carry
-- "when to use what" index of the 100+ exports
-
-Do not duplicate content from SKILL.md here. Update SKILL.md; this file is
-a pointer.
+Two docs, two audiences:
+
+- **Humans onboarding** → [`docs/concepts.md`](./docs/concepts.md) (mental model, 5 min) and [`README.md`](./README.md) (entry points + quickstart).
+- **LLM agents writing integration code** → [`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md). Auto-discovered by Claude Code as `/agent-eval`. Encodes shipped-bug directives that prevent regression — skipping one reintroduces the bug class.
+
+Wire-protocol consumers (any language other than TypeScript) → [`docs/wire-protocol.md`](./docs/wire-protocol.md) and [`clients/python/README.md`](./clients/python/README.md).
+
+Update the doc closest to the change. Don't duplicate content across docs; cross-link.
 
 ## Tech stack (unchanging)
 
 
@@ -1,27 +1,112 @@
 # @tangle-network/agent-eval
 
-Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).
+**A library for deciding whether an LLM-driven generator did its job.**
 
-## Install
+You hand it the thing the generator produced — a code scaffold, a patch, a tweet, a JSON config — and you get back a structured verdict: pass/fail, dimension scores, plain-English rationale. Built to catch the LLM failure modes that LLM-as-judge alone misses.
 
-```bash
+```ts
+import { BuilderSession, SubprocessSandboxDriver, InMemoryTraceStore } from '@tangle-network/agent-eval'
+
+const session = new BuilderSession(new InMemoryTraceStore(), { projectId: 'my-app' }, new SubprocessSandboxDriver())
+await session.startChat()
+const ship = await session.ship({
+  harness: { setupCommand: 'pnpm install', testCommand: 'pnpm exec tsc --noEmit', cwd: scaffoldDir, timeoutMs: 180_000 },
+})
+console.log(ship.result.passed, ship.result.score)
+```
+
+## Who this is for
+
+- You ship a code generator (scaffolder, patcher, refactor agent) and need to gate on whether its output actually works.
+- You ship a content generator and need quality signal beyond "the LLM said it's good".
+- You want a release gate that fails on regressions you can name, not vibes.
+
+If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then come back here.
+
+## Quickstart
+
+### From any language: HTTP or RPC
+
+The fastest path. agent-eval ships a CLI that runs as either an HTTP server or a stdio RPC binary. Drive it from Python, Rust, Go, anything.
+
+```sh
+npm i -g @tangle-network/agent-eval
+
+# HTTP — long-running
+agent-eval serve --port 5005
+
+# stdio RPC — one-shot, batch
+echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge
+```
+
+Python:
+```sh
+pip install tangle-agent-eval
+```
+```python
+from tangle_agent_eval import Client
+c = Client()
+r = c.judge(content="our scaffold ships zero-copy IO", rubric_name="anti-slop")
+print(r.composite, r.failure_modes)
+```
+
+See [`docs/wire-protocol.md`](./docs/wire-protocol.md) for the full surface.
+
+### From TypeScript: import directly
+
+In-process; no wire round-trip. Use this when your eval lives in the same Node process as your generator.
+
+```sh
 pnpm add @tangle-network/agent-eval
 ```
 
-## Usage
+The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](./.claude/skills/agent-eval/SKILL.md#minimal-working-path-builder-of-builders).
+
+## Two ways to read this repo
+
+- **You're a human onboarding** — read [`docs/concepts.md`](./docs/concepts.md) for the mental model, then [`docs/wire-protocol.md`](./docs/wire-protocol.md) if you'll call from another language, or `SKILL.md` if you'll embed in TS.
+- **You're an LLM agent writing integration code** — read `SKILL.md`. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.
 
-**→ [`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md)** — single source of truth for every usage pattern. Covers: minimal builder-of-builders path, the seven muffled-gate footguns paid for in shipped bugs, the three-layer eval contract, regression tests worth writing, and "when to use what" for the 100+ exports.
+## What's in the box
 
-If you're an LLM or agent reading this, load the skill file before writing integration code — it encodes 10+ incident-driven directives that will save you from rediscovering them.
+| Module | What it does | Doc |
+|---|---|---|
+| `BuilderSession` | Three-layer eval orchestrator (builder → app-build → app-runtime) for code generators. | concepts.md §three-layer eval |
+| `MultiLayerVerifier` | Pipeline of layers (install → typecheck → build → semantic). Skip-on-fail, weighted aggregate. | concepts.md §verifiers |
+| `judges`, `createCustomJudge`, `createAntiSlopJudge` | LLM and deterministic judges. | SKILL.md |
+| Wire protocol (`agent-eval serve` / `rpc`) | HTTP and stdio RPC interface for cross-language clients. | wire-protocol.md |
+| `clients/python/` | First-party Python client (`tangle-agent-eval` on PyPI). Version-locked to npm. | clients/python/README.md |
+| `BenchmarkRunner`, `executeScenario`, `ConvergenceTracker` | Multi-turn scenario execution + cross-run tracking. | SKILL.md |
+| `ExperimentTracker`, `PromptOptimizer`, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
+| Telemetry (`telemetry/`, `telemetry/file`) | OTLP export, trace replay, file sinks. | inline JSDoc |
 
-## Dev
+## Tech stack
 
-```bash
-pnpm build        # tsup
-pnpm test         # vitest
-pnpm typecheck    # tsc --noEmit
+- TypeScript strict, no semicolons, single quotes, 2-space indent
+- `tsup` for bundling, `vitest` for tests
+- `@tangle-network/tcloud` for LLM calls (judges, driver)
+- `hono` + `@asteasolutions/zod-to-openapi` for the wire protocol
+
+## Develop
+
+```sh
+pnpm install
+pnpm typecheck
+pnpm test
+pnpm build
+pnpm openapi             # write dist/openapi.json from the wire schemas
+
+# Run the server locally
+node dist/cli.js serve --port 5005
+
+# Python client tests (require pnpm build first)
+cd clients/python && pip install -e ".[dev]" && pytest
 ```
 
+## Release
+
+`@tangle-network/agent-eval` (npm) and `tangle-agent-eval` (PyPI) ship from the same git tag in the same CI workflow. If either fails to publish, neither does. Versions are locked.
+
 ## Related
 
 - [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
 
@@ -0,0 +1,8 @@
+.venv/
+__pycache__/
+*.egg-info/
+.pytest_cache/
+.ruff_cache/
+build/
+dist/
+*.pyc