You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This release tightens the public benchmark surface and lands the SKILL.md guidance that the v0.15 dispatch couldn't write.
138
+
This release tightens the public benchmark surface and lands internal usage guidance that the v0.15 dispatch couldn't write.
113
139
114
140
### Moved
115
141
@@ -123,7 +149,7 @@ These are reference implementations of `BenchmarkAdapter`, not core surface. Con
123
149
### Added
124
150
125
151
-`examples/benchmarks/README.md` documents how to use, copy, and extend the example wrappers.
126
-
-`.claude/skills/agent-eval/SKILL.md` gains a "Production-rigor primitives (v0.16+)" section and a "Pitfalls" section with 13 footgun directives covering the v0.16 primitives. (Couldn't be written in v0.15 due to harness sandbox; landed in v0.17.)
152
+
-Internal agent-eval usage guidance gains production-rigor and pitfalls sections covering the v0.16 primitives.
127
153
128
154
### Migration
129
155
@@ -218,8 +244,7 @@ optimization with held-out promotion gates.
218
244
are additive.
219
245
- All new public symbols carry JSDoc.
220
246
- 87 new tests across 7 new test files. 571 total tests pass.
221
-
- See `.claude/skills/agent-eval/SKILL.md` for usage directives and
222
-
pitfalls; `## Pitfalls` section added in this release.
247
+
- See the package docs for usage directives and pitfalls.
Copy file name to clipboardExpand all lines: clients/python/README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,8 @@ That's the entire surface for content judging.
27
27
## Install
28
28
29
29
```sh
30
-
pip install tangle-agent-eval
30
+
cd clients/python
31
+
pip install -e .
31
32
```
32
33
33
34
To use it, **one of**:
@@ -140,7 +141,7 @@ All errors carry `.code` and `.details` (the structured payload from the server)
140
141
141
142
## Versioning
142
143
143
-
This package is **version-locked** to the npm package. `tangle-agent-eval==0.19.0` ↔ `@tangle-network/agent-eval@0.19.0`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
144
+
This package is **version-locked** to the npm package. `tangle-agent-eval==0.20.9` ↔ `@tangle-network/agent-eval@0.20.9`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
144
145
145
146
`wire_version` is separate. It bumps only on breaking schema changes. Package versions can differ across releases as long as `wire_version` is the same.
Copy file name to clipboardExpand all lines: docs/concepts.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ that can seed memory, replay scenarios, and optimization.
43
43
|**Trace store**| The append-only log of every span/event during a run. Replay = read this back. |
44
44
|**Composite score**| A 0..1 number combining all dimensions. The single number you gate on. |
45
45
|**Rubric version**| A stable hash of the rubric. Scores from different rubric versions are not comparable. |
46
-
|**Muffled gate**| A check that should fail loud but silently passes (e.g. `command || true`). The most expensive bug class in this codebase — see SKILL.md. |
46
+
|**Muffled gate**| A check that should fail loud but silently passes (e.g. `command || true`). The most expensive bug class in this codebase. |
Two rules that will save you bugs (paid for in real incidents — see SKILL.md):
122
+
Two rules that will save you bugs:
123
123
124
124
1.**Run both gates.** Build gates catch code that doesn't compile; structural assertions catch missing files. Run both unconditionally — they catch orthogonal failures.
125
125
@@ -150,6 +150,6 @@ You don't need to build the trace tree by hand. `BuilderSession` does it for you
150
150
-**Just want to score a string against a rubric?** → [wire-protocol.md](./wire-protocol.md) — HTTP/RPC interface, pluggable from any language.
151
151
-**Need a reusable driver/worker/evaluator loop?** → [control-runtime.md](./control-runtime.md) — generic runtime plus coding, browser, computer-use, and research integration patterns.
152
152
-**Want review feedback to become eval/optimization data?** → [feedback-trajectories.md](./feedback-trajectories.md) — turn feedback into datasets, optimizer rows, and preference memory.
153
-
-**Building a code-generator eval?** → SKILL.md §Minimal working path — the `BuilderSession` recipe.
-**Building a code-generator eval?** → Start with `BuilderSession`, `SandboxHarness`, and `MultiLayerVerifier`.
154
+
-**Multi-layer verifier?** → Use [control-runtime.md](./control-runtime.md) and `MultiLayerVerifier` for ordered gates with dependencies.
155
155
-**Adding a new judge or rubric?** → `src/wire/rubrics.ts` for the cross-language path; `src/anti-slop.ts` and `src/judges.ts` for the in-process path.
Copy file name to clipboardExpand all lines: docs/wire-protocol.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -96,13 +96,13 @@ GET /v1/version
96
96
```json
97
97
{
98
98
"package": "@tangle-network/agent-eval",
99
-
"version": "0.19.0",
99
+
"version": "0.20.9",
100
100
"wireVersion": "1.0.0",
101
101
"apiSurface": ["judge", "listRubrics", "version"]
102
102
}
103
103
```
104
104
105
-
`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
105
+
`version` matches the package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
106
106
107
107
### `GET /healthz` — liveness
108
108
@@ -176,7 +176,7 @@ Each invocation is one process — Node startup adds ~500 ms. For more than a fe
176
176
177
177
## Clients
178
178
179
-
-**Python**: [`tangle-agent-eval`](../clients/python/README.md) on PyPI. Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
179
+
-**Python**: source lives in [`clients/python`](../clients/python/README.md). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
180
180
-**TypeScript**: import directly from `@tangle-network/agent-eval` (no wire round-trip needed in-process).
181
181
-**Rust / Go / Other**: generate from `dist/openapi.json`. PRs welcome to add an officially-maintained client.
Copy file name to clipboardExpand all lines: examples/benchmarks/README.md
+4-11Lines changed: 4 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,17 +11,10 @@ The novel benchmark we ship and own — the synthetic routing task — lives in
11
11
12
12
## Using these wrappers
13
13
14
-
Two paths.
15
-
16
-
**Option A — read and inline.** Copy the wrapper file into your project. Replace the import paths from `../../../src/benchmarks/types` and `../../../src/run-record` with `@tangle-network/agent-eval`. Done.
17
-
18
-
**Option B — import from agent-eval source.** If your project sits in this monorepo (or you've cloned the repo), import directly:
0 commit comments