Skip to content

Commit ef4cd4f

Browse files
authored
chore: harden agent-eval release surface (#27)
1 parent 1539bd4 commit ef4cd4f

23 files changed

Lines changed: 552 additions & 103 deletions

CHANGELOG.md

Lines changed: 30 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,31 @@
11
# Changelog
22

3+
## 0.20.9 — release hygiene and runtime failure fixes
4+
5+
### Fixed
6+
7+
- Initial `runAgentControlLoop` observe/validate failures now report the
8+
actual observe/validate error even when trace start/end emission also fails.
9+
- Knowledge readiness recommended actions now honor non-blocking gap
10+
acquisition modes such as `ask_user`, `search_web`, `query_connector`, and
11+
`inspect_repo`.
12+
- Npm builds now generate `dist/openapi.json`, and the package exports
13+
`@tangle-network/agent-eval/openapi.json`.
14+
- Npm and Python client versions are locked at `0.20.9`.
15+
16+
### Added
17+
18+
- `CallbackResearcher`, a concrete callback-backed implementation of the
19+
stable `Researcher` interface for scripts, tests, and small integrations.
20+
- Public `@tangle-network/agent-eval/benchmarks` subpath for the supported
21+
routing benchmark surface.
22+
- Root MIT `LICENSE`.
23+
24+
### Changed
25+
26+
- Raw TypeScript examples are no longer included in the npm package; they remain
27+
repository examples to read, copy, and adapt.
28+
329
## 0.20.2 — freshness-aware knowledge readiness
430

531
### Added
@@ -107,9 +133,9 @@
107133
- `runProposeReviewAsControlLoop` accepts a caller-provided verifier failure
108134
mapper for domain-specific failure classes.
109135

110-
## 0.17.0 — surface cleanup + SKILL pitfalls
136+
## 0.17.0 — surface cleanup + usage-guidance pitfalls
111137

112-
This release tightens the public benchmark surface and lands the SKILL.md guidance that the v0.15 dispatch couldn't write.
138+
This release tightens the public benchmark surface and lands internal usage guidance that the v0.15 dispatch couldn't write.
113139

114140
### Moved
115141

@@ -123,7 +149,7 @@ These are reference implementations of `BenchmarkAdapter`, not core surface. Con
123149
### Added
124150

125151
- `examples/benchmarks/README.md` documents how to use, copy, and extend the example wrappers.
126-
- `.claude/skills/agent-eval/SKILL.md` gains a "Production-rigor primitives (v0.16+)" section and a "Pitfalls" section with 13 footgun directives covering the v0.16 primitives. (Couldn't be written in v0.15 due to harness sandbox; landed in v0.17.)
152+
- Internal agent-eval usage guidance gains production-rigor and pitfalls sections covering the v0.16 primitives.
127153

128154
### Migration
129155

@@ -218,8 +244,7 @@ optimization with held-out promotion gates.
218244
are additive.
219245
- All new public symbols carry JSDoc.
220246
- 87 new tests across 7 new test files. 571 total tests pass.
221-
- See `.claude/skills/agent-eval/SKILL.md` for usage directives and
222-
pitfalls; `## Pitfalls` section added in this release.
247+
- See the package docs for usage directives and pitfalls.
223248

224249
## 0.11.0
225250

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 Tangle Network
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,9 @@ Package responsibilities:
5555
optimization, reporting.
5656
- Product app: domain state, tools, credentials, UI, storage, deployment, model
5757
gateway.
58-
- `agent-runtime`: production agent-loop/session runtime.
59-
- `agent-knowledge`: evidence stores, claim/page synthesis, retrieval, knowledge
60-
readiness implementation.
58+
- `@tangle-network/agent-runtime`: production agent-loop/session runtime.
59+
- `@tangle-network/agent-knowledge`: evidence stores, claim/page synthesis,
60+
retrieval, knowledge readiness implementation.
6161

6262
## Install
6363

@@ -72,10 +72,12 @@ npm i -g @tangle-network/agent-eval
7272
agent-eval serve --port 5005
7373
```
7474

75-
Python client:
75+
Python client source lives in `clients/python`. Until the PyPI package is
76+
published, install it from the repo:
7677

7778
```sh
78-
pip install tangle-agent-eval
79+
cd clients/python
80+
pip install -e .
7981
```
8082

8183
## Core Primitives
@@ -98,7 +100,8 @@ pip install tangle-agent-eval
98100

99101
## Examples
100102

101-
Runnable examples live in [`examples/`](./examples):
103+
Runnable examples live in the repository's [`examples/`](./examples)
104+
directory. They are not part of the published npm package.
102105

103106
- [`examples/same-sandbox-harness`](./examples/same-sandbox-harness) - run
104107
multiple eval passes against the same workspace.

clients/python/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@ That's the entire surface for content judging.
2727
## Install
2828

2929
```sh
30-
pip install tangle-agent-eval
30+
cd clients/python
31+
pip install -e .
3132
```
3233

3334
To use it, **one of**:
@@ -140,7 +141,7 @@ All errors carry `.code` and `.details` (the structured payload from the server)
140141

141142
## Versioning
142143

143-
This package is **version-locked** to the npm package. `tangle-agent-eval==0.19.0``@tangle-network/agent-eval@0.19.0`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
144+
This package is **version-locked** to the npm package. `tangle-agent-eval==0.20.9``@tangle-network/agent-eval@0.20.9`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
144145

145146
`wire_version` is separate. It bumps only on breaking schema changes. Package versions can differ across releases as long as `wire_version` is the same.
146147

clients/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "tangle-agent-eval"
7-
version = "0.19.0"
7+
version = "0.20.9"
88
description = "Python client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC."
99
readme = "README.md"
1010
requires-python = ">=3.10"

clients/python/src/tangle_agent_eval/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
VersionResponse,
4040
)
4141

42-
__version__ = "0.19.0"
42+
__version__ = "0.20.9"
4343

4444
__all__ = [
4545
"Client",

docs/concepts.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ that can seed memory, replay scenarios, and optimization.
4343
| **Trace store** | The append-only log of every span/event during a run. Replay = read this back. |
4444
| **Composite score** | A 0..1 number combining all dimensions. The single number you gate on. |
4545
| **Rubric version** | A stable hash of the rubric. Scores from different rubric versions are not comparable. |
46-
| **Muffled gate** | A check that should fail loud but silently passes (e.g. `command || true`). The most expensive bug class in this codebase — see SKILL.md. |
46+
| **Muffled gate** | A check that should fail loud but silently passes (e.g. `command || true`). The most expensive bug class in this codebase. |
4747

4848
## The feedback trajectory loop
4949

@@ -119,7 +119,7 @@ report.blendedScore // 0..1 — weighted aggregate
119119
report.layers // per-layer status, findings, duration
120120
```
121121

122-
Two rules that will save you bugs (paid for in real incidents — see SKILL.md):
122+
Two rules that will save you bugs:
123123

124124
1. **Run both gates.** Build gates catch code that doesn't compile; structural assertions catch missing files. Run both unconditionally — they catch orthogonal failures.
125125

@@ -150,6 +150,6 @@ You don't need to build the trace tree by hand. `BuilderSession` does it for you
150150
- **Just want to score a string against a rubric?**[wire-protocol.md](./wire-protocol.md) — HTTP/RPC interface, pluggable from any language.
151151
- **Need a reusable driver/worker/evaluator loop?**[control-runtime.md](./control-runtime.md) — generic runtime plus coding, browser, computer-use, and research integration patterns.
152152
- **Want review feedback to become eval/optimization data?**[feedback-trajectories.md](./feedback-trajectories.md) — turn feedback into datasets, optimizer rows, and preference memory.
153-
- **Building a code-generator eval?**SKILL.md §Minimal working path — the `BuilderSession` recipe.
154-
- **Multi-layer verifier?**SKILL.md §Verification pipeline.
153+
- **Building a code-generator eval?**Start with `BuilderSession`, `SandboxHarness`, and `MultiLayerVerifier`.
154+
- **Multi-layer verifier?**Use [control-runtime.md](./control-runtime.md) and `MultiLayerVerifier` for ordered gates with dependencies.
155155
- **Adding a new judge or rubric?**`src/wire/rubrics.ts` for the cross-language path; `src/anti-slop.ts` and `src/judges.ts` for the in-process path.

docs/knowledge-readiness.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22

33
`agent-eval` owns the contract for deciding whether an agent had enough
44
task-world context to run. It does not own web crawling, connector storage, wiki
5-
pages, credentials, or product policy. Those live in `agent-knowledge` and
6-
product repos.
5+
pages, credentials, or product policy. Those live in
6+
`@tangle-network/agent-knowledge` and product repos.
77

88
The core loop is:
99

docs/wire-protocol.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,13 +96,13 @@ GET /v1/version
9696
```json
9797
{
9898
"package": "@tangle-network/agent-eval",
99-
"version": "0.19.0",
99+
"version": "0.20.9",
100100
"wireVersion": "1.0.0",
101101
"apiSurface": ["judge", "listRubrics", "version"]
102102
}
103103
```
104104

105-
`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
105+
`version` matches the package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
106106

107107
### `GET /healthz` — liveness
108108

@@ -176,7 +176,7 @@ Each invocation is one process — Node startup adds ~500 ms. For more than a fe
176176

177177
## Clients
178178

179-
- **Python**: [`tangle-agent-eval`](../clients/python/README.md) on PyPI. Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
179+
- **Python**: source lives in [`clients/python`](../clients/python/README.md). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
180180
- **TypeScript**: import directly from `@tangle-network/agent-eval` (no wire round-trip needed in-process).
181181
- **Rust / Go / Other**: generate from `dist/openapi.json`. PRs welcome to add an officially-maintained client.
182182

examples/benchmarks/README.md

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,10 @@ The novel benchmark we ship and own — the synthetic routing task — lives in
1111

1212
## Using these wrappers
1313

14-
Two paths.
15-
16-
**Option A — read and inline.** Copy the wrapper file into your project. Replace the import paths from `../../../src/benchmarks/types` and `../../../src/run-record` with `@tangle-network/agent-eval`. Done.
17-
18-
**Option B — import from agent-eval source.** If your project sits in this monorepo (or you've cloned the repo), import directly:
19-
20-
```ts
21-
import * as gsm8k from '@tangle-network/agent-eval/examples/benchmarks/gsm8k'
22-
```
23-
24-
This requires adding `examples/**/*.ts` to your TypeScript paths. Easier to just copy.
14+
Read and inline them. Copy the wrapper file into your project, then replace
15+
imports such as `../../../src/benchmarks/types` and `../../../src/run-record`
16+
with `@tangle-network/agent-eval`. These examples are repository source, not
17+
published npm subpaths.
2518

2619
## What every BenchmarkAdapter exports
2720

0 commit comments

Comments
 (0)