Skip to content

Commit d2ea910

Browse files
authored
chore: harden agent-eval audit gaps (#28)
1 parent ef4cd4f commit d2ea910

29 files changed

Lines changed: 598 additions & 75 deletions

.github/workflows/publish.yml

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,20 +35,29 @@ jobs:
3535
- name: Test JS
3636
run: pnpm test
3737

38-
- name: Build JS
38+
- name: Build JS and emit OpenAPI spec
3939
run: pnpm build
4040

41-
- name: Emit OpenAPI spec
42-
run: pnpm openapi
43-
4441
- name: Verify version lock between npm and PyPI packages
4542
run: |
4643
NPM_VERSION=$(node -p "require('./package.json').version")
4744
PY_VERSION=$(grep -E '^version' clients/python/pyproject.toml | head -1 | sed -E 's/.*"([^"]+)".*/\1/')
45+
PY_RUNTIME_VERSION=$(python -c "import pathlib,re; text=pathlib.Path('clients/python/src/tangle_agent_eval/__init__.py').read_text(); match=re.search(r'__version__ = \"([^\"]+)\"', text); print(match.group(1) if match else '')")
4846
if [ "$NPM_VERSION" != "$PY_VERSION" ]; then
4947
echo "::error::Version mismatch: npm=$NPM_VERSION pypi=$PY_VERSION. Bump them together."
5048
exit 1
5149
fi
50+
if [ -n "$PY_RUNTIME_VERSION" ] && [ "$NPM_VERSION" != "$PY_RUNTIME_VERSION" ]; then
51+
echo "::error::Version mismatch: npm=$NPM_VERSION python_runtime=$PY_RUNTIME_VERSION. Bump them together."
52+
exit 1
53+
fi
54+
if [[ "${GITHUB_REF:-}" == refs/tags/v* ]]; then
55+
TAG_VERSION="${GITHUB_REF#refs/tags/v}"
56+
if [ "$TAG_VERSION" != "$NPM_VERSION" ]; then
57+
echo "::error::Tag/version mismatch: tag=$TAG_VERSION package=$NPM_VERSION."
58+
exit 1
59+
fi
60+
fi
5261
echo "Versions locked: $NPM_VERSION"
5362
5463
- name: Install Python client

CHANGELOG.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,29 @@
11
# Changelog
22

3+
## 0.20.10 — hardening audit follow-up
4+
5+
### Fixed
6+
7+
- `hashRubric` now recursively sorts nested rubric fields before hashing, so
8+
dimension, failure-mode, and win changes alter `rubricVersion`.
9+
- Wire judge handling now validates LLM output before returning it: finite
10+
dimension scores, rationale, and known failure/win ids are enforced.
11+
- Control-runtime budgets reject invalid numeric config, and invalid action
12+
costs are omitted from step telemetry instead of leaking `NaN`/`Infinity`.
13+
- Knowledge readiness now treats invalid `validUntil` timestamps as stale.
14+
- Trace-analyst regex search supports leading `(?i)` and stops scanning once
15+
bounded match output is reached.
16+
- SWE-Bench Lite example wording now reflects the implemented external-grader
17+
adapter, with quoted command parsing and timeout coverage.
18+
19+
### Changed
20+
21+
- Published package contents now include `CHANGELOG.md`.
22+
- Public docs now use GitHub URLs for repository-only examples and Python
23+
client source.
24+
- Publish CI now checks npm, Python package, runtime fallback version, and tag
25+
version agree before publishing.
26+
327
## 0.20.9 — release hygiene and runtime failure fixes
428

529
### Fixed

README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -98,16 +98,20 @@ pip install -e .
9898
| `summaryTable`, `paretoChart`, `gainHistogram` | Report-ready structured outputs. |
9999
| `KnowledgeRequirement`, `KnowledgeBundle` | Shared contracts for knowledge readiness. |
100100

101+
`NoopResearcher` is a fail-loud sentinel for wiring tests. Production systems
102+
should implement `Researcher` directly or use `CallbackResearcher`.
103+
101104
## Examples
102105

103-
Runnable examples live in the repository's [`examples/`](./examples)
106+
Runnable examples live in the repository's
107+
[`examples/`](https://github.com/tangle-network/agent-eval/tree/main/examples)
104108
directory. They are not part of the published npm package.
105109

106-
- [`examples/same-sandbox-harness`](./examples/same-sandbox-harness) - run
110+
- [`examples/same-sandbox-harness`](https://github.com/tangle-network/agent-eval/tree/main/examples/same-sandbox-harness) - run
107111
multiple eval passes against the same workspace.
108-
- [`examples/multi-shot-optimization`](./examples/multi-shot-optimization) -
112+
- [`examples/multi-shot-optimization`](https://github.com/tangle-network/agent-eval/tree/main/examples/multi-shot-optimization) -
109113
optimize full agent trajectories with held-out promotion.
110-
- [`examples/benchmarks`](./examples/benchmarks) - benchmark adapter shape and
114+
- [`examples/benchmarks`](https://github.com/tangle-network/agent-eval/tree/main/examples/benchmarks) - benchmark adapter shape and
111115
reference benchmark wrappers.
112116

113117
The examples are intentionally kept outside the README so they can be expanded,

clients/python/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ Return server + wire-protocol version. Match your `pip install` version to `vers
102102

103103
```python
104104
v = client.version()
105-
assert v.version.startswith("0.12")
105+
assert v.version.startswith("0.20")
106106
assert v.wire_version == "1.0.0"
107107
```
108108

@@ -141,7 +141,7 @@ All errors carry `.code` and `.details` (the structured payload from the server)
141141

142142
## Versioning
143143

144-
This package is **version-locked** to the npm package. `tangle-agent-eval==0.20.9``@tangle-network/agent-eval@0.20.9`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
144+
This package is **version-locked** to the npm package. `tangle-agent-eval==0.20.10``@tangle-network/agent-eval@0.20.10`. CI verifies the npm package, Python package, runtime `__version__`, and release tag all agree before publish. If one registry publish fails after the other succeeds, retry the failed publish from the same tag or supersede with the next patch release.
145145

146146
`wire_version` is separate. It bumps only on breaking schema changes. Package versions can differ across releases as long as `wire_version` is the same.
147147

clients/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "tangle-agent-eval"
7-
version = "0.20.9"
7+
version = "0.20.10"
88
description = "Python client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC."
99
readme = "README.md"
1010
requires-python = ">=3.10"

clients/python/src/tangle_agent_eval/__init__.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121
See README.md for the full guide.
2222
"""
2323

24+
from importlib.metadata import PackageNotFoundError, version
25+
2426
from .client import Client
2527
from .errors import (
2628
AgentEvalError,
@@ -39,7 +41,10 @@
3941
VersionResponse,
4042
)
4143

42-
__version__ = "0.20.9"
44+
try:
45+
__version__ = version("tangle-agent-eval")
46+
except PackageNotFoundError:
47+
__version__ = "0.20.10"
4348

4449
__all__ = [
4550
"Client",

docs/wire-protocol.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ GET /v1/version
9696
```json
9797
{
9898
"package": "@tangle-network/agent-eval",
99-
"version": "0.20.9",
99+
"version": "0.20.10",
100100
"wireVersion": "1.0.0",
101101
"apiSurface": ["judge", "listRubrics", "version"]
102102
}
@@ -176,7 +176,7 @@ Each invocation is one process — Node startup adds ~500 ms. For more than a fe
176176

177177
## Clients
178178

179-
- **Python**: source lives in [`clients/python`](../clients/python/README.md). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
179+
- **Python**: source lives in [`clients/python`](https://github.com/tangle-network/agent-eval/tree/main/clients/python). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
180180
- **TypeScript**: import directly from `@tangle-network/agent-eval` (no wire round-trip needed in-process).
181181
- **Rust / Go / Other**: generate from `dist/openapi.json`. PRs welcome to add an officially-maintained client.
182182

examples/benchmarks/swebench-lite/index.ts

Lines changed: 87 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
11
/**
22
* SWE-Bench Lite wrapper — 30-instance subset.
33
*
4-
* Status: STUB. The actual SWE-Bench harness needs a Docker host and
5-
* is too heavy to ship inside this package. We expose the contract
6-
* (loadDataset, evaluate, assignSplit) so consumers can plug in their
7-
* own grader without touching call sites.
4+
* The official grader needs a Docker host and repository cache, so this
5+
* wrapper keeps the package lightweight and delegates grading to a
6+
* caller-provided executable.
87
*
98
* Wire-up paths in priority order:
109
*
@@ -18,9 +17,8 @@
1817
* JSON on stdout. Implementations can shell out to the
1918
* official `swebench` runner here.
2019
*
21-
* If neither is set, every public method throws a clearly-marked
22-
* "not implemented" error. The stub fails LOUD; it never silently
23-
* scores zero.
20+
* If the dataset or grader is not configured, public methods throw a
21+
* clearly-marked setup error. This adapter never silently scores zero.
2422
*/
2523

2624
import { existsSync, readFileSync } from 'node:fs'
@@ -53,7 +51,7 @@ class SweBenchLiteAdapter
5351
if (!path) {
5452
throw new Error(
5553
'SWE-Bench Lite dataset not provided. Set AGENT_EVAL_SWEBENCH_PATH to a JSONL file ' +
56-
'with the 30 lite instances. STUB: this wrapper does not bundle the dataset; ' +
54+
'with the 30 lite instances. This wrapper does not bundle the dataset; ' +
5755
'see https://www.swebench.com/lite.html for the canonical source.',
5856
)
5957
}
@@ -71,12 +69,12 @@ class SweBenchLiteAdapter
7169
'SWE-Bench Lite grader not configured. Set AGENT_EVAL_SWEBENCH_GRADER_CMD to an ' +
7270
'executable that reads {instance_id, patch} JSON on stdin and writes ' +
7371
'{passed, fail_to_pass_passed, pass_to_pass_passed, log} JSON on stdout. ' +
74-
'TODO(swebench-lite): bundle a default Docker-based runner once the SDK ' +
75-
'stabilises (https://github.com/swe-bench/SWE-bench).',
72+
'This wrapper intentionally delegates Docker-based grading to the configured command.',
7673
)
7774
}
7875
const stdinPayload = JSON.stringify({ instance_id: item.payload.instanceId, patch: response })
79-
const result = await runGrader(cmd, stdinPayload)
76+
const timeoutMs = parsePositiveInt(process.env.AGENT_EVAL_SWEBENCH_GRADER_TIMEOUT_MS, 300_000)
77+
const result = await runGrader(cmd, stdinPayload, timeoutMs)
8078
let parsed: Record<string, unknown>
8179
try {
8280
parsed = JSON.parse(result.stdout) as Record<string, unknown>
@@ -115,7 +113,7 @@ function parseJsonl(path: string): SweBenchLiteItem[] {
115113
lineNo++
116114
const trimmed = line.trim()
117115
if (!trimmed) continue
118-
const row = JSON.parse(trimmed) as Record<string, unknown>
116+
const row = parseJsonRow(trimmed, lineNo)
119117
const instanceId = String(row.instance_id ?? row.instanceId ?? '')
120118
if (!instanceId) {
121119
throw new Error(`swebench-lite line ${lineNo} missing instance_id`)
@@ -149,16 +147,90 @@ function asStringArray(v: unknown): string[] {
149147
return []
150148
}
151149

152-
function runGrader(cmd: string, stdin: string): Promise<{ stdout: string; stderr: string }> {
150+
function parseJsonRow(line: string, lineNo: number): Record<string, unknown> {
151+
try {
152+
return JSON.parse(line) as Record<string, unknown>
153+
} catch (e) {
154+
throw new Error(`swebench-lite JSONL parse error at line ${lineNo}: ${(e as Error).message}`)
155+
}
156+
}
157+
158+
export function parseSweBenchGraderCommand(cmd: string): string[] {
159+
const parts: string[] = []
160+
let current = ''
161+
let quote: '"' | "'" | null = null
162+
let escaping = false
163+
for (const ch of cmd.trim()) {
164+
if (escaping) {
165+
current += ch
166+
escaping = false
167+
continue
168+
}
169+
if (ch === '\\') {
170+
escaping = true
171+
continue
172+
}
173+
if (quote) {
174+
if (ch === quote) quote = null
175+
else current += ch
176+
continue
177+
}
178+
if (ch === '"' || ch === "'") {
179+
quote = ch
180+
continue
181+
}
182+
if (/\s/.test(ch)) {
183+
if (current) {
184+
parts.push(current)
185+
current = ''
186+
}
187+
continue
188+
}
189+
current += ch
190+
}
191+
if (escaping) current += '\\'
192+
if (quote) throw new Error(`SWE-Bench grader command has an unterminated ${quote} quote`)
193+
if (current) parts.push(current)
194+
if (parts.length === 0) throw new Error('SWE-Bench grader command is empty')
195+
return parts
196+
}
197+
198+
function parsePositiveInt(raw: string | undefined, fallback: number): number {
199+
if (!raw) return fallback
200+
const parsed = Number(raw)
201+
return Number.isInteger(parsed) && parsed > 0 ? parsed : fallback
202+
}
203+
204+
function runGrader(cmd: string, stdin: string, timeoutMs: number): Promise<{ stdout: string; stderr: string }> {
153205
return new Promise((resolve, reject) => {
154-
const parts = cmd.split(/\s+/)
206+
let parts: string[]
207+
try {
208+
parts = parseSweBenchGraderCommand(cmd)
209+
} catch (e) {
210+
reject(e)
211+
return
212+
}
155213
const child = spawn(parts[0]!, parts.slice(1), { stdio: ['pipe', 'pipe', 'pipe'] })
156214
let stdout = ''
157215
let stderr = ''
216+
let settled = false
217+
const timer = setTimeout(() => {
218+
settled = true
219+
child.kill('SIGTERM')
220+
reject(new Error(`SWE-Bench grader timed out after ${timeoutMs}ms`))
221+
}, timeoutMs)
158222
child.stdout.on('data', (b: Buffer) => (stdout += b.toString('utf8')))
159223
child.stderr.on('data', (b: Buffer) => (stderr += b.toString('utf8')))
160-
child.on('error', reject)
224+
child.on('error', (err) => {
225+
if (settled) return
226+
settled = true
227+
clearTimeout(timer)
228+
reject(err)
229+
})
161230
child.on('close', (code) => {
231+
if (settled) return
232+
settled = true
233+
clearTimeout(timer)
162234
if (code !== 0) {
163235
reject(new Error(`grader exited with code ${code}: ${stderr.slice(0, 400)}`))
164236
return

package.json

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@tangle-network/agent-eval",
3-
"version": "0.20.9",
3+
"version": "0.20.10",
44
"description": "Trace-first evaluation infrastructure for agent systems: traces, harnesses, verifier pipelines, judges, datasets, gates, optimization, and reporting.",
55
"homepage": "https://github.com/tangle-network/agent-eval#readme",
66
"repository": {
@@ -48,19 +48,20 @@
4848
},
4949
"files": [
5050
"dist",
51-
"docs"
51+
"docs",
52+
"CHANGELOG.md"
5253
],
5354
"publishConfig": {
5455
"access": "public"
5556
},
5657
"scripts": {
57-
"build": "tsup && node dist/cli.js openapi --out dist/openapi.json",
58+
"build": "tsup && pnpm openapi",
5859
"dev": "tsup --watch",
5960
"prepare": "pnpm build",
6061
"test": "vitest run",
6162
"test:watch": "vitest",
6263
"typecheck": "tsc --noEmit",
63-
"openapi": "pnpm build"
64+
"openapi": "node dist/cli.js openapi --out dist/openapi.json"
6465
},
6566
"dependencies": {
6667
"@asteasolutions/zod-to-openapi": "^8.5.0",

src/benchmarks/index.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
* Example wrappers (under `examples/benchmarks/`, NOT in the bundle):
1111
* - `gsm8k` — exact-match math reasoning (HF mirror, dataset
1212
* not bundled).
13-
* - `swebench-lite` — 30-instance SWE-Bench subset (stub; needs an
14-
* external grader).
13+
* - `swebench-lite` — 30-instance SWE-Bench subset via an external
14+
* grader command.
1515
*
1616
* The example wrappers are reference implementations of `BenchmarkAdapter`.
1717
* Read them, copy them, adapt them. They're intentionally not in the main

0 commit comments

Comments
 (0)