tangle-network
diff --git a/‎.github/workflows/publish.yml‎
Lines changed: 13 additions & 4 deletions b/‎.github/workflows/publish.yml‎
Lines changed: 13 additions & 4 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 24 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 8 additions & 4 deletions b/‎README.md‎
Lines changed: 8 additions & 4 deletions
diff --git a/‎clients/python/README.md‎
Lines changed: 2 additions & 2 deletions b/‎clients/python/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎clients/python/pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎clients/python/pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎clients/python/src/tangle_agent_eval/__init__.py‎
Lines changed: 6 additions & 1 deletion b/‎clients/python/src/tangle_agent_eval/__init__.py‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/wire-protocol.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/wire-protocol.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/benchmarks/swebench-lite/index.ts‎
Lines changed: 87 additions & 15 deletions b/‎examples/benchmarks/swebench-lite/index.ts‎
Lines changed: 87 additions & 15 deletions
diff --git a/‎package.json‎
Lines changed: 5 additions & 4 deletions b/‎package.json‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎src/benchmarks/index.ts‎
Lines changed: 2 additions & 2 deletions b/‎src/benchmarks/index.ts‎
Lines changed: 2 additions & 2 deletions
@@ -35,20 +35,29 @@ jobs:
       - name: Test JS
         run: pnpm test
 
-      - name: Build JS
+      - name: Build JS and emit OpenAPI spec
         run: pnpm build
 
-      - name: Emit OpenAPI spec
-        run: pnpm openapi
-
       - name: Verify version lock between npm and PyPI packages
         run: |
           NPM_VERSION=$(node -p "require('./package.json').version")
           PY_VERSION=$(grep -E '^version' clients/python/pyproject.toml | head -1 | sed -E 's/.*"([^"]+)".*/\1/')
+          PY_RUNTIME_VERSION=$(python -c "import pathlib,re; text=pathlib.Path('clients/python/src/tangle_agent_eval/__init__.py').read_text(); match=re.search(r'__version__ = \"([^\"]+)\"', text); print(match.group(1) if match else '')")
           if [ "$NPM_VERSION" != "$PY_VERSION" ]; then
             echo "::error::Version mismatch: npm=$NPM_VERSION pypi=$PY_VERSION. Bump them together."
             exit 1
           fi
+          if [ -n "$PY_RUNTIME_VERSION" ] && [ "$NPM_VERSION" != "$PY_RUNTIME_VERSION" ]; then
+            echo "::error::Version mismatch: npm=$NPM_VERSION python_runtime=$PY_RUNTIME_VERSION. Bump them together."
+            exit 1
+          fi
+          if [[ "${GITHUB_REF:-}" == refs/tags/v* ]]; then
+            TAG_VERSION="${GITHUB_REF#refs/tags/v}"
+            if [ "$TAG_VERSION" != "$NPM_VERSION" ]; then
+              echo "::error::Tag/version mismatch: tag=$TAG_VERSION package=$NPM_VERSION."
+              exit 1
+            fi
+          fi
           echo "Versions locked: $NPM_VERSION"
 
       - name: Install Python client
 
@@ -1,5 +1,29 @@
 # Changelog
 
+## 0.20.10 — hardening audit follow-up
+
+### Fixed
+
+- `hashRubric` now recursively sorts nested rubric fields before hashing, so
+  dimension, failure-mode, and win changes alter `rubricVersion`.
+- Wire judge handling now validates LLM output before returning it: finite
+  dimension scores, rationale, and known failure/win ids are enforced.
+- Control-runtime budgets reject invalid numeric config, and invalid action
+  costs are omitted from step telemetry instead of leaking `NaN`/`Infinity`.
+- Knowledge readiness now treats invalid `validUntil` timestamps as stale.
+- Trace-analyst regex search supports leading `(?i)` and stops scanning once
+  bounded match output is reached.
+- SWE-Bench Lite example wording now reflects the implemented external-grader
+  adapter, with quoted command parsing and timeout coverage.
+
+### Changed
+
+- Published package contents now include `CHANGELOG.md`.
+- Public docs now use GitHub URLs for repository-only examples and Python
+  client source.
+- Publish CI now checks npm, Python package, runtime fallback version, and tag
+  version agree before publishing.
+
 ## 0.20.9 — release hygiene and runtime failure fixes
 
 ### Fixed
 
@@ -98,16 +98,20 @@ pip install -e .
 | `summaryTable`, `paretoChart`, `gainHistogram` | Report-ready structured outputs. |
 | `KnowledgeRequirement`, `KnowledgeBundle` | Shared contracts for knowledge readiness. |
 
+`NoopResearcher` is a fail-loud sentinel for wiring tests. Production systems
+should implement `Researcher` directly or use `CallbackResearcher`.
+
 ## Examples
 
-Runnable examples live in the repository's [`examples/`](./examples)
+Runnable examples live in the repository's
+[`examples/`](https://github.com/tangle-network/agent-eval/tree/main/examples)
 directory. They are not part of the published npm package.
 
-- [`examples/same-sandbox-harness`](./examples/same-sandbox-harness) - run
+- [`examples/same-sandbox-harness`](https://github.com/tangle-network/agent-eval/tree/main/examples/same-sandbox-harness) - run
   multiple eval passes against the same workspace.
-- [`examples/multi-shot-optimization`](./examples/multi-shot-optimization) -
+- [`examples/multi-shot-optimization`](https://github.com/tangle-network/agent-eval/tree/main/examples/multi-shot-optimization) -
   optimize full agent trajectories with held-out promotion.
-- [`examples/benchmarks`](./examples/benchmarks) - benchmark adapter shape and
+- [`examples/benchmarks`](https://github.com/tangle-network/agent-eval/tree/main/examples/benchmarks) - benchmark adapter shape and
   reference benchmark wrappers.
 
 The examples are intentionally kept outside the README so they can be expanded,
 
@@ -102,7 +102,7 @@ Return server + wire-protocol version. Match your `pip install` version to `vers
 
 ```python
 v = client.version()
-assert v.version.startswith("0.12")
+assert v.version.startswith("0.20")
 assert v.wire_version == "1.0.0"
 ```
 
@@ -141,7 +141,7 @@ All errors carry `.code` and `.details` (the structured payload from the server)
 
 ## Versioning
 
-This package is **version-locked** to the npm package. `tangle-agent-eval==0.20.9` ↔ `@tangle-network/agent-eval@0.20.9`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
+This package is **version-locked** to the npm package. `tangle-agent-eval==0.20.10` ↔ `@tangle-network/agent-eval@0.20.10`. CI verifies the npm package, Python package, runtime `__version__`, and release tag all agree before publish. If one registry publish fails after the other succeeds, retry the failed publish from the same tag or supersede with the next patch release.
 
 `wire_version` is separate. It bumps only on breaking schema changes. Package versions can differ across releases as long as `wire_version` is the same.
 
 
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "tangle-agent-eval"
-version = "0.20.9"
+version = "0.20.10"
 description = "Python client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC."
 readme = "README.md"
 requires-python = ">=3.10"
 
@@ -21,6 +21,8 @@
 See README.md for the full guide.
 """
 
+from importlib.metadata import PackageNotFoundError, version
+
 from .client import Client
 from .errors import (
     AgentEvalError,
@@ -39,7 +41,10 @@
     VersionResponse,
 )
 
-__version__ = "0.20.9"
+try:
+    __version__ = version("tangle-agent-eval")
+except PackageNotFoundError:
+    __version__ = "0.20.10"
 
 __all__ = [
     "Client",
 
@@ -96,7 +96,7 @@ GET /v1/version
 ```json
 {
   "package": "@tangle-network/agent-eval",
-  "version": "0.20.9",
+  "version": "0.20.10",
   "wireVersion": "1.0.0",
   "apiSurface": ["judge", "listRubrics", "version"]
 }
@@ -176,7 +176,7 @@ Each invocation is one process — Node startup adds ~500 ms. For more than a fe
 
 ## Clients
 
-- **Python**: source lives in [`clients/python`](../clients/python/README.md). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
+- **Python**: source lives in [`clients/python`](https://github.com/tangle-network/agent-eval/tree/main/clients/python). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
 - **TypeScript**: import directly from `@tangle-network/agent-eval` (no wire round-trip needed in-process).
 - **Rust / Go / Other**: generate from `dist/openapi.json`. PRs welcome to add an officially-maintained client.
 
 
@@ -1,10 +1,9 @@
 /**
  * SWE-Bench Lite wrapper — 30-instance subset.
  *
- * Status: STUB. The actual SWE-Bench harness needs a Docker host and
- * is too heavy to ship inside this package. We expose the contract
- * (loadDataset, evaluate, assignSplit) so consumers can plug in their
- * own grader without touching call sites.
+ * The official grader needs a Docker host and repository cache, so this
+ * wrapper keeps the package lightweight and delegates grading to a
+ * caller-provided executable.
  *
  * Wire-up paths in priority order:
  *
@@ -18,9 +17,8 @@
  *      JSON on stdout. Implementations can shell out to the
  *      official `swebench` runner here.
  *
- * If neither is set, every public method throws a clearly-marked
- * "not implemented" error. The stub fails LOUD; it never silently
- * scores zero.
+ * If the dataset or grader is not configured, public methods throw a
+ * clearly-marked setup error. This adapter never silently scores zero.
  */
 
 import { existsSync, readFileSync } from 'node:fs'
@@ -53,7 +51,7 @@ class SweBenchLiteAdapter
     if (!path) {
       throw new Error(
         'SWE-Bench Lite dataset not provided. Set AGENT_EVAL_SWEBENCH_PATH to a JSONL file ' +
-          'with the 30 lite instances. STUB: this wrapper does not bundle the dataset; ' +
+          'with the 30 lite instances. This wrapper does not bundle the dataset; ' +
           'see https://www.swebench.com/lite.html for the canonical source.',
       )
     }
@@ -71,12 +69,12 @@ class SweBenchLiteAdapter
         'SWE-Bench Lite grader not configured. Set AGENT_EVAL_SWEBENCH_GRADER_CMD to an ' +
           'executable that reads {instance_id, patch} JSON on stdin and writes ' +
           '{passed, fail_to_pass_passed, pass_to_pass_passed, log} JSON on stdout. ' +
-          'TODO(swebench-lite): bundle a default Docker-based runner once the SDK ' +
-          'stabilises (https://github.com/swe-bench/SWE-bench).',
+          'This wrapper intentionally delegates Docker-based grading to the configured command.',
       )
     }
     const stdinPayload = JSON.stringify({ instance_id: item.payload.instanceId, patch: response })
-    const result = await runGrader(cmd, stdinPayload)
+    const timeoutMs = parsePositiveInt(process.env.AGENT_EVAL_SWEBENCH_GRADER_TIMEOUT_MS, 300_000)
+    const result = await runGrader(cmd, stdinPayload, timeoutMs)
     let parsed: Record<string, unknown>
     try {
       parsed = JSON.parse(result.stdout) as Record<string, unknown>
@@ -115,7 +113,7 @@ function parseJsonl(path: string): SweBenchLiteItem[] {
     lineNo++
     const trimmed = line.trim()
     if (!trimmed) continue
-    const row = JSON.parse(trimmed) as Record<string, unknown>
+    const row = parseJsonRow(trimmed, lineNo)
     const instanceId = String(row.instance_id ?? row.instanceId ?? '')
     if (!instanceId) {
       throw new Error(`swebench-lite line ${lineNo} missing instance_id`)
@@ -149,16 +147,90 @@ function asStringArray(v: unknown): string[] {
   return []
 }
 
-function runGrader(cmd: string, stdin: string): Promise<{ stdout: string; stderr: string }> {
+function parseJsonRow(line: string, lineNo: number): Record<string, unknown> {
+  try {
+    return JSON.parse(line) as Record<string, unknown>
+  } catch (e) {
+    throw new Error(`swebench-lite JSONL parse error at line ${lineNo}: ${(e as Error).message}`)
+  }
+}
+
+export function parseSweBenchGraderCommand(cmd: string): string[] {
+  const parts: string[] = []
+  let current = ''
+  let quote: '"' | "'" | null = null
+  let escaping = false
+  for (const ch of cmd.trim()) {
+    if (escaping) {
+      current += ch
+      escaping = false
+      continue
+    }
+    if (ch === '\\') {
+      escaping = true
+      continue
+    }
+    if (quote) {
+      if (ch === quote) quote = null
+      else current += ch
+      continue
+    }
+    if (ch === '"' || ch === "'") {
+      quote = ch
+      continue
+    }
+    if (/\s/.test(ch)) {
+      if (current) {
+        parts.push(current)
+        current = ''
+      }
+      continue
+    }
+    current += ch
+  }
+  if (escaping) current += '\\'
+  if (quote) throw new Error(`SWE-Bench grader command has an unterminated ${quote} quote`)
+  if (current) parts.push(current)
+  if (parts.length === 0) throw new Error('SWE-Bench grader command is empty')
+  return parts
+}
+
+function parsePositiveInt(raw: string | undefined, fallback: number): number {
+  if (!raw) return fallback
+  const parsed = Number(raw)
+  return Number.isInteger(parsed) && parsed > 0 ? parsed : fallback
+}
+
+function runGrader(cmd: string, stdin: string, timeoutMs: number): Promise<{ stdout: string; stderr: string }> {
   return new Promise((resolve, reject) => {
-    const parts = cmd.split(/\s+/)
+    let parts: string[]
+    try {
+      parts = parseSweBenchGraderCommand(cmd)
+    } catch (e) {
+      reject(e)
+      return
+    }
     const child = spawn(parts[0]!, parts.slice(1), { stdio: ['pipe', 'pipe', 'pipe'] })
     let stdout = ''
     let stderr = ''
+    let settled = false
+    const timer = setTimeout(() => {
+      settled = true
+      child.kill('SIGTERM')
+      reject(new Error(`SWE-Bench grader timed out after ${timeoutMs}ms`))
+    }, timeoutMs)
     child.stdout.on('data', (b: Buffer) => (stdout += b.toString('utf8')))
     child.stderr.on('data', (b: Buffer) => (stderr += b.toString('utf8')))
-    child.on('error', reject)
+    child.on('error', (err) => {
+      if (settled) return
+      settled = true
+      clearTimeout(timer)
+      reject(err)
+    })
     child.on('close', (code) => {
+      if (settled) return
+      settled = true
+      clearTimeout(timer)
       if (code !== 0) {
         reject(new Error(`grader exited with code ${code}: ${stderr.slice(0, 400)}`))
         return
 
@@ -1,6 +1,6 @@
 {
   "name": "@tangle-network/agent-eval",
-  "version": "0.20.9",
+  "version": "0.20.10",
   "description": "Trace-first evaluation infrastructure for agent systems: traces, harnesses, verifier pipelines, judges, datasets, gates, optimization, and reporting.",
   "homepage": "https://github.com/tangle-network/agent-eval#readme",
   "repository": {
@@ -48,19 +48,20 @@
   },
   "files": [
     "dist",
-    "docs"
+    "docs",
+    "CHANGELOG.md"
   ],
   "publishConfig": {
     "access": "public"
   },
   "scripts": {
-    "build": "tsup && node dist/cli.js openapi --out dist/openapi.json",
+    "build": "tsup && pnpm openapi",
     "dev": "tsup --watch",
     "prepare": "pnpm build",
     "test": "vitest run",
     "test:watch": "vitest",
     "typecheck": "tsc --noEmit",
-    "openapi": "pnpm build"
+    "openapi": "node dist/cli.js openapi --out dist/openapi.json"
   },
   "dependencies": {
     "@asteasolutions/zod-to-openapi": "^8.5.0",
 
@@ -10,8 +10,8 @@
  * Example wrappers (under `examples/benchmarks/`, NOT in the bundle):
  *   - `gsm8k`         — exact-match math reasoning (HF mirror, dataset
  *                       not bundled).
- *   - `swebench-lite` — 30-instance SWE-Bench subset (stub; needs an
- *                       external grader).
+ *   - `swebench-lite` — 30-instance SWE-Bench subset via an external
+ *                       grader command.
  *
  * The example wrappers are reference implementations of `BenchmarkAdapter`.
  * Read them, copy them, adapt them. They're intentionally not in the main
Original file line number	Diff line number	Diff line change
`@@ -96,7 +96,7 @@ GET /v1/version`
`96`	`96`	```json
`97`	`97`	`{`
`98`	`98`	`"package": "@tangle-network/agent-eval",`
`99`		`- "version": "0.20.9",`
	`99`	`+ "version": "0.20.10",`
`100`	`100`	`"wireVersion": "1.0.0",`
`101`	`101`	`"apiSurface": ["judge", "listRubrics", "version"]`
`102`	`102`	`}`
`@@ -176,7 +176,7 @@ Each invocation is one process — Node startup adds ~500 ms. For more than a fe`
`176`	`176`
`177`	`177`	`## Clients`
`178`	`178`
`179`		-- Python: source lives in [`clients/python`](../clients/python/README.md). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
	`179`	+- Python: source lives in [`clients/python`](https://github.com/tangle-network/agent-eval/tree/main/clients/python). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
`180`	`180`	- TypeScript: import directly from `@tangle-network/agent-eval` (no wire round-trip needed in-process).
`181`	`181`	- Rust / Go / Other: generate from `dist/openapi.json`. PRs welcome to add an officially-maintained client.
`182`	`182`