Skip to content

Commit d755bee

Browse files
authored
feat: 0.21.0 — capture integrity for launch-grade benchmark runs (#36)
Closes the layer-1 gap a downstream consumer surfaced: better post-run statistics don't help if the underlying data wasn't captured. 0.21 ships: 1. RawProviderSink — first-class HTTP-level capture 2. assertLlmRoute — fail-loud route guard 3. assertRunCaptured — run-completion integrity check 4. onRunComplete hooks + traceAnalystOnRunComplete — auto orchestration Each piece is opt-in but composes cleanly: the matrix runner wires FileSystemRawProviderSink, calls assertLlmRoute({ requireExplicitBaseUrl, allowedBaseUrls }) at preflight, attaches traceAnalystOnRunComplete via TraceEmitterOptions.onRunComplete, and asserts a clean RunIntegrityReport before declaring the run complete. Result: a launch-decision-grade artifact without out-of-band glue. RawProviderSink notes: - InMemoryRawProviderSink, FileSystemRawProviderSink (NDJSON, rolls at 32MiB), NoopRawProviderSink ship in core - Default redactor strips Authorization / X-Api-Key / Cookie headers and credential-shaped body fields (apiKey, bearer, password, secret, token) - redactedFields array on every event records what was stripped - Wired into callLlm: every retry attempt produces a request and either a response or an error event with attemptIndex - Forensics-only: sink errors never crash the underlying LLM call Verifier route guard: - assertLlmRoute(opts, req) is pure (no I/O); safe to call from constructors and CI gates - Throws structured LlmRouteAssertionError with code field for programmatic handling (no_explicit_base_url, base_url_blocked, base_url_not_allowed, no_auth, wrong_provider) Integrity check: - assertRunCaptured returns RunIntegrityReport with issue codes; caller decides throw vs mark-failed via throwIfRunIncomplete - Pair with requireRawCoverageOfLlmSpans to catch the bug class where the structured span was emitted but raw HTTP capture was wired to a different sink Run-complete hooks: - TraceEmitterOptions.onRunComplete + addRunCompleteHook - Errors are swallowed by default (auto-orchestration must not crash the underlying flow) and logged as 'log' events; opt into propagation via hookErrors: 'throw' - traceAnalystOnRunComplete is the drop-in factory for the analyst case Version lockstep: - npm @tangle-network/agent-eval 0.21.0 - pypi tangle-agent-eval 0.21.0 867/867 tests passing (+30 new across 5 files: raw sink, route assertion, run integrity, hook lifecycle, llm raw capture).
1 parent c8f03bd commit d755bee

16 files changed

Lines changed: 1463 additions & 33 deletions

CHANGELOG.md

Lines changed: 65 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,80 @@
11
# Changelog
22

3-
## Unreleased
3+
## 0.21.0 — capture integrity + launch-grade reporting
4+
5+
This release closes the layer-1 gap a downstream consumer surfaced: better
6+
post-run statistics don't help if the underlying data wasn't captured. 0.21
7+
adds first-class raw provider-event capture, a fail-loud route guard, a
8+
run-completion integrity check, and run-complete hooks (with a trace-analyst
9+
auto-execution helper) so a direct matrix run produces complete forensics
10+
without out-of-band glue.
411

512
### Added
613

7-
- `researchReport`, an executive research-report layer for coding-vertical
8-
benchmark runs. Composes `summaryTable`, `paretoChart`, `gainHistogram`,
9-
held-out gate decisions, and optional `failureClusterView` output into
14+
- **`RawProviderSink` (capture).** First-class persistence for HTTP-level
15+
provider request / response / error payloads alongside the structured
16+
`LlmSpan`. `InMemoryRawProviderSink`, `FileSystemRawProviderSink` (NDJSON,
17+
rolls at 32 MiB), and `NoopRawProviderSink` ship in core. Default redactor
18+
strips `Authorization` / `X-Api-Key` / `Cookie` headers and credential-shaped
19+
body fields (`apiKey`, `bearer`, `password`, `secret`, `token`); redacted
20+
paths are recorded on `event.redactedFields` so a reviewer can see what was
21+
stripped without exposing values. Wired into `callLlm` via
22+
`LlmClientOptions.rawSink` — every retry attempt produces a `request` and
23+
either a `response` or `error` event with the attempt index attached.
24+
- **`assertLlmRoute` (route guard).** Pure function that throws
25+
`LlmRouteAssertionError` when the configured client doesn't match the
26+
caller's route requirements: `requireExplicitBaseUrl`, `allowedBaseUrls`,
27+
`blockedBaseUrls`, `requireAuth`, `expectedProvider`. Designed for the
28+
matrix-runner preflight — fail loud at the boundary instead of silently
29+
falling back to the public/free-tier router.
30+
- **`assertRunCaptured` (integrity check).** Read-only check on
31+
`(store, runId, expectations)` that returns a structured
32+
`RunIntegrityReport` with issue codes (`missing_llm_spans`,
33+
`missing_raw_events`, `orphan_llm_span`, `no_raw_sink`, `missing_outcome`,
34+
…). Pair with the new `requireRawCoverageOfLlmSpans` to assert every
35+
`LlmSpan` has a matching raw `request` event. Use directly or via
36+
`throwIfRunIncomplete` for strict mode.
37+
- **`onRunComplete` hooks on `TraceEmitter`.** New
38+
`TraceEmitterOptions.onRunComplete` array fires after `endRun` / `abortRun`
39+
with full run context (run id, outcome, status, store, emitter). Errors are
40+
swallowed and recorded as `log` events by default; opt into propagation via
41+
`hookErrors: 'throw'`. `addRunCompleteHook` attaches hooks after construction.
42+
- **`traceAnalystOnRunComplete` factory.** Drop-in run-complete hook that
43+
runs `analyzeTraces` after each run and persists the result. Resolves the
44+
"trace analyst never ran on this matrix sweep" complaint by making
45+
auto-execution declarative.
46+
- **`researchReport`** — executive research-report layer for coding-vertical
47+
benchmark runs (originally landed in #34, elevated in #35). Composes
48+
`summaryTable`, `paretoChart`, `gainHistogram`, held-out gate decisions,
49+
and optional `failureClusterView` output into one structured artifact:
1050
promote / hold / equivalent / reject / needs-more-data guidance with
1151
rationale, risks, next actions, markdown, HTML, and JSON chart specs.
1252
- Decisions are made on paired evidence — never on marginal means alone.
13-
- ROPE (Region of Practical Equivalence) supported via the `rope` option;
14-
candidates whose paired-delta CI is fully inside the ROPE are returned
15-
as `equivalent` rather than `hold`.
16-
- Bayesian-bootstrap-style Pr(Δ>0) and Pr(Δ∈ROPE) summaries on the mean
17-
paired delta (Rubin 1981 bootstrap-prior duality), reported per
18-
candidate alongside the bootstrap CI on the median.
19-
- Per-candidate minimum detectable paired effect at the configured power
20-
and α via the new `pairedMde` primitive in `power-analysis`, so a
21-
`needs_more_data` verdict is actionable.
22-
- SHA-256 `runFingerprint` over the canonicalised input run set + an
23-
optional `preregistrationHash` field so the report can cite a signed
24-
`HypothesisManifest`.
25-
- Soft floor `minPairs` (default 20) and a hard floor of 6 pairs
26-
(`RESEARCH_REPORT_HARD_PAIR_FLOOR`) below which any paired call returns
27-
`needs_more_data` regardless of the option.
28-
- Embedded methodology section in the rendered markdown plus a standalone
29-
[`docs/research-report-methodology.md`](./docs/research-report-methodology.md)
30-
with assumptions, alternatives, when-not-to-apply, and citations
31-
(Benjamini & Hochberg 1995; Wilcoxon 1945; Efron 1979; Rubin 1981;
32-
Kruschke 2018).
33-
- `pairedMde` in `power-analysis`: closed-form minimum detectable paired
34-
effect inverse to the paired-t / sign-rank power formula.
53+
- ROPE (Region of Practical Equivalence) supported via the `rope` option.
54+
- Bayesian-bootstrap-style `Pr(Δ>0)` and `Pr(Δ∈ROPE)` summaries (Rubin 1981).
55+
- Per-candidate minimum detectable paired effect via `pairedMde`.
56+
- SHA-256 `runFingerprint` and optional `preregistrationHash` linking a
57+
signed `HypothesisManifest`.
58+
- Embedded methodology + `docs/research-report-methodology.md` companion.
59+
- **`pairedMde`** in `power-analysis`: closed-form minimum detectable paired
60+
effect (inverse to the paired-t / sign-rank power formula).
3561

3662
### Changed
3763

38-
- `researchReport` is now async (uses Web Crypto via `hashJson` for the run
64+
- `researchReport` is async (uses Web Crypto via `hashJson` for the run
3965
fingerprint).
66+
- Default `researchReport.minPairs` is 20 (soft floor); hard floor of 6 is
67+
enforced regardless via `RESEARCH_REPORT_HARD_PAIR_FLOOR`.
68+
69+
### Wire-protocol consumers
70+
71+
No wire-protocol changes. The new capture / integrity / hook primitives are
72+
TypeScript-only; cross-language consumers continue to use the existing RPC
73+
surface.
74+
75+
### Python client
76+
77+
Locked at `tangle-agent-eval==0.21.0` to match the npm package.
4078

4179
## 0.20.10 — hardening audit follow-up
4280

clients/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "tangle-agent-eval"
7-
version = "0.20.10"
7+
version = "0.21.0"
88
description = "Python client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC."
99
readme = "README.md"
1010
requires-python = ">=3.10"

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@tangle-network/agent-eval",
3-
"version": "0.20.12",
3+
"version": "0.21.0",
44
"description": "Trace-first evaluation infrastructure for agent systems: traces, harnesses, verifier pipelines, judges, datasets, gates, optimization, and reporting.",
55
"homepage": "https://github.com/tangle-network/agent-eval#readme",
66
"repository": {

src/index.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -554,13 +554,16 @@ export {
554554
stripFencedJson,
555555
LlmCallError,
556556
LlmClient,
557+
assertLlmRoute,
558+
LlmRouteAssertionError,
557559
} from './llm-client'
558560
export type {
559561
LlmMessage,
560562
LlmCallRequest,
561563
LlmCallResult,
562564
LlmUsage,
563565
LlmClientOptions,
566+
LlmRouteRequirements,
564567
} from './llm-client'
565568

566569
export {

0 commit comments

Comments
 (0)