Releases · tangle-network/agent-eval

08 May 23:04

drewstone

v0.23.0

6c124e7

v0.23.0 — RL primitives + auto-research worked example Latest

Latest

RL bridge primitives + worked examples + downstream integrations.

What's in 0.23

RL bridge — `@tangle-network/agent-eval/rl` (new subpath)

9 stable primitives:

run-record-adapters — convert legacy optimization output (TrialResult, VerificationReport, VariantAggregate) → canonical RunRecord[]
verifiable-reward — extract clean reward signals; distinguishes 'deterministic' (compile/test/schema/sandbox) from 'probabilistic' (judge)
preferences — DPO/PPO/KTO (chosen, rejected) triples with three documented strategies
off-policy — IPS, SNIPS, doubly-robust estimators (Dudík–Langford–Li 2011; Owen 2013 SE)
process-reward — step-level credit assignment; PRM training data shape (Lightman et al. 2023)
contamination — held-out perturbation probe via paired Wilcoxon
tournament — Hunter's MM Bradley-Terry + online Elo
adversarial — hill-climb scenario search
compute-curves — runComputeCurve, bestOfN, selfConsistency, Pareto frontier (Snell et al. 2024)

7 experimental primitives (interfaces marked experimental in barrel):

active-curriculum — Neyman optimal allocation + Thompson sampling
reward-hacking — 4-signal Goodhart watchdog (Krakovna/Skalse/Kim)
adaptation-eval — k-shot adaptation curves
exporters — DPO/GRPO/SFT/PRM/step-rewards JSONL
rl-campaign — top-level orchestrator wrapping runEvalCampaign + RL bridge
auto-research — analyzeOptimizationResult, the unification primitive
predictive-validity-researcher — concrete Researcher interface implementation

`RunRecord.scenarioId` — canonical optional field

Populated automatically by runEvalCampaign and the optimization adapters. Closes the fragility flagged in the 0.23 audit.

Worked examples

examples/auto-research-with-agent-builder/ — runnable demo of the closed loop. Synthetic agent-builder driver iterates 4 generations; score climbs 0.739 → 0.973.
examples/fine-tune-with-prime-rl/ — concrete prime-rl SFT integration. Filter RunRecord[] to high-quality, project via toSftRows, write 15-line TOML, run uv run sft @ .... ~150 LoC of glue.

Architecture docs

docs/three-package-architecture.md — agent-eval × agent-knowledge × agent-runtime contracts
docs/auto-research-loop-end-to-end.md — composition pattern with explicit invariants

Downstream integrations (separate repos, all PRs open)

agent-knowledge tangle-network/agent-knowledge#5 — clean bump, 12/12 tests pass
agent-runtime tangle-network/agent-runtime#3 — clean bump + scenarioId backfill, 16/16 tests pass
agent-builder tangle-network/agent-builder#130 — bump + RL bridge wired into runAutoResearchCycle. Every auto-research cycle now produces canonical RunRecord[], preference triples, reward-hacking verdict, and sequential interim verdict on the events stream. 826/826 tests pass.

Numbers

1017 / 1017 tests passing on agent-eval main (+150 cumulative since 0.21)
typecheck + build clean
dist/rl.{js,d.ts} entry emits

Version lockstep

npm @tangle-network/agent-eval@0.23.0
PyPI agent-eval-rpc==0.23.0

References

Dudík/Langford/Li 2011 (DR), Owen 2013 (SNIPS), Hunter 2004 (BT MM), Lightman 2023 (PRM), Snell 2024 (test-time compute), plus the 0.21/0.22 foundational citations.

Assets 2

08 May 20:52

drewstone

v0.22.0

d763d00

v0.22.0 — EvalCampaign + replay + always-valid + outcome calibration

runEvalCampaign + replay + anytime-valid sequential + outcome calibration. Four primitives that compound on top of the standardised campaign artifact. Each maps to a specific failure mode observed in production; together they convert agent-eval from a TS framework into research-grade evaluation infrastructure.

runEvalCampaign — capture integrity by construction. Variants × scenarios × seeds → RunRecord[] + RunIntegrityReport[] + (optional) researchReport. Wires assertLlmRoute at preflight, builds TraceStore + RawProviderSink + TraceEmitter per run, asserts requireRawCoverageOfLlmSpans at run-end, runs the analyst on completion. Consumers stop wiring the integrity surface; the campaign owns it.
Replay-from-raw-events — ReplayCache.fromSink(sink) + createReplayFetch(cache). Every captured campaign is now a re-runnable artifact. Pass the replay fetch via LlmClientOptions.fetch and callLlm reads cached responses transparently. Drops eval R&D cost from "another full sweep" to CPU-bound. Use cases: post-hoc judging, determinism audits, free judge calibration.
Anytime-valid sequential evaluation — pairedEvalueSequence + evaluateInterimReleaseConfidence. Predictable plug-in betting martingale (Waudby-Smith & Ramdas 2024) + empirical Bernstein confidence sequence (Howard et al. 2021). Type-I error is bounded by α at every stopping time. Ship the moment evidence is decisive without inflating false-discovery rate across rolling looks.
Outcome calibration — rubricPredictiveValidity({ runs, outcomes, outcomeMetrics }). Joins canonical RunRecords to a DeploymentOutcomeStore and ranks rubrics by |spearman| against deployment outcomes. Verdict bucketing: load_bearing | informative | decorative. Without this loop every rubric is faith-based; with it, rubrics earn their authority.

Compounding shape:

runEvalCampaign  →  standardised RunRecord + raw events
                ↓
         Replay  →  free judge iteration, free determinism audits
                ↓                   ↘
         Sequential                  Predictive validity  →  rubrics earn authority
                ↘                   ↙
       Combined: ship-when-decisive, prove-by-replay, validate-by-revenue

Wired into: root barrel + optimization (campaign), traces (replay), reporting (sequential, predictive validity).

Tests: 910 / 910 (+43 dedicated cases); type-I bound on the sequential primitive verified under-the-null at α=0.05 across 100 synthetic series.

Docs: SKILL.md adds "Replay & sequential evaluation" and "Outcome calibration" sections with runnable examples and citations; docs/research-report-methodology.md "out of scope" entries for sequential inference and outcome calibration are now "shipped in 0.22"; full release entry in CHANGELOG.md.

Honesty: a panel-style critique was put on record in this release thread — the math is published methodology packaged with care, not novel research; rubricPredictiveValidity is descriptive correlation against outcomes, not causal calibration; joint coverage of the e-value test plus the empirical Bernstein CS is informal. Each is documented as a known limitation with a follow-up path.

References:

Howard, S. R., Ramdas, A., McAuliffe, J., Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences. Annals of Statistics, 49(2), 1055–1080.
Waudby-Smith, I., Ramdas, A. (2024). Estimating means of bounded random variables by betting. JRSS B, 86(1), 1–27.
Plus the foundational citations from 0.21 (Benjamini & Hochberg 1995, Wilcoxon 1945, Efron 1979, Rubin 1981, Kruschke 2018).

Version lockstep: npm @tangle-network/agent-eval@0.22.0 ↔ PyPI agent-eval-rpc==0.22.0.

PRs in this release: #42.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's in 0.23

RL bridge — `@tangle-network/agent-eval/rl` (new subpath)

`RunRecord.scenarioId` — canonical optional field

Worked examples

Architecture docs

Downstream integrations (separate repos, all PRs open)

Numbers

Version lockstep

References

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: tangle-network/agent-eval

v0.23.0 — RL primitives + auto-research worked example

What's in 0.23

RL bridge — @tangle-network/agent-eval/rl (new subpath)

RunRecord.scenarioId — canonical optional field

Worked examples

Architecture docs

Downstream integrations (separate repos, all PRs open)

Numbers

Version lockstep

References

Uh oh!

v0.22.0 — EvalCampaign + replay + always-valid + outcome calibration

Uh oh!

RL bridge — `@tangle-network/agent-eval/rl` (new subpath)

`RunRecord.scenarioId` — canonical optional field