Skip to content

Commit 6c124e7

Browse files
authored
feat: 0.23.0 — RL primitives bridge eval to policy training (#43)
* feat: 0.23.0 — RL primitives bridge eval to policy training Closes the integration gap between the pre-0.22 optimization stack (runMultiShotOptimization, runPromptEvolution) and the post-0.22 campaign artifact (RunRecord, raw events, replay, sequential, predictive validity), then ships the eight downstream RL eval primitives the package was missing. Nine modules under @tangle-network/agent-eval/rl: 1. run-record-adapters: TrialResult / VerificationReport / VariantAggregate -> canonical RunRecord. Existing optimization output becomes replayCache-able and rubricPredictiveValidity- scorable for free. 2. verifiable-reward: extract clean reward signal from VerificationReport / RunRecord. Distinguishes 'deterministic' (compile/test/schema/sandbox) from 'probabilistic' (judge) sources. The seam every credible 2025-2026 frontier RL result on coding agents leans on. 3. preferences: extractPreferences(runRecords) -> DPO/PPO/KTO (chosen, rejected) triples. Three documented strategies: paired-by-scenario-and-seed, paired-by-scenario, top-vs-bottom. toTRLFormat / toAnthropicFormat adapters. 4. off-policy: IPS, SNIPS, doubly-robust off-policy estimators (Dudik-Langford-Li 2011 for DR, Owen 2013 for SNIPS SE). offPolicyEstimateAll runs all three side-by-side - agreement is a stronger signal than any one alone. 5. process-reward: extractStepRewards over trace spans; prmTrainingPairs produces (prefix, chosen_step, rejected_step) in the canonical Lightman et al. / DeepSeek-R1 process supervision shape. We ship the data extraction; gradient descent over a transformer is out of scope. 6. contamination: held-out perturbation contamination probe via paired Wilcoxon. Stock perturbations: renameVariables, shuffleOrder, injectIrrelevantClause. Catches the SWE-Bench → SWE-Bench-Verified failure mode upstream. 7. tournament: fitBradleyTerry (Hunter's MM), applyEloUpdate, buildPairwiseFromCampaign. Sample-efficient ranking for many-candidate sweeps. 8. adversarial: adversarialScenarioSearch hill-climbs against failure indicators using caller-supplied mutation strategies. Simplest version of AdA / POET / auto-jailbreak loops on top of the campaign infrastructure. 9. compute-curves: characterise candidates as curves across compute budgets, not points. runComputeCurve, bestOfN, selfConsistency, paretoFrontier. Required for honest cost-quality reporting in the o1 / scaling-law-aware era. Build / surface: - New build entry: dist/rl.{js,d.ts} - New package subpath: @tangle-network/agent-eval/rl - All RL primitives also re-exported from the root barrel - Default BradleyTerry smoothing raised from 0 to 0.1 - Hunter's MM degenerates when a candidate has zero wins; 0.1 keeps the iteration well-conditioned without meaningfully biasing real win counts. Tests: - 971 / 971 passing (+61 dedicated RL cases across 7 new files): rl-adapters, rl-verifiable-reward, rl-preferences, rl-off-policy, rl-process-reward, rl-contamination, rl-tournament, rl-adversarial-and-compute. - typecheck + build clean; new dist/rl.{js,d.ts} entry emits. Docs: - SKILL.md: new "RL bridge" section with quick-reference + per- primitive when-to-use guidance. - README.md: Core Pieces gains 9 new rows; Import Paths advertises the new /rl subpath. - CHANGELOG.md: full 0.23.0 release entry with references. References: - Dudik, M., Langford, J., Li, L. (2011). Doubly Robust Policy Evaluation and Learning. ICML. - Owen, A. B. (2013). Monte Carlo Theory, Methods and Examples. Ch. 9 - Importance Sampling. - Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry models. Annals of Statistics, 32(1), 384-406. - Lightman, H. et al. (2023). Let's Verify Step by Step. arXiv:2305.20050. - Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally. arXiv:2408.03314. Migration: all primitives are additive. Existing consumers don't need to change. Caveats / out-of-scope (documented in CHANGELOG): - DR Q-function is caller-supplied; we don't ship a learned trainer. - PRM gradient training is out of scope; we ship the data shape. - Contamination per-scenario q-values are heuristic; load-bearing test is the global Wilcoxon. - prmTrainingPairs matches by step name + kind; production should use a token-level prefix hash. - Adversarial search is hill-climb only; LM-driven scenario synthesis is future work. * feat(rl): close auto-research loop end-to-end + 7 new primitives + worked examples Audit-driven closure of the gaps the 0.23 panel critique identified. Stops adding primitives speculatively and pivots to integration / worked examples / honest scoping. Worked examples (the load-bearing addition): examples/auto-research-with-agent-builder/ Runnable demo: synthetic agent-builder runner iterates 4 generations of prompt variants, with each gen feeding analyzeOptimizationResult (preferences + reward-hacking + sequential interim verdict) and the next gen proposed via a deterministic mutator. Score progression: 0.739 -> 0.973 over 4 iterations on the synthetic environment. Real- driver mode (replace synthetic runner with runForgeBuilderSim) is documented inline. examples/fine-tune-with-prime-rl/ Concrete prime-rl SFT integration. ~150 LoC TS that filters RunRecord[] to high-quality runs, projects via toSftRows to messages- list JSONL, writes a 15-line prime-rl SFT TOML config. SFT chosen as the first integration because it's the only agent-eval exporter that maps directly onto a prime-rl entrypoint (DPO/PRM go to TRL; offline GRPO requires a custom verifiers env). Honest scope in README. Architecture docs: docs/three-package-architecture.md Contracts between agent-eval, agent-knowledge, agent-runtime. Dependency direction (both consume agent-eval; agent-eval imports neither), shared interchange types (RunRecord, Scenario, KnowledgeBundle), known contract gaps tracked as follow-ups. docs/auto-research-loop-end-to-end.md Composition pattern with the explicit invariants every iteration must preserve (canonical RunRecord with scenarioId, capture wired by construction, stable comparator, deterministic mutator). New experimental RL primitives (all marked experimental in barrel docstring): active-curriculum.ts (Neyman 1934 + Thompson sampling) reward-hacking.ts (4-signal Krakovna/Skalse/Kim hygiene check) adaptation-eval.ts (k-shot adaptation curves + paired comparison) exporters.ts (DPO/GRPO/SFT/PRM/step-rewards JSONL exporters) rl-campaign.ts (top-level RL orchestrator wrapping runEvalCampaign) auto-research.ts (analyzeOptimizationResult — bridges PromptEvolutionResult/MultiShotOptimizationResult to the RL bridge) predictive-validity-researcher.ts (concrete Researcher impl; interface had been a placeholder) Foundation work: RunRecord.scenarioId added as canonical optional field. Closes the fragility flagged in the 0.23 audit (extractPreferences was fishing through outcome.raw.scenario_id). runEvalCampaign and trialsToRunRecords populate it canonically; legacy records fall back to the old convention. Tests: 1017 / 1017 passing (+46 over 0.23 baseline). 7 dedicated test files for the new primitives covering happy path + edge cases: rl-active-curriculum.test.ts rl-reward-hacking.test.ts rl-adaptation-eval.test.ts rl-exporters.test.ts rl-predictive-validity-researcher.test.ts rl-rl-campaign.test.ts rl-auto-research.test.ts typecheck + build clean. dist/rl.{js,d.ts} entry emits. Honest framing: The 9 stable RL primitives (run-record-adapters, verifiable-reward, preferences, off-policy, process-reward, contamination, tournament, adversarial, compute-curves) are the load-bearing release. The 7 experimental primitives ship behind an explicit "experimental" marker in the barrel docstring. Their interfaces are reasonable but may evolve as production consumers exercise them. No consumer is pulling on them today — that's the honest scope. The two worked examples are the primary deliverable: they prove the composition story end-to-end (auto-research loop with synthetic driver shows real score climb; prime-rl SFT export runs end-to-end on the synthetic data). Migration: all additive. agent-knowledge and agent-runtime should bump to 0.23 in follow-up PRs to those repos to consume the new surface.
1 parent d763d00 commit 6c124e7

49 files changed

Lines changed: 7348 additions & 6 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/agent-eval/SKILL.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,15 @@ If a term below isn't in this table or in `docs/concepts.md`, that's a bug — f
5252
| Re-run / re-judge / determinism-audit a past campaign for free | `ReplayCache` + `createReplayFetch` (§Replay & sequential evaluation) |
5353
| Ship the moment evidence is decisive, with anytime-valid α control across rolling looks | `pairedEvalueSequence`, `evaluateInterimReleaseConfidence` (§Replay & sequential evaluation) |
5454
| Tell load-bearing rubrics from decorative ones using deployment outcomes | `rubricPredictiveValidity` (§Outcome calibration) |
55+
| Bridge legacy optimization output to canonical `RunRecord[]` | `trialToRunRecord`, `verificationReportToRunRecord` (§RL bridge — 0.23+) |
56+
| Extract a clean reward signal for RL training (compile / test / schema vs judge) | `extractVerifiableReward`, `filterDeterministicallyRewarded` (§RL bridge — 0.23+) |
57+
| Produce DPO / PPO / KTO `(chosen, rejected)` triples from `RunRecord[]` | `extractPreferences` (§RL bridge — 0.23+) |
58+
| Estimate the value of a new policy on old trajectories without re-running | `inverseProbabilityWeighting`, `selfNormalizedImportanceWeighting`, `doublyRobust`, `offPolicyEstimateAll` (§RL bridge — 0.23+) |
59+
| Step-level credit assignment / PRM training data | `extractStepRewards`, `prmTrainingPairs` (§RL bridge — 0.23+) |
60+
| Detect benchmark contamination via held-out perturbations | `runContaminationProbe`, stock perturbations (§RL bridge — 0.23+) |
61+
| Pairwise tournament ratings for many-candidate sweeps | `fitBradleyTerry`, `applyEloUpdate`, `buildPairwiseFromCampaign` (§RL bridge — 0.23+) |
62+
| Active search for inputs the policy fails on | `adversarialScenarioSearch` (§RL bridge — 0.23+) |
63+
| Characterize a candidate across compute budgets (`bestOfN`, self-consistency, curves) | `runComputeCurve`, `bestOfN`, `selfConsistency`, `paretoFrontier` (§RL bridge — 0.23+) |
5564
| Capture every provider HTTP request/response for forensics | `RawProviderSink` + `LlmClientOptions.rawSink` (§Capture integrity Directive 1) |
5665
| Fail loud if the eval would silently use the wrong route | `assertLlmRoute` (§Capture integrity Directive 2) |
5766
| Assert at run-end that the artifact is complete | `assertRunCaptured` + `throwIfRunIncomplete` (§Capture integrity Directive 3) |
@@ -317,6 +326,47 @@ Fail closed; use `// muffle-ok: <reason>` for the rare exception.
317326

318327
---
319328

329+
## RL bridge — from eval to policy training (0.23+)
330+
331+
Imported from `@tangle-network/agent-eval/rl` (or the root barrel). Eight modules; each one converts a piece of agent-eval output into a shape an RL pipeline can consume, or implements a canonical RL eval methodology that the rest of the package didn't cover.
332+
333+
### Quick reference
334+
335+
```ts
336+
import {
337+
trialsToRunRecords, // bridge legacy optimization output
338+
extractVerifiableReward, // clean reward signal (compile/test) vs judge
339+
extractPreferences, // (chosen, rejected) triples for DPO/PPO/KTO
340+
offPolicyEstimateAll, // IPS + SNIPS + DR side-by-side
341+
extractStepRewards, // step-level credit assignment
342+
prmTrainingPairs, // PRM training data
343+
runContaminationProbe, // held-out perturbation contamination
344+
fitBradleyTerry, applyEloUpdate, // pairwise tournament ratings
345+
adversarialScenarioSearch, // active failure-mode discovery
346+
runComputeCurve, bestOfN, selfConsistency, paretoFrontier, // compute-axis evaluation
347+
} from '@tangle-network/agent-eval/rl'
348+
```
349+
350+
### When you actually use each one
351+
352+
- **You ran an existing `runPromptEvolution` or `runMultiShotOptimization` sweep** — wrap with `trialsToRunRecords(trials, ctx)` so the output composes with `replayCache`, `pairedEvalueSequence`, `rubricPredictiveValidity`, and the rest of the 0.22 surface. Single line, zero behavior change.
353+
- **You're training a policy with TRL / DPO / PPO / GRPO** — use `extractVerifiableReward` to separate deterministic rewards (compile/test/schema/sandbox) from probabilistic ones (judge), then `extractPreferences` to produce the `(chosen, rejected)` triples in the shape your trainer expects.
354+
- **You changed a policy and want to evaluate it on yesterday's trajectories without re-running** — use `offPolicyEstimateAll` with token log-prob propensity scores. Run all three estimators (IPS, SNIPS, DR); agreement across estimators is much stronger than any single number.
355+
- **You want step-level credit assignment for long-horizon agents**`extractStepRewards` over the trace spans of completed runs, `prmTrainingPairs` to produce the training data for a PRM, then plug into your favourite trainer (we don't ship gradient descent).
356+
- **You're worried your benchmark scenarios leaked into training data**`runContaminationProbe` with one of the stock perturbations (`renameVariables`, `shuffleOrder`, `injectIrrelevantClause`). Catches drift before the launch reviewer does.
357+
- **You have ≥ 5 candidates running on shared scenarios**`fitBradleyTerry` is more sample-efficient than running every candidate against a fixed comparator. Use `applyEloUpdate` for online ratings as new comparisons arrive.
358+
- **You want to find the failure modes the curator didn't think of**`adversarialScenarioSearch` hill-climbs against a failure indicator using caller-supplied mutation strategies. Pair with the contamination probe for two-sided robustness.
359+
- **You want to characterise a candidate's capability vs cost rather than at one budget**`runComputeCurve` at `{1×, 4×, 16×}` with `bestOfN` or `selfConsistency` as the per-budget evaluator, then `paretoFrontier` over (candidate, compute) tuples.
360+
361+
### When NOT to use these
362+
363+
- The RL primitives don't replace `runEvalCampaign`. The campaign is the matrix runner with capture-integrity baked in; the RL primitives consume the campaign's `RunRecord[]` output. Keep the campaign as the entry point.
364+
- `doublyRobust` requires a Q-function. We don't ship a learned Q-function trainer — pass a heuristic (running mean per scenario), a regression fit you trained out-of-band, or `null` per-trajectory to fall back to IPS for that entry.
365+
- `prmTrainingPairs` matches trajectories by `(span name, span kind)` prefix. Production use should replace this with a token-level prefix hash; the heuristic is good for early-stage PRM scaffolding.
366+
- Contamination probe's per-scenario q-values use a heuristic pseudo-p — they're a display aid; the load-bearing test is the global Wilcoxon.
367+
368+
---
369+
320370
## Replay & sequential evaluation (0.22+)
321371

322372
Once `runEvalCampaign` standardises the output (every run is a `RunRecord` plus a SHA-256-keyed raw-event log) two compounding capabilities open up:

0 commit comments

Comments
 (0)