Commit 6c124e7
authored
feat: 0.23.0 — RL primitives bridge eval to policy training (#43)
* feat: 0.23.0 — RL primitives bridge eval to policy training
Closes the integration gap between the pre-0.22 optimization stack
(runMultiShotOptimization, runPromptEvolution) and the post-0.22
campaign artifact (RunRecord, raw events, replay, sequential, predictive
validity), then ships the eight downstream RL eval primitives the
package was missing.
Nine modules under @tangle-network/agent-eval/rl:
1. run-record-adapters: TrialResult / VerificationReport /
VariantAggregate -> canonical RunRecord. Existing optimization
output becomes replayCache-able and rubricPredictiveValidity-
scorable for free.
2. verifiable-reward: extract clean reward signal from
VerificationReport / RunRecord. Distinguishes 'deterministic'
(compile/test/schema/sandbox) from 'probabilistic' (judge) sources.
The seam every credible 2025-2026 frontier RL result on coding
agents leans on.
3. preferences: extractPreferences(runRecords) -> DPO/PPO/KTO
(chosen, rejected) triples. Three documented strategies:
paired-by-scenario-and-seed, paired-by-scenario, top-vs-bottom.
toTRLFormat / toAnthropicFormat adapters.
4. off-policy: IPS, SNIPS, doubly-robust off-policy estimators
(Dudik-Langford-Li 2011 for DR, Owen 2013 for SNIPS SE).
offPolicyEstimateAll runs all three side-by-side - agreement is a
stronger signal than any one alone.
5. process-reward: extractStepRewards over trace spans;
prmTrainingPairs produces (prefix, chosen_step, rejected_step) in
the canonical Lightman et al. / DeepSeek-R1 process supervision
shape. We ship the data extraction; gradient descent over a
transformer is out of scope.
6. contamination: held-out perturbation contamination probe via
paired Wilcoxon. Stock perturbations: renameVariables,
shuffleOrder, injectIrrelevantClause. Catches the
SWE-Bench → SWE-Bench-Verified failure mode upstream.
7. tournament: fitBradleyTerry (Hunter's MM), applyEloUpdate,
buildPairwiseFromCampaign. Sample-efficient ranking for
many-candidate sweeps.
8. adversarial: adversarialScenarioSearch hill-climbs against
failure indicators using caller-supplied mutation strategies.
Simplest version of AdA / POET / auto-jailbreak loops on top of
the campaign infrastructure.
9. compute-curves: characterise candidates as curves across compute
budgets, not points. runComputeCurve, bestOfN, selfConsistency,
paretoFrontier. Required for honest cost-quality reporting in the
o1 / scaling-law-aware era.
Build / surface:
- New build entry: dist/rl.{js,d.ts}
- New package subpath: @tangle-network/agent-eval/rl
- All RL primitives also re-exported from the root barrel
- Default BradleyTerry smoothing raised from 0 to 0.1 - Hunter's MM
degenerates when a candidate has zero wins; 0.1 keeps the
iteration well-conditioned without meaningfully biasing real win
counts.
Tests:
- 971 / 971 passing (+61 dedicated RL cases across 7 new files):
rl-adapters, rl-verifiable-reward, rl-preferences, rl-off-policy,
rl-process-reward, rl-contamination, rl-tournament,
rl-adversarial-and-compute.
- typecheck + build clean; new dist/rl.{js,d.ts} entry emits.
Docs:
- SKILL.md: new "RL bridge" section with quick-reference + per-
primitive when-to-use guidance.
- README.md: Core Pieces gains 9 new rows; Import Paths advertises
the new /rl subpath.
- CHANGELOG.md: full 0.23.0 release entry with references.
References:
- Dudik, M., Langford, J., Li, L. (2011). Doubly Robust Policy
Evaluation and Learning. ICML.
- Owen, A. B. (2013). Monte Carlo Theory, Methods and Examples.
Ch. 9 - Importance Sampling.
- Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry
models. Annals of Statistics, 32(1), 384-406.
- Lightman, H. et al. (2023). Let's Verify Step by Step.
arXiv:2305.20050.
- Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally.
arXiv:2408.03314.
Migration: all primitives are additive. Existing consumers don't need
to change.
Caveats / out-of-scope (documented in CHANGELOG):
- DR Q-function is caller-supplied; we don't ship a learned trainer.
- PRM gradient training is out of scope; we ship the data shape.
- Contamination per-scenario q-values are heuristic; load-bearing
test is the global Wilcoxon.
- prmTrainingPairs matches by step name + kind; production should
use a token-level prefix hash.
- Adversarial search is hill-climb only; LM-driven scenario
synthesis is future work.
* feat(rl): close auto-research loop end-to-end + 7 new primitives + worked examples
Audit-driven closure of the gaps the 0.23 panel critique identified.
Stops adding primitives speculatively and pivots to integration / worked
examples / honest scoping.
Worked examples (the load-bearing addition):
examples/auto-research-with-agent-builder/
Runnable demo: synthetic agent-builder runner iterates 4 generations
of prompt variants, with each gen feeding analyzeOptimizationResult
(preferences + reward-hacking + sequential interim verdict) and the
next gen proposed via a deterministic mutator. Score progression:
0.739 -> 0.973 over 4 iterations on the synthetic environment. Real-
driver mode (replace synthetic runner with runForgeBuilderSim) is
documented inline.
examples/fine-tune-with-prime-rl/
Concrete prime-rl SFT integration. ~150 LoC TS that filters
RunRecord[] to high-quality runs, projects via toSftRows to messages-
list JSONL, writes a 15-line prime-rl SFT TOML config. SFT chosen as
the first integration because it's the only agent-eval exporter that
maps directly onto a prime-rl entrypoint (DPO/PRM go to TRL; offline
GRPO requires a custom verifiers env). Honest scope in README.
Architecture docs:
docs/three-package-architecture.md
Contracts between agent-eval, agent-knowledge, agent-runtime.
Dependency direction (both consume agent-eval; agent-eval imports
neither), shared interchange types (RunRecord, Scenario,
KnowledgeBundle), known contract gaps tracked as follow-ups.
docs/auto-research-loop-end-to-end.md
Composition pattern with the explicit invariants every iteration
must preserve (canonical RunRecord with scenarioId, capture wired
by construction, stable comparator, deterministic mutator).
New experimental RL primitives (all marked experimental in barrel docstring):
active-curriculum.ts (Neyman 1934 + Thompson sampling)
reward-hacking.ts (4-signal Krakovna/Skalse/Kim hygiene check)
adaptation-eval.ts (k-shot adaptation curves + paired comparison)
exporters.ts (DPO/GRPO/SFT/PRM/step-rewards JSONL exporters)
rl-campaign.ts (top-level RL orchestrator wrapping runEvalCampaign)
auto-research.ts (analyzeOptimizationResult — bridges
PromptEvolutionResult/MultiShotOptimizationResult to the RL bridge)
predictive-validity-researcher.ts (concrete Researcher impl;
interface had been a placeholder)
Foundation work:
RunRecord.scenarioId added as canonical optional field. Closes the
fragility flagged in the 0.23 audit (extractPreferences was fishing
through outcome.raw.scenario_id). runEvalCampaign and trialsToRunRecords
populate it canonically; legacy records fall back to the old convention.
Tests:
1017 / 1017 passing (+46 over 0.23 baseline). 7 dedicated test files for
the new primitives covering happy path + edge cases:
rl-active-curriculum.test.ts
rl-reward-hacking.test.ts
rl-adaptation-eval.test.ts
rl-exporters.test.ts
rl-predictive-validity-researcher.test.ts
rl-rl-campaign.test.ts
rl-auto-research.test.ts
typecheck + build clean. dist/rl.{js,d.ts} entry emits.
Honest framing:
The 9 stable RL primitives (run-record-adapters, verifiable-reward,
preferences, off-policy, process-reward, contamination, tournament,
adversarial, compute-curves) are the load-bearing release.
The 7 experimental primitives ship behind an explicit "experimental"
marker in the barrel docstring. Their interfaces are reasonable but
may evolve as production consumers exercise them. No consumer is
pulling on them today — that's the honest scope.
The two worked examples are the primary deliverable: they prove the
composition story end-to-end (auto-research loop with synthetic
driver shows real score climb; prime-rl SFT export runs end-to-end
on the synthetic data).
Migration: all additive. agent-knowledge and agent-runtime should bump
to 0.23 in follow-up PRs to those repos to consume the new surface.1 parent d763d00 commit 6c124e7
49 files changed
Lines changed: 7348 additions & 6 deletions
File tree
- .claude/skills/agent-eval
- clients/python
- src/agent_eval_rpc
- docs
- examples
- auto-research-with-agent-builder
- fine-tune-with-prime-rl
- src
- rl
- tests
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
55 | 64 | | |
56 | 65 | | |
57 | 66 | | |
| |||
317 | 326 | | |
318 | 327 | | |
319 | 328 | | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
320 | 370 | | |
321 | 371 | | |
322 | 372 | | |
| |||
0 commit comments