Skip to content

tangle-network/agent-eval

Repository files navigation

@tangle-network/agent-eval

Substrate for self-improving agents. Trace what runs, verify the result, turn outcomes into preferences and rewards, mutate prompts and policies under anytime-valid evidence, and ship only when the improvement is decisive.

real product task
  -> observe / act (your runtime)
  -> trace + verifier pipeline (capture integrity)
  -> RunRecord (canonical eval artifact)
       -> judge calibration · paired stats · sequential α
       -> preferences · verifiable rewards · process rewards
       -> GEPA / reflective mutation · auto-research · active curriculum
       -> release gate · replay · contamination probe · tournament rating
  -> next iteration

agent-eval does not own product state, credentials, UI, storage, model routing, browser drivers, sandbox policy, or deployment. Products own those. This package owns the loop that closes evaluation → preference → mutation → redeploy, with capture integrity and statistically rigorous evidence at every step.

It ships as a TypeScript library (npm) with a generated Python client (PyPI), both speaking the same wire protocol. MIT, self-hostable, no SaaS dependency.

Install

pnpm add @tangle-network/agent-eval
# or, from Python:
pip install agent-eval-rpc

Quick Start — the control loop

import {
  objectiveEval,
  runAgentControlLoop,
} from '@tangle-network/agent-eval/control'

const result = await runAgentControlLoop({
  intent: task.prompt,
  budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },

  observe() {
    return product.readState(task.id)
  },

  validate({ state }) {
    return [
      objectiveEval({
        id: 'build-passes',
        passed: state.build.exitCode === 0,
        severity: 'critical',
        metadata: state.build,
      }),
      objectiveEval({
        id: 'preview-serves',
        passed: state.preview.httpStatus === 200,
        severity: 'critical',
      }),
    ]
  },

  decide({ evals }) {
    const failed = evals.filter((e) => !e.passed)
    if (failed.length === 0) {
      return { type: 'stop', pass: true, reason: 'all gates passed' }
    }
    return {
      type: 'continue',
      action: { type: 'repair', failed: failed.map((e) => e.id) },
      reason: 'repair failed gates',
    }
  },

  act(action) {
    return product.runAgentStep(task.id, action)
  },
})

await product.storeEvalResult(task.id, result)

Same loop shape in production, replay, benchmark, and optimization. Swap the dependencies behind observe() and act(), never the eval contract.

Production loop — close the eval → prod → eval cycle (0.25.0)

Static prompts decay. Yesterday's FTC rule flips today; yesterday's tool quirk becomes today's incident. The production agents that win are the ones that continuously re-train against live failure modes.

runProductionLoop is the orchestration layer that wires the existing eval substrate into a self-improvement cron:

import {
  runProductionLoop,
  httpGithubClient,
  FileSystemFeedbackTrajectoryStore,
} from '@tangle-network/agent-eval'
import { FileSystemTraceStore } from '@tangle-network/agent-eval/traces'

const result = await runProductionLoop({
  runId: `weekly-${new Date().toISOString().slice(0, 10)}`,
  target: 'tax-agent',

  // 1. Where production traces + feedback land. Wire the HTTP ingestion
  //    endpoints (POST /v1/traces/ingest, POST /v1/feedback) from your
  //    runtime; the same store reads them here.
  traceStore: new FileSystemTraceStore({ dir: 'data/prod-traces' }),
  feedbackStore: new FileSystemFeedbackTrajectoryStore({ dir: 'data/prod-feedback' }),

  // 2. Cluster threshold: act on failure groups ≥ 20 runs or ≥ 5% of corpus.
  cluster: { minClusterSize: 20, minSeverityRatio: 0.05, maxClustersPerCycle: 1 },

  // 3. Evolve: seed = current prompt, gate against holdout scenarios.
  evolve: {
    baselinePrompt: currentSystemPrompt,
    holdoutScenarios: productionShapeScenarios,
    runner,                            // your agent driver
    scorer,                            // calibrated judge or rubric
    mutator,                           // GEPA-style or addendum-style mutator
    gate: {
      baselineKey: 'baseline',
      minProductiveRuns: 5,
      pairedDeltaThreshold: 0.03,      // require Nσ improvement on holdout
      overfitGapThreshold: 0.10,
    },
  },

  // 4. Ship: when the gate passes, open a PR with the new prompt.
  ship: {
    client: httpGithubClient({ token: process.env.GITHUB_TOKEN! }),
    repo: { owner: 'tangle-network', name: 'tax-agent' },
    branchPrefix: 'eval/auto-improve',
    promptFilePath: 'prompts/tax-agent-system.txt',
    reviewers: ['drew'],
  },

  cron: { cadence: 'weekly' },         // surface-only; consumer schedules
})

console.log(result.decision)            // 'pr_opened' | 'gate_failed' | 'no_actionable_failures' | ...
console.log(result.pullRequest?.prUrl)  // populated when a PR was opened

The primitive runs one cycle. Schedule it with workflow_dispatch + cron in GitHub Actions. It is idempotent + replayable: same runId → same plan. Gate failures are fail-closed — a candidate that beats baseline on search but overfits on holdout never lands.

Full runnable demo (synthetic traces, no credentials) in examples/production-loop.

Self-improvement loop

Eval doesn't end at "pass/fail." Outcomes become training signal, mutation proposals, and curriculum updates — all from the same RunRecord produced by the control loop.

import { runEvalCampaign } from '@tangle-network/agent-eval'
import {
  extractPreferences,
  extractVerifiableReward,
  filterDeterministicallyRewarded,
  offPolicyEstimateAll,
  analyzeOptimizationResult,
} from '@tangle-network/agent-eval/rl'

// 1. Run a matrix of variants × scenarios with capture integrity by construction.
const campaign = await runEvalCampaign({ variants, scenarios, run })

// 2. Convert outcomes into RL signal.
const rewards = extractVerifiableReward(campaign.runs)          // compile/test/schema
const prefs   = extractPreferences(campaign.runs)               // (chosen, rejected) triples
const clean   = filterDeterministicallyRewarded(rewards)        // judge-noise free

// 3. Estimate a candidate policy's value without re-running.
const ope = offPolicyEstimateAll(campaign.runs, candidatePolicy)  // IPS + SNIPS + DR

// 4. Or close the loop end-to-end: score → reflect → mutate → re-run.
const next = await analyzeOptimizationResult(campaign, { researcher })
Step Primitive Subpath
Eval matrix with integrity runEvalCampaign /
Deterministic re-judge / audit ReplayCache, createReplayFetch /
Anytime-valid α across rolling looks pairedEvalueSequence /reporting
Judge quality vs gold calibrateJudge (κ, Pearson, MAE, bias probes) /
(chosen, rejected) for DPO/KTO/PPO extractPreferences /rl
Verifiable reward signal extractVerifiableReward /rl
Step-level / PRM training data extractStepRewards, prmTrainingPairs /rl
Estimate policy value off-policy offPolicyEstimateAll (IPS + SNIPS + DR) /rl
GEPA / reflective prompt mutation buildReflectionPrompt, parseReflectionResponse, Ax-GEPA SteeringOptimizer / /optimization
Auto-research (read runs → propose) analyzeOptimizationResult, PredictiveValidityResearcher /rl
Active curriculum (variance / Thompson) allocateCurriculum /rl
Tournament ratings (Bradley-Terry + Elo) fitBradleyTerry, applyEloUpdate /rl
Adversarial scenario search adversarialScenarioSearch /rl
Contamination probe (held-out perturb) runContaminationProbe /rl
Reward hacking signatures detectRewardHacking /rl
Compute curves (best-of-N, self-consist, Pareto) runComputeCurve, bestOfN, selfConsistency, paretoFrontier /rl
Knowledge gap separated from reasoning gap scoreKnowledgeReadiness /
Release gate (paired evidence + holdouts) evaluateReleaseConfidence, HeldOutGate /reporting
Launch report (decision-grade) renderReleaseReport, researchReport /reporting

Import Paths

Subpath Use for
@tangle-network/agent-eval/control observe → validate → decide → act, action policy, propose/review loops
@tangle-network/agent-eval/traces trace stores, emitters, TraceAnalyst, replay
@tangle-network/agent-eval/optimization feedback trajectories, multi-shot, prompt evolution, GEPA, EvalCampaign
@tangle-network/agent-eval/reporting release confidence, paired stats, sequential e-values, launch reports
@tangle-network/agent-eval/rl adapters, verifiable rewards, preferences, OPE, PRM, contamination, tournaments, adversarial, compute curves, auto-research
@tangle-network/agent-eval/wire HTTP/RPC server + schemas (same protocol the Python client speaks)
@tangle-network/agent-eval/benchmarks benchmark adapter contracts and reference wrappers

The root export remains available for convenience; new code should prefer focused subpaths. Anything under /rl should be imported from /rl — root re-export is retained only for backward compatibility and will be narrowed in 0.25.

API stability

Public exports are tagged with JSDoc stability markers so consumers can see status at the call site (IDE hover, language server, declaration files).

Tag Meaning
@stable API frozen at this major. Breaking changes require a major bump.
@experimental Interface may evolve before becoming @stable. Pin the patch version if you depend on it.
@internal Not part of the public contract. Use the documented subpath instead.

The /rl subpath is the most active surface. See src/rl/index.ts for the current stable/experimental breakdown.

Capture integrity (0.21+)

Launch-grade benchmark runs need four things that are easy to forget in glue code: (1) raw HTTP capture alongside the structured spans so a reviewer can verify which route answered, (2) a preflight assertion that the configured client points at the intended provider, (3) a run-end assertion that the expected events were actually written, and (4) auto-execution of the trace analyst as part of the run lifecycle.

import {
  TraceEmitter, FileSystemRawProviderSink, callLlm, assertLlmRoute,
  assertRunCaptured, throwIfRunIncomplete,
} from '@tangle-network/agent-eval'
import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'

const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })

const emitter = new TraceEmitter(store, {
  onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
})
await emitter.startRun(/* ... */)
// LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
await emitter.endRun({ pass, score })

throwIfRunIncomplete(await assertRunCaptured(store, emitter.runId, {
  llmSpansMin: 1, rawSink: sink, requireRawCoverageOfLlmSpans: true, requireOutcome: true,
}))

Directives, rationale, and shipped-bug context are in SKILL.md § Capture integrity.

Examples

Each example has its own README with what it demonstrates, expected output, and runtime. See examples/.

Docs

Read in this order:

  1. Concepts — mental model, 5 min
  2. Product Eval Adoption
  3. Control Runtime
  4. Feedback Trajectories
  5. Multi-Shot Optimization
  6. Trace Analysis
  7. Knowledge Readiness
  8. Integration Launch Gates
  9. Wire Protocol — required for non-TypeScript consumers

CLI / Wire Protocol

npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005

Python:

pip install agent-eval-rpc
from agent_eval_rpc import Client
client = Client()  # auto-detects HTTP server, falls back to subprocess
score = await client.judge(content=output, rubric_name="anti-slop")

TypeScript is the source of truth. Python is a thin transport client over the generated OpenAPI schema. Schema drift is enforced impossible at release time (version-locked CI).

Development

pnpm install
pnpm typecheck
pnpm test
pnpm lint        # biome
pnpm build       # tsup + openapi.json

Related Packages

Together: agent-runtime is where the agent runs; agent-knowledge is what it knows; agent-integrations is what it can do; agent-eval is how it gets better.

License

MIT

About

Domain-agnostic evaluation framework for Tangle agent apps

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors