This page explains the main agent-eval primitives in plain English first,
then shows when to use each one.
LLM agents can write code, drafts, research, plans, and actions. The hard part is knowing whether they actually did a good job, whether they should keep trying, and whether a change made them better or worse.
agent-eval gives you reusable tools for that:
- Judges grade one output.
- Verifiers run several checks in order.
- Control loops let an agent keep working until it passes, gets blocked, or hits a budget.
- Feedback trajectories turn normal user approvals/rejections into training and eval data.
- Datasets and holdouts keep examples organized so you do not overfit.
- Optimizers and mutation loops try prompt/signature/code variants and keep the ones that really improve.
- Traces and telemetry show what happened, step by step.
| Problem | Use | Why |
|---|---|---|
| “Did this single answer/draft pass?” | Judge or rubric | Fast quality signal for one artifact. |
| “Does generated code actually work?” | BuilderSession, MultiLayerVerifier, sandbox harness |
Build/test/runtime gates catch failures judges miss. |
| “Should the agent keep trying?” | runAgentControlLoop |
Budgeted observe -> validate -> decide -> act runtime. |
| “The agent should propose, verify, review, and revise.” | runProposeReviewAsControlLoop |
Reusable preset over the generic control loop. |
| “Human feedback should become reusable eval data.” | FeedbackTrajectory |
Captures approvals, rejections, edits, choices, metrics, and policy blocks. |
| “Can this action run, or does it need approval?” | evaluateActionPolicy |
Generic preflight for side effects, budgets, and required evidence. |
| “I need train/dev/test/holdout examples.” | Dataset plus feedback trajectory conversion |
Stable splits and contamination control. |
| “Which prompt or signature wins?” | runImprovementLoop, steering optimizers |
Runs variants on scenarios and compares scores. |
| “Improve a multi-turn agent over real task traces.” | runImprovementLoop |
GEPA-style trajectory optimization with ASI and held-out promotion. |
| “Improve prompts, then code if prompts plateau.” | runPromptEvolution, composite mutator, code mutator |
Bounded evolution with telemetry and lineage. |
| “Find why a regression happened.” | bisector, traces, run records | Narrows changes and preserves evidence. |
| “Expose evals to another language.” | Wire protocol and Python client | HTTP/RPC boundary for non-TypeScript apps. |
Use this shape when the product needs to keep pushing work forward instead of only answering once:
user intent
-> product adapter observes typed state
-> runAgentControlLoop decides the next driver action
-> worker executes in the real product environment
-> validators and judges grade the new state
-> FeedbackTrajectory records user/environment/reviewer signal
-> datasets and optimizers replay the same adapter
The important part is that production and eval do not use different loops. The adapter can swap dependencies (real user session, replay fixture, sandbox), but the state shape, validators, actions, budgets, and stop policies should stay the same. That is what makes benchmark gains transfer to real usage.
Use when you have a coding, browser, computer-use, research, or documentation agent.
- Represent the current task state.
- Validate that state with objective checks first and judges second.
- Use
runAgentControlLoopto decide the next action. - Record user feedback as
FeedbackTrajectory. - Convert trajectories into datasets and optimizer rows.
Result:
normal agent usage -> labeled examples -> replay/eval -> optimization
Implementation ownership:
- Put reusable loop mechanics, trace schemas, budget accounting, split
assignment, and optimizer row conversion in
agent-eval. - Put product state readers, action executors, approval policy, credentials, workspace paths, and UI-specific storage in the downstream repo.
- Promote a product adapter into
agent-evalonly after at least two products need the same adapter shape.
Use when an agent writes or patches a repo.
- Use
BuilderSessionorMultiLayerVerifier. - Always run static gates like typecheck/build/tests.
- Add semantic judges only after build gates pass.
- Store traces and run records for regression debugging.
Result:
generated code -> build/test/runtime gates -> score -> ship or revise
Use when you want Ax/GEPA-style improvement.
- For variable-length agent tasks, use
runImprovementLoop. - Build search/dev/test/holdout splits from the real product loop.
- Score full trajectories, not just final text.
- Emit actionable side information for failures the mutator can fix.
- Promote only
promotedVariant, never a rejectedsearchBestVariant. - Keep run records with prompt hash, model, config, cost, and commit.
Result:
candidate variant -> repeated evals -> statistical comparison -> promotion gate
Do not optimize a toy harness if users run a different product loop. Build
training examples from FeedbackTrajectory and controlRunToFeedbackTrajectory,
then replay them through the same product adapter used at runtime. Train on
train, tune on dev, report on test, and keep holdout untouched until a
promotion decision.
Use when operator or reviewer interaction should create labels.
Capture:
- approve/reject
- select A/B/C
- edit/rewrite
- rank/rate
- comment
- metric outcome
- policy block or budget block
Store as FeedbackTrajectory, then derive:
- preference memory for the next run
- dataset scenarios for regression
- optimizer rows for prompt/signature/code changes
- holdout examples to detect overfitting
| Area | Key exports | Best for | Notes |
|---|---|---|---|
| Judging | createCustomJudge, semantic judges, anti-slop, wire rubrics |
Content, voice, semantic quality | Pair with objective checks when possible. |
| Verification | MultiLayerVerifier, JudgeRunner, sandbox harness |
Code and multi-step gates | Do not let semantic judges override failed builds. |
| Control | runAgentControlLoop, objectiveEval, subjectiveEval |
Long-running agent tasks | Supports budgets, cost, stop policies, trace spans. |
| Propose/review | runProposeReview, runProposeReviewAsControlLoop |
Iterative artifact repair | Good for code, docs, plans, briefs. |
| Feedback data | FeedbackTrajectory, stores, converters |
Human/environment labels | Domain adapters live in downstream repos. |
| Action policy | evaluateActionPolicy |
Approval/budget preflight | Blocks or labels actions before act(). |
| Datasets | Dataset, holdout tools, canaries |
Train/dev/test/holdout corpora | Keeps optimization honest. |
| Optimization | runImprovementLoop, steering optimizers |
Prompt/signature comparison | Use held-out gates before promotion. |
| Evolution | prompt/code mutators, sandbox pool, telemetry | Autoresearch and mutation loops | Use budgets and lineage; do not run unbounded. |
| Telemetry | TraceStore, OTLP, file sinks |
Audit and replay | Treat traces as evidence, not just logs. |
| Reporting | summaries, pareto, cost tracker | Decision support | Useful for PRs, launch gates, research notes. |
- Prefer deterministic checks before LLM judges.
- Keep holdout data out of optimization.
- Record model, prompt hash, config hash, commit, and cost for every serious run.
- Budget every loop: steps, wall time, and dollars.
- Treat external side effects as downstream policy. The runtime can stop loops, but your adapter decides what requires approval.
- Store user feedback with enough context to replay it later.
- Run harnesses and judges in the same sandbox when they need shared files, logs, screenshots, or browser state. Use separate sandboxes for parallel variants or destructive checks.
- Wrap one real product workflow in
runAgentControlLoop. - Add objective validators that can fail without an LLM.
- Emit traces and convert completed runs into feedback trajectories.
- Capture explicit user/reviewer feedback on attempts.
- Replay those trajectories as train/dev/test/holdout scenarios.
- Run prompt/signature optimization against that replay adapter.
- Promote only when held-out trajectories and product telemetry both improve.
Domain-specific adapters should usually stay in downstream repos until they prove reusable:
- browser site-specific actions
- repo-specific coding commands
- workspace-specific storage paths
Core should provide shapes, stores, runners, scoring, traces, and converters. Downstream integrations provide domain state, policy, tools, and storage.
examples/same-sandbox-harness: one workdir for install/build/test plus evidence inspection.examples/multi-shot-optimization: full-trajectory optimization with a holdout gate.examples/benchmarks: benchmark adapter contracts and reference wrappers.