This guide is for teams adding @tangle-network/agent-eval to a real agent
product. The package supplies evaluation contracts and runtime primitives. Your
product supplies the actual workflow adapter, state, credentials, tools, UI, and
storage.
Use the same loop for production, replay, and optimization:
real user task
-> product adapter observes state
-> validators and judges grade state
-> control loop decides next action
-> product agent acts in the real environment
-> trace + feedback trajectory are stored
-> datasets and optimizers replay the same adapterIf production and eval use different loops, benchmark gains will not transfer.
The product owns:
- task state and domain models
- credentials, tenant policy, approval, and side-effect rules
- browser, sandbox, CLI, connector, or voice drivers
- database and trace persistence
- user/reviewer feedback collection
- deployment and live canary routing
- model gateway configuration
agent-eval owns:
- trace, run, dataset, feedback, and score contracts
- control-loop mechanics
- verifier and judge orchestration
- failure taxonomy
- paired statistics and holdout gates
- optimizer inputs and promotion reports
Start with a small adapter that mirrors one real workflow.
interface ProductEvalAdapter<TState, TAction> {
observe(taskId: string): Promise<TState>
validate(state: TState): Promise<ControlEvalResult[]>
decide(input: {
state: TState
evals: ControlEvalResult[]
history: unknown[]
}): Promise<TAction | 'stop'>
act(taskId: string, action: TAction): Promise<void>
}Keep the adapter product-owned until at least two products need the same shape.
Use deterministic checks before judges.
- State validity: schema, required files, required DB rows, required connections.
- Runtime gates: install, build, typecheck, tests, serve, deploy smoke.
- Policy gates: approvals, side effects, budget, credentials, data freshness.
- Behavior gates: browser flows, API calls, generated app preview, voice transcript checks.
- Semantic judges: intent fit, quality, completeness, safety, professional correctness.
Semantic judges should never turn a failed build into a pass.
Every serious run should record:
- task id and scenario id
- git commit
- model and provider
- prompt/config hashes
- tool calls and retrieval spans
- build/test/deploy output
- cost, latency, and token use
- user/reviewer feedback
- final outcome and failure class
Convert runs into FeedbackTrajectory records so normal product usage becomes
replayable eval data.
production run -> feedback trajectory -> dataset scenario -> optimizer rowFor promotion-grade runs, also project the completed control result into a
strict RunRecord:
const record = controlRunToRunRecord(controlResult, {
experimentId,
candidateId,
seed,
model: 'gpt-4o-2024-11-20',
promptHash,
configHash,
commitSha,
splitTag: 'holdout',
tokenUsage,
})Use four splits:
train: optimizer search.dev: tuning and threshold selection.test: normal reporting.holdout: promotion-only gate.
The low-level RunRecord schema uses search | dev | holdout; map train
and normal non-holdout test/report rows to search when producing promotion
tables.
Do not inspect or tune against holdout failures during optimization. If a holdout failure reveals a real product bug, fix the bug and rotate the holdout set with a signed note.
Use runMultiShotOptimization() when the system is a multi-step agent, not a
single prompt.
Good optimization targets:
- system prompt
- tool descriptions
- retrieval policy
- data acquisition policy
- user-question policy
- evaluator threshold
- agent topology
- scaffold/template choice
Bad optimization targets:
- hidden holdout examples
- production credentials
- brittle string checks that do not match user value
- fake workflows that do not call the product adapter
Use actionable side information so the optimizer knows whether a failure belongs to prompt, tools, retrieval, data acquisition, sandbox, evaluator, or product runtime.
A launch or promotion should require:
- enough runs for the target risk level
- paired improvement over the current baseline
- no critical regression on test
- holdout pass or explicit rejection
- cost and latency within budget
- no unresolved canary or contamination failures
- trace evidence for representative successes and failures
- TraceAnalyst findings for failure-heavy or regression-heavy corpora
- human-readable report with failure clusters and next actions
evaluateReleaseConfidence() and the paired statistics helpers provide the
decision data. The product decides the business threshold.
Use sandbox/build/test/serve/browser validators. Add intent and semantic concept judges only after the generated app runs.
Record browser steps, screenshots, network errors, console errors, and final state. Use deterministic DOM/API assertions before visual or semantic judges.
Use domain fixtures, jurisdiction/date metadata, retrieval spans, and professional judges. Fail missing/stale evidence separately from bad reasoning.
Use @tangle-network/agent-integrations manifests as readiness inputs. Gate
missing connections, missing scopes, approval-required writes, and stale tokens
before blaming the agent prompt.
For generated apps and sandbox agents, also run the Integration Launch Gates. The eval should prove that app code invokes through the integration bridge, not provider SDKs with raw OAuth tokens.
Record transcript, timing, interruptions, tool calls, and task outcome. Judge conversation quality separately from tool success and policy compliance.
- Evaluating only final prose for an agent that actually builds, browses, or calls tools.
- Letting an LLM judge override failed tests.
- Optimizing on examples that users will never hit.
- Recording traces as logs but never converting them to datasets.
- Calling every failure a prompt failure when context, data, auth, or runtime readiness was missing.
- Shipping reports without run ids, commits, model ids, or evidence links.