Product Eval Adoption

This guide is for teams adding @tangle-network/agent-eval to a real agent product. The package supplies evaluation contracts and runtime primitives. Your product supplies the actual workflow adapter, state, credentials, tools, UI, and storage.

Goal

Use the same loop for production, replay, and optimization:

real user task
  -> product adapter observes state
  -> validators and judges grade state
  -> control loop decides next action
  -> product agent acts in the real environment
  -> trace + feedback trajectory are stored
  -> datasets and optimizers replay the same adapter

If production and eval use different loops, benchmark gains will not transfer.

What The Product Owns

The product owns:

task state and domain models
credentials, tenant policy, approval, and side-effect rules
browser, sandbox, CLI, connector, or voice drivers
database and trace persistence
user/reviewer feedback collection
deployment and live canary routing
model gateway configuration

agent-eval owns:

trace, run, dataset, feedback, and score contracts
control-loop mechanics
verifier and judge orchestration
failure taxonomy
paired statistics and holdout gates
optimizer inputs and promotion reports

Minimal Production Adapter

Start with a small adapter that mirrors one real workflow.

interface ProductEvalAdapter<TState, TAction> {
  observe(taskId: string): Promise<TState>
  validate(state: TState): Promise<ControlEvalResult[]>
  decide(input: {
    state: TState
    evals: ControlEvalResult[]
    history: unknown[]
  }): Promise<TAction | 'stop'>
  act(taskId: string, action: TAction): Promise<void>
}

Keep the adapter product-owned until at least two products need the same shape.

Validator Order

Use deterministic checks before judges.

State validity: schema, required files, required DB rows, required connections.
Runtime gates: install, build, typecheck, tests, serve, deploy smoke.
Policy gates: approvals, side effects, budget, credentials, data freshness.
Behavior gates: browser flows, API calls, generated app preview, voice transcript checks.
Semantic judges: intent fit, quality, completeness, safety, professional correctness.

Semantic judges should never turn a failed build into a pass.

Traces And Feedback

Every serious run should record:

task id and scenario id
git commit
model and provider
prompt/config hashes
tool calls and retrieval spans
build/test/deploy output
cost, latency, and token use
user/reviewer feedback
final outcome and failure class

Convert runs into FeedbackTrajectory records so normal product usage becomes replayable eval data.

production run -> feedback trajectory -> dataset scenario -> optimizer row

For promotion-grade runs, also project the completed control result into a strict RunRecord:

const record = controlRunToRunRecord(controlResult, {
  experimentId,
  candidateId,
  seed,
  model: 'gpt-4o-2024-11-20',
  promptHash,
  configHash,
  commitSha,
  splitTag: 'holdout',
  tokenUsage,
})

Datasets And Holdouts

Use four splits:

train: optimizer search.
dev: tuning and threshold selection.
test: normal reporting.
holdout: promotion-only gate.

The low-level RunRecord schema uses search | dev | holdout; map train and normal non-holdout test/report rows to search when producing promotion tables.

Do not inspect or tune against holdout failures during optimization. If a holdout failure reveals a real product bug, fix the bug and rotate the holdout set with a signed note.

Optimization

Use runMultiShotOptimization() when the system is a multi-step agent, not a single prompt.

Good optimization targets:

system prompt
tool descriptions
retrieval policy
data acquisition policy
user-question policy
evaluator threshold
agent topology
scaffold/template choice

Bad optimization targets:

hidden holdout examples
production credentials
brittle string checks that do not match user value
fake workflows that do not call the product adapter

Use actionable side information so the optimizer knows whether a failure belongs to prompt, tools, retrieval, data acquisition, sandbox, evaluator, or product runtime.

Release Gate

A launch or promotion should require:

enough runs for the target risk level
paired improvement over the current baseline
no critical regression on test
holdout pass or explicit rejection
cost and latency within budget
no unresolved canary or contamination failures
trace evidence for representative successes and failures
TraceAnalyst findings for failure-heavy or regression-heavy corpora
human-readable report with failure clusters and next actions

evaluateReleaseConfidence() and the paired statistics helpers provide the decision data. The product decides the business threshold.

Product Patterns

Coding Or Builder Agent

Use sandbox/build/test/serve/browser validators. Add intent and semantic concept judges only after the generated app runs.

Browser Agent

Record browser steps, screenshots, network errors, console errors, and final state. Use deterministic DOM/API assertions before visual or semantic judges.

Domain Agent

Use domain fixtures, jurisdiction/date metadata, retrieval spans, and professional judges. Fail missing/stale evidence separately from bad reasoning.

Workflow Or Integration Agent

Use @tangle-network/agent-integrations manifests as readiness inputs. Gate missing connections, missing scopes, approval-required writes, and stale tokens before blaming the agent prompt.

For generated apps and sandbox agents, also run the Integration Launch Gates. The eval should prove that app code invokes through the integration bridge, not provider SDKs with raw OAuth tokens.

Voice Agent

Record transcript, timing, interruptions, tool calls, and task outcome. Judge conversation quality separately from tool success and policy compliance.

Anti-Patterns

Evaluating only final prose for an agent that actually builds, browses, or calls tools.
Letting an LLM judge override failed tests.
Optimizing on examples that users will never hit.
Recording traces as logs but never converting them to datasets.
Calling every failure a prompt failure when context, data, auth, or runtime readiness was missing.
Shipping reports without run ids, commits, model ids, or evidence links.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Product Eval Adoption

Goal

What The Product Owns

Minimal Production Adapter

Validator Order

Traces And Feedback

Datasets And Holdouts

Optimization

Release Gate

Product Patterns

Coding Or Builder Agent

Browser Agent

Domain Agent

Workflow Or Integration Agent

Voice Agent

Anti-Patterns

FilesExpand file tree

product-eval-adoption.md

Latest commit

History

product-eval-adoption.md

File metadata and controls

Product Eval Adoption

Goal

What The Product Owns

Minimal Production Adapter

Validator Order

Traces And Feedback

Datasets And Holdouts

Optimization

Release Gate

Product Patterns

Coding Or Builder Agent

Browser Agent

Domain Agent

Workflow Or Integration Agent

Voice Agent

Anti-Patterns