This document outlines what your AI agent backend needs to expose for EvalView testing. EvalView supports three tiers: basic (just text response), metadata (response + cost/tokens), and full streaming (JSONL event stream with tool calls).
Minimum to get started:
-
Your agent must respond to POST requests:
curl -X POST http://localhost:3000/api/chat \ -H "Content-Type: application/json" \ -d '{"message": "Test query"}'
-
Response must include the agent's answer:
{"response": "Agent answer here..."} -
That's it! You can now test output quality and latency.
To add cost tracking (10 more minutes):
Add metadata to your response:
{
"response": "Agent answer...",
"metadata": {
"cost": 0.05,
"tokens": {"input": 100, "output": 500}
}
}For full tool tracking (20 more minutes):
Stream JSONL events (see Level 3 below).
EvalView is a general-purpose testing framework for AI agents. It works with any agent that:
- Accepts queries via HTTP API
- Returns responses (streaming or non-streaming)
- Can emit structured execution data (optional but recommended)
EvalView supports agents at different levels of sophistication:
| Level | What You Provide | What Gets Tested | Setup Time |
|---|---|---|---|
| Level 1: Basic | Just text response | Output quality, latency | 5 minutes |
| Level 2: Metadata | Response + cost/tokens | Everything except tool sequence | 15 minutes |
| Level 3: Streaming | Full event stream | Everything (tools, cost, sequence) | 30 minutes |
What you need:
POST /api/chat
{"message": "What is AAPL stock price?"}
Response:
{"response": "AAPL is trading at $266.25..."}What gets tested:
- ✅ Output quality (contains expected keywords)
- ✅ Latency (response time)
- ❌ Cost (will show $0 with warning)
- ❌ Tools called
- ❌ Sequence correctness
Best for: Quick start, proof of concept, simple agents
What you need:
POST /api/chat
{"message": "Analyze AAPL stock"}
Response:
{
"response": "AAPL analysis...",
"metadata": {
"cost": 0.05,
"tokens": {"input": 100, "output": 500},
"steps": ["fetch_data", "analyze", "synthesize"]
}
}What gets tested:
- ✅ Output quality
- ✅ Latency
- ✅ Cost (from metadata)
- ✅ Tools called (basic list)
⚠️ Sequence (order only, not parameters)
Best for: Most production agents, good balance of effort vs. coverage
What you need: JSONL event stream (see below)
What gets tested: Everything with full fidelity
Best for: Complex multi-step agents, orchestrators, production systems
Your agent must expose an HTTP endpoint that:
- Accepts POST requests with JSON payload
- Returns a response (sync or streaming)
- Completes within reasonable time (30-120 seconds recommended)
Example request:
POST /api/chat
{
"message": "Analyze AAPL stock performance",
"userId": "test-user"
}Two options supported:
Stream JSON Lines events for rich execution tracking:
{"type": "tool_call", "data": {"name": "analyzeStock", "args": {"symbol": "AAPL"}}}
{"type": "tool_result", "data": {"result": "...", "success": true}}
{"type": "usage", "data": {"input_tokens": 1000, "output_tokens": 500}}
{"type": "message_complete", "data": {"content": "Final response..."}}Return complete response in single JSON:
{
"response": "Complete agent response text...",
"metadata": {
"cost": 0.05,
"tokens": 1500
}
}For comprehensive test coverage, emit these event types:
When your agent calls a tool/function:
{"type": "tool_call", "data": {
"name": "analyzeStock",
"args": {"symbol": "AAPL"}
}}After tool execution:
{"type": "tool_result", "data": {
"result": "Stock analysis data...",
"success": true,
"error": null
}}If you have step descriptions:
{"type": "step_narration", "data": {
"text": "Analyzing stock fundamentals",
"toolName": "analyzeStock"
}}For cost tracking:
{"type": "usage", "data": {
"input_tokens": 1000,
"output_tokens": 500,
"cached_tokens": 100
}}Complete response:
{"type": "message_complete", "data": {
"content": "Complete agent response..."
}}Problem: Your backend refining indefinitely blocks tests.
Example Bad Pattern:
// ❌ DON'T DO THIS
while (quality < threshold) {
result = await refine(result);
// No limit - can loop forever!
}Solution: Add max iteration limits
// ✅ DO THIS
const MAX_REFINEMENTS = 3;
let refinements = 0;
while (quality < threshold && refinements < MAX_REFINEMENTS) {
result = await refine(result);
refinements++;
}- Set max iterations: 3-5 refinements maximum
- Add time limits: Stop if taking > 30 seconds
- Check for improvements: Stop if quality isn't increasing
- Return partial results: Better to return "good enough" than timeout
Example:
async function executeWithRefinement(query, maxRefinements = 3) {
let result = await initialExecution(query);
let refinements = 0;
const startTime = Date.now();
const MAX_TIME = 30000; // 30 seconds
while (
shouldRefine(result) &&
refinements < maxRefinements &&
(Date.now() - startTime) < MAX_TIME
) {
result = await refine(result);
refinements++;
// Emit progress
await stream.write(JSON.stringify({
type: "status",
data: {text: `Refinement ${refinements}/${maxRefinements}`}
}) + "\n");
}
return result;
}| Agent Type | Recommended Timeout |
|---|---|
| Simple Q&A | 10-30 seconds |
| Multi-step analysis | 30-90 seconds |
| Complex orchestration | 60-120 seconds |
If your agent needs more time:
- Reduce complexity
- Limit refinement iterations
- Parallelize tool calls
- Cache expensive operations
To enable cost evaluation, emit usage events:
{"type": "usage", "data": {
"input_tokens": 1000,
"output_tokens": 500,
"cached_tokens": 100,
"model": "gpt-4" // optional
}}EvalView will:
- Sum tokens across all steps
- Calculate costs using built-in pricing
- Compare against test thresholds
Before running EvalView tests:
-
Test response time:
time curl -X POST http://localhost:3000/api/chat \ -H "Content-Type: application/json" \ -d '{"message": "Test query"}'
Should complete in < 60 seconds.
-
Check event format:
curl -X POST http://localhost:3000/api/chat \ -H "Content-Type: application/json" \ -d '{"message": "Test query"}' \ | jq -R 'fromjson? | .type'
Should show event types.
-
Monitor logs: Check for infinite loops or stuck operations.
Cause: Backend stuck in refinement loop Fix: Add max refinement limit (3-5 iterations)
Cause: Not emitting usage events
Fix: Emit {"type": "usage", ...} after each LLM call
Cause: Not emitting tool_call/tool_result events Fix: Emit events when tools execute
Cause: Streaming not properly closed
Fix: Always emit final message_complete event
If you're using TapeScope backend:
-
Fix refinement loop in your orchestrator:
- Limit "TinyLLM refinement decision" to max 3-5 iterations
- Add timeout after 30 seconds
- Return results even if quality < threshold
-
Emit streaming events in your API route:
- When calling
analyzeStock, emittool_callevent - After tool completes, emit
tool_resultevent - At the end, emit
message_completeevent
- When calling
-
Track token usage:
- Sum tokens from all LLM calls
- Emit
usageevent with totals
- See
examples/directory for reference implementations - Check
evalview/adapters/for adapter code - File issues at: https://github.com/hidai25/eval-view/issues
- Adapters — All supported adapters and configuration options
- Trace Specification — Execution trace format produced from API events
- Cost Tracking — How token usage events enable cost monitoring
- Troubleshooting — Common API connection and parsing issues
- Getting Started — Install EvalView and run your first test