Backend Requirements — How to Make Your AI Agent Compatible with EvalView

This document outlines what your AI agent backend needs to expose for EvalView testing. EvalView supports three tiers: basic (just text response), metadata (response + cost/tokens), and full streaming (JSONL event stream with tool calls).

Quick Start (5 Minutes)

Minimum to get started:

Your agent must respond to POST requests:

curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Test query"}'

Response must include the agent's answer:
```
{"response": "Agent answer here..."}
```
That's it! You can now test output quality and latency.

To add cost tracking (10 more minutes):

Add metadata to your response:

{
  "response": "Agent answer...",
  "metadata": {
    "cost": 0.05,
    "tokens": {"input": 100, "output": 500}
  }
}

For full tool tracking (20 more minutes):

Stream JSONL events (see Level 3 below).

Overview

EvalView is a general-purpose testing framework for AI agents. It works with any agent that:

Accepts queries via HTTP API
Returns responses (streaming or non-streaming)
Can emit structured execution data (optional but recommended)

Three-Tier Support Model

EvalView supports agents at different levels of sophistication:

Level	What You Provide	What Gets Tested	Setup Time
Level 1: Basic	Just text response	Output quality, latency	5 minutes
Level 2: Metadata	Response + cost/tokens	Everything except tool sequence	15 minutes
Level 3: Streaming	Full event stream	Everything (tools, cost, sequence)	30 minutes

Level 1: Basic Agent (Minimum Viable)

What you need:

POST /api/chat
{"message": "What is AAPL stock price?"}

Response:
{"response": "AAPL is trading at $266.25..."}

What gets tested:

✅ Output quality (contains expected keywords)
✅ Latency (response time)
❌ Cost (will show $0 with warning)
❌ Tools called
❌ Sequence correctness

Best for: Quick start, proof of concept, simple agents

Level 2: Agent with Metadata (Recommended)

What you need:

POST /api/chat
{"message": "Analyze AAPL stock"}

Response:
{
  "response": "AAPL analysis...",
  "metadata": {
    "cost": 0.05,
    "tokens": {"input": 100, "output": 500},
    "steps": ["fetch_data", "analyze", "synthesize"]
  }
}

What gets tested:

✅ Output quality
✅ Latency
✅ Cost (from metadata)
✅ Tools called (basic list)
⚠️ Sequence (order only, not parameters)

Best for: Most production agents, good balance of effort vs. coverage

Level 3: Streaming Agent (Full Featured)

What you need: JSONL event stream (see below)

What gets tested: Everything with full fidelity

Best for: Complex multi-step agents, orchestrators, production systems

Core Requirements

1. HTTP API Endpoint

Your agent must expose an HTTP endpoint that:

Accepts POST requests with JSON payload
Returns a response (sync or streaming)
Completes within reasonable time (30-120 seconds recommended)

Example request:

POST /api/chat
{
  "message": "Analyze AAPL stock performance",
  "userId": "test-user"
}

2. Response Format

Two options supported:

Option A: JSONL Streaming (Recommended)

Stream JSON Lines events for rich execution tracking:

{"type": "tool_call", "data": {"name": "analyzeStock", "args": {"symbol": "AAPL"}}}
{"type": "tool_result", "data": {"result": "...", "success": true}}
{"type": "usage", "data": {"input_tokens": 1000, "output_tokens": 500}}
{"type": "message_complete", "data": {"content": "Final response..."}}

Option B: Simple JSON Response

Return complete response in single JSON:

{
  "response": "Complete agent response text...",
  "metadata": {
    "cost": 0.05,
    "tokens": 1500
  }
}

Recommended: Structured Event Streaming

For comprehensive test coverage, emit these event types:

Tool Call Events

When your agent calls a tool/function:

{"type": "tool_call", "data": {
  "name": "analyzeStock",
  "args": {"symbol": "AAPL"}
}}

Tool Result Events

After tool execution:

{"type": "tool_result", "data": {
  "result": "Stock analysis data...",
  "success": true,
  "error": null
}}

Step Narration (Alternative)

If you have step descriptions:

{"type": "step_narration", "data": {
  "text": "Analyzing stock fundamentals",
  "toolName": "analyzeStock"
}}

Usage/Cost Events

For cost tracking:

{"type": "usage", "data": {
  "input_tokens": 1000,
  "output_tokens": 500,
  "cached_tokens": 100
}}

Final Message

Complete response:

{"type": "message_complete", "data": {
  "content": "Complete agent response..."
}}

Critical: Avoid Infinite Loops

Problem: Your backend refining indefinitely blocks tests.

Example Bad Pattern:

// ❌ DON'T DO THIS
while (quality < threshold) {
  result = await refine(result);
  // No limit - can loop forever!
}

Solution: Add max iteration limits

// ✅ DO THIS
const MAX_REFINEMENTS = 3;
let refinements = 0;

while (quality < threshold && refinements < MAX_REFINEMENTS) {
  result = await refine(result);
  refinements++;
}

Recommended Refinement Strategy

Set max iterations: 3-5 refinements maximum
Add time limits: Stop if taking > 30 seconds
Check for improvements: Stop if quality isn't increasing
Return partial results: Better to return "good enough" than timeout

Example:

async function executeWithRefinement(query, maxRefinements = 3) {
  let result = await initialExecution(query);
  let refinements = 0;
  const startTime = Date.now();
  const MAX_TIME = 30000; // 30 seconds

  while (
    shouldRefine(result) &&
    refinements < maxRefinements &&
    (Date.now() - startTime) < MAX_TIME
  ) {
    result = await refine(result);
    refinements++;

    // Emit progress
    await stream.write(JSON.stringify({
      type: "status",
      data: {text: `Refinement ${refinements}/${maxRefinements}`}
    }) + "\n");
  }

  return result;
}

Response Time Guidelines

Agent Type	Recommended Timeout
Simple Q&A	10-30 seconds
Multi-step analysis	30-90 seconds
Complex orchestration	60-120 seconds

If your agent needs more time:

Reduce complexity
Limit refinement iterations
Parallelize tool calls
Cache expensive operations

Cost & Token Tracking

To enable cost evaluation, emit usage events:

{"type": "usage", "data": {
  "input_tokens": 1000,
  "output_tokens": 500,
  "cached_tokens": 100,
  "model": "gpt-4" // optional
}}

EvalView will:

Sum tokens across all steps
Calculate costs using built-in pricing
Compare against test thresholds

Testing Your Backend

Before running EvalView tests:

Test response time:

time curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Test query"}'

Should complete in < 60 seconds.

Check event format:

curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Test query"}' \
  | jq -R 'fromjson? | .type'

Should show event types.

Monitor logs: Check for infinite loops or stuck operations.

Common Issues

Issue: Tests timeout

Cause: Backend stuck in refinement loop Fix: Add max refinement limit (3-5 iterations)

Issue: No cost tracking

Cause: Not emitting usage events Fix: Emit {"type": "usage", ...} after each LLM call

Issue: No tool calls captured

Cause: Not emitting tool_call/tool_result events Fix: Emit events when tools execute

Issue: Empty or incomplete responses

Cause: Streaming not properly closed Fix: Always emit final message_complete event

TapeScope-Specific Notes

If you're using TapeScope backend:

Fix refinement loop in your orchestrator:
- Limit "TinyLLM refinement decision" to max 3-5 iterations
- Add timeout after 30 seconds
- Return results even if quality < threshold
Emit streaming events in your API route:
- When calling analyzeStock, emit tool_call event
- After tool completes, emit tool_result event
- At the end, emit message_complete event
Track token usage:
- Sum tokens from all LLM calls
- Emit usage event with totals

Questions?

See examples/ directory for reference implementations
Check evalview/adapters/ for adapter code
File issues at: https://github.com/hidai25/eval-view/issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend Requirements — How to Make Your AI Agent Compatible with EvalView

Quick Start (5 Minutes)

Overview

Three-Tier Support Model

Level 1: Basic Agent (Minimum Viable)

Level 2: Agent with Metadata (Recommended)

Level 3: Streaming Agent (Full Featured)

Core Requirements

1. HTTP API Endpoint

2. Response Format

Option A: JSONL Streaming (Recommended)

Option B: Simple JSON Response

Recommended: Structured Event Streaming

Tool Call Events

Tool Result Events

Step Narration (Alternative)

Usage/Cost Events

Final Message

Critical: Avoid Infinite Loops

Recommended Refinement Strategy

Response Time Guidelines

Cost & Token Tracking

Testing Your Backend

Common Issues

Issue: Tests timeout

Issue: No cost tracking

Issue: No tool calls captured

Issue: Empty or incomplete responses

TapeScope-Specific Notes

Questions?

Related Documentation

FilesExpand file tree

BACKEND_REQUIREMENTS.md

Latest commit

History

BACKEND_REQUIREMENTS.md

File metadata and controls

Backend Requirements — How to Make Your AI Agent Compatible with EvalView

Quick Start (5 Minutes)

Overview

Three-Tier Support Model

Level 1: Basic Agent (Minimum Viable)

Level 2: Agent with Metadata (Recommended)

Level 3: Streaming Agent (Full Featured)

Core Requirements

1. HTTP API Endpoint

2. Response Format

Option A: JSONL Streaming (Recommended)

Option B: Simple JSON Response

Recommended: Structured Event Streaming

Tool Call Events

Tool Result Events

Step Narration (Alternative)

Usage/Cost Events

Final Message

Critical: Avoid Infinite Loops

Recommended Refinement Strategy

Response Time Guidelines

Cost & Token Tracking

Testing Your Backend

Common Issues

Issue: Tests timeout

Issue: No cost tracking

Issue: No tool calls captured

Issue: Empty or incomplete responses

TapeScope-Specific Notes

Questions?

Related Documentation