| title | Evals quickstart |
|---|---|
| subtitle | Get started with AI agent testing in 5 minutes |
| slug | observability/evals-quickstart |
This quickstart guide will help you set up automated testing for your AI assistants and squads. In just a few minutes, you'll create mock conversations, define expected behaviors, and validate your agents work correctly before production.
Evals is Vapi's AI agent testing framework that enables you to systematically test assistants and squads using mock conversations with automated validation. Test your agents by:
- Creating mock conversations - Define user messages and expected assistant responses
- Validating behavior - Use exact match, regex patterns, or AI-powered judging
- Testing tool calls - Verify function calls with specific arguments
- Running automated tests - Execute tests and receive detailed pass/fail results
- Debugging failures - Review full conversation transcripts with evaluation details
Evals help you maintain quality and catch issues early:
- Pre-deployment testing - Validate new assistant configurations before going live
- Regression testing - Ensure prompt or tool changes don't break existing behaviors
- Conversation flow validation - Test multi-turn interactions and complex scenarios
- Tool calling verification - Validate function calls with correct arguments
- Squad handoff testing - Ensure smooth transitions between squad members
- CI/CD integration - Automate quality gates in your deployment pipeline
An evaluation suite for an appointment booking assistant that tests:
- Greeting and initial response validation
- Tool call execution with specific arguments
- Response pattern matching with regex
- Semantic validation using AI judges
- Multi-turn conversation flows
Define a mock conversation to test your assistant's greeting behavior.
1. Log in to [dashboard.vapi.ai](https://dashboard.vapi.ai) 2. Click on **Evals** in the left sidebar (under Observability) 3. Click **Create Evaluation** <Step title="Configure basic settings">
1. **Name**: Enter "Greeting Test"
2. **Description**: Add "Verify assistant greets users appropriately"
3. **Type**: Automatically set to "chat.mockConversation"
</Step>
<Step title="Add conversation turns">
1. Click **Add Message**
2. Select **User** message type
3. Enter content: "Hello"
4. Click **Add Message** again
5. Select **Assistant** message type
6. Click **Enable Evaluation** toggle
7. Select **Exact Match** as judge type
8. Enter expected content: "Hello! How can I help you today?"
9. Click **Save Evaluation**
</Step>
</Steps>
<Tip>
Your evaluation is now saved. You can run it against any assistant or squad.
</Tip>
Response:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"orgId": "org-123",
"type": "chat.mockConversation",
"name": "Greeting Test",
"description": "Verify assistant greets users appropriately",
"messages": [...],
"createdAt": "2024-01-15T09:30:00Z",
"updatedAt": "2024-01-15T09:30:00Z"
}Save the returned id - you'll need it to run the evaluation.
For complete API details, see Create Eval.
**Message structure:** Each conversation turn has a `role` (user, assistant, system, or tool). Assistant messages with `judgePlan` define what to validate.Execute the evaluation against your assistant or squad.
1. Navigate to **Evals** in the sidebar 2. Click on "Greeting Test" from your evaluations list <Step title="Select target and run">
1. In the evaluation detail page, find the **Run Test** section
2. Select **Assistant** or **Squad** as the target type
3. Choose your assistant/squad from the dropdown
4. Click **Run Evaluation**
5. Watch real-time progress as the test executes
</Step>
<Step title="View results">
Results appear automatically when the test completes:
- ✅ **Green checkmark** indicates evaluation passed
- ❌ **Red X** indicates evaluation failed
- Click **View Details** to see full conversation transcript
</Step>
</Steps>
curl -X POST "https://api.vapi.ai/eval/run" \
-H "Authorization: Bearer $VAPI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"evalId": "550e8400-e29b-41d4-a716-446655440000",
"target": {
"type": "assistant",
"assistantId": "your-assistant-id"
}
}'Response:
{
"id": "eval-run-123",
"evalId": "550e8400-e29b-41d4-a716-446655440000",
"orgId": "org-123",
"status": "queued",
"createdAt": "2024-01-15T09:35:00Z",
"updatedAt": "2024-01-15T09:35:00Z"
}Check results:
curl -X GET "https://api.vapi.ai/eval/run/eval-run-123" \
-H "Authorization: Bearer $VAPI_API_KEY"For complete API details, see Create Eval Run and Get Eval Run.
You can also run evaluations with transient assistant or squad configurations by providing `assistant` or `squad` objects instead of IDs in the target.Learn to interpret evaluation results and identify issues.
When all checks pass, you'll see:
{
"id": "eval-run-123",
"evalId": "550e8400-e29b-41d4-a716-446655440000",
"status": "ended",
"endedReason": "mockConversation.done",
"results": [
{
"status": "pass",
"messages": [
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I help you today?",
"judge": {
"status": "pass"
}
}
]
}
]
}Pass criteria:
statusis "ended"endedReasonis "mockConversation.done"results[0].statusis "pass"- All
judge.statusvalues are "pass"
When validation fails, you'll see details:
{
"status": "ended",
"endedReason": "mockConversation.done",
"results": [
{
"status": "fail",
"messages": [
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hi there! What can I do for you?",
"judge": {
"status": "fail",
"failureReason": "Expected exact match: 'Hello! How can I help you today?' but got: 'Hi there! What can I do for you?'"
}
}
]
}
]
}Failure indicators:
results[0].statusis "fail"judge.statusis "fail"judge.failureReasonexplains why validation failed
Validate that your assistant calls functions with correct arguments.
Test appointment booking with exact argument matching:
1. Create new evaluation: "Appointment Booking Test" 2. Add user message: "Book me an appointment for next Monday at 2pm" 3. Add assistant message with evaluation enabled 4. Select **Exact Match** judge type 5. Click **Add Tool Call** 6. Enter function name: "bookAppointment" 7. Add arguments: - `date`: "2025-01-20" - `time`: "14:00" 8. Add tool response message: - Type: **Tool** - Content: `{"status": "success", "confirmationId": "APT-12345"}` 9. Add final assistant message to verify confirmation 10. Save evaluation ```bash curl -X POST "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Appointment Booking Test", "type": "chat.mockConversation", "messages": [ { "role": "user", "content": "Book me an appointment for next Monday at 2pm" }, { "role": "assistant", "judgePlan": { "type": "exact", "toolCalls": [{ "name": "bookAppointment", "arguments": { "date": "2025-01-20", "time": "14:00" } }] } }, { "role": "tool", "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}" }, { "role": "assistant", "judgePlan": { "type": "regex", "content": ".*confirmed.*APT-12345.*" } } ] }' ```For API details, see Create Eval.
Exact match - Full validation:
{
"judgePlan": {
"type": "exact",
"toolCalls": [
{
"name": "bookAppointment",
"arguments": {
"date": "2025-01-20",
"time": "14:00"
}
}
]
}
}Validates both function name AND all argument values exactly.
Partial match - Name only:
{
"judgePlan": {
"type": "regex",
"toolCalls": [
{
"name": "bookAppointment"
}
]
}
}Validates only that the function was called (arguments can vary).
Multiple tool calls:
{
"judgePlan": {
"type": "exact",
"toolCalls": [
{
"name": "checkAvailability",
"arguments": { "date": "2025-01-20" }
},
{
"name": "bookAppointment",
"arguments": { "date": "2025-01-20", "time": "14:00" }
}
]
}
}Validates multiple function calls in sequence.
Tool calls are validated in the order they're defined. Use `type: "exact"` for strict validation or `type: "regex"` for flexible validation.When responses vary slightly (like names, dates, or IDs), use regex patterns for flexible matching.
Greeting variations:
{
"judgePlan": {
"type": "regex",
"content": "^(Hello|Hi|Hey),? (I can|I'll|let me) help.*"
}
}Matches: "Hello, I can help...", "Hi I'll help...", "Hey let me help..."
Responses with variables:
{
"judgePlan": {
"type": "regex",
"content": ".*appointment.*confirmed.*[A-Z]{3}-[0-9]{5}.*"
}
}Matches any confirmation message with appointment ID format.
Date patterns:
{
"judgePlan": {
"type": "regex",
"content": ".*scheduled for (Monday|Tuesday|Wednesday|Thursday|Friday).*"
}
}Matches responses mentioning weekdays.
Case-insensitive matching:
{
"judgePlan": {
"type": "regex",
"content": "(?i)booking confirmed"
}
}The (?i) flag makes matching case-insensitive.
For complex validation criteria beyond pattern matching, use AI-powered judges to evaluate responses semantically.
{
"role": "assistant",
"judgePlan": {
"type": "ai",
"model": {
"provider": "openai",
"model": "gpt-4o",
"messages": [
{
"role": "system",
"content": "Your evaluation prompt here"
}
]
}
}
}Template structure:
You are an LLM-Judge. Evaluate ONLY the last assistant message in the mock conversation: {{messages[-1]}}.
Include the full conversation history for context: {{messages}}
Decision rule:
- PASS if ALL "pass criteria" are satisfied AND NONE of the "fail criteria" are triggered.
- Otherwise FAIL.
Pass criteria:
- [Specific requirement 1]
- [Specific requirement 2]
Fail criteria (any one triggers FAIL):
- [Specific failure condition 1]
- [Specific failure condition 2]
Output format: respond with exactly one word: pass or fail
- No explanations
- No punctuation
- No additional text
Best for general-purpose evaluation
{" "}
**Models:** claude-3-5-sonnet-20241022, claude-3-opus-20240229 Best for nuanced evaluation{" "}
**Models:** gemini-1.5-pro, gemini-1.5-flash Best for multilingual content **Models:** llama-3.1-70b-versatile, mixtral-8x7b-32768Best for fast evaluation
Custom LLM:
{
"model": {
"provider": "custom-llm",
"model": "your-model-name",
"url": "https://your-api-endpoint.com/chat/completions",
"messages": [...]
}
}Define what happens after an evaluation passes or fails using continuePlan.
Stop the test immediately if a critical check fails:
{
"role": "assistant",
"judgePlan": {
"type": "exact",
"content": "I can help you with that."
},
"continuePlan": {
"exitOnFailureEnabled": true
}
}Use case: Skip expensive subsequent tests when initial validation fails.
Provide fallback responses to continue testing even when validation fails:
{
"role": "assistant",
"judgePlan": {
"type": "exact",
"content": "I've processed your request."
},
"continuePlan": {
"exitOnFailureEnabled": false,
"contentOverride": "Let me rephrase that...",
"toolCallsOverride": [
{
"name": "retryProcessing",
"arguments": { "retry": "true" }
}
]
}
}Use case: Test error recovery paths or force specific tool calls for subsequent validation.
1. Create evaluation with multiple conversation turns 2. For each assistant message with critical validation: - Enable evaluation - Configure judge plan (exact, regex, or AI) - Toggle **Exit on Failure** to stop test early 3. For non-critical checks, leave **Exit on Failure** disabled ```bash curl -X POST "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Multi-Step with Control", "type": "chat.mockConversation", "messages": [ { "role": "user", "content": "I want to book an appointment" }, { "role": "assistant", "judgePlan": { "type": "exact", "content": "I can help you book an appointment." }, "continuePlan": { "exitOnFailureEnabled": true } }, { "role": "user", "content": "Monday at 2pm" }, { "role": "assistant", "judgePlan": { "type": "exact", "toolCalls": [{"name": "bookAppointment"}] }, "continuePlan": { "exitOnFailureEnabled": false, "contentOverride": "Booking confirmed for Monday at 2pm.", "toolCallsOverride": [{ "name": "bookAppointment", "arguments": {"date": "2025-01-20", "time": "14:00"} }] } } ] }' ``` If `exitOnFailureEnabled` is `true` and validation fails, the test stops immediately. Subsequent conversation turns are not executed. Use this for critical checkpoints.Validate multi-turn interactions that simulate real user conversations.
Create a comprehensive test:1. **Turn 1 - Initial request:**
- User: "I need to schedule an appointment"
- Assistant evaluation: AI judge checking acknowledgment
2. **Turn 2 - Provide details:**
- User: "Next Monday at 2pm"
- Assistant evaluation: Exact match on tool call `bookAppointment`
3. **Turn 3 - Tool response:**
- Tool: `{"status": "success", "confirmationId": "APT-12345"}`
4. **Turn 4 - Confirmation:**
- Assistant evaluation: Regex matching confirmation with ID
5. **Turn 5 - Follow-up:**
- User: "Can I get that via email?"
- Assistant evaluation: Exact match on tool call `sendEmail`
For API details, see Create Eval.
Inject system prompts mid-conversation to test dynamic behavior changes:
{
"messages": [
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"judgePlan": {
"type": "regex",
"content": ".*help.*"
}
},
{
"role": "system",
"content": "You are now in urgent mode. Prioritize speed."
},
{
"role": "user",
"content": "I need immediate help"
},
{
"role": "assistant",
"judgePlan": {
"type": "ai",
"model": {
"provider": "openai",
"model": "gpt-4o",
"messages": [
{
"role": "system",
"content": "PASS if response shows urgency. FAIL if response is casual. Output: pass or fail"
}
]
}
}
}
]
}List, update, and organize your evaluation suite.
1. Navigate to **Evals** in the sidebar 2. View all evaluations in a table with: - Name and description - Created date - Last run status - Actions (Edit, Run, Delete) 3. Use search to filter by name 4. Sort by date or status ```bash curl -X GET "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" ```Response:
{
"results": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "Greeting Test",
"description": "Verify assistant greets users appropriately",
"type": "chat.mockConversation",
"createdAt": "2024-01-15T09:30:00Z",
"updatedAt": "2024-01-15T09:30:00Z"
}
],
"page": 1,
"total": 1
}For API details, see List Evals.
1. Navigate to **Evals** and click on an evaluation 2. Click **Edit** button 3. Modify conversation turns, judge plans, or settings 4. Click **Save Changes** 5. Previous test runs remain unchanged ```bash curl -X PATCH "https://api.vapi.ai/eval/550e8400-e29b-41d4-a716-446655440000" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Updated Greeting Test", "description": "Enhanced greeting validation", "messages": [ { "role": "user", "content": "Hi there" }, { "role": "assistant", "judgePlan": { "type": "regex", "content": "^(Hello|Hi|Hey).*" } } ] }' ```For API details, see Update Eval.
1. Navigate to **Evals** 2. Click on an evaluation 3. Click **Delete** button 4. Confirm deletion<Warning>
Deleting an evaluation does NOT delete its run history. Past run results remain accessible.
</Warning>
For API details, see Delete Eval.
1. Navigate to **Evals** 2. Click on an evaluation 3. View **Runs** tab showing: - Run timestamp - Target (assistant/squad) - Status (pass/fail) - Duration 4. Click any run to view detailed results **List all runs:** ```bash curl -X GET "https://api.vapi.ai/eval/run" \ -H "Authorization: Bearer $VAPI_API_KEY" ```Filter by eval ID:
curl -X GET "https://api.vapi.ai/eval/run?evalId=550e8400-e29b-41d4-a716-446655440000" \
-H "Authorization: Bearer $VAPI_API_KEY"For API details, see List Eval Runs.
{
"id": "eval-run-123",
"evalId": "550e8400-e29b-41d4-a716-446655440000",
"orgId": "org-123",
"status": "ended",
"endedReason": "mockConversation.done",
"createdAt": "2024-01-15T09:35:00Z",
"updatedAt": "2024-01-15T09:35:45Z",
"results": [
{
"status": "pass",
"messages": [
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I help you today?",
"judge": {
"status": "pass"
}
}
]
}
],
"target": {
"type": "assistant",
"assistantId": "your-assistant-id"
}
}Indicators of success:
- ✅
statusis "ended" - ✅
endedReasonis "mockConversation.done" - ✅
results[0].statusis "pass" - ✅ All
judge.statusvalues are "pass"
{
"id": "eval-run-124",
"status": "ended",
"endedReason": "mockConversation.done",
"results": [
{
"status": "fail",
"messages": [
{
"role": "user",
"content": "Book an appointment for Monday at 2pm"
},
{
"role": "assistant",
"content": "Sure, let me help you with that.",
"toolCalls": [
{
"name": "bookAppointment",
"arguments": {
"date": "2025-01-20",
"time": "2:00 PM"
}
}
],
"judge": {
"status": "fail",
"failureReason": "Tool call arguments mismatch. Expected time: '14:00' but got: '2:00 PM'"
}
}
]
}
]
}Indicators of failure:
- ❌
results[0].statusis "fail" - ❌
judge.statusis "fail" - ❌
judge.failureReasonprovides specific details
Combine exact, regex, and AI judges for comprehensive testing:
{
"messages": [
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"judgePlan": {
"type": "exact",
"content": "Hello! How can I help you?"
}
},
{
"role": "user",
"content": "Book appointment for Monday"
},
{
"role": "assistant",
"judgePlan": {
"type": "regex",
"content": ".*(Monday|next week).*"
}
},
{
"role": "user",
"content": "Thanks for your help"
},
{
"role": "assistant",
"judgePlan": {
"type": "ai",
"model": {
"provider": "openai",
"model": "gpt-4o",
"messages": [
{
"role": "system",
"content": "PASS if response is polite and acknowledges thanks. Output: pass or fail"
}
]
}
}
}
]
}Validate smooth transitions between squad members:
{
"name": "Squad Handoff Test",
"messages": [
{
"role": "user",
"content": "I need technical support"
},
{
"role": "assistant",
"judgePlan": {
"type": "exact",
"toolCalls": [
{
"name": "transferToSquadMember",
"arguments": {
"destination": "technical-support-agent"
}
}
]
}
}
],
"target": {
"type": "squad",
"squadId": "your-squad-id"
}
}Organize related tests for systematic validation:
{
"name": "Greeting Regression Suite",
"tests": [
"Greeting Test - Formal",
"Greeting Test - Casual",
"Greeting Test - Multilingual"
]
}Run multiple evals sequentially to validate all greeting scenarios.
| Issue | Solution |
|---|---|
| Eval always fails | Verify exact match strings character-by-character. Consider using regex for flexibility |
| AI judge inconsistent | Make pass/fail criteria more specific and binary. Test with known examples |
| Tool calls not matching | Check argument types (string vs number). Ensure exact spelling of function names |
| Run stuck in "running" | Verify assistant configuration. Check for errors in assistant's tools or prompts |
| Timeout errors | Reduce conversation length or simplify evaluations. Check assistant response times |
| Regex not matching | Test regex patterns separately. Remember to escape special characters like . or ? |
| Empty results array | Check endedReason field. Assistant may have encountered an error before completion |
| Missing judge results | Verify judgePlan is properly configured in assistant messages |
"mockConversation.done" not reached:
- Check
endedReasonfor actual error (e.g., "assistant-error", "pipeline-error-openai-llm-failed") - Verify assistant configuration (model, voice, tools)
- Check API key validity and rate limits
Judge validation fails unexpectedly:
- Review actual vs expected output in
failureReason - For exact match: Check for extra spaces, punctuation, or case differences
- For regex: Test pattern with online regex validators
- For AI judge: Verify prompt clarity and binary pass/fail logic
Tool calls not validated:
- Ensure tool is properly configured in assistant
- Check argument types match exactly (string "14:00" vs number 14)
- Verify tool function names are spelled correctly
{" "}
Create and configure assistants to test{" "}
Build custom tools and validate their behavior<Card title="Eval API reference" icon="book" href="/api-reference/eval/create"
Complete API documentation for evals
Need assistance? We're here to help: