Skip to content

Latest commit

 

History

History
1414 lines (1190 loc) · 34.6 KB

File metadata and controls

1414 lines (1190 loc) · 34.6 KB
title Evals quickstart
subtitle Get started with AI agent testing in 5 minutes
slug observability/evals-quickstart

Overview

This quickstart guide will help you set up automated testing for your AI assistants and squads. In just a few minutes, you'll create mock conversations, define expected behaviors, and validate your agents work correctly before production.

<iframe style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border: 0;" src="https://www.tella.tv/video/cmgu6muyb002m0bktda8r6nou/embed?b=0&title=0&a=1&loop=0&t=0&muted=0&wt=1" allowfullscreen allowtransparency></iframe>

What are Evals?

Evals is Vapi's AI agent testing framework that enables you to systematically test assistants and squads using mock conversations with automated validation. Test your agents by:

  1. Creating mock conversations - Define user messages and expected assistant responses
  2. Validating behavior - Use exact match, regex patterns, or AI-powered judging
  3. Testing tool calls - Verify function calls with specific arguments
  4. Running automated tests - Execute tests and receive detailed pass/fail results
  5. Debugging failures - Review full conversation transcripts with evaluation details

When are Evals useful?

Evals help you maintain quality and catch issues early:

  • Pre-deployment testing - Validate new assistant configurations before going live
  • Regression testing - Ensure prompt or tool changes don't break existing behaviors
  • Conversation flow validation - Test multi-turn interactions and complex scenarios
  • Tool calling verification - Validate function calls with correct arguments
  • Squad handoff testing - Ensure smooth transitions between squad members
  • CI/CD integration - Automate quality gates in your deployment pipeline

What you'll build

An evaluation suite for an appointment booking assistant that tests:

  • Greeting and initial response validation
  • Tool call execution with specific arguments
  • Response pattern matching with regex
  • Semantic validation using AI judges
  • Multi-turn conversation flows

Prerequisites

Sign up at [dashboard.vapi.ai](https://dashboard.vapi.ai) Get your API key from **API Keys** in sidebar You'll also need an existing assistant or squad to test. You can create one in the Dashboard or use the API.

Step 1: Create your first evaluation

Define a mock conversation to test your assistant's greeting behavior.

1. Log in to [dashboard.vapi.ai](https://dashboard.vapi.ai) 2. Click on **Evals** in the left sidebar (under Observability) 3. Click **Create Evaluation**
  <Step title="Configure basic settings">
    1. **Name**: Enter "Greeting Test"
    2. **Description**: Add "Verify assistant greets users appropriately"
    3. **Type**: Automatically set to "chat.mockConversation"
  </Step>

  <Step title="Add conversation turns">
    1. Click **Add Message**
    2. Select **User** message type
    3. Enter content: "Hello"
    4. Click **Add Message** again
    5. Select **Assistant** message type
    6. Click **Enable Evaluation** toggle
    7. Select **Exact Match** as judge type
    8. Enter expected content: "Hello! How can I help you today?"
    9. Click **Save Evaluation**
  </Step>
</Steps>

<Tip>
Your evaluation is now saved. You can run it against any assistant or squad.
</Tip>
```bash curl -X POST "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Greeting Test", "description": "Verify assistant greets users appropriately", "type": "chat.mockConversation", "messages": [ { "role": "user", "content": "Hello" }, { "role": "assistant", "judgePlan": { "type": "exact", "content": "Hello! How can I help you today?" } } ] }' ```

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "orgId": "org-123",
  "type": "chat.mockConversation",
  "name": "Greeting Test",
  "description": "Verify assistant greets users appropriately",
  "messages": [...],
  "createdAt": "2024-01-15T09:30:00Z",
  "updatedAt": "2024-01-15T09:30:00Z"
}

Save the returned id - you'll need it to run the evaluation.

For complete API details, see Create Eval.

**Message structure:** Each conversation turn has a `role` (user, assistant, system, or tool). Assistant messages with `judgePlan` define what to validate.

Step 2: Run your evaluation

Execute the evaluation against your assistant or squad.

1. Navigate to **Evals** in the sidebar 2. Click on "Greeting Test" from your evaluations list
  <Step title="Select target and run">
    1. In the evaluation detail page, find the **Run Test** section
    2. Select **Assistant** or **Squad** as the target type
    3. Choose your assistant/squad from the dropdown
    4. Click **Run Evaluation**
    5. Watch real-time progress as the test executes
  </Step>

  <Step title="View results">
    Results appear automatically when the test completes:
    - ✅ **Green checkmark** indicates evaluation passed
    - ❌ **Red X** indicates evaluation failed
    - Click **View Details** to see full conversation transcript
  </Step>
</Steps>
**Create an eval run:**
curl -X POST "https://api.vapi.ai/eval/run" \
  -H "Authorization: Bearer $VAPI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "evalId": "550e8400-e29b-41d4-a716-446655440000",
    "target": {
      "type": "assistant",
      "assistantId": "your-assistant-id"
    }
  }'

Response:

{
  "id": "eval-run-123",
  "evalId": "550e8400-e29b-41d4-a716-446655440000",
  "orgId": "org-123",
  "status": "queued",
  "createdAt": "2024-01-15T09:35:00Z",
  "updatedAt": "2024-01-15T09:35:00Z"
}

Check results:

curl -X GET "https://api.vapi.ai/eval/run/eval-run-123" \
  -H "Authorization: Bearer $VAPI_API_KEY"

For complete API details, see Create Eval Run and Get Eval Run.

You can also run evaluations with transient assistant or squad configurations by providing `assistant` or `squad` objects instead of IDs in the target.

Step 3: Understand test results

Learn to interpret evaluation results and identify issues.

Successful evaluation

When all checks pass, you'll see:

{
  "id": "eval-run-123",
  "evalId": "550e8400-e29b-41d4-a716-446655440000",
  "status": "ended",
  "endedReason": "mockConversation.done",
  "results": [
    {
      "status": "pass",
      "messages": [
        {
          "role": "user",
          "content": "Hello"
        },
        {
          "role": "assistant",
          "content": "Hello! How can I help you today?",
          "judge": {
            "status": "pass"
          }
        }
      ]
    }
  ]
}

Pass criteria:

  • status is "ended"
  • endedReason is "mockConversation.done"
  • results[0].status is "pass"
  • All judge.status values are "pass"

Failed evaluation

When validation fails, you'll see details:

{
  "status": "ended",
  "endedReason": "mockConversation.done",
  "results": [
    {
      "status": "fail",
      "messages": [
        {
          "role": "user",
          "content": "Hello"
        },
        {
          "role": "assistant",
          "content": "Hi there! What can I do for you?",
          "judge": {
            "status": "fail",
            "failureReason": "Expected exact match: 'Hello! How can I help you today?' but got: 'Hi there! What can I do for you?'"
          }
        }
      ]
    }
  ]
}

Failure indicators:

  • results[0].status is "fail"
  • judge.status is "fail"
  • judge.failureReason explains why validation failed
If `endedReason` is not "mockConversation.done", the test encountered an error (like "assistant-error" or "pipeline-error-openai-llm-failed"). Check your assistant configuration.

Step 4: Test tool/function calls

Validate that your assistant calls functions with correct arguments.

Basic tool call validation

Test appointment booking with exact argument matching:

1. Create new evaluation: "Appointment Booking Test" 2. Add user message: "Book me an appointment for next Monday at 2pm" 3. Add assistant message with evaluation enabled 4. Select **Exact Match** judge type 5. Click **Add Tool Call** 6. Enter function name: "bookAppointment" 7. Add arguments: - `date`: "2025-01-20" - `time`: "14:00" 8. Add tool response message: - Type: **Tool** - Content: `{"status": "success", "confirmationId": "APT-12345"}` 9. Add final assistant message to verify confirmation 10. Save evaluation ```bash curl -X POST "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Appointment Booking Test", "type": "chat.mockConversation", "messages": [ { "role": "user", "content": "Book me an appointment for next Monday at 2pm" }, { "role": "assistant", "judgePlan": { "type": "exact", "toolCalls": [{ "name": "bookAppointment", "arguments": { "date": "2025-01-20", "time": "14:00" } }] } }, { "role": "tool", "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}" }, { "role": "assistant", "judgePlan": { "type": "regex", "content": ".*confirmed.*APT-12345.*" } } ] }' ```

For API details, see Create Eval.

Tool call validation modes

Exact match - Full validation:

{
  "judgePlan": {
    "type": "exact",
    "toolCalls": [
      {
        "name": "bookAppointment",
        "arguments": {
          "date": "2025-01-20",
          "time": "14:00"
        }
      }
    ]
  }
}

Validates both function name AND all argument values exactly.

Partial match - Name only:

{
  "judgePlan": {
    "type": "regex",
    "toolCalls": [
      {
        "name": "bookAppointment"
      }
    ]
  }
}

Validates only that the function was called (arguments can vary).

Multiple tool calls:

{
  "judgePlan": {
    "type": "exact",
    "toolCalls": [
      {
        "name": "checkAvailability",
        "arguments": { "date": "2025-01-20" }
      },
      {
        "name": "bookAppointment",
        "arguments": { "date": "2025-01-20", "time": "14:00" }
      }
    ]
  }
}

Validates multiple function calls in sequence.

Tool calls are validated in the order they're defined. Use `type: "exact"` for strict validation or `type: "regex"` for flexible validation.

Step 5: Use regex for flexible validation

When responses vary slightly (like names, dates, or IDs), use regex patterns for flexible matching.

Common regex patterns

Greeting variations:

{
  "judgePlan": {
    "type": "regex",
    "content": "^(Hello|Hi|Hey),? (I can|I'll|let me) help.*"
  }
}

Matches: "Hello, I can help...", "Hi I'll help...", "Hey let me help..."

Responses with variables:

{
  "judgePlan": {
    "type": "regex",
    "content": ".*appointment.*confirmed.*[A-Z]{3}-[0-9]{5}.*"
  }
}

Matches any confirmation message with appointment ID format.

Date patterns:

{
  "judgePlan": {
    "type": "regex",
    "content": ".*scheduled for (Monday|Tuesday|Wednesday|Thursday|Friday).*"
  }
}

Matches responses mentioning weekdays.

Case-insensitive matching:

{
  "judgePlan": {
    "type": "regex",
    "content": "(?i)booking confirmed"
  }
}

The (?i) flag makes matching case-insensitive.

Example: Flexible booking confirmation

1. Add assistant message with evaluation enabled 2. Select **Regex** as judge type 3. Enter pattern: `.*appointment.*(confirmed|booked).*\d{1,2}:\d{2}.*` 4. This matches various confirmation phrasings with time mentions ```bash curl -X POST "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Flexible Booking Test", "type": "chat.mockConversation", "messages": [ { "role": "user", "content": "I need to schedule an appointment" }, { "role": "assistant", "judgePlan": { "type": "regex", "content": ".*(schedule|book|set up).*appointment.*" } } ] }' ``` **Regex tips:** - Use `.*` to match any characters - Use `(option1|option2)` for alternatives - Use `\d` for digits, `\s` for whitespace - Use `.*?` for non-greedy matching - Test your patterns with sample responses first

Step 6: Use AI judge for semantic validation

For complex validation criteria beyond pattern matching, use AI-powered judges to evaluate responses semantically.

AI judge structure

{
  "role": "assistant",
  "judgePlan": {
    "type": "ai",
    "model": {
      "provider": "openai",
      "model": "gpt-4o",
      "messages": [
        {
          "role": "system",
          "content": "Your evaluation prompt here"
        }
      ]
    }
  }
}

Writing effective judge prompts

Template structure:

You are an LLM-Judge. Evaluate ONLY the last assistant message in the mock conversation: {{messages[-1]}}.

Include the full conversation history for context: {{messages}}

Decision rule:
- PASS if ALL "pass criteria" are satisfied AND NONE of the "fail criteria" are triggered.
- Otherwise FAIL.

Pass criteria:
- [Specific requirement 1]
- [Specific requirement 2]

Fail criteria (any one triggers FAIL):
- [Specific failure condition 1]
- [Specific failure condition 2]

Output format: respond with exactly one word: pass or fail
- No explanations
- No punctuation
- No additional text
**Template variables:** - `{{messages}}` - The entire conversation history (all messages exchanged) - `{{messages[-1]}}` - The last assistant message only

Example: Evaluate helpfulness and tone

1. Add assistant message with evaluation enabled 2. Select **AI Judge** as judge type 3. Choose provider: **OpenAI** 4. Select model: **gpt-4o** 5. Enter evaluation prompt (see template above) 6. Customize pass/fail criteria for your use case ```bash curl -X POST "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Helpfulness Test", "type": "chat.mockConversation", "messages": [ { "role": "user", "content": "I need help with my account" }, { "role": "assistant", "judgePlan": { "type": "ai", "model": { "provider": "openai", "model": "gpt-4o", "messages": [{ "role": "system", "content": "You are an LLM-Judge. Evaluate ONLY the last assistant message: {{messages[-1]}}.\n\nInclude context: {{messages}}\n\nDecision rule:\n- PASS if ALL pass criteria are met AND NO fail criteria are triggered.\n- Otherwise FAIL.\n\nPass criteria:\n- Response acknowledges the user request\n- Response offers specific help or next steps\n- Tone is professional and friendly\n\nFail criteria (any triggers FAIL):\n- Response is rude or dismissive\n- Response ignores the user request\n- Response provides no actionable information\n\nOutput format: respond with exactly one word: pass or fail" }] } } } ] }' ```

Supported AI judge providers

**Models:** gpt-4o, gpt-4-turbo, gpt-3.5-turbo
Best for general-purpose evaluation

{" "}

**Models:** claude-3-5-sonnet-20241022, claude-3-opus-20240229 Best for nuanced evaluation

{" "}

**Models:** gemini-1.5-pro, gemini-1.5-flash Best for multilingual content **Models:** llama-3.1-70b-versatile, mixtral-8x7b-32768
Best for fast evaluation

Custom LLM:

{
  "model": {
    "provider": "custom-llm",
    "model": "your-model-name",
    "url": "https://your-api-endpoint.com/chat/completions",
    "messages": [...]
  }
}

AI judge best practices

**Tips for reliable AI judging:** - Be specific with pass/fail criteria (avoid ambiguous requirements) - Use "ALL pass criteria must be met" logic - Use "ANY fail criteria triggers fail" logic - Include conversation context with ` {{ messages }}` syntax - Request exact "pass" or "fail" output (no explanations) - Test criteria with known good/bad responses before production - Use consistent evaluation standards across similar tests

Step 7: Control flow with Continue Plan

Define what happens after an evaluation passes or fails using continuePlan.

Exit on failure

Stop the test immediately if a critical check fails:

{
  "role": "assistant",
  "judgePlan": {
    "type": "exact",
    "content": "I can help you with that."
  },
  "continuePlan": {
    "exitOnFailureEnabled": true
  }
}

Use case: Skip expensive subsequent tests when initial validation fails.

Override responses on failure

Provide fallback responses to continue testing even when validation fails:

{
  "role": "assistant",
  "judgePlan": {
    "type": "exact",
    "content": "I've processed your request."
  },
  "continuePlan": {
    "exitOnFailureEnabled": false,
    "contentOverride": "Let me rephrase that...",
    "toolCallsOverride": [
      {
        "name": "retryProcessing",
        "arguments": { "retry": "true" }
      }
    ]
  }
}

Use case: Test error recovery paths or force specific tool calls for subsequent validation.

Example: Multi-step with exit control

1. Create evaluation with multiple conversation turns 2. For each assistant message with critical validation: - Enable evaluation - Configure judge plan (exact, regex, or AI) - Toggle **Exit on Failure** to stop test early 3. For non-critical checks, leave **Exit on Failure** disabled ```bash curl -X POST "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Multi-Step with Control", "type": "chat.mockConversation", "messages": [ { "role": "user", "content": "I want to book an appointment" }, { "role": "assistant", "judgePlan": { "type": "exact", "content": "I can help you book an appointment." }, "continuePlan": { "exitOnFailureEnabled": true } }, { "role": "user", "content": "Monday at 2pm" }, { "role": "assistant", "judgePlan": { "type": "exact", "toolCalls": [{"name": "bookAppointment"}] }, "continuePlan": { "exitOnFailureEnabled": false, "contentOverride": "Booking confirmed for Monday at 2pm.", "toolCallsOverride": [{ "name": "bookAppointment", "arguments": {"date": "2025-01-20", "time": "14:00"} }] } } ] }' ``` If `exitOnFailureEnabled` is `true` and validation fails, the test stops immediately. Subsequent conversation turns are not executed. Use this for critical checkpoints.

Step 8: Test complete conversation flows

Validate multi-turn interactions that simulate real user conversations.

Complete booking flow example

Create a comprehensive test:
1. **Turn 1 - Initial request:**
   - User: "I need to schedule an appointment"
   - Assistant evaluation: AI judge checking acknowledgment

2. **Turn 2 - Provide details:**
   - User: "Next Monday at 2pm"
   - Assistant evaluation: Exact match on tool call `bookAppointment`

3. **Turn 3 - Tool response:**
   - Tool: `{"status": "success", "confirmationId": "APT-12345"}`

4. **Turn 4 - Confirmation:**
   - Assistant evaluation: Regex matching confirmation with ID

5. **Turn 5 - Follow-up:**
   - User: "Can I get that via email?"
   - Assistant evaluation: Exact match on tool call `sendEmail`
```bash curl -X POST "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Complete Booking Flow", "description": "Test full appointment booking conversation", "type": "chat.mockConversation", "messages": [ { "role": "user", "content": "I need to schedule an appointment" }, { "role": "assistant", "judgePlan": { "type": "ai", "model": { "provider": "openai", "model": "gpt-4o", "messages": [{ "role": "system", "content": "Evaluate: {{messages[-1]}}\n\nPASS if:\n- Response acknowledges appointment request\n- Response asks for details or preferences\n\nFAIL if:\n- Response is dismissive\n- Response ignores request\n\nOutput: pass or fail" }] } } }, { "role": "user", "content": "Next Monday at 2pm" }, { "role": "assistant", "judgePlan": { "type": "exact", "toolCalls": [{ "name": "bookAppointment", "arguments": { "date": "2025-01-20", "time": "14:00" } }] } }, { "role": "tool", "content": "{\"status\": \"success\", \"confirmationId\": \"APT-12345\"}" }, { "role": "assistant", "judgePlan": { "type": "regex", "content": ".*confirmed.*APT-12345.*" } }, { "role": "user", "content": "Can I get that via email?" }, { "role": "assistant", "judgePlan": { "type": "exact", "toolCalls": [{ "name": "sendEmail" }] } } ] }' ```

For API details, see Create Eval.

System message injection

Inject system prompts mid-conversation to test dynamic behavior changes:

{
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "regex",
        "content": ".*help.*"
      }
    },
    {
      "role": "system",
      "content": "You are now in urgent mode. Prioritize speed."
    },
    {
      "role": "user",
      "content": "I need immediate help"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "ai",
        "model": {
          "provider": "openai",
          "model": "gpt-4o",
          "messages": [
            {
              "role": "system",
              "content": "PASS if response shows urgency. FAIL if response is casual. Output: pass or fail"
            }
          ]
        }
      }
    }
  ]
}
**Multi-turn testing tips:** - Keep conversations focused (5-10 turns for most tests) - Use exit-on-failure for early turns to save time - Test one primary flow per evaluation - Mix judge types (exact, regex, AI) for comprehensive validation - Include tool responses to simulate real interactions

Step 9: Manage evaluations

List, update, and organize your evaluation suite.

List all evaluations

1. Navigate to **Evals** in the sidebar 2. View all evaluations in a table with: - Name and description - Created date - Last run status - Actions (Edit, Run, Delete) 3. Use search to filter by name 4. Sort by date or status ```bash curl -X GET "https://api.vapi.ai/eval" \ -H "Authorization: Bearer $VAPI_API_KEY" ```

Response:

{
  "results": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "name": "Greeting Test",
      "description": "Verify assistant greets users appropriately",
      "type": "chat.mockConversation",
      "createdAt": "2024-01-15T09:30:00Z",
      "updatedAt": "2024-01-15T09:30:00Z"
    }
  ],
  "page": 1,
  "total": 1
}

For API details, see List Evals.

Update an evaluation

1. Navigate to **Evals** and click on an evaluation 2. Click **Edit** button 3. Modify conversation turns, judge plans, or settings 4. Click **Save Changes** 5. Previous test runs remain unchanged ```bash curl -X PATCH "https://api.vapi.ai/eval/550e8400-e29b-41d4-a716-446655440000" \ -H "Authorization: Bearer $VAPI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Updated Greeting Test", "description": "Enhanced greeting validation", "messages": [ { "role": "user", "content": "Hi there" }, { "role": "assistant", "judgePlan": { "type": "regex", "content": "^(Hello|Hi|Hey).*" } } ] }' ```

For API details, see Update Eval.

Delete an evaluation

1. Navigate to **Evals** 2. Click on an evaluation 3. Click **Delete** button 4. Confirm deletion
<Warning>
Deleting an evaluation does NOT delete its run history. Past run results remain accessible.
</Warning>
```bash curl -X DELETE "https://api.vapi.ai/eval/550e8400-e29b-41d4-a716-446655440000" \ -H "Authorization: Bearer $VAPI_API_KEY" ```

For API details, see Delete Eval.

View run history

1. Navigate to **Evals** 2. Click on an evaluation 3. View **Runs** tab showing: - Run timestamp - Target (assistant/squad) - Status (pass/fail) - Duration 4. Click any run to view detailed results **List all runs:** ```bash curl -X GET "https://api.vapi.ai/eval/run" \ -H "Authorization: Bearer $VAPI_API_KEY" ```

Filter by eval ID:

curl -X GET "https://api.vapi.ai/eval/run?evalId=550e8400-e29b-41d4-a716-446655440000" \
  -H "Authorization: Bearer $VAPI_API_KEY"

For API details, see List Eval Runs.

Expected output

Successful run

{
  "id": "eval-run-123",
  "evalId": "550e8400-e29b-41d4-a716-446655440000",
  "orgId": "org-123",
  "status": "ended",
  "endedReason": "mockConversation.done",
  "createdAt": "2024-01-15T09:35:00Z",
  "updatedAt": "2024-01-15T09:35:45Z",
  "results": [
    {
      "status": "pass",
      "messages": [
        {
          "role": "user",
          "content": "Hello"
        },
        {
          "role": "assistant",
          "content": "Hello! How can I help you today?",
          "judge": {
            "status": "pass"
          }
        }
      ]
    }
  ],
  "target": {
    "type": "assistant",
    "assistantId": "your-assistant-id"
  }
}

Indicators of success:

  • status is "ended"
  • endedReason is "mockConversation.done"
  • results[0].status is "pass"
  • ✅ All judge.status values are "pass"

Failed run

{
  "id": "eval-run-124",
  "status": "ended",
  "endedReason": "mockConversation.done",
  "results": [
    {
      "status": "fail",
      "messages": [
        {
          "role": "user",
          "content": "Book an appointment for Monday at 2pm"
        },
        {
          "role": "assistant",
          "content": "Sure, let me help you with that.",
          "toolCalls": [
            {
              "name": "bookAppointment",
              "arguments": {
                "date": "2025-01-20",
                "time": "2:00 PM"
              }
            }
          ],
          "judge": {
            "status": "fail",
            "failureReason": "Tool call arguments mismatch. Expected time: '14:00' but got: '2:00 PM'"
          }
        }
      ]
    }
  ]
}

Indicators of failure:

  • results[0].status is "fail"
  • judge.status is "fail"
  • judge.failureReason provides specific details
Full conversation transcripts show both expected and actual values, making debugging straightforward.

Common patterns

Multiple validation types in one eval

Combine exact, regex, and AI judges for comprehensive testing:

{
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "exact",
        "content": "Hello! How can I help you?"
      }
    },
    {
      "role": "user",
      "content": "Book appointment for Monday"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "regex",
        "content": ".*(Monday|next week).*"
      }
    },
    {
      "role": "user",
      "content": "Thanks for your help"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "ai",
        "model": {
          "provider": "openai",
          "model": "gpt-4o",
          "messages": [
            {
              "role": "system",
              "content": "PASS if response is polite and acknowledges thanks. Output: pass or fail"
            }
          ]
        }
      }
    }
  ]
}

Test squad handoffs

Validate smooth transitions between squad members:

{
  "name": "Squad Handoff Test",
  "messages": [
    {
      "role": "user",
      "content": "I need technical support"
    },
    {
      "role": "assistant",
      "judgePlan": {
        "type": "exact",
        "toolCalls": [
          {
            "name": "transferToSquadMember",
            "arguments": {
              "destination": "technical-support-agent"
            }
          }
        ]
      }
    }
  ],
  "target": {
    "type": "squad",
    "squadId": "your-squad-id"
  }
}

Regression test suite

Organize related tests for systematic validation:

{
  "name": "Greeting Regression Suite",
  "tests": [
    "Greeting Test - Formal",
    "Greeting Test - Casual",
    "Greeting Test - Multilingual"
  ]
}

Run multiple evals sequentially to validate all greeting scenarios.

Troubleshooting

Issue Solution
Eval always fails Verify exact match strings character-by-character. Consider using regex for flexibility
AI judge inconsistent Make pass/fail criteria more specific and binary. Test with known examples
Tool calls not matching Check argument types (string vs number). Ensure exact spelling of function names
Run stuck in "running" Verify assistant configuration. Check for errors in assistant's tools or prompts
Timeout errors Reduce conversation length or simplify evaluations. Check assistant response times
Regex not matching Test regex patterns separately. Remember to escape special characters like . or ?
Empty results array Check endedReason field. Assistant may have encountered an error before completion
Missing judge results Verify judgePlan is properly configured in assistant messages

Common errors

"mockConversation.done" not reached:

  • Check endedReason for actual error (e.g., "assistant-error", "pipeline-error-openai-llm-failed")
  • Verify assistant configuration (model, voice, tools)
  • Check API key validity and rate limits

Judge validation fails unexpectedly:

  • Review actual vs expected output in failureReason
  • For exact match: Check for extra spaces, punctuation, or case differences
  • For regex: Test pattern with online regex validators
  • For AI judge: Verify prompt clarity and binary pass/fail logic

Tool calls not validated:

  • Ensure tool is properly configured in assistant
  • Check argument types match exactly (string "14:00" vs number 14)
  • Verify tool function names are spelled correctly
If you see `endedReason: "assistant-error"`, your assistant configuration has issues. Test the assistant manually first before running evals.

Next steps

Learn testing patterns, best practices, and CI/CD integration

{" "}

Create and configure assistants to test

{" "}

Build custom tools and validate their behavior

<Card title="Eval API reference" icon="book" href="/api-reference/eval/create"

Complete API documentation for evals

Tips for success

**Best practices for reliable testing:** - Start simple with exact matches, then add complexity - One behavior per evaluation turn keeps tests focused - Use descriptive names that explain what's being tested - Test both happy paths and edge cases - Version control your evals alongside assistant configs - Run critical tests first to fail fast - Review failure reasons promptly and iterate - Document why each test exists (use descriptions)

Get help

Need assistance? We're here to help: