Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,357 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "bf5280e2",
"metadata": {},
"source": [
"# Evaluate Semantic Kernel AI (ChatCompletion) Agents in Azure AI Foundry"
]
},
{
"cell_type": "markdown",
"id": "0330c099",
"metadata": {},
"source": [
"## Objective\n",
"\n",
"This sample demonstrates how to evaluate Semantic Kernel AI ChatCompletionAgents in Azure AI Foundry. It provides a step-by-step guide to set up the environment, create an agent, and evaluate its performance."
]
},
{
"cell_type": "markdown",
"id": "b364c694",
"metadata": {},
"source": [
"## Time\n",
"You can expect to complete this sample in approximately 20 minutes."
]
},
{
"cell_type": "markdown",
"id": "919c6017",
"metadata": {},
"source": [
"## Prerequisites\n",
Comment thread
ahibrahimm marked this conversation as resolved.
Outdated
"### Packages\n",
"- `semantic-kernel` installed (`pip install semantic-kernel`)\n",
"- `azure-ai-evaluation` SDK installed\n",
"- An Azure OpenAI resource with a deployment configured\n",
"\n",
"Before running the sample:\n",
"```bash\n",
"pip install semantic-kernel azure-ai-projects azure-identity azure-ai-evaluation\n",
"```\n",
"\n",
"### Environment Variables\n",
"- For **AzureChatService** (Semantic Kernel Agent):\n",
" - **`api_key`** – Azure OpenAI API key used by the agent.\n",
" - **`chat_deployment_name`** – Name of the deployed chat model (e.g., `gpt-35-turbo`) used by the agent.\n",
" - **`endpoint`** – Azure OpenAI endpoint URL (e.g., `https://<your-resource>.openai.azure.com/`).\n",
"- For **LLM Evaluation**:\n",
" - **`AZURE_OPENAI_ENDPOINT`** – Azure OpenAI endpoint to be used by the evaluation LLM.\n",
" - **`AZURE_OPENAI_API_KEY`** – Azure OpenAI API key for evaluation.\n",
" - **`AZURE_OPENAI_API_VERSION`** – API version (e.g., `2024-05-01-preview`) for the evaluation LLM.\n",
" - **`MODEL_DEPLOYMENT_NAME`** – Deployment name of the model used for evaluation*, as found under the \"Name\" column in the \"Models + endpoints\" tab in your Azure AI Foundry project*.\n",
"- For Azure AI Foundry (Bonus):\n",
" - **`AZURE_SUBSCRIPTION_ID`** – Your Azure subscription ID where the AI Foundry project is hosted.\n",
" - **`PROJECT_NAME`** – Name of the Azure AI Foundry project.\n",
" - **`RESOURCE_GROUP_NAME`** – Resource group containing your AI Foundry project."
]
},
{
"cell_type": "markdown",
"id": "ba1d6576",
"metadata": {},
"source": [
"### Create a AzureChatCompletion service - [reference](https://learn.microsoft.com/en-us/semantic-kernel/concepts/ai-services/chat-completion/?tabs=csharp-AzureOpenAI%2Cpython-AzureOpenAI%2Cjava-AzureOpenAI&pivots=programming-language-python)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7dc6ce40",
"metadata": {},
"outputs": [],
"source": [
"from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion\n",
"\n",
"# You can do the following if you have set the necessary environment variables or created a .env file\n",
"chat_completion_service = AzureChatCompletion(service_id=\"my-service-id\")"
]
},
{
"cell_type": "markdown",
"id": "ef319288",
"metadata": {},
"source": [
"### Create a ChatCompletionAgent - [reference](https://learn.microsoft.com/en-us/semantic-kernel/frameworks/agent/agent-types/chat-completion-agent?pivots=programming-language-python)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76781359",
"metadata": {},
"outputs": [],
"source": [
"from semantic_kernel.functions import kernel_function\n",
"from typing import Annotated\n",
"\n",
"\n",
"# This is a sample plugin that provides tools\n",
"class MenuPlugin:\n",
" \"\"\"A sample Menu Plugin used for the concept sample.\"\"\"\n",
"\n",
" @kernel_function(description=\"Provides a list of specials from the menu.\")\n",
" def get_specials(self) -> Annotated[str, \"Returns the specials from the menu.\"]:\n",
" return \"\"\"\n",
" Special Soup: Clam Chowder\n",
" Special Salad: Cobb Salad\n",
" Special Drink: Chai Tea\n",
" \"\"\"\n",
"\n",
" @kernel_function(description=\"Provides the price of the requested menu item.\")\n",
" def get_item_price(\n",
" self, menu_item: Annotated[str, \"The name of the menu item.\"]\n",
" ) -> Annotated[str, \"Returns the price of the menu item.\"]:\n",
" _ = menu_item # This is just to simulate a function that uses the input.\n",
" return \"$9.99\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6abead3",
"metadata": {},
"outputs": [],
"source": [
"from semantic_kernel.agents import ChatCompletionAgent\n",
"\n",
"# Create the agent by directly providing the chat completion service\n",
"agent = ChatCompletionAgent(\n",
" service=chat_completion_service,\n",
" name=\"Chef\",\n",
" instructions=\"Answer questions about the menu.\",\n",
" plugins=[MenuPlugin()],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b7b9ba3",
"metadata": {},
"outputs": [],
"source": [
"thread = None\n",
"\n",
"user_inputs = [\n",
" \"Hello\",\n",
" \"What is the special drink today?\",\n",
" \"What does that cost?\",\n",
" \"Thank you\",\n",
"]\n",
"\n",
"for user_input in user_inputs:\n",
" response = await agent.get_response(messages=user_input, thread=thread)\n",
" print(f\"## User: {user_input}\")\n",
" print(f\"## {response.name}: {response}\\n\")\n",
" thread = response.thread"
]
},
{
"cell_type": "markdown",
"id": "2586d3e5",
"metadata": {},
"source": [
"### Converter"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fcd6ac41",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.evaluation import SKAgentConverter\n",
"\n",
"# Get the avaiable turn indices for the thread,\n",
"# useful for selecting a specific turn for evaluation\n",
"turn_indices = await SKAgentConverter._get_thread_turn_indices(thread=thread)\n",
"print(f\"Available turn indices: {turn_indices}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d1d4ae12",
"metadata": {},
"outputs": [],
"source": [
"converter = SKAgentConverter()\n",
"\n",
"# Get a single agent run data\n",
"evaluation_data_single_run = await converter.convert(\n",
" thread=thread,\n",
" turn_index=2, # Specify the turn index you want to evaluate\n",
" agent=agent, # Pass it to include the instructions and plugins in the evaluation data\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7813b5eb",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"file_name = \"evaluation_data.jsonl\"\n",
"# Save the agent thread data to a JSONL file (all turns)\n",
"evaluation_data = await converter.prepare_evaluation_data(threads=[thread], filename=file_name, agent=agent)\n",
"# print(json.dumps(evaluation_data, indent=4))\n",
"len(evaluation_data) # number of turns in the thread"
]
},
{
"cell_type": "markdown",
"id": "8bf87cab",
"metadata": {},
"source": [
"### Setting up evaluator\n",
"\n",
"We will select the following evaluators to assess the different aspects relevant for agent quality: \n",
"\n",
"- [Intent resolution](https://aka.ms/intentresolution-sample): measures the extent of which an agent identifies the correct intent from a user query. Scale: integer 1-5. Higher is better.\n",
"- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps. Scale: float 0-1. Higher is better.\n",
"- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query. Scale: integer 1-5. Higher is better.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e6ee09df",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from pprint import pprint\n",
"\n",
"from azure.ai.evaluation import (\n",
" ToolCallAccuracyEvaluator,\n",
" AzureOpenAIModelConfiguration,\n",
" IntentResolutionEvaluator,\n",
" TaskAdherenceEvaluator,\n",
")\n",
"\n",
"model_config = AzureOpenAIModelConfiguration(\n",
" azure_endpoint=os.environ[\"AZURE_OPENAI_ENDPOINT\"],\n",
" api_key=os.environ[\"AZURE_OPENAI_API_KEY\"],\n",
" api_version=os.environ[\"AZURE_OPENAI_API_VERSION\"],\n",
" azure_deployment=os.environ[\"MODEL_DEPLOYMENT_NAME\"],\n",
")\n",
"\n",
"intent_resolution = IntentResolutionEvaluator(model_config=model_config)\n",
"\n",
"tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)\n",
"\n",
"task_adherence = TaskAdherenceEvaluator(model_config=model_config)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80bd50ff",
"metadata": {},
"outputs": [],
"source": [
"# Test a single evaluation run\n",
"evaluator = ToolCallAccuracyEvaluator(model_config=model_config)\n",
"\n",
"# evaluation_data_single_run.keys() # query, response, tool_definitions\n",
"res = evaluator(**evaluation_data_single_run)\n",
"print(json.dumps(res, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "06bab561",
"metadata": {},
"source": [
"#### Bonus - run on perviously saved file for all turns"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0530c0d",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.evaluation import evaluate\n",
"\n",
"response = evaluate(\n",
" data=file_name,\n",
" evaluators={\n",
" \"tool_call_accuracy\": tool_call_accuracy,\n",
" \"intent_resolution\": intent_resolution,\n",
" \"task_adherence\": task_adherence,\n",
" },\n",
" azure_ai_project={\n",
" \"subscription_id\": os.environ[\"AZURE_SUBSCRIPTION_ID\"],\n",
" \"project_name\": os.environ[\"PROJECT_NAME\"],\n",
" \"resource_group_name\": os.environ[\"RESOURCE_GROUP_NAME\"],\n",
" },\n",
")\n",
"\n",
"pprint(f'AI Foundary URL: {response.get(\"studio_url\")}')"
]
},
{
"cell_type": "markdown",
"id": "ac38d924",
"metadata": {},
"source": [
"## Inspect results on Azure AI Foundry\n",
"\n",
"Go to AI Foundry URL for rich Azure AI Foundry data visualization to inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "225ae69a",
"metadata": {},
"outputs": [],
"source": [
"# alternatively, you can use the following to get the evaluation results in memory\n",
"\n",
"# average scores across all runs\n",
"pprint(response[\"metrics\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ A general AI agent workflow typically contains a linear workflow of intent resol
- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query.
- [Response Completeness](https://aka.ms/rescompleteness-sample): measures the extent of which an agent or RAG response is complete (does not miss critical information) compared to the ground truth.
- [End-to-end Azure AI agent evaluation](https://aka.ms/e2e-agent-eval-sample): create an agent using Azure AI Agent Service and seamlessly evaluate its thread and run data, via converter support.

- [End-to-end SK Chat Completion Agent evaluation](Evaluate_SK_Chat_Completion_Agent.ipynb): create an SK Chat Completion Agent and evaluate its thread and run data, via converter support.
### Objective

This tutorial provides a step-by-step guide on how to evaluate AI agents using quality evaluators. By the end of this tutorial, you should be able to:
Expand Down
Loading