Azure-Samples · kdestin · Aug 11, 2025 · Jul 2, 2025 · Jul 2, 2025 · Aug 7, 2025
@@ -0,0 +1,357 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "bf5280e2",
+   "metadata": {},
+   "source": [
+    "# Evaluate Semantic Kernel AI (ChatCompletion) Agents in Azure AI Foundry"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0330c099",
+   "metadata": {},
+   "source": [
+    "## Objective\n",
+    "\n",
+    "This sample demonstrates how to evaluate Semantic Kernel AI ChatCompletionAgents in Azure AI Foundry. It provides a step-by-step guide to set up the environment, create an agent, and evaluate its performance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b364c694",
+   "metadata": {},
+   "source": [
+    "## Time\n",
+    "You can expect to complete this sample in approximately 20 minutes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "919c6017",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "### Packages\n",
+    "- `semantic-kernel` installed (`pip install semantic-kernel`)\n",
+    "- `azure-ai-evaluation` SDK installed\n",
+    "- An Azure OpenAI resource with a deployment configured\n",
+    "\n",
+    "Before running the sample:\n",
+    "```bash\n",
+    "pip install semantic-kernel azure-ai-projects azure-identity azure-ai-evaluation\n",
+    "```\n",
+    "\n",
+    "### Environment Variables\n",
+    "- For **AzureChatService** (Semantic Kernel Agent):\n",
+    "  - **`api_key`** – Azure OpenAI API key used by the agent.\n",
+    "  - **`chat_deployment_name`** – Name of the deployed chat model (e.g., `gpt-35-turbo`) used by the agent.\n",
+    "  - **`endpoint`** – Azure OpenAI endpoint URL (e.g., `https://<your-resource>.openai.azure.com/`).\n",
+    "- For **LLM Evaluation**:\n",
+    "  - **`AZURE_OPENAI_ENDPOINT`** – Azure OpenAI endpoint to be used by the evaluation LLM.\n",
+    "  - **`AZURE_OPENAI_API_KEY`** – Azure OpenAI API key for evaluation.\n",
+    "  - **`AZURE_OPENAI_API_VERSION`** – API version (e.g., `2024-05-01-preview`) for the evaluation LLM.\n",
+    "  - **`MODEL_DEPLOYMENT_NAME`** – Deployment name of the model used for evaluation*, as found under the \"Name\" column in the \"Models + endpoints\" tab in your Azure AI Foundry project*.\n",
+    "- For Azure AI Foundry (Bonus):\n",
+    "  - **`AZURE_SUBSCRIPTION_ID`** – Your Azure subscription ID where the AI Foundry project is hosted.\n",
+    "  - **`PROJECT_NAME`** – Name of the Azure AI Foundry project.\n",
+    "  - **`RESOURCE_GROUP_NAME`** – Resource group containing your AI Foundry project."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba1d6576",
+   "metadata": {},
+   "source": [
+    "### Create a AzureChatCompletion service - [reference](https://learn.microsoft.com/en-us/semantic-kernel/concepts/ai-services/chat-completion/?tabs=csharp-AzureOpenAI%2Cpython-AzureOpenAI%2Cjava-AzureOpenAI&pivots=programming-language-python)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7dc6ce40",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion\n",
+    "\n",
+    "# You can do the following if you have set the necessary environment variables or created a .env file\n",
+    "chat_completion_service = AzureChatCompletion(service_id=\"my-service-id\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef319288",
+   "metadata": {},
+   "source": [
+    "### Create a ChatCompletionAgent - [reference](https://learn.microsoft.com/en-us/semantic-kernel/frameworks/agent/agent-types/chat-completion-agent?pivots=programming-language-python)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "76781359",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from semantic_kernel.functions import kernel_function\n",
+    "from typing import Annotated\n",
+    "\n",
+    "\n",
+    "# This is a sample plugin that provides tools\n",
+    "class MenuPlugin:\n",
+    "    \"\"\"A sample Menu Plugin used for the concept sample.\"\"\"\n",
+    "\n",
+    "    @kernel_function(description=\"Provides a list of specials from the menu.\")\n",
+    "    def get_specials(self) -> Annotated[str, \"Returns the specials from the menu.\"]:\n",
+    "        return \"\"\"\n",
+    "        Special Soup: Clam Chowder\n",
+    "        Special Salad: Cobb Salad\n",
+    "        Special Drink: Chai Tea\n",
+    "        \"\"\"\n",
+    "\n",
+    "    @kernel_function(description=\"Provides the price of the requested menu item.\")\n",
+    "    def get_item_price(\n",
+    "        self, menu_item: Annotated[str, \"The name of the menu item.\"]\n",
+    "    ) -> Annotated[str, \"Returns the price of the menu item.\"]:\n",
+    "        _ = menu_item  # This is just to simulate a function that uses the input.\n",
+    "        return \"$9.99\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d6abead3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from semantic_kernel.agents import ChatCompletionAgent\n",
+    "\n",
+    "# Create the agent by directly providing the chat completion service\n",
+    "agent = ChatCompletionAgent(\n",
+    "    service=chat_completion_service,\n",
+    "    name=\"Chef\",\n",
+    "    instructions=\"Answer questions about the menu.\",\n",
+    "    plugins=[MenuPlugin()],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b7b9ba3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "thread = None\n",
+    "\n",
+    "user_inputs = [\n",
+    "    \"Hello\",\n",
+    "    \"What is the special drink today?\",\n",
+    "    \"What does that cost?\",\n",
+    "    \"Thank you\",\n",
+    "]\n",
+    "\n",
+    "for user_input in user_inputs:\n",
+    "    response = await agent.get_response(messages=user_input, thread=thread)\n",
+    "    print(f\"## User: {user_input}\")\n",
+    "    print(f\"## {response.name}: {response}\\n\")\n",
+    "    thread = response.thread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2586d3e5",
+   "metadata": {},
+   "source": [
+    "### Converter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fcd6ac41",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azure.ai.evaluation import SKAgentConverter\n",
+    "\n",
+    "# Get the avaiable turn indices for the thread,\n",
+    "# useful for selecting a specific turn for evaluation\n",
+    "turn_indices = await SKAgentConverter._get_thread_turn_indices(thread=thread)\n",
+    "print(f\"Available turn indices: {turn_indices}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1d4ae12",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "converter = SKAgentConverter()\n",
+    "\n",
+    "# Get a single agent run data\n",
+    "evaluation_data_single_run = await converter.convert(\n",
+    "    thread=thread,\n",
+    "    turn_index=2,  # Specify the turn index you want to evaluate\n",
+    "    agent=agent,  # Pass it to include the instructions and plugins in the evaluation data\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7813b5eb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "file_name = \"evaluation_data.jsonl\"\n",
+    "# Save the agent thread data to a JSONL file (all turns)\n",
+    "evaluation_data = await converter.prepare_evaluation_data(threads=[thread], filename=file_name, agent=agent)\n",
+    "# print(json.dumps(evaluation_data, indent=4))\n",
+    "len(evaluation_data)  # number of turns in the thread"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8bf87cab",
+   "metadata": {},
+   "source": [
+    "### Setting up evaluator\n",
+    "\n",
+    "We will select the following evaluators to assess the different aspects relevant for agent quality: \n",
+    "\n",
+    "- [Intent resolution](https://aka.ms/intentresolution-sample): measures the extent of which an agent identifies the correct intent from a user query. Scale: integer 1-5. Higher is better.\n",
+    "- [Tool call accuracy](https://aka.ms/toolcallaccuracy-sample): evaluates the agent’s ability to select the appropriate tools, and process correct parameters from previous steps. Scale: float 0-1. Higher is better.\n",
+    "- [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query. Scale: integer 1-5. Higher is better.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e6ee09df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from pprint import pprint\n",
+    "\n",
+    "from azure.ai.evaluation import (\n",
+    "    ToolCallAccuracyEvaluator,\n",
+    "    AzureOpenAIModelConfiguration,\n",
+    "    IntentResolutionEvaluator,\n",
+    "    TaskAdherenceEvaluator,\n",
+    ")\n",
+    "\n",
+    "model_config = AzureOpenAIModelConfiguration(\n",
+    "    azure_endpoint=os.environ[\"AZURE_OPENAI_ENDPOINT\"],\n",
+    "    api_key=os.environ[\"AZURE_OPENAI_API_KEY\"],\n",
+    "    api_version=os.environ[\"AZURE_OPENAI_API_VERSION\"],\n",
+    "    azure_deployment=os.environ[\"MODEL_DEPLOYMENT_NAME\"],\n",
+    ")\n",
+    "\n",
+    "intent_resolution = IntentResolutionEvaluator(model_config=model_config)\n",
+    "\n",
+    "tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config)\n",
+    "\n",
+    "task_adherence = TaskAdherenceEvaluator(model_config=model_config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "80bd50ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test a single evaluation run\n",
+    "evaluator = ToolCallAccuracyEvaluator(model_config=model_config)\n",
+    "\n",
+    "# evaluation_data_single_run.keys() # query, response, tool_definitions\n",
+    "res = evaluator(**evaluation_data_single_run)\n",
+    "print(json.dumps(res, indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "06bab561",
+   "metadata": {},
+   "source": [
+    "#### Bonus - run on perviously saved file for all turns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0530c0d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azure.ai.evaluation import evaluate\n",
+    "\n",
+    "response = evaluate(\n",
+    "    data=file_name,\n",
+    "    evaluators={\n",
+    "        \"tool_call_accuracy\": tool_call_accuracy,\n",
+    "        \"intent_resolution\": intent_resolution,\n",
+    "        \"task_adherence\": task_adherence,\n",
+    "    },\n",
+    "    azure_ai_project={\n",
+    "        \"subscription_id\": os.environ[\"AZURE_SUBSCRIPTION_ID\"],\n",
+    "        \"project_name\": os.environ[\"PROJECT_NAME\"],\n",
+    "        \"resource_group_name\": os.environ[\"RESOURCE_GROUP_NAME\"],\n",
+    "    },\n",
+    ")\n",
+    "\n",
+    "pprint(f'AI Foundary URL: {response.get(\"studio_url\")}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ac38d924",
+   "metadata": {},
+   "source": [
+    "## Inspect results on Azure AI Foundry\n",
+    "\n",
+    "Go to AI Foundry URL for rich Azure AI Foundry data visualization to inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "225ae69a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# alternatively, you can use the following to get the evaluation results in memory\n",
+    "\n",
+    "# average scores across all runs\n",
+    "pprint(response[\"metrics\"])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -19,7 +19,7 @@ A general AI agent workflow typically contains a linear workflow of intent resol
 - [Task adherence](https://aka.ms/taskadherence-sample): measures the extent of which an agent’s final response adheres to the task based on its system message and a user query.
 - [Response Completeness](https://aka.ms/rescompleteness-sample): measures the extent of which an agent or RAG response is complete (does not miss critical information) compared to the ground truth.
 - [End-to-end Azure AI agent evaluation](https://aka.ms/e2e-agent-eval-sample): create an agent using Azure AI Agent Service and seamlessly evaluate its thread and run data, via converter support.
-
+- [End-to-end SK Chat Completion Agent evaluation](Evaluate_SK_Chat_Completion_Agent.ipynb): create an SK Chat Completion Agent and evaluate its thread and run data, via converter support.
 ### Objective
 
 This tutorial provides a step-by-step guide on how to evaluate AI agents using quality evaluators. By the end of this tutorial, you should be able to: