Evaluation Samples: add multi-turn conversation evaluation samples S1-S4 (#47034)

kseager · Copilot · web-flow · commit 5f88edbdc2a2 · 2026-05-26T10:11:34.000-07:00
* feat(samples): add multi-turn conversation evaluation sample (S1)

Add sample_multiturn_conversation_evaluation.py demonstrating:
- Custom data source config with messages/tool_definitions schema
- Conversation-level evaluators (customer_satisfaction, task_completion, coherence, groundedness)
- Dataset upload and evaluation run with evaluation_level=conversation
- Polling for results

Includes sample JSONL with 3 conversations: basic multi-turn, tool-calling, and extended support dialog.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* feat(samples): add S2, S3, S4 multi-turn evaluation samples

- S2: sample_multiturn_trace_evaluation_by_id.py
  Evaluate traces by conversation_id or trace_id
- S3: sample_multiturn_trace_evaluation_agent_filter.py
  Evaluate traces by agent name/version/id with optional smart filtering
- S4: sample_multiturn_conversation_simulation.py
  Simulate multi-turn conversations against an agent and evaluate
- Data: sample_data_simulation_scenarios.jsonl (3 seed scenarios for S4)

All samples use 4 conversation-level evaluators: customer_satisfaction,
task_completion, coherence, groundedness.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* fix(samples): narrow dataset.id type for pyright

The .id property on upload_file() returns Optional[str], which pyright
flags when passed to SourceFileID(id=...) which expects str. Split the
chain and add an assert to narrow the type.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* update to release version

* update to model

---------

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/sdk/ai/azure-ai-projects/samples/evaluations/data_folder/sample_data_multiturn_conversations.jsonl b/sdk/ai/azure-ai-projects/samples/evaluations/data_folder/sample_data_multiturn_conversations.jsonl
@@ -0,0 +1,3 @@
+{"messages": [{"role": "system", "content": "You are a helpful travel assistant."}, {"role": "user", "content": "I need to book a flight to Paris."}, {"role": "assistant", "content": "I'd be happy to help you book a flight to Paris. What dates are you looking to travel?"}, {"role": "user", "content": "Next Friday, returning Sunday."}, {"role": "assistant", "content": "I found several options for flights departing next Friday and returning Sunday. The best value is a direct flight on Air France for $450 round trip. Would you like me to book this for you?"}]}
+{"messages": [{"role": "user", "content": "What's the weather in Paris?"}, {"role": "assistant", "content": null, "tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\": \"Paris\"}"}}]}, {"role": "tool", "tool_call_id": "call_123", "content": "{\"temperature\": 18, \"condition\": \"sunny\"}"}, {"role": "assistant", "content": "The weather in Paris is currently sunny with a temperature of 18\u00b0C (64\u00b0F)."}], "tool_definitions": [{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}}]}
+{"messages": [{"role": "system", "content": "You are a customer service agent for a software company."}, {"role": "user", "content": "I can't log into my account. I've tried resetting my password but the email never arrives."}, {"role": "assistant", "content": "I'm sorry to hear you're having trouble logging in. Let me help you with that. Can you confirm the email address associated with your account?"}, {"role": "user", "content": "It's john.doe@example.com"}, {"role": "assistant", "content": "Thank you. I can see that the reset emails are being sent to john.doe@example.com but they appear to be bouncing. It looks like there may be a typo in the email on file. Would you like me to update it?"}, {"role": "user", "content": "Oh yes, it should be john.doe@example.org"}, {"role": "assistant", "content": "I've updated your email to john.doe@example.org and sent a new password reset link. You should receive it within the next few minutes. Is there anything else I can help you with?"}]}
diff --git a/sdk/ai/azure-ai-projects/samples/evaluations/data_folder/sample_data_simulation_scenarios.jsonl b/sdk/ai/azure-ai-projects/samples/evaluations/data_folder/sample_data_simulation_scenarios.jsonl
@@ -0,0 +1,3 @@
+{"id": "scenario-1-greeting", "test_case_description": "User starts with a casual greeting; agent should respond warmly and offer help.", "desired_num_turns": 3}
+{"id": "scenario-2-weather-followup", "test_case_description": "User asks about the weather in a major city, then asks a follow-up about whether to bring an umbrella.", "desired_num_turns": 4}
+{"id": "scenario-3-store-hours", "test_case_description": "User asks if a store is open, then progressively narrows down to curbside-pickup hours, then places an order.", "desired_num_turns": 5}
diff --git a/sdk/ai/azure-ai-projects/samples/evaluations/sample_multiturn_conversation_evaluation.py b/sdk/ai/azure-ai-projects/samples/evaluations/sample_multiturn_conversation_evaluation.py
@@ -0,0 +1,176 @@
+# pylint: disable=line-too-long,useless-suppression
+# ------------------------------------
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+# ------------------------------------
+
+"""
+DESCRIPTION:
+    Given an AIProjectClient, this sample demonstrates how to create and run a
+    multi-turn conversation evaluation using the synchronous `openai.evals.*`
+    methods. Multi-turn evaluations assess complete conversations—including
+    tool-calling exchanges—using conversation-level metrics such as customer
+    satisfaction, task completion, coherence, and groundedness.
+
+    This sample uses a JSONL dataset where each row contains a ``messages``
+    array (and optional ``tool_definitions``). It shows how to:
+      - Define a ``custom`` data source config with the conversation schema.
+      - Select conversation-level evaluators with ``{{item.messages}}`` mapping.
+      - Upload conversation data, create an evaluation, and run it.
+      - Poll for completion and print results.
+
+USAGE:
+    python sample_multiturn_conversation_evaluation.py
+
+    Before running the sample:
+
+    pip install "azure-ai-projects>=2.2.0" python-dotenv
+
+    Set these environment variables with your own values:
+    1) FOUNDRY_PROJECT_ENDPOINT - Required. The Azure AI Project endpoint, as found in the overview page of your
+       Microsoft Foundry project. It has the form: https://<account_name>.services.ai.azure.com/api/projects/<project_name>.
+    2) FOUNDRY_MODEL_NAME - Required. The name of the model deployment to use for AI-assisted evaluators.
+"""
+
+import os
+import time
+from pprint import pprint
+from dotenv import load_dotenv
+from openai.types.evals.create_eval_jsonl_run_data_source_param import (
+    CreateEvalJSONLRunDataSourceParam,
+    SourceFileID,
+)
+from openai.types.eval_create_params import DataSourceConfigCustom
+from azure.identity import DefaultAzureCredential
+from azure.ai.projects import AIProjectClient
+from azure.ai.projects.models import TestingCriterionAzureAIEvaluator
+
+load_dotenv()
+
+endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
+model_deployment_name = os.environ["FOUNDRY_MODEL_NAME"]
+
+# Path to the multi-turn conversation data file
+script_dir = os.path.dirname(os.path.abspath(__file__))
+data_folder = os.environ.get("DATA_FOLDER", os.path.join(script_dir, "data_folder"))
+data_file = os.path.join(data_folder, "sample_data_multiturn_conversations.jsonl")
+
+with (
+    DefaultAzureCredential() as credential,
+    AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
+    project_client.get_openai_client() as client,
+):
+    # Define the data source config for multi-turn conversations.
+    # The item_schema declares the "messages" array and optional "tool_definitions".
+    # Set include_sample_schema to False since conversation evaluators use
+    # {{item.messages}} mapping rather than per-turn sample fields.
+    data_source_config = DataSourceConfigCustom(
+        type="custom",
+        item_schema={
+            "type": "object",
+            "properties": {
+                "messages": {"type": "array"},
+                "tool_definitions": {"type": "array"},
+            },
+            "required": ["messages"],
+        },
+        include_sample_schema=False,
+    )
+
+    # Define conversation-level evaluators.
+    # All evaluators map to {{item.messages}} to assess the full conversation.
+    testing_criteria = [
+        TestingCriterionAzureAIEvaluator(
+            type="azure_ai_evaluator",
+            name="customer_satisfaction",
+            evaluator_name="builtin.customer_satisfaction",
+            initialization_parameters={"model": model_deployment_name},
+            data_mapping={"messages": "{{item.messages}}"},
+        ),
+        TestingCriterionAzureAIEvaluator(
+            type="azure_ai_evaluator",
+            name="task_completion",
+            evaluator_name="builtin.task_completion",
+            initialization_parameters={"model": model_deployment_name},
+            data_mapping={"messages": "{{item.messages}}"},
+        ),
+        TestingCriterionAzureAIEvaluator(
+            type="azure_ai_evaluator",
+            name="conversation_coherence",
+            evaluator_name="builtin.coherence",
+            initialization_parameters={"model": model_deployment_name},
+            data_mapping={"messages": "{{item.messages}}"},
+        ),
+        TestingCriterionAzureAIEvaluator(
+            type="azure_ai_evaluator",
+            name="groundedness",
+            evaluator_name="builtin.groundedness",
+            initialization_parameters={"model": model_deployment_name},
+            data_mapping={"messages": "{{item.messages}}"},
+        ),
+    ]
+
+    print("Creating multi-turn conversation evaluation")
+    eval_object = client.evals.create(
+        name="Multi-turn Conversation Evaluation",
+        data_source_config=data_source_config,
+        testing_criteria=testing_criteria,
+    )
+    print(f"Evaluation created (id: {eval_object.id})")
+
+    # Upload the conversation dataset
+    try:
+        dataset = project_client.datasets.upload_file(
+            name="multiturn-conversation-data",
+            version="1",
+            file_path=data_file,
+        )
+        assert dataset.id is not None, "Dataset upload returned no ID"
+        data_id: str = dataset.id
+        print(f"Dataset uploaded (id: {data_id})")
+    except Exception:
+        # Dataset already exists — use the existing URI
+        account = endpoint.split("/")[2].split(".")[0]
+        project = endpoint.rstrip("/").split("/")[-1]
+        data_id = f"azureai://accounts/{account}/projects/{project}/data/multiturn-conversation-data/versions/1"
+        print(f"Using existing dataset (id: {data_id})")
+
+    # Create a run with evaluation_level set to "conversation"
+    # so evaluators score each conversation as a whole.
+    eval_run = client.evals.runs.create(
+        eval_id=eval_object.id,
+        name="multiturn-conversation-run",
+        data_source=CreateEvalJSONLRunDataSourceParam(
+            type="jsonl",
+            source=SourceFileID(
+                type="file_id",
+                id=data_id,
+            ),
+        ),
+        extra_body={"evaluation_level": "conversation"},
+    )
+    print(f"Evaluation run created (id: {eval_run.id})")
+
+    while True:
+        run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)
+        if run.status in ("completed", "failed"):
+            break
+        print(f"Waiting for eval run to complete... current status: {run.status}")
+        time.sleep(5)
+
+    if run.status == "completed":
+        print("\n✓ Evaluation run completed successfully!")
+        print(f"Result Counts: {run.result_counts}")
+
+        output_items = list(client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id))
+        print(f"\nOUTPUT ITEMS (Total: {len(output_items)})")
+        print(f"{'-'*60}")
+        pprint(output_items)
+        print(f"{'-'*60}")
+
+        print(f"\nEval Run Report URL: {run.report_url}")
+    else:
+        print(f"\n✗ Evaluation run failed: {run.error}")
+
+    client.evals.delete(eval_id=eval_object.id)
+    print("Evaluation deleted")
diff --git a/sdk/ai/azure-ai-projects/samples/evaluations/sample_multiturn_conversation_simulation.py b/sdk/ai/azure-ai-projects/samples/evaluations/sample_multiturn_conversation_simulation.py
@@ -0,0 +1,212 @@
+# pylint: disable=line-too-long,useless-suppression
+# ------------------------------------
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+# ------------------------------------
+
+"""
+DESCRIPTION:
+    Given an AIProjectClient, this sample demonstrates how to run a conversation
+    simulation evaluation against a Foundry agent. The service generates multi-turn
+    conversations by simulating a user interacting with your agent based on seed
+    scenarios, then evaluates the generated conversations with conversation-level
+    metrics.
+
+    This is Scenario 4 of multi-turn evaluations: you provide seed scenarios
+    (each describing a test case), and the service generates full conversations
+    by replaying simulated user turns against your agent. The generated
+    conversations are then scored by conversation-level evaluators.
+
+    Key concepts:
+      - data_source type is "azure_ai_target_completions" with
+        item_generation_params.type = "conversation_gen_preview"
+      - num_conversations is per seed scenario (e.g., 2 conversations × 3 scenarios = 6 total)
+      - max_turns controls the maximum exchanges per conversation
+      - The seed scenarios source is at the data_source root level
+
+USAGE:
+    python sample_multiturn_conversation_simulation.py
+
+    Before running the sample:
+
+    pip install "azure-ai-projects>=2.0.0" python-dotenv
+
+    Set these environment variables with your own values:
+    1) FOUNDRY_PROJECT_ENDPOINT - Required. The Azure AI Project endpoint.
+    2) FOUNDRY_MODEL_NAME - Required. The model deployment name for the simulator
+       and AI-assisted evaluators.
+    3) FOUNDRY_AGENT_NAME - Required. The name of the Foundry agent to simulate against.
+"""
+
+import os
+import time
+from pprint import pprint
+from dotenv import load_dotenv
+from openai.types.eval_create_params import DataSourceConfigCustom
+from azure.identity import DefaultAzureCredential
+from azure.ai.projects import AIProjectClient
+from azure.ai.projects.models import TestingCriterionAzureAIEvaluator, PromptAgentDefinition
+
+load_dotenv()
+
+endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
+model_deployment_name = os.environ["FOUNDRY_MODEL_NAME"]
+agent_name = os.environ.get("FOUNDRY_AGENT_NAME", "")
+
+# Path to the simulation scenarios data file
+script_dir = os.path.dirname(os.path.abspath(__file__))
+data_folder = os.environ.get("DATA_FOLDER", os.path.join(script_dir, "data_folder"))
+scenarios_file = os.path.join(data_folder, "sample_data_simulation_scenarios.jsonl")
+
+with (
+    DefaultAzureCredential() as credential,
+    AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
+    project_client.get_openai_client() as client,
+):
+    # Create (or update) an agent to simulate against
+    agent = project_client.agents.create_version(
+        agent_name=agent_name,
+        definition=PromptAgentDefinition(
+            model=model_deployment_name,
+            instructions="You are a helpful customer service agent. Be empathetic and solution-oriented.",
+        ),
+    )
+    print(f"Agent created (name: {agent.name}, version: {agent.version})")
+
+    # Simulation uses the same "custom" eval group type as dataset evaluation (S1),
+    # since the generated conversations follow the same messages schema.
+    data_source_config = DataSourceConfigCustom(
+        type="custom",
+        item_schema={
+            "type": "object",
+            "properties": {
+                "messages": {"type": "array"},
+            },
+            "required": ["messages"],
+        },
+        include_sample_schema=False,
+    )
+
+    # Conversation-level evaluators
+    testing_criteria = [
+        TestingCriterionAzureAIEvaluator(
+            type="azure_ai_evaluator",
+            name="customer_satisfaction",
+            evaluator_name="builtin.customer_satisfaction",
+            initialization_parameters={"model": model_deployment_name},
+            data_mapping={"messages": "{{item.messages}}"},
+        ),
+        TestingCriterionAzureAIEvaluator(
+            type="azure_ai_evaluator",
+            name="task_completion",
+            evaluator_name="builtin.task_completion",
+            initialization_parameters={"model": model_deployment_name},
+            data_mapping={"messages": "{{item.messages}}"},
+        ),
+        TestingCriterionAzureAIEvaluator(
+            type="azure_ai_evaluator",
+            name="conversation_coherence",
+            evaluator_name="builtin.coherence",
+            initialization_parameters={"model": model_deployment_name},
+            data_mapping={"messages": "{{item.messages}}"},
+        ),
+        TestingCriterionAzureAIEvaluator(
+            type="azure_ai_evaluator",
+            name="groundedness",
+            evaluator_name="builtin.groundedness",
+            initialization_parameters={"model": model_deployment_name},
+            data_mapping={"messages": "{{item.messages}}"},
+        ),
+    ]
+
+    print("Creating simulation evaluation group")
+    eval_object = client.evals.create(
+        name="Multi-turn Conversation Simulation",
+        data_source_config=data_source_config,
+        testing_criteria=testing_criteria,
+    )
+    print(f"Evaluation created (id: {eval_object.id})")
+
+    # Upload the simulation scenarios dataset
+    try:
+        dataset = project_client.datasets.upload_file(
+            name="simulation-scenarios",
+            version="1",
+            file_path=scenarios_file,
+        )
+        assert dataset.id is not None, "Dataset upload returned no ID"
+        scenarios_id: str = dataset.id
+        print(f"Scenarios dataset uploaded (id: {scenarios_id})")
+    except Exception:
+        # Dataset already exists — use the existing URI
+        scenarios_id = f"azureai://accounts/{endpoint.split('/')[2].split('.')[0]}/projects/{endpoint.rstrip('/').split('/')[-1]}/data/simulation-scenarios/versions/1"
+        print(f"Using existing scenarios dataset (id: {scenarios_id})")
+
+    # Create a simulation run
+    # - source: the seed scenarios dataset (each row is a test case)
+    # - target: the agent to simulate against
+    # - item_generation_params: controls conversation generation
+    #   - num_conversations: conversations to generate per seed scenario
+    #   - max_turns: maximum exchanges per conversation
+    #   - data_mapping: maps JSONL field names to simulation parameters
+    eval_run = client.evals.runs.create(
+        eval_id=eval_object.id,
+        name="conversation-simulation-run",
+        data_source={
+            "type": "azure_ai_target_completions",
+            "source": {
+                "type": "file_id",
+                "id": scenarios_id,
+            },
+            "target": {
+                "type": "azure_ai_agent",
+                "name": agent.name,
+                "version": agent.version,
+            },
+            "item_generation_params": {
+                "type": "conversation_gen_preview",
+                "model": model_deployment_name,
+                "num_conversations": 2,
+                "max_turns": 5,
+                "sampling_params": {
+                    "temperature": 0.7,
+                    "top_p": 1.0,
+                    "max_completion_tokens": 800,
+                },
+                "data_mapping": {
+                    "test_case_description": "test_case_description",
+                    "id": "id",
+                    "desired_num_turns": "desired_num_turns",
+                },
+            },
+        },  # type: ignore
+        extra_body={"evaluation_level": "conversation"},
+    )
+    print(f"Simulation run created (id: {eval_run.id})")
+    print("Simulation runs are slow (3-8 min). Polling...")
+
+    while True:
+        run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)
+        if run.status in ("completed", "failed"):
+            break
+        print(f"Waiting for simulation to complete... current status: {run.status}")
+        time.sleep(10)
+
+    if run.status == "completed":
+        print("\n✓ Simulation run completed successfully!")
+        print(f"Result Counts: {run.result_counts}")
+        # With 3 seed scenarios and num_conversations=2, expect 6 total conversations
+        print(f"Expected: {3 * 2} conversations (3 scenarios × 2 per scenario)")
+
+        output_items = list(client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id))
+        print(f"\nOUTPUT ITEMS (Total: {len(output_items)})")
+        print(f"{'-'*60}")
+        pprint(output_items)
+        print(f"{'-'*60}")
+
+        print(f"\nEval Run Report URL: {run.report_url}")
+    else:
+        print(f"\n✗ Simulation run failed: {run.error}")
+
+    client.evals.delete(eval_id=eval_object.id)
+    print("Evaluation deleted")
diff --git a/sdk/ai/azure-ai-projects/samples/evaluations/sample_multiturn_trace_evaluation_agent_filter.py b/sdk/ai/azure-ai-projects/samples/evaluations/sample_multiturn_trace_evaluation_agent_filter.py
diff --git a/sdk/ai/azure-ai-projects/samples/evaluations/sample_multiturn_trace_evaluation_by_id.py b/sdk/ai/azure-ai-projects/samples/evaluations/sample_multiturn_trace_evaluation_by_id.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	+{"messages": [{"role": "system", "content": "You are a helpful travel assistant."}, {"role": "user", "content": "I need to book a flight to Paris."}, {"role": "assistant", "content": "I'd be happy to help you book a flight to Paris. What dates are you looking to travel?"}, {"role": "user", "content": "Next Friday, returning Sunday."}, {"role": "assistant", "content": "I found several options for flights departing next Friday and returning Sunday. The best value is a direct flight on Air France for $450 round trip. Would you like me to book this for you?"}]}
	`2`	+{"messages": [{"role": "user", "content": "What's the weather in Paris?"}, {"role": "assistant", "content": null, "tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\": \"Paris\"}"}}]}, {"role": "tool", "tool_call_id": "call_123", "content": "{\"temperature\": 18, \"condition\": \"sunny\"}"}, {"role": "assistant", "content": "The weather in Paris is currently sunny with a temperature of 18\u00b0C (64\u00b0F)."}], "tool_definitions": [{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}}]}
	`3`	+{"messages": [{"role": "system", "content": "You are a customer service agent for a software company."}, {"role": "user", "content": "I can't log into my account. I've tried resetting my password but the email never arrives."}, {"role": "assistant", "content": "I'm sorry to hear you're having trouble logging in. Let me help you with that. Can you confirm the email address associated with your account?"}, {"role": "user", "content": "It's john.doe@example.com"}, {"role": "assistant", "content": "Thank you. I can see that the reset emails are being sent to john.doe@example.com but they appear to be bouncing. It looks like there may be a typo in the email on file. Would you like me to update it?"}, {"role": "user", "content": "Oh yes, it should be john.doe@example.org"}, {"role": "assistant", "content": "I've updated your email to john.doe@example.org and sent a new password reset link. You should receive it within the next few minutes. Is there anything else I can help you with?"}]}
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+{"id": "scenario-1-greeting", "test_case_description": "User starts with a casual greeting; agent should respond warmly and offer help.", "desired_num_turns": 3}`
	`2`	`+{"id": "scenario-2-weather-followup", "test_case_description": "User asks about the weather in a major city, then asks a follow-up about whether to bring an umbrella.", "desired_num_turns": 4}`
	`3`	`+{"id": "scenario-3-store-hours", "test_case_description": "User asks if a store is open, then progressively narrows down to curbside-pickup hours, then places an order.", "desired_num_turns": 5}`