Skip to content

Commit 5f88edb

Browse files
kseagerCopilot
andauthored
Evaluation Samples: add multi-turn conversation evaluation samples S1-S4 (#47034)
* feat(samples): add multi-turn conversation evaluation sample (S1) Add sample_multiturn_conversation_evaluation.py demonstrating: - Custom data source config with messages/tool_definitions schema - Conversation-level evaluators (customer_satisfaction, task_completion, coherence, groundedness) - Dataset upload and evaluation run with evaluation_level=conversation - Polling for results Includes sample JSONL with 3 conversations: basic multi-turn, tool-calling, and extended support dialog. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(samples): add S2, S3, S4 multi-turn evaluation samples - S2: sample_multiturn_trace_evaluation_by_id.py Evaluate traces by conversation_id or trace_id - S3: sample_multiturn_trace_evaluation_agent_filter.py Evaluate traces by agent name/version/id with optional smart filtering - S4: sample_multiturn_conversation_simulation.py Simulate multi-turn conversations against an agent and evaluate - Data: sample_data_simulation_scenarios.jsonl (3 seed scenarios for S4) All samples use 4 conversation-level evaluators: customer_satisfaction, task_completion, coherence, groundedness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(samples): narrow dataset.id type for pyright The .id property on upload_file() returns Optional[str], which pyright flags when passed to SourceFileID(id=...) which expects str. Split the chain and add an assert to narrow the type. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * update to release version * update to model --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent e6e06be commit 5f88edb

6 files changed

Lines changed: 743 additions & 0 deletions
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"messages": [{"role": "system", "content": "You are a helpful travel assistant."}, {"role": "user", "content": "I need to book a flight to Paris."}, {"role": "assistant", "content": "I'd be happy to help you book a flight to Paris. What dates are you looking to travel?"}, {"role": "user", "content": "Next Friday, returning Sunday."}, {"role": "assistant", "content": "I found several options for flights departing next Friday and returning Sunday. The best value is a direct flight on Air France for $450 round trip. Would you like me to book this for you?"}]}
2+
{"messages": [{"role": "user", "content": "What's the weather in Paris?"}, {"role": "assistant", "content": null, "tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\": \"Paris\"}"}}]}, {"role": "tool", "tool_call_id": "call_123", "content": "{\"temperature\": 18, \"condition\": \"sunny\"}"}, {"role": "assistant", "content": "The weather in Paris is currently sunny with a temperature of 18\u00b0C (64\u00b0F)."}], "tool_definitions": [{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"location": {"type": "string"}}}}]}
3+
{"messages": [{"role": "system", "content": "You are a customer service agent for a software company."}, {"role": "user", "content": "I can't log into my account. I've tried resetting my password but the email never arrives."}, {"role": "assistant", "content": "I'm sorry to hear you're having trouble logging in. Let me help you with that. Can you confirm the email address associated with your account?"}, {"role": "user", "content": "It's john.doe@example.com"}, {"role": "assistant", "content": "Thank you. I can see that the reset emails are being sent to john.doe@example.com but they appear to be bouncing. It looks like there may be a typo in the email on file. Would you like me to update it?"}, {"role": "user", "content": "Oh yes, it should be john.doe@example.org"}, {"role": "assistant", "content": "I've updated your email to john.doe@example.org and sent a new password reset link. You should receive it within the next few minutes. Is there anything else I can help you with?"}]}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{"id": "scenario-1-greeting", "test_case_description": "User starts with a casual greeting; agent should respond warmly and offer help.", "desired_num_turns": 3}
2+
{"id": "scenario-2-weather-followup", "test_case_description": "User asks about the weather in a major city, then asks a follow-up about whether to bring an umbrella.", "desired_num_turns": 4}
3+
{"id": "scenario-3-store-hours", "test_case_description": "User asks if a store is open, then progressively narrows down to curbside-pickup hours, then places an order.", "desired_num_turns": 5}
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# pylint: disable=line-too-long,useless-suppression
2+
# ------------------------------------
3+
# Copyright (c) Microsoft Corporation.
4+
# Licensed under the MIT License.
5+
# ------------------------------------
6+
7+
"""
8+
DESCRIPTION:
9+
Given an AIProjectClient, this sample demonstrates how to create and run a
10+
multi-turn conversation evaluation using the synchronous `openai.evals.*`
11+
methods. Multi-turn evaluations assess complete conversations—including
12+
tool-calling exchanges—using conversation-level metrics such as customer
13+
satisfaction, task completion, coherence, and groundedness.
14+
15+
This sample uses a JSONL dataset where each row contains a ``messages``
16+
array (and optional ``tool_definitions``). It shows how to:
17+
- Define a ``custom`` data source config with the conversation schema.
18+
- Select conversation-level evaluators with ``{{item.messages}}`` mapping.
19+
- Upload conversation data, create an evaluation, and run it.
20+
- Poll for completion and print results.
21+
22+
USAGE:
23+
python sample_multiturn_conversation_evaluation.py
24+
25+
Before running the sample:
26+
27+
pip install "azure-ai-projects>=2.2.0" python-dotenv
28+
29+
Set these environment variables with your own values:
30+
1) FOUNDRY_PROJECT_ENDPOINT - Required. The Azure AI Project endpoint, as found in the overview page of your
31+
Microsoft Foundry project. It has the form: https://<account_name>.services.ai.azure.com/api/projects/<project_name>.
32+
2) FOUNDRY_MODEL_NAME - Required. The name of the model deployment to use for AI-assisted evaluators.
33+
"""
34+
35+
import os
36+
import time
37+
from pprint import pprint
38+
from dotenv import load_dotenv
39+
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
40+
CreateEvalJSONLRunDataSourceParam,
41+
SourceFileID,
42+
)
43+
from openai.types.eval_create_params import DataSourceConfigCustom
44+
from azure.identity import DefaultAzureCredential
45+
from azure.ai.projects import AIProjectClient
46+
from azure.ai.projects.models import TestingCriterionAzureAIEvaluator
47+
48+
load_dotenv()
49+
50+
endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
51+
model_deployment_name = os.environ["FOUNDRY_MODEL_NAME"]
52+
53+
# Path to the multi-turn conversation data file
54+
script_dir = os.path.dirname(os.path.abspath(__file__))
55+
data_folder = os.environ.get("DATA_FOLDER", os.path.join(script_dir, "data_folder"))
56+
data_file = os.path.join(data_folder, "sample_data_multiturn_conversations.jsonl")
57+
58+
with (
59+
DefaultAzureCredential() as credential,
60+
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
61+
project_client.get_openai_client() as client,
62+
):
63+
# Define the data source config for multi-turn conversations.
64+
# The item_schema declares the "messages" array and optional "tool_definitions".
65+
# Set include_sample_schema to False since conversation evaluators use
66+
# {{item.messages}} mapping rather than per-turn sample fields.
67+
data_source_config = DataSourceConfigCustom(
68+
type="custom",
69+
item_schema={
70+
"type": "object",
71+
"properties": {
72+
"messages": {"type": "array"},
73+
"tool_definitions": {"type": "array"},
74+
},
75+
"required": ["messages"],
76+
},
77+
include_sample_schema=False,
78+
)
79+
80+
# Define conversation-level evaluators.
81+
# All evaluators map to {{item.messages}} to assess the full conversation.
82+
testing_criteria = [
83+
TestingCriterionAzureAIEvaluator(
84+
type="azure_ai_evaluator",
85+
name="customer_satisfaction",
86+
evaluator_name="builtin.customer_satisfaction",
87+
initialization_parameters={"model": model_deployment_name},
88+
data_mapping={"messages": "{{item.messages}}"},
89+
),
90+
TestingCriterionAzureAIEvaluator(
91+
type="azure_ai_evaluator",
92+
name="task_completion",
93+
evaluator_name="builtin.task_completion",
94+
initialization_parameters={"model": model_deployment_name},
95+
data_mapping={"messages": "{{item.messages}}"},
96+
),
97+
TestingCriterionAzureAIEvaluator(
98+
type="azure_ai_evaluator",
99+
name="conversation_coherence",
100+
evaluator_name="builtin.coherence",
101+
initialization_parameters={"model": model_deployment_name},
102+
data_mapping={"messages": "{{item.messages}}"},
103+
),
104+
TestingCriterionAzureAIEvaluator(
105+
type="azure_ai_evaluator",
106+
name="groundedness",
107+
evaluator_name="builtin.groundedness",
108+
initialization_parameters={"model": model_deployment_name},
109+
data_mapping={"messages": "{{item.messages}}"},
110+
),
111+
]
112+
113+
print("Creating multi-turn conversation evaluation")
114+
eval_object = client.evals.create(
115+
name="Multi-turn Conversation Evaluation",
116+
data_source_config=data_source_config,
117+
testing_criteria=testing_criteria,
118+
)
119+
print(f"Evaluation created (id: {eval_object.id})")
120+
121+
# Upload the conversation dataset
122+
try:
123+
dataset = project_client.datasets.upload_file(
124+
name="multiturn-conversation-data",
125+
version="1",
126+
file_path=data_file,
127+
)
128+
assert dataset.id is not None, "Dataset upload returned no ID"
129+
data_id: str = dataset.id
130+
print(f"Dataset uploaded (id: {data_id})")
131+
except Exception:
132+
# Dataset already exists — use the existing URI
133+
account = endpoint.split("/")[2].split(".")[0]
134+
project = endpoint.rstrip("/").split("/")[-1]
135+
data_id = f"azureai://accounts/{account}/projects/{project}/data/multiturn-conversation-data/versions/1"
136+
print(f"Using existing dataset (id: {data_id})")
137+
138+
# Create a run with evaluation_level set to "conversation"
139+
# so evaluators score each conversation as a whole.
140+
eval_run = client.evals.runs.create(
141+
eval_id=eval_object.id,
142+
name="multiturn-conversation-run",
143+
data_source=CreateEvalJSONLRunDataSourceParam(
144+
type="jsonl",
145+
source=SourceFileID(
146+
type="file_id",
147+
id=data_id,
148+
),
149+
),
150+
extra_body={"evaluation_level": "conversation"},
151+
)
152+
print(f"Evaluation run created (id: {eval_run.id})")
153+
154+
while True:
155+
run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)
156+
if run.status in ("completed", "failed"):
157+
break
158+
print(f"Waiting for eval run to complete... current status: {run.status}")
159+
time.sleep(5)
160+
161+
if run.status == "completed":
162+
print("\n✓ Evaluation run completed successfully!")
163+
print(f"Result Counts: {run.result_counts}")
164+
165+
output_items = list(client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id))
166+
print(f"\nOUTPUT ITEMS (Total: {len(output_items)})")
167+
print(f"{'-'*60}")
168+
pprint(output_items)
169+
print(f"{'-'*60}")
170+
171+
print(f"\nEval Run Report URL: {run.report_url}")
172+
else:
173+
print(f"\n✗ Evaluation run failed: {run.error}")
174+
175+
client.evals.delete(eval_id=eval_object.id)
176+
print("Evaluation deleted")
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# pylint: disable=line-too-long,useless-suppression
2+
# ------------------------------------
3+
# Copyright (c) Microsoft Corporation.
4+
# Licensed under the MIT License.
5+
# ------------------------------------
6+
7+
"""
8+
DESCRIPTION:
9+
Given an AIProjectClient, this sample demonstrates how to run a conversation
10+
simulation evaluation against a Foundry agent. The service generates multi-turn
11+
conversations by simulating a user interacting with your agent based on seed
12+
scenarios, then evaluates the generated conversations with conversation-level
13+
metrics.
14+
15+
This is Scenario 4 of multi-turn evaluations: you provide seed scenarios
16+
(each describing a test case), and the service generates full conversations
17+
by replaying simulated user turns against your agent. The generated
18+
conversations are then scored by conversation-level evaluators.
19+
20+
Key concepts:
21+
- data_source type is "azure_ai_target_completions" with
22+
item_generation_params.type = "conversation_gen_preview"
23+
- num_conversations is per seed scenario (e.g., 2 conversations × 3 scenarios = 6 total)
24+
- max_turns controls the maximum exchanges per conversation
25+
- The seed scenarios source is at the data_source root level
26+
27+
USAGE:
28+
python sample_multiturn_conversation_simulation.py
29+
30+
Before running the sample:
31+
32+
pip install "azure-ai-projects>=2.0.0" python-dotenv
33+
34+
Set these environment variables with your own values:
35+
1) FOUNDRY_PROJECT_ENDPOINT - Required. The Azure AI Project endpoint.
36+
2) FOUNDRY_MODEL_NAME - Required. The model deployment name for the simulator
37+
and AI-assisted evaluators.
38+
3) FOUNDRY_AGENT_NAME - Required. The name of the Foundry agent to simulate against.
39+
"""
40+
41+
import os
42+
import time
43+
from pprint import pprint
44+
from dotenv import load_dotenv
45+
from openai.types.eval_create_params import DataSourceConfigCustom
46+
from azure.identity import DefaultAzureCredential
47+
from azure.ai.projects import AIProjectClient
48+
from azure.ai.projects.models import TestingCriterionAzureAIEvaluator, PromptAgentDefinition
49+
50+
load_dotenv()
51+
52+
endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
53+
model_deployment_name = os.environ["FOUNDRY_MODEL_NAME"]
54+
agent_name = os.environ.get("FOUNDRY_AGENT_NAME", "")
55+
56+
# Path to the simulation scenarios data file
57+
script_dir = os.path.dirname(os.path.abspath(__file__))
58+
data_folder = os.environ.get("DATA_FOLDER", os.path.join(script_dir, "data_folder"))
59+
scenarios_file = os.path.join(data_folder, "sample_data_simulation_scenarios.jsonl")
60+
61+
with (
62+
DefaultAzureCredential() as credential,
63+
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
64+
project_client.get_openai_client() as client,
65+
):
66+
# Create (or update) an agent to simulate against
67+
agent = project_client.agents.create_version(
68+
agent_name=agent_name,
69+
definition=PromptAgentDefinition(
70+
model=model_deployment_name,
71+
instructions="You are a helpful customer service agent. Be empathetic and solution-oriented.",
72+
),
73+
)
74+
print(f"Agent created (name: {agent.name}, version: {agent.version})")
75+
76+
# Simulation uses the same "custom" eval group type as dataset evaluation (S1),
77+
# since the generated conversations follow the same messages schema.
78+
data_source_config = DataSourceConfigCustom(
79+
type="custom",
80+
item_schema={
81+
"type": "object",
82+
"properties": {
83+
"messages": {"type": "array"},
84+
},
85+
"required": ["messages"],
86+
},
87+
include_sample_schema=False,
88+
)
89+
90+
# Conversation-level evaluators
91+
testing_criteria = [
92+
TestingCriterionAzureAIEvaluator(
93+
type="azure_ai_evaluator",
94+
name="customer_satisfaction",
95+
evaluator_name="builtin.customer_satisfaction",
96+
initialization_parameters={"model": model_deployment_name},
97+
data_mapping={"messages": "{{item.messages}}"},
98+
),
99+
TestingCriterionAzureAIEvaluator(
100+
type="azure_ai_evaluator",
101+
name="task_completion",
102+
evaluator_name="builtin.task_completion",
103+
initialization_parameters={"model": model_deployment_name},
104+
data_mapping={"messages": "{{item.messages}}"},
105+
),
106+
TestingCriterionAzureAIEvaluator(
107+
type="azure_ai_evaluator",
108+
name="conversation_coherence",
109+
evaluator_name="builtin.coherence",
110+
initialization_parameters={"model": model_deployment_name},
111+
data_mapping={"messages": "{{item.messages}}"},
112+
),
113+
TestingCriterionAzureAIEvaluator(
114+
type="azure_ai_evaluator",
115+
name="groundedness",
116+
evaluator_name="builtin.groundedness",
117+
initialization_parameters={"model": model_deployment_name},
118+
data_mapping={"messages": "{{item.messages}}"},
119+
),
120+
]
121+
122+
print("Creating simulation evaluation group")
123+
eval_object = client.evals.create(
124+
name="Multi-turn Conversation Simulation",
125+
data_source_config=data_source_config,
126+
testing_criteria=testing_criteria,
127+
)
128+
print(f"Evaluation created (id: {eval_object.id})")
129+
130+
# Upload the simulation scenarios dataset
131+
try:
132+
dataset = project_client.datasets.upload_file(
133+
name="simulation-scenarios",
134+
version="1",
135+
file_path=scenarios_file,
136+
)
137+
assert dataset.id is not None, "Dataset upload returned no ID"
138+
scenarios_id: str = dataset.id
139+
print(f"Scenarios dataset uploaded (id: {scenarios_id})")
140+
except Exception:
141+
# Dataset already exists — use the existing URI
142+
scenarios_id = f"azureai://accounts/{endpoint.split('/')[2].split('.')[0]}/projects/{endpoint.rstrip('/').split('/')[-1]}/data/simulation-scenarios/versions/1"
143+
print(f"Using existing scenarios dataset (id: {scenarios_id})")
144+
145+
# Create a simulation run
146+
# - source: the seed scenarios dataset (each row is a test case)
147+
# - target: the agent to simulate against
148+
# - item_generation_params: controls conversation generation
149+
# - num_conversations: conversations to generate per seed scenario
150+
# - max_turns: maximum exchanges per conversation
151+
# - data_mapping: maps JSONL field names to simulation parameters
152+
eval_run = client.evals.runs.create(
153+
eval_id=eval_object.id,
154+
name="conversation-simulation-run",
155+
data_source={
156+
"type": "azure_ai_target_completions",
157+
"source": {
158+
"type": "file_id",
159+
"id": scenarios_id,
160+
},
161+
"target": {
162+
"type": "azure_ai_agent",
163+
"name": agent.name,
164+
"version": agent.version,
165+
},
166+
"item_generation_params": {
167+
"type": "conversation_gen_preview",
168+
"model": model_deployment_name,
169+
"num_conversations": 2,
170+
"max_turns": 5,
171+
"sampling_params": {
172+
"temperature": 0.7,
173+
"top_p": 1.0,
174+
"max_completion_tokens": 800,
175+
},
176+
"data_mapping": {
177+
"test_case_description": "test_case_description",
178+
"id": "id",
179+
"desired_num_turns": "desired_num_turns",
180+
},
181+
},
182+
}, # type: ignore
183+
extra_body={"evaluation_level": "conversation"},
184+
)
185+
print(f"Simulation run created (id: {eval_run.id})")
186+
print("Simulation runs are slow (3-8 min). Polling...")
187+
188+
while True:
189+
run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)
190+
if run.status in ("completed", "failed"):
191+
break
192+
print(f"Waiting for simulation to complete... current status: {run.status}")
193+
time.sleep(10)
194+
195+
if run.status == "completed":
196+
print("\n✓ Simulation run completed successfully!")
197+
print(f"Result Counts: {run.result_counts}")
198+
# With 3 seed scenarios and num_conversations=2, expect 6 total conversations
199+
print(f"Expected: {3 * 2} conversations (3 scenarios × 2 per scenario)")
200+
201+
output_items = list(client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id))
202+
print(f"\nOUTPUT ITEMS (Total: {len(output_items)})")
203+
print(f"{'-'*60}")
204+
pprint(output_items)
205+
print(f"{'-'*60}")
206+
207+
print(f"\nEval Run Report URL: {run.report_url}")
208+
else:
209+
print(f"\n✗ Simulation run failed: {run.error}")
210+
211+
client.evals.delete(eval_id=eval_object.id)
212+
print("Evaluation deleted")

0 commit comments

Comments
 (0)