- Package Name: azure-ai-projects
- Package Version: 2.0.0b3
- Operating System: Windows
- Python Version: 3.12.10
Describe the bug
When running agent evaluations, Application Insights shows two separate, disconnected trace hierarchies:
- Agent execution traces with operation names like "Working - Processing request - Instructions" containing gen_ai.* attributes (agent.id, agent.name, response.id)
- Evaluation result traces (Operation ID: 00000000000000000000000000000000) with operation name "gen_ai.evaluation.result"
There is no trace context propagation between agent runs and their evaluation results, making it difficult to correlate which evaluation results correspond to which agent execution in Application Insights queries.
The only workaround right now is to Manually correlate using response.id from agent traces. However, these IDs are buried in custom properties and require complex KQL joins.
To Reproduce
Steps to reproduce the behavior:
- Create an agent evaluation using AIProjectClient.agents.get() (or create new agent) and openai_client.evals.create()
- Run the evaluation with openai_client.evals.runs.create()
- Observe that evaluation result traces have a different trace hierarchy with no parent-child relationship to agent execution traces
endpoint = os.environ.get("AZURE_AI_PROJECT_ENDPOINT", "")
model_deployment_name = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "")
with (
DefaultAzureCredential() as credential,
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
project_client.get_openai_client() as openai_client,
):
agent = project_client.agents.get(agent_name="") # Enter agent name
print(f"Agent retrieved (id: {agent.id}, name: {agent.name})")
data_source_config = DataSourceConfigCustom(
type="custom",
item_schema={"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]},
include_sample_schema=True,
)
testing_criteria = [
# System Evaluation
# 1. Task Completion
{
"type": "azure_ai_evaluator",
"name": "task_completion",
"evaluator_name": "builtin.task_completion",
"initialization_parameters": {
"deployment_name": f"{model_deployment_name}",
# "is_reasoning_model": True # if you use an AOAI reasoning model
},
"data_mapping": {
"query": "{{item.query}}",
"response": "{{sample.output_text}}",
},
}
]
eval_object = openai_client.evals.create(
name="Agent Evaluation",
data_source_config=data_source_config,
testing_criteria=testing_criteria, # type: ignore
)
print(f"Evaluation created (id: {eval_object.id}, name: {eval_object.name})")
data_source = {
"type": "azure_ai_target_completions",
"source": {
"type": "file_content",
"content": [
{"item": {"query": ""}}, # Enter Query here
],
},
"input_messages": {
"type": "template",
"template": [
{"type": "message", "role": "user", "content": {"type": "input_text", "text": "{{item.query}}"}}
],
},
"target": {
"type": "azure_ai_agent",
"name": agent.name,
},
}
agent_eval_run: Union[RunCreateResponse, RunRetrieveResponse] = openai_client.evals.runs.create(
eval_id=eval_object.id, name=f"Evaluation Run for Agent {agent.name}", data_source=data_source # type: ignore
)
print(f"Evaluation run created (id: {agent_eval_run.id})")
# [END agent_evaluation_basic]
while agent_eval_run.status not in ["completed", "failed"]:
agent_eval_run = openai_client.evals.runs.retrieve(run_id=agent_eval_run.id, eval_id=eval_object.id)
print(f"Waiting for eval run to complete... current status: {agent_eval_run.status}")
time.sleep(5)
if agent_eval_run.status == "completed":
print("\n✓ Evaluation run completed successfully!")
print(f"Result Counts: {agent_eval_run.result_counts}")
output_items = list(
openai_client.evals.runs.output_items.list(run_id=agent_eval_run.id, eval_id=eval_object.id)
)
print(f"\nOUTPUT ITEMS (Total: {len(output_items)})")
print(f"{'-'*60}")
pprint(output_items)
print(f"{'-'*60}")
else:
print("\n✗ Evaluation run failed.")
Expected behavior
Option 1 (Preferred): Use proper OpenTelemetry trace context propagation
Evaluation result spans should be children of the agent run span
Trace ID and Parent Span ID should link evaluation results to the original agent execution
This follows OpenTelemetry semantic conventions for distributed tracing
Option 2 (Alternative): Add agent trace context to evaluation custom properties
Add gen_ai.evaluation.result spans with custom properties gen_ai.evaluation.agent_trace_id: The trace ID from the agent run being evaluated
Screenshots

Describe the bug
When running agent evaluations, Application Insights shows two separate, disconnected trace hierarchies:
There is no trace context propagation between agent runs and their evaluation results, making it difficult to correlate which evaluation results correspond to which agent execution in Application Insights queries.
The only workaround right now is to Manually correlate using response.id from agent traces. However, these IDs are buried in custom properties and require complex KQL joins.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Option 1 (Preferred): Use proper OpenTelemetry trace context propagation
Evaluation result spans should be children of the agent run span
Trace ID and Parent Span ID should link evaluation results to the original agent execution
This follows OpenTelemetry semantic conventions for distributed tracing
Option 2 (Alternative): Add agent trace context to evaluation custom properties
Add gen_ai.evaluation.result spans with custom properties
gen_ai.evaluation.agent_trace_id: The trace ID from the agent run being evaluatedScreenshots