Observability Gap in Agent Run Trace and Eval Results Trace

- **Package Name**: azure-ai-projects
- **Package Version**: 2.0.0b3
- **Operating System**: Windows
- **Python Version**: 3.12.10

**Describe the bug**
When running agent evaluations, Application Insights shows two separate, disconnected trace hierarchies:

1. Agent execution traces with operation names like "Working - Processing request - Instructions" containing gen_ai.* attributes (agent.id, agent.name, response.id)
2. Evaluation result traces (Operation ID: 00000000000000000000000000000000) with operation name "gen_ai.evaluation.result"
**There is no trace context propagation between agent runs and their evaluation results**, making it difficult to correlate which evaluation results correspond to which agent execution in Application Insights queries.

The only workaround right now is to Manually correlate using response.id from agent traces. However, these IDs are buried in custom properties and require complex KQL joins.

**To Reproduce**
Steps to reproduce the behavior:
1. Create an agent evaluation using AIProjectClient.agents.get() (or create new agent) and openai_client.evals.create()
2. Run the evaluation with openai_client.evals.runs.create()
3. Observe that evaluation result traces have a different trace hierarchy with no parent-child relationship to agent execution traces

```
    endpoint = os.environ.get("AZURE_AI_PROJECT_ENDPOINT", "")
    model_deployment_name = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "")

    with (
        DefaultAzureCredential() as credential,
        AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
        project_client.get_openai_client() as openai_client,
    ):
        agent = project_client.agents.get(agent_name="") # Enter agent name
        print(f"Agent retrieved (id: {agent.id}, name: {agent.name})")
        data_source_config = DataSourceConfigCustom(
            type="custom",
            item_schema={"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]},
            include_sample_schema=True,
        )
        testing_criteria = [
            # System Evaluation
            # 1. Task Completion
            {
                "type": "azure_ai_evaluator",
                "name": "task_completion",
                "evaluator_name": "builtin.task_completion",
                "initialization_parameters": {
                    "deployment_name": f"{model_deployment_name}",
                    # "is_reasoning_model": True # if you use an AOAI reasoning model
                },
                "data_mapping": {
                    "query": "{{item.query}}",
                    "response": "{{sample.output_text}}",
                },
            }
        ]
        eval_object = openai_client.evals.create(
            name="Agent Evaluation",
            data_source_config=data_source_config,
            testing_criteria=testing_criteria,  # type: ignore
        )
        print(f"Evaluation created (id: {eval_object.id}, name: {eval_object.name})")

        data_source = {
            "type": "azure_ai_target_completions",
            "source": {
                "type": "file_content",
                "content": [
                    {"item": {"query": ""}}, # Enter Query here
                ],
            },
            "input_messages": {
                "type": "template",
                "template": [
                    {"type": "message", "role": "user", "content": {"type": "input_text", "text": "{{item.query}}"}}
                ],
            },
            "target": {
                "type": "azure_ai_agent",
                "name": agent.name,
            },
        }

        agent_eval_run: Union[RunCreateResponse, RunRetrieveResponse] = openai_client.evals.runs.create(
            eval_id=eval_object.id, name=f"Evaluation Run for Agent {agent.name}", data_source=data_source  # type: ignore
        )
        print(f"Evaluation run created (id: {agent_eval_run.id})")
        # [END agent_evaluation_basic]

        while agent_eval_run.status not in ["completed", "failed"]:
            agent_eval_run = openai_client.evals.runs.retrieve(run_id=agent_eval_run.id, eval_id=eval_object.id)
            print(f"Waiting for eval run to complete... current status: {agent_eval_run.status}")
            time.sleep(5)

        if agent_eval_run.status == "completed":
            print("\n✓ Evaluation run completed successfully!")
            print(f"Result Counts: {agent_eval_run.result_counts}")

            output_items = list(
                openai_client.evals.runs.output_items.list(run_id=agent_eval_run.id, eval_id=eval_object.id)
            )
            print(f"\nOUTPUT ITEMS (Total: {len(output_items)})")
            print(f"{'-'*60}")
            pprint(output_items)
            print(f"{'-'*60}")
        else:
            print("\n✗ Evaluation run failed.")

```

**Expected behavior**
**Option 1 (Preferred)**: Use proper OpenTelemetry trace context propagation

Evaluation result spans should be children of the agent run span
Trace ID and Parent Span ID should link evaluation results to the original agent execution
This follows OpenTelemetry semantic conventions for distributed tracing

**Option 2 (Alternative)**: Add agent trace context to evaluation custom properties

Add gen_ai.evaluation.result spans with custom properties `gen_ai.evaluation.agent_trace_id`: The trace ID from the agent run being evaluated

**Screenshots**

<img width="742" height="270" alt="Image" src="https://github.com/user-attachments/assets/6d547274-a59e-4ee6-8be9-729301c0089c" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability Gap in Agent Run Trace and Eval Results Trace #44646

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Observability Gap in Agent Run Trace and Eval Results Trace #44646

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions