Skip to content

Observability Gap in Agent Run Trace and Eval Results Trace #44646

@m-gheini

Description

@m-gheini
  • Package Name: azure-ai-projects
  • Package Version: 2.0.0b3
  • Operating System: Windows
  • Python Version: 3.12.10

Describe the bug
When running agent evaluations, Application Insights shows two separate, disconnected trace hierarchies:

  1. Agent execution traces with operation names like "Working - Processing request - Instructions" containing gen_ai.* attributes (agent.id, agent.name, response.id)
  2. Evaluation result traces (Operation ID: 00000000000000000000000000000000) with operation name "gen_ai.evaluation.result"
    There is no trace context propagation between agent runs and their evaluation results, making it difficult to correlate which evaluation results correspond to which agent execution in Application Insights queries.

The only workaround right now is to Manually correlate using response.id from agent traces. However, these IDs are buried in custom properties and require complex KQL joins.

To Reproduce
Steps to reproduce the behavior:

  1. Create an agent evaluation using AIProjectClient.agents.get() (or create new agent) and openai_client.evals.create()
  2. Run the evaluation with openai_client.evals.runs.create()
  3. Observe that evaluation result traces have a different trace hierarchy with no parent-child relationship to agent execution traces
    endpoint = os.environ.get("AZURE_AI_PROJECT_ENDPOINT", "")
    model_deployment_name = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "")

    with (
        DefaultAzureCredential() as credential,
        AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
        project_client.get_openai_client() as openai_client,
    ):
        agent = project_client.agents.get(agent_name="") # Enter agent name
        print(f"Agent retrieved (id: {agent.id}, name: {agent.name})")
        data_source_config = DataSourceConfigCustom(
            type="custom",
            item_schema={"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]},
            include_sample_schema=True,
        )
        testing_criteria = [
            # System Evaluation
            # 1. Task Completion
            {
                "type": "azure_ai_evaluator",
                "name": "task_completion",
                "evaluator_name": "builtin.task_completion",
                "initialization_parameters": {
                    "deployment_name": f"{model_deployment_name}",
                    # "is_reasoning_model": True # if you use an AOAI reasoning model
                },
                "data_mapping": {
                    "query": "{{item.query}}",
                    "response": "{{sample.output_text}}",
                },
            }
        ]
        eval_object = openai_client.evals.create(
            name="Agent Evaluation",
            data_source_config=data_source_config,
            testing_criteria=testing_criteria,  # type: ignore
        )
        print(f"Evaluation created (id: {eval_object.id}, name: {eval_object.name})")

        data_source = {
            "type": "azure_ai_target_completions",
            "source": {
                "type": "file_content",
                "content": [
                    {"item": {"query": ""}}, # Enter Query here
                ],
            },
            "input_messages": {
                "type": "template",
                "template": [
                    {"type": "message", "role": "user", "content": {"type": "input_text", "text": "{{item.query}}"}}
                ],
            },
            "target": {
                "type": "azure_ai_agent",
                "name": agent.name,
            },
        }

        agent_eval_run: Union[RunCreateResponse, RunRetrieveResponse] = openai_client.evals.runs.create(
            eval_id=eval_object.id, name=f"Evaluation Run for Agent {agent.name}", data_source=data_source  # type: ignore
        )
        print(f"Evaluation run created (id: {agent_eval_run.id})")
        # [END agent_evaluation_basic]

        while agent_eval_run.status not in ["completed", "failed"]:
            agent_eval_run = openai_client.evals.runs.retrieve(run_id=agent_eval_run.id, eval_id=eval_object.id)
            print(f"Waiting for eval run to complete... current status: {agent_eval_run.status}")
            time.sleep(5)

        if agent_eval_run.status == "completed":
            print("\n✓ Evaluation run completed successfully!")
            print(f"Result Counts: {agent_eval_run.result_counts}")

            output_items = list(
                openai_client.evals.runs.output_items.list(run_id=agent_eval_run.id, eval_id=eval_object.id)
            )
            print(f"\nOUTPUT ITEMS (Total: {len(output_items)})")
            print(f"{'-'*60}")
            pprint(output_items)
            print(f"{'-'*60}")
        else:
            print("\n✗ Evaluation run failed.")

Expected behavior
Option 1 (Preferred): Use proper OpenTelemetry trace context propagation

Evaluation result spans should be children of the agent run span
Trace ID and Parent Span ID should link evaluation results to the original agent execution
This follows OpenTelemetry semantic conventions for distributed tracing

Option 2 (Alternative): Add agent trace context to evaluation custom properties

Add gen_ai.evaluation.result spans with custom properties gen_ai.evaluation.agent_trace_id: The trace ID from the agent run being evaluated

Screenshots

Image

Metadata

Metadata

Assignees

Labels

AI ProjectsEvaluationIssues related to the client library for Azure AI EvaluationService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions