Auto-fetch execution trace for single-session quality report

evekhm · evekhm · commit a52d99677b28 · 2026-05-27T23:05:06.000Z
When --session is used, the execution trajectory is now fetched
automatically from BigQuery and printed to console with sub-trajectory
segmentation at correction boundaries. Updated sample with real data
showing the full output including trace tree and segmentation.
diff --git a/scripts/quality_report.py b/scripts/quality_report.py
@@ -1613,6 +1613,49 @@ def run_eval(args):
     else:
       logger.warning("No trajectories fetched (BQ may not be configured)")
 
+  # Single-session mode: always fetch trajectory from BQ
+  if args.session and not trajectories and not conversations_file:
+    trajectories = _fetch_session_traces([args.session], max_sessions=1)
+    if trajectories:
+      for sid, trace_obj in trajectories.items():
+        ctx = result["resolved_map"].get(sid)
+        if ctx and ctx.get("answered_by") == "unknown":
+          ctx["answered_by"] = get_responding_agent(trace_obj)
+
+  # Print execution trace to console for single-session mode
+  if args.session and trajectories:
+    trace_obj = trajectories.get(args.session)
+    if trace_obj:
+      hr = "─" * 70
+      print(f"\n{'=' * 70}")
+      print("EXECUTION TRACE")
+      print(f"{'=' * 70}")
+      print(_render_trace(trace_obj))
+      ctx = result["resolved_map"].get(args.session, {})
+      sub_trajs = ctx.get("sub_trajectories", [])
+      conversation = ctx.get("conversation", [])
+      if sub_trajs and conversation:
+        segments = _segment_trace_by_turns(
+            trace_obj, conversation, sub_trajs,
+        )
+        if segments:
+          print(f"\n{hr}")
+          print("  SUB-TRAJECTORY SEGMENTATION")
+          print(hr)
+          for seg in segments:
+            icon = (
+                "✅" if seg["outcome"] in ("correct", "recovered")
+                else "❌"
+            )
+            print(
+                f"\n  {icon} {seg['label']} "
+                f"(turns {seg['start_turn']}-{seg['end_turn']}) "
+                f"→ {seg['outcome']}"
+            )
+            for line in seg["trace"].split("\n"):
+              print(f"  {line}")
+      print(f"{'=' * 70}\n")
+
   report_path = None
   md_dir = None
   if args.output_json and args.output_json != "-":
diff --git a/scripts/sample_quality_report_session.md b/scripts/sample_quality_report_session.md
@@ -1,68 +1,143 @@
 # Single-Session Quality Report
 
-Console output generated by `python scripts/quality_report.py --session conv_481f43fa`.
+Console output generated by `python scripts/quality_report.py --session conv_5d77036b --tag-turns`.
 
 When `--session` is used, all 7 metrics are shown with full justifications (verbose mode).
+The execution trace is fetched automatically from BigQuery and displayed after the summary.
 
 ```text
 ──────────────────────────────────────────────────────────────────────
-  PARTIAL Sessions (showing 1 of 1)
+  MEANINGFUL Sessions (showing 1 of 1)
 ──────────────────────────────────────────────────────────────────────
 
-  Session:     conv_481f43fa  → policy_agent
-  Question:    You mentioned I will accrue an additional 13.3 days of PTO this year,
-               which, combined with the 7.8 days already accrued, totals 21.1 days.
-               My onboarding packet states that PTO is 20 days per year.
-  Response:    "According to the company policy, employees receive 20 days of PTO
-               per year, accrued monthly at approximately 1.67 days per month."
-  Usefulness:    ⚠️  PARTIAL
-  Reason:        The agent provided specific numbers for PTO balance, accrued, used,
-                 and sick leave, which is useful. However, the calculation for
-                 additional PTO accrual was incorrect, leading to a total that
-                 contradicted company policy, making the overall response partially
-                 useful but also misleading.
+  Session:     conv_5d77036b  → policy_agent
+  Question:    My onboarding packet says that Juneteenth is NOT a company
+               holiday. Can you please check the official policy?
+  Response:    "Thank you for bringing this to my attention and for asking
+               me to verify! I've looked up the official company policy for
+               holidays. Based on the 2026 holiday schedule, Juneteenth
+               (June 19th) is **not** listed as a company holiday. It seems
+               your onboarding packet has the most accurate information in
+               this case."
+  Usefulness:    ✅ HELPFUL
+  Reason:        The final response from the policy_agent correctly answers
+                 the user's question after the correction, providing specific
+                 and actionable information about Juneteenth not being a
+                 company holiday based on the official policy.
   Grounding:     ✅ GROUNDED
-  Reason:        The agent called the `calculate_pto_details` tool, indicating the
-                 response was based on tool usage to retrieve specific data.
-  Dimensions:    Correctness: ⚠️  MOSTLY CORRECT | Tool Usage: ✅ PROPER |
+  Reason:        The policy_agent explicitly states it looked up the official
+                 company policy and based its answer on the 2026 holiday
+                 schedule, indicating tool usage.
+  Dimensions:    Correctness: ✅ CORRECT | Tool Usage: ✅ PROPER |
                  Specificity: ✅ SPECIFIC | Scope: ✅ COMPLIANT |
                  First-Time Right: ❌ CORRECTION NEEDED
 
 ======================================================================
 QUALITY SUMMARY
 ======================================================================
   Total sessions evaluated : 1
-  Meaningful               : 0
+  Meaningful               : 1
   Declined (out-of-scope)  : 0
-  Partial                  : 1
+  Partial                  : 0
   Unhelpful                : 0
   Unhelpful rate           : 0.0%
 
   Quality Dimensions (0-2 scale):
-    Correctness         : 1.00 / 2.00  #########################
+    Correctness         : 2.00 / 2.00  ##################################################
     Tool Usage          : 2.00 / 2.00  ##################################################
     Specificity         : 2.00 / 2.00  ##################################################
     Scope               : 2.00 / 2.00  ##################################################
     First-Time Right    : 0.00 / 2.00
 
   Multi-Turn Efficiency:
     Avg user turns       : 2.0
-    Avg tool calls       : 5.0
+    Avg tool calls       : 2.0
     Multi-turn sessions  : 1
     Correction rate      : 100.0%
     Verification rate    : 0.0%
 
   Category Distributions:
 
   [response_usefulness]
-    ⚠️  PARTIAL       :    1  (100.0%) ##################################################
+    ✅ HELPFUL         :    1  (100.0%) ##################################################
 
   [task_grounding]
     ✅ GROUNDED        :    1  (100.0%) ##################################################
 
   Execution Details:
     execution_mode: ai_generate
-    elapsed_seconds: 32.7
+    elapsed_seconds: 23.4
     eval_model: gemini-2.5-flash
+
+======================================================================
+
+======================================================================
+EXECUTION TRACE
+======================================================================
+Session: conv_5d77036b
+Time: 17:37:54  Total: 1.1min
+──────────────────────────────────────────────────────────────────────
+├── knowledge_supervisor > USER_MESSAGE_RECEIVED
+├── knowledge_supervisor > INVOCATION_STARTING
+├── knowledge_supervisor > INVOCATION_COMPLETED [14.7s]
+│   ├── knowledge_supervisor > AGENT_STARTING
+│   └── knowledge_supervisor > AGENT_COMPLETED [2.1s]
+│       ├── knowledge_supervisor > LLM_REQUEST
+│       └── knowledge_supervisor > LLM_RESPONSE [2.0s, ttft=2.0s]
+├── knowledge_supervisor > USER_MESSAGE_RECEIVED
+├── knowledge_supervisor > INVOCATION_STARTING
+└── knowledge_supervisor > INVOCATION_COMPLETED [1.0min]
+    ├── knowledge_supervisor > AGENT_STARTING
+    └── knowledge_supervisor > AGENT_COMPLETED [1.0min]
+        ├── knowledge_supervisor > LLM_REQUEST
+        ├── knowledge_supervisor > LLM_RESPONSE [5.5s, ttft=5.5s]
+        ├── knowledge_supervisor > TOOL_STARTING (transfer_to_agent)
+        ├── knowledge_supervisor > TOOL_COMPLETED (transfer_to_agent) [0ms]
+        ├── policy_agent > AGENT_STARTING
+        └── policy_agent > AGENT_COMPLETED [56.0s]
+            ├── policy_agent > LLM_REQUEST
+            ├── policy_agent > LLM_RESPONSE [20.2s, ttft=20.2s]
+            ├── policy_agent > TOOL_STARTING (lookup_company_policy)
+            ├── policy_agent > TOOL_COMPLETED (lookup_company_policy) [0ms]
+            ├── policy_agent > LLM_REQUEST
+            └── policy_agent > LLM_RESPONSE [35.7s, ttft=35.7s]
+
+──────────────────────────────────────────────────────────────────────
+  SUB-TRAJECTORY SEGMENTATION
+──────────────────────────────────────────────────────────────────────
+
+  ❌ pre_correction_1 (turns 0-1) → wrong
+  ├── knowledge_supervisor > USER_MESSAGE_RECEIVED
+  ├── knowledge_supervisor > INVOCATION_STARTING
+  └── knowledge_supervisor > INVOCATION_COMPLETED [14.7s]
+      ├── knowledge_supervisor > AGENT_STARTING
+      └── knowledge_supervisor > AGENT_COMPLETED [2.1s]
+          ├── knowledge_supervisor > LLM_REQUEST
+          └── knowledge_supervisor > LLM_RESPONSE [2.0s, ttft=2.0s]
+
+  ✅ post_correction_1 (turns 2-3) → recovered
+  ├── knowledge_supervisor > USER_MESSAGE_RECEIVED
+  ├── knowledge_supervisor > INVOCATION_STARTING
+  └── knowledge_supervisor > INVOCATION_COMPLETED [1.0min]
+      ├── knowledge_supervisor > AGENT_STARTING
+      └── knowledge_supervisor > AGENT_COMPLETED [1.0min]
+          ├── knowledge_supervisor > LLM_REQUEST
+          ├── knowledge_supervisor > LLM_RESPONSE [5.5s, ttft=5.5s]
+          ├── knowledge_supervisor > TOOL_STARTING (transfer_to_agent)
+          ├── knowledge_supervisor > TOOL_COMPLETED (transfer_to_agent) [0ms]
+          ├── policy_agent > AGENT_STARTING
+          └── policy_agent > AGENT_COMPLETED [56.0s]
+              ├── policy_agent > LLM_REQUEST
+              ├── policy_agent > LLM_RESPONSE [20.2s, ttft=20.2s]
+              ├── policy_agent > TOOL_STARTING (lookup_company_policy)
+              ├── policy_agent > TOOL_COMPLETED (lookup_company_policy) [0ms]
+              ├── policy_agent > LLM_REQUEST
+              └── policy_agent > LLM_RESPONSE [35.7s, ttft=35.7s]
 ======================================================================
 ```
+
+The execution trace reveals:
+- **Turn 1 (wrong):** The supervisor answered directly from LLM knowledge (no routing, no tool call) — incorrectly stating Juneteenth is a holiday
+- **Turn 2 (recovered):** After user correction, the supervisor routed via `transfer_to_agent` to the `policy_agent`, which called `lookup_company_policy` and returned the correct answer
+
+The sub-trajectory segmentation splits the trace at the correction boundary, making it easy to see what changed between the failed and recovered attempts.