Skip to content

Commit a52d996

Browse files
committed
Auto-fetch execution trace for single-session quality report
When --session is used, the execution trajectory is now fetched automatically from BigQuery and printed to console with sub-trajectory segmentation at correction boundaries. Updated sample with real data showing the full output including trace tree and segmentation.
1 parent 8313d65 commit a52d996

2 files changed

Lines changed: 141 additions & 23 deletions

File tree

scripts/quality_report.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1613,6 +1613,49 @@ def run_eval(args):
16131613
else:
16141614
logger.warning("No trajectories fetched (BQ may not be configured)")
16151615

1616+
# Single-session mode: always fetch trajectory from BQ
1617+
if args.session and not trajectories and not conversations_file:
1618+
trajectories = _fetch_session_traces([args.session], max_sessions=1)
1619+
if trajectories:
1620+
for sid, trace_obj in trajectories.items():
1621+
ctx = result["resolved_map"].get(sid)
1622+
if ctx and ctx.get("answered_by") == "unknown":
1623+
ctx["answered_by"] = get_responding_agent(trace_obj)
1624+
1625+
# Print execution trace to console for single-session mode
1626+
if args.session and trajectories:
1627+
trace_obj = trajectories.get(args.session)
1628+
if trace_obj:
1629+
hr = "─" * 70
1630+
print(f"\n{'=' * 70}")
1631+
print("EXECUTION TRACE")
1632+
print(f"{'=' * 70}")
1633+
print(_render_trace(trace_obj))
1634+
ctx = result["resolved_map"].get(args.session, {})
1635+
sub_trajs = ctx.get("sub_trajectories", [])
1636+
conversation = ctx.get("conversation", [])
1637+
if sub_trajs and conversation:
1638+
segments = _segment_trace_by_turns(
1639+
trace_obj, conversation, sub_trajs,
1640+
)
1641+
if segments:
1642+
print(f"\n{hr}")
1643+
print(" SUB-TRAJECTORY SEGMENTATION")
1644+
print(hr)
1645+
for seg in segments:
1646+
icon = (
1647+
"✅" if seg["outcome"] in ("correct", "recovered")
1648+
else "❌"
1649+
)
1650+
print(
1651+
f"\n {icon} {seg['label']} "
1652+
f"(turns {seg['start_turn']}-{seg['end_turn']}) "
1653+
f"→ {seg['outcome']}"
1654+
)
1655+
for line in seg["trace"].split("\n"):
1656+
print(f" {line}")
1657+
print(f"{'=' * 70}\n")
1658+
16161659
report_path = None
16171660
md_dir = None
16181661
if args.output_json and args.output_json != "-":
Lines changed: 98 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,68 +1,143 @@
11
# Single-Session Quality Report
22

3-
Console output generated by `python scripts/quality_report.py --session conv_481f43fa`.
3+
Console output generated by `python scripts/quality_report.py --session conv_5d77036b --tag-turns`.
44

55
When `--session` is used, all 7 metrics are shown with full justifications (verbose mode).
6+
The execution trace is fetched automatically from BigQuery and displayed after the summary.
67

78
```text
89
──────────────────────────────────────────────────────────────────────
9-
PARTIAL Sessions (showing 1 of 1)
10+
MEANINGFUL Sessions (showing 1 of 1)
1011
──────────────────────────────────────────────────────────────────────
1112
12-
Session: conv_481f43fa → policy_agent
13-
Question: You mentioned I will accrue an additional 13.3 days of PTO this year,
14-
which, combined with the 7.8 days already accrued, totals 21.1 days.
15-
My onboarding packet states that PTO is 20 days per year.
16-
Response: "According to the company policy, employees receive 20 days of PTO
17-
per year, accrued monthly at approximately 1.67 days per month."
18-
Usefulness: ⚠️ PARTIAL
19-
Reason: The agent provided specific numbers for PTO balance, accrued, used,
20-
and sick leave, which is useful. However, the calculation for
21-
additional PTO accrual was incorrect, leading to a total that
22-
contradicted company policy, making the overall response partially
23-
useful but also misleading.
13+
Session: conv_5d77036b → policy_agent
14+
Question: My onboarding packet says that Juneteenth is NOT a company
15+
holiday. Can you please check the official policy?
16+
Response: "Thank you for bringing this to my attention and for asking
17+
me to verify! I've looked up the official company policy for
18+
holidays. Based on the 2026 holiday schedule, Juneteenth
19+
(June 19th) is **not** listed as a company holiday. It seems
20+
your onboarding packet has the most accurate information in
21+
this case."
22+
Usefulness: ✅ HELPFUL
23+
Reason: The final response from the policy_agent correctly answers
24+
the user's question after the correction, providing specific
25+
and actionable information about Juneteenth not being a
26+
company holiday based on the official policy.
2427
Grounding: ✅ GROUNDED
25-
Reason: The agent called the `calculate_pto_details` tool, indicating the
26-
response was based on tool usage to retrieve specific data.
27-
Dimensions: Correctness: ⚠️ MOSTLY CORRECT | Tool Usage: ✅ PROPER |
28+
Reason: The policy_agent explicitly states it looked up the official
29+
company policy and based its answer on the 2026 holiday
30+
schedule, indicating tool usage.
31+
Dimensions: Correctness: ✅ CORRECT | Tool Usage: ✅ PROPER |
2832
Specificity: ✅ SPECIFIC | Scope: ✅ COMPLIANT |
2933
First-Time Right: ❌ CORRECTION NEEDED
3034
3135
======================================================================
3236
QUALITY SUMMARY
3337
======================================================================
3438
Total sessions evaluated : 1
35-
Meaningful : 0
39+
Meaningful : 1
3640
Declined (out-of-scope) : 0
37-
Partial : 1
41+
Partial : 0
3842
Unhelpful : 0
3943
Unhelpful rate : 0.0%
4044
4145
Quality Dimensions (0-2 scale):
42-
Correctness : 1.00 / 2.00 #########################
46+
Correctness : 2.00 / 2.00 ##################################################
4347
Tool Usage : 2.00 / 2.00 ##################################################
4448
Specificity : 2.00 / 2.00 ##################################################
4549
Scope : 2.00 / 2.00 ##################################################
4650
First-Time Right : 0.00 / 2.00
4751
4852
Multi-Turn Efficiency:
4953
Avg user turns : 2.0
50-
Avg tool calls : 5.0
54+
Avg tool calls : 2.0
5155
Multi-turn sessions : 1
5256
Correction rate : 100.0%
5357
Verification rate : 0.0%
5458
5559
Category Distributions:
5660
5761
[response_usefulness]
58-
⚠️ PARTIAL : 1 (100.0%) ##################################################
62+
✅ HELPFUL : 1 (100.0%) ##################################################
5963
6064
[task_grounding]
6165
✅ GROUNDED : 1 (100.0%) ##################################################
6266
6367
Execution Details:
6468
execution_mode: ai_generate
65-
elapsed_seconds: 32.7
69+
elapsed_seconds: 23.4
6670
eval_model: gemini-2.5-flash
71+
72+
======================================================================
73+
74+
======================================================================
75+
EXECUTION TRACE
76+
======================================================================
77+
Session: conv_5d77036b
78+
Time: 17:37:54 Total: 1.1min
79+
──────────────────────────────────────────────────────────────────────
80+
├── knowledge_supervisor > USER_MESSAGE_RECEIVED
81+
├── knowledge_supervisor > INVOCATION_STARTING
82+
├── knowledge_supervisor > INVOCATION_COMPLETED [14.7s]
83+
│ ├── knowledge_supervisor > AGENT_STARTING
84+
│ └── knowledge_supervisor > AGENT_COMPLETED [2.1s]
85+
│ ├── knowledge_supervisor > LLM_REQUEST
86+
│ └── knowledge_supervisor > LLM_RESPONSE [2.0s, ttft=2.0s]
87+
├── knowledge_supervisor > USER_MESSAGE_RECEIVED
88+
├── knowledge_supervisor > INVOCATION_STARTING
89+
└── knowledge_supervisor > INVOCATION_COMPLETED [1.0min]
90+
├── knowledge_supervisor > AGENT_STARTING
91+
└── knowledge_supervisor > AGENT_COMPLETED [1.0min]
92+
├── knowledge_supervisor > LLM_REQUEST
93+
├── knowledge_supervisor > LLM_RESPONSE [5.5s, ttft=5.5s]
94+
├── knowledge_supervisor > TOOL_STARTING (transfer_to_agent)
95+
├── knowledge_supervisor > TOOL_COMPLETED (transfer_to_agent) [0ms]
96+
├── policy_agent > AGENT_STARTING
97+
└── policy_agent > AGENT_COMPLETED [56.0s]
98+
├── policy_agent > LLM_REQUEST
99+
├── policy_agent > LLM_RESPONSE [20.2s, ttft=20.2s]
100+
├── policy_agent > TOOL_STARTING (lookup_company_policy)
101+
├── policy_agent > TOOL_COMPLETED (lookup_company_policy) [0ms]
102+
├── policy_agent > LLM_REQUEST
103+
└── policy_agent > LLM_RESPONSE [35.7s, ttft=35.7s]
104+
105+
──────────────────────────────────────────────────────────────────────
106+
SUB-TRAJECTORY SEGMENTATION
107+
──────────────────────────────────────────────────────────────────────
108+
109+
❌ pre_correction_1 (turns 0-1) → wrong
110+
├── knowledge_supervisor > USER_MESSAGE_RECEIVED
111+
├── knowledge_supervisor > INVOCATION_STARTING
112+
└── knowledge_supervisor > INVOCATION_COMPLETED [14.7s]
113+
├── knowledge_supervisor > AGENT_STARTING
114+
└── knowledge_supervisor > AGENT_COMPLETED [2.1s]
115+
├── knowledge_supervisor > LLM_REQUEST
116+
└── knowledge_supervisor > LLM_RESPONSE [2.0s, ttft=2.0s]
117+
118+
✅ post_correction_1 (turns 2-3) → recovered
119+
├── knowledge_supervisor > USER_MESSAGE_RECEIVED
120+
├── knowledge_supervisor > INVOCATION_STARTING
121+
└── knowledge_supervisor > INVOCATION_COMPLETED [1.0min]
122+
├── knowledge_supervisor > AGENT_STARTING
123+
└── knowledge_supervisor > AGENT_COMPLETED [1.0min]
124+
├── knowledge_supervisor > LLM_REQUEST
125+
├── knowledge_supervisor > LLM_RESPONSE [5.5s, ttft=5.5s]
126+
├── knowledge_supervisor > TOOL_STARTING (transfer_to_agent)
127+
├── knowledge_supervisor > TOOL_COMPLETED (transfer_to_agent) [0ms]
128+
├── policy_agent > AGENT_STARTING
129+
└── policy_agent > AGENT_COMPLETED [56.0s]
130+
├── policy_agent > LLM_REQUEST
131+
├── policy_agent > LLM_RESPONSE [20.2s, ttft=20.2s]
132+
├── policy_agent > TOOL_STARTING (lookup_company_policy)
133+
├── policy_agent > TOOL_COMPLETED (lookup_company_policy) [0ms]
134+
├── policy_agent > LLM_REQUEST
135+
└── policy_agent > LLM_RESPONSE [35.7s, ttft=35.7s]
67136
======================================================================
68137
```
138+
139+
The execution trace reveals:
140+
- **Turn 1 (wrong):** The supervisor answered directly from LLM knowledge (no routing, no tool call) — incorrectly stating Juneteenth is a holiday
141+
- **Turn 2 (recovered):** After user correction, the supervisor routed via `transfer_to_agent` to the `policy_agent`, which called `lookup_company_policy` and returned the correct answer
142+
143+
The sub-trajectory segmentation splits the trace at the correction boundary, making it easy to see what changed between the failed and recovered attempts.

0 commit comments

Comments
 (0)