|
1 | 1 | # Single-Session Quality Report |
2 | 2 |
|
3 | | -Console output generated by `python scripts/quality_report.py --session conv_481f43fa`. |
| 3 | +Console output generated by `python scripts/quality_report.py --session conv_5d77036b --tag-turns`. |
4 | 4 |
|
5 | 5 | When `--session` is used, all 7 metrics are shown with full justifications (verbose mode). |
| 6 | +The execution trace is fetched automatically from BigQuery and displayed after the summary. |
6 | 7 |
|
7 | 8 | ```text |
8 | 9 | ────────────────────────────────────────────────────────────────────── |
9 | | - PARTIAL Sessions (showing 1 of 1) |
| 10 | + MEANINGFUL Sessions (showing 1 of 1) |
10 | 11 | ────────────────────────────────────────────────────────────────────── |
11 | 12 |
|
12 | | - Session: conv_481f43fa → policy_agent |
13 | | - Question: You mentioned I will accrue an additional 13.3 days of PTO this year, |
14 | | - which, combined with the 7.8 days already accrued, totals 21.1 days. |
15 | | - My onboarding packet states that PTO is 20 days per year. |
16 | | - Response: "According to the company policy, employees receive 20 days of PTO |
17 | | - per year, accrued monthly at approximately 1.67 days per month." |
18 | | - Usefulness: ⚠️ PARTIAL |
19 | | - Reason: The agent provided specific numbers for PTO balance, accrued, used, |
20 | | - and sick leave, which is useful. However, the calculation for |
21 | | - additional PTO accrual was incorrect, leading to a total that |
22 | | - contradicted company policy, making the overall response partially |
23 | | - useful but also misleading. |
| 13 | + Session: conv_5d77036b → policy_agent |
| 14 | + Question: My onboarding packet says that Juneteenth is NOT a company |
| 15 | + holiday. Can you please check the official policy? |
| 16 | + Response: "Thank you for bringing this to my attention and for asking |
| 17 | + me to verify! I've looked up the official company policy for |
| 18 | + holidays. Based on the 2026 holiday schedule, Juneteenth |
| 19 | + (June 19th) is **not** listed as a company holiday. It seems |
| 20 | + your onboarding packet has the most accurate information in |
| 21 | + this case." |
| 22 | + Usefulness: ✅ HELPFUL |
| 23 | + Reason: The final response from the policy_agent correctly answers |
| 24 | + the user's question after the correction, providing specific |
| 25 | + and actionable information about Juneteenth not being a |
| 26 | + company holiday based on the official policy. |
24 | 27 | Grounding: ✅ GROUNDED |
25 | | - Reason: The agent called the `calculate_pto_details` tool, indicating the |
26 | | - response was based on tool usage to retrieve specific data. |
27 | | - Dimensions: Correctness: ⚠️ MOSTLY CORRECT | Tool Usage: ✅ PROPER | |
| 28 | + Reason: The policy_agent explicitly states it looked up the official |
| 29 | + company policy and based its answer on the 2026 holiday |
| 30 | + schedule, indicating tool usage. |
| 31 | + Dimensions: Correctness: ✅ CORRECT | Tool Usage: ✅ PROPER | |
28 | 32 | Specificity: ✅ SPECIFIC | Scope: ✅ COMPLIANT | |
29 | 33 | First-Time Right: ❌ CORRECTION NEEDED |
30 | 34 |
|
31 | 35 | ====================================================================== |
32 | 36 | QUALITY SUMMARY |
33 | 37 | ====================================================================== |
34 | 38 | Total sessions evaluated : 1 |
35 | | - Meaningful : 0 |
| 39 | + Meaningful : 1 |
36 | 40 | Declined (out-of-scope) : 0 |
37 | | - Partial : 1 |
| 41 | + Partial : 0 |
38 | 42 | Unhelpful : 0 |
39 | 43 | Unhelpful rate : 0.0% |
40 | 44 |
|
41 | 45 | Quality Dimensions (0-2 scale): |
42 | | - Correctness : 1.00 / 2.00 ######################### |
| 46 | + Correctness : 2.00 / 2.00 ################################################## |
43 | 47 | Tool Usage : 2.00 / 2.00 ################################################## |
44 | 48 | Specificity : 2.00 / 2.00 ################################################## |
45 | 49 | Scope : 2.00 / 2.00 ################################################## |
46 | 50 | First-Time Right : 0.00 / 2.00 |
47 | 51 |
|
48 | 52 | Multi-Turn Efficiency: |
49 | 53 | Avg user turns : 2.0 |
50 | | - Avg tool calls : 5.0 |
| 54 | + Avg tool calls : 2.0 |
51 | 55 | Multi-turn sessions : 1 |
52 | 56 | Correction rate : 100.0% |
53 | 57 | Verification rate : 0.0% |
54 | 58 |
|
55 | 59 | Category Distributions: |
56 | 60 |
|
57 | 61 | [response_usefulness] |
58 | | - ⚠️ PARTIAL : 1 (100.0%) ################################################## |
| 62 | + ✅ HELPFUL : 1 (100.0%) ################################################## |
59 | 63 |
|
60 | 64 | [task_grounding] |
61 | 65 | ✅ GROUNDED : 1 (100.0%) ################################################## |
62 | 66 |
|
63 | 67 | Execution Details: |
64 | 68 | execution_mode: ai_generate |
65 | | - elapsed_seconds: 32.7 |
| 69 | + elapsed_seconds: 23.4 |
66 | 70 | eval_model: gemini-2.5-flash |
| 71 | +
|
| 72 | +====================================================================== |
| 73 | +
|
| 74 | +====================================================================== |
| 75 | +EXECUTION TRACE |
| 76 | +====================================================================== |
| 77 | +Session: conv_5d77036b |
| 78 | +Time: 17:37:54 Total: 1.1min |
| 79 | +────────────────────────────────────────────────────────────────────── |
| 80 | +├── knowledge_supervisor > USER_MESSAGE_RECEIVED |
| 81 | +├── knowledge_supervisor > INVOCATION_STARTING |
| 82 | +├── knowledge_supervisor > INVOCATION_COMPLETED [14.7s] |
| 83 | +│ ├── knowledge_supervisor > AGENT_STARTING |
| 84 | +│ └── knowledge_supervisor > AGENT_COMPLETED [2.1s] |
| 85 | +│ ├── knowledge_supervisor > LLM_REQUEST |
| 86 | +│ └── knowledge_supervisor > LLM_RESPONSE [2.0s, ttft=2.0s] |
| 87 | +├── knowledge_supervisor > USER_MESSAGE_RECEIVED |
| 88 | +├── knowledge_supervisor > INVOCATION_STARTING |
| 89 | +└── knowledge_supervisor > INVOCATION_COMPLETED [1.0min] |
| 90 | + ├── knowledge_supervisor > AGENT_STARTING |
| 91 | + └── knowledge_supervisor > AGENT_COMPLETED [1.0min] |
| 92 | + ├── knowledge_supervisor > LLM_REQUEST |
| 93 | + ├── knowledge_supervisor > LLM_RESPONSE [5.5s, ttft=5.5s] |
| 94 | + ├── knowledge_supervisor > TOOL_STARTING (transfer_to_agent) |
| 95 | + ├── knowledge_supervisor > TOOL_COMPLETED (transfer_to_agent) [0ms] |
| 96 | + ├── policy_agent > AGENT_STARTING |
| 97 | + └── policy_agent > AGENT_COMPLETED [56.0s] |
| 98 | + ├── policy_agent > LLM_REQUEST |
| 99 | + ├── policy_agent > LLM_RESPONSE [20.2s, ttft=20.2s] |
| 100 | + ├── policy_agent > TOOL_STARTING (lookup_company_policy) |
| 101 | + ├── policy_agent > TOOL_COMPLETED (lookup_company_policy) [0ms] |
| 102 | + ├── policy_agent > LLM_REQUEST |
| 103 | + └── policy_agent > LLM_RESPONSE [35.7s, ttft=35.7s] |
| 104 | +
|
| 105 | +────────────────────────────────────────────────────────────────────── |
| 106 | + SUB-TRAJECTORY SEGMENTATION |
| 107 | +────────────────────────────────────────────────────────────────────── |
| 108 | +
|
| 109 | + ❌ pre_correction_1 (turns 0-1) → wrong |
| 110 | + ├── knowledge_supervisor > USER_MESSAGE_RECEIVED |
| 111 | + ├── knowledge_supervisor > INVOCATION_STARTING |
| 112 | + └── knowledge_supervisor > INVOCATION_COMPLETED [14.7s] |
| 113 | + ├── knowledge_supervisor > AGENT_STARTING |
| 114 | + └── knowledge_supervisor > AGENT_COMPLETED [2.1s] |
| 115 | + ├── knowledge_supervisor > LLM_REQUEST |
| 116 | + └── knowledge_supervisor > LLM_RESPONSE [2.0s, ttft=2.0s] |
| 117 | +
|
| 118 | + ✅ post_correction_1 (turns 2-3) → recovered |
| 119 | + ├── knowledge_supervisor > USER_MESSAGE_RECEIVED |
| 120 | + ├── knowledge_supervisor > INVOCATION_STARTING |
| 121 | + └── knowledge_supervisor > INVOCATION_COMPLETED [1.0min] |
| 122 | + ├── knowledge_supervisor > AGENT_STARTING |
| 123 | + └── knowledge_supervisor > AGENT_COMPLETED [1.0min] |
| 124 | + ├── knowledge_supervisor > LLM_REQUEST |
| 125 | + ├── knowledge_supervisor > LLM_RESPONSE [5.5s, ttft=5.5s] |
| 126 | + ├── knowledge_supervisor > TOOL_STARTING (transfer_to_agent) |
| 127 | + ├── knowledge_supervisor > TOOL_COMPLETED (transfer_to_agent) [0ms] |
| 128 | + ├── policy_agent > AGENT_STARTING |
| 129 | + └── policy_agent > AGENT_COMPLETED [56.0s] |
| 130 | + ├── policy_agent > LLM_REQUEST |
| 131 | + ├── policy_agent > LLM_RESPONSE [20.2s, ttft=20.2s] |
| 132 | + ├── policy_agent > TOOL_STARTING (lookup_company_policy) |
| 133 | + ├── policy_agent > TOOL_COMPLETED (lookup_company_policy) [0ms] |
| 134 | + ├── policy_agent > LLM_REQUEST |
| 135 | + └── policy_agent > LLM_RESPONSE [35.7s, ttft=35.7s] |
67 | 136 | ====================================================================== |
68 | 137 | ``` |
| 138 | + |
| 139 | +The execution trace reveals: |
| 140 | +- **Turn 1 (wrong):** The supervisor answered directly from LLM knowledge (no routing, no tool call) — incorrectly stating Juneteenth is a holiday |
| 141 | +- **Turn 2 (recovered):** After user correction, the supervisor routed via `transfer_to_agent` to the `policy_agent`, which called `lookup_company_policy` and returned the correct answer |
| 142 | + |
| 143 | +The sub-trajectory segmentation splits the trace at the correction boundary, making it easy to see what changed between the failed and recovered attempts. |
0 commit comments