Commit b830575
committed
Add quality dimensions, multi-turn metrics, and streamlined report format
Add 5 new LLM-as-judge quality dimensions to the quality report:
correctness, tool_usage, specificity, scope_compliance, first_time_right.
Each session is scored 0-2 and averaged into a Quality Dimensions table.
Add multi-turn efficiency metrics (avg user turns, avg tool calls,
multi-turn session count) extracted from trace spans.
Streamline report output for readability:
- Quality Dimensions table includes "What it measures" descriptions
and a color-coded rating legend
- Category distributions only show primary metrics (response_usefulness,
task_grounding) since dimension averages already summarize the rest
- Per-session details use a compact one-line scorecard for dimensions
instead of verbose multi-line blocks
Add 12 new tests for the helper functions.
Update README metrics documentation.
Regenerate sample quality report from live data.1 parent 31fe562 commit b830575
4 files changed
Lines changed: 717 additions & 170 deletions
File tree
- scripts
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
101 | 101 | | |
102 | 102 | | |
103 | 103 | | |
104 | | - | |
| 104 | + | |
105 | 105 | | |
106 | | - | |
107 | | - | |
| 106 | + | |
108 | 107 | | |
109 | | - | |
110 | | - | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
111 | 112 | | |
112 | 113 | | |
113 | 114 | | |
114 | 115 | | |
115 | 116 | | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
116 | 136 | | |
117 | 137 | | |
118 | 138 | | |
| |||
0 commit comments