Skip to content

Commit b830575

Browse files
committed
Add quality dimensions, multi-turn metrics, and streamlined report format
Add 5 new LLM-as-judge quality dimensions to the quality report: correctness, tool_usage, specificity, scope_compliance, first_time_right. Each session is scored 0-2 and averaged into a Quality Dimensions table. Add multi-turn efficiency metrics (avg user turns, avg tool calls, multi-turn session count) extracted from trace spans. Streamline report output for readability: - Quality Dimensions table includes "What it measures" descriptions and a color-coded rating legend - Category distributions only show primary metrics (response_usefulness, task_grounding) since dimension averages already summarize the rest - Per-session details use a compact one-line scorecard for dimensions instead of verbose multi-line blocks Add 12 new tests for the helper functions. Update README metrics documentation. Regenerate sample quality report from live data.
1 parent 31fe562 commit b830575

4 files changed

Lines changed: 717 additions & 170 deletions

File tree

scripts/README.md

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -101,18 +101,38 @@ These filters can be combined (e.g. `--app-name my_agent --session-ids-file ids.
101101

102102
### Metrics
103103

104-
The evaluation uses two categorical metrics:
104+
The evaluation scores each session on **7 dimensions** using LLM-as-a-judge.
105105

106-
- **response_usefulness** - Whether the agent's response provides a genuinely
107-
useful answer. Categories: `meaningful`, `declined`, `unhelpful`, `partial`.
106+
**Primary metrics** classify each session:
108107

109-
- **task_grounding** - Whether the response is grounded in tool-retrieved data
110-
or fabricated. Categories: `grounded`, `ungrounded`, `no_tool_needed`.
108+
| Metric | Categories | What it measures |
109+
|--------|------------|------------------|
110+
| `response_usefulness` | `meaningful`, `declined`, `unhelpful`, `partial` | Whether the response provides a genuinely useful answer |
111+
| `task_grounding` | `grounded`, `ungrounded`, `no_tool_needed` | Whether the response is based on tool-retrieved data or fabricated |
111112

112113
The **`declined`** category is always available — the LLM judge can classify
113114
polite refusals of out-of-scope questions as correct behavior rather than
114115
marking them as `unhelpful`.
115116

117+
**Quality dimensions** score each session 0-2 and are averaged across all
118+
sessions to produce the Quality Dimensions table in the report:
119+
120+
| Dimension | 2 (best) | 1 (middle) | 0 (worst) |
121+
|-----------|----------|------------|-----------|
122+
| `correctness` | All facts accurate | Minor inaccuracy | Wrong facts or hallucinations |
123+
| `tool_usage` | Tools used properly | Partial tool use | No tool use when needed |
124+
| `specificity` | Specific numbers, dates, limits | Missing some details | Vague or generic |
125+
| `scope_compliance` | Correctly handled scope | Unnecessary caveats | Wrong scope decision |
126+
| `first_time_right` | Correct on first try | Needed clarification | User had to correct |
127+
128+
**Multi-turn efficiency** metrics are extracted from trace spans:
129+
130+
| Metric | Description |
131+
|--------|-------------|
132+
| Avg user turns | Average number of user messages per session |
133+
| Avg tool calls | Average number of tool calls per session |
134+
| Multi-turn sessions | Sessions with more than one user message |
135+
116136
### Scope-Aware Evaluation (`--config`)
117137

118138
For more accurate scope evaluation, provide a config file that tells the

0 commit comments

Comments
 (0)