Commit c2afc72
Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering (#174)
* Add quality dimensions, multi-turn metrics, and streamlined report format
Add 5 new LLM-as-judge quality dimensions to the quality report:
correctness, tool_usage, specificity, scope_compliance, first_time_right.
Each session is scored 0-2 and averaged into a Quality Dimensions table.
Add multi-turn efficiency metrics (avg user turns, avg tool calls,
multi-turn session count) extracted from trace spans.
Streamline report output for readability:
- Quality Dimensions table includes "What it measures" descriptions
and a color-coded rating legend
- Category distributions only show primary metrics (response_usefulness,
task_grounding) since dimension averages already summarize the rest
- Per-session details use a compact one-line scorecard for dimensions
instead of verbose multi-line blocks
Add 12 new tests for the helper functions.
Update README metrics documentation.
Regenerate sample quality report from live data.
* Add correction/verification inference and multi-turn conversation extraction
Extract full multi-turn conversations from trace spans and use LLM to
classify user follow-ups as corrections, verifications, or normal
follow-ups. Surface correction_rate and verify_rate in both console
output and markdown reports to match the metrics available in
knowledge_supervisor's multiturn_quality_report.
* Improve scope-aware eval accuracy with conditional declined category and ground truth support
The LLM judge was misclassifying in-scope failures as "declined" because
the declined category was always present regardless of whether scope
context existed. This led to inflated correctness scores for agents that
failed to answer in-scope questions.
Key changes:
- Make the "declined" metric category conditional: only include it when
the agent config actually defines out-of-scope topics (has_scope flag).
Without scope context the judge has no basis for that category.
- Add ground_truth config field to _build_scope_context so the judge can
verify factual accuracy against known-correct policy data.
- Add in_scope_topics to scope context so the judge can distinguish
"topic is out of scope" (declined) from "agent failed to answer an
in-scope topic" (unhelpful).
- Clarify unhelpful definition: explicitly lists failure patterns like
"I don't have that information" for in-scope topics.
- Support config_path="none" in _load_agent_config to explicitly disable
scope context and skip auto-discovery.
- Thread config_path through generate_quality_report for programmatic use.
- Update --agent-config CLI help to document the "none" option.
* Add scope config example, remove in_scope_topics, update README
- Add scripts/eval/data/agent_context.example.json with scope_decisions
for scope-aware evaluation. Users copy to agent_context.json and
customize with their agent's out-of-scope topics.
- Remove in_scope_topics from scope context: scope_decisions alone is
sufficient. Anything not listed as out_of_scope is implicitly in scope.
- Remove hardcoded domain-specific topic names (benefits, PTO, etc.)
from the LLM judge prompt to keep it agent-agnostic.
- Update README with all new features: dimension drilldowns,
single-session evaluation (--session), quality dimensions table,
multi-turn metrics, declined category behavior, and sample output
links.
- Add quality_metrics.json (metric definitions) and
sample_quality_report_session.md (single-session verbose output).
* Remove ground truth from scope context, update sample reports with fresh eval run
* Add --conversations-file for local JSON scoring without BigQuery
Support evaluating traffic-generator conversations directly from a JSON file,
bypassing BigQuery entirely. Adds run_evaluation_from_conversations() and
generate_quality_report_from_conversations() public APIs, plus CLI flag.
Also enhances _write_md_report with TOC, section descriptions, metric
explanations, and a report_dir parameter for custom output locations.
Includes issue for concurrent session scoring (asyncio.gather + semaphore).
* Update quality report format (as a goal)
* Add per-category sample control for report sections
Support --samples with per-category overrides (e.g. unhelpful=10,partial=5,low=3)
in addition to single-number and 'all' modes. Refactor _print_eval_results and
_write_md_report to use centralized _parse_samples/_get_sample_limit helpers
with sensible defaults per category.
* Make correction inference concurrent with asyncio.gather
Run _infer_corrections calls in parallel using asyncio threads
with a semaphore (max 10 concurrent) instead of sequential loop.
* Allow --help/-h without env vars in quality_report.sh
Short-circuit help flags before env var validation so users can
view usage without setting PROJECT_ID, DATASET_ID, etc.
* Add dynamic low-dimension sections and report metadata
Add _md_find_low_dimension_sessions and _md_write_low_dimension_section
helpers to generate per-dimension subsections for all 5 quality dimensions
dynamically. Move low-dimension sections after Declined Sessions.
Restructure TOC with <!-- TOC --> wrapper. Add CLI command reproduction
in Summary section. Compute report layout variables upfront.
* Fix help text, enable --tag-turns for BQ, and fix --env loading
- --limit: clarify it evaluates the N most recent sessions
- --trajectory-samples: remove misleading "Requires BQ credentials"
- --tag-turns: remove conversations-file restriction, enable for BQ path
by passing tag_turns through run_evaluation with concurrent tagging
- --env=path syntax now works in shell script (not just --env path)
- Error message now says 'export VAR=...' and mentions --env flag
* Add correction analysis with trajectory segmentation, routing failure detection, and parroting detection
- Embed segmented execution trajectories in Correction Analysis section
(before/after correction with outcome labels: recovered, parroted, not_recovered)
- Detect routing failures (agent answered from LLM knowledge without tools)
and surface as Routing Failures subsection
- Fix answered_by defaulting to "policy_agent" — now "unknown" with BQ backfill
- Tighten correction inference prompt to distinguish genuine recovery from parroting
- Add --eval-config flag for external metric/prompt definitions (eval_config.json)
- Nest Correction Analysis under Sample Sessions with dynamic heading levels
- Update sample report and README
* Add parroting penalty to eval and auto-discover eval_config.json
Penalize parroted recoveries (agent echoes user's correction without
re-verifying via tool) as unhelpful in both the usefulness definition
and the unhelpful category.
Refactor quality_report.py to auto-discover eval/eval_config.json
from the repo root, removing 290 lines of hardcoded metric
definitions in favor of the config-driven approach.
* Add per_session_context support for golden eval
Thread per_session_context through quality_report.py into
classify_sessions_via_api() so golden eval expected answers
can be injected into the judge prompt per session.
* Auto-fetch execution trace for single-session quality report
When --session is used, the execution trajectory is now fetched
automatically from BigQuery and printed to console with sub-trajectory
segmentation at correction boundaries. Updated sample with real data
showing the full output including trace tree and segmentation.
* Address PR #174 review: fix dimension scoring, TOOL_ERROR counting, and module constants
- H1: Skip parse errors and unknown categories in _compute_dimension_averages
instead of scoring them as 0 (which inflated averages downward)
- H2: Default unknown categories to ❓ instead of ✅ in scorecard icons
- H3: Count TOOL_ERROR spans as tool attempts in _count_trace_metrics
- L3: Lift _SCORECARD_ICONS to module level (was duplicated in function)
- L7: Extract _PRIMARY_METRICS constant, replace 5 inline references
- M2: _compute_multiturn_stats returns stable shape on empty input
- Update tests: add parse_error attr to _FakeMetric, test TOOL_ERROR
counting, test parse error/unknown category skipping, fix empty map
assertion
* Fix custom_tags/custom_labels path mismatch and add version filtering
- TraceFilter.to_sql_conditions: query $.custom_tags.* instead of
$.labels.* to match BigQueryLoggerConfig.custom_tags write path
- TraceFilter.from_cli_args: add custom_labels parameter
- run_evaluation: add custom_labels parameter, thread through to
TraceFilter for version-aware session filtering
* Add --label CLI flag, surface active filters in Execution Details
- Add --label KEY=VALUE (repeatable) to filter by custom_tags set via
BigQueryLoggerConfig.custom_tags (version, env, experiment_id, etc.)
- Surface app_name and labels in report.details so they appear in both
console and markdown Execution Details sections
- Expand CLI epilog with filtering section, scope-aware eval example
showing full agent_context.json format, and combined filter examples
- Update README with Custom Labels section (end-to-end: agent emits →
BQ stores → quality report filters), expanded --agent-context docs,
and complete Execution Details description
- Update sample reports with complete Execution Details fields
* Address PR #174 review: tool_usage no_tool_needed, --dimensions flag, and hardening
Review feedback (@caohy1988):
- P1: add `no_tool_needed` category to `tool_usage` (scores 2) so correctly
declined / no-tool-needed sessions are not penalised as a Tool Usage failure;
neutral ➖ scorecard icon; regression tests.
- B2: add `--dimensions {full,primary}` flag to cut LLM-judge cost ~3.5x
(primary scores only the 2 primary metrics); document cost in README + help.
Default stays `full` for backward compatibility.
- D2: document that `first_time_right` is primarily a multi-turn signal.
- D3/L6: comment the deliberately-divergent middle-category names and the
shared `correct` category.
- M5: print per-dimension descriptions in console output.
- L1: render markdown report metadata as a bullet list instead of
trailing-double-space hard breaks (fixes `git diff --check`).
Independent review hardening:
- Gate the dimension block on a new `_has_dimension_data()` helper across all
three output paths; JSON `dimension_averages` is now empty (not all-zero)
when dimensions were not scored, so consumers don't read unscored dimensions
as 0.0 / failing.
- Anchor the judge prompt so a missing in-scope tool lookup stays `none` and is
not mislabeled `no_tool_needed` (inverse-of-P1 inflation guard).
- Make `_DIMENSION_LOW_CATEGORIES` fail-safe if a dimension has no score-0
category (no StopIteration at import).
Tests: 81 pass (+5).
* Add eval-spec grounding (scope + golden Q&A) and failure-cause taxonomy
Introduces an optional eval spec (`eval/data/eval_spec.json`, or `--eval-spec`)
that grounds quality scoring:
- `scope` (free text) — defines what the agent handles; out-of-scope is the
complement, so a polite decline is scored `declined` (replaces the brittle
scope_decisions topic list).
- `golden_qa` — expected answers matched per-question by embedding similarity
(`--golden-threshold`); injects ground truth into the judge and emits a
`golden_eval_summary`. Matching now lives in the SDK (`match_golden_qa`),
not in downstream projects.
- `tools` — capability description used by the judge's new `failure_attribution`
metric.
Adds a 3-way failure taxonomy: every `unhelpful` session is attributed to a
`skill_gap` (evolution-fixable), `knowledge_gap` (a fact missing from the data),
or `tool_gap` (no tool/capability), and a new `addressable_meaningful_rate`
reports quality on questions the agent can actually answer. Surfaced in the
console, markdown (with per-class "add a fact" / "build a tool" sections), and
JSON.
Also: `tool_usage.no_tool_needed` (scores 2, not a failure), `--dimensions
{full,primary}` cost control, `_has_dimension_data` so unscored dimensions
aren't reported as 0.0. Tests: 94 pass.
* Apply autoformat (isort + pyink) to fix format check
* quality_report: warn when scoring without golden_qa ground truth
response_usefulness and task_grounding are LLM estimates when no golden_qa is
supplied -- they can mislabel verbose, tool-grounded answers as
ungrounded/unhelpful. Emit a clear warning pointing users to --eval-spec with a
golden_qa list so correctness is graded against expected answers
(summary.golden_eval_summary). The golden-Q&A grounding itself is unchanged and
remains the reliable path.
* scripts/README: document adding evals + all new quality-report features
The quality-report usage section listed filters and report flags but omitted
the most important workflow — adding evals (ground truth) — and several
features introduced in this PR. Now documented:
- Adding evals: eval-spec (scope/tools/ground_truth/golden_qa) + scoring a
local conversations file (--conversations-file, no BigQuery) with the JSON
format and recommended workflow
- Failure-cause taxonomy (skill_gap / knowledge_gap / tool_gap) and the
'tools' eval-spec field that drives it; routing-failure and parroting notes
- --dimensions full|primary, --tag-turns, --trajectory-samples, --concurrency,
--golden-threshold, --eval-config, --env, per-category --samples
- golden_eval_summary and the no-ground-truth warning
- Reorganized the usage quick-reference into grouped, complete examples
* quality_report: address review (B1, B2, H1-H3, M2, L1-L2, L4)
B1 — golden Q&A no longer a no-op on the BigQuery path: run_evaluation now
matches golden_qa from resolved questions, returns golden_metadata (so the
existing _inject_golden_summary fires) and generate_quality_report injects it
too; warns that the server-side judge can't take per-session expected answers
(expected-answer grounding stays on --conversations-file).
B2 — failure-cause taxonomy is gated on a new _has_failure_attribution_data
predicate (needs failure_attribution, or both tool_usage and correctness), in
console, markdown, and JSON; with --dimensions primary the breakdown no longer
defaults every failure to skill_gap.
H1 — metric count corrected to 8 (incl. failure_attribution) and cost ~4x in
--dimensions help, README, and metrics section.
H2 — removed the removed --agent-context flag from sample_quality_report.md;
fixed the residual agent-context docstring (also L1).
H3 — added a TraceFilter regression test asserting the $.custom_tags JSON path.
M2 — thread eval_config through generate_quality_report_from_conversations.
L2 — drop the unreachable TypeError in get_a2a_response.
L4 — lift the golden threshold to _DEFAULT_GOLDEN_THRESHOLD.
Tests: 99 pass (5 new). pyink + isort clean.
---------
Co-authored-by: Haiyuan Cao <haiyuan@google.com>1 parent 2d15a61 commit c2afc72
12 files changed
Lines changed: 5649 additions & 562 deletions
File tree
- issues
- scripts
- eval
- data
- src/bigquery_agent_analytics
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
23 | | - | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
0 commit comments