Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering#174
Conversation
caohy1988
left a comment
There was a problem hiding this comment.
Comprehensive review — PR #174 head b830575
Checked locally. Solid feature addition with good test coverage (91 tests pass, 12 new). Most of the code is clean. A handful of design questions worth resolving plus one familiar systematic issue.
The PR is 1 commit behind origin/main (PR #156 squash-merge f4aa037), with PR #156's 6 unsquashed commits still in the branch's history. Mergeable status unknown until rebase.
What's in the PR (vs PR #156 baseline)
Single new commit b830575 on top of PR #156. Files actually changed (excluding inherited from PR #156):
scripts/README.md | 30 +- (new metrics docs)
scripts/quality_report.py | 557 +++++ (5 new LLM-judge metrics, dimension averaging, multi-turn stats, MD refactor)
scripts/sample_quality_report.md | 142 +/- (regenerated)
tests/test_quality_report_helpers.py | 158 + (3 new test classes, 12 tests)
tests/test_latency_report_helpers.py is unchanged. scripts/latency_report.py is unchanged.
Blockers
B1 — Rebase needed; PR base is the unsquashed PR #156 branch tip
mergeable: UNKNOWN, mergeStateStatus: UNKNOWN. PR #156 was squash-merged into main as f4aa037. PR #174's branch still has the 6 original PR #156 commits unsquashed (8564042 → 31fe562). After rebase onto current origin/main, those 6 commits should be subsumed by the squash.
Suggested:
git fetch origin main
git rebase origin/main
# The 6 PR #156 commits should be auto-dropped because their changes are now in
# the squash commit on main. If git keeps them, drop them via interactive rebase.
# Conflicts most likely in scripts/quality_report.py and scripts/sample_quality_report.md.
git push --force-with-leaseAfter the rebase, the diff against origin/main should be roughly +717/-170 (the 4 files above), matching git diff 31fe562..HEAD.
B2 — 5× LLM-judge cost increase per evaluation run, no opt-out
The PR adds 5 new categorical LLM-as-judge metrics. Each session is now evaluated on 7 metrics instead of 2. For --limit 100, that's ~700 LLM calls instead of ~200 — a 3.5× cost increase with no flag to disable the new dimensions. For --limit 1000 (some users run weekly bulk evals), the difference is ~2000 calls.
PR description doesn't mention cost. The README says "evaluation scores each session on 7 dimensions" but doesn't quantify cost. Two reasonable fixes:
- (a) Add a
--dimensions {full,primary,off}flag (defaultfull,primarykeeps onlyresponse_usefulness+task_grounding,offskips eval entirely) - (b) Document the cost implication in README with rough numbers ("Adds ~5 LLM calls per session, costs roughly $X per 100 sessions at default model")
Without one of these, an operator turning on --config and --limit 500 weekly is suddenly paying 3.5× more without realizing why.
Design questions (worth resolving before merge)
D1 — scope_compliance and response_usefulness.declined measure the same thing
Looking at the two metrics:
response_usefulness.declined= "outside scope, correctly declined" (a correct outcome)scope_compliance.compliant= "correctly answered in-scope OR correctly declined out-of-scope"scope_compliance.non_compliant= "answered out-of-scope OR refused in-scope"
These are highly correlated. A session marked declined in response_usefulness will essentially always be compliant in scope_compliance. A session marked unhelpful (refused in-scope) should be non_compliant.
Two metrics paying the same LLM-judge cost to measure the same property is wasteful. Either:
- Drop
scope_complianceand rely on the existingresponse_usefulnesscategories to cover scope correctness - Keep
scope_complianceand removedeclinedfromresponse_usefulness(probably worse —declinedis heavily relied on downstream) - Reframe
scope_complianceto capture somethingresponse_usefulnessdoesn't (e.g., scope creep within a single response — partially-in-scope answers)
The third option is what I'd recommend — narrow scope_compliance to detect scope mixing within a single response (e.g., "answered a different question than asked"), which response_usefulness doesn't capture.
D2 — first_time_right is duplicative with correctness for single-turn sessions
The metric definition says: "For single-turn conversations, evaluate the only response." But for single-turn sessions, "was the first response right" is exactly correctness. The metric only adds signal for multi-turn sessions where the user had to correct the agent.
Since _count_trace_metrics already counts user_turns, the PR could conditionally include first_time_right only in the metric set for multi-turn sessions — saving LLM calls for the majority of sessions that are single-turn. Or document that first_time_right is intended for multi-turn analysis and may be redundant for single-turn.
D3 — Inconsistent middle-category names across _DIMENSION_SCORES
_DIMENSION_SCORES = {
"correctness": {"correct": 2, "mostly_correct": 1, "incorrect": 0},
"tool_usage": {"proper": 2, "partial": 1, "none": 0},
"specificity": {"specific": 2, "somewhat_specific": 1, "vague": 0},
"scope_compliance": {"compliant": 2, "partially_compliant": 1, "non_compliant": 0},
"first_time_right": {"correct": 2, "clarification_needed": 1, "correction_needed": 0},
}Five different middle-category names (mostly_correct, partial, somewhat_specific, partially_compliant, clarification_needed). Each is contextually appropriate, but the LLM judge prompt gets the full per-dimension vocabulary anyway, so the variety helps the judge. Reasonable choice — just worth a comment in the code explaining why the names diverge (so a future maintainer doesn't "normalize" them).
Also: tool_usage.partial collides namespace-wise with response_usefulness.partial and the visible-label collision shows up in the per-session compact scorecard. Worth aliasing the visible label.
High-priority
H1 — Parse errors silently score 0 in _compute_dimension_averages
quality_report.py:1056 (approximately):
score = score_map.get(mr.category, 0)
dim_totals[mr.metric_name].append(score)If the LLM-judge returns a category that isn't in the score map (parse error, garbled response, an unexpected category name), the metric is scored 0 — which is the same as the "worst" category. This pulls the dimension average down without distinguishing "judge failed" from "agent failed."
Fix: skip parse errors in the average:
if mr.parse_error or mr.category not in score_map:
continue # don't penalize the agent for judge failures
score = score_map[mr.category]
dim_totals[mr.metric_name].append(score)Then report parse-error count separately in the dimension table:
| Correctness | 1.70 / 2.00 (n=18, 2 parse errors) | 🟢 | ... |
Without this, a noisy LLM judge silently drags every dimension average down.
H2 — Unknown categories default to green checkmark in _md_dimension_scorecard
quality_report.py:797:
icon = _SCORECARD_ICONS.get(mr.category, "✅")✅ is the green check. If the LLM returns an unexpected category (parse error, weird value), the scorecard renders green — actively misleading. Default should be neutral or warning:
icon = _SCORECARD_ICONS.get(mr.category, "❓") # ❓ unknownSame bug pattern as H1 — silently treating "unknown" as "good."
H3 — _count_trace_metrics only counts TOOL_COMPLETED, missing TOOL_ERROR
elif span.event_type == "TOOL_COMPLETED":
tool_calls += 1A tool call that errored out (TOOL_ERROR) isn't counted. So the reported "avg tool calls" undercounts the agent's actual tool usage. For sessions where tools fail, the metric understates effort.
Either:
- Count both
TOOL_COMPLETEDandTOOL_ERROR(intent: "total tool attempts") - Split into
tool_calls_successful/tool_calls_failed(better signal, more useful)
The test test_tool_starting_not_counted covers the "STARTING shouldn't double-count" case but doesn't pin down the error-handling intent. Add a test for TOOL_ERROR.
Medium
M1 — Test plan checkboxes still unchecked
PR description has two unchecked items:
- Review Quality Dimensions table renders correctly on GitHub
- Verify JSON output includes new dimension_averages and multi-turn fields
The Quality Dimensions table renders fine in sample_quality_report.md (I checked — emoji + table cells display properly). The JSON output does include dimension_averages (verified via _build_json_output at line 1687). Just need to tick those boxes or address whatever the gating concern was.
M2 — _compute_multiturn_stats is inconsistent about return shape
def _compute_multiturn_stats(resolved_map):
...
if not total:
return {}
return {
"avg_user_turns": ...,
"avg_tool_calls": ...,
"multi_turn_sessions": ...,
}Empty input → {}; non-empty → 3-key dict. Callers must handle both:
mt_stats = _compute_multiturn_stats(resolved_map)
if mt_stats:
print(...)This works but is brittle. Consider always returning the 3-key dict with zeros so the JSON shape is stable:
return {
"avg_user_turns": round(sum(user_turns) / total, 1) if total else 0,
"avg_tool_calls": round(sum(tool_calls) / total, 1) if total else 0,
"multi_turn_sessions": sum(1 for t in user_turns if t > 1),
}Then JSON consumers don't need to special-case empty runs.
M3 — **mt_stats flattening in JSON summary creates surprises
quality_report.py:1702:
return {
"summary": {
...
"dimension_averages": dim_avgs,
**mt_stats,
},
...
}Adding any field to _compute_multiturn_stats automatically appears in summary — fine when intentional, surprising otherwise. Prefer:
"summary": {
...
"dimension_averages": dim_avgs,
"multi_turn": mt_stats,
}Cleaner JSON schema, more discoverable for downstream consumers.
M4 — README and sample report duplicate the "scored 0-2 on five dimensions" phrasing
scripts/README.md:130: "score each session 0-2 and are averaged across all sessions to produce the Quality Dimensions table"
scripts/sample_quality_report.md:25: "Each session is scored 0-2 on five dimensions. Scores are averaged across all sessions."
Similar but redundant. Keep one source (or link the README to the sample for the long form).
M5 — _DIMENSION_DESCRIPTIONS only in markdown, not console
The "What it measures" column is included in the markdown report but the console output (_print_eval_results) skips it:
print(f" {label:<20s}: {avg:.2f} / 2.00 {bar}")A user running quality_report.sh without --report won't know what each dimension measures. Add the description to the console output too (one-line per dimension):
desc = _DIMENSION_DESCRIPTIONS.get(dim, "")
print(f" {label:<20s}: {avg:.2f} / 2.00 {bar}")
if desc:
print(f" {'':<20s} ↳ {desc}")Low / nits
L1 — Sample markdown trailing-whitespace regression (same as PR #156)
$ git diff --check 31fe562..HEAD
scripts/sample_quality_report.md:5: trailing whitespace.
+**Generated:** 2026-05-19 05:37:33
scripts/sample_quality_report.md:8: trailing whitespace.
+**Location:** us-central1
scripts/sample_quality_report.md:9: trailing whitespace.
+**Eval model:** gemini-2.5-flash
scripts/sample_quality_report.md:10: trailing whitespace.
+**Sessions:** 20
Four lines flagged. Same root cause as PR #156: _write_md_report emits f"**Generated:** {timestamp} " (two trailing spaces for GFM hard-breaks). The PR #156 approval flagged this as an "Option A vs B" recommendation; neither has been applied. Once the systematic fix lands in _write_md_report it stops recurring.
L2 — _DIMENSION_NAMES = list(_DIMENSION_SCORES.keys()) relies on dict insertion order
Python 3.7+ guarantees insertion order, so this works. But a comment noting "order matters for report rendering" would help a future maintainer who reorders _DIMENSION_SCORES for some other reason.
L3 — _md_dimension_scorecard builds _SCORECARD_ICONS inside the function
quality_report.py:780-797: the dict is rebuilt on every call. Lift to a module-level constant near _DIMENSION_SCORES and _DIMENSION_DESCRIPTIONS for consistency and zero per-call allocation.
L4 — Test gap: parse-error scoring in _compute_dimension_averages
The 4 new TestComputeDimensionAverages tests cover happy path, all-perfect, empty, and missing-dimensions. None covers an unexpected category name (the case in H1). Add:
def test_unknown_category_treated_as_zero(self):
# Documents current behavior; revisit if H1 changes it
sessions = [
_FakeSession("s1", [_FakeMetric("correctness", "garbage")]),
]
avgs = _compute_dimension_averages(_FakeReport(sessions))
assert avgs["correctness"] == 0After H1 is fixed, the test asserts the skip-instead-of-zero behavior.
L5 — _count_trace_metrics test coverage gap
test_tool_starting_not_counted confirms TOOL_STARTING doesn't count, but no test confirms what happens with TOOL_ERROR (H3). Add it as part of the H3 fix.
L6 — correct is a category name in two dimensions
correctness.correct and first_time_right.correct both use the bare name "correct". This is fine because the metric_name disambiguates them in the category-distributions output and in _DIMENSION_SCORES (keyed by metric_name → category map). But a reader skimming the code sees "correct" in two score maps and might wonder if they collide. Inline comment would help.
L7 — _PRIMARY_METRICS defined twice
_PRIMARY_METRICS = {"response_usefulness", "task_grounding"} appears as a local set at _write_md_report (line ~1497). The same constraint is hardcoded twice more in _print_eval_results (lines if mr.metric_name not in ("response_usefulness", "task_grounding"):). Lift to a module-level constant.
What looks great
- The Quality Dimensions table with rating legend (🟢 / 🟡 / 🔴 thresholds at 1.50 / 1.00) is a clean visual abstraction — at a glance you see which dimensions are problem areas. Better than a raw "average 0.90" with no rating context.
- The compact one-line scorecard for dimensions in per-session output (
Correctness ✅ | Tool Usage ❌ | Specificity ⚠️ | Scope ✅ | First-Time Right ✅) is the right call vs verbose multi-line blocks. Keeps the per-session view readable while still surfacing all 5 dimensions. - Section refactor with
_md_write_session_section(used for Unhelpful/Declined/Partial Sessions) consolidates ~80 lines of duplicated markdown-writing into one helper. Right cleanup. _count_trace_metricsis a clean extraction — testable pure function with focused responsibility. Good test coverage of the happy paths.- Per-dimension category vocabularies (
correct/mostly_correct/incorrect,proper/partial/none, etc.) are well-chosen — the LLM judge gets vocabulary that fits each dimension instead of forcing everything intogood/medium/bad. dimension_averagesflowing into_build_json_outputkeeps the JSON output complete. Downstream consumers don't have to recompute averages.- README documentation update is thorough — the "2 (best) / 1 (middle) / 0 (worst)" table per dimension is exactly what an operator wants to see when interpreting the report.
- Test coverage — 12 new tests covering 3 helper functions, including edge cases (empty, single-turn, missing keys). Good test discipline carried over from PR #156.
Verdict
REQUEST_CHANGES on:
- B1 rebase (one cleanup) — needed to confirm mergeability
- B2 cost transparency — 5× LLM-judge cost increase without a flag or doc is the kind of surprise that erodes operator trust; add either a
--dimensions {full,primary,off}flag or explicit cost docs in README - D1 / D2 design questions —
scope_complianceis largely redundant withresponse_usefulness.declined;first_time_rightis duplicative withcorrectnessfor single-turn sessions. Either reshape these metrics or document why both pay the LLM-judge cost - H1 / H2 / H3 correctness issues — parse errors silently scoring 0, unknown categories defaulting to green check,
TOOL_ERRORnot counted
M1-M5 and L1-L7 are improvement-level — bundle into the same push or skip.
Substantive design is good. The dimension averaging, multi-turn extraction, and markdown refactor all do the right things. The main concerns are operational (cost) and semantic (overlapping metrics) — both are addressable.
b830575 to
766363f
Compare
|
[P1] On current head The sample report demonstrates the bad outcome: Suggested fix: add a positive category such as |
…rmat Add 5 new LLM-as-judge quality dimensions to the quality report: correctness, tool_usage, specificity, scope_compliance, first_time_right. Each session is scored 0-2 and averaged into a Quality Dimensions table. Add multi-turn efficiency metrics (avg user turns, avg tool calls, multi-turn session count) extracted from trace spans. Streamline report output for readability: - Quality Dimensions table includes "What it measures" descriptions and a color-coded rating legend - Category distributions only show primary metrics (response_usefulness, task_grounding) since dimension averages already summarize the rest - Per-session details use a compact one-line scorecard for dimensions instead of verbose multi-line blocks Add 12 new tests for the helper functions. Update README metrics documentation. Regenerate sample quality report from live data.
…raction Extract full multi-turn conversations from trace spans and use LLM to classify user follow-ups as corrections, verifications, or normal follow-ups. Surface correction_rate and verify_rate in both console output and markdown reports to match the metrics available in knowledge_supervisor's multiturn_quality_report.
…and ground truth support The LLM judge was misclassifying in-scope failures as "declined" because the declined category was always present regardless of whether scope context existed. This led to inflated correctness scores for agents that failed to answer in-scope questions. Key changes: - Make the "declined" metric category conditional: only include it when the agent config actually defines out-of-scope topics (has_scope flag). Without scope context the judge has no basis for that category. - Add ground_truth config field to _build_scope_context so the judge can verify factual accuracy against known-correct policy data. - Add in_scope_topics to scope context so the judge can distinguish "topic is out of scope" (declined) from "agent failed to answer an in-scope topic" (unhelpful). - Clarify unhelpful definition: explicitly lists failure patterns like "I don't have that information" for in-scope topics. - Support config_path="none" in _load_agent_config to explicitly disable scope context and skip auto-discovery. - Thread config_path through generate_quality_report for programmatic use. - Update --agent-config CLI help to document the "none" option.
- Add scripts/eval/data/agent_context.example.json with scope_decisions for scope-aware evaluation. Users copy to agent_context.json and customize with their agent's out-of-scope topics. - Remove in_scope_topics from scope context: scope_decisions alone is sufficient. Anything not listed as out_of_scope is implicitly in scope. - Remove hardcoded domain-specific topic names (benefits, PTO, etc.) from the LLM judge prompt to keep it agent-agnostic. - Update README with all new features: dimension drilldowns, single-session evaluation (--session), quality dimensions table, multi-turn metrics, declined category behavior, and sample output links. - Add quality_metrics.json (metric definitions) and sample_quality_report_session.md (single-session verbose output).
Support evaluating traffic-generator conversations directly from a JSON file, bypassing BigQuery entirely. Adds run_evaluation_from_conversations() and generate_quality_report_from_conversations() public APIs, plus CLI flag. Also enhances _write_md_report with TOC, section descriptions, metric explanations, and a report_dir parameter for custom output locations. Includes issue for concurrent session scoring (asyncio.gather + semaphore).
Support --samples with per-category overrides (e.g. unhelpful=10,partial=5,low=3) in addition to single-number and 'all' modes. Refactor _print_eval_results and _write_md_report to use centralized _parse_samples/_get_sample_limit helpers with sensible defaults per category.
Run _infer_corrections calls in parallel using asyncio threads with a semaphore (max 10 concurrent) instead of sequential loop.
Short-circuit help flags before env var validation so users can view usage without setting PROJECT_ID, DATASET_ID, etc.
Add _md_find_low_dimension_sessions and _md_write_low_dimension_section helpers to generate per-dimension subsections for all 5 quality dimensions dynamically. Move low-dimension sections after Declined Sessions. Restructure TOC with <!-- TOC --> wrapper. Add CLI command reproduction in Summary section. Compute report layout variables upfront.
- --limit: clarify it evaluates the N most recent sessions - --trajectory-samples: remove misleading "Requires BQ credentials" - --tag-turns: remove conversations-file restriction, enable for BQ path by passing tag_turns through run_evaluation with concurrent tagging - --env=path syntax now works in shell script (not just --env path) - Error message now says 'export VAR=...' and mentions --env flag
… detection, and parroting detection - Embed segmented execution trajectories in Correction Analysis section (before/after correction with outcome labels: recovered, parroted, not_recovered) - Detect routing failures (agent answered from LLM knowledge without tools) and surface as Routing Failures subsection - Fix answered_by defaulting to "policy_agent" — now "unknown" with BQ backfill - Tighten correction inference prompt to distinguish genuine recovery from parroting - Add --eval-config flag for external metric/prompt definitions (eval_config.json) - Nest Correction Analysis under Sample Sessions with dynamic heading levels - Update sample report and README
Penalize parroted recoveries (agent echoes user's correction without re-verifying via tool) as unhelpful in both the usefulness definition and the unhelpful category. Refactor quality_report.py to auto-discover eval/eval_config.json from the repo root, removing 290 lines of hardcoded metric definitions in favor of the config-driven approach.
Thread per_session_context through quality_report.py into classify_sessions_via_api() so golden eval expected answers can be injected into the judge prompt per session.
When --session is used, the execution trajectory is now fetched automatically from BigQuery and printed to console with sub-trajectory segmentation at correction boundaries. Updated sample with real data showing the full output including trace tree and segmentation.
…L_ERROR counting, and module constants - H1: Skip parse errors and unknown categories in _compute_dimension_averages instead of scoring them as 0 (which inflated averages downward) - H2: Default unknown categories to ❓ instead of ✅ in scorecard icons - H3: Count TOOL_ERROR spans as tool attempts in _count_trace_metrics - L3: Lift _SCORECARD_ICONS to module level (was duplicated in function) - L7: Extract _PRIMARY_METRICS constant, replace 5 inline references - M2: _compute_multiturn_stats returns stable shape on empty input - Update tests: add parse_error attr to _FakeMetric, test TOOL_ERROR counting, test parse error/unknown category skipping, fix empty map assertion
a52d996 to
041b978
Compare
Review Feedback AddressedAll substantive items from the review have been addressed in the latest push: High priority:
Medium priority:
Low priority:
Design:
Branch:
Tests added: |
- TraceFilter.to_sql_conditions: query $.custom_tags.* instead of $.labels.* to match BigQueryLoggerConfig.custom_tags write path - TraceFilter.from_cli_args: add custom_labels parameter - run_evaluation: add custom_labels parameter, thread through to TraceFilter for version-aware session filtering
- Add --label KEY=VALUE (repeatable) to filter by custom_tags set via BigQueryLoggerConfig.custom_tags (version, env, experiment_id, etc.) - Surface app_name and labels in report.details so they appear in both console and markdown Execution Details sections - Expand CLI epilog with filtering section, scope-aware eval example showing full agent_context.json format, and combined filter examples - Update README with Custom Labels section (end-to-end: agent emits → BQ stores → quality report filters), expanded --agent-context docs, and complete Execution Details description - Update sample reports with complete Execution Details fields
… --dimensions flag, and hardening Review feedback (@caohy1988): - P1: add `no_tool_needed` category to `tool_usage` (scores 2) so correctly declined / no-tool-needed sessions are not penalised as a Tool Usage failure; neutral ➖ scorecard icon; regression tests. - B2: add `--dimensions {full,primary}` flag to cut LLM-judge cost ~3.5x (primary scores only the 2 primary metrics); document cost in README + help. Default stays `full` for backward compatibility. - D2: document that `first_time_right` is primarily a multi-turn signal. - D3/L6: comment the deliberately-divergent middle-category names and the shared `correct` category. - M5: print per-dimension descriptions in console output. - L1: render markdown report metadata as a bullet list instead of trailing-double-space hard breaks (fixes `git diff --check`). Independent review hardening: - Gate the dimension block on a new `_has_dimension_data()` helper across all three output paths; JSON `dimension_averages` is now empty (not all-zero) when dimensions were not scored, so consumers don't read unscored dimensions as 0.0 / failing. - Anchor the judge prompt so a missing in-scope tool lookup stays `none` and is not mislabeled `no_tool_needed` (inverse-of-P1 inflation guard). - Make `_DIMENSION_LOW_CATEGORIES` fail-safe if a dimension has no score-0 category (no StopIteration at import). Tests: 81 pass (+5).
Follow-up round addressed (commit
|
Introduces an optional eval spec (`eval/data/eval_spec.json`, or `--eval-spec`)
that grounds quality scoring:
- `scope` (free text) — defines what the agent handles; out-of-scope is the
complement, so a polite decline is scored `declined` (replaces the brittle
scope_decisions topic list).
- `golden_qa` — expected answers matched per-question by embedding similarity
(`--golden-threshold`); injects ground truth into the judge and emits a
`golden_eval_summary`. Matching now lives in the SDK (`match_golden_qa`),
not in downstream projects.
- `tools` — capability description used by the judge's new `failure_attribution`
metric.
Adds a 3-way failure taxonomy: every `unhelpful` session is attributed to a
`skill_gap` (evolution-fixable), `knowledge_gap` (a fact missing from the data),
or `tool_gap` (no tool/capability), and a new `addressable_meaningful_rate`
reports quality on questions the agent can actually answer. Surfaced in the
console, markdown (with per-class "add a fact" / "build a tool" sections), and
JSON.
Also: `tool_usage.no_tool_needed` (scores 2, not a failure), `--dimensions
{full,primary}` cost control, `_has_dimension_data` so unscored dimensions
aren't reported as 0.0. Tests: 94 pass.
|
REady for re-review |
response_usefulness and task_grounding are LLM estimates when no golden_qa is supplied -- they can mislabel verbose, tool-grounded answers as ungrounded/unhelpful. Emit a clear warning pointing users to --eval-spec with a golden_qa list so correctness is graded against expected answers (summary.golden_eval_summary). The golden-Q&A grounding itself is unchanged and remains the reliable path.
Quality dimensions + eval-spec grounding for the BigQuery Agent Analytics quality report.This PR upgrades the LLM-as-judge quality_report from a single "usefulness" verdict into a richer, more trustworthy evaluation. Its headline feature is eval-spec grounding: you supply one optional JSON
|
The quality-report usage section listed filters and report flags but omitted the most important workflow — adding evals (ground truth) — and several features introduced in this PR. Now documented: - Adding evals: eval-spec (scope/tools/ground_truth/golden_qa) + scoring a local conversations file (--conversations-file, no BigQuery) with the JSON format and recommended workflow - Failure-cause taxonomy (skill_gap / knowledge_gap / tool_gap) and the 'tools' eval-spec field that drives it; routing-failure and parroting notes - --dimensions full|primary, --tag-turns, --trajectory-samples, --concurrency, --golden-threshold, --eval-config, --env, per-category --samples - golden_eval_summary and the no-ground-truth warning - Reorganized the usage quick-reference into grouped, complete examples
caohy1988
left a comment
There was a problem hiding this comment.
Fresh re-review — PR #174 head 6595417
Locally checked out, 94 helper tests pass (pytest tests/test_quality_report_helpers.py), git diff --check is clean for the PR's own scripts/src/tests. Solid progression — prior rounds' H1/H2/H3/P1/M2/L3/L7 fixes all landed correctly, the --dimensions flag, _has_dimension_data gating, and the custom_tags JSON-path fix are real wins. Two new blockers surfaced on a fresh pass through; the rest are tightening.
PR is 17 commits behind origin/main. mergeable: MERGEABLE per the GH API, no conflicts, but the rebase has aged since 2026-05-27.
Blockers
B1 — Golden Q&A silently does nothing on the BigQuery path
match_golden_qa is only called from run_evaluation_from_conversations (line 1465). run_evaluation (the BQ path) loads the eval spec, uses scope/ground_truth in the judge prompt, but never matches golden_qa to session questions. Confirmed locally:
run_evaluation (BQ path) mentions match_golden_qa: False
run_evaluation (BQ path) mentions per_session_context: False
run_evaluation_from_conversations (local path): True / True
So this command:
./scripts/quality_report.sh --eval-spec eval/data/eval_spec.json --report --limit 50silently ignores the golden_qa block. No warning, no golden_eval_summary in JSON, no per-session "expected answer" injected into the judge. The PR description headlines golden Q&A as "the trustworthy headline for regression testing" and the README says "the most important way to use the quality report" — but on what is presumably the most common path (score this week's BQ traffic), the feature is a no-op.
The _resolved_map already contains question per session, so wiring is straightforward — between report = client.evaluate_categorical(...) and the return, build question_by_sid = {sid: ctx["question"] for sid, ctx in resolved_map.items()}, call match_golden_qa, then _inject_golden_summary later. Or, if it's deliberately local-only, the BQ path should log a warning when it sees golden_qa in the spec (mirror the existing "No golden_qa" warning at line 1237).
B2 — Failure-cause taxonomy is misleading whenever failure_attribution / tool_usage / correctness weren't scored
_failure_breakdown_from_report and _classify_failures always run, regardless of which metrics were actually scored. With --dimensions primary, the deterministic fallback in _failure_class sees tool=None, correctness=None, attribution=None for every unhelpful session and unconditionally returns "skill_gap". Confirmed locally with 4 unhelpful + 6 meaningful sessions, primary-mode metrics:
counts: {'skill_gap': 4, 'knowledge_gap': 0, 'tool_gap': 0}
The report then prints Failure causes: skill=4 (evolution) knowledge=0 (add data) tool=0 (build tool) and the markdown summary shows Skill gaps (evolution fixes) | 4. A user reading that output reasonably concludes "no knowledge gaps, no tool gaps, just evolution work" — but it's a function of metrics not being scored, not of agent quality. Same class of silent-zero-default bug as the H1/H2 issues this PR already fixed.
Cleanest fix: introduce a _has_failure_attribution_data(report) predicate (analogous to _has_dimension_data) that's true iff at least one session has failure_attribution OR both tool_usage and correctness. Gate the failure-breakdown blocks in _print_eval_results (line 2431), _write_md_report (line 3424), and _classify_failures write-out on it. JSON summary.skill_gap/knowledge_gap/tool_gap should be absent (or null) when ungated, not silently N/0/0.
Note: the same issue bites the eval-spec example. The example's tools field is what lets the judge tell knowledge from tool gaps, but the metric defaults to running even without a tools field — see D1 below.
High
H1 — Off-by-one: PR description and --dimensions help say "7 metrics", actual is 8
eval_config.json defines 8 metrics: response_usefulness, task_grounding, correctness, tool_usage, specificity, scope_compliance, first_time_right, failure_attribution. Several places undercount:
quality_report.py:4013—--dimensionshelp:"'full' (default) scores all 7 metrics"→ should be 8.quality_report.py:4017—"about 3.5x cheaper (2 LLM calls/session instead of 7)"→ actually 4× (2 instead of 8).scripts/README.md:78,122,271,337,506— five "7 metrics / 7 dimensions" claims.scripts/sample_quality_report_session.md:5— "all 7 metrics".- PR description itself: "evaluation scores each session on 7 dimensions".
Cost claim ($x cheaper) is a B2-adjacent operator-trust issue — quoting 3.5× when the real ratio is 4× is in the noise, but consistently saying 7 when there are 8 makes a careful reader doubt the rest of the numbers.
H2 — Stale --agent-context flag in sample_quality_report.md
scripts/sample_quality_report.md:29:
Markdown report generated by `./scripts/quality_report.sh --report --limit 20
... --agent-context agent_context.json`.
--agent-context was removed from argparse this round (replaced by --eval-spec). Verified python3 quality_report.py --help no longer lists it. Anyone copy-pasting the command line from the sample report gets unrecognized arguments: --agent-context. scripts/quality_report.py:562 also has a residual docstring mention ("same pattern as agent-context auto-discovery").
Re-render the sample with the current command line (and, while there, with --eval-spec set so the sample actually showcases golden_eval_summary — currently the sample report contains zero occurrences of "golden" or "eval-spec").
H3 — custom_tags JSON-path fix has no regression test
src/bigquery_agent_analytics/trace.py:613 flips '$.labels.' → '$.custom_tags.' — verified against producer code (producers/src/bigquery_agent_analytics_tracing/claude_code.py:447 and friends write custom_tags). The fix is one line and clearly correct, but there's not a single test in tests/ that mentions custom_labels or custom_tags:
$ grep -rnE "custom_labels|custom_tags" tests/
(no matches)
This is exactly the kind of "silent broken filter" — --label version=v1 returns an empty result set, no error, no clue — that the lack of tests allowed the first time. Suggested guardrail:
def test_custom_labels_uses_custom_tags_json_path():
tf = TraceFilter(custom_labels={"version": "v1"})
sql, _ = tf.to_sql_clauses() # or whatever the method is
assert "$.custom_tags." in sql
assert "$.labels." not in sqlA single string-assertion test would catch any future regression of the same kind.
Medium
M1 — failure_attribution runs (and burns 1 judge call/session) without an eval_spec.tools field
failure_attribution is scope_aware: true, so the scope-and-tools blurb is appended only when those fields exist. Without eval_spec.tools, the judge gets the metric definition but no tool inventory — meaning it cannot meaningfully distinguish knowledge_gap (tool covers topic, fact missing) from tool_gap (no tool for topic). The metric still runs unconditionally — adding 1 LLM call/session and producing low-signal results that feed straight into _classify_failures.
Options:
- Skip the
failure_attributionmetric wheneval_spec.toolsis empty (drop frommetricslist at filter time, the same way--dimensions primaryalready filters). - Or: log a warning when the spec has
golden_qa/scopebut lackstools, calling out that failure attribution will be noisy.
M2 — generate_quality_report_from_conversations doesn't expose eval_config
run_evaluation_from_conversations (line 1401) accepts eval_config=None but the public wrapper generate_quality_report_from_conversations (line 1514) doesn't pass it through. So a programmatic caller (evolve.py, score_and_compare.py) can't override metric definitions for the local-conversations path the way the CLI can. Trivial fix: thread eval_config through the signature.
M3 — **mt_stats flattening still propagates into summary
_build_json_output:3901 keeps the previous-round **mt_stats flatten:
"summary": {
...
"dimension_averages": (...),
**mt_stats,
},That's still the prior M3 ergonomics concern — adding any field to _compute_multiturn_stats automatically appears at the top level of summary, with no schema. Wrapping under "multi_turn" would be cleaner; PR rejected this earlier on the basis that score_conversations.py reads summary['correction_rate'] directly. Fair, but at least the new fields (avg_corrections, avg_verifications) could be nested without breaking that consumer. Not blocking.
M4 — Conversation extraction loses per-turn agent attribution
_extract_conversation returns [{"role": "user"|"agent", "text": str}]. The original spans have span.agent, which gets dropped. For multi-agent sessions where different turns are answered by different sub-agents, the conversation block in the markdown report says **agent:** for every reply without telling you which one. Worth preserving on the dict for the markdown render (and the JSON export).
Low / nits
L1 — _load_eval_spec docstring references "agent-context"
quality_report.py:562 — "same pattern as agent-context auto-discovery". Same root as H2: leftover from before --agent-context got renamed.
L2 — get_a2a_response exception catches an impossible TypeError
quality_report.py:750 — except (json.JSONDecodeError, TypeError):. The branch is gated on isinstance(c, str), so json.loads(c) never raises TypeError. Cosmetic; drop TypeError.
L3 — _failure_class deterministic fallback overrides judge "not_a_failure"
If the judge says failure_attribution = "not_a_failure" for an unhelpful session, the code treats that as an attribution miss and runs the deterministic split (knowledge vs skill). Sometimes that's right (response_usefulness judge mislabeled a fine response as unhelpful); sometimes it's wrong (judge correctly recognized the response was fine but the deterministic split now invents a gap). Worth a comment at line 3722 documenting the choice, or a third branch that drops these sessions from the breakdown.
L4 — Magic number 0.92 for golden-Q&A threshold appears in two places
Default in match_golden_qa (line 374) and in argparse (line 4108). Lift to a _DEFAULT_GOLDEN_THRESHOLD = 0.92 module constant so changing the default doesn't require touching two places.
L5 — _DIMENSION_NAMES insertion-order comment is good; one nit
The comment "order matters for rendering" (line 2126) is good. But there's no test that pins down the rendering order — if someone reorders _DIMENSION_SCORES, the markdown table reorders silently. A render-order test on _compute_dimension_averages or _md_dimension_scorecard would catch this.
What looks great
_has_dimension_datapredicate + three-path gating is exactly the right shape. JSON now correctly emits{}for unscored dimensions; console and markdown both gate on the same predicate. Direct response to the prior-round JSON-path inconsistency and clean.--dimensions {full,primary}with cost calc in the help text is the right cost-transparency move. The filter rebuilds a new dict ({**eval_config, "metrics": [...]}) so the cache isn't mutated — verified.tool_usage.no_tool_needed+_SCORECARD_ICONS["no_tool_needed"] = "➖"correctly resolves prior-round P1. The judge prompt's "DECISIVE TEST" (line 81 of eval_config) guards against the inverse failure (mislabeling a real tool-skip asno_tool_needed). Regression teststest_tool_usage_no_tool_needed_scores_full+..._does_not_drag_averageare tight._DIMENSION_LOW_CATEGORIESfail-safe (usingnext(..., None)and filtering outNone) preventsStopIterationat import time if a future dimension lacks a score-0 category. Small, defensive, right call.- Eval-spec docstring + comments at line 2095-2109 explaining why middle-category names diverge per dimension and why
correctis duplicated across two dimensions answers the prior-round D3 ask. A future maintainer will not "normalize" them. _diagnose_correction_traceis a clean small heuristic — when the agent never routed AND never tool-called, it surfaces "Routing Failures" as actionable feedback. Useful diagnostic output, not just more numbers.per_session_contextplumbing incategorical_evaluator.pyis minimal (added param, defensiveif … in dict) — doesn't break any existing caller, doesn't change the prompt when no context is supplied.custom_tagsJSON-path fix intrace.pyis a real bug fix (verified path matches what producers write). One-liner, correct.- 94 tests pass. Prior-round test additions are all tight regression guards (
test_parse_error_skipped,test_unknown_category_skipped,test_tool_error_counted,test_tool_usage_no_tool_needed_*,test_has_dimension_data_*). - README "Adding evals" section is the right framing for the headline feature. The recommended-workflow snippet (eval-spec +
--conversations-file) is exactly right for an evolution loop. (If B1 lands, the BQ-path workflow becomes equally trustworthy.)
Verdict
REQUEST_CHANGES on B1 and B2, plus H1/H2/H3 cleanup:
- B1 (golden Q&A wiring on the BQ path) — the marquee feature should work on the most common path, or fail loudly when it doesn't.
- B2 (failure-cause taxonomy needs a
_has_failure_attribution_datagate) — same pattern as the existing dimension gate; one helper + three call-site changes. - H1 (8 vs 7 metric count + 4× vs 3.5× cost) — small, mechanical doc/help update.
- H2 (sample report's stale
--agent-context) — re-render with current flags, ideally with--eval-specset so the sample shows the headline feature. - H3 (
custom_tagstest) — one string-assertion test on the SQL clause.
M1–M4 and L1–L5 bundle in the same push or skip — your call.
Substantive design is in great shape. Failure-cause taxonomy, multi-turn segmentation, routing-failure detection, scope-grounded scoring, and the eval-spec architecture all look right; the remaining work is making the BQ path and the failure breakdown as trustworthy as the rest.
Follow-up validation — current head
|
B1 — golden Q&A no longer a no-op on the BigQuery path: run_evaluation now matches golden_qa from resolved questions, returns golden_metadata (so the existing _inject_golden_summary fires) and generate_quality_report injects it too; warns that the server-side judge can't take per-session expected answers (expected-answer grounding stays on --conversations-file). B2 — failure-cause taxonomy is gated on a new _has_failure_attribution_data predicate (needs failure_attribution, or both tool_usage and correctness), in console, markdown, and JSON; with --dimensions primary the breakdown no longer defaults every failure to skill_gap. H1 — metric count corrected to 8 (incl. failure_attribution) and cost ~4x in --dimensions help, README, and metrics section. H2 — removed the removed --agent-context flag from sample_quality_report.md; fixed the residual agent-context docstring (also L1). H3 — added a TraceFilter regression test asserting the $.custom_tags JSON path. M2 — thread eval_config through generate_quality_report_from_conversations. L2 — drop the unreachable TypeError in get_a2a_response. L4 — lift the golden threshold to _DEFAULT_GOLDEN_THRESHOLD. Tests: 99 pass (5 new). pyink + isort clean.
Review feedback addressed (commit
|
|
Thanks for the thorough re-review @caohy1988 — all blockers and High items are addressed in B1 — Golden Q&A no-op on the BigQuery path ✅ B2 — Failure taxonomy misleading when those metrics weren't scored ✅ H1 — "7 metrics" → 8 ✅ Help text ( Also covered M2 / L1–L2 / L4 from the same commit. Verification: |
caohy1988
left a comment
There was a problem hiding this comment.
Re-review — PR #174 head fa5938144 (merge of current main + fix commit ac47775)
Checked out locally. Both June 5 blockers and all three High items are verified fixed. 99 helper tests pass, pyink/isort clean on PR-touched files, git diff --check clean, mergeable: MERGEABLE with current main merged in (incl. the #300 rename).
Verified RESOLVED
- B1 — Golden Q&A on the BigQuery path.
run_evaluationnow buildsquestion_by_sidfromresolved_map, callsmatch_golden_qa(quality_report.py:1316), threadsgolden_thresholdthrough the signature, returnsgolden_metadata(:1389), and_inject_golden_summaryruns on the BQ JSON output (:1429) andgenerate_quality_report(:2059). The design boundary — expected-answer judge injection stays conversations-only because the server-side AI.GENERATE judge can't take per-session context — is documented in a comment block and, critically, logged as a WARNING at runtime (:1320-1325) so the partial behavior is no longer silent.scope/ground_truthground both paths. This is exactly the shape I asked for. - B2 — Failure-taxonomy gating.
_has_failure_attribution_data(:3771) with a docstring that explains the failure mode it prevents, gating all three output paths: console (:2473), markdown (:3464), JSON (:3973).--dimensions primaryno longer fabricatesskill=N knowledge=0 tool=0. - H1 — metric count. README (
:78,122,271,339,356,508) and the--dimensionshelp text (:4080) all corrected to 8 metrics / ~4x. - H2 — stale
--agent-context. Zero references remain anywhere underscripts/. - H3 —
custom_tagsSQL regression test.test_custom_labels_uses_custom_tags_json_path(tests/test_quality_report_helpers.py:1207) asserts$.custom_tags.in the generated WHERE clause. - M2 —
eval_confignow threads throughgenerate_quality_report_from_conversations. - L4 —
_DEFAULT_GOLDEN_THRESHOLDmodule constant replaces the duplicated0.92.
One straggler (non-blocking)
scripts/sample_quality_report_session.md:5 still says "all 7 metrics" — the README got the 7→8 sweep but this file was missed. One-line fix.
Optional follow-up (M1 from June 5, unaddressed — fine as a separate issue)
failure_attribution still runs (1 judge call/session) even when eval_spec.tools is absent, where the judge can't meaningfully distinguish knowledge_gap from tool_gap. Now that B2's gate exists the output is honest, so this is purely a cost/noise optimization — happy to see it land as a follow-up issue rather than another round here.
Verdict — APPROVE
Three rounds in, every substantive finding is fixed and regression-guarded. Please fix the one "7 metrics" line in sample_quality_report_session.md before (or at) merge.
c2afc72
into
GoogleCloudPlatform:main
Motivation
The original quality report gave two binary metrics — response_usefulness (helpful/unhelpful) and task_grounding (grounded/not). Useful as a pass/fail scorecard, but they don't tell you why a session failed, whether the answer is actually correct, or what to fix. When an agent shows a 35% unhelpful rate, the real questions are:
This PR turns the quality report from a scorecard into a diagnostic that can drive automated skill evolution — including ground-truth correctness via golden Q&A, a failure-cause taxonomy, execution-trace rendering, and version-aware filtering.
What's New
1. Five quality dimensions (0–2)
Each session is scored on 5 orthogonal dimensions: Correctness, Tool Usage, Specificity, Scope Compliance, First-Time Right. Averaged across sessions, surfaced in console + markdown with color-coded ratings, with low-scoring sessions grouped per dimension.
--dimensions a,b,cselects which to score.2. Eval-spec grounding: scope + golden Q&A (ground-truth correctness)
--eval-spec eval_spec.jsonis the single source of evaluation ground truth:{ "scope": "what the agent should/should not answer (out-of-scope -> decline)", "tools": "what the tools can/can't do (used to tell knowledge gaps from tool gaps)", "ground_truth": "authoritative facts for correctness", "golden_qa": [ {"question": "...", "expected_answer": "...", "topic": "..."}, {"question": "...", "expected_behavior": "decline", "topic": "out_of_scope"} ] }golden_eval_summary(matched/unmatched split by usefulness,matched_meaningful_rate, and the list of golden-matched questions the agent got wrong) — the trustworthy headline for regression testing.declinedoutcome: a correct decline of an out-of-scope question scores as a positive, not a failure.golden_qais supplied, the report warns that usefulness/grounding are LLM estimates without ground truth (they can mislabel verbose, tool-grounded answers) and points to--eval-spec.(This supersedes the earlier
--agent-contextscope-only flag; scope now lives in the eval spec.)3. Failure-cause taxonomy (who fixes it)
Failures are classified into skill_gap (had the tool + data, misbehaved → a skill/prompt fix), knowledge_gap (tool used correctly, fact missing → add data), and tool_gap (no tool/data source, or a personal-data/action request → build a tool). The
toolsfield in the eval spec lets the judge tell a knowledge gap from a tool gap. Surfaced in the report as "skill (evolution) / knowledge (add data) / tool (build tool)" — directly routable.4. Multi-turn conversation analysis
Extracts full multi-turn conversations from trace spans; classifies each user turn (
CORRECTION,VERIFY,SPECIFICS,SCOPE,FOLLOWUP,END); computes correction/verification rates, avg user turns, avg tool calls; segments conversations at correction boundaries.5. Correction analysis + trajectory segmentation
For corrected sessions: shows what the agent claimed, what the user corrected, and whether it recovered; fetches the BigQuery execution trace and segments it at correction boundaries with before/after timing. Detects routing failures (answered from LLM knowledge without routing to a specialist) and parroting (echoed the user's correction without re-verifying via a tool).
6. Execution-trace rendering
--trajectory-samples Nfetches N traces (prioritizing unhelpful + correction sessions) and renders the full routing tree with tool calls, latency, and TTFT per span.--session <id>auto-fetches the trace (no extra flags).7. Parroting penalty
Parroted "recoveries" (echoing the user's correction without a tool call) are penalized as unhelpful, preventing inflated scores from agents that just agree with the user.
8. Local evaluation without BigQuery
--conversations-file traffic.jsonscores conversations from a local JSON file via the Gemini API — no BQ credentials — with the same report format. (This is what lets an evolution loop score candidate skills offline.)9. Version-aware filtering + labels
--label key=valuefilters BQ sessions bycustom_tags(e.g.agent_version=v1) and surfaces active filters in the report's Execution Details. Fixes thecustom_tags/custom_labelsJSON-path mismatch ($.labels.*→$.custom_tags.*) so version filtering actually works — essential for comparing skill versions without cross-contamination.10. External eval config + per-category sampling
--eval-config eval_config.jsonexternalizes metric definitions, judge prompts, and rubrics (auto-discovered from the repo).--samples unhelpful=10,partial=5,low=3controls how many examples each report section shows.11. Per-session context for golden eval
per_session_contextonclassify_sessions_via_api()injects per-session context (e.g. expected answers) into the judge prompt — the programmatic hook behind golden-Q&A grounding.12. Performance
Correction inference runs concurrently with
asyncio.gather(~10x on 200+ sessions).How this enables skill evolution
These form the evaluation backbone of an automated skill-evolution loop:
--conversations-file+--eval-specgolden Q&A score evolved candidates against ground truth offline (golden_eval_summary).--label agent_version=…) + structured JSON (--output-json) feed automated candidate comparison.Review feedback addressed
Round 1 (@caohy1988):
_compute_dimension_averagesscored unknown categories as 0, dragging averages downparse_error+ categories not in the score map_md_dimension_scorecarddefaulted unknown categories to ✅_count_trace_metricsmissedTOOL_ERRORTOOL_COMPLETED+TOOL_ERROR_compute_multiturn_stats({})→{}caused KeyErrorssummary.correction_rate)_SCORECARD_ICONSduplicated("response_usefulness","task_grounding")repeated 5+×_PRIMARY_METRICSscope_compliancevstask_groundingoverlapeval_config.jsonorigin/mainRound 2:
tool_usagehad no "tool not needed" outcomeno_tool_needed(scores 2 — a correct outcome)--dimensionsto select dimensionscustom_tagspath + added--labelNew tests for all code-level fixes (e.g.
test_parse_error_skipped,test_unknown_category_skipped,test_tool_error_counted); 76 helper tests total.Files changed
scripts/quality_report.pysrc/bigquery_agent_analytics/categorical_evaluator.pyper_session_contextsrc/bigquery_agent_analytics/trace.pycustom_tagspath fix supportscripts/quality_report.sh--envloading,--helpwithout envscripts/eval/eval_config.jsonscripts/eval/data/eval_spec.example.json{scope, tools, ground_truth, golden_qa}tests/test_quality_report_helpers.pyscripts/sample_quality_report*.md,scripts/README.mdUsage
Test plan
--conversations-file traffic.json --eval-spec eval_spec.json→golden_eval_summarymatches ground truth--session <id> --tag-turns--limit 20 --report --tag-turns --trajectory-samples 3--label agent_version=v1golden_qa--dimensionsselects a subset;--helpworks without env vars