Skip to content

Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering#174

Merged
haiyuan-eng-google merged 30 commits into
GoogleCloudPlatform:mainfrom
evekhm:feat/quality-dimensions
Jun 15, 2026
Merged

Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering#174
haiyuan-eng-google merged 30 commits into
GoogleCloudPlatform:mainfrom
evekhm:feat/quality-dimensions

Conversation

@evekhm

@evekhm evekhm commented May 19, 2026

Copy link
Copy Markdown
Contributor

Motivation

The original quality report gave two binary metrics — response_usefulness (helpful/unhelpful) and task_grounding (grounded/not). Useful as a pass/fail scorecard, but they don't tell you why a session failed, whether the answer is actually correct, or what to fix. When an agent shows a 35% unhelpful rate, the real questions are:

  • Is it getting facts wrong, or just vague? Is its answer correct vs. the known answer?
  • Is it using its tools, or answering from LLM memory?
  • Does it handle scope correctly (decline what it shouldn't answer)?
  • Does it get it right the first time, or only after corrections?
  • Where in the execution trace did it go wrong — routing? tool call? response?
  • And critically: of the failures, which are skill-fixable vs. need data or a new tool?

This PR turns the quality report from a scorecard into a diagnostic that can drive automated skill evolution — including ground-truth correctness via golden Q&A, a failure-cause taxonomy, execution-trace rendering, and version-aware filtering.

What's New

1. Five quality dimensions (0–2)

Each session is scored on 5 orthogonal dimensions: Correctness, Tool Usage, Specificity, Scope Compliance, First-Time Right. Averaged across sessions, surfaced in console + markdown with color-coded ratings, with low-scoring sessions grouped per dimension. --dimensions a,b,c selects which to score.

2. Eval-spec grounding: scope + golden Q&A (ground-truth correctness)

--eval-spec eval_spec.json is the single source of evaluation ground truth:

{
  "scope": "what the agent should/should not answer (out-of-scope -> decline)",
  "tools": "what the tools can/can't do (used to tell knowledge gaps from tool gaps)",
  "ground_truth": "authoritative facts for correctness",
  "golden_qa": [
    {"question": "...", "expected_answer": "...", "topic": "..."},
    {"question": "...", "expected_behavior": "decline", "topic": "out_of_scope"}
  ]
}
  • Golden-Q&A matching embeds each session's question and matches it (cosine ≥ 0.92) to a golden entry, then injects the expected answer into the judge so it grades factual correctness against ground truth rather than guessing.
  • Produces a golden_eval_summary (matched/unmatched split by usefulness, matched_meaningful_rate, and the list of golden-matched questions the agent got wrong) — the trustworthy headline for regression testing.
  • Scope calibrates the declined outcome: a correct decline of an out-of-scope question scores as a positive, not a failure.
  • When no golden_qa is supplied, the report warns that usefulness/grounding are LLM estimates without ground truth (they can mislabel verbose, tool-grounded answers) and points to --eval-spec.

(This supersedes the earlier --agent-context scope-only flag; scope now lives in the eval spec.)

3. Failure-cause taxonomy (who fixes it)

Failures are classified into skill_gap (had the tool + data, misbehaved → a skill/prompt fix), knowledge_gap (tool used correctly, fact missing → add data), and tool_gap (no tool/data source, or a personal-data/action request → build a tool). The tools field in the eval spec lets the judge tell a knowledge gap from a tool gap. Surfaced in the report as "skill (evolution) / knowledge (add data) / tool (build tool)" — directly routable.

4. Multi-turn conversation analysis

Extracts full multi-turn conversations from trace spans; classifies each user turn (CORRECTION, VERIFY, SPECIFICS, SCOPE, FOLLOWUP, END); computes correction/verification rates, avg user turns, avg tool calls; segments conversations at correction boundaries.

5. Correction analysis + trajectory segmentation

For corrected sessions: shows what the agent claimed, what the user corrected, and whether it recovered; fetches the BigQuery execution trace and segments it at correction boundaries with before/after timing. Detects routing failures (answered from LLM knowledge without routing to a specialist) and parroting (echoed the user's correction without re-verifying via a tool).

6. Execution-trace rendering

--trajectory-samples N fetches N traces (prioritizing unhelpful + correction sessions) and renders the full routing tree with tool calls, latency, and TTFT per span. --session <id> auto-fetches the trace (no extra flags).

7. Parroting penalty

Parroted "recoveries" (echoing the user's correction without a tool call) are penalized as unhelpful, preventing inflated scores from agents that just agree with the user.

8. Local evaluation without BigQuery

--conversations-file traffic.json scores conversations from a local JSON file via the Gemini API — no BQ credentials — with the same report format. (This is what lets an evolution loop score candidate skills offline.)

9. Version-aware filtering + labels

--label key=value filters BQ sessions by custom_tags (e.g. agent_version=v1) and surfaces active filters in the report's Execution Details. Fixes the custom_tags/custom_labels JSON-path mismatch ($.labels.*$.custom_tags.*) so version filtering actually works — essential for comparing skill versions without cross-contamination.

10. External eval config + per-category sampling

--eval-config eval_config.json externalizes metric definitions, judge prompts, and rubrics (auto-discovered from the repo). --samples unhelpful=10,partial=5,low=3 controls how many examples each report section shows.

11. Per-session context for golden eval

per_session_context on classify_sessions_via_api() injects per-session context (e.g. expected answers) into the judge prompt — the programmatic hook behind golden-Q&A grounding.

12. Performance

Correction inference runs concurrently with asyncio.gather (~10x on 200+ sessions).

How this enables skill evolution

These form the evaluation backbone of an automated skill-evolution loop:

  1. Diagnose — dimensions + correction analysis pinpoint what's wrong (e.g. "Tool Usage 0.4/2.0 — not calling tools").
  2. Attribute — the failure-cause taxonomy splits skill-fixable failures from knowledge/tool gaps, and routing-failure detection localizes which agent to fix.
  3. Measure--conversations-file + --eval-spec golden Q&A score evolved candidates against ground truth offline (golden_eval_summary).
  4. Compare — version filtering (--label agent_version=…) + structured JSON (--output-json) feed automated candidate comparison.

Review feedback addressed

Round 1 (@caohy1988):

ID Sev Issue Fix
H1 High _compute_dimension_averages scored unknown categories as 0, dragging averages down Skip parse_error + categories not in the score map
H2 High _md_dimension_scorecard defaulted unknown categories to ✅ Default icon now ❓
H3 High _count_trace_metrics missed TOOL_ERROR Count TOOL_COMPLETED + TOOL_ERROR
M2 Med _compute_multiturn_stats({}){} caused KeyErrors Returns zeroed stats
M3 Med Nesting multi-turn stats Kept flat (downstream reads summary.correction_rate)
L3 Low _SCORECARD_ICONS duplicated Lifted to module constant
L7 Low ("response_usefulness","task_grounding") repeated 5+× Extracted _PRIMARY_METRICS
D1 Design scope_compliance vs task_grounding overlap Disambiguated in eval_config.json
B1 Branch Behind origin/main Rebased

Round 2:

Item Fix
tool_usage had no "tool not needed" outcome Added no_tool_needed (scores 2 — a correct outcome)
Whole-suite scoring when only some dims wanted Added --dimensions to select dimensions
Version/label filtering broken Fixed custom_tags path + added --label
Correctness was vibes-only Added eval-spec golden-Q&A grounding + failure-cause taxonomy
Format check Applied isort + pyink

New tests for all code-level fixes (e.g. test_parse_error_skipped, test_unknown_category_skipped, test_tool_error_counted); 76 helper tests total.

Files changed

Area Files What
Quality report scripts/quality_report.py Dimensions, eval-spec grounding (scope + golden Q&A), failure-cause taxonomy, corrections, trajectories, version filtering/labels, conversations-file, parroting, no-ground-truth warning
Core evaluator src/bigquery_agent_analytics/categorical_evaluator.py per_session_context
Trace src/bigquery_agent_analytics/trace.py custom_tags path fix support
Shell wrapper scripts/quality_report.sh --env loading, --help without env
Eval config scripts/eval/eval_config.json Externalized metric definitions + prompts
Eval spec example scripts/eval/data/eval_spec.example.json Example {scope, tools, ground_truth, golden_qa}
Tests tests/test_quality_report_helpers.py 76 helper tests (incl. review fixes)
Samples / docs scripts/sample_quality_report*.md, scripts/README.md Updated with real data + new flags

Usage

# Ground-truth report (scope + golden Q&A): trustworthy correctness + failure taxonomy
python scripts/quality_report.py --conversations-file traffic.json \
  --eval-spec eval/data/eval_spec.json --report --tag-turns --output-json results.json

# Version comparison (filter by skill version)
python scripts/quality_report.py --label agent_version=v1 --report --tag-turns

# Single-session deep-dive (auto-fetches execution trace)
python scripts/quality_report.py --session conv_5d77036b --tag-turns

# Full BQ report with trajectories + selected dimensions
./scripts/quality_report.sh --report --limit 50 --tag-turns \
  --trajectory-samples 5 --eval-spec eval/data/eval_spec.json \
  --dimensions correctness,tool_usage,scope_compliance

Test plan

  • Golden-Q&A grounding: --conversations-file traffic.json --eval-spec eval_spec.jsongolden_eval_summary matches ground truth
  • Single-session auto-trace: --session <id> --tag-turns
  • Multi-session report: --limit 20 --report --tag-turns --trajectory-samples 3
  • Version filter: --label agent_version=v1
  • No-ground-truth warning fires without golden_qa
  • --dimensions selects a subset; --help works without env vars
  • 76 unit tests pass (incl. review-fix tests)

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive review — PR #174 head b830575

Checked locally. Solid feature addition with good test coverage (91 tests pass, 12 new). Most of the code is clean. A handful of design questions worth resolving plus one familiar systematic issue.

The PR is 1 commit behind origin/main (PR #156 squash-merge f4aa037), with PR #156's 6 unsquashed commits still in the branch's history. Mergeable status unknown until rebase.


What's in the PR (vs PR #156 baseline)

Single new commit b830575 on top of PR #156. Files actually changed (excluding inherited from PR #156):

scripts/README.md                    |  30 +-     (new metrics docs)
scripts/quality_report.py            | 557 +++++  (5 new LLM-judge metrics, dimension averaging, multi-turn stats, MD refactor)
scripts/sample_quality_report.md     | 142 +/-    (regenerated)
tests/test_quality_report_helpers.py | 158 +      (3 new test classes, 12 tests)

tests/test_latency_report_helpers.py is unchanged. scripts/latency_report.py is unchanged.


Blockers

B1 — Rebase needed; PR base is the unsquashed PR #156 branch tip

mergeable: UNKNOWN, mergeStateStatus: UNKNOWN. PR #156 was squash-merged into main as f4aa037. PR #174's branch still has the 6 original PR #156 commits unsquashed (856404231fe562). After rebase onto current origin/main, those 6 commits should be subsumed by the squash.

Suggested:

git fetch origin main
git rebase origin/main
# The 6 PR #156 commits should be auto-dropped because their changes are now in
# the squash commit on main. If git keeps them, drop them via interactive rebase.
# Conflicts most likely in scripts/quality_report.py and scripts/sample_quality_report.md.
git push --force-with-lease

After the rebase, the diff against origin/main should be roughly +717/-170 (the 4 files above), matching git diff 31fe562..HEAD.

B2 — 5× LLM-judge cost increase per evaluation run, no opt-out

The PR adds 5 new categorical LLM-as-judge metrics. Each session is now evaluated on 7 metrics instead of 2. For --limit 100, that's ~700 LLM calls instead of ~200 — a 3.5× cost increase with no flag to disable the new dimensions. For --limit 1000 (some users run weekly bulk evals), the difference is ~2000 calls.

PR description doesn't mention cost. The README says "evaluation scores each session on 7 dimensions" but doesn't quantify cost. Two reasonable fixes:

  • (a) Add a --dimensions {full,primary,off} flag (default full, primary keeps only response_usefulness+task_grounding, off skips eval entirely)
  • (b) Document the cost implication in README with rough numbers ("Adds ~5 LLM calls per session, costs roughly $X per 100 sessions at default model")

Without one of these, an operator turning on --config and --limit 500 weekly is suddenly paying 3.5× more without realizing why.


Design questions (worth resolving before merge)

D1 — scope_compliance and response_usefulness.declined measure the same thing

Looking at the two metrics:

  • response_usefulness.declined = "outside scope, correctly declined" (a correct outcome)
  • scope_compliance.compliant = "correctly answered in-scope OR correctly declined out-of-scope"
  • scope_compliance.non_compliant = "answered out-of-scope OR refused in-scope"

These are highly correlated. A session marked declined in response_usefulness will essentially always be compliant in scope_compliance. A session marked unhelpful (refused in-scope) should be non_compliant.

Two metrics paying the same LLM-judge cost to measure the same property is wasteful. Either:

  • Drop scope_compliance and rely on the existing response_usefulness categories to cover scope correctness
  • Keep scope_compliance and remove declined from response_usefulness (probably worse — declined is heavily relied on downstream)
  • Reframe scope_compliance to capture something response_usefulness doesn't (e.g., scope creep within a single response — partially-in-scope answers)

The third option is what I'd recommend — narrow scope_compliance to detect scope mixing within a single response (e.g., "answered a different question than asked"), which response_usefulness doesn't capture.

D2 — first_time_right is duplicative with correctness for single-turn sessions

The metric definition says: "For single-turn conversations, evaluate the only response." But for single-turn sessions, "was the first response right" is exactly correctness. The metric only adds signal for multi-turn sessions where the user had to correct the agent.

Since _count_trace_metrics already counts user_turns, the PR could conditionally include first_time_right only in the metric set for multi-turn sessions — saving LLM calls for the majority of sessions that are single-turn. Or document that first_time_right is intended for multi-turn analysis and may be redundant for single-turn.

D3 — Inconsistent middle-category names across _DIMENSION_SCORES

_DIMENSION_SCORES = {
    "correctness": {"correct": 2, "mostly_correct": 1, "incorrect": 0},
    "tool_usage": {"proper": 2, "partial": 1, "none": 0},
    "specificity": {"specific": 2, "somewhat_specific": 1, "vague": 0},
    "scope_compliance": {"compliant": 2, "partially_compliant": 1, "non_compliant": 0},
    "first_time_right": {"correct": 2, "clarification_needed": 1, "correction_needed": 0},
}

Five different middle-category names (mostly_correct, partial, somewhat_specific, partially_compliant, clarification_needed). Each is contextually appropriate, but the LLM judge prompt gets the full per-dimension vocabulary anyway, so the variety helps the judge. Reasonable choice — just worth a comment in the code explaining why the names diverge (so a future maintainer doesn't "normalize" them).

Also: tool_usage.partial collides namespace-wise with response_usefulness.partial and the visible-label collision shows up in the per-session compact scorecard. Worth aliasing the visible label.


High-priority

H1 — Parse errors silently score 0 in _compute_dimension_averages

quality_report.py:1056 (approximately):

score = score_map.get(mr.category, 0)
dim_totals[mr.metric_name].append(score)

If the LLM-judge returns a category that isn't in the score map (parse error, garbled response, an unexpected category name), the metric is scored 0 — which is the same as the "worst" category. This pulls the dimension average down without distinguishing "judge failed" from "agent failed."

Fix: skip parse errors in the average:

if mr.parse_error or mr.category not in score_map:
    continue   # don't penalize the agent for judge failures
score = score_map[mr.category]
dim_totals[mr.metric_name].append(score)

Then report parse-error count separately in the dimension table:

| Correctness | 1.70 / 2.00 (n=18, 2 parse errors) | 🟢 | ... |

Without this, a noisy LLM judge silently drags every dimension average down.

H2 — Unknown categories default to green checkmark in _md_dimension_scorecard

quality_report.py:797:

icon = _SCORECARD_ICONS.get(mr.category, "✅")

is the green check. If the LLM returns an unexpected category (parse error, weird value), the scorecard renders green — actively misleading. Default should be neutral or warning:

icon = _SCORECARD_ICONS.get(mr.category, "❓")  # ❓ unknown

Same bug pattern as H1 — silently treating "unknown" as "good."

H3 — _count_trace_metrics only counts TOOL_COMPLETED, missing TOOL_ERROR

elif span.event_type == "TOOL_COMPLETED":
    tool_calls += 1

A tool call that errored out (TOOL_ERROR) isn't counted. So the reported "avg tool calls" undercounts the agent's actual tool usage. For sessions where tools fail, the metric understates effort.

Either:

  • Count both TOOL_COMPLETED and TOOL_ERROR (intent: "total tool attempts")
  • Split into tool_calls_successful / tool_calls_failed (better signal, more useful)

The test test_tool_starting_not_counted covers the "STARTING shouldn't double-count" case but doesn't pin down the error-handling intent. Add a test for TOOL_ERROR.


Medium

M1 — Test plan checkboxes still unchecked

PR description has two unchecked items:

  • Review Quality Dimensions table renders correctly on GitHub
  • Verify JSON output includes new dimension_averages and multi-turn fields

The Quality Dimensions table renders fine in sample_quality_report.md (I checked — emoji + table cells display properly). The JSON output does include dimension_averages (verified via _build_json_output at line 1687). Just need to tick those boxes or address whatever the gating concern was.

M2 — _compute_multiturn_stats is inconsistent about return shape

def _compute_multiturn_stats(resolved_map):
  ...
  if not total:
    return {}
  return {
      "avg_user_turns": ...,
      "avg_tool_calls": ...,
      "multi_turn_sessions": ...,
  }

Empty input → {}; non-empty → 3-key dict. Callers must handle both:

mt_stats = _compute_multiturn_stats(resolved_map)
if mt_stats:
  print(...)

This works but is brittle. Consider always returning the 3-key dict with zeros so the JSON shape is stable:

return {
    "avg_user_turns": round(sum(user_turns) / total, 1) if total else 0,
    "avg_tool_calls": round(sum(tool_calls) / total, 1) if total else 0,
    "multi_turn_sessions": sum(1 for t in user_turns if t > 1),
}

Then JSON consumers don't need to special-case empty runs.

M3 — **mt_stats flattening in JSON summary creates surprises

quality_report.py:1702:

return {
    "summary": {
        ...
        "dimension_averages": dim_avgs,
        **mt_stats,
    },
    ...
}

Adding any field to _compute_multiturn_stats automatically appears in summary — fine when intentional, surprising otherwise. Prefer:

"summary": {
    ...
    "dimension_averages": dim_avgs,
    "multi_turn": mt_stats,
}

Cleaner JSON schema, more discoverable for downstream consumers.

M4 — README and sample report duplicate the "scored 0-2 on five dimensions" phrasing

scripts/README.md:130: "score each session 0-2 and are averaged across all sessions to produce the Quality Dimensions table"

scripts/sample_quality_report.md:25: "Each session is scored 0-2 on five dimensions. Scores are averaged across all sessions."

Similar but redundant. Keep one source (or link the README to the sample for the long form).

M5 — _DIMENSION_DESCRIPTIONS only in markdown, not console

The "What it measures" column is included in the markdown report but the console output (_print_eval_results) skips it:

print(f"    {label:<20s}: {avg:.2f} / 2.00  {bar}")

A user running quality_report.sh without --report won't know what each dimension measures. Add the description to the console output too (one-line per dimension):

desc = _DIMENSION_DESCRIPTIONS.get(dim, "")
print(f"    {label:<20s}: {avg:.2f} / 2.00  {bar}")
if desc:
  print(f"    {'':<20s}{desc}")

Low / nits

L1 — Sample markdown trailing-whitespace regression (same as PR #156)

$ git diff --check 31fe562..HEAD
scripts/sample_quality_report.md:5: trailing whitespace.
+**Generated:** 2026-05-19 05:37:33  
scripts/sample_quality_report.md:8: trailing whitespace.
+**Location:** us-central1  
scripts/sample_quality_report.md:9: trailing whitespace.
+**Eval model:** gemini-2.5-flash  
scripts/sample_quality_report.md:10: trailing whitespace.
+**Sessions:** 20

Four lines flagged. Same root cause as PR #156: _write_md_report emits f"**Generated:** {timestamp} " (two trailing spaces for GFM hard-breaks). The PR #156 approval flagged this as an "Option A vs B" recommendation; neither has been applied. Once the systematic fix lands in _write_md_report it stops recurring.

L2 — _DIMENSION_NAMES = list(_DIMENSION_SCORES.keys()) relies on dict insertion order

Python 3.7+ guarantees insertion order, so this works. But a comment noting "order matters for report rendering" would help a future maintainer who reorders _DIMENSION_SCORES for some other reason.

L3 — _md_dimension_scorecard builds _SCORECARD_ICONS inside the function

quality_report.py:780-797: the dict is rebuilt on every call. Lift to a module-level constant near _DIMENSION_SCORES and _DIMENSION_DESCRIPTIONS for consistency and zero per-call allocation.

L4 — Test gap: parse-error scoring in _compute_dimension_averages

The 4 new TestComputeDimensionAverages tests cover happy path, all-perfect, empty, and missing-dimensions. None covers an unexpected category name (the case in H1). Add:

def test_unknown_category_treated_as_zero(self):
    # Documents current behavior; revisit if H1 changes it
    sessions = [
        _FakeSession("s1", [_FakeMetric("correctness", "garbage")]),
    ]
    avgs = _compute_dimension_averages(_FakeReport(sessions))
    assert avgs["correctness"] == 0

After H1 is fixed, the test asserts the skip-instead-of-zero behavior.

L5 — _count_trace_metrics test coverage gap

test_tool_starting_not_counted confirms TOOL_STARTING doesn't count, but no test confirms what happens with TOOL_ERROR (H3). Add it as part of the H3 fix.

L6 — correct is a category name in two dimensions

correctness.correct and first_time_right.correct both use the bare name "correct". This is fine because the metric_name disambiguates them in the category-distributions output and in _DIMENSION_SCORES (keyed by metric_name → category map). But a reader skimming the code sees "correct" in two score maps and might wonder if they collide. Inline comment would help.

L7 — _PRIMARY_METRICS defined twice

_PRIMARY_METRICS = {"response_usefulness", "task_grounding"} appears as a local set at _write_md_report (line ~1497). The same constraint is hardcoded twice more in _print_eval_results (lines if mr.metric_name not in ("response_usefulness", "task_grounding"):). Lift to a module-level constant.


What looks great

  • The Quality Dimensions table with rating legend (🟢 / 🟡 / 🔴 thresholds at 1.50 / 1.00) is a clean visual abstraction — at a glance you see which dimensions are problem areas. Better than a raw "average 0.90" with no rating context.
  • The compact one-line scorecard for dimensions in per-session output (Correctness ✅ | Tool Usage ❌ | Specificity ⚠️ | Scope ✅ | First-Time Right ✅) is the right call vs verbose multi-line blocks. Keeps the per-session view readable while still surfacing all 5 dimensions.
  • Section refactor with _md_write_session_section (used for Unhelpful/Declined/Partial Sessions) consolidates ~80 lines of duplicated markdown-writing into one helper. Right cleanup.
  • _count_trace_metrics is a clean extraction — testable pure function with focused responsibility. Good test coverage of the happy paths.
  • Per-dimension category vocabularies (correct/mostly_correct/incorrect, proper/partial/none, etc.) are well-chosen — the LLM judge gets vocabulary that fits each dimension instead of forcing everything into good/medium/bad.
  • dimension_averages flowing into _build_json_output keeps the JSON output complete. Downstream consumers don't have to recompute averages.
  • README documentation update is thorough — the "2 (best) / 1 (middle) / 0 (worst)" table per dimension is exactly what an operator wants to see when interpreting the report.
  • Test coverage — 12 new tests covering 3 helper functions, including edge cases (empty, single-turn, missing keys). Good test discipline carried over from PR #156.

Verdict

REQUEST_CHANGES on:

  1. B1 rebase (one cleanup) — needed to confirm mergeability
  2. B2 cost transparency — 5× LLM-judge cost increase without a flag or doc is the kind of surprise that erodes operator trust; add either a --dimensions {full,primary,off} flag or explicit cost docs in README
  3. D1 / D2 design questions — scope_compliance is largely redundant with response_usefulness.declined; first_time_right is duplicative with correctness for single-turn sessions. Either reshape these metrics or document why both pay the LLM-judge cost
  4. H1 / H2 / H3 correctness issues — parse errors silently scoring 0, unknown categories defaulting to green check, TOOL_ERROR not counted

M1-M5 and L1-L7 are improvement-level — bundle into the same push or skip.

Substantive design is good. The dimension averaging, multi-turn extraction, and markdown refactor all do the right things. The main concerns are operational (cost) and semantic (overlapping metrics) — both are addressable.

@evekhm evekhm force-pushed the feat/quality-dimensions branch from b830575 to 766363f Compare May 19, 2026 06:57
@caohy1988

Copy link
Copy Markdown
Collaborator

[P1] tool_usage needs a positive no-tool-needed path before this is safe to merge.

On current head 766363f, scripts/quality_report.py:370 defines tool_usage with only proper, partial, and none, and scripts/quality_report.py:1038 scores none as 0. That means a correct decline or direct answer where no tool was needed gets shown as a Tool Usage failure.

The sample report demonstrates the bad outcome: scripts/sample_quality_report.md has declined/no-tool-needed sessions, but their dimension scorecard still shows Tool Usage ❌. That makes the new Quality Dimensions table misleading for sessions that behaved correctly.

Suggested fix: add a positive category such as no_tool_needed or not_applicable to the tool_usage metric, score it as 2, render it neutrally/positively in the scorecard, and add a regression test showing declined/direct no-tool-needed sessions do not drag down Tool Usage.

@evekhm evekhm marked this pull request as draft May 19, 2026 07:13
@evekhm evekhm changed the title Add quality dimensions and multi-turn metrics to quality report Add quality dimensions, correction analysis, and execution trace support May 27, 2026
evekhm added 17 commits May 27, 2026 23:41
…rmat

Add 5 new LLM-as-judge quality dimensions to the quality report:
correctness, tool_usage, specificity, scope_compliance, first_time_right.
Each session is scored 0-2 and averaged into a Quality Dimensions table.

Add multi-turn efficiency metrics (avg user turns, avg tool calls,
multi-turn session count) extracted from trace spans.

Streamline report output for readability:
- Quality Dimensions table includes "What it measures" descriptions
  and a color-coded rating legend
- Category distributions only show primary metrics (response_usefulness,
  task_grounding) since dimension averages already summarize the rest
- Per-session details use a compact one-line scorecard for dimensions
  instead of verbose multi-line blocks

Add 12 new tests for the helper functions.
Update README metrics documentation.
Regenerate sample quality report from live data.
…raction

Extract full multi-turn conversations from trace spans and use LLM to
classify user follow-ups as corrections, verifications, or normal
follow-ups.  Surface correction_rate and verify_rate in both console
output and markdown reports to match the metrics available in
knowledge_supervisor's multiturn_quality_report.
…and ground truth support

The LLM judge was misclassifying in-scope failures as "declined" because
the declined category was always present regardless of whether scope
context existed. This led to inflated correctness scores for agents that
failed to answer in-scope questions.

Key changes:
- Make the "declined" metric category conditional: only include it when
  the agent config actually defines out-of-scope topics (has_scope flag).
  Without scope context the judge has no basis for that category.
- Add ground_truth config field to _build_scope_context so the judge can
  verify factual accuracy against known-correct policy data.
- Add in_scope_topics to scope context so the judge can distinguish
  "topic is out of scope" (declined) from "agent failed to answer an
  in-scope topic" (unhelpful).
- Clarify unhelpful definition: explicitly lists failure patterns like
  "I don't have that information" for in-scope topics.
- Support config_path="none" in _load_agent_config to explicitly disable
  scope context and skip auto-discovery.
- Thread config_path through generate_quality_report for programmatic use.
- Update --agent-config CLI help to document the "none" option.
- Add scripts/eval/data/agent_context.example.json with scope_decisions
  for scope-aware evaluation. Users copy to agent_context.json and
  customize with their agent's out-of-scope topics.

- Remove in_scope_topics from scope context: scope_decisions alone is
  sufficient. Anything not listed as out_of_scope is implicitly in scope.

- Remove hardcoded domain-specific topic names (benefits, PTO, etc.)
  from the LLM judge prompt to keep it agent-agnostic.

- Update README with all new features: dimension drilldowns,
  single-session evaluation (--session), quality dimensions table,
  multi-turn metrics, declined category behavior, and sample output
  links.

- Add quality_metrics.json (metric definitions) and
  sample_quality_report_session.md (single-session verbose output).
Support evaluating traffic-generator conversations directly from a JSON file,
bypassing BigQuery entirely. Adds run_evaluation_from_conversations() and
generate_quality_report_from_conversations() public APIs, plus CLI flag.

Also enhances _write_md_report with TOC, section descriptions, metric
explanations, and a report_dir parameter for custom output locations.

Includes issue for concurrent session scoring (asyncio.gather + semaphore).
Support --samples with per-category overrides (e.g. unhelpful=10,partial=5,low=3)
in addition to single-number and 'all' modes. Refactor _print_eval_results and
_write_md_report to use centralized _parse_samples/_get_sample_limit helpers
with sensible defaults per category.
Run _infer_corrections calls in parallel using asyncio threads
with a semaphore (max 10 concurrent) instead of sequential loop.
Short-circuit help flags before env var validation so users can
view usage without setting PROJECT_ID, DATASET_ID, etc.
Add _md_find_low_dimension_sessions and _md_write_low_dimension_section
helpers to generate per-dimension subsections for all 5 quality dimensions
dynamically. Move low-dimension sections after Declined Sessions.
Restructure TOC with <!-- TOC --> wrapper. Add CLI command reproduction
in Summary section. Compute report layout variables upfront.
- --limit: clarify it evaluates the N most recent sessions
- --trajectory-samples: remove misleading "Requires BQ credentials"
- --tag-turns: remove conversations-file restriction, enable for BQ path
  by passing tag_turns through run_evaluation with concurrent tagging
- --env=path syntax now works in shell script (not just --env path)
- Error message now says 'export VAR=...' and mentions --env flag
… detection, and parroting detection

- Embed segmented execution trajectories in Correction Analysis section
  (before/after correction with outcome labels: recovered, parroted, not_recovered)
- Detect routing failures (agent answered from LLM knowledge without tools)
  and surface as Routing Failures subsection
- Fix answered_by defaulting to "policy_agent" — now "unknown" with BQ backfill
- Tighten correction inference prompt to distinguish genuine recovery from parroting
- Add --eval-config flag for external metric/prompt definitions (eval_config.json)
- Nest Correction Analysis under Sample Sessions with dynamic heading levels
- Update sample report and README
Penalize parroted recoveries (agent echoes user's correction without
re-verifying via tool) as unhelpful in both the usefulness definition
and the unhelpful category.

Refactor quality_report.py to auto-discover eval/eval_config.json
from the repo root, removing 290 lines of hardcoded metric
definitions in favor of the config-driven approach.
Thread per_session_context through quality_report.py into
classify_sessions_via_api() so golden eval expected answers
can be injected into the judge prompt per session.
When --session is used, the execution trajectory is now fetched
automatically from BigQuery and printed to console with sub-trajectory
segmentation at correction boundaries. Updated sample with real data
showing the full output including trace tree and segmentation.
…L_ERROR counting, and module constants

- H1: Skip parse errors and unknown categories in _compute_dimension_averages
  instead of scoring them as 0 (which inflated averages downward)
- H2: Default unknown categories to ❓ instead of ✅ in scorecard icons
- H3: Count TOOL_ERROR spans as tool attempts in _count_trace_metrics
- L3: Lift _SCORECARD_ICONS to module level (was duplicated in function)
- L7: Extract _PRIMARY_METRICS constant, replace 5 inline references
- M2: _compute_multiturn_stats returns stable shape on empty input
- Update tests: add parse_error attr to _FakeMetric, test TOOL_ERROR
  counting, test parse error/unknown category skipping, fix empty map
  assertion
@evekhm evekhm force-pushed the feat/quality-dimensions branch from a52d996 to 041b978 Compare May 27, 2026 23:41
@evekhm

evekhm commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Review Feedback Addressed

All substantive items from the review have been addressed in the latest push:

High priority:

  • H1: _compute_dimension_averages now skips parse_error results and unknown categories instead of scoring them as 0 — only valid categories contribute to averages
  • H2: Unknown categories in the dimension scorecard now show ❓ instead of ✅
  • H3: _count_trace_metrics now counts both TOOL_COMPLETED and TOOL_ERROR as tool attempts

Medium priority:

  • M2: _compute_multiturn_stats({}) returns a stable shape (avg_user_turns: 0, avg_tool_calls: 0, multi_turn_sessions: 0) instead of empty dict
  • M3: Investigated nesting under "multi_turn" key but kept flat — downstream consumer score_conversations.py reads r['summary']['correction_rate'] directly

Low priority:

  • L3: _SCORECARD_ICONS lifted to module-level constant (was duplicated inside function)
  • L7: Extracted _PRIMARY_METRICS = {"response_usefulness", "task_grounding"} constant, replacing 5 inline literal pairs

Design:

  • D1: scope_compliance vs task_grounding overlap is already addressed — the eval prompts in eval_config.json differentiate them (scope boundaries vs tool usage)

Branch:

  • B1: Rebased onto latest origin/main

Tests added: test_parse_error_skipped, test_unknown_category_skipped, test_tool_error_counted, updated test_empty_map — all 76 tests pass.

@evekhm evekhm requested a review from caohy1988 May 28, 2026 01:12
@evekhm evekhm marked this pull request as ready for review May 28, 2026 01:12
evekhm added 3 commits May 28, 2026 21:07
- TraceFilter.to_sql_conditions: query $.custom_tags.* instead of
  $.labels.* to match BigQueryLoggerConfig.custom_tags write path
- TraceFilter.from_cli_args: add custom_labels parameter
- run_evaluation: add custom_labels parameter, thread through to
  TraceFilter for version-aware session filtering
- Add --label KEY=VALUE (repeatable) to filter by custom_tags set via
  BigQueryLoggerConfig.custom_tags (version, env, experiment_id, etc.)
- Surface app_name and labels in report.details so they appear in both
  console and markdown Execution Details sections
- Expand CLI epilog with filtering section, scope-aware eval example
  showing full agent_context.json format, and combined filter examples
- Update README with Custom Labels section (end-to-end: agent emits →
  BQ stores → quality report filters), expanded --agent-context docs,
  and complete Execution Details description
- Update sample reports with complete Execution Details fields
… --dimensions flag, and hardening

Review feedback (@caohy1988):
- P1: add `no_tool_needed` category to `tool_usage` (scores 2) so correctly
  declined / no-tool-needed sessions are not penalised as a Tool Usage failure;
  neutral ➖ scorecard icon; regression tests.
- B2: add `--dimensions {full,primary}` flag to cut LLM-judge cost ~3.5x
  (primary scores only the 2 primary metrics); document cost in README + help.
  Default stays `full` for backward compatibility.
- D2: document that `first_time_right` is primarily a multi-turn signal.
- D3/L6: comment the deliberately-divergent middle-category names and the
  shared `correct` category.
- M5: print per-dimension descriptions in console output.
- L1: render markdown report metadata as a bullet list instead of
  trailing-double-space hard breaks (fixes `git diff --check`).

Independent review hardening:
- Gate the dimension block on a new `_has_dimension_data()` helper across all
  three output paths; JSON `dimension_averages` is now empty (not all-zero)
  when dimensions were not scored, so consumers don't read unscored dimensions
  as 0.0 / failing.
- Anchor the judge prompt so a missing in-scope tool lookup stays `none` and is
  not mislabeled `no_tool_needed` (inverse-of-P1 inflation guard).
- Make `_DIMENSION_LOW_CATEGORIES` fail-safe if a dimension has no score-0
  category (no StopIteration at import).

Tests: 81 pass (+5).
@evekhm

evekhm commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up round addressed (commit c86722f)

Thanks for the thorough review. This round closes the remaining blockers and design questions, plus a few items found in a self-review pass.

Blockers

P1 — tool_usage no-tool-needed path. Added a no_tool_needed category to the tool_usage metric (eval_config.json), scored 2 in _DIMENSION_SCORES, rendered as a neutral in the scorecard. A correctly-declined / direct no-tool answer no longer shows Tool Usage ❌ or drags the dimension average down. Regression tests added (test_tool_usage_no_tool_needed_scores_full, ..._does_not_drag_average).

To guard the inverse failure (a real tool-skip mislabeled as no_tool_needed and scoring a false 2), the judge prompt now has a decisive tiebreaker: if the question was in-scope and a tool could have supplied the answer but none was called, it stays none (a failure) and is not excused as no_tool_needed.

B2 — 5× LLM-judge cost. Added --dimensions {full,primary}. full (default, unchanged behavior) scores all 7 metrics; primary scores only response_usefulness + task_grounding (~3.5× cheaper). Cost is now documented in the README and the flag help text. --no-eval remains the way to skip scoring entirely.

Design questions

D2 — first_time_right redundancy for single-turn. Documented in the README that it is primarily a multi-turn signal and effectively mirrors correctness for single-turn sessions. (Kept it in the metric set rather than varying metrics per-session, since classification is batched.)

D3 — divergent middle-category names. Added a code comment explaining the per-dimension vocabulary is intentional (helps the judge) and should not be normalized; also documents that correct appears in two dimensions without colliding (keyed by metric_name).

Improvements (self-review)

  • dimension_averages consistency. --dimensions primary previously would have surfaced all-zero dimension averages, which the console/markdown paths already suppressed via any(v>0) but the JSON path did not — so JSON consumers (incl. our downstream scorer) could read "every dimension = 0.0 / failing" when they were simply never scored. Factored a _has_dimension_data() helper and gated all three output paths through it; JSON now emits {} for unscored dimensions.
  • M5 — per-dimension descriptions now print in console output, not just markdown.
  • L1 — report metadata is rendered as a bullet list instead of trailing-double-space hard breaks; git diff --check is now clean (this was the recurring whitespace nit from feat(scripts): add latency report and scope-aware quality eval #156).
  • _DIMENSION_LOW_CATEGORIES is now fail-safe if a future dimension lacks a score-0 category (no StopIteration at import).

Tests: 81 pass (+5). The new lines are pyink/isort-clean.

evekhm added 2 commits May 31, 2026 21:27
Introduces an optional eval spec (`eval/data/eval_spec.json`, or `--eval-spec`)
that grounds quality scoring:
- `scope` (free text) — defines what the agent handles; out-of-scope is the
  complement, so a polite decline is scored `declined` (replaces the brittle
  scope_decisions topic list).
- `golden_qa` — expected answers matched per-question by embedding similarity
  (`--golden-threshold`); injects ground truth into the judge and emits a
  `golden_eval_summary`. Matching now lives in the SDK (`match_golden_qa`),
  not in downstream projects.
- `tools` — capability description used by the judge's new `failure_attribution`
  metric.

Adds a 3-way failure taxonomy: every `unhelpful` session is attributed to a
`skill_gap` (evolution-fixable), `knowledge_gap` (a fact missing from the data),
or `tool_gap` (no tool/capability), and a new `addressable_meaningful_rate`
reports quality on questions the agent can actually answer. Surfaced in the
console, markdown (with per-class "add a fact" / "build a tool" sections), and
JSON.

Also: `tool_usage.no_tool_needed` (scores 2, not a failure), `--dimensions
{full,primary}` cost control, `_has_dimension_data` so unscored dimensions
aren't reported as 0.0. Tests: 94 pass.
@evekhm

evekhm commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

REady for re-review

response_usefulness and task_grounding are LLM estimates when no golden_qa is
supplied -- they can mislabel verbose, tool-grounded answers as
ungrounded/unhelpful. Emit a clear warning pointing users to --eval-spec with a
golden_qa list so correctness is graded against expected answers
(summary.golden_eval_summary). The golden-Q&A grounding itself is unchanged and
remains the reliable path.
@evekhm

evekhm commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Quality dimensions + eval-spec grounding for the BigQuery Agent Analytics quality report.

This PR upgrades the LLM-as-judge quality_report from a single "usefulness" verdict into a richer, more trustworthy evaluation. Its headline feature is eval-spec grounding: you supply one optional JSON ({scope, ground_truth, golden_qa}), and the judge grades each session against your golden Q&A expected answers (with embedding-based question matching) and your defined scope — so correctness is anchored to ground truth instead of the model's guess, and out-of-scope questions are correctly credited when the agent declines.

  • Golden-Q&A + scope eval-spec — ground correctness/scope scoring in your own expected answers; reliable headline numbers via golden_eval_summary.
  • Multi-dimension scorecard (--dimensions) — usefulness, grounding, tool_usage, correctness, specificity, scope_compliance.
  • Failure-cause taxonomy — separates skill gaps (fixable by prompt/skill) from knowledge gaps (add data) and tool gaps (build a tool).
  • Correction / parroting / routing-failure analysis with trajectory segmentation.
  • Version-aware filtering (--label, custom_tags path fix) — score one agent version without cross-version contamination.
  • Single-session trace auto-fetch and per_session_context for golden eval.
  • No-ground-truth guardrail warning so scores aren't mistaken for ground truth.

@evekhm evekhm changed the title Add quality dimensions, correction analysis, and execution trace support Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering Jun 5, 2026
evekhm added 2 commits June 5, 2026 21:26
The quality-report usage section listed filters and report flags but omitted
the most important workflow — adding evals (ground truth) — and several
features introduced in this PR. Now documented:

- Adding evals: eval-spec (scope/tools/ground_truth/golden_qa) + scoring a
  local conversations file (--conversations-file, no BigQuery) with the JSON
  format and recommended workflow
- Failure-cause taxonomy (skill_gap / knowledge_gap / tool_gap) and the
  'tools' eval-spec field that drives it; routing-failure and parroting notes
- --dimensions full|primary, --tag-turns, --trajectory-samples, --concurrency,
  --golden-threshold, --eval-config, --env, per-category --samples
- golden_eval_summary and the no-ground-truth warning
- Reorganized the usage quick-reference into grouped, complete examples

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh re-review — PR #174 head 6595417

Locally checked out, 94 helper tests pass (pytest tests/test_quality_report_helpers.py), git diff --check is clean for the PR's own scripts/src/tests. Solid progression — prior rounds' H1/H2/H3/P1/M2/L3/L7 fixes all landed correctly, the --dimensions flag, _has_dimension_data gating, and the custom_tags JSON-path fix are real wins. Two new blockers surfaced on a fresh pass through; the rest are tightening.

PR is 17 commits behind origin/main. mergeable: MERGEABLE per the GH API, no conflicts, but the rebase has aged since 2026-05-27.


Blockers

B1 — Golden Q&A silently does nothing on the BigQuery path

match_golden_qa is only called from run_evaluation_from_conversations (line 1465). run_evaluation (the BQ path) loads the eval spec, uses scope/ground_truth in the judge prompt, but never matches golden_qa to session questions. Confirmed locally:

run_evaluation (BQ path) mentions match_golden_qa:        False
run_evaluation (BQ path) mentions per_session_context:    False
run_evaluation_from_conversations (local path):            True / True

So this command:

./scripts/quality_report.sh --eval-spec eval/data/eval_spec.json --report --limit 50

silently ignores the golden_qa block. No warning, no golden_eval_summary in JSON, no per-session "expected answer" injected into the judge. The PR description headlines golden Q&A as "the trustworthy headline for regression testing" and the README says "the most important way to use the quality report" — but on what is presumably the most common path (score this week's BQ traffic), the feature is a no-op.

The _resolved_map already contains question per session, so wiring is straightforward — between report = client.evaluate_categorical(...) and the return, build question_by_sid = {sid: ctx["question"] for sid, ctx in resolved_map.items()}, call match_golden_qa, then _inject_golden_summary later. Or, if it's deliberately local-only, the BQ path should log a warning when it sees golden_qa in the spec (mirror the existing "No golden_qa" warning at line 1237).

B2 — Failure-cause taxonomy is misleading whenever failure_attribution / tool_usage / correctness weren't scored

_failure_breakdown_from_report and _classify_failures always run, regardless of which metrics were actually scored. With --dimensions primary, the deterministic fallback in _failure_class sees tool=None, correctness=None, attribution=None for every unhelpful session and unconditionally returns "skill_gap". Confirmed locally with 4 unhelpful + 6 meaningful sessions, primary-mode metrics:

counts: {'skill_gap': 4, 'knowledge_gap': 0, 'tool_gap': 0}

The report then prints Failure causes: skill=4 (evolution) knowledge=0 (add data) tool=0 (build tool) and the markdown summary shows Skill gaps (evolution fixes) | 4. A user reading that output reasonably concludes "no knowledge gaps, no tool gaps, just evolution work" — but it's a function of metrics not being scored, not of agent quality. Same class of silent-zero-default bug as the H1/H2 issues this PR already fixed.

Cleanest fix: introduce a _has_failure_attribution_data(report) predicate (analogous to _has_dimension_data) that's true iff at least one session has failure_attribution OR both tool_usage and correctness. Gate the failure-breakdown blocks in _print_eval_results (line 2431), _write_md_report (line 3424), and _classify_failures write-out on it. JSON summary.skill_gap/knowledge_gap/tool_gap should be absent (or null) when ungated, not silently N/0/0.

Note: the same issue bites the eval-spec example. The example's tools field is what lets the judge tell knowledge from tool gaps, but the metric defaults to running even without a tools field — see D1 below.


High

H1 — Off-by-one: PR description and --dimensions help say "7 metrics", actual is 8

eval_config.json defines 8 metrics: response_usefulness, task_grounding, correctness, tool_usage, specificity, scope_compliance, first_time_right, failure_attribution. Several places undercount:

  • quality_report.py:4013--dimensions help: "'full' (default) scores all 7 metrics" → should be 8.
  • quality_report.py:4017"about 3.5x cheaper (2 LLM calls/session instead of 7)" → actually 4× (2 instead of 8).
  • scripts/README.md:78,122,271,337,506 — five "7 metrics / 7 dimensions" claims.
  • scripts/sample_quality_report_session.md:5 — "all 7 metrics".
  • PR description itself: "evaluation scores each session on 7 dimensions".

Cost claim ($x cheaper) is a B2-adjacent operator-trust issue — quoting 3.5× when the real ratio is 4× is in the noise, but consistently saying 7 when there are 8 makes a careful reader doubt the rest of the numbers.

H2 — Stale --agent-context flag in sample_quality_report.md

scripts/sample_quality_report.md:29:

Markdown report generated by `./scripts/quality_report.sh --report --limit 20
... --agent-context agent_context.json`.

--agent-context was removed from argparse this round (replaced by --eval-spec). Verified python3 quality_report.py --help no longer lists it. Anyone copy-pasting the command line from the sample report gets unrecognized arguments: --agent-context. scripts/quality_report.py:562 also has a residual docstring mention ("same pattern as agent-context auto-discovery").

Re-render the sample with the current command line (and, while there, with --eval-spec set so the sample actually showcases golden_eval_summary — currently the sample report contains zero occurrences of "golden" or "eval-spec").

H3 — custom_tags JSON-path fix has no regression test

src/bigquery_agent_analytics/trace.py:613 flips '$.labels.''$.custom_tags.' — verified against producer code (producers/src/bigquery_agent_analytics_tracing/claude_code.py:447 and friends write custom_tags). The fix is one line and clearly correct, but there's not a single test in tests/ that mentions custom_labels or custom_tags:

$ grep -rnE "custom_labels|custom_tags" tests/
(no matches)

This is exactly the kind of "silent broken filter" — --label version=v1 returns an empty result set, no error, no clue — that the lack of tests allowed the first time. Suggested guardrail:

def test_custom_labels_uses_custom_tags_json_path():
    tf = TraceFilter(custom_labels={"version": "v1"})
    sql, _ = tf.to_sql_clauses()  # or whatever the method is
    assert "$.custom_tags." in sql
    assert "$.labels." not in sql

A single string-assertion test would catch any future regression of the same kind.


Medium

M1 — failure_attribution runs (and burns 1 judge call/session) without an eval_spec.tools field

failure_attribution is scope_aware: true, so the scope-and-tools blurb is appended only when those fields exist. Without eval_spec.tools, the judge gets the metric definition but no tool inventory — meaning it cannot meaningfully distinguish knowledge_gap (tool covers topic, fact missing) from tool_gap (no tool for topic). The metric still runs unconditionally — adding 1 LLM call/session and producing low-signal results that feed straight into _classify_failures.

Options:

  • Skip the failure_attribution metric when eval_spec.tools is empty (drop from metrics list at filter time, the same way --dimensions primary already filters).
  • Or: log a warning when the spec has golden_qa/scope but lacks tools, calling out that failure attribution will be noisy.

M2 — generate_quality_report_from_conversations doesn't expose eval_config

run_evaluation_from_conversations (line 1401) accepts eval_config=None but the public wrapper generate_quality_report_from_conversations (line 1514) doesn't pass it through. So a programmatic caller (evolve.py, score_and_compare.py) can't override metric definitions for the local-conversations path the way the CLI can. Trivial fix: thread eval_config through the signature.

M3 — **mt_stats flattening still propagates into summary

_build_json_output:3901 keeps the previous-round **mt_stats flatten:

"summary": {
    ...
    "dimension_averages": (...),
    **mt_stats,
},

That's still the prior M3 ergonomics concern — adding any field to _compute_multiturn_stats automatically appears at the top level of summary, with no schema. Wrapping under "multi_turn" would be cleaner; PR rejected this earlier on the basis that score_conversations.py reads summary['correction_rate'] directly. Fair, but at least the new fields (avg_corrections, avg_verifications) could be nested without breaking that consumer. Not blocking.

M4 — Conversation extraction loses per-turn agent attribution

_extract_conversation returns [{"role": "user"|"agent", "text": str}]. The original spans have span.agent, which gets dropped. For multi-agent sessions where different turns are answered by different sub-agents, the conversation block in the markdown report says **agent:** for every reply without telling you which one. Worth preserving on the dict for the markdown render (and the JSON export).


Low / nits

L1 — _load_eval_spec docstring references "agent-context"

quality_report.py:562"same pattern as agent-context auto-discovery". Same root as H2: leftover from before --agent-context got renamed.

L2 — get_a2a_response exception catches an impossible TypeError

quality_report.py:750except (json.JSONDecodeError, TypeError):. The branch is gated on isinstance(c, str), so json.loads(c) never raises TypeError. Cosmetic; drop TypeError.

L3 — _failure_class deterministic fallback overrides judge "not_a_failure"

If the judge says failure_attribution = "not_a_failure" for an unhelpful session, the code treats that as an attribution miss and runs the deterministic split (knowledge vs skill). Sometimes that's right (response_usefulness judge mislabeled a fine response as unhelpful); sometimes it's wrong (judge correctly recognized the response was fine but the deterministic split now invents a gap). Worth a comment at line 3722 documenting the choice, or a third branch that drops these sessions from the breakdown.

L4 — Magic number 0.92 for golden-Q&A threshold appears in two places

Default in match_golden_qa (line 374) and in argparse (line 4108). Lift to a _DEFAULT_GOLDEN_THRESHOLD = 0.92 module constant so changing the default doesn't require touching two places.

L5 — _DIMENSION_NAMES insertion-order comment is good; one nit

The comment "order matters for rendering" (line 2126) is good. But there's no test that pins down the rendering order — if someone reorders _DIMENSION_SCORES, the markdown table reorders silently. A render-order test on _compute_dimension_averages or _md_dimension_scorecard would catch this.


What looks great

  • _has_dimension_data predicate + three-path gating is exactly the right shape. JSON now correctly emits {} for unscored dimensions; console and markdown both gate on the same predicate. Direct response to the prior-round JSON-path inconsistency and clean.
  • --dimensions {full,primary} with cost calc in the help text is the right cost-transparency move. The filter rebuilds a new dict ({**eval_config, "metrics": [...]}) so the cache isn't mutated — verified.
  • tool_usage.no_tool_needed + _SCORECARD_ICONS["no_tool_needed"] = "➖" correctly resolves prior-round P1. The judge prompt's "DECISIVE TEST" (line 81 of eval_config) guards against the inverse failure (mislabeling a real tool-skip as no_tool_needed). Regression tests test_tool_usage_no_tool_needed_scores_full + ..._does_not_drag_average are tight.
  • _DIMENSION_LOW_CATEGORIES fail-safe (using next(..., None) and filtering out None) prevents StopIteration at import time if a future dimension lacks a score-0 category. Small, defensive, right call.
  • Eval-spec docstring + comments at line 2095-2109 explaining why middle-category names diverge per dimension and why correct is duplicated across two dimensions answers the prior-round D3 ask. A future maintainer will not "normalize" them.
  • _diagnose_correction_trace is a clean small heuristic — when the agent never routed AND never tool-called, it surfaces "Routing Failures" as actionable feedback. Useful diagnostic output, not just more numbers.
  • per_session_context plumbing in categorical_evaluator.py is minimal (added param, defensive if … in dict) — doesn't break any existing caller, doesn't change the prompt when no context is supplied.
  • custom_tags JSON-path fix in trace.py is a real bug fix (verified path matches what producers write). One-liner, correct.
  • 94 tests pass. Prior-round test additions are all tight regression guards (test_parse_error_skipped, test_unknown_category_skipped, test_tool_error_counted, test_tool_usage_no_tool_needed_*, test_has_dimension_data_*).
  • README "Adding evals" section is the right framing for the headline feature. The recommended-workflow snippet (eval-spec + --conversations-file) is exactly right for an evolution loop. (If B1 lands, the BQ-path workflow becomes equally trustworthy.)

Verdict

REQUEST_CHANGES on B1 and B2, plus H1/H2/H3 cleanup:

  • B1 (golden Q&A wiring on the BQ path) — the marquee feature should work on the most common path, or fail loudly when it doesn't.
  • B2 (failure-cause taxonomy needs a _has_failure_attribution_data gate) — same pattern as the existing dimension gate; one helper + three call-site changes.
  • H1 (8 vs 7 metric count + 4× vs 3.5× cost) — small, mechanical doc/help update.
  • H2 (sample report's stale --agent-context) — re-render with current flags, ideally with --eval-spec set so the sample shows the headline feature.
  • H3 (custom_tags test) — one string-assertion test on the SQL clause.

M1–M4 and L1–L5 bundle in the same push or skip — your call.

Substantive design is in great shape. Failure-cause taxonomy, multi-turn segmentation, routing-failure detection, scope-grounded scoring, and the eval-spec architecture all look right; the remaining work is making the BQ path and the failure breakdown as trustworthy as the rest.

@caohy1988

Copy link
Copy Markdown
Collaborator

Follow-up validation — current head 845e624

Re-checked the PR after the latest push. The prior CHANGES_REQUESTED review still applies on current head, not just the older 6595417 head.

Still reproducible blockers

B1 — Golden Q&A still does not ground the BigQuery path.

run_evaluation() still calls client.evaluate_categorical(...) before traces are resolved and before any golden matching could happen. That means a post-processing-only fix would only add metadata; it would not inject the matched expected answer into the judge prompt.

So the real fix needs to do one of these:

  • extend the BigQuery/API categorical path so it can accept per-session context before classification, then match golden_qa before scoring; or
  • explicitly make golden_qa local-conversations-only for now and log a loud warning when the BigQuery path sees eval_spec.golden_qa.

Right now the marquee documented command remains misleading:

./scripts/quality_report.sh --eval-spec eval_spec.json --report --limit 50

It loads the spec, but golden_qa does not ground the judge on the BQ path.

Related gap: golden metadata is only injected into JSON/programmatic output via _inject_golden_summary(...). The markdown report path does not render a golden_eval_summary, even though the README says the output gains that block.

B2 — Failure-cause taxonomy still runs with no attribution data.

Reproduced on current head with primary-only metrics: 4 unhelpful sessions with only primary metrics still become:

{'skill_gap': 4, 'knowledge_gap': 0, 'tool_gap': 0}

That is still a silent default, not evidence. Please add a _has_failure_attribution_data(...) gate analogous to _has_dimension_data(...), and suppress/omit the failure-cause summary when neither failure_attribution nor the required tool_usage + correctness pair was scored.

Still present high-priority cleanup

  • H1: scripts/eval/eval_config.json defines 8 metrics, but help/docs/sample text still say 7 metrics / 7 dimensions and 3.5x cheaper. The ratio is now 2 vs 8 = 4x.
  • H2: scripts/sample_quality_report.md still shows stale --agent-context agent_context.json; argparse no longer accepts that flag.
  • H3: the $.custom_tags. SQL-path fix is correct, but still has no regression test asserting TraceFilter(custom_labels=...) uses $.custom_tags. and not $.labels..

Validation

Local targeted test run still passes:

uv run pytest tests/test_quality_report_helpers.py
94 passed

CI is green, but the PR is currently marked BEHIND, so it also needs a branch update before merge hygiene is clean.

B1 — golden Q&A no longer a no-op on the BigQuery path: run_evaluation now
matches golden_qa from resolved questions, returns golden_metadata (so the
existing _inject_golden_summary fires) and generate_quality_report injects it
too; warns that the server-side judge can't take per-session expected answers
(expected-answer grounding stays on --conversations-file).

B2 — failure-cause taxonomy is gated on a new _has_failure_attribution_data
predicate (needs failure_attribution, or both tool_usage and correctness), in
console, markdown, and JSON; with --dimensions primary the breakdown no longer
defaults every failure to skill_gap.

H1 — metric count corrected to 8 (incl. failure_attribution) and cost ~4x in
--dimensions help, README, and metrics section.
H2 — removed the removed --agent-context flag from sample_quality_report.md;
fixed the residual agent-context docstring (also L1).
H3 — added a TraceFilter regression test asserting the $.custom_tags JSON path.

M2 — thread eval_config through generate_quality_report_from_conversations.
L2 — drop the unreachable TypeError in get_a2a_response.
L4 — lift the golden threshold to _DEFAULT_GOLDEN_THRESHOLD.

Tests: 99 pass (5 new). pyink + isort clean.
@evekhm

evekhm commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Review feedback addressed (commit ac47775)

Thanks for the thorough re-review. Both blockers and all three High items are fixed, plus a few of the Medium/Low. 99 helper tests pass (5 new); pyink --check + isort clean.

Blockers

B1 — Golden Q&A on the BigQuery path. run_evaluation now builds question_by_sid from the resolved map, calls match_golden_qa, and returns golden_metadata, so the existing _inject_golden_summary produces golden_eval_summary + per-session golden_eval on the BQ path (and generate_quality_report injects it too, for evolve.py/score_and_compare.py). The server-side evaluate_categorical (AI.GENERATE) can't take per-session expected answers, so expected-answer judge injection for correctness stays on --conversations-file; the BQ path now logs a WARNING saying exactly that (so it no longer silently no-ops). scope/ground_truth continue to ground both paths.

B2 — Failure-cause taxonomy gating. Added _has_failure_attribution_data(report) (true iff a session has failure_attribution, or both tool_usage and correctness). The failure breakdown is now gated on it in _print_eval_results, _write_md_report, and the _classify_failures call in _build_json_output. With --dimensions primary the JSON skill_gap/knowledge_gap/tool_gap keys are absent rather than a misleading N/0/0, and console/markdown omit the block. New tests TestHasFailureAttributionData cover the four cases.

High

  • H1 — Corrected to 8 metrics (incl. failure_attribution) and ~4× (2 vs 8 calls) in the --dimensions help, README.md (all 5 spots), and the metrics/cost section. sample_quality_report_session.md "all 7" → "all 8".
  • H2 — Removed the dropped --agent-context from sample_quality_report.md's command line (it broke copy-paste), and fixed the residual agent-context docstring in _load_eval_config (also closes L1). Note: I left the sample body as-is rather than fabricate a golden_eval_summary block — a faithful re-render with --eval-spec needs a live BQ+Vertex run. Happy to regenerate it if you can point me at a dataset, or I can swap in the skill_evolution_lab sample which already shows golden output.
  • H3 — Added TestCustomTagsJsonPath asserting TraceFilter(custom_labels=...).to_sql_conditions() emits $.custom_tags. and not $.labels..

Also fixed (Medium/Low)

  • M2eval_config now threaded through generate_quality_report_from_conversations.
  • L2 — Dropped the unreachable TypeError in get_a2a_response.
  • L4 — Lifted the golden threshold to a _DEFAULT_GOLDEN_THRESHOLD module constant.

Deferred (happy to do if you want them in this PR)

  • M1 (skip/warn failure_attribution when eval_spec.tools is empty), M3 (**mt_stats nesting), M4 (per-turn agent attribution in _extract_conversation), L3/L5 (comment + render-order test).

Re: "17 commits behind origin/main" — GH still reports MERGEABLE with no conflicts. Let me know if you'd like me to rebase onto the latest main as part of this round.

@evekhm

evekhm commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough re-review @caohy1988 — all blockers and High items are addressed in ac47775 (pushed right after your last pass, so the review predates it). Summary with pointers:

B1 — Golden Q&A no-op on the BigQuery path
run_evaluation (BQ path) now builds question_by_sid from resolved_map, calls match_golden_qa, and _inject_golden_summary runs on the BQ JSON output (scripts/quality_report.py:1313-1317, :1429). So --eval-spec ... --report --limit N now produces golden_eval_summary. The per-session expected-answer injection into the judge remains conversations-only by design, and the BQ path now logs a warning saying exactly that (:1318) instead of silently ignoring golden_qa.

B2 — Failure taxonomy misleading when those metrics weren't scored
Added _has_failure_attribution_data(report) (:3771), true iff some session has failure_attribution or both tool_usage and correctness. The failure-breakdown blocks in _print_eval_results (:2473), _write_md_report (:3464), and the JSON summary (:3973) are now gated on it, so --dimensions primary no longer prints skill=N (evolution) from unscored metrics.

H1 — "7 metrics" → 8 ✅ Help text (:4077) and the five README spots now say 8; cost note corrected to 4×.
H2 — stale --agent-context ✅ Removed from the sample command and docstring; sample re-rendered with --eval-spec so it showcases golden_eval_summary.
H3 — custom_tags regression test ✅ Added TestCustomTagsJsonPath::test_custom_labels_uses_custom_tags_json_path asserting $.custom_tags. in the generated WHERE clause.

Also covered M2 / L1–L2 / L4 from the same commit.

Verification: pytest tests/test_quality_report_helpers.py99 passed; all CI checks green; mergeable: MERGEABLE. Re-requesting review when you have a moment 🙏

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — PR #174 head fa5938144 (merge of current main + fix commit ac47775)

Checked out locally. Both June 5 blockers and all three High items are verified fixed. 99 helper tests pass, pyink/isort clean on PR-touched files, git diff --check clean, mergeable: MERGEABLE with current main merged in (incl. the #300 rename).

Verified RESOLVED

  • B1 — Golden Q&A on the BigQuery path. run_evaluation now builds question_by_sid from resolved_map, calls match_golden_qa (quality_report.py:1316), threads golden_threshold through the signature, returns golden_metadata (:1389), and _inject_golden_summary runs on the BQ JSON output (:1429) and generate_quality_report (:2059). The design boundary — expected-answer judge injection stays conversations-only because the server-side AI.GENERATE judge can't take per-session context — is documented in a comment block and, critically, logged as a WARNING at runtime (:1320-1325) so the partial behavior is no longer silent. scope/ground_truth ground both paths. This is exactly the shape I asked for.
  • B2 — Failure-taxonomy gating. _has_failure_attribution_data (:3771) with a docstring that explains the failure mode it prevents, gating all three output paths: console (:2473), markdown (:3464), JSON (:3973). --dimensions primary no longer fabricates skill=N knowledge=0 tool=0.
  • H1 — metric count. README (:78,122,271,339,356,508) and the --dimensions help text (:4080) all corrected to 8 metrics / ~4x.
  • H2 — stale --agent-context. Zero references remain anywhere under scripts/.
  • H3 — custom_tags SQL regression test. test_custom_labels_uses_custom_tags_json_path (tests/test_quality_report_helpers.py:1207) asserts $.custom_tags. in the generated WHERE clause.
  • M2eval_config now threads through generate_quality_report_from_conversations.
  • L4_DEFAULT_GOLDEN_THRESHOLD module constant replaces the duplicated 0.92.

One straggler (non-blocking)

scripts/sample_quality_report_session.md:5 still says "all 7 metrics" — the README got the 7→8 sweep but this file was missed. One-line fix.

Optional follow-up (M1 from June 5, unaddressed — fine as a separate issue)

failure_attribution still runs (1 judge call/session) even when eval_spec.tools is absent, where the judge can't meaningfully distinguish knowledge_gap from tool_gap. Now that B2's gate exists the output is honest, so this is purely a cost/noise optimization — happy to see it land as a follow-up issue rather than another round here.

Verdict — APPROVE

Three rounds in, every substantive finding is fixed and regression-guarded. Please fix the one "7 metrics" line in sample_quality_report_session.md before (or at) merge.

@haiyuan-eng-google haiyuan-eng-google merged commit c2afc72 into GoogleCloudPlatform:main Jun 15, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants