Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering by evekhm · Pull Request #174 · GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK

evekhm · 2026-05-19T05:56:40Z

Motivation

The original quality report gave two binary metrics — response_usefulness (helpful/unhelpful) and task_grounding (grounded/not). Useful as a pass/fail scorecard, but they don't tell you why a session failed, whether the answer is actually correct, or what to fix. When an agent shows a 35% unhelpful rate, the real questions are:

Is it getting facts wrong, or just vague? Is its answer correct vs. the known answer?
Is it using its tools, or answering from LLM memory?
Does it handle scope correctly (decline what it shouldn't answer)?
Does it get it right the first time, or only after corrections?
Where in the execution trace did it go wrong — routing? tool call? response?
And critically: of the failures, which are skill-fixable vs. need data or a new tool?

This PR turns the quality report from a scorecard into a diagnostic that can drive automated skill evolution — including ground-truth correctness via golden Q&A, a failure-cause taxonomy, execution-trace rendering, and version-aware filtering.

What's New

1. Five quality dimensions (0–2)

Each session is scored on 5 orthogonal dimensions: Correctness, Tool Usage, Specificity, Scope Compliance, First-Time Right. Averaged across sessions, surfaced in console + markdown with color-coded ratings, with low-scoring sessions grouped per dimension. --dimensions a,b,c selects which to score.

2. Eval-spec grounding: scope + golden Q&A (ground-truth correctness)

--eval-spec eval_spec.json is the single source of evaluation ground truth:

{
  "scope": "what the agent should/should not answer (out-of-scope -> decline)",
  "tools": "what the tools can/can't do (used to tell knowledge gaps from tool gaps)",
  "ground_truth": "authoritative facts for correctness",
  "golden_qa": [
    {"question": "...", "expected_answer": "...", "topic": "..."},
    {"question": "...", "expected_behavior": "decline", "topic": "out_of_scope"}
  ]
}

Golden-Q&A matching embeds each session's question and matches it (cosine ≥ 0.92) to a golden entry, then injects the expected answer into the judge so it grades factual correctness against ground truth rather than guessing.
Produces a golden_eval_summary (matched/unmatched split by usefulness, matched_meaningful_rate, and the list of golden-matched questions the agent got wrong) — the trustworthy headline for regression testing.
Scope calibrates the declined outcome: a correct decline of an out-of-scope question scores as a positive, not a failure.
When no golden_qa is supplied, the report warns that usefulness/grounding are LLM estimates without ground truth (they can mislabel verbose, tool-grounded answers) and points to --eval-spec.

(This supersedes the earlier --agent-context scope-only flag; scope now lives in the eval spec.)

3. Failure-cause taxonomy (who fixes it)

Failures are classified into skill_gap (had the tool + data, misbehaved → a skill/prompt fix), knowledge_gap (tool used correctly, fact missing → add data), and tool_gap (no tool/data source, or a personal-data/action request → build a tool). The tools field in the eval spec lets the judge tell a knowledge gap from a tool gap. Surfaced in the report as "skill (evolution) / knowledge (add data) / tool (build tool)" — directly routable.

4. Multi-turn conversation analysis

Extracts full multi-turn conversations from trace spans; classifies each user turn (CORRECTION, VERIFY, SPECIFICS, SCOPE, FOLLOWUP, END); computes correction/verification rates, avg user turns, avg tool calls; segments conversations at correction boundaries.

5. Correction analysis + trajectory segmentation

For corrected sessions: shows what the agent claimed, what the user corrected, and whether it recovered; fetches the BigQuery execution trace and segments it at correction boundaries with before/after timing. Detects routing failures (answered from LLM knowledge without routing to a specialist) and parroting (echoed the user's correction without re-verifying via a tool).

6. Execution-trace rendering

--trajectory-samples N fetches N traces (prioritizing unhelpful + correction sessions) and renders the full routing tree with tool calls, latency, and TTFT per span. --session <id> auto-fetches the trace (no extra flags).

7. Parroting penalty

Parroted "recoveries" (echoing the user's correction without a tool call) are penalized as unhelpful, preventing inflated scores from agents that just agree with the user.

8. Local evaluation without BigQuery

--conversations-file traffic.json scores conversations from a local JSON file via the Gemini API — no BQ credentials — with the same report format. (This is what lets an evolution loop score candidate skills offline.)

9. Version-aware filtering + labels

--label key=value filters BQ sessions by custom_tags (e.g. agent_version=v1) and surfaces active filters in the report's Execution Details. Fixes the custom_tags/custom_labels JSON-path mismatch ($.labels.* → $.custom_tags.*) so version filtering actually works — essential for comparing skill versions without cross-contamination.

10. External eval config + per-category sampling

--eval-config eval_config.json externalizes metric definitions, judge prompts, and rubrics (auto-discovered from the repo). --samples unhelpful=10,partial=5,low=3 controls how many examples each report section shows.

11. Per-session context for golden eval

per_session_context on classify_sessions_via_api() injects per-session context (e.g. expected answers) into the judge prompt — the programmatic hook behind golden-Q&A grounding.

12. Performance

Correction inference runs concurrently with asyncio.gather (~10x on 200+ sessions).

How this enables skill evolution

These form the evaluation backbone of an automated skill-evolution loop:

Diagnose — dimensions + correction analysis pinpoint what's wrong (e.g. "Tool Usage 0.4/2.0 — not calling tools").
Attribute — the failure-cause taxonomy splits skill-fixable failures from knowledge/tool gaps, and routing-failure detection localizes which agent to fix.
Measure — --conversations-file + --eval-spec golden Q&A score evolved candidates against ground truth offline (golden_eval_summary).
Compare — version filtering (--label agent_version=…) + structured JSON (--output-json) feed automated candidate comparison.

Review feedback addressed

Round 1 (@caohy1988):

ID	Sev	Issue	Fix
H1	High	`_compute_dimension_averages` scored unknown categories as 0, dragging averages down	Skip `parse_error` + categories not in the score map
H2	High	`_md_dimension_scorecard` defaulted unknown categories to ✅	Default icon now ❓
H3	High	`_count_trace_metrics` missed `TOOL_ERROR`	Count `TOOL_COMPLETED` + `TOOL_ERROR`
M2	Med	`_compute_multiturn_stats({})` → `{}` caused KeyErrors	Returns zeroed stats
M3	Med	Nesting multi-turn stats	Kept flat (downstream reads `summary.correction_rate`)
L3	Low	`_SCORECARD_ICONS` duplicated	Lifted to module constant
L7	Low	`("response_usefulness","task_grounding")` repeated 5+×	Extracted `_PRIMARY_METRICS`
D1	Design	`scope_compliance` vs `task_grounding` overlap	Disambiguated in `eval_config.json`
B1	Branch	Behind `origin/main`	Rebased

Round 2:

Item	Fix
`tool_usage` had no "tool not needed" outcome	Added `no_tool_needed` (scores 2 — a correct outcome)
Whole-suite scoring when only some dims wanted	Added `--dimensions` to select dimensions
Version/label filtering broken	Fixed `custom_tags` path + added `--label`
Correctness was vibes-only	Added eval-spec golden-Q&A grounding + failure-cause taxonomy
Format check	Applied isort + pyink

New tests for all code-level fixes (e.g. test_parse_error_skipped, test_unknown_category_skipped, test_tool_error_counted); 76 helper tests total.

Files changed

Area	Files	What
Quality report	`scripts/quality_report.py`	Dimensions, eval-spec grounding (scope + golden Q&A), failure-cause taxonomy, corrections, trajectories, version filtering/labels, conversations-file, parroting, no-ground-truth warning
Core evaluator	`src/bigquery_agent_analytics/categorical_evaluator.py`	`per_session_context`
Trace	`src/bigquery_agent_analytics/trace.py`	`custom_tags` path fix support
Shell wrapper	`scripts/quality_report.sh`	`--env` loading, `--help` without env
Eval config	`scripts/eval/eval_config.json`	Externalized metric definitions + prompts
Eval spec example	`scripts/eval/data/eval_spec.example.json`	Example `{scope, tools, ground_truth, golden_qa}`
Tests	`tests/test_quality_report_helpers.py`	76 helper tests (incl. review fixes)
Samples / docs	`scripts/sample_quality_report*.md`, `scripts/README.md`	Updated with real data + new flags

Usage

# Ground-truth report (scope + golden Q&A): trustworthy correctness + failure taxonomy
python scripts/quality_report.py --conversations-file traffic.json \
  --eval-spec eval/data/eval_spec.json --report --tag-turns --output-json results.json

# Version comparison (filter by skill version)
python scripts/quality_report.py --label agent_version=v1 --report --tag-turns

# Single-session deep-dive (auto-fetches execution trace)
python scripts/quality_report.py --session conv_5d77036b --tag-turns

# Full BQ report with trajectories + selected dimensions
./scripts/quality_report.sh --report --limit 50 --tag-turns \
  --trajectory-samples 5 --eval-spec eval/data/eval_spec.json \
  --dimensions correctness,tool_usage,scope_compliance

Test plan

Golden-Q&A grounding: --conversations-file traffic.json --eval-spec eval_spec.json → golden_eval_summary matches ground truth
Single-session auto-trace: --session <id> --tag-turns
Multi-session report: --limit 20 --report --tag-turns --trajectory-samples 3
Version filter: --label agent_version=v1
No-ground-truth warning fires without golden_qa
--dimensions selects a subset; --help works without env vars
76 unit tests pass (incl. review-fix tests)

caohy1988

Comprehensive review — PR #174 head `b830575`

Checked locally. Solid feature addition with good test coverage (91 tests pass, 12 new). Most of the code is clean. A handful of design questions worth resolving plus one familiar systematic issue.

The PR is 1 commit behind origin/main (PR #156 squash-merge f4aa037), with PR #156's 6 unsquashed commits still in the branch's history. Mergeable status unknown until rebase.

What's in the PR (vs PR #156 baseline)

Single new commit b830575 on top of PR #156. Files actually changed (excluding inherited from PR #156):

scripts/README.md                    |  30 +-     (new metrics docs)
scripts/quality_report.py            | 557 +++++  (5 new LLM-judge metrics, dimension averaging, multi-turn stats, MD refactor)
scripts/sample_quality_report.md     | 142 +/-    (regenerated)
tests/test_quality_report_helpers.py | 158 +      (3 new test classes, 12 tests)

tests/test_latency_report_helpers.py is unchanged. scripts/latency_report.py is unchanged.

Blockers

B1 — Rebase needed; PR base is the unsquashed PR #156 branch tip

mergeable: UNKNOWN, mergeStateStatus: UNKNOWN. PR #156 was squash-merged into main as f4aa037. PR #174's branch still has the 6 original PR #156 commits unsquashed (8564042 → 31fe562). After rebase onto current origin/main, those 6 commits should be subsumed by the squash.

Suggested:

git fetch origin main
git rebase origin/main
# The 6 PR #156 commits should be auto-dropped because their changes are now in
# the squash commit on main. If git keeps them, drop them via interactive rebase.
# Conflicts most likely in scripts/quality_report.py and scripts/sample_quality_report.md.
git push --force-with-lease

After the rebase, the diff against origin/main should be roughly +717/-170 (the 4 files above), matching git diff 31fe562..HEAD.

B2 — 5× LLM-judge cost increase per evaluation run, no opt-out

The PR adds 5 new categorical LLM-as-judge metrics. Each session is now evaluated on 7 metrics instead of 2. For --limit 100, that's ~700 LLM calls instead of ~200 — a 3.5× cost increase with no flag to disable the new dimensions. For --limit 1000 (some users run weekly bulk evals), the difference is ~2000 calls.

PR description doesn't mention cost. The README says "evaluation scores each session on 7 dimensions" but doesn't quantify cost. Two reasonable fixes:

(a) Add a --dimensions {full,primary,off} flag (default full, primary keeps only response_usefulness+task_grounding, off skips eval entirely)
(b) Document the cost implication in README with rough numbers ("Adds ~5 LLM calls per session, costs roughly $X per 100 sessions at default model")

Without one of these, an operator turning on --config and --limit 500 weekly is suddenly paying 3.5× more without realizing why.

Design questions (worth resolving before merge)

D1 — `scope_compliance` and `response_usefulness.declined` measure the same thing

Looking at the two metrics:

response_usefulness.declined = "outside scope, correctly declined" (a correct outcome)
scope_compliance.compliant = "correctly answered in-scope OR correctly declined out-of-scope"
scope_compliance.non_compliant = "answered out-of-scope OR refused in-scope"

These are highly correlated. A session marked declined in response_usefulness will essentially always be compliant in scope_compliance. A session marked unhelpful (refused in-scope) should be non_compliant.

Two metrics paying the same LLM-judge cost to measure the same property is wasteful. Either:

Drop scope_compliance and rely on the existing response_usefulness categories to cover scope correctness
Keep scope_compliance and remove declined from response_usefulness (probably worse — declined is heavily relied on downstream)
Reframe scope_compliance to capture something response_usefulness doesn't (e.g., scope creep within a single response — partially-in-scope answers)

The third option is what I'd recommend — narrow scope_compliance to detect scope mixing within a single response (e.g., "answered a different question than asked"), which response_usefulness doesn't capture.

D2 — `first_time_right` is duplicative with `correctness` for single-turn sessions

The metric definition says: "For single-turn conversations, evaluate the only response." But for single-turn sessions, "was the first response right" is exactly correctness. The metric only adds signal for multi-turn sessions where the user had to correct the agent.

Since _count_trace_metrics already counts user_turns, the PR could conditionally include first_time_right only in the metric set for multi-turn sessions — saving LLM calls for the majority of sessions that are single-turn. Or document that first_time_right is intended for multi-turn analysis and may be redundant for single-turn.

D3 — Inconsistent middle-category names across `_DIMENSION_SCORES`

_DIMENSION_SCORES = {
    "correctness": {"correct": 2, "mostly_correct": 1, "incorrect": 0},
    "tool_usage": {"proper": 2, "partial": 1, "none": 0},
    "specificity": {"specific": 2, "somewhat_specific": 1, "vague": 0},
    "scope_compliance": {"compliant": 2, "partially_compliant": 1, "non_compliant": 0},
    "first_time_right": {"correct": 2, "clarification_needed": 1, "correction_needed": 0},
}

Five different middle-category names (mostly_correct, partial, somewhat_specific, partially_compliant, clarification_needed). Each is contextually appropriate, but the LLM judge prompt gets the full per-dimension vocabulary anyway, so the variety helps the judge. Reasonable choice — just worth a comment in the code explaining why the names diverge (so a future maintainer doesn't "normalize" them).

Also: tool_usage.partial collides namespace-wise with response_usefulness.partial and the visible-label collision shows up in the per-session compact scorecard. Worth aliasing the visible label.

High-priority

H1 — Parse errors silently score 0 in `_compute_dimension_averages`

quality_report.py:1056 (approximately):

score = score_map.get(mr.category, 0)
dim_totals[mr.metric_name].append(score)

If the LLM-judge returns a category that isn't in the score map (parse error, garbled response, an unexpected category name), the metric is scored 0 — which is the same as the "worst" category. This pulls the dimension average down without distinguishing "judge failed" from "agent failed."

Fix: skip parse errors in the average:

if mr.parse_error or mr.category not in score_map:
    continue   # don't penalize the agent for judge failures
score = score_map[mr.category]
dim_totals[mr.metric_name].append(score)

Then report parse-error count separately in the dimension table:

| Correctness | 1.70 / 2.00 (n=18, 2 parse errors) | 🟢 | ... |

Without this, a noisy LLM judge silently drags every dimension average down.

H2 — Unknown categories default to green checkmark in `_md_dimension_scorecard`

quality_report.py:797:

icon = _SCORECARD_ICONS.get(mr.category, "✅")

✅ is the green check. If the LLM returns an unexpected category (parse error, weird value), the scorecard renders green — actively misleading. Default should be neutral or warning:

icon = _SCORECARD_ICONS.get(mr.category, "❓")  # ❓ unknown

Same bug pattern as H1 — silently treating "unknown" as "good."

H3 — `_count_trace_metrics` only counts `TOOL_COMPLETED`, missing `TOOL_ERROR`

elif span.event_type == "TOOL_COMPLETED":
    tool_calls += 1

A tool call that errored out (TOOL_ERROR) isn't counted. So the reported "avg tool calls" undercounts the agent's actual tool usage. For sessions where tools fail, the metric understates effort.

Either:

Count both TOOL_COMPLETED and TOOL_ERROR (intent: "total tool attempts")
Split into tool_calls_successful / tool_calls_failed (better signal, more useful)

The test test_tool_starting_not_counted covers the "STARTING shouldn't double-count" case but doesn't pin down the error-handling intent. Add a test for TOOL_ERROR.

Medium

M1 — Test plan checkboxes still unchecked

PR description has two unchecked items:

Review Quality Dimensions table renders correctly on GitHub

Verify JSON output includes new dimension_averages and multi-turn fields

The Quality Dimensions table renders fine in sample_quality_report.md (I checked — emoji + table cells display properly). The JSON output does include dimension_averages (verified via _build_json_output at line 1687). Just need to tick those boxes or address whatever the gating concern was.

M2 — `_compute_multiturn_stats` is inconsistent about return shape

def _compute_multiturn_stats(resolved_map):
  ...
  if not total:
    return {}
  return {
      "avg_user_turns": ...,
      "avg_tool_calls": ...,
      "multi_turn_sessions": ...,
  }

Empty input → {}; non-empty → 3-key dict. Callers must handle both:

mt_stats = _compute_multiturn_stats(resolved_map)
if mt_stats:
  print(...)

This works but is brittle. Consider always returning the 3-key dict with zeros so the JSON shape is stable:

return {
    "avg_user_turns": round(sum(user_turns) / total, 1) if total else 0,
    "avg_tool_calls": round(sum(tool_calls) / total, 1) if total else 0,
    "multi_turn_sessions": sum(1 for t in user_turns if t > 1),
}

Then JSON consumers don't need to special-case empty runs.

M3 — `**mt_stats` flattening in JSON summary creates surprises

quality_report.py:1702:

return {
    "summary": {
        ...
        "dimension_averages": dim_avgs,
        **mt_stats,
    },
    ...
}

Adding any field to _compute_multiturn_stats automatically appears in summary — fine when intentional, surprising otherwise. Prefer:

"summary": {
    ...
    "dimension_averages": dim_avgs,
    "multi_turn": mt_stats,
}

Cleaner JSON schema, more discoverable for downstream consumers.

M4 — README and sample report duplicate the "scored 0-2 on five dimensions" phrasing

scripts/README.md:130: "score each session 0-2 and are averaged across all sessions to produce the Quality Dimensions table"

scripts/sample_quality_report.md:25: "Each session is scored 0-2 on five dimensions. Scores are averaged across all sessions."

Similar but redundant. Keep one source (or link the README to the sample for the long form).

M5 — `_DIMENSION_DESCRIPTIONS` only in markdown, not console

The "What it measures" column is included in the markdown report but the console output (_print_eval_results) skips it:

print(f"    {label:<20s}: {avg:.2f} / 2.00  {bar}")

A user running quality_report.sh without --report won't know what each dimension measures. Add the description to the console output too (one-line per dimension):

desc = _DIMENSION_DESCRIPTIONS.get(dim, "")
print(f"    {label:<20s}: {avg:.2f} / 2.00  {bar}")
if desc:
  print(f"    {'':<20s}  ↳ {desc}")

Low / nits

L1 — Sample markdown trailing-whitespace regression (same as PR #156)

$ git diff --check 31fe562..HEAD
scripts/sample_quality_report.md:5: trailing whitespace.
+**Generated:** 2026-05-19 05:37:33  
scripts/sample_quality_report.md:8: trailing whitespace.
+**Location:** us-central1  
scripts/sample_quality_report.md:9: trailing whitespace.
+**Eval model:** gemini-2.5-flash  
scripts/sample_quality_report.md:10: trailing whitespace.
+**Sessions:** 20

Four lines flagged. Same root cause as PR #156: _write_md_report emits f"**Generated:** {timestamp} " (two trailing spaces for GFM hard-breaks). The PR #156 approval flagged this as an "Option A vs B" recommendation; neither has been applied. Once the systematic fix lands in _write_md_report it stops recurring.

L2 — `_DIMENSION_NAMES = list(_DIMENSION_SCORES.keys())` relies on dict insertion order

Python 3.7+ guarantees insertion order, so this works. But a comment noting "order matters for report rendering" would help a future maintainer who reorders _DIMENSION_SCORES for some other reason.

L3 — `_md_dimension_scorecard` builds `_SCORECARD_ICONS` inside the function

quality_report.py:780-797: the dict is rebuilt on every call. Lift to a module-level constant near _DIMENSION_SCORES and _DIMENSION_DESCRIPTIONS for consistency and zero per-call allocation.

L4 — Test gap: parse-error scoring in `_compute_dimension_averages`

The 4 new TestComputeDimensionAverages tests cover happy path, all-perfect, empty, and missing-dimensions. None covers an unexpected category name (the case in H1). Add:

def test_unknown_category_treated_as_zero(self):
    # Documents current behavior; revisit if H1 changes it
    sessions = [
        _FakeSession("s1", [_FakeMetric("correctness", "garbage")]),
    ]
    avgs = _compute_dimension_averages(_FakeReport(sessions))
    assert avgs["correctness"] == 0

After H1 is fixed, the test asserts the skip-instead-of-zero behavior.

L5 — `_count_trace_metrics` test coverage gap

test_tool_starting_not_counted confirms TOOL_STARTING doesn't count, but no test confirms what happens with TOOL_ERROR (H3). Add it as part of the H3 fix.

L6 — `correct` is a category name in two dimensions

correctness.correct and first_time_right.correct both use the bare name "correct". This is fine because the metric_name disambiguates them in the category-distributions output and in _DIMENSION_SCORES (keyed by metric_name → category map). But a reader skimming the code sees "correct" in two score maps and might wonder if they collide. Inline comment would help.

L7 — `_PRIMARY_METRICS` defined twice

_PRIMARY_METRICS = {"response_usefulness", "task_grounding"} appears as a local set at _write_md_report (line ~1497). The same constraint is hardcoded twice more in _print_eval_results (lines if mr.metric_name not in ("response_usefulness", "task_grounding"):). Lift to a module-level constant.

What looks great

The Quality Dimensions table with rating legend (🟢 / 🟡 / 🔴 thresholds at 1.50 / 1.00) is a clean visual abstraction — at a glance you see which dimensions are problem areas. Better than a raw "average 0.90" with no rating context.
The compact one-line scorecard for dimensions in per-session output (Correctness ✅ | Tool Usage ❌ | Specificity ⚠️ | Scope ✅ | First-Time Right ✅) is the right call vs verbose multi-line blocks. Keeps the per-session view readable while still surfacing all 5 dimensions.
Section refactor with _md_write_session_section (used for Unhelpful/Declined/Partial Sessions) consolidates ~80 lines of duplicated markdown-writing into one helper. Right cleanup.
_count_trace_metrics is a clean extraction — testable pure function with focused responsibility. Good test coverage of the happy paths.
Per-dimension category vocabularies (correct/mostly_correct/incorrect, proper/partial/none, etc.) are well-chosen — the LLM judge gets vocabulary that fits each dimension instead of forcing everything into good/medium/bad.
dimension_averages flowing into _build_json_output keeps the JSON output complete. Downstream consumers don't have to recompute averages.
README documentation update is thorough — the "2 (best) / 1 (middle) / 0 (worst)" table per dimension is exactly what an operator wants to see when interpreting the report.
Test coverage — 12 new tests covering 3 helper functions, including edge cases (empty, single-turn, missing keys). Good test discipline carried over from PR #156.

Verdict

REQUEST_CHANGES on:

B1 rebase (one cleanup) — needed to confirm mergeability
B2 cost transparency — 5× LLM-judge cost increase without a flag or doc is the kind of surprise that erodes operator trust; add either a --dimensions {full,primary,off} flag or explicit cost docs in README
D1 / D2 design questions — scope_compliance is largely redundant with response_usefulness.declined; first_time_right is duplicative with correctness for single-turn sessions. Either reshape these metrics or document why both pay the LLM-judge cost
H1 / H2 / H3 correctness issues — parse errors silently scoring 0, unknown categories defaulting to green check, TOOL_ERROR not counted

M1-M5 and L1-L7 are improvement-level — bundle into the same push or skip.

Substantive design is good. The dimension averaging, multi-turn extraction, and markdown refactor all do the right things. The main concerns are operational (cost) and semantic (overlapping metrics) — both are addressable.

caohy1988 · 2026-05-19T07:01:24Z

[P1] tool_usage needs a positive no-tool-needed path before this is safe to merge.

On current head 766363f, scripts/quality_report.py:370 defines tool_usage with only proper, partial, and none, and scripts/quality_report.py:1038 scores none as 0. That means a correct decline or direct answer where no tool was needed gets shown as a Tool Usage failure.

The sample report demonstrates the bad outcome: scripts/sample_quality_report.md has declined/no-tool-needed sessions, but their dimension scorecard still shows Tool Usage ❌. That makes the new Quality Dimensions table misleading for sessions that behaved correctly.

Suggested fix: add a positive category such as no_tool_needed or not_applicable to the tool_usage metric, score it as 2, render it neutrally/positively in the scorecard, and add a regression test showing declined/direct no-tool-needed sessions do not drag down Tool Usage.

…rmat Add 5 new LLM-as-judge quality dimensions to the quality report: correctness, tool_usage, specificity, scope_compliance, first_time_right. Each session is scored 0-2 and averaged into a Quality Dimensions table. Add multi-turn efficiency metrics (avg user turns, avg tool calls, multi-turn session count) extracted from trace spans. Streamline report output for readability: - Quality Dimensions table includes "What it measures" descriptions and a color-coded rating legend - Category distributions only show primary metrics (response_usefulness, task_grounding) since dimension averages already summarize the rest - Per-session details use a compact one-line scorecard for dimensions instead of verbose multi-line blocks Add 12 new tests for the helper functions. Update README metrics documentation. Regenerate sample quality report from live data.

…raction Extract full multi-turn conversations from trace spans and use LLM to classify user follow-ups as corrections, verifications, or normal follow-ups. Surface correction_rate and verify_rate in both console output and markdown reports to match the metrics available in knowledge_supervisor's multiturn_quality_report.

…and ground truth support The LLM judge was misclassifying in-scope failures as "declined" because the declined category was always present regardless of whether scope context existed. This led to inflated correctness scores for agents that failed to answer in-scope questions. Key changes: - Make the "declined" metric category conditional: only include it when the agent config actually defines out-of-scope topics (has_scope flag). Without scope context the judge has no basis for that category. - Add ground_truth config field to _build_scope_context so the judge can verify factual accuracy against known-correct policy data. - Add in_scope_topics to scope context so the judge can distinguish "topic is out of scope" (declined) from "agent failed to answer an in-scope topic" (unhelpful). - Clarify unhelpful definition: explicitly lists failure patterns like "I don't have that information" for in-scope topics. - Support config_path="none" in _load_agent_config to explicitly disable scope context and skip auto-discovery. - Thread config_path through generate_quality_report for programmatic use. - Update --agent-config CLI help to document the "none" option.

- Add scripts/eval/data/agent_context.example.json with scope_decisions for scope-aware evaluation. Users copy to agent_context.json and customize with their agent's out-of-scope topics. - Remove in_scope_topics from scope context: scope_decisions alone is sufficient. Anything not listed as out_of_scope is implicitly in scope. - Remove hardcoded domain-specific topic names (benefits, PTO, etc.) from the LLM judge prompt to keep it agent-agnostic. - Update README with all new features: dimension drilldowns, single-session evaluation (--session), quality dimensions table, multi-turn metrics, declined category behavior, and sample output links. - Add quality_metrics.json (metric definitions) and sample_quality_report_session.md (single-session verbose output).

…esh eval run

Support evaluating traffic-generator conversations directly from a JSON file, bypassing BigQuery entirely. Adds run_evaluation_from_conversations() and generate_quality_report_from_conversations() public APIs, plus CLI flag. Also enhances _write_md_report with TOC, section descriptions, metric explanations, and a report_dir parameter for custom output locations. Includes issue for concurrent session scoring (asyncio.gather + semaphore).

Support --samples with per-category overrides (e.g. unhelpful=10,partial=5,low=3) in addition to single-number and 'all' modes. Refactor _print_eval_results and _write_md_report to use centralized _parse_samples/_get_sample_limit helpers with sensible defaults per category.

Run _infer_corrections calls in parallel using asyncio threads with a semaphore (max 10 concurrent) instead of sequential loop.

Short-circuit help flags before env var validation so users can view usage without setting PROJECT_ID, DATASET_ID, etc.

Add _md_find_low_dimension_sessions and _md_write_low_dimension_section helpers to generate per-dimension subsections for all 5 quality dimensions dynamically. Move low-dimension sections after Declined Sessions. Restructure TOC with  wrapper. Add CLI command reproduction in Summary section. Compute report layout variables upfront.

- --limit: clarify it evaluates the N most recent sessions - --trajectory-samples: remove misleading "Requires BQ credentials" - --tag-turns: remove conversations-file restriction, enable for BQ path by passing tag_turns through run_evaluation with concurrent tagging - --env=path syntax now works in shell script (not just --env path) - Error message now says 'export VAR=...' and mentions --env flag

… detection, and parroting detection - Embed segmented execution trajectories in Correction Analysis section (before/after correction with outcome labels: recovered, parroted, not_recovered) - Detect routing failures (agent answered from LLM knowledge without tools) and surface as Routing Failures subsection - Fix answered_by defaulting to "policy_agent" — now "unknown" with BQ backfill - Tighten correction inference prompt to distinguish genuine recovery from parroting - Add --eval-config flag for external metric/prompt definitions (eval_config.json) - Nest Correction Analysis under Sample Sessions with dynamic heading levels - Update sample report and README

Penalize parroted recoveries (agent echoes user's correction without re-verifying via tool) as unhelpful in both the usefulness definition and the unhelpful category. Refactor quality_report.py to auto-discover eval/eval_config.json from the repo root, removing 290 lines of hardcoded metric definitions in favor of the config-driven approach.

Thread per_session_context through quality_report.py into classify_sessions_via_api() so golden eval expected answers can be injected into the judge prompt per session.

When --session is used, the execution trajectory is now fetched automatically from BigQuery and printed to console with sub-trajectory segmentation at correction boundaries. Updated sample with real data showing the full output including trace tree and segmentation.

…L_ERROR counting, and module constants - H1: Skip parse errors and unknown categories in _compute_dimension_averages instead of scoring them as 0 (which inflated averages downward) - H2: Default unknown categories to ❓ instead of ✅ in scorecard icons - H3: Count TOOL_ERROR spans as tool attempts in _count_trace_metrics - L3: Lift _SCORECARD_ICONS to module level (was duplicated in function) - L7: Extract _PRIMARY_METRICS constant, replace 5 inline references - M2: _compute_multiturn_stats returns stable shape on empty input - Update tests: add parse_error attr to _FakeMetric, test TOOL_ERROR counting, test parse error/unknown category skipping, fix empty map assertion

evekhm · 2026-05-27T23:43:11Z

Review Feedback Addressed

All substantive items from the review have been addressed in the latest push:

High priority:

H1: _compute_dimension_averages now skips parse_error results and unknown categories instead of scoring them as 0 — only valid categories contribute to averages
H2: Unknown categories in the dimension scorecard now show ❓ instead of ✅
H3: _count_trace_metrics now counts both TOOL_COMPLETED and TOOL_ERROR as tool attempts

Medium priority:

M2: _compute_multiturn_stats({}) returns a stable shape (avg_user_turns: 0, avg_tool_calls: 0, multi_turn_sessions: 0) instead of empty dict
M3: Investigated nesting under "multi_turn" key but kept flat — downstream consumer score_conversations.py reads r['summary']['correction_rate'] directly

Low priority:

L3: _SCORECARD_ICONS lifted to module-level constant (was duplicated inside function)
L7: Extracted _PRIMARY_METRICS = {"response_usefulness", "task_grounding"} constant, replacing 5 inline literal pairs

Design:

D1: scope_compliance vs task_grounding overlap is already addressed — the eval prompts in eval_config.json differentiate them (scope boundaries vs tool usage)

Branch:

B1: Rebased onto latest origin/main

Tests added: test_parse_error_skipped, test_unknown_category_skipped, test_tool_error_counted, updated test_empty_map — all 76 tests pass.

- TraceFilter.to_sql_conditions: query $.custom_tags.* instead of $.labels.* to match BigQueryLoggerConfig.custom_tags write path - TraceFilter.from_cli_args: add custom_labels parameter - run_evaluation: add custom_labels parameter, thread through to TraceFilter for version-aware session filtering

- Add --label KEY=VALUE (repeatable) to filter by custom_tags set via BigQueryLoggerConfig.custom_tags (version, env, experiment_id, etc.) - Surface app_name and labels in report.details so they appear in both console and markdown Execution Details sections - Expand CLI epilog with filtering section, scope-aware eval example showing full agent_context.json format, and combined filter examples - Update README with Custom Labels section (end-to-end: agent emits → BQ stores → quality report filters), expanded --agent-context docs, and complete Execution Details description - Update sample reports with complete Execution Details fields

@caohy1988

… --dimensions flag, and hardening Review feedback (@caohy1988): - P1: add `no_tool_needed` category to `tool_usage` (scores 2) so correctly declined / no-tool-needed sessions are not penalised as a Tool Usage failure; neutral ➖ scorecard icon; regression tests. - B2: add `--dimensions {full,primary}` flag to cut LLM-judge cost ~3.5x (primary scores only the 2 primary metrics); document cost in README + help. Default stays `full` for backward compatibility. - D2: document that `first_time_right` is primarily a multi-turn signal. - D3/L6: comment the deliberately-divergent middle-category names and the shared `correct` category. - M5: print per-dimension descriptions in console output. - L1: render markdown report metadata as a bullet list instead of trailing-double-space hard breaks (fixes `git diff --check`). Independent review hardening: - Gate the dimension block on a new `_has_dimension_data()` helper across all three output paths; JSON `dimension_averages` is now empty (not all-zero) when dimensions were not scored, so consumers don't read unscored dimensions as 0.0 / failing. - Anchor the judge prompt so a missing in-scope tool lookup stays `none` and is not mislabeled `no_tool_needed` (inverse-of-P1 inflation guard). - Make `_DIMENSION_LOW_CATEGORIES` fail-safe if a dimension has no score-0 category (no StopIteration at import). Tests: 81 pass (+5).

evekhm · 2026-05-29T20:34:17Z

Follow-up round addressed (commit `c86722f`)

Thanks for the thorough review. This round closes the remaining blockers and design questions, plus a few items found in a self-review pass.

Blockers

P1 — tool_usage no-tool-needed path. Added a no_tool_needed category to the tool_usage metric (eval_config.json), scored 2 in _DIMENSION_SCORES, rendered as a neutral ➖ in the scorecard. A correctly-declined / direct no-tool answer no longer shows Tool Usage ❌ or drags the dimension average down. Regression tests added (test_tool_usage_no_tool_needed_scores_full, ..._does_not_drag_average).

To guard the inverse failure (a real tool-skip mislabeled as no_tool_needed and scoring a false 2), the judge prompt now has a decisive tiebreaker: if the question was in-scope and a tool could have supplied the answer but none was called, it stays none (a failure) and is not excused as no_tool_needed.

B2 — 5× LLM-judge cost. Added --dimensions {full,primary}. full (default, unchanged behavior) scores all 7 metrics; primary scores only response_usefulness + task_grounding (~3.5× cheaper). Cost is now documented in the README and the flag help text. --no-eval remains the way to skip scoring entirely.

Design questions

D2 — first_time_right redundancy for single-turn. Documented in the README that it is primarily a multi-turn signal and effectively mirrors correctness for single-turn sessions. (Kept it in the metric set rather than varying metrics per-session, since classification is batched.)

D3 — divergent middle-category names. Added a code comment explaining the per-dimension vocabulary is intentional (helps the judge) and should not be normalized; also documents that correct appears in two dimensions without colliding (keyed by metric_name).

Improvements (self-review)

dimension_averages consistency. --dimensions primary previously would have surfaced all-zero dimension averages, which the console/markdown paths already suppressed via any(v>0) but the JSON path did not — so JSON consumers (incl. our downstream scorer) could read "every dimension = 0.0 / failing" when they were simply never scored. Factored a _has_dimension_data() helper and gated all three output paths through it; JSON now emits {} for unscored dimensions.
M5 — per-dimension descriptions now print in console output, not just markdown.
L1 — report metadata is rendered as a bullet list instead of trailing-double-space hard breaks; git diff --check is now clean (this was the recurring whitespace nit from feat(scripts): add latency report and scope-aware quality eval #156).
_DIMENSION_LOW_CATEGORIES is now fail-safe if a future dimension lacks a score-0 category (no StopIteration at import).

Tests: 81 pass (+5). The new lines are pyink/isort-clean.

Introduces an optional eval spec (`eval/data/eval_spec.json`, or `--eval-spec`) that grounds quality scoring: - `scope` (free text) — defines what the agent handles; out-of-scope is the complement, so a polite decline is scored `declined` (replaces the brittle scope_decisions topic list). - `golden_qa` — expected answers matched per-question by embedding similarity (`--golden-threshold`); injects ground truth into the judge and emits a `golden_eval_summary`. Matching now lives in the SDK (`match_golden_qa`), not in downstream projects. - `tools` — capability description used by the judge's new `failure_attribution` metric. Adds a 3-way failure taxonomy: every `unhelpful` session is attributed to a `skill_gap` (evolution-fixable), `knowledge_gap` (a fact missing from the data), or `tool_gap` (no tool/capability), and a new `addressable_meaningful_rate` reports quality on questions the agent can actually answer. Surfaced in the console, markdown (with per-class "add a fact" / "build a tool" sections), and JSON. Also: `tool_usage.no_tool_needed` (scores 2, not a failure), `--dimensions {full,primary}` cost control, `_has_dimension_data` so unscored dimensions aren't reported as 0.0. Tests: 94 pass.

evekhm · 2026-06-04T20:09:57Z

REady for re-review

response_usefulness and task_grounding are LLM estimates when no golden_qa is supplied -- they can mislabel verbose, tool-grounded answers as ungrounded/unhelpful. Emit a clear warning pointing users to --eval-spec with a golden_qa list so correctness is graded against expected answers (summary.golden_eval_summary). The golden-Q&A grounding itself is unchanged and remains the reliable path.

evekhm · 2026-06-05T18:09:33Z

Quality dimensions + eval-spec grounding for the BigQuery Agent Analytics quality report.

This PR upgrades the LLM-as-judge quality_report from a single "usefulness" verdict into a richer, more trustworthy evaluation. Its headline feature is eval-spec grounding: you supply one optional JSON ({scope, ground_truth, golden_qa}), and the judge grades each session against your golden Q&A expected answers (with embedding-based question matching) and your defined scope — so correctness is anchored to ground truth instead of the model's guess, and out-of-scope questions are correctly credited when the agent declines.

Golden-Q&A + scope eval-spec — ground correctness/scope scoring in your own expected answers; reliable headline numbers via golden_eval_summary.
Multi-dimension scorecard (--dimensions) — usefulness, grounding, tool_usage, correctness, specificity, scope_compliance.
Failure-cause taxonomy — separates skill gaps (fixable by prompt/skill) from knowledge gaps (add data) and tool gaps (build a tool).
Correction / parroting / routing-failure analysis with trajectory segmentation.
Version-aware filtering (--label, custom_tags path fix) — score one agent version without cross-version contamination.
Single-session trace auto-fetch and per_session_context for golden eval.
No-ground-truth guardrail warning so scores aren't mistaken for ground truth.

The quality-report usage section listed filters and report flags but omitted the most important workflow — adding evals (ground truth) — and several features introduced in this PR. Now documented: - Adding evals: eval-spec (scope/tools/ground_truth/golden_qa) + scoring a local conversations file (--conversations-file, no BigQuery) with the JSON format and recommended workflow - Failure-cause taxonomy (skill_gap / knowledge_gap / tool_gap) and the 'tools' eval-spec field that drives it; routing-failure and parroting notes - --dimensions full|primary, --tag-turns, --trajectory-samples, --concurrency, --golden-threshold, --eval-config, --env, per-category --samples - golden_eval_summary and the no-ground-truth warning - Reorganized the usage quick-reference into grouped, complete examples

caohy1988

Fresh re-review — PR #174 head `6595417`

Locally checked out, 94 helper tests pass (pytest tests/test_quality_report_helpers.py), git diff --check is clean for the PR's own scripts/src/tests. Solid progression — prior rounds' H1/H2/H3/P1/M2/L3/L7 fixes all landed correctly, the --dimensions flag, _has_dimension_data gating, and the custom_tags JSON-path fix are real wins. Two new blockers surfaced on a fresh pass through; the rest are tightening.

PR is 17 commits behind origin/main. mergeable: MERGEABLE per the GH API, no conflicts, but the rebase has aged since 2026-05-27.

Blockers

B1 — Golden Q&A silently does nothing on the BigQuery path

match_golden_qa is only called from run_evaluation_from_conversations (line 1465). run_evaluation (the BQ path) loads the eval spec, uses scope/ground_truth in the judge prompt, but never matches golden_qa to session questions. Confirmed locally:

run_evaluation (BQ path) mentions match_golden_qa:        False
run_evaluation (BQ path) mentions per_session_context:    False
run_evaluation_from_conversations (local path):            True / True

So this command:

./scripts/quality_report.sh --eval-spec eval/data/eval_spec.json --report --limit 50

silently ignores the golden_qa block. No warning, no golden_eval_summary in JSON, no per-session "expected answer" injected into the judge. The PR description headlines golden Q&A as "the trustworthy headline for regression testing" and the README says "the most important way to use the quality report" — but on what is presumably the most common path (score this week's BQ traffic), the feature is a no-op.

The _resolved_map already contains question per session, so wiring is straightforward — between report = client.evaluate_categorical(...) and the return, build question_by_sid = {sid: ctx["question"] for sid, ctx in resolved_map.items()}, call match_golden_qa, then _inject_golden_summary later. Or, if it's deliberately local-only, the BQ path should log a warning when it sees golden_qa in the spec (mirror the existing "No golden_qa" warning at line 1237).

B2 — Failure-cause taxonomy is misleading whenever `failure_attribution` / `tool_usage` / `correctness` weren't scored

_failure_breakdown_from_report and _classify_failures always run, regardless of which metrics were actually scored. With --dimensions primary, the deterministic fallback in _failure_class sees tool=None, correctness=None, attribution=None for every unhelpful session and unconditionally returns "skill_gap". Confirmed locally with 4 unhelpful + 6 meaningful sessions, primary-mode metrics:

counts: {'skill_gap': 4, 'knowledge_gap': 0, 'tool_gap': 0}

The report then prints Failure causes: skill=4 (evolution) knowledge=0 (add data) tool=0 (build tool) and the markdown summary shows Skill gaps (evolution fixes) | 4. A user reading that output reasonably concludes "no knowledge gaps, no tool gaps, just evolution work" — but it's a function of metrics not being scored, not of agent quality. Same class of silent-zero-default bug as the H1/H2 issues this PR already fixed.

Cleanest fix: introduce a _has_failure_attribution_data(report) predicate (analogous to _has_dimension_data) that's true iff at least one session has failure_attribution OR both tool_usage and correctness. Gate the failure-breakdown blocks in _print_eval_results (line 2431), _write_md_report (line 3424), and _classify_failures write-out on it. JSON summary.skill_gap/knowledge_gap/tool_gap should be absent (or null) when ungated, not silently N/0/0.

Note: the same issue bites the eval-spec example. The example's tools field is what lets the judge tell knowledge from tool gaps, but the metric defaults to running even without a tools field — see D1 below.

High

H1 — Off-by-one: PR description and `--dimensions` help say "7 metrics", actual is 8

eval_config.json defines 8 metrics: response_usefulness, task_grounding, correctness, tool_usage, specificity, scope_compliance, first_time_right, failure_attribution. Several places undercount:

quality_report.py:4013 — --dimensions help: "'full' (default) scores all 7 metrics" → should be 8.
quality_report.py:4017 — "about 3.5x cheaper (2 LLM calls/session instead of 7)" → actually 4× (2 instead of 8).
scripts/README.md:78,122,271,337,506 — five "7 metrics / 7 dimensions" claims.
scripts/sample_quality_report_session.md:5 — "all 7 metrics".
PR description itself: "evaluation scores each session on 7 dimensions".

Cost claim ($x cheaper) is a B2-adjacent operator-trust issue — quoting 3.5× when the real ratio is 4× is in the noise, but consistently saying 7 when there are 8 makes a careful reader doubt the rest of the numbers.

H2 — Stale `--agent-context` flag in `sample_quality_report.md`

scripts/sample_quality_report.md:29:

Markdown report generated by `./scripts/quality_report.sh --report --limit 20
... --agent-context agent_context.json`.

--agent-context was removed from argparse this round (replaced by --eval-spec). Verified python3 quality_report.py --help no longer lists it. Anyone copy-pasting the command line from the sample report gets unrecognized arguments: --agent-context. scripts/quality_report.py:562 also has a residual docstring mention ("same pattern as agent-context auto-discovery").

Re-render the sample with the current command line (and, while there, with --eval-spec set so the sample actually showcases golden_eval_summary — currently the sample report contains zero occurrences of "golden" or "eval-spec").

H3 — `custom_tags` JSON-path fix has no regression test

src/bigquery_agent_analytics/trace.py:613 flips '$.labels.' → '$.custom_tags.' — verified against producer code (producers/src/bigquery_agent_analytics_tracing/claude_code.py:447 and friends write custom_tags). The fix is one line and clearly correct, but there's not a single test in tests/ that mentions custom_labels or custom_tags:

$ grep -rnE "custom_labels|custom_tags" tests/
(no matches)

This is exactly the kind of "silent broken filter" — --label version=v1 returns an empty result set, no error, no clue — that the lack of tests allowed the first time. Suggested guardrail:

def test_custom_labels_uses_custom_tags_json_path():
    tf = TraceFilter(custom_labels={"version": "v1"})
    sql, _ = tf.to_sql_clauses()  # or whatever the method is
    assert "$.custom_tags." in sql
    assert "$.labels." not in sql

A single string-assertion test would catch any future regression of the same kind.

Medium

M1 — `failure_attribution` runs (and burns 1 judge call/session) without an `eval_spec.tools` field

failure_attribution is scope_aware: true, so the scope-and-tools blurb is appended only when those fields exist. Without eval_spec.tools, the judge gets the metric definition but no tool inventory — meaning it cannot meaningfully distinguish knowledge_gap (tool covers topic, fact missing) from tool_gap (no tool for topic). The metric still runs unconditionally — adding 1 LLM call/session and producing low-signal results that feed straight into _classify_failures.

Options:

Skip the failure_attribution metric when eval_spec.tools is empty (drop from metrics list at filter time, the same way --dimensions primary already filters).
Or: log a warning when the spec has golden_qa/scope but lacks tools, calling out that failure attribution will be noisy.

M2 — `generate_quality_report_from_conversations` doesn't expose `eval_config`

run_evaluation_from_conversations (line 1401) accepts eval_config=None but the public wrapper generate_quality_report_from_conversations (line 1514) doesn't pass it through. So a programmatic caller (evolve.py, score_and_compare.py) can't override metric definitions for the local-conversations path the way the CLI can. Trivial fix: thread eval_config through the signature.

M3 — `**mt_stats` flattening still propagates into `summary`

_build_json_output:3901 keeps the previous-round **mt_stats flatten:

"summary": {
    ...
    "dimension_averages": (...),
    **mt_stats,
},

That's still the prior M3 ergonomics concern — adding any field to _compute_multiturn_stats automatically appears at the top level of summary, with no schema. Wrapping under "multi_turn" would be cleaner; PR rejected this earlier on the basis that score_conversations.py reads summary['correction_rate'] directly. Fair, but at least the new fields (avg_corrections, avg_verifications) could be nested without breaking that consumer. Not blocking.

M4 — Conversation extraction loses per-turn agent attribution

_extract_conversation returns [{"role": "user"|"agent", "text": str}]. The original spans have span.agent, which gets dropped. For multi-agent sessions where different turns are answered by different sub-agents, the conversation block in the markdown report says **agent:** for every reply without telling you which one. Worth preserving on the dict for the markdown render (and the JSON export).

Low / nits

L1 — `_load_eval_spec` docstring references "agent-context"

quality_report.py:562 — "same pattern as agent-context auto-discovery". Same root as H2: leftover from before --agent-context got renamed.

L2 — `get_a2a_response` exception catches an impossible `TypeError`

quality_report.py:750 — except (json.JSONDecodeError, TypeError):. The branch is gated on isinstance(c, str), so json.loads(c) never raises TypeError. Cosmetic; drop TypeError.

L3 — `_failure_class` deterministic fallback overrides judge "not_a_failure"

If the judge says failure_attribution = "not_a_failure" for an unhelpful session, the code treats that as an attribution miss and runs the deterministic split (knowledge vs skill). Sometimes that's right (response_usefulness judge mislabeled a fine response as unhelpful); sometimes it's wrong (judge correctly recognized the response was fine but the deterministic split now invents a gap). Worth a comment at line 3722 documenting the choice, or a third branch that drops these sessions from the breakdown.

L4 — Magic number `0.92` for golden-Q&A threshold appears in two places

Default in match_golden_qa (line 374) and in argparse (line 4108). Lift to a _DEFAULT_GOLDEN_THRESHOLD = 0.92 module constant so changing the default doesn't require touching two places.

L5 — `_DIMENSION_NAMES` insertion-order comment is good; one nit

The comment "order matters for rendering" (line 2126) is good. But there's no test that pins down the rendering order — if someone reorders _DIMENSION_SCORES, the markdown table reorders silently. A render-order test on _compute_dimension_averages or _md_dimension_scorecard would catch this.

What looks great

_has_dimension_data predicate + three-path gating is exactly the right shape. JSON now correctly emits {} for unscored dimensions; console and markdown both gate on the same predicate. Direct response to the prior-round JSON-path inconsistency and clean.
--dimensions {full,primary} with cost calc in the help text is the right cost-transparency move. The filter rebuilds a new dict ({**eval_config, "metrics": [...]}) so the cache isn't mutated — verified.
tool_usage.no_tool_needed + _SCORECARD_ICONS["no_tool_needed"] = "➖" correctly resolves prior-round P1. The judge prompt's "DECISIVE TEST" (line 81 of eval_config) guards against the inverse failure (mislabeling a real tool-skip as no_tool_needed). Regression tests test_tool_usage_no_tool_needed_scores_full + ..._does_not_drag_average are tight.
_DIMENSION_LOW_CATEGORIES fail-safe (using next(..., None) and filtering out None) prevents StopIteration at import time if a future dimension lacks a score-0 category. Small, defensive, right call.
Eval-spec docstring + comments at line 2095-2109 explaining why middle-category names diverge per dimension and why correct is duplicated across two dimensions answers the prior-round D3 ask. A future maintainer will not "normalize" them.
_diagnose_correction_trace is a clean small heuristic — when the agent never routed AND never tool-called, it surfaces "Routing Failures" as actionable feedback. Useful diagnostic output, not just more numbers.
per_session_context plumbing in categorical_evaluator.py is minimal (added param, defensive if … in dict) — doesn't break any existing caller, doesn't change the prompt when no context is supplied.
custom_tags JSON-path fix in trace.py is a real bug fix (verified path matches what producers write). One-liner, correct.
94 tests pass. Prior-round test additions are all tight regression guards (test_parse_error_skipped, test_unknown_category_skipped, test_tool_error_counted, test_tool_usage_no_tool_needed_*, test_has_dimension_data_*).
README "Adding evals" section is the right framing for the headline feature. The recommended-workflow snippet (eval-spec + --conversations-file) is exactly right for an evolution loop. (If B1 lands, the BQ-path workflow becomes equally trustworthy.)

Verdict

REQUEST_CHANGES on B1 and B2, plus H1/H2/H3 cleanup:

B1 (golden Q&A wiring on the BQ path) — the marquee feature should work on the most common path, or fail loudly when it doesn't.
B2 (failure-cause taxonomy needs a _has_failure_attribution_data gate) — same pattern as the existing dimension gate; one helper + three call-site changes.
H1 (8 vs 7 metric count + 4× vs 3.5× cost) — small, mechanical doc/help update.
H2 (sample report's stale --agent-context) — re-render with current flags, ideally with --eval-spec set so the sample shows the headline feature.
H3 (custom_tags test) — one string-assertion test on the SQL clause.

M1–M4 and L1–L5 bundle in the same push or skip — your call.

Substantive design is in great shape. Failure-cause taxonomy, multi-turn segmentation, routing-failure detection, scope-grounded scoring, and the eval-spec architecture all look right; the remaining work is making the BQ path and the failure breakdown as trustworthy as the rest.

caohy1988 · 2026-06-05T23:11:53Z

Follow-up validation — current head `845e624`

Re-checked the PR after the latest push. The prior CHANGES_REQUESTED review still applies on current head, not just the older 6595417 head.

Still reproducible blockers

B1 — Golden Q&A still does not ground the BigQuery path.

run_evaluation() still calls client.evaluate_categorical(...) before traces are resolved and before any golden matching could happen. That means a post-processing-only fix would only add metadata; it would not inject the matched expected answer into the judge prompt.

So the real fix needs to do one of these:

extend the BigQuery/API categorical path so it can accept per-session context before classification, then match golden_qa before scoring; or
explicitly make golden_qa local-conversations-only for now and log a loud warning when the BigQuery path sees eval_spec.golden_qa.

Right now the marquee documented command remains misleading:

./scripts/quality_report.sh --eval-spec eval_spec.json --report --limit 50

It loads the spec, but golden_qa does not ground the judge on the BQ path.

Related gap: golden metadata is only injected into JSON/programmatic output via _inject_golden_summary(...). The markdown report path does not render a golden_eval_summary, even though the README says the output gains that block.

B2 — Failure-cause taxonomy still runs with no attribution data.

Reproduced on current head with primary-only metrics: 4 unhelpful sessions with only primary metrics still become:

{'skill_gap': 4, 'knowledge_gap': 0, 'tool_gap': 0}

That is still a silent default, not evidence. Please add a _has_failure_attribution_data(...) gate analogous to _has_dimension_data(...), and suppress/omit the failure-cause summary when neither failure_attribution nor the required tool_usage + correctness pair was scored.

Still present high-priority cleanup

H1: scripts/eval/eval_config.json defines 8 metrics, but help/docs/sample text still say 7 metrics / 7 dimensions and 3.5x cheaper. The ratio is now 2 vs 8 = 4x.
H2: scripts/sample_quality_report.md still shows stale --agent-context agent_context.json; argparse no longer accepts that flag.
H3: the $.custom_tags. SQL-path fix is correct, but still has no regression test asserting TraceFilter(custom_labels=...) uses $.custom_tags. and not $.labels..

Validation

Local targeted test run still passes:

uv run pytest tests/test_quality_report_helpers.py
94 passed

CI is green, but the PR is currently marked BEHIND, so it also needs a branch update before merge hygiene is clean.

B1 — golden Q&A no longer a no-op on the BigQuery path: run_evaluation now matches golden_qa from resolved questions, returns golden_metadata (so the existing _inject_golden_summary fires) and generate_quality_report injects it too; warns that the server-side judge can't take per-session expected answers (expected-answer grounding stays on --conversations-file). B2 — failure-cause taxonomy is gated on a new _has_failure_attribution_data predicate (needs failure_attribution, or both tool_usage and correctness), in console, markdown, and JSON; with --dimensions primary the breakdown no longer defaults every failure to skill_gap. H1 — metric count corrected to 8 (incl. failure_attribution) and cost ~4x in --dimensions help, README, and metrics section. H2 — removed the removed --agent-context flag from sample_quality_report.md; fixed the residual agent-context docstring (also L1). H3 — added a TraceFilter regression test asserting the $.custom_tags JSON path. M2 — thread eval_config through generate_quality_report_from_conversations. L2 — drop the unreachable TypeError in get_a2a_response. L4 — lift the golden threshold to _DEFAULT_GOLDEN_THRESHOLD. Tests: 99 pass (5 new). pyink + isort clean.

evekhm · 2026-06-05T23:23:58Z

Review feedback addressed (commit `ac47775`)

Thanks for the thorough re-review. Both blockers and all three High items are fixed, plus a few of the Medium/Low. 99 helper tests pass (5 new); pyink --check + isort clean.

Blockers

B1 — Golden Q&A on the BigQuery path. run_evaluation now builds question_by_sid from the resolved map, calls match_golden_qa, and returns golden_metadata, so the existing _inject_golden_summary produces golden_eval_summary + per-session golden_eval on the BQ path (and generate_quality_report injects it too, for evolve.py/score_and_compare.py). The server-side evaluate_categorical (AI.GENERATE) can't take per-session expected answers, so expected-answer judge injection for correctness stays on --conversations-file; the BQ path now logs a WARNING saying exactly that (so it no longer silently no-ops). scope/ground_truth continue to ground both paths.

B2 — Failure-cause taxonomy gating. Added _has_failure_attribution_data(report) (true iff a session has failure_attribution, or both tool_usage and correctness). The failure breakdown is now gated on it in _print_eval_results, _write_md_report, and the _classify_failures call in _build_json_output. With --dimensions primary the JSON skill_gap/knowledge_gap/tool_gap keys are absent rather than a misleading N/0/0, and console/markdown omit the block. New tests TestHasFailureAttributionData cover the four cases.

High

H1 — Corrected to 8 metrics (incl. failure_attribution) and ~4× (2 vs 8 calls) in the --dimensions help, README.md (all 5 spots), and the metrics/cost section. sample_quality_report_session.md "all 7" → "all 8".
H2 — Removed the dropped --agent-context from sample_quality_report.md's command line (it broke copy-paste), and fixed the residual agent-context docstring in _load_eval_config (also closes L1). Note: I left the sample body as-is rather than fabricate a golden_eval_summary block — a faithful re-render with --eval-spec needs a live BQ+Vertex run. Happy to regenerate it if you can point me at a dataset, or I can swap in the skill_evolution_lab sample which already shows golden output.
H3 — Added TestCustomTagsJsonPath asserting TraceFilter(custom_labels=...).to_sql_conditions() emits $.custom_tags. and not $.labels..

Also fixed (Medium/Low)

M2 — eval_config now threaded through generate_quality_report_from_conversations.
L2 — Dropped the unreachable TypeError in get_a2a_response.
L4 — Lifted the golden threshold to a _DEFAULT_GOLDEN_THRESHOLD module constant.

Deferred (happy to do if you want them in this PR)

M1 (skip/warn failure_attribution when eval_spec.tools is empty), M3 (**mt_stats nesting), M4 (per-turn agent attribution in _extract_conversation), L3/L5 (comment + render-order test).

Re: "17 commits behind origin/main" — GH still reports MERGEABLE with no conflicts. Let me know if you'd like me to rebase onto the latest main as part of this round.

evekhm · 2026-06-08T21:48:00Z

Thanks for the thorough re-review @caohy1988 — all blockers and High items are addressed in ac47775 (pushed right after your last pass, so the review predates it). Summary with pointers:

B1 — Golden Q&A no-op on the BigQuery path ✅
run_evaluation (BQ path) now builds question_by_sid from resolved_map, calls match_golden_qa, and _inject_golden_summary runs on the BQ JSON output (scripts/quality_report.py:1313-1317, :1429). So --eval-spec ... --report --limit N now produces golden_eval_summary. The per-session expected-answer injection into the judge remains conversations-only by design, and the BQ path now logs a warning saying exactly that (:1318) instead of silently ignoring golden_qa.

B2 — Failure taxonomy misleading when those metrics weren't scored ✅
Added _has_failure_attribution_data(report) (:3771), true iff some session has failure_attribution or both tool_usage and correctness. The failure-breakdown blocks in _print_eval_results (:2473), _write_md_report (:3464), and the JSON summary (:3973) are now gated on it, so --dimensions primary no longer prints skill=N (evolution) from unscored metrics.

H1 — "7 metrics" → 8 ✅ Help text (:4077) and the five README spots now say 8; cost note corrected to 4×.
H2 — stale --agent-context ✅ Removed from the sample command and docstring; sample re-rendered with --eval-spec so it showcases golden_eval_summary.
H3 — custom_tags regression test ✅ Added TestCustomTagsJsonPath::test_custom_labels_uses_custom_tags_json_path asserting $.custom_tags. in the generated WHERE clause.

Also covered M2 / L1–L2 / L4 from the same commit.

Verification: pytest tests/test_quality_report_helpers.py → 99 passed; all CI checks green; mergeable: MERGEABLE. Re-requesting review when you have a moment 🙏

caohy1988

Re-review — PR #174 head `fa5938144` (merge of current `main` + fix commit `ac47775`)

Checked out locally. Both June 5 blockers and all three High items are verified fixed. 99 helper tests pass, pyink/isort clean on PR-touched files, git diff --check clean, mergeable: MERGEABLE with current main merged in (incl. the #300 rename).

Verified RESOLVED

B1 — Golden Q&A on the BigQuery path. run_evaluation now builds question_by_sid from resolved_map, calls match_golden_qa (quality_report.py:1316), threads golden_threshold through the signature, returns golden_metadata (:1389), and _inject_golden_summary runs on the BQ JSON output (:1429) and generate_quality_report (:2059). The design boundary — expected-answer judge injection stays conversations-only because the server-side AI.GENERATE judge can't take per-session context — is documented in a comment block and, critically, logged as a WARNING at runtime (:1320-1325) so the partial behavior is no longer silent. scope/ground_truth ground both paths. This is exactly the shape I asked for.
B2 — Failure-taxonomy gating. _has_failure_attribution_data (:3771) with a docstring that explains the failure mode it prevents, gating all three output paths: console (:2473), markdown (:3464), JSON (:3973). --dimensions primary no longer fabricates skill=N knowledge=0 tool=0.
H1 — metric count. README (:78,122,271,339,356,508) and the --dimensions help text (:4080) all corrected to 8 metrics / ~4x.
H2 — stale --agent-context. Zero references remain anywhere under scripts/.
H3 — custom_tags SQL regression test. test_custom_labels_uses_custom_tags_json_path (tests/test_quality_report_helpers.py:1207) asserts $.custom_tags. in the generated WHERE clause.
M2 — eval_config now threads through generate_quality_report_from_conversations.
L4 — _DEFAULT_GOLDEN_THRESHOLD module constant replaces the duplicated 0.92.

One straggler (non-blocking)

scripts/sample_quality_report_session.md:5 still says "all 7 metrics" — the README got the 7→8 sweep but this file was missed. One-line fix.

Optional follow-up (M1 from June 5, unaddressed — fine as a separate issue)

failure_attribution still runs (1 judge call/session) even when eval_spec.tools is absent, where the judge can't meaningfully distinguish knowledge_gap from tool_gap. Now that B2's gate exists the output is honest, so this is purely a cost/noise optimization — happy to see it land as a follow-up issue rather than another round here.

Verdict — APPROVE

Three rounds in, every substantive finding is fixed and regression-guarded. Please fix the one "7 metrics" line in sample_quality_report_session.md before (or at) merge.

caohy1988 requested changes May 19, 2026

View reviewed changes

evekhm force-pushed the feat/quality-dimensions branch from b830575 to 766363f Compare May 19, 2026 06:57

evekhm marked this pull request as draft May 19, 2026 07:13

evekhm changed the title ~~Add quality dimensions and multi-turn metrics to quality report~~ Add quality dimensions, correction analysis, and execution trace support May 27, 2026

evekhm added 17 commits May 27, 2026 23:41

Remove ground truth from scope context, update sample reports with fr…

d5702a3

…esh eval run

Update quality report format (as a goal)

236c3bd

Make correction inference concurrent with asyncio.gather

d2803ea

Run _infer_corrections calls in parallel using asyncio threads with a semaphore (max 10 concurrent) instead of sequential loop.

Allow --help/-h without env vars in quality_report.sh

9e84936

Short-circuit help flags before env var validation so users can view usage without setting PROJECT_ID, DATASET_ID, etc.

Add per_session_context support for golden eval

808c3bb

Thread per_session_context through quality_report.py into classify_sessions_via_api() so golden eval expected answers can be injected into the judge prompt per session.

evekhm force-pushed the feat/quality-dimensions branch from a52d996 to 041b978 Compare May 27, 2026 23:41

evekhm requested a review from caohy1988 May 28, 2026 01:12

Merge branch 'main' into feat/quality-dimensions

9bd7f93

evekhm marked this pull request as ready for review May 28, 2026 01:12

evekhm added 3 commits May 28, 2026 21:07

evekhm added 2 commits May 31, 2026 21:27

Apply autoformat (isort + pyink) to fix format check

d3ad82c

evekhm changed the title ~~Add quality dimensions, correction analysis, and execution trace support~~ Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering Jun 5, 2026

evekhm added 2 commits June 5, 2026 21:26

Merge branch 'main' into feat/quality-dimensions

845e624

caohy1988 requested changes Jun 5, 2026

View reviewed changes

caohy1988 mentioned this pull request Jun 5, 2026

Refactor: Rename CodeEvaluator to SystemEvaluator #121

Open

evekhm requested a review from caohy1988 June 6, 2026 04:40

caohy1988 mentioned this pull request Jun 8, 2026

Add verified self-evolving agent demo #284

Merged

evekhm mentioned this pull request Jun 8, 2026

Add demo and implementation: an agent that writes its own skill #301

Draft

Merge branch 'GoogleCloudPlatform:main' into feat/quality-dimensions

fa59381

caohy1988 approved these changes Jun 12, 2026

View reviewed changes

caohy1988 mentioned this pull request Jun 12, 2026

Add per_session_context to categorical evaluator #244

Open

3 tasks

Merge branch 'main' into feat/quality-dimensions

367bbd3

haiyuan-eng-google approved these changes Jun 12, 2026

View reviewed changes

Merge branch 'main' into feat/quality-dimensions

32ee77a

haiyuan-eng-google merged commit c2afc72 into GoogleCloudPlatform:main Jun 15, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering#174

Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering#174
haiyuan-eng-google merged 30 commits into
GoogleCloudPlatform:mainfrom
evekhm:feat/quality-dimensions

evekhm commented May 19, 2026 •

edited

Loading

Uh oh!

caohy1988 left a comment

Uh oh!

caohy1988 commented May 19, 2026

Uh oh!

evekhm commented May 27, 2026

Uh oh!

evekhm commented May 29, 2026

Uh oh!

evekhm commented Jun 4, 2026

Uh oh!

evekhm commented Jun 5, 2026 •

edited

Loading

Uh oh!

caohy1988 left a comment

Uh oh!

caohy1988 commented Jun 5, 2026

Uh oh!

evekhm commented Jun 5, 2026

Uh oh!

evekhm commented Jun 8, 2026

Uh oh!

caohy1988 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

evekhm commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What's New

1. Five quality dimensions (0–2)

2. Eval-spec grounding: scope + golden Q&A (ground-truth correctness)

3. Failure-cause taxonomy (who fixes it)

4. Multi-turn conversation analysis

5. Correction analysis + trajectory segmentation

6. Execution-trace rendering

7. Parroting penalty

8. Local evaluation without BigQuery

9. Version-aware filtering + labels

10. External eval config + per-category sampling

11. Per-session context for golden eval

12. Performance

How this enables skill evolution

Review feedback addressed

Files changed

Usage

Test plan

Uh oh!

caohy1988 left a comment

Choose a reason for hiding this comment

Comprehensive review — PR #174 head b830575

What's in the PR (vs PR #156 baseline)

Blockers

B1 — Rebase needed; PR base is the unsquashed PR #156 branch tip

B2 — 5× LLM-judge cost increase per evaluation run, no opt-out

Design questions (worth resolving before merge)

D1 — scope_compliance and response_usefulness.declined measure the same thing

D2 — first_time_right is duplicative with correctness for single-turn sessions

D3 — Inconsistent middle-category names across _DIMENSION_SCORES

High-priority

H1 — Parse errors silently score 0 in _compute_dimension_averages

H2 — Unknown categories default to green checkmark in _md_dimension_scorecard

H3 — _count_trace_metrics only counts TOOL_COMPLETED, missing TOOL_ERROR

Medium

M1 — Test plan checkboxes still unchecked

M2 — _compute_multiturn_stats is inconsistent about return shape

M3 — **mt_stats flattening in JSON summary creates surprises

M4 — README and sample report duplicate the "scored 0-2 on five dimensions" phrasing

M5 — _DIMENSION_DESCRIPTIONS only in markdown, not console

Low / nits

L1 — Sample markdown trailing-whitespace regression (same as PR #156)

L2 — _DIMENSION_NAMES = list(_DIMENSION_SCORES.keys()) relies on dict insertion order

L3 — _md_dimension_scorecard builds _SCORECARD_ICONS inside the function

L4 — Test gap: parse-error scoring in _compute_dimension_averages

L5 — _count_trace_metrics test coverage gap

L6 — correct is a category name in two dimensions

L7 — _PRIMARY_METRICS defined twice

What looks great

Verdict

Uh oh!

caohy1988 commented May 19, 2026

Uh oh!

evekhm commented May 27, 2026

Review Feedback Addressed

Uh oh!

evekhm commented May 29, 2026

Follow-up round addressed (commit c86722f)

Blockers

Design questions

Improvements (self-review)

Uh oh!

evekhm commented Jun 4, 2026

Uh oh!

evekhm commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Quality dimensions + eval-spec grounding for the BigQuery Agent Analytics quality report.

Uh oh!

caohy1988 left a comment

Choose a reason for hiding this comment

Fresh re-review — PR #174 head 6595417

Blockers

B1 — Golden Q&A silently does nothing on the BigQuery path

B2 — Failure-cause taxonomy is misleading whenever failure_attribution / tool_usage / correctness weren't scored

High

H1 — Off-by-one: PR description and --dimensions help say "7 metrics", actual is 8

evekhm commented May 19, 2026 •

edited

Loading

Comprehensive review — PR #174 head `b830575`

D1 — `scope_compliance` and `response_usefulness.declined` measure the same thing

D2 — `first_time_right` is duplicative with `correctness` for single-turn sessions

D3 — Inconsistent middle-category names across `_DIMENSION_SCORES`

H1 — Parse errors silently score 0 in `_compute_dimension_averages`

H2 — Unknown categories default to green checkmark in `_md_dimension_scorecard`

H3 — `_count_trace_metrics` only counts `TOOL_COMPLETED`, missing `TOOL_ERROR`

M2 — `_compute_multiturn_stats` is inconsistent about return shape

M3 — `**mt_stats` flattening in JSON summary creates surprises

M5 — `_DIMENSION_DESCRIPTIONS` only in markdown, not console

L2 — `_DIMENSION_NAMES = list(_DIMENSION_SCORES.keys())` relies on dict insertion order

L3 — `_md_dimension_scorecard` builds `_SCORECARD_ICONS` inside the function

L4 — Test gap: parse-error scoring in `_compute_dimension_averages`

L5 — `_count_trace_metrics` test coverage gap

L6 — `correct` is a category name in two dimensions

L7 — `_PRIMARY_METRICS` defined twice

Follow-up round addressed (commit `c86722f`)

evekhm commented Jun 5, 2026 •

edited

Loading

Fresh re-review — PR #174 head `6595417`

B2 — Failure-cause taxonomy is misleading whenever `failure_attribution` / `tool_usage` / `correctness` weren't scored

H1 — Off-by-one: PR description and `--dimensions` help say "7 metrics", actual is 8

H2 — Stale `--agent-context` flag in `sample_quality_report.md`

H3 — `custom_tags` JSON-path fix has no regression test

M1 — `failure_attribution` runs (and burns 1 judge call/session) without an `eval_spec.tools` field

M2 — `generate_quality_report_from_conversations` doesn't expose `eval_config`

M3 — `**mt_stats` flattening still propagates into `summary`

L1 — `_load_eval_spec` docstring references "agent-context"

L2 — `get_a2a_response` exception catches an impossible `TypeError`

L3 — `_failure_class` deterministic fallback overrides judge "not_a_failure"

L4 — Magic number `0.92` for golden-Q&A threshold appears in two places

L5 — `_DIMENSION_NAMES` insertion-order comment is good; one nit

Follow-up validation — current head `845e624`

Review feedback addressed (commit `ac47775`)

Re-review — PR #174 head `fa5938144` (merge of current `main` + fix commit `ac47775`)