Skip to content

Commit c2afc72

Browse files
Quality report: dimensions, correction analysis, execution traces, golden-Q&A grounding, and version filtering (#174)
* Add quality dimensions, multi-turn metrics, and streamlined report format Add 5 new LLM-as-judge quality dimensions to the quality report: correctness, tool_usage, specificity, scope_compliance, first_time_right. Each session is scored 0-2 and averaged into a Quality Dimensions table. Add multi-turn efficiency metrics (avg user turns, avg tool calls, multi-turn session count) extracted from trace spans. Streamline report output for readability: - Quality Dimensions table includes "What it measures" descriptions and a color-coded rating legend - Category distributions only show primary metrics (response_usefulness, task_grounding) since dimension averages already summarize the rest - Per-session details use a compact one-line scorecard for dimensions instead of verbose multi-line blocks Add 12 new tests for the helper functions. Update README metrics documentation. Regenerate sample quality report from live data. * Add correction/verification inference and multi-turn conversation extraction Extract full multi-turn conversations from trace spans and use LLM to classify user follow-ups as corrections, verifications, or normal follow-ups. Surface correction_rate and verify_rate in both console output and markdown reports to match the metrics available in knowledge_supervisor's multiturn_quality_report. * Improve scope-aware eval accuracy with conditional declined category and ground truth support The LLM judge was misclassifying in-scope failures as "declined" because the declined category was always present regardless of whether scope context existed. This led to inflated correctness scores for agents that failed to answer in-scope questions. Key changes: - Make the "declined" metric category conditional: only include it when the agent config actually defines out-of-scope topics (has_scope flag). Without scope context the judge has no basis for that category. - Add ground_truth config field to _build_scope_context so the judge can verify factual accuracy against known-correct policy data. - Add in_scope_topics to scope context so the judge can distinguish "topic is out of scope" (declined) from "agent failed to answer an in-scope topic" (unhelpful). - Clarify unhelpful definition: explicitly lists failure patterns like "I don't have that information" for in-scope topics. - Support config_path="none" in _load_agent_config to explicitly disable scope context and skip auto-discovery. - Thread config_path through generate_quality_report for programmatic use. - Update --agent-config CLI help to document the "none" option. * Add scope config example, remove in_scope_topics, update README - Add scripts/eval/data/agent_context.example.json with scope_decisions for scope-aware evaluation. Users copy to agent_context.json and customize with their agent's out-of-scope topics. - Remove in_scope_topics from scope context: scope_decisions alone is sufficient. Anything not listed as out_of_scope is implicitly in scope. - Remove hardcoded domain-specific topic names (benefits, PTO, etc.) from the LLM judge prompt to keep it agent-agnostic. - Update README with all new features: dimension drilldowns, single-session evaluation (--session), quality dimensions table, multi-turn metrics, declined category behavior, and sample output links. - Add quality_metrics.json (metric definitions) and sample_quality_report_session.md (single-session verbose output). * Remove ground truth from scope context, update sample reports with fresh eval run * Add --conversations-file for local JSON scoring without BigQuery Support evaluating traffic-generator conversations directly from a JSON file, bypassing BigQuery entirely. Adds run_evaluation_from_conversations() and generate_quality_report_from_conversations() public APIs, plus CLI flag. Also enhances _write_md_report with TOC, section descriptions, metric explanations, and a report_dir parameter for custom output locations. Includes issue for concurrent session scoring (asyncio.gather + semaphore). * Update quality report format (as a goal) * Add per-category sample control for report sections Support --samples with per-category overrides (e.g. unhelpful=10,partial=5,low=3) in addition to single-number and 'all' modes. Refactor _print_eval_results and _write_md_report to use centralized _parse_samples/_get_sample_limit helpers with sensible defaults per category. * Make correction inference concurrent with asyncio.gather Run _infer_corrections calls in parallel using asyncio threads with a semaphore (max 10 concurrent) instead of sequential loop. * Allow --help/-h without env vars in quality_report.sh Short-circuit help flags before env var validation so users can view usage without setting PROJECT_ID, DATASET_ID, etc. * Add dynamic low-dimension sections and report metadata Add _md_find_low_dimension_sessions and _md_write_low_dimension_section helpers to generate per-dimension subsections for all 5 quality dimensions dynamically. Move low-dimension sections after Declined Sessions. Restructure TOC with <!-- TOC --> wrapper. Add CLI command reproduction in Summary section. Compute report layout variables upfront. * Fix help text, enable --tag-turns for BQ, and fix --env loading - --limit: clarify it evaluates the N most recent sessions - --trajectory-samples: remove misleading "Requires BQ credentials" - --tag-turns: remove conversations-file restriction, enable for BQ path by passing tag_turns through run_evaluation with concurrent tagging - --env=path syntax now works in shell script (not just --env path) - Error message now says 'export VAR=...' and mentions --env flag * Add correction analysis with trajectory segmentation, routing failure detection, and parroting detection - Embed segmented execution trajectories in Correction Analysis section (before/after correction with outcome labels: recovered, parroted, not_recovered) - Detect routing failures (agent answered from LLM knowledge without tools) and surface as Routing Failures subsection - Fix answered_by defaulting to "policy_agent" — now "unknown" with BQ backfill - Tighten correction inference prompt to distinguish genuine recovery from parroting - Add --eval-config flag for external metric/prompt definitions (eval_config.json) - Nest Correction Analysis under Sample Sessions with dynamic heading levels - Update sample report and README * Add parroting penalty to eval and auto-discover eval_config.json Penalize parroted recoveries (agent echoes user's correction without re-verifying via tool) as unhelpful in both the usefulness definition and the unhelpful category. Refactor quality_report.py to auto-discover eval/eval_config.json from the repo root, removing 290 lines of hardcoded metric definitions in favor of the config-driven approach. * Add per_session_context support for golden eval Thread per_session_context through quality_report.py into classify_sessions_via_api() so golden eval expected answers can be injected into the judge prompt per session. * Auto-fetch execution trace for single-session quality report When --session is used, the execution trajectory is now fetched automatically from BigQuery and printed to console with sub-trajectory segmentation at correction boundaries. Updated sample with real data showing the full output including trace tree and segmentation. * Address PR #174 review: fix dimension scoring, TOOL_ERROR counting, and module constants - H1: Skip parse errors and unknown categories in _compute_dimension_averages instead of scoring them as 0 (which inflated averages downward) - H2: Default unknown categories to ❓ instead of ✅ in scorecard icons - H3: Count TOOL_ERROR spans as tool attempts in _count_trace_metrics - L3: Lift _SCORECARD_ICONS to module level (was duplicated in function) - L7: Extract _PRIMARY_METRICS constant, replace 5 inline references - M2: _compute_multiturn_stats returns stable shape on empty input - Update tests: add parse_error attr to _FakeMetric, test TOOL_ERROR counting, test parse error/unknown category skipping, fix empty map assertion * Fix custom_tags/custom_labels path mismatch and add version filtering - TraceFilter.to_sql_conditions: query $.custom_tags.* instead of $.labels.* to match BigQueryLoggerConfig.custom_tags write path - TraceFilter.from_cli_args: add custom_labels parameter - run_evaluation: add custom_labels parameter, thread through to TraceFilter for version-aware session filtering * Add --label CLI flag, surface active filters in Execution Details - Add --label KEY=VALUE (repeatable) to filter by custom_tags set via BigQueryLoggerConfig.custom_tags (version, env, experiment_id, etc.) - Surface app_name and labels in report.details so they appear in both console and markdown Execution Details sections - Expand CLI epilog with filtering section, scope-aware eval example showing full agent_context.json format, and combined filter examples - Update README with Custom Labels section (end-to-end: agent emits → BQ stores → quality report filters), expanded --agent-context docs, and complete Execution Details description - Update sample reports with complete Execution Details fields * Address PR #174 review: tool_usage no_tool_needed, --dimensions flag, and hardening Review feedback (@caohy1988): - P1: add `no_tool_needed` category to `tool_usage` (scores 2) so correctly declined / no-tool-needed sessions are not penalised as a Tool Usage failure; neutral ➖ scorecard icon; regression tests. - B2: add `--dimensions {full,primary}` flag to cut LLM-judge cost ~3.5x (primary scores only the 2 primary metrics); document cost in README + help. Default stays `full` for backward compatibility. - D2: document that `first_time_right` is primarily a multi-turn signal. - D3/L6: comment the deliberately-divergent middle-category names and the shared `correct` category. - M5: print per-dimension descriptions in console output. - L1: render markdown report metadata as a bullet list instead of trailing-double-space hard breaks (fixes `git diff --check`). Independent review hardening: - Gate the dimension block on a new `_has_dimension_data()` helper across all three output paths; JSON `dimension_averages` is now empty (not all-zero) when dimensions were not scored, so consumers don't read unscored dimensions as 0.0 / failing. - Anchor the judge prompt so a missing in-scope tool lookup stays `none` and is not mislabeled `no_tool_needed` (inverse-of-P1 inflation guard). - Make `_DIMENSION_LOW_CATEGORIES` fail-safe if a dimension has no score-0 category (no StopIteration at import). Tests: 81 pass (+5). * Add eval-spec grounding (scope + golden Q&A) and failure-cause taxonomy Introduces an optional eval spec (`eval/data/eval_spec.json`, or `--eval-spec`) that grounds quality scoring: - `scope` (free text) — defines what the agent handles; out-of-scope is the complement, so a polite decline is scored `declined` (replaces the brittle scope_decisions topic list). - `golden_qa` — expected answers matched per-question by embedding similarity (`--golden-threshold`); injects ground truth into the judge and emits a `golden_eval_summary`. Matching now lives in the SDK (`match_golden_qa`), not in downstream projects. - `tools` — capability description used by the judge's new `failure_attribution` metric. Adds a 3-way failure taxonomy: every `unhelpful` session is attributed to a `skill_gap` (evolution-fixable), `knowledge_gap` (a fact missing from the data), or `tool_gap` (no tool/capability), and a new `addressable_meaningful_rate` reports quality on questions the agent can actually answer. Surfaced in the console, markdown (with per-class "add a fact" / "build a tool" sections), and JSON. Also: `tool_usage.no_tool_needed` (scores 2, not a failure), `--dimensions {full,primary}` cost control, `_has_dimension_data` so unscored dimensions aren't reported as 0.0. Tests: 94 pass. * Apply autoformat (isort + pyink) to fix format check * quality_report: warn when scoring without golden_qa ground truth response_usefulness and task_grounding are LLM estimates when no golden_qa is supplied -- they can mislabel verbose, tool-grounded answers as ungrounded/unhelpful. Emit a clear warning pointing users to --eval-spec with a golden_qa list so correctness is graded against expected answers (summary.golden_eval_summary). The golden-Q&A grounding itself is unchanged and remains the reliable path. * scripts/README: document adding evals + all new quality-report features The quality-report usage section listed filters and report flags but omitted the most important workflow — adding evals (ground truth) — and several features introduced in this PR. Now documented: - Adding evals: eval-spec (scope/tools/ground_truth/golden_qa) + scoring a local conversations file (--conversations-file, no BigQuery) with the JSON format and recommended workflow - Failure-cause taxonomy (skill_gap / knowledge_gap / tool_gap) and the 'tools' eval-spec field that drives it; routing-failure and parroting notes - --dimensions full|primary, --tag-turns, --trajectory-samples, --concurrency, --golden-threshold, --eval-config, --env, per-category --samples - golden_eval_summary and the no-ground-truth warning - Reorganized the usage quick-reference into grouped, complete examples * quality_report: address review (B1, B2, H1-H3, M2, L1-L2, L4) B1 — golden Q&A no longer a no-op on the BigQuery path: run_evaluation now matches golden_qa from resolved questions, returns golden_metadata (so the existing _inject_golden_summary fires) and generate_quality_report injects it too; warns that the server-side judge can't take per-session expected answers (expected-answer grounding stays on --conversations-file). B2 — failure-cause taxonomy is gated on a new _has_failure_attribution_data predicate (needs failure_attribution, or both tool_usage and correctness), in console, markdown, and JSON; with --dimensions primary the breakdown no longer defaults every failure to skill_gap. H1 — metric count corrected to 8 (incl. failure_attribution) and cost ~4x in --dimensions help, README, and metrics section. H2 — removed the removed --agent-context flag from sample_quality_report.md; fixed the residual agent-context docstring (also L1). H3 — added a TraceFilter regression test asserting the $.custom_tags JSON path. M2 — thread eval_config through generate_quality_report_from_conversations. L2 — drop the unreachable TypeError in get_a2a_response. L4 — lift the golden threshold to _DEFAULT_GOLDEN_THRESHOLD. Tests: 99 pass (5 new). pyink + isort clean. --------- Co-authored-by: Haiyuan Cao <haiyuan@google.com>
1 parent 2d15a61 commit c2afc72

12 files changed

Lines changed: 5649 additions & 562 deletions

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@ env/
1414
.adk/
1515
uv.lock
1616
.env
17+
/.idea/
1718

1819
# Script outputs
1920
scripts/reports/
2021

2122
# Example run artifacts
2223
examples/*/reports/
23-
examples/*/reports_*/
2424
examples/*/trials_*/
2525
scripts/**/*.log
2626
examples/**/*.log
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# classify_sessions_via_api and _infer_corrections should run concurrently
2+
3+
**Labels:** `enhancement`, `performance`
4+
5+
## Problem
6+
7+
`classify_sessions_via_api` in `categorical_evaluator.py:831` processes sessions sequentially:
8+
9+
```python
10+
for sid, transcript in transcripts.items():
11+
response = await client.aio.models.generate_content(...)
12+
```
13+
14+
Additionally, `_infer_corrections` in `quality_report.py` is called per-session in a loop inside `_build_resolved_map_from_conversations` and `run_evaluation` (lines 908-920).
15+
16+
For 205 multi-turn sessions this results in **410 sequential Gemini API calls** (~7-8s per call = ~25 minutes total). Each call is independent — there's no reason they can't run concurrently.
17+
18+
## Benchmarks
19+
20+
| Sessions | Sequential (current) | Expected with concurrency=10 |
21+
|----------|---------------------|-------------------------------|
22+
| 5 | 38.8s | ~4s |
23+
| 205 | ~25min | ~2.5min |
24+
25+
## Proposed fix
26+
27+
### 1. `classify_sessions_via_api` — add semaphore-bounded concurrency
28+
29+
```python
30+
async def classify_sessions_via_api(transcripts, config, endpoint, concurrency=10):
31+
semaphore = asyncio.Semaphore(concurrency)
32+
33+
async def _classify_one(sid, transcript):
34+
async with semaphore:
35+
# existing per-session logic (lines 860-895)
36+
...
37+
38+
tasks = [_classify_one(sid, t) for sid, t in transcripts.items()]
39+
results = await asyncio.gather(*tasks)
40+
return list(results)
41+
```
42+
43+
### 2. `_infer_corrections` — batch with gather
44+
45+
In `_build_resolved_map_from_conversations` and `run_evaluation`, collect all multi-turn sessions and infer corrections concurrently:
46+
47+
```python
48+
async def _infer_corrections_batch(sessions, model, concurrency=10):
49+
semaphore = asyncio.Semaphore(concurrency)
50+
51+
async def _infer_one(conv):
52+
async with semaphore:
53+
return _infer_corrections(conv, model)
54+
55+
return await asyncio.gather(*[_infer_one(s) for s in sessions])
56+
```
57+
58+
### 3. Wire `--concurrency` flag
59+
60+
The `score_conversations.py` CLI already has a `--concurrency` flag (currently ignored). Pass it through to both functions.
61+
62+
## Files to change
63+
64+
- `src/bigquery_agent_analytics/categorical_evaluator.py``classify_sessions_via_api`
65+
- `scripts/quality_report.py``_infer_corrections` batching, `_build_resolved_map_from_conversations`, `run_evaluation`
66+
67+
## Notes
68+
69+
- Default concurrency of 10 should be safe for Gemini API rate limits
70+
- The `client.aio.models.generate_content` API is already async — just needs gather
71+
- Backwards compatible — sequential behavior preserved with `concurrency=1`

scripts/README.md

Lines changed: 420 additions & 54 deletions
Large diffs are not rendered by default.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"scope": "This assistant answers questions about company HR policies using its lookup tools: PTO and time off, sick leave, remote work, expenses and reimbursements, benefits (medical, dental, vision, 401k), parental leave, and company holidays. It is OUT OF SCOPE for salary and compensation (bonuses, severance, salary bands), stock/equity, promotions and performance reviews, IT support, office and facilities, training budgets, employee relations (harassment, grievances), code of conduct, internal mobility, and dress code. For any out-of-scope topic the agent should politely decline rather than guess.",
3+
"tools": "lookup_company_policy(topic) returns static policy text for: PTO, sick leave, remote work, expenses, benefits, holidays ONLY (returns 'topic not found' for anything else). No tool can read an individual employee's personal/account data or perform actions (submit, enroll, file). Used by the judge's failure_attribution metric to tell a knowledge gap (covered topic, missing fact) from a tool gap (no data source, or a personal-data / action request).",
4+
"ground_truth": "PTO: 20 days/year, accrued monthly (~1.67/mo), max 5 days rollover.\nSICK LEAVE: 10 days/year, no rollover.\nBENEFITS: 401k match 4% of salary, vested after 1 year.\nHOLIDAYS: 11 paid holidays/year; Juneteenth and Veterans Day are NOT company holidays.",
5+
"golden_qa": [
6+
{
7+
"question": "How many PTO days do I get per year?",
8+
"expected_answer": "20 days per year, accrued monthly at ~1.67 days/month.",
9+
"topic": "pto"
10+
},
11+
{
12+
"question": "How does the 401k match work?",
13+
"expected_answer": "The company matches 4% of salary, vested after 1 year.",
14+
"topic": "benefits"
15+
},
16+
{
17+
"question": "What are the salary bands for senior engineers?",
18+
"expected_behavior": "decline",
19+
"topic": "out_of_scope"
20+
}
21+
]
22+
}

scripts/eval/eval_config.json

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
{
2+
"metrics": [
3+
{
4+
"name": "response_usefulness",
5+
"definition": "Whether the agent final response provides a genuinely useful, substantive answer to the user question. A response that apologizes, says it cannot help, returns no data, provides only generic filler, or loops without resolving the question is NOT useful. If the conversation contains a user correction and the agent merely repeated or acknowledged the correction without independently verifying it (e.g. re-querying a tool, citing a new source), the response is NOT useful — the user did the agent's work.",
6+
"categories": [
7+
{
8+
"name": "meaningful",
9+
"definition": "The response directly and substantively addresses the user question with specific, actionable information."
10+
},
11+
{
12+
"name": "unhelpful",
13+
"definition": "The response does NOT meaningfully answer the user question. This includes: (1) The agent said 'I don't have that information', gave generic advice, or directed the user elsewhere instead of using its tools. (2) The agent apologized without answering. (3) Empty data results or generic filler text. (4) The agent looped without resolution. (5) The agent only became correct after the user provided the right answer and the agent repeated it without independent verification (e.g. re-querying a tool)."
14+
},
15+
{
16+
"name": "partial",
17+
"definition": "The response partially addresses the question but is incomplete, missing key details, or only tangentially relevant."
18+
}
19+
],
20+
"required": true,
21+
"scope_aware": true,
22+
"declined_category": {
23+
"name": "declined",
24+
"definition": "The TOPIC of the question is explicitly listed as out of scope (see AGENT SCOPE CONTEXT above) and the agent correctly declined. Use this ONLY when the topic itself is out of scope -- NOT when the agent simply failed to find an answer for an in-scope topic.",
25+
"insert_after": "meaningful"
26+
},
27+
"scope_suffix": " UNLESS the question is outside the agent's defined scope, in which case a polite decline IS a correct and meaningful response."
28+
},
29+
{
30+
"name": "task_grounding",
31+
"definition": "Whether the agent response is grounded in actual data retrieved from its tools, or is fabricated / hallucinated general knowledge.",
32+
"categories": [
33+
{
34+
"name": "grounded",
35+
"definition": "The response is clearly based on data retrieved from the agent tools (search results, database lookups, API calls)."
36+
},
37+
{
38+
"name": "ungrounded",
39+
"definition": "The response appears to be fabricated or based on the LLM general knowledge rather than actual tool results. The tool may have returned empty data and the agent filled in anyway."
40+
},
41+
{
42+
"name": "no_tool_needed",
43+
"definition": "The question did not require tool usage and a direct LLM response was appropriate."
44+
}
45+
],
46+
"required": true
47+
},
48+
{
49+
"name": "correctness",
50+
"definition": "Whether the facts stated in the agent response are accurate. Evaluate based on the information the agent retrieved from its tools and whether it was conveyed faithfully.",
51+
"categories": [
52+
{
53+
"name": "correct",
54+
"definition": "All facts stated by the agent are accurate and consistent with the tool results retrieved."
55+
},
56+
{
57+
"name": "mostly_correct",
58+
"definition": "The response is mostly correct but contains a minor inaccuracy, omission, or imprecise wording."
59+
},
60+
{
61+
"name": "incorrect",
62+
"definition": "The response contains wrong facts, hallucinated information, or claims contradicted by the tool results."
63+
}
64+
],
65+
"required": true
66+
},
67+
{
68+
"name": "tool_usage",
69+
"definition": "Whether the agent used its available tools correctly to answer the question, rather than relying on general knowledge.",
70+
"categories": [
71+
{
72+
"name": "proper",
73+
"definition": "The agent used its tools and based the answer on the tool results. Tools were called with appropriate parameters."
74+
},
75+
{
76+
"name": "partial",
77+
"definition": "The agent partially used tools, or tool usage was unclear or incomplete. Some information may not be tool-derived."
78+
},
79+
{
80+
"name": "none",
81+
"definition": "The agent answered from general knowledge without looking up information via tools, even though tools were available and the question warranted their use. DECISIVE TEST: if the question was in-scope and a tool could have supplied the answer, but the trace shows no relevant tool call, this is `none` (a failure) -- do NOT use `no_tool_needed` to excuse a missing lookup."
82+
},
83+
{
84+
"name": "no_tool_needed",
85+
"definition": "The question genuinely required no tool lookup -- e.g. a greeting, a meta/clarification turn, or an out-of-scope topic the agent correctly declined. Not using a tool was the CORRECT behavior here, so this is a positive outcome, not a failure. Use this ONLY when no tool was needed; if the question was an in-scope data lookup the agent should have performed, use `none` instead."
86+
}
87+
],
88+
"required": true
89+
},
90+
{
91+
"name": "specificity",
92+
"definition": "Whether the agent response provides specific, concrete details (numbers, dates, dollar amounts, limits) rather than vague or generic statements.",
93+
"categories": [
94+
{
95+
"name": "specific",
96+
"definition": "The response includes specific and complete details: exact numbers, percentages, dollar amounts, dates, or limits."
97+
},
98+
{
99+
"name": "somewhat_specific",
100+
"definition": "The response is somewhat specific but missing some key details that would make it fully actionable."
101+
},
102+
{
103+
"name": "vague",
104+
"definition": "The response is vague, generic, or missing key specifics that the user needs to act on the information."
105+
}
106+
],
107+
"required": true
108+
},
109+
{
110+
"name": "scope_compliance",
111+
"definition": "Whether the agent correctly handled the scope of the question. An agent should answer in-scope questions and politely decline out-of-scope ones.",
112+
"categories": [
113+
{
114+
"name": "compliant",
115+
"definition": "The agent correctly answered an in-scope question OR correctly declined an out-of-scope question."
116+
},
117+
{
118+
"name": "partially_compliant",
119+
"definition": "The agent answered but with unnecessary caveats, excessive hedging, or was partially out of scope."
120+
},
121+
{
122+
"name": "non_compliant",
123+
"definition": "The agent tried to answer an out-of-scope question it should have declined, OR refused to answer an in-scope question it should have handled."
124+
}
125+
],
126+
"required": true,
127+
"scope_aware": true
128+
},
129+
{
130+
"name": "first_time_right",
131+
"definition": "Whether the agent's FIRST response in the conversation was satisfactory, without needing user corrections or follow-ups to fix errors. For single-turn conversations, evaluate the only response. For multi-turn, focus on whether the first substantive answer was correct.",
132+
"categories": [
133+
{
134+
"name": "correct",
135+
"definition": "The first response was correct and complete. No correction or significant clarification was needed from the user."
136+
},
137+
{
138+
"name": "clarification_needed",
139+
"definition": "The first response was mostly right but needed minor clarification or a follow-up to be fully useful."
140+
},
141+
{
142+
"name": "correction_needed",
143+
"definition": "The first response was wrong, vague, or incomplete enough that the user had to push back or correct the agent."
144+
}
145+
],
146+
"required": true
147+
},
148+
{
149+
"name": "failure_attribution",
150+
"definition": "ROOT CAUSE of a failure: when the agent did NOT give a useful answer, why? Use the AGENT TOOLS / CAPABILITIES context above to decide which fixer is responsible. If the response WAS useful (a substantive answer or a correct decline of an out-of-scope topic), return not_a_failure.",
151+
"categories": [
152+
{
153+
"name": "not_a_failure",
154+
"definition": "The response was useful -- a substantive answer, or a correct polite decline of a genuinely out-of-scope topic. No failure to attribute."
155+
},
156+
{
157+
"name": "skill_gap",
158+
"definition": "The agent HAD the means to answer but behaved wrong: it failed to route to the right sub-agent, did not call an available tool, echoed/parroted the user's correction without re-verifying, or stated facts that contradict its tools. The tool and data needed were available -- this is fixable by improving the agent's instructions (skill)."
159+
},
160+
{
161+
"name": "knowledge_gap",
162+
"definition": "The agent correctly used a tool that DOES cover this topic, but the SPECIFIC fact requested was not present in the data the tool returned (the data source is incomplete on this detail). Fixable by a human adding the missing fact to the existing data source -- not by changing instructions."
163+
},
164+
{
165+
"name": "tool_gap",
166+
"definition": "No tool or capability could even attempt this request. Either (a) the question is about a topic that NONE of the listed tools has any data source for, or (b) it needs the individual user's personal/account data (their actual balance, enrollment status) or an ACTION (submit, file, enroll) that no tool provides. Fixable only by an engineer building a new tool or data source -- not by skill evolution or by adding a fact."
167+
}
168+
],
169+
"required": true,
170+
"scope_aware": true
171+
}
172+
]
173+
}

0 commit comments

Comments
 (0)