Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
e4e154f
Add quality dimensions, multi-turn metrics, and streamlined report fo…
evekhm May 19, 2026
cf514c9
Add correction/verification inference and multi-turn conversation ext…
evekhm May 19, 2026
f94026c
Improve scope-aware eval accuracy with conditional declined category …
evekhm May 20, 2026
e3f5d9c
Add scope config example, remove in_scope_topics, update README
evekhm May 20, 2026
d5702a3
Remove ground truth from scope context, update sample reports with fr…
evekhm May 20, 2026
f981661
Add --conversations-file for local JSON scoring without BigQuery
evekhm May 21, 2026
236c3bd
Update quality report format (as a goal)
evekhm May 21, 2026
1ff6e2d
Add per-category sample control for report sections
evekhm May 22, 2026
d2803ea
Make correction inference concurrent with asyncio.gather
evekhm May 22, 2026
9e84936
Allow --help/-h without env vars in quality_report.sh
evekhm May 22, 2026
e708e36
Add dynamic low-dimension sections and report metadata
evekhm May 22, 2026
a2fd41e
Fix help text, enable --tag-turns for BQ, and fix --env loading
evekhm May 22, 2026
387ee89
Add correction analysis with trajectory segmentation, routing failure…
evekhm May 24, 2026
e4af59f
Add parroting penalty to eval and auto-discover eval_config.json
evekhm May 26, 2026
808c3bb
Add per_session_context support for golden eval
evekhm May 26, 2026
6369ae9
Auto-fetch execution trace for single-session quality report
evekhm May 27, 2026
041b978
Address PR #174 review: fix dimension scoring, TOOL_ERROR counting, a…
evekhm May 27, 2026
9bd7f93
Merge branch 'main' into feat/quality-dimensions
evekhm May 28, 2026
a1f9bb7
Fix custom_tags/custom_labels path mismatch and add version filtering
evekhm May 28, 2026
1606db3
Add --label CLI flag, surface active filters in Execution Details
evekhm May 28, 2026
c86722f
Address PR #174 review: tool_usage no_tool_needed, --dimensions flag,…
evekhm May 29, 2026
520a4e5
Add eval-spec grounding (scope + golden Q&A) and failure-cause taxonomy
evekhm May 31, 2026
d3ad82c
Apply autoformat (isort + pyink) to fix format check
evekhm Jun 2, 2026
fa702a1
quality_report: warn when scoring without golden_qa ground truth
evekhm Jun 5, 2026
6595417
scripts/README: document adding evals + all new quality-report features
evekhm Jun 5, 2026
845e624
Merge branch 'main' into feat/quality-dimensions
evekhm Jun 5, 2026
ac47775
quality_report: address review (B1, B2, H1-H3, M2, L1-L2, L4)
evekhm Jun 5, 2026
fa59381
Merge branch 'GoogleCloudPlatform:main' into feat/quality-dimensions
evekhm Jun 8, 2026
367bbd3
Merge branch 'main' into feat/quality-dimensions
haiyuan-eng-google Jun 12, 2026
32ee77a
Merge branch 'main' into feat/quality-dimensions
haiyuan-eng-google Jun 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ env/
.adk/
uv.lock
.env
/.idea/

# Script outputs
scripts/reports/

# Example run artifacts
examples/*/reports/
examples/*/reports_*/
examples/*/trials_*/
scripts/**/*.log
examples/**/*.log
Expand Down
71 changes: 71 additions & 0 deletions issues/concurrent_classify_sessions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# classify_sessions_via_api and _infer_corrections should run concurrently

**Labels:** `enhancement`, `performance`

## Problem

`classify_sessions_via_api` in `categorical_evaluator.py:831` processes sessions sequentially:

```python
for sid, transcript in transcripts.items():
response = await client.aio.models.generate_content(...)
```

Additionally, `_infer_corrections` in `quality_report.py` is called per-session in a loop inside `_build_resolved_map_from_conversations` and `run_evaluation` (lines 908-920).

For 205 multi-turn sessions this results in **410 sequential Gemini API calls** (~7-8s per call = ~25 minutes total). Each call is independent — there's no reason they can't run concurrently.

## Benchmarks

| Sessions | Sequential (current) | Expected with concurrency=10 |
|----------|---------------------|-------------------------------|
| 5 | 38.8s | ~4s |
| 205 | ~25min | ~2.5min |

## Proposed fix

### 1. `classify_sessions_via_api` — add semaphore-bounded concurrency

```python
async def classify_sessions_via_api(transcripts, config, endpoint, concurrency=10):
semaphore = asyncio.Semaphore(concurrency)

async def _classify_one(sid, transcript):
async with semaphore:
# existing per-session logic (lines 860-895)
...

tasks = [_classify_one(sid, t) for sid, t in transcripts.items()]
results = await asyncio.gather(*tasks)
return list(results)
```

### 2. `_infer_corrections` — batch with gather

In `_build_resolved_map_from_conversations` and `run_evaluation`, collect all multi-turn sessions and infer corrections concurrently:

```python
async def _infer_corrections_batch(sessions, model, concurrency=10):
semaphore = asyncio.Semaphore(concurrency)

async def _infer_one(conv):
async with semaphore:
return _infer_corrections(conv, model)

return await asyncio.gather(*[_infer_one(s) for s in sessions])
```

### 3. Wire `--concurrency` flag

The `score_conversations.py` CLI already has a `--concurrency` flag (currently ignored). Pass it through to both functions.

## Files to change

- `src/bigquery_agent_analytics/categorical_evaluator.py` — `classify_sessions_via_api`
- `scripts/quality_report.py` — `_infer_corrections` batching, `_build_resolved_map_from_conversations`, `run_evaluation`

## Notes

- Default concurrency of 10 should be safe for Gemini API rate limits
- The `client.aio.models.generate_content` API is already async — just needs gather
- Backwards compatible — sequential behavior preserved with `concurrency=1`
474 changes: 420 additions & 54 deletions scripts/README.md

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions scripts/eval/data/eval_spec.example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"scope": "This assistant answers questions about company HR policies using its lookup tools: PTO and time off, sick leave, remote work, expenses and reimbursements, benefits (medical, dental, vision, 401k), parental leave, and company holidays. It is OUT OF SCOPE for salary and compensation (bonuses, severance, salary bands), stock/equity, promotions and performance reviews, IT support, office and facilities, training budgets, employee relations (harassment, grievances), code of conduct, internal mobility, and dress code. For any out-of-scope topic the agent should politely decline rather than guess.",
"tools": "lookup_company_policy(topic) returns static policy text for: PTO, sick leave, remote work, expenses, benefits, holidays ONLY (returns 'topic not found' for anything else). No tool can read an individual employee's personal/account data or perform actions (submit, enroll, file). Used by the judge's failure_attribution metric to tell a knowledge gap (covered topic, missing fact) from a tool gap (no data source, or a personal-data / action request).",
"ground_truth": "PTO: 20 days/year, accrued monthly (~1.67/mo), max 5 days rollover.\nSICK LEAVE: 10 days/year, no rollover.\nBENEFITS: 401k match 4% of salary, vested after 1 year.\nHOLIDAYS: 11 paid holidays/year; Juneteenth and Veterans Day are NOT company holidays.",
"golden_qa": [
{
"question": "How many PTO days do I get per year?",
"expected_answer": "20 days per year, accrued monthly at ~1.67 days/month.",
"topic": "pto"
},
{
"question": "How does the 401k match work?",
"expected_answer": "The company matches 4% of salary, vested after 1 year.",
"topic": "benefits"
},
{
"question": "What are the salary bands for senior engineers?",
"expected_behavior": "decline",
"topic": "out_of_scope"
}
]
}
173 changes: 173 additions & 0 deletions scripts/eval/eval_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
{
"metrics": [
{
"name": "response_usefulness",
"definition": "Whether the agent final response provides a genuinely useful, substantive answer to the user question. A response that apologizes, says it cannot help, returns no data, provides only generic filler, or loops without resolving the question is NOT useful. If the conversation contains a user correction and the agent merely repeated or acknowledged the correction without independently verifying it (e.g. re-querying a tool, citing a new source), the response is NOT useful — the user did the agent's work.",
"categories": [
{
"name": "meaningful",
"definition": "The response directly and substantively addresses the user question with specific, actionable information."
},
{
"name": "unhelpful",
"definition": "The response does NOT meaningfully answer the user question. This includes: (1) The agent said 'I don't have that information', gave generic advice, or directed the user elsewhere instead of using its tools. (2) The agent apologized without answering. (3) Empty data results or generic filler text. (4) The agent looped without resolution. (5) The agent only became correct after the user provided the right answer and the agent repeated it without independent verification (e.g. re-querying a tool)."
},
{
"name": "partial",
"definition": "The response partially addresses the question but is incomplete, missing key details, or only tangentially relevant."
}
],
"required": true,
"scope_aware": true,
"declined_category": {
"name": "declined",
"definition": "The TOPIC of the question is explicitly listed as out of scope (see AGENT SCOPE CONTEXT above) and the agent correctly declined. Use this ONLY when the topic itself is out of scope -- NOT when the agent simply failed to find an answer for an in-scope topic.",
"insert_after": "meaningful"
},
"scope_suffix": " UNLESS the question is outside the agent's defined scope, in which case a polite decline IS a correct and meaningful response."
},
{
"name": "task_grounding",
"definition": "Whether the agent response is grounded in actual data retrieved from its tools, or is fabricated / hallucinated general knowledge.",
"categories": [
{
"name": "grounded",
"definition": "The response is clearly based on data retrieved from the agent tools (search results, database lookups, API calls)."
},
{
"name": "ungrounded",
"definition": "The response appears to be fabricated or based on the LLM general knowledge rather than actual tool results. The tool may have returned empty data and the agent filled in anyway."
},
{
"name": "no_tool_needed",
"definition": "The question did not require tool usage and a direct LLM response was appropriate."
}
],
"required": true
},
{
"name": "correctness",
"definition": "Whether the facts stated in the agent response are accurate. Evaluate based on the information the agent retrieved from its tools and whether it was conveyed faithfully.",
"categories": [
{
"name": "correct",
"definition": "All facts stated by the agent are accurate and consistent with the tool results retrieved."
},
{
"name": "mostly_correct",
"definition": "The response is mostly correct but contains a minor inaccuracy, omission, or imprecise wording."
},
{
"name": "incorrect",
"definition": "The response contains wrong facts, hallucinated information, or claims contradicted by the tool results."
}
],
"required": true
},
{
"name": "tool_usage",
"definition": "Whether the agent used its available tools correctly to answer the question, rather than relying on general knowledge.",
"categories": [
{
"name": "proper",
"definition": "The agent used its tools and based the answer on the tool results. Tools were called with appropriate parameters."
},
{
"name": "partial",
"definition": "The agent partially used tools, or tool usage was unclear or incomplete. Some information may not be tool-derived."
},
{
"name": "none",
"definition": "The agent answered from general knowledge without looking up information via tools, even though tools were available and the question warranted their use. DECISIVE TEST: if the question was in-scope and a tool could have supplied the answer, but the trace shows no relevant tool call, this is `none` (a failure) -- do NOT use `no_tool_needed` to excuse a missing lookup."
},
{
"name": "no_tool_needed",
"definition": "The question genuinely required no tool lookup -- e.g. a greeting, a meta/clarification turn, or an out-of-scope topic the agent correctly declined. Not using a tool was the CORRECT behavior here, so this is a positive outcome, not a failure. Use this ONLY when no tool was needed; if the question was an in-scope data lookup the agent should have performed, use `none` instead."
}
],
"required": true
},
{
"name": "specificity",
"definition": "Whether the agent response provides specific, concrete details (numbers, dates, dollar amounts, limits) rather than vague or generic statements.",
"categories": [
{
"name": "specific",
"definition": "The response includes specific and complete details: exact numbers, percentages, dollar amounts, dates, or limits."
},
{
"name": "somewhat_specific",
"definition": "The response is somewhat specific but missing some key details that would make it fully actionable."
},
{
"name": "vague",
"definition": "The response is vague, generic, or missing key specifics that the user needs to act on the information."
}
],
"required": true
},
{
"name": "scope_compliance",
"definition": "Whether the agent correctly handled the scope of the question. An agent should answer in-scope questions and politely decline out-of-scope ones.",
"categories": [
{
"name": "compliant",
"definition": "The agent correctly answered an in-scope question OR correctly declined an out-of-scope question."
},
{
"name": "partially_compliant",
"definition": "The agent answered but with unnecessary caveats, excessive hedging, or was partially out of scope."
},
{
"name": "non_compliant",
"definition": "The agent tried to answer an out-of-scope question it should have declined, OR refused to answer an in-scope question it should have handled."
}
],
"required": true,
"scope_aware": true
},
{
"name": "first_time_right",
"definition": "Whether the agent's FIRST response in the conversation was satisfactory, without needing user corrections or follow-ups to fix errors. For single-turn conversations, evaluate the only response. For multi-turn, focus on whether the first substantive answer was correct.",
"categories": [
{
"name": "correct",
"definition": "The first response was correct and complete. No correction or significant clarification was needed from the user."
},
{
"name": "clarification_needed",
"definition": "The first response was mostly right but needed minor clarification or a follow-up to be fully useful."
},
{
"name": "correction_needed",
"definition": "The first response was wrong, vague, or incomplete enough that the user had to push back or correct the agent."
}
],
"required": true
},
{
"name": "failure_attribution",
"definition": "ROOT CAUSE of a failure: when the agent did NOT give a useful answer, why? Use the AGENT TOOLS / CAPABILITIES context above to decide which fixer is responsible. If the response WAS useful (a substantive answer or a correct decline of an out-of-scope topic), return not_a_failure.",
"categories": [
{
"name": "not_a_failure",
"definition": "The response was useful -- a substantive answer, or a correct polite decline of a genuinely out-of-scope topic. No failure to attribute."
},
{
"name": "skill_gap",
"definition": "The agent HAD the means to answer but behaved wrong: it failed to route to the right sub-agent, did not call an available tool, echoed/parroted the user's correction without re-verifying, or stated facts that contradict its tools. The tool and data needed were available -- this is fixable by improving the agent's instructions (skill)."
},
{
"name": "knowledge_gap",
"definition": "The agent correctly used a tool that DOES cover this topic, but the SPECIFIC fact requested was not present in the data the tool returned (the data source is incomplete on this detail). Fixable by a human adding the missing fact to the existing data source -- not by changing instructions."
},
{
"name": "tool_gap",
"definition": "No tool or capability could even attempt this request. Either (a) the question is about a topic that NONE of the listed tools has any data source for, or (b) it needs the individual user's personal/account data (their actual balance, enrollment status) or an ACTION (submit, file, enroll) that no tool provides. Fixable only by an engineer building a new tool or data source -- not by skill evolution or by adding a fact."
}
],
"required": true,
"scope_aware": true
}
]
}
Loading
Loading