GoogleCloudPlatform · haiyuan-eng-google · Jun 15, 2026 · May 19, 2026 · May 19, 2026 · May 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -14,13 +14,13 @@ env/
 .adk/
 uv.lock
 .env
+/.idea/
 
 # Script outputs
 scripts/reports/
 
 # Example run artifacts
 examples/*/reports/
-examples/*/reports_*/
 examples/*/trials_*/
 scripts/**/*.log
 examples/**/*.log

diff --git a/issues/concurrent_classify_sessions.md b/issues/concurrent_classify_sessions.md
@@ -0,0 +1,71 @@
+# classify_sessions_via_api and _infer_corrections should run concurrently
+
+**Labels:** `enhancement`, `performance`
+
+## Problem
+
+`classify_sessions_via_api` in `categorical_evaluator.py:831` processes sessions sequentially:
+
+```python
+for sid, transcript in transcripts.items():
+    response = await client.aio.models.generate_content(...)
+```
+
+Additionally, `_infer_corrections` in `quality_report.py` is called per-session in a loop inside `_build_resolved_map_from_conversations` and `run_evaluation` (lines 908-920).
+
+For 205 multi-turn sessions this results in **410 sequential Gemini API calls** (~7-8s per call = ~25 minutes total). Each call is independent — there's no reason they can't run concurrently.
+
+## Benchmarks
+
+| Sessions | Sequential (current) | Expected with concurrency=10 |
+|----------|---------------------|-------------------------------|
+| 5 | 38.8s | ~4s |
+| 205 | ~25min | ~2.5min |
+
+## Proposed fix
+
+### 1. `classify_sessions_via_api` — add semaphore-bounded concurrency
+
+```python
+async def classify_sessions_via_api(transcripts, config, endpoint, concurrency=10):
+    semaphore = asyncio.Semaphore(concurrency)
+
+    async def _classify_one(sid, transcript):
+        async with semaphore:
+            # existing per-session logic (lines 860-895)
+            ...
+
+    tasks = [_classify_one(sid, t) for sid, t in transcripts.items()]
+    results = await asyncio.gather(*tasks)
+    return list(results)
+```
+
+### 2. `_infer_corrections` — batch with gather
+
+In `_build_resolved_map_from_conversations` and `run_evaluation`, collect all multi-turn sessions and infer corrections concurrently:
+
+```python
+async def _infer_corrections_batch(sessions, model, concurrency=10):
+    semaphore = asyncio.Semaphore(concurrency)
+
+    async def _infer_one(conv):
+        async with semaphore:
+            return _infer_corrections(conv, model)
+
+    return await asyncio.gather(*[_infer_one(s) for s in sessions])
+```
+
+### 3. Wire `--concurrency` flag
+
+The `score_conversations.py` CLI already has a `--concurrency` flag (currently ignored). Pass it through to both functions.
+
+## Files to change
+
+- `src/bigquery_agent_analytics/categorical_evaluator.py` — `classify_sessions_via_api`
+- `scripts/quality_report.py` — `_infer_corrections` batching, `_build_resolved_map_from_conversations`, `run_evaluation`
+
+## Notes
+
+- Default concurrency of 10 should be safe for Gemini API rate limits
+- The `client.aio.models.generate_content` API is already async — just needs gather
+- Backwards compatible — sequential behavior preserved with `concurrency=1`
diff --git a/scripts/README.md b/scripts/README.md
diff --git a/scripts/eval/data/eval_spec.example.json b/scripts/eval/data/eval_spec.example.json
@@ -0,0 +1,22 @@
+{
+  "scope": "This assistant answers questions about company HR policies using its lookup tools: PTO and time off, sick leave, remote work, expenses and reimbursements, benefits (medical, dental, vision, 401k), parental leave, and company holidays. It is OUT OF SCOPE for salary and compensation (bonuses, severance, salary bands), stock/equity, promotions and performance reviews, IT support, office and facilities, training budgets, employee relations (harassment, grievances), code of conduct, internal mobility, and dress code. For any out-of-scope topic the agent should politely decline rather than guess.",
+  "tools": "lookup_company_policy(topic) returns static policy text for: PTO, sick leave, remote work, expenses, benefits, holidays ONLY (returns 'topic not found' for anything else). No tool can read an individual employee's personal/account data or perform actions (submit, enroll, file). Used by the judge's failure_attribution metric to tell a knowledge gap (covered topic, missing fact) from a tool gap (no data source, or a personal-data / action request).",
+  "ground_truth": "PTO: 20 days/year, accrued monthly (~1.67/mo), max 5 days rollover.\nSICK LEAVE: 10 days/year, no rollover.\nBENEFITS: 401k match 4% of salary, vested after 1 year.\nHOLIDAYS: 11 paid holidays/year; Juneteenth and Veterans Day are NOT company holidays.",
+  "golden_qa": [
+    {
+      "question": "How many PTO days do I get per year?",
+      "expected_answer": "20 days per year, accrued monthly at ~1.67 days/month.",
+      "topic": "pto"
+    },
+    {
+      "question": "How does the 401k match work?",
+      "expected_answer": "The company matches 4% of salary, vested after 1 year.",
+      "topic": "benefits"
+    },
+    {
+      "question": "What are the salary bands for senior engineers?",
+      "expected_behavior": "decline",
+      "topic": "out_of_scope"
+    }
+  ]
+}
diff --git a/scripts/eval/eval_config.json b/scripts/eval/eval_config.json
@@ -0,0 +1,173 @@
+{
+  "metrics": [
+    {
+      "name": "response_usefulness",
+      "definition": "Whether the agent final response provides a genuinely useful, substantive answer to the user question. A response that apologizes, says it cannot help, returns no data, provides only generic filler, or loops without resolving the question is NOT useful. If the conversation contains a user correction and the agent merely repeated or acknowledged the correction without independently verifying it (e.g. re-querying a tool, citing a new source), the response is NOT useful — the user did the agent's work.",
+      "categories": [
+        {
+          "name": "meaningful",
+          "definition": "The response directly and substantively addresses the user question with specific, actionable information."
+        },
+        {
+          "name": "unhelpful",
+          "definition": "The response does NOT meaningfully answer the user question. This includes: (1) The agent said 'I don't have that information', gave generic advice, or directed the user elsewhere instead of using its tools. (2) The agent apologized without answering. (3) Empty data results or generic filler text. (4) The agent looped without resolution. (5) The agent only became correct after the user provided the right answer and the agent repeated it without independent verification (e.g. re-querying a tool)."
+        },
+        {
+          "name": "partial",
+          "definition": "The response partially addresses the question but is incomplete, missing key details, or only tangentially relevant."
+        }
+      ],
+      "required": true,
+      "scope_aware": true,
+      "declined_category": {
+        "name": "declined",
+        "definition": "The TOPIC of the question is explicitly listed as out of scope (see AGENT SCOPE CONTEXT above) and the agent correctly declined. Use this ONLY when the topic itself is out of scope -- NOT when the agent simply failed to find an answer for an in-scope topic.",
+        "insert_after": "meaningful"
+      },
+      "scope_suffix": " UNLESS the question is outside the agent's defined scope, in which case a polite decline IS a correct and meaningful response."
+    },
+    {
+      "name": "task_grounding",
+      "definition": "Whether the agent response is grounded in actual data retrieved from its tools, or is fabricated / hallucinated general knowledge.",
+      "categories": [
+        {
+          "name": "grounded",
+          "definition": "The response is clearly based on data retrieved from the agent tools (search results, database lookups, API calls)."
+        },
+        {
+          "name": "ungrounded",
+          "definition": "The response appears to be fabricated or based on the LLM general knowledge rather than actual tool results. The tool may have returned empty data and the agent filled in anyway."
+        },
+        {
+          "name": "no_tool_needed",
+          "definition": "The question did not require tool usage and a direct LLM response was appropriate."
+        }
+      ],
+      "required": true
+    },
+    {
+      "name": "correctness",
+      "definition": "Whether the facts stated in the agent response are accurate. Evaluate based on the information the agent retrieved from its tools and whether it was conveyed faithfully.",
+      "categories": [
+        {
+          "name": "correct",
+          "definition": "All facts stated by the agent are accurate and consistent with the tool results retrieved."
+        },
+        {
+          "name": "mostly_correct",
+          "definition": "The response is mostly correct but contains a minor inaccuracy, omission, or imprecise wording."
+        },
+        {
+          "name": "incorrect",
+          "definition": "The response contains wrong facts, hallucinated information, or claims contradicted by the tool results."
+        }
+      ],
+      "required": true
+    },
+    {
+      "name": "tool_usage",
+      "definition": "Whether the agent used its available tools correctly to answer the question, rather than relying on general knowledge.",
+      "categories": [
+        {
+          "name": "proper",
+          "definition": "The agent used its tools and based the answer on the tool results. Tools were called with appropriate parameters."
+        },
+        {
+          "name": "partial",
+          "definition": "The agent partially used tools, or tool usage was unclear or incomplete. Some information may not be tool-derived."
+        },
+        {
+          "name": "none",
+          "definition": "The agent answered from general knowledge without looking up information via tools, even though tools were available and the question warranted their use. DECISIVE TEST: if the question was in-scope and a tool could have supplied the answer, but the trace shows no relevant tool call, this is `none` (a failure) -- do NOT use `no_tool_needed` to excuse a missing lookup."
+        },
+        {
+          "name": "no_tool_needed",
+          "definition": "The question genuinely required no tool lookup -- e.g. a greeting, a meta/clarification turn, or an out-of-scope topic the agent correctly declined. Not using a tool was the CORRECT behavior here, so this is a positive outcome, not a failure. Use this ONLY when no tool was needed; if the question was an in-scope data lookup the agent should have performed, use `none` instead."
+        }
+      ],
+      "required": true
+    },
+    {
+      "name": "specificity",
+      "definition": "Whether the agent response provides specific, concrete details (numbers, dates, dollar amounts, limits) rather than vague or generic statements.",
+      "categories": [
+        {
+          "name": "specific",
+          "definition": "The response includes specific and complete details: exact numbers, percentages, dollar amounts, dates, or limits."
+        },
+        {
+          "name": "somewhat_specific",
+          "definition": "The response is somewhat specific but missing some key details that would make it fully actionable."
+        },
+        {
+          "name": "vague",
+          "definition": "The response is vague, generic, or missing key specifics that the user needs to act on the information."
+        }
+      ],
+      "required": true
+    },
+    {
+      "name": "scope_compliance",
+      "definition": "Whether the agent correctly handled the scope of the question. An agent should answer in-scope questions and politely decline out-of-scope ones.",
+      "categories": [
+        {
+          "name": "compliant",
+          "definition": "The agent correctly answered an in-scope question OR correctly declined an out-of-scope question."
+        },
+        {
+          "name": "partially_compliant",
+          "definition": "The agent answered but with unnecessary caveats, excessive hedging, or was partially out of scope."
+        },
+        {
+          "name": "non_compliant",
+          "definition": "The agent tried to answer an out-of-scope question it should have declined, OR refused to answer an in-scope question it should have handled."
+        }
+      ],
+      "required": true,
+      "scope_aware": true
+    },
+    {
+      "name": "first_time_right",
+      "definition": "Whether the agent's FIRST response in the conversation was satisfactory, without needing user corrections or follow-ups to fix errors. For single-turn conversations, evaluate the only response. For multi-turn, focus on whether the first substantive answer was correct.",
+      "categories": [
+        {
+          "name": "correct",
+          "definition": "The first response was correct and complete. No correction or significant clarification was needed from the user."
+        },
+        {
+          "name": "clarification_needed",
+          "definition": "The first response was mostly right but needed minor clarification or a follow-up to be fully useful."
+        },
+        {
+          "name": "correction_needed",
+          "definition": "The first response was wrong, vague, or incomplete enough that the user had to push back or correct the agent."
+        }
+      ],
+      "required": true
+    },
+    {
+      "name": "failure_attribution",
+      "definition": "ROOT CAUSE of a failure: when the agent did NOT give a useful answer, why? Use the AGENT TOOLS / CAPABILITIES context above to decide which fixer is responsible. If the response WAS useful (a substantive answer or a correct decline of an out-of-scope topic), return not_a_failure.",
+      "categories": [
+        {
+          "name": "not_a_failure",
+          "definition": "The response was useful -- a substantive answer, or a correct polite decline of a genuinely out-of-scope topic. No failure to attribute."
+        },
+        {
+          "name": "skill_gap",
+          "definition": "The agent HAD the means to answer but behaved wrong: it failed to route to the right sub-agent, did not call an available tool, echoed/parroted the user's correction without re-verifying, or stated facts that contradict its tools. The tool and data needed were available -- this is fixable by improving the agent's instructions (skill)."
+        },
+        {
+          "name": "knowledge_gap",
+          "definition": "The agent correctly used a tool that DOES cover this topic, but the SPECIFIC fact requested was not present in the data the tool returned (the data source is incomplete on this detail). Fixable by a human adding the missing fact to the existing data source -- not by changing instructions."
+        },
+        {
+          "name": "tool_gap",
+          "definition": "No tool or capability could even attempt this request. Either (a) the question is about a topic that NONE of the listed tools has any data source for, or (b) it needs the individual user's personal/account data (their actual balance, enrollment status) or an ACTION (submit, file, enroll) that no tool provides. Fixable only by an engineer building a new tool or data source -- not by skill evolution or by adding a fact."
+        }
+      ],
+      "required": true,
+      "scope_aware": true
+    }
+  ]
+}