A well-designed agent should learn from its own mistakes. This demo implements that paradigm: a continuous self-improvement cycle where the agent's real-world failures become the training data for its next version. It is powered by the BigQuery Agent Analytics SDK and Vertex AI Prompt Registry. Prompts are stored, versioned, and optimized in Vertex AI.
For a guided walkthrough, see the Demo Narration.
The agent used in this demo is a company policy Q&A assistant, built with Google ADK and the BigQuery Agent Analytics Plugin.
It's deliberately simple: a single LLM agent with just two tools:
lookup_company_policy(topic)— retrieves detailed policy data on PTO, sick leave, remote work, expenses, benefits, and holidays.get_current_date()— returns today's date and day of the week, so the agent can answer date-relative questions like "Is next Friday a holiday?"
The agent's job is to answer employee questions — "How many PTO days do I get?", "What's the meal reimbursement limit?", "When is the next company holiday?", and so on.
The V1 prompt is intentionally flawed. It tells the agent to "answer from the knowledge above" — a short, incomplete summary baked into the prompt — and to say "I don't know, contact HR" for anything not listed. The result: the agent ignores its own tools, even though those tools have all the answers. Users get vague deflections instead of useful information.
| Flaw | Effect |
|---|---|
| "Answer from knowledge above" | Agent ignores its tools entirely |
| No expense/holiday info in prompt | Agent says "I don't know" instead of looking it up |
| Vague "competitive benefits" | Agent deflects or hallucinates benefit details |
| No date handling guidance | Agent cannot resolve "next Friday" |
The tools have all the data. The flaw is that the prompt discourages the agent from using them. By running the self-improvement cycle, the system detects these failures, generates correct answers using a teacher agent, optimizes the prompt through the Vertex AI Prompt Optimizer, and produces a new version that actually uses the tools. The agent fixes itself.
Note: observed model behavior with V1. The model does not fail uniformly across all topics. For topics mentioned in V1's inline knowledge (PTO, sick leave, remote work), the model often calls
lookup_company_policyanyway — even though the prompt says "answer from the knowledge above." The inline mention acts as a signal that the topic is valid, which encourages the model to explore available tools for more detail. For topics not mentioned in the prompt (expenses, holidays, parental leave), the explicit fallback instruction — "tell the user you do not have that information and suggest they contact HR" — overrides tool exploration. The model obeys the refusal rule because nothing in the prompt hints that the tool could answer the question. This means the V1 failures are concentrated on topics absent from the prompt, not on all topics. The improvement cycle discovers these gaps through synthetic traffic and fixes them by rewriting the prompt to always use tools first.
When you design eval cases for an agent, you are guessing what users will ask. You cover the happy paths, maybe some edge cases, but you cannot anticipate every real question. The agent ships, users interact with it, and some of those interactions fail in ways your tests never predicted.
This demo shows how to close that gap using four components:
BigQueryAgentAnalyticsPlugincaptures every real agent session (questions, tool calls, responses) into BigQuery automatically.SDK quality_report.py(the SDK's evaluation script) reads those logged sessions back from BigQuery, evaluates quality using an LLM judge, and produces structured reports that drive automated improvement.SDK SystemEvaluator(the SDK's deterministic evaluator) checks operational metrics — latency, token efficiency, and turn count — on the same sessions. No LLM calls needed, just math on the data already in BigQuery. This ensures the improved prompt doesn't trade quality for cost.- Vertex AI Prompt Registry stores and versions the agent's prompt in the cloud. The Vertex AI Prompt Optimizer generates improved prompts using synthetic ground truth from a teacher model.
The full cycle:
- GENERATE SYNTHETIC TRAFFIC: Gemini produces diverse user questions to test the agent beyond anticipated scenarios.
- RUN TRAFFIC THROUGH AGENT: Process traffic through the agent and log every trace/session into BigQuery.
- EVALUATE SESSION QUALITY: SDK scripts read logged sessions; an LLM judge scores them for usefulness and grounding.
- IMPROVE PROMPT: The core optimization stage consists of four critical sub-steps:
- Extract: Failed cases are moved into the golden eval set to raise the performance bar.
- Teacher Agent: Generates ground truth by re-answering failed questions with tool-mandated logic.
- Optimize: Vertex AI Prompt Optimizer generates a new candidate prompt.
- Validate (Regression Gate): The candidate is tested against the full golden eval set.
- MEASURE IMPROVEMENT: Verify the improved prompt against fresh traffic to quantify the quality jump.
At each evaluation step (3 and 5), the SDK's deterministic
SystemEvaluator also checks latency, token efficiency, and turn count.
Step 3 establishes the operational baseline; Step 5 shows the
before/after comparison to verify the improved prompt didn't regress
on cost or performance. No extra agent runs — just math on the session
data already in BigQuery.
By default, the script runs a single cycle and stops. This is the safe default -- each cycle makes dozens of Gemini API calls, and running multiple cycles unintentionally can lead to unexpected costs.
To run multiple improvement cycles, use --auto --cycles N. The
--auto flag enables auto-cycling, which runs up to N cycles and
stops early once quality meets the quality_threshold setting in
config.json (default: 0.95, i.e. 95% meaningful).
Why 95% and not 100%? LLM output is non-deterministic. At N=100 traffic, a single stochastic misfire causes a 1% drop. Setting the threshold to 100% leads to cycles that fight random variance rather than fix systematic gaps. The 95% default means: stop when real failures are gone, don't chase noise. If the improvement step finds quality already at or above the threshold, it skips the optimizer entirely and the cycle moves on. If no new prompt version is produced, the measurement step (Step 5) is also skipped -- there is nothing to compare.
The hero moment: quality typically climbs from ~60% to ~100% in a single cycle
(results vary due to non-deterministic LLM output). With the default
N=10 traffic, the improvement step typically succeeds on the first
optimizer attempt. At higher traffic volumes (--traffic-count 100),
the system discovers more failures but max_failure_extract: "auto"
applies category-aware selection to extract a representative subset
(~12 cases from ~42 failures in a typical run), keeping the regression
gate strict but manageable. Use --auto --cycles 3 for higher-N runs
to give the optimizer multiple cycles to converge if needed.
Static eval suites go stale. Users ask questions you never anticipated. The plugin captures every real interaction, and the SDK's quality evaluation scores them automatically. The Vertex AI Prompt Optimizer reads those scores, generates ground truth via a teacher model, optimizes the prompt, and the pipeline extracts the failures into the golden eval set so they never recur.
Each cycle, the golden eval set grows with cases sourced from actual failures. Over time, your tests reflect what users actually ask, not what you imagined they would ask.
config.json # Declarative config: agent module, prompt storage,
# model, eval paths, optimizer settings
agent/
agent.py # ADK agent (company policy Q&A assistant)
# Reads prompt from Vertex AI Prompt Registry
prompts.py # V1 seed prompt (used by setup/reset only)
tools.py # lookup_company_policy, get_current_date
eval/
eval_cases.json # Golden eval set (regression gate, grows each cycle)
generate_traffic.py # Generates synthetic user traffic via Gemini
run_eval.py # Runs eval/traffic cases via ADK InMemoryRunner
operational_metrics.py # Deterministic metrics gate (latency, tokens, turns)
agent_improvement/ # Reusable improvement module (works with any ADK agent)
config.py # ImprovementConfig dataclass
config_loader.py # Loads config.json, builds ImprovementConfig
improver_agent.py # LoopAgent + LlmAgent with tool-based workflow
eval_runner.py # Run eval cases + LLM judge
prompt_adapter.py # PromptAdapter ABC + VertexPromptAdapter +
# PythonFilePromptAdapter
tool_introspection.py # Auto-extract tool signatures from agent tools
prompts.py # Default judge/improver prompt templates
run_improvement.py # Entry point: loads config.json, runs improvement
setup_vertex.py # Creates/resets Vertex AI prompt (called by setup.sh)
reports/ # Generated reports, eval results, ground truth
run_cycle.sh # Orchestrator: traffic -> eval -> quality -> improve
setup.sh # One-time setup (auth, deps, BigQuery, Vertex AI prompt)
reset.sh # Reset to V1 prompt, prompts.py, and 3 golden cases
show_prompt.sh # Display current prompt from Vertex AI (curl + jq)
All agent-specific settings live in a single declarative config file:
{
"app_name": "company_info_agent",
"agent_module": "agent.agent",
"prompts_path": "agent/prompts.py",
"prompt_variable": "CURRENT_PROMPT",
"version_variable": "CURRENT_VERSION",
"eval_cases_path": "eval/eval_cases.json",
"traffic_generator": "eval/generate_traffic.py",
"model_id": "gemini-2.5-flash",
"optimizer_max_iterations": 3,
"prompt_storage": "vertex",
"vertex_prompt_id": "1234567890",
"use_vertex_optimizer": true,
"teacher_model_id": null
}To point the cycle at a different agent, create a config.json for it
and pass --agent-config /path/to/config.json.
The agent's prompt is stored in the Vertex AI Prompt Registry, not in a local file. This gives you:
- Cloud-native versioning: each improvement creates a new version
- Audit trail: full history of prompt changes with metadata
- API access: read/write via
vertexai.Client().prompts - Local mirroring: each update is also written to
agent/prompts.pyso changes are visible ingit diff
The VertexPromptAdapter handles all reads and writes. On startup,
agent.py fetches the current prompt from the registry via the
VERTEX_PROMPT_ID environment variable. The improvement cycle writes
new versions back through the same adapter and mirrors them to
agent/prompts.py via the PythonFilePromptAdapter.
To inspect the current prompt from the command line:
./show_prompt.sh # Display current prompt text
./show_prompt.sh --versions # List all versionssetup.sh creates the initial prompt resource automatically.
reset.sh deletes it and creates a fresh one at V1, and restores
agent/prompts.py to its original state.
The cycle displays the current prompt at the start and end of each run so you can see exactly what changed.
When the cycle identifies failed sessions, it uses the Vertex AI Prompt Optimizer to generate improved prompts:
- Identify failures: Extract sessions scored as "unhelpful" or "partial" from the quality report.
- Generate ground truth: A "teacher agent" (same tools, better prompt) re-answers each failed question to produce what the correct response should have been. See below for details.
- Optimize: Feed the current prompt + (question, bad_response,
ground_truth) triples to the Vertex AI Prompt Optimizer in
target_responsemode. - Validate: Test the optimized prompt against the full golden eval set before accepting it.
The optimizer also receives the agent's tool signatures, auto-extracted
from the Python functions by tool_introspection.py using inspect --
function name, parameter types, and full docstrings. These are appended
to the prompt as plain text so the optimizer knows what tools exist and
what arguments they accept. This is how the V2 prompt ends up with
explicit topic-to-tool mappings: the optimizer saw the tool's signature,
saw the teacher successfully calling it with specific arguments, and
generated routing instructions accordingly. If the optimizer's output
strips the tool references (which it tends to do), they are
re-appended automatically.
This replaces raw "ask Gemini to rewrite the prompt" with a structured optimization pipeline backed by real failure data.
The Vertex AI Prompt Optimizer needs labeled examples — pairs of (input, expected_output) — to learn from. This is the same principle as supervised learning in ML: you can't improve a model without showing it what "correct" looks like.
But where do the expected outputs come from? You don't have hand-written reference answers for every possible user question, especially not for questions discovered from synthetic traffic that you never anticipated. Writing golden answers manually doesn't scale — and the whole point is to handle questions you didn't predict.
The solution is the teacher agent. It borrows a concept from knowledge distillation in ML, where a "teacher" model generates training data for a "student" model. Here the teacher isn't a bigger model — it's the same model with the same tools, just with a different prompt:
You are an expert assistant. For EVERY question, you MUST call
the available tools to look up the answer. NEVER say 'I don't
know' or defer the user elsewhere. ALWAYS use the tools first, then answer
based on the tool results. Be specific and thorough.
The teacher's job is narrow: produce correct, tool-grounded answers to questions the target agent failed on. It's not a replacement for the target agent — it's a data generator. Think of it as an oracle that knows how to use the tools correctly, but has no domain-specific personality, formatting, or routing logic.
The key insight: the V1 agent fails not because the tools are broken
or the model is incapable, but because the V1 prompt actively
discourages tool use. The teacher prompt removes that barrier.
The teacher calls lookup_company_policy("expenses") and gets a
correct answer; the target agent with V1 never tries.
The full flow:
Failed sessions from quality report
|
v
Teacher agent re-answers each failed question
(same tools, same model, tool-first prompt)
|
v
Produces labeled triples:
(question, bad_response, ground_truth)
|
v
Vertex AI Prompt Optimizer
(target_response mode — learns from the triples)
|
v
Optimized prompt that steers the target agent
toward tool-grounded answers
The teacher's answers are saved to
reports/run_YYYYMMDD_HHMMSS/ground_truth_latest.json for inspection.
Each entry contains the original question, the bad response from the
target agent, and the teacher's ground truth answer.
This is the natural question: if the teacher works, why not deploy it?
The teacher prompt is generic — "always use tools, be thorough." It works for producing correct answers but it lacks everything a production agent needs:
- Topic routing: A complex agent with 10+ tools needs to know
which tool to call for which question. "Use tools" doesn't tell
the agent to call
lookup_company_policy("benefits")when someone asks about their 401k. - Response style: The teacher gives verbose, unstructured answers. A production prompt defines formatting, tone, and what to include or omit.
- Edge case handling: The teacher doesn't know about policy exceptions, date-relative logic, or when to combine multiple tool calls.
- Domain vocabulary: The teacher doesn't know that "WFH" means remote work, or that "time off" maps to PTO.
The optimizer reads the ground truth examples and produces a prompt that is both correct (uses tools) and tailored (knows the domain mappings, response format, and edge cases). The teacher generates the training data; the optimizer generates the production prompt.
In this demo, the agent is simple enough that the distinction is subtle — the teacher's generic prompt happens to work well for 2 tools and 6 topics. In a real system with complex tool routing, multi-step workflows, and nuanced response requirements, the gap between "generic tool-first" and "optimized domain-specific" is significant.
This demo uses two distinct sets of questions:
- Golden eval set (
eval_cases.json): The regression gate. These cases must always pass. The set starts with 3 cases that V1 handles correctly and grows each cycle as failed synthetic cases are extracted into it. - Synthetic traffic: Generated fresh each cycle by Gemini. These simulate diverse, unpredictable user questions that differ from the golden set. They are the source of new failures that drive improvement.
Step 1: Generate Synthetic Traffic -- generate_traffic.py calls
Gemini to produce diverse, realistic employee questions, intentionally
different from the golden eval set.
Step 2: Run Traffic -- run_eval.py sends questions to the agent
using ADK's InMemoryRunner. Sessions are logged to BigQuery via the
BigQueryAgentAnalyticsPlugin.
Step 3: Evaluate Quality -- The SDK's quality_report.py reads
sessions from BigQuery and scores each one on response_usefulness
(meaningful/partial/unhelpful) and task_grounding (grounded/ungrounded).
The SDK's SystemEvaluator also runs deterministic checks on the same
sessions — latency, token efficiency, and turn count — to establish
an operational baseline.
Step 4: Improve Prompt -- An ADK LoopAgent wrapping an LlmAgent with six tools:
LoopAgent("prompt_improver", max_iterations=3)
+-- LlmAgent("prompt_engineer")
tools: read_quality_report, read_current_prompt,
generate_candidate, test_candidate,
write_prompt, exit_loop
The generate_candidate tool uses the Vertex AI Prompt Optimizer with
synthetic ground truth from a teacher agent. The test_candidate tool
runs the full golden eval set. The write_prompt tool persists the
validated prompt to the Vertex AI Prompt Registry.
Step 5: Measure Improvement -- Fresh synthetic traffic is generated and scored against the improved prompt via BigQuery. The deterministic evaluators then compare V1 and V2 sessions side by side:
| Metric | What it checks | Default budget |
|---|---|---|
latency |
Average response time per session | 10,000 ms |
token_efficiency |
Total tokens consumed per session | 50,000 tokens |
turn_count |
Number of conversational turns | 10 turns |
This verifies the improved prompt didn't trade quality for cost — a
prompt that makes the agent chattier or triggers more retries would
show up here even if the quality score is 100%. The data is already in
BigQuery from Steps 2 and 5; no additional agent runs are needed. See
eval/operational_metrics.py.
- Golden eval gate: Candidate prompts must pass ALL golden cases. Rejected if any fail, retried up to 3 times.
- Eval case extraction: Failed synthetic cases are added to the
golden set before improvement, raising the bar each cycle. The
max_failure_extractconfig controls how many cases are extracted (see Scaling extraction below). - Question deduplication: Extracted cases are deduplicated by both ID and question text.
- Length check: Prompts shorter than 50 characters are rejected.
- Retry with backoff: Quality report step retries for BigQuery write propagation delays.
At the default traffic volume (N=10), the system typically discovers 3-5 failures, all of which are extracted into the golden eval set. The regression gate (3 original + 3-5 extracted = ~8 cases) is manageable for the optimizer to satisfy in one pass.
At higher volumes (--traffic-count 100), the system discovers
30-43 failures. Extracting all of them creates a regression gate of
40+ cases, which is often too strict for the optimizer to satisfy
in a single attempt. Many of these failures are redundant — 15
might be "benefits" questions that all fail the same way.
The max_failure_extract config field controls this:
| Value | Behavior |
|---|---|
null (default) |
Extract all failures — every unhelpful or partial session becomes a golden eval case. This is the right choice for the small-N demo (N=10) where there are only 3-5 failures. At higher traffic volumes it can overwhelm the optimizer (see below). |
"auto" |
Two-tier category-aware selection. Tier 1: one failure per category (breadth). Tier 2: fill proportionally from heaviest categories. Budget = 2 × number of failing categories. For 6 categories, that's ~12 cases. |
Integer (e.g. 10) |
Hard cap with category-aware selection. Same two-tier logic. |
Example config for high-traffic runs:
{
"max_failure_extract": "auto"
}- Python 3.10+
- Google Cloud project with billing enabled
gcloudCLI authenticated (gcloud auth application-default login)
The setup script enables the required APIs automatically:
- BigQuery API (
bigquery.googleapis.com) - Vertex AI API (
aiplatform.googleapis.com)
Your authenticated user or service account needs these IAM roles:
| Role | Why |
|---|---|
roles/bigquery.dataEditor |
Create datasets, write agent session data |
roles/bigquery.jobUser |
Run BigQuery queries for evaluation |
roles/aiplatform.user |
Call Gemini models and Vertex AI Prompt APIs |
Set your GCP project:
export PROJECT_ID=my-project-id./setup.shThis installs dependencies (google-cloud-aiplatform, google-adk,
google-genai, etc.), verifies credentials, creates the BigQuery
dataset, and creates the initial V1 prompt in the Vertex AI Prompt
Registry. Improved prompts are mirrored to agent/prompts.py for
git tracking.
# Single improvement cycle (default — safe for experimentation)
./run_cycle.sh
# Auto-cycle: run up to 3 cycles, stop early when quality meets threshold (95%)
./run_cycle.sh --auto --cycles 3
# Exactly 3 cycles (no early stopping)
./run_cycle.sh --cycles 3
# Eval only (no improvement step)
./run_cycle.sh --eval-only
# Customize traffic volume
./run_cycle.sh --auto --cycles 3 --traffic-count 20
# Scaled run (N=100)
./run_cycle.sh --auto --cycles 5 --traffic-count 100
# Use a different agent's config
./run_cycle.sh --agent-config /path/to/other/config.jsonThe scaled run combines all the flags:
| Flag | What it does |
|---|---|
--auto |
Stop early when quality meets quality_threshold (default 95%) |
--cycles 5 |
Run up to 5 improvement cycles |
--traffic-count 100 |
Generate 100 synthetic questions per cycle (default: 10) |
All output is automatically logged to reports/run_YYYYMMDD_HHMMSS/run.log
(ANSI colour codes stripped for readability). Each run gets its own
timestamped subdirectory under reports/, so previous runs are preserved.
Cost note: Each cycle makes ~50-80 Gemini API calls (more with higher
--traffic-count). Running./run_cycle.shwith no flags is always safe (1 cycle). Use--auto --cycles Nonly when you intentionally want multiple iterations.
The quality_report.sh wrapper can be run independently. Use
--env to point at the right .env file (otherwise it loads the
repo root .env which may target a different dataset):
# From anywhere — explicit .env
../../scripts/quality_report.sh \
--env .env \
--app-name company_info_agent \
--time-period all --limit 100The --env flag is also available on quality_report.py directly.
Each run creates a timestamped subdirectory under reports/:
reports/
run_20260430_174920/ # one directory per run
run.log # full console output (ANSI stripped)
synthetic_traffic_cycle_1.json # generated questions (Step 1)
latest_eval_results.json # session IDs + responses (Step 2)
expected_session_ids_cycle_1.json # copy of eval results for BQ lookup
quality_report_cycle_1.json # LLM judge scores (Step 3)
operational_metrics_cycle_1_baseline.json # latency/tokens/turns (Step 3)
ground_truth_latest.json # teacher agent answers (Step 4)
synthetic_traffic_cycle_1_fresh.json # fresh questions (Step 5)
expected_session_ids_cycle_1_fresh.json # fresh session IDs (Step 5)
quality_report_cycle_1_after.json # post-improvement scores (Step 5)
operational_metrics_cycle_1.json # before/after comparison (Step 5)
run_20260430_183045/ # next run — previous runs are preserved
...
Previous runs are never deleted. reset.sh only resets the prompt
and golden eval set, not the reports directory.
# Browse runs
ls reports/
# Quality report JSON (consumed by the improver)
cat reports/run_*/quality_report_cycle_1.json | python3 -m json.tool | head -20
# Full console log
less reports/run_20260430_174920/run.log
# See new eval cases extracted from failures
cat eval/eval_cases.jsonTo start over, reset the prompt and golden eval set to their initial state. Previous run reports are preserved.
./reset.shThis restores the V1 prompt in Vertex AI, resets eval_cases.json to
the original 3 golden cases, and removes generated synthetic traffic
files. The reports/ directory (with timestamped run subdirectories)
is not deleted.
The agent_improvement module is reusable. To apply it to a different
agent:
- Create a
config.jsonwith your agent's settings - Ensure your agent module exports
create_agent(prompt) -> Agent,AGENT_TOOLS,root_agent, andbq_logging_plugin - Run:
./run_cycle.sh --agent-config /path/to/your/config.json
| Field | Default | Description |
|---|---|---|
app_name |
required | Agent name for BQ filtering |
agent_module |
required | Python module path (e.g. agent.agent) |
prompts_path |
required | Path to prompts.py (for V1 seed text and local mirroring) |
prompt_variable |
CURRENT_PROMPT |
Variable name in prompts.py holding the active prompt |
version_variable |
CURRENT_VERSION |
Variable name in prompts.py holding the version number |
eval_cases_path |
required | Path to golden eval set JSON |
traffic_generator |
required | Path to traffic generation script |
model_id |
gemini-2.5-flash |
Gemini model for agent and judge |
optimizer_max_iterations |
3 |
Max Vertex AI Prompt Optimizer iterations per improvement step |
prompt_storage |
python_file |
vertex or python_file |
vertex_prompt_id |
"" |
Vertex AI prompt ID (auto-filled by setup) |
vertex_project |
from gcloud |
GCP project for Vertex AI (defaults to env) |
vertex_location |
us-central1 |
Vertex AI region |
use_vertex_optimizer |
false |
Use Vertex AI Prompt Optimizer |
teacher_model_id |
null |
Model for teacher agent (null = same as model_id) |
max_failure_extract |
null |
Max failed cases to extract per cycle. null = extract all failures (best for the small-N demo where N<=20). "auto" = two-tier category-aware selection (~2x categories). Integer = hard cap with category-aware selection. See Scaling extraction. |
| Variable | Default | Description |
|---|---|---|
PROJECT_ID |
from gcloud |
Google Cloud project ID |
DATASET_ID |
agent_logs |
BigQuery dataset for session logs |
DATASET_LOCATION |
us-central1 |
BigQuery dataset location |
TABLE_ID |
agent_events |
BigQuery table name |
DEMO_MODEL_ID |
gemini-2.5-flash |
Model for the demo agent |
VERTEX_PROMPT_ID |
from setup | Vertex AI prompt resource ID |
Each improvement cycle makes Gemini API calls for traffic generation, agent execution, quality evaluation, and prompt optimization. The Vertex AI Prompt Optimizer also runs the teacher model to generate ground truth for failed sessions. A typical single cycle uses ~50-80 Gemini calls; a 3-cycle run uses ~200-300.
Golden eval set growth: The golden eval set grows each cycle as
failed synthetic cases are extracted into it. Each improvement attempt
validates the candidate prompt against the full golden set (N agent
calls + N judge calls per attempt, up to optimizer_max_iterations retries).
After several cycles, the golden set can reach 20+ cases, increasing
both cost and runtime of the validation step. For long-running
deployments, consider periodically pruning redundant golden cases.
- Your Agent Events Table Is Also a Test Suite — Using the SDK's
SystemEvaluatorandcategorical-evalCLI to gate PRs against production traces. Covers the same deterministic evaluators (latency, token efficiency, turn count, error rate) this demo uses in Steps 3 and 5. - BigQuery Agent Analytics: From Logs to Graphs — Visualizing agent session traces as interactive graphs. Shows how the
BigQueryAgentAnalyticsPlugincaptures the data that powers this improvement cycle.
- Sentiment analysis integration: Extend the quality evaluation to detect sentiment dips in agent responses. Use sentiment scores as an additional signal for the improvement cycle, identifying not just factually wrong answers but also responses that leave users frustrated or confused. Feed sentiment-flagged sessions to the optimizer alongside usefulness failures.

