Add verified self-evolving agent demo by sunpan9209 · Pull Request #284 · GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK

sunpan9209 · 2026-06-04T18:21:36Z

Summary

Add a self-evolving ADK agent demo that runs baseline traffic, analyzes BigQuery trace data with the SDK, generates an improved prompt, reruns the agent, and compares acceptance gates.
Include setup/reset/run scripts, deterministic basketball analytics fixtures, prompt diff artifacts, comparison reports, and a tracked live verification summary.
Preserve complete comparison artifacts even when a generated candidate fails a gate, so demo failures remain inspectable.

Verification

Live GCP-backed end-to-end run: quality 100% -> 100%, avg tokens 3512.5 -> 1419.5 (-59.6%), avg tool calls 3.0 -> 1.0, broad lookup calls 4 -> 0.
python3 -m pyink --config pyproject.toml examples/nba_self_evolving_demo
python3 -m isort examples/nba_self_evolving_demo
.venv313/bin/python -m compileall examples/nba_self_evolving_demo
bash -n examples/nba_self_evolving_demo/setup.sh examples/nba_self_evolving_demo/reset.sh examples/nba_self_evolving_demo/run_e2e_demo.sh
CLI help checks for run_agent.py, analyze_and_evolve.py, and compare_runs.py

caohy1988

Fresh review — PR #284 head `90f2818`

Locally checked out. Lint clean (pyink --check, isort --check-only, bash -n all pass; one git diff --check warning noted below). PR adds a 19-file, 2236-line, addition-only examples/nba_self_evolving_demo/ directory with an ADK agent + tools, a BigQuery trace analyzer, a prompt-evolution generator, a comparator with acceptance gates, three shell scripts, and verification artifacts. No prior reviews, no existing test coverage.

The shape is clean and the verification numbers (3512 → 1419 avg tokens, 4→0 broad lookups, quality flat at 100%) are reproducible from the wired pipeline. The findings below are: a handful of correctness/robustness bugs in the pipeline glue, several places where the demo can silently misrepresent what happened, documentation/dependency hygiene, and a handful of small polish items.

Blocker

B1 — Baseline `run_agent.py` is not run with `--allow-failures`, so a single non-deterministic V1 hiccup aborts the whole demo before analysis ever runs

run_e2e_demo.sh:60-63:

echo "[1/5] Run baseline V1 agent..."
"$PYTHON_BIN" run_agent.py \
  --label baseline \
  --output-dir "$REPORTS_DIR"

vs. step [3/5] which explicitly passes --allow-failures. run_agent.py:194-195 exits 1 when any case fails quality_passed, and set -euo pipefail halts the script. quality_passed requires both bool(response_text.strip()) and expected_tool_used — so a single V1 session where the LLM happens to call only lookup_nba_reference (a model whim, not a contract violation) kills the demo with no analysis, no comparison, no artifacts, and a misleading "ERROR" trail.

This bites the PR description's own promise: "Preserve complete comparison artifacts even when a generated candidate fails a gate, so demo failures remain inspectable." — that's true only after step [3/5]. Steps [1/5] and [4/5] (compare_runs.py exits 1 on failed gates) terminate set -e and leave a partial state.

Fix: pass --allow-failures to baseline too, and let analyze_and_evolve.py decide whether to promote based on quality["pass_rate"] >= 1.0 (line 284). That gate already exists — making baseline-failures fatal is just redundant.

Same concern for the comparator step: compare_runs.py:237-238 exits 1 when gates fail. Step [4/5] then halts set -e. The script never reaches echo "[5/5] Done.". Either accept this (artifacts are written first, line 192-206, so they do exist) and document it, or split compare_runs exit code from the "did it pass gates" signal (e.g. always exit 0, surface gate status via JSON only, and let the script print a summary).

High

H1 — Generator fallback path produces a hand-crafted prompt but is presented as if the model generated it

analyze_and_evolve.py:196-199:

except Exception as exc:
  fallback = _fallback_candidate()
  fallback["changes_summary"] += f" Generator fallback reason: {exc}"
  return fallback

_fallback_candidate() returns a fully-tuned, hand-crafted V2 prompt that includes the precise routing rules ("Player comparison: call compare_players", "Call lookup_nba_reference only when…") needed to pass every gate in compare_runs.py. If the genai client errors (quota, auth, transient 503, JSON validation, length-too-short), the demo silently substitutes that prompt and proceeds to step [3/5]. The final markdown report still says "The generated V2 prompt was created from the baseline trace summary, tool counts, …" — even when no model ran.

The fallback note ("Generator fallback reason: ...") is appended to changes_summary and surfaces in candidate_prompt.json / comparison.md. But it's easy to miss — and a user who never reads JSON might believe they witnessed self-evolution when they actually got the bundled fallback.

Two ways to fix:

Loud-fail mode (recommended for a demo whose point is "the agent evolves itself"): drop the fallback. If the generator fails, exit nonzero with the exception. The demo's claim is testable.
Loud-warn mode: keep the fallback for robustness, but print a prominent WARNING: generator failed; using bundled fallback prompt. The demo is no longer demonstrating self-evolution. to stderr, and set evolution["promoted_via_fallback"] = True so the JSON downstream can be filtered.

Either way, the fallback prompt and the generated prompt should be visually distinguishable in comparison.md (e.g. a "Source: model" vs "Source: bundled fallback" line at the top of the report).

H2 — `summarize()` empty-input return is missing `avg_llm_calls`; populated input includes it

analytics/session_metrics.py:212-222 (empty branch):

if not rows:
  return {
      "sessions": 0,
      "avg_total_tokens": 0.0,
      "avg_input_tokens": 0.0,
      "avg_output_tokens": 0.0,
      "avg_tool_calls": 0.0,
      "total_broad_lookup_calls": 0,
      "sessions_with_broad_lookup": 0,
      "broad_lookup_session_rate": 0.0,
      "total_tool_errors": 0,
  }

vs. populated branch (:229-240) which includes "avg_llm_calls". Same finding pattern as the M2 issue from PR #174 — the return shape is unstable across input sizes. Caller code paths that do summary["avg_llm_calls"] will KeyError whenever rows=[]. Currently no in-repo consumer hits this (verified: grep avg_llm_calls analyze_and_evolve.py compare_runs.py is empty), so it's a latent footgun rather than an immediate bug.

Fix: keep the keyset identical, set zeros in the empty branch.

H3 — Partial-results blind spot: `fetch_session_metrics` returns fewer rows than `session_ids` after retry exhaustion, and downstream silently computes averages on the incomplete subset

analytics/session_metrics.py:171-180:

rows: list[dict[str, Any]] = []
for attempt in range(1, attempts + 1):
  if wait_seconds and attempt > 1:
    time.sleep(wait_seconds)
  rows = [dict(r) for r in client.query(query, job_config=job_config).result()]
  if len(rows) >= len(set(session_ids)):
    break
return rows

If only 2 of 4 baseline sessions have landed after attempts=6 × wait_seconds=15 (90 seconds total), rows has 2 entries and the call returns. analyze_and_evolve.py:264-268 only raises when not rows — it does not check for the partial case. summarize() then divides by 2, producing an avg_total_tokens that pretends those are the whole baseline. The downstream gates then compare a 2-session "baseline" vs a 4-session "evolved" — apples and oranges.

Fix: at the call site (analyze_and_evolve.py:259-268), assert len(rows) == len(session_ids) or print a clear warning and exit. The streaming-insert race is the most common real-world failure mode for this kind of pipeline; the demo shouldn't silently hide it.

H4 — Token-extraction triple-coalesce silently returns 0 when none of the three schemas match

analytics/session_metrics.py:132-160 reads tokens via:

attributes.$.usage_metadata.prompt_token_count (ADK BQAA plugin recent schema)
content.$.usage.prompt (older schema)
attributes.$.input_tokens (alternate path)
falls through to 0 if none of those JSON paths exist

If the BQAA plugin's emitted attribute schema changes in a fourth way (it has changed before — see issue #94 history), total_tokens summed across sessions becomes 0 silently, the gates compare 0 vs 0 (which fails tokens_reduced), and the demo reports false failure.

Fixes:

Add a post-query sanity check: if sum(total_tokens) == 0 across all rows but event_count > 0, log a WARNING: token extraction produced zero tokens; the BQAA plugin's attribute schema may have changed. Inspect a session with LLM_RESPONSE events under \{config.table_ref}`.`
Optionally, in the SDK-evaluators path (which the PR already calls via run_sdk_evaluators), cross-check the SDK's reported avg-observed token count against the BQ-side aggregation. If they diverge, surface that.

H5 — `compare_runs._pct_delta(0, X) == 0.0` masks regressions when the baseline metric was zero

compare_runs.py:46-49:

def _pct_delta(before: float, after: float) -> float:
  if before == 0:
    return 0.0
  return round((after - before) / before, 4)

If baseline had 0 tool errors and evolved has 5, _pct_delta returns 0.0. That's then displayed as +0.0% in comparison.md and feeds the tokens_reduced gate as <= -0.05 → False (correctly). The math doesn't break a gate today, but the reported delta is wrong: "we went from 0 to 5 errors" should not render as "0%".

Fix: return float("inf") (or None and special-case rendering) when before == 0 and after > 0. Or print the raw delta alongside the percent for the metrics where percent is meaningless.

Medium

M1 — `CodeEvaluator` is the deprecated alias; the demo should use `SystemEvaluator`

analytics/session_metrics.py:252:

from bigquery_agent_analytics.evaluators import CodeEvaluator

Per the active PR #121, CodeEvaluator = SystemEvaluator is a backward-compat alias with CHANGELOG.md saying "deprecated but supported." A new example landing today should use the preferred name (SystemEvaluator) so it doesn't have to be migrated again next quarter. Also update README.md:16 ("SDK CodeEvaluator + trace SQL") and VERIFICATION.md:41 ("SDK CodeEvaluator checks for…") to match.

M2 — `agent/agent.py` mutates `os.environ` at import time

agent.py:53-55:

os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID or ""
os.environ["GOOGLE_CLOUD_LOCATION"] = AGENT_LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "true"

Importing agent.agent rewrites three process-wide env vars unconditionally. PROJECT_ID or "" actively clears GOOGLE_CLOUD_PROJECT if PROJECT_ID couldn't be resolved — silently masking the user's actual configuration. The demo runs single-process so this is mostly cosmetic, but importing a module shouldn't have global side effects.

Fix: move these into a _configure_environment() helper that's only called from create_agent (or from a top-level main() somewhere). Don't clear with "" when the lookup failed — raise a clear error.

M3 — `bq_logging_plugin` is instantiated at module import; fails confusingly when `PROJECT_ID` is `None`

agent.py:42-44:

PROJECT_ID = os.getenv("PROJECT_ID") or os.getenv("GOOGLE_CLOUD_PROJECT")
if not PROJECT_ID:
  PROJECT_ID = _auth_project

If ADC also has no default project, PROJECT_ID stays None. Then BigQueryAgentAnalyticsPlugin(project_id=None, ...) constructs successfully (the plugin tolerates None in its constructor), but the first write will fail with a less-helpful error from the BQ client deep in plugin internals.

Fix: add an explicit guard near the top of agent.py:

if not PROJECT_ID:
  raise RuntimeError(
      "Could not resolve PROJECT_ID from .env, GOOGLE_CLOUD_PROJECT, or ADC. "
      "Run ./setup.sh or `gcloud config set project YOUR_PROJECT_ID`."
  )

Same applies in analytics/session_metrics.load_config:62-64 (which does have a guard — good); the agent module should match.

M4 — `BigQueryLoggerConfig(batch_size=1, shutdown_timeout=15.0)` is the slowest path; intentional?

agent.py:80-85:

config=BigQueryLoggerConfig(
    enabled=True,
    max_content_length=500 * 1024,
    batch_size=1,
    shutdown_timeout=15.0,
),

batch_size=1 forces a Storage Write API append per event. For a 4-case demo with ~3.5K tokens per session, you're likely emitting 20–40 events per session × 4 sessions × 2 runs = ~200+ individual append RPCs. Most demos would use the default batching to amortize, then rely on the shutdown_timeout to flush remaining buffered events.

If batch_size=1 is deliberate (to make individual rows show up in BigQuery faster for the streaming-insert race the analyzer worries about), say so in a code comment. Otherwise use default batching and rely on the 15s shutdown timeout to flush.

M5 — `analyze_and_evolve.py` promotion heuristic has two hardcoded magic numbers

analyze_and_evolve.py:285-288:

should_promote = (
    current_state["version"] == "v1"
    and quality["pass_rate"] >= 1.0
    and (
        summary["broad_lookup_session_rate"] >= 0.5
        or summary["avg_total_tokens"] > args.token_budget
    )
)

>= 1.0 and >= 0.5 are magic. The first is "every single session must pass quality" — a strict bar that, combined with B1, makes a single bad session abort the demo entirely. The second ("at least half of sessions used the broad tool") is a threshold the user can't tune without editing source.

Promote both to CLI flags with the current values as defaults (--min-quality-pass-rate 1.0, --min-broad-lookup-rate 0.5), and document them in README.

M6 — `analyze_and_evolve._observations` thresholds duplicated, not centralized

:70-85 hardcodes 0.5 (broad lookup) and 2.0 (avg tool calls) and token_budget for the observation strings. The same 0.5 repeats in should_promote (:287). These three thresholds are demo-tuning knobs that should live in one constants block (or CLI flags as M5) so a future maintainer doesn't have to grep two files to retune.

M7 — `run_e2e_demo.sh:39` prepends `$REPO_ROOT/src` to `PYTHONPATH` even after `setup.sh` already ran `pip install -e .`

export PYTHONPATH="$REPO_ROOT/src:$SCRIPT_DIR:${PYTHONPATH:-}"

If setup.sh succeeded, the SDK is on site-packages and the $REPO_ROOT/src prepend is redundant. If setup.sh wasn't run, the demo will fail earlier on missing google-adk / google-genai. So $REPO_ROOT/src is only useful if someone runs the script from a checkout without pip install, which the README explicitly tells them not to. Keep $SCRIPT_DIR (needed for from agent.agent import …); drop $REPO_ROOT/src.

Low / nits

L1 — `VERIFICATION.md` has a trailing blank line at EOF

git diff --check:

examples/nba_self_evolving_demo/VERIFICATION.md:95: new blank line at EOF.

Strip the trailing blank. Same lint hygiene CI catches in other PRs.

L2 — `BigQueryLoggerConfig.max_content_length = 500 * 1024` (500 KiB) is way larger than necessary

NBA tool responses are < 5 KiB. Demo-level concern only, but 500 KiB is the kind of number that looks like a copy-paste from a different example. If the goal is "never truncate," set it to a comfortable round number (50 KiB) or drop the override entirely and use the default.

L3 — `_fallback_candidate.changes_summary` reads "Fallback prompt generated from the detected broad-tool overuse pattern" even when broad-tool overuse wasn't detected

_fallback_candidate() returns the same string regardless of what observations actually flagged. If the generator failed for a non-broad-tool reason (token over budget, say), the fallback's changes_summary still claims it's responding to broad-tool overuse. Make the summary echo observations so the rationale matches the observed signal.

L4 — `analyze_and_evolve._generate_candidate_prompt` hardcodes `len(improved) < 120` as the "too short" threshold

:190-191. Why 120? A comment explaining the choice (or moving to a constant) would help.

L5 — `run_agent.py:88` quality only checks `expected_tool_used`, not `avoid_tool_used`

quality_passed = bool(response_text.strip()) and expected_tool_used

That's a deliberate design choice (broad-tool overuse is tracked separately via avoid_tool_used and broad_lookup_calls), but the local definition of "quality" is then "did the agent call the right tool at all?" — which can be True even when the agent also called lookup_nba_reference. This is the case for every V1 session per VERIFICATION.md (broad lookup calls = 4 out of 4, yet quality = 100%). Worth a one-line comment in run_agent.py explaining that quality_passed is intentionally the broader "did it answer correctly" signal and that avoid_tool overuse is the separate signal the demo evolves against.

L6 — `compare_runs._read_candidate_summary` silently returns "" on any exception

:52-61:

try:
  with open(path) as f:
    data = json.load(f)
  return str(data.get("changes_summary", ""))
except Exception:
  return ""

The bare except Exception swallows json.JSONDecodeError and FileNotFoundError. The "" return then makes comparison.md fall back to the generic blurb at :87-91. If candidate_prompt.json was malformed (bug, generator output corrupt, etc.), the user never finds out. Catch only FileNotFoundError here; let JSON errors surface.

L7 — `setup.sh` overwrites `.env` on every invocation

:93-105. Re-running ./setup.sh blows away any manual edit to .env. Probably fine for a demo, but worth a comment in the README:

Re-running setup.sh regenerates .env from current environment variables. To customize without losing the change, override via env vars before running setup (e.g. NBA_AGENT_MODEL=gemini-2.5-pro ./setup.sh).

L8 — `setup.sh` dataset-creation step is not location-idempotent

:86-91:

if ! bq show "${PROJECT_ID}:${NBA_DATASET_ID}" &>/dev/null 2>&1; then
  ...
  bq mk --dataset --location="$DATASET_LOCATION" ...
fi

If the dataset exists in a different location than the one in $DATASET_LOCATION, the bq show succeeds and no warning is issued. The subsequent BQAA plugin writes will then go to a dataset in the wrong region (or fail). Add a bq show --format=json | jq -r .location cross-check, or surface a warning if the existing dataset's location doesn't match.

L9 — `examples/README.md` description for the demo is one long sentence (56 lines wide)

examples/README.md:55 is 600+ characters on a single line in the markdown table. GitHub renders it fine, but the source is hard to skim/edit. Wrap to 80–100 chars per line, or break into bullets.

L10 — Cost claim "typically well under $1" appears twice without supporting math

README.md:92-96 and setup.sh:27-29. Based on VERIFICATION.md's actual numbers (~25K total tokens across the run at gemini-2.5-flash rates), real cost is in the $0.01–0.05 range. The conservative "well under $1" cushion is fine, but consider linking to Vertex AI pricing or showing the actual cost in VERIFICATION.md so a stakeholder can verify.

L11 — No unit tests for any of the new helpers

summarize, _pct_delta, _observations, _failure_class-style logic in _load_eval_contract, the gate booleans in compare_runs — all reasonable to test offline without BQ. Three or four pure-function tests under tests/test_nba_self_evolving_demo.py would prevent regressions on H2/H5 and the magic-number thresholds. Pattern would mirror tests/test_quality_report_helpers.py from the BQAA scripts.

L12 — Mermaid diagrams in `README.md` and `VERIFICATION.md` are nice; minor consistency

Three mermaid diagrams: TD/LR variants, sometimes with flowchart, sometimes with flowchart TD. Consistent direction (all TD or all LR) reads better, especially in VERIFICATION.md where the bar-chart "Run flow" feels visually closest to the README's pipeline diagram.

L13 — Productionization roadmap mermaid uses `Scheduler → Cloud Run Job → … → Canary traffic → C` cycle

README.md:170-178. The cycle goes back to C "Analyze recent BigQuery traces" from H "Canary traffic". That's the right shape, but the diagram has no termination — readers might wonder when human approval is bypassed. Either keep as-is (steady-state loop) or add an explicit "rollback to last accepted" edge from H back to F for completeness.

What looks great

The 5-step pipeline shape is the right shape: V1 baseline → trace analyze + generate → V2 rerun → before/after gates → JSON+MD artifacts. Self-contained, no shared state across runs.
run_agent.py:67-82 event consumption correctly walks event.content.parts and accumulates both text and tool names. The use of nonlocal response_text inside _consume() keeps the per-case scope clean.
Acceptance gates in compare_runs.py:172-179 are explicit, named, and individually testable. The result["gates"] dict that flows into both JSON and Markdown is good practice.
Concurrent eval execution via asyncio.Semaphore (run_agent.py:116-117) caps in-flight ADK runners — important because the InMemoryRunner with a single BQAA plugin has shared state.
Idempotent setup script structure: API enable, ADC check, dataset create only if missing, .env regeneration. The order matches what an operator would do by hand.
shutdown_timeout=15.0 on the plugin config + the attempts=6, wait_seconds=15 retry in the analyzer correctly hedges against BigQuery streaming insert latency. The shape is right; H3 just adds the missing partial-results check.
_write_prompt_diff using difflib.unified_diff is the right tool, and the resulting prompt_diff.md is the cleanest single-artifact "what changed in the prompt" view I've seen for an example like this.
VERIFICATION.md as a tracked file (vs the ignored reports/) is a great pattern — it pins a known-good live run to the source tree so reviewers can sanity-check claimed numbers without running anything.
Lint-clean PR (pyink, isort, bash -n, compileall). Only finding is the trailing blank-line at EOF flagged above.

Verdict — REQUEST_CHANGES on B1; H1–H5 + M1–M3 are tightening that materially improves trust

B1 (baseline --allow-failures) is the blocker — without it, one model whim aborts the demo before any of the actual self-evolution machinery runs.

H1 (fallback prompt presented as model output), H3 (partial-rows blind spot), H4 (token-extraction silent zero) are the three "the demo can silently misrepresent itself" findings — each cheap to fix, each meaningfully improves the trust users put in the verification numbers.

H2, H5, M1–M7, L1–L13 are tightening — bundle in or skip.

The PR is a clean, self-contained example with a real story and reproducible verification numbers. Once B1 + H1/H3/H4 land, this is mergeable, and the productionization roadmap section makes it a useful jumping-off point for the next demo iteration.

sunpan9209 · 2026-06-10T00:20:09Z

Updated the PR at head fdf14fc.

What changed:

Repositioned the example as self_evolving_agent_demo, a lightweight metric-driven self-evolution companion to agent_improvement_cycle.
Removed the domain-first visible naming from the folder, README, setup output, environment variables, dataset default, and demo index entry.
Addressed the blocker by allowing baseline result writing even when a V1 case fails, so analysis and artifacts can still run.
Removed the bundled prompt fallback; prompt generation now fails loudly instead of silently promoting a hand-authored candidate.
Added complete BigQuery row checks and zero-token schema sanity checks before metrics are trusted.
Stabilized empty metric summaries, fixed zero-baseline percent delta reporting, and made comparison gate failure an explicit JSON result instead of an unconditional shell abort.
Added setup dataset-location validation, clearer project resolution, smaller logged-content cap, and documented/tunable evolution thresholds.
Added focused offline tests for the helper behavior called out in review.

Verification:

Live renamed end-to-end run: quality 100% -> 100%, avg tokens 3640.2 -> 1479.8 (-59.4%), avg tool calls 2.5 -> 1.0, broad lookup calls 4 -> 0, candidate source model.
python3 -m pyink --config pyproject.toml examples/self_evolving_agent_demo tests/test_self_evolving_agent_demo.py
python3 -m isort examples/self_evolving_agent_demo tests/test_self_evolving_agent_demo.py
.venv313/bin/python -m compileall examples/self_evolving_agent_demo
bash -n examples/self_evolving_agent_demo/setup.sh examples/self_evolving_agent_demo/reset.sh examples/self_evolving_agent_demo/run_e2e_demo.sh
python3 -m pytest tests/test_self_evolving_agent_demo.py
CLI help checks for the three demo entrypoints

caohy1988

Re-review — PR #284 head `fdf14fc`

Checked out locally and verified every June 8 finding against the new head. All of them are fixed — the blocker, all five Highs, all seven Mediums I'd flagged as material, and most of the Lows — plus the rename to self_evolving_agent_demo is consistent end-to-end. The 5 new offline tests pass, pyink/isort clean, git diff --check clean, and the fresh live verification (run_20260609_171547) reports candidate source: model — directly closing the trust gap from H1.

Verified RESOLVED

B1 — baseline step now passes --allow-failures (run_e2e_demo.sh:63); a single V1 model whim no longer aborts the demo before analysis. Comparison gate failure is now an explicit JSON result with an opt-in --fail-on-gate-failure flag (compare_runs.py:261) instead of an unconditional set -e abort — artifacts always land.
H1 — the bundled fallback prompt is deleted. Generator failure now raises SystemExit("Prompt generation failed; no fallback prompt was promoted...") (analyze_and_evolve.py:308-312), and the candidate carries source: "model" which flows into candidate_prompt.json and renders as "Candidate source: model" in comparison.md. The demo can no longer silently present canned output as self-evolution.
H2 — summarize([]) now returns the full keyset including avg_llm_calls: 0.0 (session_metrics.py:240-252). Pinned by test_summarize_empty_rows_has_stable_shape.
H3 — require_complete_session_metrics (session_metrics.py:185-199) raises with the missing session IDs, called from both analyze_and_evolve.py:261 and compare_runs.py:47. Pinned by test_require_complete_session_metrics_rejects_missing_rows.
H4 — zero-token schema sanity check raises RuntimeError pointing the operator at LLM_RESPONSE rows in the configured table (session_metrics.py:205-209). Pinned by test_require_complete_session_metrics_rejects_zero_token_schema.
H5 — _pct_delta(0, X>0) returns None, rendered as n/a (compare_runs.py:53-61). Pinned by test_pct_delta_marks_zero_baseline_growth_as_not_applicable.
M1 — SystemEvaluator preferred with a CodeEvaluator as SystemEvaluator fallback import (session_metrics.py:285-287) — neat shape that works on both pre- and post-#121 SDK versions.
M2/M3 — env mutation moved into _configure_environment() called from create_agent; unresolvable PROJECT_ID raises a clear RuntimeError at the top of agent.py instead of silently clearing GOOGLE_CLOUD_PROJECT.
M4 — batch_size=1 now carries its intent comment ("rows visible quickly for this one-shot demo"). L2 — content cap reduced to 50 KiB.
M5/M6 — promotion thresholds exposed as --min-quality-pass-rate and --min-broad-lookup-rate flags and threaded into _observations. Pinned by test_observations_use_configured_thresholds.
M7 — PYTHONPATH no longer prepends $REPO_ROOT/src.
L1 — git diff --check clean. L8 — setup.sh:86-101 now cross-checks the existing dataset's location and hard-errors on mismatch. L11 — tests/test_self_evolving_agent_demo.py adds 5 offline tests that map one-to-one onto the review findings; all pass.

Rename consistency — verified

nba_self_evolving_demo → self_evolving_agent_demo, env vars NBA_* → SELF_EVOLVING_*, lookup_nba_reference → lookup_basketball_reference — and the broad-tool name is consistent across the prompt, the eval cases' avoid_tool, and the SQL COUNTIF in session_metrics.py:132 (the demo's core signal, so this mattered). The basketball fixture content stays, which is fine — the branding is neutral. examples/README.md row updated. VERIFICATION.md regenerated from a fresh post-rename live run.

Verdict — APPROVE

This is a model response to review: every finding addressed at the right layer, each behavioral fix locked by a test named after the failure mode it prevents, and a fresh live verification that demonstrates the H1 fix (candidate source: model) rather than just claiming it. Nothing blocking remains.

caohy1988 requested changes Jun 8, 2026

View reviewed changes

sunpan9209 force-pushed the main branch 2 times, most recently from eaecfd8 to fdf14fc Compare June 10, 2026 00:19

sunpan9209 requested a review from caohy1988 June 10, 2026 01:24

caohy1988 approved these changes Jun 12, 2026

View reviewed changes

Add verified self-evolving agent demo

ca61b75

sunpan9209 force-pushed the main branch from fdf14fc to ca61b75 Compare June 13, 2026 00:09

sunpan9209 merged commit 2d15a61 into GoogleCloudPlatform:main Jun 13, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add verified self-evolving agent demo#284

Add verified self-evolving agent demo#284
sunpan9209 merged 1 commit into
GoogleCloudPlatform:mainfrom
sunpan9209:main

sunpan9209 commented Jun 4, 2026

Uh oh!

caohy1988 left a comment

Uh oh!

sunpan9209 commented Jun 10, 2026

Uh oh!

caohy1988 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sunpan9209 commented Jun 4, 2026

Summary

Verification

Uh oh!

caohy1988 left a comment

Choose a reason for hiding this comment

Fresh review — PR #284 head 90f2818

Blocker

B1 — Baseline run_agent.py is not run with --allow-failures, so a single non-deterministic V1 hiccup aborts the whole demo before analysis ever runs

High

H1 — Generator fallback path produces a hand-crafted prompt but is presented as if the model generated it

H2 — summarize() empty-input return is missing avg_llm_calls; populated input includes it

H3 — Partial-results blind spot: fetch_session_metrics returns fewer rows than session_ids after retry exhaustion, and downstream silently computes averages on the incomplete subset

H4 — Token-extraction triple-coalesce silently returns 0 when none of the three schemas match

H5 — compare_runs._pct_delta(0, X) == 0.0 masks regressions when the baseline metric was zero

Medium

M1 — CodeEvaluator is the deprecated alias; the demo should use SystemEvaluator

M2 — agent/agent.py mutates os.environ at import time

M3 — bq_logging_plugin is instantiated at module import; fails confusingly when PROJECT_ID is None

M4 — BigQueryLoggerConfig(batch_size=1, shutdown_timeout=15.0) is the slowest path; intentional?

M5 — analyze_and_evolve.py promotion heuristic has two hardcoded magic numbers

M6 — analyze_and_evolve._observations thresholds duplicated, not centralized

M7 — run_e2e_demo.sh:39 prepends $REPO_ROOT/src to PYTHONPATH even after setup.sh already ran pip install -e .

Low / nits

L1 — VERIFICATION.md has a trailing blank line at EOF

L2 — BigQueryLoggerConfig.max_content_length = 500 * 1024 (500 KiB) is way larger than necessary

L3 — _fallback_candidate.changes_summary reads "Fallback prompt generated from the detected broad-tool overuse pattern" even when broad-tool overuse wasn't detected

L4 — analyze_and_evolve._generate_candidate_prompt hardcodes len(improved) < 120 as the "too short" threshold

L5 — run_agent.py:88 quality only checks expected_tool_used, not avoid_tool_used

L6 — compare_runs._read_candidate_summary silently returns "" on any exception

L7 — setup.sh overwrites .env on every invocation

L8 — setup.sh dataset-creation step is not location-idempotent

L9 — examples/README.md description for the demo is one long sentence (56 lines wide)

L10 — Cost claim "typically well under $1" appears twice without supporting math

L11 — No unit tests for any of the new helpers

L12 — Mermaid diagrams in README.md and VERIFICATION.md are nice; minor consistency

L13 — Productionization roadmap mermaid uses Scheduler → Cloud Run Job → … → Canary traffic → C cycle

What looks great

Verdict — REQUEST_CHANGES on B1; H1–H5 + M1–M3 are tightening that materially improves trust

Uh oh!

sunpan9209 commented Jun 10, 2026

Uh oh!

caohy1988 left a comment

Choose a reason for hiding this comment

Re-review — PR #284 head fdf14fc

Verified RESOLVED

Rename consistency — verified

Verdict — APPROVE

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fresh review — PR #284 head `90f2818`

B1 — Baseline `run_agent.py` is not run with `--allow-failures`, so a single non-deterministic V1 hiccup aborts the whole demo before analysis ever runs

H2 — `summarize()` empty-input return is missing `avg_llm_calls`; populated input includes it

H3 — Partial-results blind spot: `fetch_session_metrics` returns fewer rows than `session_ids` after retry exhaustion, and downstream silently computes averages on the incomplete subset

H5 — `compare_runs._pct_delta(0, X) == 0.0` masks regressions when the baseline metric was zero

M1 — `CodeEvaluator` is the deprecated alias; the demo should use `SystemEvaluator`

M2 — `agent/agent.py` mutates `os.environ` at import time

M3 — `bq_logging_plugin` is instantiated at module import; fails confusingly when `PROJECT_ID` is `None`

M4 — `BigQueryLoggerConfig(batch_size=1, shutdown_timeout=15.0)` is the slowest path; intentional?

M5 — `analyze_and_evolve.py` promotion heuristic has two hardcoded magic numbers

M6 — `analyze_and_evolve._observations` thresholds duplicated, not centralized

M7 — `run_e2e_demo.sh:39` prepends `$REPO_ROOT/src` to `PYTHONPATH` even after `setup.sh` already ran `pip install -e .`

L1 — `VERIFICATION.md` has a trailing blank line at EOF

L2 — `BigQueryLoggerConfig.max_content_length = 500 * 1024` (500 KiB) is way larger than necessary

L3 — `_fallback_candidate.changes_summary` reads "Fallback prompt generated from the detected broad-tool overuse pattern" even when broad-tool overuse wasn't detected

L4 — `analyze_and_evolve._generate_candidate_prompt` hardcodes `len(improved) < 120` as the "too short" threshold

L5 — `run_agent.py:88` quality only checks `expected_tool_used`, not `avoid_tool_used`

L6 — `compare_runs._read_candidate_summary` silently returns "" on any exception

L7 — `setup.sh` overwrites `.env` on every invocation

L8 — `setup.sh` dataset-creation step is not location-idempotent

L9 — `examples/README.md` description for the demo is one long sentence (56 lines wide)

L12 — Mermaid diagrams in `README.md` and `VERIFICATION.md` are nice; minor consistency

L13 — Productionization roadmap mermaid uses `Scheduler → Cloud Run Job → … → Canary traffic → C` cycle

Re-review — PR #284 head `fdf14fc`