Skip to content

Add verified self-evolving agent demo#284

Merged
sunpan9209 merged 1 commit into
GoogleCloudPlatform:mainfrom
sunpan9209:main
Jun 13, 2026
Merged

Add verified self-evolving agent demo#284
sunpan9209 merged 1 commit into
GoogleCloudPlatform:mainfrom
sunpan9209:main

Conversation

@sunpan9209

Copy link
Copy Markdown
Collaborator

Summary

  • Add a self-evolving ADK agent demo that runs baseline traffic, analyzes BigQuery trace data with the SDK, generates an improved prompt, reruns the agent, and compares acceptance gates.
  • Include setup/reset/run scripts, deterministic basketball analytics fixtures, prompt diff artifacts, comparison reports, and a tracked live verification summary.
  • Preserve complete comparison artifacts even when a generated candidate fails a gate, so demo failures remain inspectable.

Verification

  • Live GCP-backed end-to-end run: quality 100% -> 100%, avg tokens 3512.5 -> 1419.5 (-59.6%), avg tool calls 3.0 -> 1.0, broad lookup calls 4 -> 0.
  • python3 -m pyink --config pyproject.toml examples/nba_self_evolving_demo
  • python3 -m isort examples/nba_self_evolving_demo
  • .venv313/bin/python -m compileall examples/nba_self_evolving_demo
  • bash -n examples/nba_self_evolving_demo/setup.sh examples/nba_self_evolving_demo/reset.sh examples/nba_self_evolving_demo/run_e2e_demo.sh
  • CLI help checks for run_agent.py, analyze_and_evolve.py, and compare_runs.py

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh review — PR #284 head 90f2818

Locally checked out. Lint clean (pyink --check, isort --check-only, bash -n all pass; one git diff --check warning noted below). PR adds a 19-file, 2236-line, addition-only examples/nba_self_evolving_demo/ directory with an ADK agent + tools, a BigQuery trace analyzer, a prompt-evolution generator, a comparator with acceptance gates, three shell scripts, and verification artifacts. No prior reviews, no existing test coverage.

The shape is clean and the verification numbers (3512 → 1419 avg tokens, 4→0 broad lookups, quality flat at 100%) are reproducible from the wired pipeline. The findings below are: a handful of correctness/robustness bugs in the pipeline glue, several places where the demo can silently misrepresent what happened, documentation/dependency hygiene, and a handful of small polish items.


Blocker

B1 — Baseline run_agent.py is not run with --allow-failures, so a single non-deterministic V1 hiccup aborts the whole demo before analysis ever runs

run_e2e_demo.sh:60-63:

echo "[1/5] Run baseline V1 agent..."
"$PYTHON_BIN" run_agent.py \
  --label baseline \
  --output-dir "$REPORTS_DIR"

vs. step [3/5] which explicitly passes --allow-failures. run_agent.py:194-195 exits 1 when any case fails quality_passed, and set -euo pipefail halts the script. quality_passed requires both bool(response_text.strip()) and expected_tool_used — so a single V1 session where the LLM happens to call only lookup_nba_reference (a model whim, not a contract violation) kills the demo with no analysis, no comparison, no artifacts, and a misleading "ERROR" trail.

This bites the PR description's own promise: "Preserve complete comparison artifacts even when a generated candidate fails a gate, so demo failures remain inspectable." — that's true only after step [3/5]. Steps [1/5] and [4/5] (compare_runs.py exits 1 on failed gates) terminate set -e and leave a partial state.

Fix: pass --allow-failures to baseline too, and let analyze_and_evolve.py decide whether to promote based on quality["pass_rate"] >= 1.0 (line 284). That gate already exists — making baseline-failures fatal is just redundant.

Same concern for the comparator step: compare_runs.py:237-238 exits 1 when gates fail. Step [4/5] then halts set -e. The script never reaches echo "[5/5] Done.". Either accept this (artifacts are written first, line 192-206, so they do exist) and document it, or split compare_runs exit code from the "did it pass gates" signal (e.g. always exit 0, surface gate status via JSON only, and let the script print a summary).


High

H1 — Generator fallback path produces a hand-crafted prompt but is presented as if the model generated it

analyze_and_evolve.py:196-199:

except Exception as exc:
  fallback = _fallback_candidate()
  fallback["changes_summary"] += f" Generator fallback reason: {exc}"
  return fallback

_fallback_candidate() returns a fully-tuned, hand-crafted V2 prompt that includes the precise routing rules ("Player comparison: call compare_players", "Call lookup_nba_reference only when…") needed to pass every gate in compare_runs.py. If the genai client errors (quota, auth, transient 503, JSON validation, length-too-short), the demo silently substitutes that prompt and proceeds to step [3/5]. The final markdown report still says "The generated V2 prompt was created from the baseline trace summary, tool counts, …" — even when no model ran.

The fallback note ("Generator fallback reason: ...") is appended to changes_summary and surfaces in candidate_prompt.json / comparison.md. But it's easy to miss — and a user who never reads JSON might believe they witnessed self-evolution when they actually got the bundled fallback.

Two ways to fix:

  • Loud-fail mode (recommended for a demo whose point is "the agent evolves itself"): drop the fallback. If the generator fails, exit nonzero with the exception. The demo's claim is testable.
  • Loud-warn mode: keep the fallback for robustness, but print a prominent WARNING: generator failed; using bundled fallback prompt. The demo is no longer demonstrating self-evolution. to stderr, and set evolution["promoted_via_fallback"] = True so the JSON downstream can be filtered.

Either way, the fallback prompt and the generated prompt should be visually distinguishable in comparison.md (e.g. a "Source: model" vs "Source: bundled fallback" line at the top of the report).

H2 — summarize() empty-input return is missing avg_llm_calls; populated input includes it

analytics/session_metrics.py:212-222 (empty branch):

if not rows:
  return {
      "sessions": 0,
      "avg_total_tokens": 0.0,
      "avg_input_tokens": 0.0,
      "avg_output_tokens": 0.0,
      "avg_tool_calls": 0.0,
      "total_broad_lookup_calls": 0,
      "sessions_with_broad_lookup": 0,
      "broad_lookup_session_rate": 0.0,
      "total_tool_errors": 0,
  }

vs. populated branch (:229-240) which includes "avg_llm_calls". Same finding pattern as the M2 issue from PR #174 — the return shape is unstable across input sizes. Caller code paths that do summary["avg_llm_calls"] will KeyError whenever rows=[]. Currently no in-repo consumer hits this (verified: grep avg_llm_calls analyze_and_evolve.py compare_runs.py is empty), so it's a latent footgun rather than an immediate bug.

Fix: keep the keyset identical, set zeros in the empty branch.

H3 — Partial-results blind spot: fetch_session_metrics returns fewer rows than session_ids after retry exhaustion, and downstream silently computes averages on the incomplete subset

analytics/session_metrics.py:171-180:

rows: list[dict[str, Any]] = []
for attempt in range(1, attempts + 1):
  if wait_seconds and attempt > 1:
    time.sleep(wait_seconds)
  rows = [dict(r) for r in client.query(query, job_config=job_config).result()]
  if len(rows) >= len(set(session_ids)):
    break
return rows

If only 2 of 4 baseline sessions have landed after attempts=6 × wait_seconds=15 (90 seconds total), rows has 2 entries and the call returns. analyze_and_evolve.py:264-268 only raises when not rows — it does not check for the partial case. summarize() then divides by 2, producing an avg_total_tokens that pretends those are the whole baseline. The downstream gates then compare a 2-session "baseline" vs a 4-session "evolved" — apples and oranges.

Fix: at the call site (analyze_and_evolve.py:259-268), assert len(rows) == len(session_ids) or print a clear warning and exit. The streaming-insert race is the most common real-world failure mode for this kind of pipeline; the demo shouldn't silently hide it.

H4 — Token-extraction triple-coalesce silently returns 0 when none of the three schemas match

analytics/session_metrics.py:132-160 reads tokens via:

  1. attributes.$.usage_metadata.prompt_token_count (ADK BQAA plugin recent schema)
  2. content.$.usage.prompt (older schema)
  3. attributes.$.input_tokens (alternate path)
  4. falls through to 0 if none of those JSON paths exist

If the BQAA plugin's emitted attribute schema changes in a fourth way (it has changed before — see issue #94 history), total_tokens summed across sessions becomes 0 silently, the gates compare 0 vs 0 (which fails tokens_reduced), and the demo reports false failure.

Fixes:

  • Add a post-query sanity check: if sum(total_tokens) == 0 across all rows but event_count > 0, log a WARNING: token extraction produced zero tokens; the BQAA plugin's attribute schema may have changed. Inspect a session with LLM_RESPONSE events under \{config.table_ref}`.`
  • Optionally, in the SDK-evaluators path (which the PR already calls via run_sdk_evaluators), cross-check the SDK's reported avg-observed token count against the BQ-side aggregation. If they diverge, surface that.

H5 — compare_runs._pct_delta(0, X) == 0.0 masks regressions when the baseline metric was zero

compare_runs.py:46-49:

def _pct_delta(before: float, after: float) -> float:
  if before == 0:
    return 0.0
  return round((after - before) / before, 4)

If baseline had 0 tool errors and evolved has 5, _pct_delta returns 0.0. That's then displayed as +0.0% in comparison.md and feeds the tokens_reduced gate as <= -0.05 → False (correctly). The math doesn't break a gate today, but the reported delta is wrong: "we went from 0 to 5 errors" should not render as "0%".

Fix: return float("inf") (or None and special-case rendering) when before == 0 and after > 0. Or print the raw delta alongside the percent for the metrics where percent is meaningless.


Medium

M1 — CodeEvaluator is the deprecated alias; the demo should use SystemEvaluator

analytics/session_metrics.py:252:

from bigquery_agent_analytics.evaluators import CodeEvaluator

Per the active PR #121, CodeEvaluator = SystemEvaluator is a backward-compat alias with CHANGELOG.md saying "deprecated but supported." A new example landing today should use the preferred name (SystemEvaluator) so it doesn't have to be migrated again next quarter. Also update README.md:16 ("SDK CodeEvaluator + trace SQL") and VERIFICATION.md:41 ("SDK CodeEvaluator checks for…") to match.

M2 — agent/agent.py mutates os.environ at import time

agent.py:53-55:

os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID or ""
os.environ["GOOGLE_CLOUD_LOCATION"] = AGENT_LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "true"

Importing agent.agent rewrites three process-wide env vars unconditionally. PROJECT_ID or "" actively clears GOOGLE_CLOUD_PROJECT if PROJECT_ID couldn't be resolved — silently masking the user's actual configuration. The demo runs single-process so this is mostly cosmetic, but importing a module shouldn't have global side effects.

Fix: move these into a _configure_environment() helper that's only called from create_agent (or from a top-level main() somewhere). Don't clear with "" when the lookup failed — raise a clear error.

M3 — bq_logging_plugin is instantiated at module import; fails confusingly when PROJECT_ID is None

agent.py:42-44:

PROJECT_ID = os.getenv("PROJECT_ID") or os.getenv("GOOGLE_CLOUD_PROJECT")
if not PROJECT_ID:
  PROJECT_ID = _auth_project

If ADC also has no default project, PROJECT_ID stays None. Then BigQueryAgentAnalyticsPlugin(project_id=None, ...) constructs successfully (the plugin tolerates None in its constructor), but the first write will fail with a less-helpful error from the BQ client deep in plugin internals.

Fix: add an explicit guard near the top of agent.py:

if not PROJECT_ID:
  raise RuntimeError(
      "Could not resolve PROJECT_ID from .env, GOOGLE_CLOUD_PROJECT, or ADC. "
      "Run ./setup.sh or `gcloud config set project YOUR_PROJECT_ID`."
  )

Same applies in analytics/session_metrics.load_config:62-64 (which does have a guard — good); the agent module should match.

M4 — BigQueryLoggerConfig(batch_size=1, shutdown_timeout=15.0) is the slowest path; intentional?

agent.py:80-85:

config=BigQueryLoggerConfig(
    enabled=True,
    max_content_length=500 * 1024,
    batch_size=1,
    shutdown_timeout=15.0,
),

batch_size=1 forces a Storage Write API append per event. For a 4-case demo with ~3.5K tokens per session, you're likely emitting 20–40 events per session × 4 sessions × 2 runs = ~200+ individual append RPCs. Most demos would use the default batching to amortize, then rely on the shutdown_timeout to flush remaining buffered events.

If batch_size=1 is deliberate (to make individual rows show up in BigQuery faster for the streaming-insert race the analyzer worries about), say so in a code comment. Otherwise use default batching and rely on the 15s shutdown timeout to flush.

M5 — analyze_and_evolve.py promotion heuristic has two hardcoded magic numbers

analyze_and_evolve.py:285-288:

should_promote = (
    current_state["version"] == "v1"
    and quality["pass_rate"] >= 1.0
    and (
        summary["broad_lookup_session_rate"] >= 0.5
        or summary["avg_total_tokens"] > args.token_budget
    )
)

>= 1.0 and >= 0.5 are magic. The first is "every single session must pass quality" — a strict bar that, combined with B1, makes a single bad session abort the demo entirely. The second ("at least half of sessions used the broad tool") is a threshold the user can't tune without editing source.

Promote both to CLI flags with the current values as defaults (--min-quality-pass-rate 1.0, --min-broad-lookup-rate 0.5), and document them in README.

M6 — analyze_and_evolve._observations thresholds duplicated, not centralized

:70-85 hardcodes 0.5 (broad lookup) and 2.0 (avg tool calls) and token_budget for the observation strings. The same 0.5 repeats in should_promote (:287). These three thresholds are demo-tuning knobs that should live in one constants block (or CLI flags as M5) so a future maintainer doesn't have to grep two files to retune.

M7 — run_e2e_demo.sh:39 prepends $REPO_ROOT/src to PYTHONPATH even after setup.sh already ran pip install -e .

export PYTHONPATH="$REPO_ROOT/src:$SCRIPT_DIR:${PYTHONPATH:-}"

If setup.sh succeeded, the SDK is on site-packages and the $REPO_ROOT/src prepend is redundant. If setup.sh wasn't run, the demo will fail earlier on missing google-adk / google-genai. So $REPO_ROOT/src is only useful if someone runs the script from a checkout without pip install, which the README explicitly tells them not to. Keep $SCRIPT_DIR (needed for from agent.agent import …); drop $REPO_ROOT/src.


Low / nits

L1 — VERIFICATION.md has a trailing blank line at EOF

git diff --check:

examples/nba_self_evolving_demo/VERIFICATION.md:95: new blank line at EOF.

Strip the trailing blank. Same lint hygiene CI catches in other PRs.

L2 — BigQueryLoggerConfig.max_content_length = 500 * 1024 (500 KiB) is way larger than necessary

NBA tool responses are < 5 KiB. Demo-level concern only, but 500 KiB is the kind of number that looks like a copy-paste from a different example. If the goal is "never truncate," set it to a comfortable round number (50 KiB) or drop the override entirely and use the default.

L3 — _fallback_candidate.changes_summary reads "Fallback prompt generated from the detected broad-tool overuse pattern" even when broad-tool overuse wasn't detected

_fallback_candidate() returns the same string regardless of what observations actually flagged. If the generator failed for a non-broad-tool reason (token over budget, say), the fallback's changes_summary still claims it's responding to broad-tool overuse. Make the summary echo observations so the rationale matches the observed signal.

L4 — analyze_and_evolve._generate_candidate_prompt hardcodes len(improved) < 120 as the "too short" threshold

:190-191. Why 120? A comment explaining the choice (or moving to a constant) would help.

L5 — run_agent.py:88 quality only checks expected_tool_used, not avoid_tool_used

quality_passed = bool(response_text.strip()) and expected_tool_used

That's a deliberate design choice (broad-tool overuse is tracked separately via avoid_tool_used and broad_lookup_calls), but the local definition of "quality" is then "did the agent call the right tool at all?" — which can be True even when the agent also called lookup_nba_reference. This is the case for every V1 session per VERIFICATION.md (broad lookup calls = 4 out of 4, yet quality = 100%). Worth a one-line comment in run_agent.py explaining that quality_passed is intentionally the broader "did it answer correctly" signal and that avoid_tool overuse is the separate signal the demo evolves against.

L6 — compare_runs._read_candidate_summary silently returns "" on any exception

:52-61:

try:
  with open(path) as f:
    data = json.load(f)
  return str(data.get("changes_summary", ""))
except Exception:
  return ""

The bare except Exception swallows json.JSONDecodeError and FileNotFoundError. The "" return then makes comparison.md fall back to the generic blurb at :87-91. If candidate_prompt.json was malformed (bug, generator output corrupt, etc.), the user never finds out. Catch only FileNotFoundError here; let JSON errors surface.

L7 — setup.sh overwrites .env on every invocation

:93-105. Re-running ./setup.sh blows away any manual edit to .env. Probably fine for a demo, but worth a comment in the README:

Re-running setup.sh regenerates .env from current environment variables. To customize without losing the change, override via env vars before running setup (e.g. NBA_AGENT_MODEL=gemini-2.5-pro ./setup.sh).

L8 — setup.sh dataset-creation step is not location-idempotent

:86-91:

if ! bq show "${PROJECT_ID}:${NBA_DATASET_ID}" &>/dev/null 2>&1; then
  ...
  bq mk --dataset --location="$DATASET_LOCATION" ...
fi

If the dataset exists in a different location than the one in $DATASET_LOCATION, the bq show succeeds and no warning is issued. The subsequent BQAA plugin writes will then go to a dataset in the wrong region (or fail). Add a bq show --format=json | jq -r .location cross-check, or surface a warning if the existing dataset's location doesn't match.

L9 — examples/README.md description for the demo is one long sentence (56 lines wide)

examples/README.md:55 is 600+ characters on a single line in the markdown table. GitHub renders it fine, but the source is hard to skim/edit. Wrap to 80–100 chars per line, or break into bullets.

L10 — Cost claim "typically well under $1" appears twice without supporting math

README.md:92-96 and setup.sh:27-29. Based on VERIFICATION.md's actual numbers (~25K total tokens across the run at gemini-2.5-flash rates), real cost is in the $0.01–0.05 range. The conservative "well under $1" cushion is fine, but consider linking to Vertex AI pricing or showing the actual cost in VERIFICATION.md so a stakeholder can verify.

L11 — No unit tests for any of the new helpers

summarize, _pct_delta, _observations, _failure_class-style logic in _load_eval_contract, the gate booleans in compare_runs — all reasonable to test offline without BQ. Three or four pure-function tests under tests/test_nba_self_evolving_demo.py would prevent regressions on H2/H5 and the magic-number thresholds. Pattern would mirror tests/test_quality_report_helpers.py from the BQAA scripts.

L12 — Mermaid diagrams in README.md and VERIFICATION.md are nice; minor consistency

Three mermaid diagrams: TD/LR variants, sometimes with flowchart, sometimes with flowchart TD. Consistent direction (all TD or all LR) reads better, especially in VERIFICATION.md where the bar-chart "Run flow" feels visually closest to the README's pipeline diagram.

L13 — Productionization roadmap mermaid uses Scheduler → Cloud Run Job → … → Canary traffic → C cycle

README.md:170-178. The cycle goes back to C "Analyze recent BigQuery traces" from H "Canary traffic". That's the right shape, but the diagram has no termination — readers might wonder when human approval is bypassed. Either keep as-is (steady-state loop) or add an explicit "rollback to last accepted" edge from H back to F for completeness.


What looks great

  • The 5-step pipeline shape is the right shape: V1 baseline → trace analyze + generate → V2 rerun → before/after gates → JSON+MD artifacts. Self-contained, no shared state across runs.
  • run_agent.py:67-82 event consumption correctly walks event.content.parts and accumulates both text and tool names. The use of nonlocal response_text inside _consume() keeps the per-case scope clean.
  • Acceptance gates in compare_runs.py:172-179 are explicit, named, and individually testable. The result["gates"] dict that flows into both JSON and Markdown is good practice.
  • Concurrent eval execution via asyncio.Semaphore (run_agent.py:116-117) caps in-flight ADK runners — important because the InMemoryRunner with a single BQAA plugin has shared state.
  • Idempotent setup script structure: API enable, ADC check, dataset create only if missing, .env regeneration. The order matches what an operator would do by hand.
  • shutdown_timeout=15.0 on the plugin config + the attempts=6, wait_seconds=15 retry in the analyzer correctly hedges against BigQuery streaming insert latency. The shape is right; H3 just adds the missing partial-results check.
  • _write_prompt_diff using difflib.unified_diff is the right tool, and the resulting prompt_diff.md is the cleanest single-artifact "what changed in the prompt" view I've seen for an example like this.
  • VERIFICATION.md as a tracked file (vs the ignored reports/) is a great pattern — it pins a known-good live run to the source tree so reviewers can sanity-check claimed numbers without running anything.
  • Lint-clean PR (pyink, isort, bash -n, compileall). Only finding is the trailing blank-line at EOF flagged above.

Verdict — REQUEST_CHANGES on B1; H1–H5 + M1–M3 are tightening that materially improves trust

B1 (baseline --allow-failures) is the blocker — without it, one model whim aborts the demo before any of the actual self-evolution machinery runs.

H1 (fallback prompt presented as model output), H3 (partial-rows blind spot), H4 (token-extraction silent zero) are the three "the demo can silently misrepresent itself" findings — each cheap to fix, each meaningfully improves the trust users put in the verification numbers.

H2, H5, M1–M7, L1–L13 are tightening — bundle in or skip.

The PR is a clean, self-contained example with a real story and reproducible verification numbers. Once B1 + H1/H3/H4 land, this is mergeable, and the productionization roadmap section makes it a useful jumping-off point for the next demo iteration.

@sunpan9209 sunpan9209 force-pushed the main branch 2 times, most recently from eaecfd8 to fdf14fc Compare June 10, 2026 00:19
@sunpan9209

Copy link
Copy Markdown
Collaborator Author

Updated the PR at head fdf14fc.

What changed:

  • Repositioned the example as self_evolving_agent_demo, a lightweight metric-driven self-evolution companion to agent_improvement_cycle.
  • Removed the domain-first visible naming from the folder, README, setup output, environment variables, dataset default, and demo index entry.
  • Addressed the blocker by allowing baseline result writing even when a V1 case fails, so analysis and artifacts can still run.
  • Removed the bundled prompt fallback; prompt generation now fails loudly instead of silently promoting a hand-authored candidate.
  • Added complete BigQuery row checks and zero-token schema sanity checks before metrics are trusted.
  • Stabilized empty metric summaries, fixed zero-baseline percent delta reporting, and made comparison gate failure an explicit JSON result instead of an unconditional shell abort.
  • Added setup dataset-location validation, clearer project resolution, smaller logged-content cap, and documented/tunable evolution thresholds.
  • Added focused offline tests for the helper behavior called out in review.

Verification:

  • Live renamed end-to-end run: quality 100% -> 100%, avg tokens 3640.2 -> 1479.8 (-59.4%), avg tool calls 2.5 -> 1.0, broad lookup calls 4 -> 0, candidate source model.
  • python3 -m pyink --config pyproject.toml examples/self_evolving_agent_demo tests/test_self_evolving_agent_demo.py
  • python3 -m isort examples/self_evolving_agent_demo tests/test_self_evolving_agent_demo.py
  • .venv313/bin/python -m compileall examples/self_evolving_agent_demo
  • bash -n examples/self_evolving_agent_demo/setup.sh examples/self_evolving_agent_demo/reset.sh examples/self_evolving_agent_demo/run_e2e_demo.sh
  • python3 -m pytest tests/test_self_evolving_agent_demo.py
  • CLI help checks for the three demo entrypoints

@sunpan9209 sunpan9209 requested a review from caohy1988 June 10, 2026 01:24

@caohy1988 caohy1988 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — PR #284 head fdf14fc

Checked out locally and verified every June 8 finding against the new head. All of them are fixed — the blocker, all five Highs, all seven Mediums I'd flagged as material, and most of the Lows — plus the rename to self_evolving_agent_demo is consistent end-to-end. The 5 new offline tests pass, pyink/isort clean, git diff --check clean, and the fresh live verification (run_20260609_171547) reports candidate source: model — directly closing the trust gap from H1.

Verified RESOLVED

  • B1 — baseline step now passes --allow-failures (run_e2e_demo.sh:63); a single V1 model whim no longer aborts the demo before analysis. Comparison gate failure is now an explicit JSON result with an opt-in --fail-on-gate-failure flag (compare_runs.py:261) instead of an unconditional set -e abort — artifacts always land.
  • H1 — the bundled fallback prompt is deleted. Generator failure now raises SystemExit("Prompt generation failed; no fallback prompt was promoted...") (analyze_and_evolve.py:308-312), and the candidate carries source: "model" which flows into candidate_prompt.json and renders as "Candidate source: model" in comparison.md. The demo can no longer silently present canned output as self-evolution.
  • H2summarize([]) now returns the full keyset including avg_llm_calls: 0.0 (session_metrics.py:240-252). Pinned by test_summarize_empty_rows_has_stable_shape.
  • H3require_complete_session_metrics (session_metrics.py:185-199) raises with the missing session IDs, called from both analyze_and_evolve.py:261 and compare_runs.py:47. Pinned by test_require_complete_session_metrics_rejects_missing_rows.
  • H4 — zero-token schema sanity check raises RuntimeError pointing the operator at LLM_RESPONSE rows in the configured table (session_metrics.py:205-209). Pinned by test_require_complete_session_metrics_rejects_zero_token_schema.
  • H5_pct_delta(0, X>0) returns None, rendered as n/a (compare_runs.py:53-61). Pinned by test_pct_delta_marks_zero_baseline_growth_as_not_applicable.
  • M1SystemEvaluator preferred with a CodeEvaluator as SystemEvaluator fallback import (session_metrics.py:285-287) — neat shape that works on both pre- and post-#121 SDK versions.
  • M2/M3 — env mutation moved into _configure_environment() called from create_agent; unresolvable PROJECT_ID raises a clear RuntimeError at the top of agent.py instead of silently clearing GOOGLE_CLOUD_PROJECT.
  • M4batch_size=1 now carries its intent comment ("rows visible quickly for this one-shot demo"). L2 — content cap reduced to 50 KiB.
  • M5/M6 — promotion thresholds exposed as --min-quality-pass-rate and --min-broad-lookup-rate flags and threaded into _observations. Pinned by test_observations_use_configured_thresholds.
  • M7PYTHONPATH no longer prepends $REPO_ROOT/src.
  • L1git diff --check clean. L8setup.sh:86-101 now cross-checks the existing dataset's location and hard-errors on mismatch. L11tests/test_self_evolving_agent_demo.py adds 5 offline tests that map one-to-one onto the review findings; all pass.

Rename consistency — verified

nba_self_evolving_demoself_evolving_agent_demo, env vars NBA_*SELF_EVOLVING_*, lookup_nba_referencelookup_basketball_reference — and the broad-tool name is consistent across the prompt, the eval cases' avoid_tool, and the SQL COUNTIF in session_metrics.py:132 (the demo's core signal, so this mattered). The basketball fixture content stays, which is fine — the branding is neutral. examples/README.md row updated. VERIFICATION.md regenerated from a fresh post-rename live run.

Verdict — APPROVE

This is a model response to review: every finding addressed at the right layer, each behavioral fix locked by a test named after the failure mode it prevents, and a fresh live verification that demonstrates the H1 fix (candidate source: model) rather than just claiming it. Nothing blocking remains.

@sunpan9209 sunpan9209 merged commit 2d15a61 into GoogleCloudPlatform:main Jun 13, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants