Add verified self-evolving agent demo#284
Conversation
caohy1988
left a comment
There was a problem hiding this comment.
Fresh review — PR #284 head 90f2818
Locally checked out. Lint clean (pyink --check, isort --check-only, bash -n all pass; one git diff --check warning noted below). PR adds a 19-file, 2236-line, addition-only examples/nba_self_evolving_demo/ directory with an ADK agent + tools, a BigQuery trace analyzer, a prompt-evolution generator, a comparator with acceptance gates, three shell scripts, and verification artifacts. No prior reviews, no existing test coverage.
The shape is clean and the verification numbers (3512 → 1419 avg tokens, 4→0 broad lookups, quality flat at 100%) are reproducible from the wired pipeline. The findings below are: a handful of correctness/robustness bugs in the pipeline glue, several places where the demo can silently misrepresent what happened, documentation/dependency hygiene, and a handful of small polish items.
Blocker
B1 — Baseline run_agent.py is not run with --allow-failures, so a single non-deterministic V1 hiccup aborts the whole demo before analysis ever runs
run_e2e_demo.sh:60-63:
echo "[1/5] Run baseline V1 agent..."
"$PYTHON_BIN" run_agent.py \
--label baseline \
--output-dir "$REPORTS_DIR"vs. step [3/5] which explicitly passes --allow-failures. run_agent.py:194-195 exits 1 when any case fails quality_passed, and set -euo pipefail halts the script. quality_passed requires both bool(response_text.strip()) and expected_tool_used — so a single V1 session where the LLM happens to call only lookup_nba_reference (a model whim, not a contract violation) kills the demo with no analysis, no comparison, no artifacts, and a misleading "ERROR" trail.
This bites the PR description's own promise: "Preserve complete comparison artifacts even when a generated candidate fails a gate, so demo failures remain inspectable." — that's true only after step [3/5]. Steps [1/5] and [4/5] (compare_runs.py exits 1 on failed gates) terminate set -e and leave a partial state.
Fix: pass --allow-failures to baseline too, and let analyze_and_evolve.py decide whether to promote based on quality["pass_rate"] >= 1.0 (line 284). That gate already exists — making baseline-failures fatal is just redundant.
Same concern for the comparator step: compare_runs.py:237-238 exits 1 when gates fail. Step [4/5] then halts set -e. The script never reaches echo "[5/5] Done.". Either accept this (artifacts are written first, line 192-206, so they do exist) and document it, or split compare_runs exit code from the "did it pass gates" signal (e.g. always exit 0, surface gate status via JSON only, and let the script print a summary).
High
H1 — Generator fallback path produces a hand-crafted prompt but is presented as if the model generated it
analyze_and_evolve.py:196-199:
except Exception as exc:
fallback = _fallback_candidate()
fallback["changes_summary"] += f" Generator fallback reason: {exc}"
return fallback_fallback_candidate() returns a fully-tuned, hand-crafted V2 prompt that includes the precise routing rules ("Player comparison: call compare_players", "Call lookup_nba_reference only when…") needed to pass every gate in compare_runs.py. If the genai client errors (quota, auth, transient 503, JSON validation, length-too-short), the demo silently substitutes that prompt and proceeds to step [3/5]. The final markdown report still says "The generated V2 prompt was created from the baseline trace summary, tool counts, …" — even when no model ran.
The fallback note ("Generator fallback reason: ...") is appended to changes_summary and surfaces in candidate_prompt.json / comparison.md. But it's easy to miss — and a user who never reads JSON might believe they witnessed self-evolution when they actually got the bundled fallback.
Two ways to fix:
- Loud-fail mode (recommended for a demo whose point is "the agent evolves itself"): drop the fallback. If the generator fails, exit nonzero with the exception. The demo's claim is testable.
- Loud-warn mode: keep the fallback for robustness, but print a prominent
WARNING: generator failed; using bundled fallback prompt. The demo is no longer demonstrating self-evolution.to stderr, and setevolution["promoted_via_fallback"] = Trueso the JSON downstream can be filtered.
Either way, the fallback prompt and the generated prompt should be visually distinguishable in comparison.md (e.g. a "Source: model" vs "Source: bundled fallback" line at the top of the report).
H2 — summarize() empty-input return is missing avg_llm_calls; populated input includes it
analytics/session_metrics.py:212-222 (empty branch):
if not rows:
return {
"sessions": 0,
"avg_total_tokens": 0.0,
"avg_input_tokens": 0.0,
"avg_output_tokens": 0.0,
"avg_tool_calls": 0.0,
"total_broad_lookup_calls": 0,
"sessions_with_broad_lookup": 0,
"broad_lookup_session_rate": 0.0,
"total_tool_errors": 0,
}vs. populated branch (:229-240) which includes "avg_llm_calls". Same finding pattern as the M2 issue from PR #174 — the return shape is unstable across input sizes. Caller code paths that do summary["avg_llm_calls"] will KeyError whenever rows=[]. Currently no in-repo consumer hits this (verified: grep avg_llm_calls analyze_and_evolve.py compare_runs.py is empty), so it's a latent footgun rather than an immediate bug.
Fix: keep the keyset identical, set zeros in the empty branch.
H3 — Partial-results blind spot: fetch_session_metrics returns fewer rows than session_ids after retry exhaustion, and downstream silently computes averages on the incomplete subset
analytics/session_metrics.py:171-180:
rows: list[dict[str, Any]] = []
for attempt in range(1, attempts + 1):
if wait_seconds and attempt > 1:
time.sleep(wait_seconds)
rows = [dict(r) for r in client.query(query, job_config=job_config).result()]
if len(rows) >= len(set(session_ids)):
break
return rowsIf only 2 of 4 baseline sessions have landed after attempts=6 × wait_seconds=15 (90 seconds total), rows has 2 entries and the call returns. analyze_and_evolve.py:264-268 only raises when not rows — it does not check for the partial case. summarize() then divides by 2, producing an avg_total_tokens that pretends those are the whole baseline. The downstream gates then compare a 2-session "baseline" vs a 4-session "evolved" — apples and oranges.
Fix: at the call site (analyze_and_evolve.py:259-268), assert len(rows) == len(session_ids) or print a clear warning and exit. The streaming-insert race is the most common real-world failure mode for this kind of pipeline; the demo shouldn't silently hide it.
H4 — Token-extraction triple-coalesce silently returns 0 when none of the three schemas match
analytics/session_metrics.py:132-160 reads tokens via:
attributes.$.usage_metadata.prompt_token_count(ADK BQAA plugin recent schema)content.$.usage.prompt(older schema)attributes.$.input_tokens(alternate path)- falls through to 0 if none of those JSON paths exist
If the BQAA plugin's emitted attribute schema changes in a fourth way (it has changed before — see issue #94 history), total_tokens summed across sessions becomes 0 silently, the gates compare 0 vs 0 (which fails tokens_reduced), and the demo reports false failure.
Fixes:
- Add a post-query sanity check: if
sum(total_tokens) == 0across all rows butevent_count > 0, log aWARNING: token extraction produced zero tokens; the BQAA plugin's attribute schema may have changed. Inspect a session with LLM_RESPONSE events under \{config.table_ref}`.` - Optionally, in the SDK-evaluators path (which the PR already calls via
run_sdk_evaluators), cross-check the SDK's reported avg-observed token count against the BQ-side aggregation. If they diverge, surface that.
H5 — compare_runs._pct_delta(0, X) == 0.0 masks regressions when the baseline metric was zero
compare_runs.py:46-49:
def _pct_delta(before: float, after: float) -> float:
if before == 0:
return 0.0
return round((after - before) / before, 4)If baseline had 0 tool errors and evolved has 5, _pct_delta returns 0.0. That's then displayed as +0.0% in comparison.md and feeds the tokens_reduced gate as <= -0.05 → False (correctly). The math doesn't break a gate today, but the reported delta is wrong: "we went from 0 to 5 errors" should not render as "0%".
Fix: return float("inf") (or None and special-case rendering) when before == 0 and after > 0. Or print the raw delta alongside the percent for the metrics where percent is meaningless.
Medium
M1 — CodeEvaluator is the deprecated alias; the demo should use SystemEvaluator
analytics/session_metrics.py:252:
from bigquery_agent_analytics.evaluators import CodeEvaluatorPer the active PR #121, CodeEvaluator = SystemEvaluator is a backward-compat alias with CHANGELOG.md saying "deprecated but supported." A new example landing today should use the preferred name (SystemEvaluator) so it doesn't have to be migrated again next quarter. Also update README.md:16 ("SDK CodeEvaluator + trace SQL") and VERIFICATION.md:41 ("SDK CodeEvaluator checks for…") to match.
M2 — agent/agent.py mutates os.environ at import time
agent.py:53-55:
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID or ""
os.environ["GOOGLE_CLOUD_LOCATION"] = AGENT_LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "true"Importing agent.agent rewrites three process-wide env vars unconditionally. PROJECT_ID or "" actively clears GOOGLE_CLOUD_PROJECT if PROJECT_ID couldn't be resolved — silently masking the user's actual configuration. The demo runs single-process so this is mostly cosmetic, but importing a module shouldn't have global side effects.
Fix: move these into a _configure_environment() helper that's only called from create_agent (or from a top-level main() somewhere). Don't clear with "" when the lookup failed — raise a clear error.
M3 — bq_logging_plugin is instantiated at module import; fails confusingly when PROJECT_ID is None
agent.py:42-44:
PROJECT_ID = os.getenv("PROJECT_ID") or os.getenv("GOOGLE_CLOUD_PROJECT")
if not PROJECT_ID:
PROJECT_ID = _auth_projectIf ADC also has no default project, PROJECT_ID stays None. Then BigQueryAgentAnalyticsPlugin(project_id=None, ...) constructs successfully (the plugin tolerates None in its constructor), but the first write will fail with a less-helpful error from the BQ client deep in plugin internals.
Fix: add an explicit guard near the top of agent.py:
if not PROJECT_ID:
raise RuntimeError(
"Could not resolve PROJECT_ID from .env, GOOGLE_CLOUD_PROJECT, or ADC. "
"Run ./setup.sh or `gcloud config set project YOUR_PROJECT_ID`."
)Same applies in analytics/session_metrics.load_config:62-64 (which does have a guard — good); the agent module should match.
M4 — BigQueryLoggerConfig(batch_size=1, shutdown_timeout=15.0) is the slowest path; intentional?
agent.py:80-85:
config=BigQueryLoggerConfig(
enabled=True,
max_content_length=500 * 1024,
batch_size=1,
shutdown_timeout=15.0,
),batch_size=1 forces a Storage Write API append per event. For a 4-case demo with ~3.5K tokens per session, you're likely emitting 20–40 events per session × 4 sessions × 2 runs = ~200+ individual append RPCs. Most demos would use the default batching to amortize, then rely on the shutdown_timeout to flush remaining buffered events.
If batch_size=1 is deliberate (to make individual rows show up in BigQuery faster for the streaming-insert race the analyzer worries about), say so in a code comment. Otherwise use default batching and rely on the 15s shutdown timeout to flush.
M5 — analyze_and_evolve.py promotion heuristic has two hardcoded magic numbers
analyze_and_evolve.py:285-288:
should_promote = (
current_state["version"] == "v1"
and quality["pass_rate"] >= 1.0
and (
summary["broad_lookup_session_rate"] >= 0.5
or summary["avg_total_tokens"] > args.token_budget
)
)>= 1.0 and >= 0.5 are magic. The first is "every single session must pass quality" — a strict bar that, combined with B1, makes a single bad session abort the demo entirely. The second ("at least half of sessions used the broad tool") is a threshold the user can't tune without editing source.
Promote both to CLI flags with the current values as defaults (--min-quality-pass-rate 1.0, --min-broad-lookup-rate 0.5), and document them in README.
M6 — analyze_and_evolve._observations thresholds duplicated, not centralized
:70-85 hardcodes 0.5 (broad lookup) and 2.0 (avg tool calls) and token_budget for the observation strings. The same 0.5 repeats in should_promote (:287). These three thresholds are demo-tuning knobs that should live in one constants block (or CLI flags as M5) so a future maintainer doesn't have to grep two files to retune.
M7 — run_e2e_demo.sh:39 prepends $REPO_ROOT/src to PYTHONPATH even after setup.sh already ran pip install -e .
export PYTHONPATH="$REPO_ROOT/src:$SCRIPT_DIR:${PYTHONPATH:-}"If setup.sh succeeded, the SDK is on site-packages and the $REPO_ROOT/src prepend is redundant. If setup.sh wasn't run, the demo will fail earlier on missing google-adk / google-genai. So $REPO_ROOT/src is only useful if someone runs the script from a checkout without pip install, which the README explicitly tells them not to. Keep $SCRIPT_DIR (needed for from agent.agent import …); drop $REPO_ROOT/src.
Low / nits
L1 — VERIFICATION.md has a trailing blank line at EOF
git diff --check:
examples/nba_self_evolving_demo/VERIFICATION.md:95: new blank line at EOF.
Strip the trailing blank. Same lint hygiene CI catches in other PRs.
L2 — BigQueryLoggerConfig.max_content_length = 500 * 1024 (500 KiB) is way larger than necessary
NBA tool responses are < 5 KiB. Demo-level concern only, but 500 KiB is the kind of number that looks like a copy-paste from a different example. If the goal is "never truncate," set it to a comfortable round number (50 KiB) or drop the override entirely and use the default.
L3 — _fallback_candidate.changes_summary reads "Fallback prompt generated from the detected broad-tool overuse pattern" even when broad-tool overuse wasn't detected
_fallback_candidate() returns the same string regardless of what observations actually flagged. If the generator failed for a non-broad-tool reason (token over budget, say), the fallback's changes_summary still claims it's responding to broad-tool overuse. Make the summary echo observations so the rationale matches the observed signal.
L4 — analyze_and_evolve._generate_candidate_prompt hardcodes len(improved) < 120 as the "too short" threshold
:190-191. Why 120? A comment explaining the choice (or moving to a constant) would help.
L5 — run_agent.py:88 quality only checks expected_tool_used, not avoid_tool_used
quality_passed = bool(response_text.strip()) and expected_tool_usedThat's a deliberate design choice (broad-tool overuse is tracked separately via avoid_tool_used and broad_lookup_calls), but the local definition of "quality" is then "did the agent call the right tool at all?" — which can be True even when the agent also called lookup_nba_reference. This is the case for every V1 session per VERIFICATION.md (broad lookup calls = 4 out of 4, yet quality = 100%). Worth a one-line comment in run_agent.py explaining that quality_passed is intentionally the broader "did it answer correctly" signal and that avoid_tool overuse is the separate signal the demo evolves against.
L6 — compare_runs._read_candidate_summary silently returns "" on any exception
:52-61:
try:
with open(path) as f:
data = json.load(f)
return str(data.get("changes_summary", ""))
except Exception:
return ""The bare except Exception swallows json.JSONDecodeError and FileNotFoundError. The "" return then makes comparison.md fall back to the generic blurb at :87-91. If candidate_prompt.json was malformed (bug, generator output corrupt, etc.), the user never finds out. Catch only FileNotFoundError here; let JSON errors surface.
L7 — setup.sh overwrites .env on every invocation
:93-105. Re-running ./setup.sh blows away any manual edit to .env. Probably fine for a demo, but worth a comment in the README:
Re-running
setup.shregenerates.envfrom current environment variables. To customize without losing the change, override via env vars before running setup (e.g.NBA_AGENT_MODEL=gemini-2.5-pro ./setup.sh).
L8 — setup.sh dataset-creation step is not location-idempotent
:86-91:
if ! bq show "${PROJECT_ID}:${NBA_DATASET_ID}" &>/dev/null 2>&1; then
...
bq mk --dataset --location="$DATASET_LOCATION" ...
fiIf the dataset exists in a different location than the one in $DATASET_LOCATION, the bq show succeeds and no warning is issued. The subsequent BQAA plugin writes will then go to a dataset in the wrong region (or fail). Add a bq show --format=json | jq -r .location cross-check, or surface a warning if the existing dataset's location doesn't match.
L9 — examples/README.md description for the demo is one long sentence (56 lines wide)
examples/README.md:55 is 600+ characters on a single line in the markdown table. GitHub renders it fine, but the source is hard to skim/edit. Wrap to 80–100 chars per line, or break into bullets.
L10 — Cost claim "typically well under $1" appears twice without supporting math
README.md:92-96 and setup.sh:27-29. Based on VERIFICATION.md's actual numbers (~25K total tokens across the run at gemini-2.5-flash rates), real cost is in the $0.01–0.05 range. The conservative "well under $1" cushion is fine, but consider linking to Vertex AI pricing or showing the actual cost in VERIFICATION.md so a stakeholder can verify.
L11 — No unit tests for any of the new helpers
summarize, _pct_delta, _observations, _failure_class-style logic in _load_eval_contract, the gate booleans in compare_runs — all reasonable to test offline without BQ. Three or four pure-function tests under tests/test_nba_self_evolving_demo.py would prevent regressions on H2/H5 and the magic-number thresholds. Pattern would mirror tests/test_quality_report_helpers.py from the BQAA scripts.
L12 — Mermaid diagrams in README.md and VERIFICATION.md are nice; minor consistency
Three mermaid diagrams: TD/LR variants, sometimes with flowchart, sometimes with flowchart TD. Consistent direction (all TD or all LR) reads better, especially in VERIFICATION.md where the bar-chart "Run flow" feels visually closest to the README's pipeline diagram.
L13 — Productionization roadmap mermaid uses Scheduler → Cloud Run Job → … → Canary traffic → C cycle
README.md:170-178. The cycle goes back to C "Analyze recent BigQuery traces" from H "Canary traffic". That's the right shape, but the diagram has no termination — readers might wonder when human approval is bypassed. Either keep as-is (steady-state loop) or add an explicit "rollback to last accepted" edge from H back to F for completeness.
What looks great
- The 5-step pipeline shape is the right shape: V1 baseline → trace analyze + generate → V2 rerun → before/after gates → JSON+MD artifacts. Self-contained, no shared state across runs.
run_agent.py:67-82event consumption correctly walksevent.content.partsand accumulates both text and tool names. The use ofnonlocal response_textinside_consume()keeps the per-case scope clean.- Acceptance gates in
compare_runs.py:172-179are explicit, named, and individually testable. Theresult["gates"]dict that flows into both JSON and Markdown is good practice. - Concurrent eval execution via
asyncio.Semaphore(run_agent.py:116-117) caps in-flight ADK runners — important because the InMemoryRunner with a single BQAA plugin has shared state. - Idempotent setup script structure: API enable, ADC check, dataset create only if missing, .env regeneration. The order matches what an operator would do by hand.
shutdown_timeout=15.0on the plugin config + theattempts=6, wait_seconds=15retry in the analyzer correctly hedges against BigQuery streaming insert latency. The shape is right; H3 just adds the missing partial-results check._write_prompt_diffusingdifflib.unified_diffis the right tool, and the resultingprompt_diff.mdis the cleanest single-artifact "what changed in the prompt" view I've seen for an example like this.- VERIFICATION.md as a tracked file (vs the ignored
reports/) is a great pattern — it pins a known-good live run to the source tree so reviewers can sanity-check claimed numbers without running anything. - Lint-clean PR (
pyink,isort,bash -n, compileall). Only finding is the trailing blank-line at EOF flagged above.
Verdict — REQUEST_CHANGES on B1; H1–H5 + M1–M3 are tightening that materially improves trust
B1 (baseline --allow-failures) is the blocker — without it, one model whim aborts the demo before any of the actual self-evolution machinery runs.
H1 (fallback prompt presented as model output), H3 (partial-rows blind spot), H4 (token-extraction silent zero) are the three "the demo can silently misrepresent itself" findings — each cheap to fix, each meaningfully improves the trust users put in the verification numbers.
H2, H5, M1–M7, L1–L13 are tightening — bundle in or skip.
The PR is a clean, self-contained example with a real story and reproducible verification numbers. Once B1 + H1/H3/H4 land, this is mergeable, and the productionization roadmap section makes it a useful jumping-off point for the next demo iteration.
eaecfd8 to
fdf14fc
Compare
|
Updated the PR at head What changed:
Verification:
|
caohy1988
left a comment
There was a problem hiding this comment.
Re-review — PR #284 head fdf14fc
Checked out locally and verified every June 8 finding against the new head. All of them are fixed — the blocker, all five Highs, all seven Mediums I'd flagged as material, and most of the Lows — plus the rename to self_evolving_agent_demo is consistent end-to-end. The 5 new offline tests pass, pyink/isort clean, git diff --check clean, and the fresh live verification (run_20260609_171547) reports candidate source: model — directly closing the trust gap from H1.
Verified RESOLVED
- B1 — baseline step now passes
--allow-failures(run_e2e_demo.sh:63); a single V1 model whim no longer aborts the demo before analysis. Comparison gate failure is now an explicit JSON result with an opt-in--fail-on-gate-failureflag (compare_runs.py:261) instead of an unconditionalset -eabort — artifacts always land. - H1 — the bundled fallback prompt is deleted. Generator failure now raises
SystemExit("Prompt generation failed; no fallback prompt was promoted...")(analyze_and_evolve.py:308-312), and the candidate carriessource: "model"which flows intocandidate_prompt.jsonand renders as "Candidate source:model" incomparison.md. The demo can no longer silently present canned output as self-evolution. - H2 —
summarize([])now returns the full keyset includingavg_llm_calls: 0.0(session_metrics.py:240-252). Pinned bytest_summarize_empty_rows_has_stable_shape. - H3 —
require_complete_session_metrics(session_metrics.py:185-199) raises with the missing session IDs, called from bothanalyze_and_evolve.py:261andcompare_runs.py:47. Pinned bytest_require_complete_session_metrics_rejects_missing_rows. - H4 — zero-token schema sanity check raises
RuntimeErrorpointing the operator at LLM_RESPONSE rows in the configured table (session_metrics.py:205-209). Pinned bytest_require_complete_session_metrics_rejects_zero_token_schema. - H5 —
_pct_delta(0, X>0)returnsNone, rendered asn/a(compare_runs.py:53-61). Pinned bytest_pct_delta_marks_zero_baseline_growth_as_not_applicable. - M1 —
SystemEvaluatorpreferred with aCodeEvaluator as SystemEvaluatorfallback import (session_metrics.py:285-287) — neat shape that works on both pre- and post-#121 SDK versions. - M2/M3 — env mutation moved into
_configure_environment()called fromcreate_agent; unresolvablePROJECT_IDraises a clearRuntimeErrorat the top ofagent.pyinstead of silently clearingGOOGLE_CLOUD_PROJECT. - M4 —
batch_size=1now carries its intent comment ("rows visible quickly for this one-shot demo"). L2 — content cap reduced to 50 KiB. - M5/M6 — promotion thresholds exposed as
--min-quality-pass-rateand--min-broad-lookup-rateflags and threaded into_observations. Pinned bytest_observations_use_configured_thresholds. - M7 —
PYTHONPATHno longer prepends$REPO_ROOT/src. - L1 —
git diff --checkclean. L8 —setup.sh:86-101now cross-checks the existing dataset's location and hard-errors on mismatch. L11 —tests/test_self_evolving_agent_demo.pyadds 5 offline tests that map one-to-one onto the review findings; all pass.
Rename consistency — verified
nba_self_evolving_demo → self_evolving_agent_demo, env vars NBA_* → SELF_EVOLVING_*, lookup_nba_reference → lookup_basketball_reference — and the broad-tool name is consistent across the prompt, the eval cases' avoid_tool, and the SQL COUNTIF in session_metrics.py:132 (the demo's core signal, so this mattered). The basketball fixture content stays, which is fine — the branding is neutral. examples/README.md row updated. VERIFICATION.md regenerated from a fresh post-rename live run.
Verdict — APPROVE
This is a model response to review: every finding addressed at the right layer, each behavioral fix locked by a test named after the failure mode it prevents, and a fresh live verification that demonstrates the H1 fix (candidate source: model) rather than just claiming it. Nothing blocking remains.
Summary
Verification