GoogleCloudPlatform
diff --git a/‎examples/README.md‎
Lines changed: 1 addition & 0 deletions b/‎examples/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/self_evolving_agent_demo/.gitignore‎
Lines changed: 5 additions & 0 deletions b/‎examples/self_evolving_agent_demo/.gitignore‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎examples/self_evolving_agent_demo/DEMO_NARRATION.md‎
Lines changed: 28 additions & 0 deletions b/‎examples/self_evolving_agent_demo/DEMO_NARRATION.md‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎examples/self_evolving_agent_demo/README.md‎
Lines changed: 212 additions & 0 deletions b/‎examples/self_evolving_agent_demo/README.md‎
Lines changed: 212 additions & 0 deletions
diff --git a/‎examples/self_evolving_agent_demo/VERIFICATION.md‎
Lines changed: 101 additions & 0 deletions b/‎examples/self_evolving_agent_demo/VERIFICATION.md‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎examples/self_evolving_agent_demo/agent/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎examples/self_evolving_agent_demo/agent/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -53,6 +53,7 @@ artifacts that demonstrate SDK capabilities.
 |-----------|-------------|
 | [context_graph/](context_graph/) | Context Graph extraction: a runnable ADK agent + BQ AA plugin, the ontology-driven artifact pipeline (MAKO reference config), and the scheduled Cloud Run + Cloud Scheduler deploy. The advanced explicit-ontology path; for the primary one-artifact path see the [codelab](../docs/codelabs/periodic_materialization.md). |
 | [agent_improvement_cycle/](agent_improvement_cycle/) | LoopAgent-driven prompt improvement cycle |
+| [self_evolving_agent_demo/](self_evolving_agent_demo/) | Metric-driven self-evolution demo for a single ADK agent. Uses trace signals to generate and gate a bounded prompt evolution. |
 | [decision_lineage_demo/](decision_lineage_demo/) | Decision-lineage property graph (issue #98): live ADK media-planner agent + BQ AA Plugin running across 6 campaign sessions → SDK `build_context_graph(use_ai_generate=True, include_decisions=True)` → six GQL blocks pasted into BigQuery Studio (one renders an interactive graph diagram, one is a portfolio roll-up) |
 
 ## Reference Artifacts
 
@@ -0,0 +1,5 @@
+.env
+prompt_state.json
+reports/
+__pycache__/
+*/__pycache__/
@@ -0,0 +1,28 @@
+# Self-Evolving Agent Demo Narration
+
+## 30-second version
+
+This demo starts with a basketball analytics agent that answers correctly but
+wastes work. It logs every run to BigQuery through the analytics
+plugin. The SDK reads the traces, finds that the agent keeps calling a
+broad reference tool and spending excess tokens, generates a tighter V2
+prompt, reruns the same questions, and proves that quality stayed flat
+while token and tool usage dropped.
+
+## Walkthrough
+
+1. Run `./setup.sh`.
+2. Run `./run_e2e_demo.sh`.
+3. Watch the V1 run call broad and narrow sample tools.
+4. Watch `analyze_and_evolve.py` print the SDK-backed finding:
+   broad reference lookups were used on narrow tasks.
+5. Open `prompt_diff.md` to inspect the exact V1 -> generated V2 diff.
+6. Watch the V2 run use narrow tools directly.
+7. Open `comparison.md` for the final quality/token/tool diff.
+
+## Demo Message
+
+The important idea is not "save tokens" in isolation. The agent uses
+its own production-shaped traces as feedback. Token tracking gives the
+loop a measurable signal, but the goal is a self-evolving agent that
+gets cheaper or cleaner without losing answer quality.
@@ -0,0 +1,212 @@
+# Self-Evolving Agent Demo
+
+This demo shows a single ADK agent improving from its own logged
+behavior. The agent answers basketball analytics questions using deterministic
+fixture tools. V1 is intentionally wasteful: it loads broad basketball
+reference context and writes long scouting reports even when a narrow
+tool can answer the question. The BigQuery Agent Analytics Plugin logs
+the sessions to BigQuery, and the SDK reads those traces back to find a
+concrete improvement opportunity. The demo generates V2 during the run,
+then activates it only when the baseline answers already pass quality
+checks and the trace analysis shows broad-tool / token waste.
+
+```mermaid
+flowchart TD
+  A["Run sample agent V1"] --> B["Plugin logs agent_events to BigQuery"]
+  B --> C["SDK deterministic evaluators + trace SQL"]
+  C --> D["Find broad lookup and token waste"]
+  D --> E["Generate bounded V2 prompt"]
+  E --> F["Run same sample eval questions"]
+  F --> G["Show prompt diff + metric diff"]
+```
+
+The point is self-evolution. Token tracking is the measurement signal,
+not the product promise.
+
+This is a lightweight companion to `examples/agent_improvement_cycle/`.
+That demo shows a production-facing quality-improvement loop with
+Prompt Registry and Prompt Optimizer. This demo is intentionally smaller:
+it focuses on operational trace signals such as tool overuse and token
+waste, then gates a single generated prompt evolution against before/after
+metrics.
+
+## What Improves
+
+V1 behavior:
+
+- Calls `lookup_basketball_reference` before narrow tools.
+- Often calls more than one tool for a one-question task.
+- Produces long sectioned scouting reports.
+
+Generated V2 behavior:
+
+- Is created at runtime by a prompt generator from the SDK trace
+  summary, tool counts, quality summary, and available tool signatures.
+- Should use the cheapest sufficient narrow tool.
+- Should avoid `lookup_basketball_reference` unless no narrow tool fits.
+- Should give a short answer with decisive stats and a recommendation.
+
+The acceptance gate is:
+
+```mermaid
+flowchart TD
+  A["Generated V2"] --> B{"Quality not worse?"}
+  B -- no --> R["Reject"]
+  B -- yes --> C{"Avg tokens lower?"}
+  C -- no --> R
+  C -- yes --> D{"Broad lookup reduced?"}
+  D -- no --> R
+  D -- yes --> E{"No tool errors?"}
+  E -- no --> R
+  E -- yes --> P["Accept evolved prompt"]
+```
+
+## Run It
+
+Prerequisites:
+
+- Python 3.10+
+- `gcloud` and `bq` CLIs
+- Application Default Credentials
+- A Google Cloud project with billing enabled
+- IAM: BigQuery data editor/job user and Vertex AI user
+
+Setup:
+
+```bash
+./setup.sh
+```
+
+If your default `python3` is older than 3.10, run with:
+
+```bash
+PYTHON_BIN=python3.11 ./setup.sh
+PYTHON_BIN=python3.11 ./run_e2e_demo.sh
+```
+
+Run the end-to-end demo:
+
+```bash
+./run_e2e_demo.sh
+```
+
+Reset local prompt state and reports:
+
+```bash
+./reset.sh
+```
+
+Expected default one-run cost is typically well under `$1`: four V1
+agent sessions, one small prompt-generation call, four generated-V2
+agent sessions, small BigQuery reads, and SDK deterministic evaluators.
+The demo does not deploy Cloud Run,
+Scheduler, Workflows, or any long-running infrastructure.
+
+## Outputs
+
+Each run writes a timestamped directory under `reports/`:
+
+```text
+reports/run_<timestamp>/
+├── latest_eval_results_baseline.json  # V1 answers + session IDs
+├── candidate_prompt.json              # model-generated V2 prompt
+├── prompt_diff.md                     # exact V1 -> generated V2 diff
+├── self_evolution_analysis.json       # SDK-backed evolution decision
+├── latest_eval_results_evolved.json   # V2 answers + session IDs
+├── comparison.json                    # before/after gates
+└── comparison.md                      # readable metric diff report
+```
+
+For the main story, open these two files after a run:
+
+- `prompt_diff.md` — shows the exact prompt changes generated from
+  the trace/token signal.
+- `comparison.md` — shows quality, token, tool-call, and broad-lookup
+  deltas between agent V1 and generated V2.
+
+The tracked `VERIFICATION.md` file records the latest live end-to-end
+verification result for this demo.
+
+The raw traces land in:
+
+```text
+<PROJECT_ID>.self_evolving_agent_demo.agent_events
+```
+
+Override with:
+
+```bash
+export SELF_EVOLVING_DATASET_ID=my_dataset
+export SELF_EVOLVING_TABLE_ID=agent_events
+export SELF_EVOLVING_AGENT_MODEL=gemini-2.5-flash
+export SELF_EVOLVING_PROMPT_GENERATOR_MODEL=gemini-2.5-flash
+export DATASET_LOCATION=us-central1
+```
+
+Re-running `setup.sh` regenerates `.env` from the current environment.
+To customize a setting persistently, pass it as an environment variable
+when running setup, for example:
+
+```bash
+SELF_EVOLVING_AGENT_MODEL=gemini-2.5-pro ./setup.sh
+```
+
+Evolution thresholds can be tuned with:
+
+```bash
+python analyze_and_evolve.py \
+  --min-quality-pass-rate 1.0 \
+  --min-broad-lookup-rate 0.5 \
+  --max-avg-tool-calls 2.0
+```
+
+## File Map
+
+```text
+examples/self_evolving_agent_demo/
+├── README.md
+├── DEMO_NARRATION.md
+├── VERIFICATION.md
+├── setup.sh
+├── reset.sh
+├── run_e2e_demo.sh
+├── run_agent.py
+├── analyze_and_evolve.py
+├── compare_runs.py
+├── agent/
+│   ├── agent.py
+│   ├── prompts.py
+│   ├── prompt_store.py
+│   └── tools.py
+├── analytics/
+│   └── session_metrics.py
+└── eval/
+    └── eval_cases.json
+```
+
+## Productionization Roadmap
+
+The demo is intentionally one-shot. A production self-evolving loop
+would add durable orchestration, approvals, and rollout controls:
+
+```mermaid
+flowchart LR
+  A["Scheduler"] --> B["Cloud Run Job"]
+  B --> C["Analyze recent BigQuery traces"]
+  C --> D["Generate prompt or skill candidate"]
+  D --> E["Regression eval gate"]
+  E --> F["Human approval or policy gate"]
+  F --> G["Prompt Registry / config rollout"]
+  G --> H["Canary traffic"]
+  H --> C
+```
+
+Recommended next steps:
+
+- Store accepted and rejected candidates in BigQuery.
+- Add prompt registry support for managed version history.
+- Add a human approval step before production rollout.
+- Add canary routing and automatic rollback if quality or cost
+  regressions appear.
+- Extend the candidate generator from full-prompt generation to bounded
+  prompt/skill patch optimization.
@@ -0,0 +1,101 @@
+# Live Verification
+
+Last verified: 2026-06-09, America/Los_Angeles
+
+Run id: `run_20260609_171547`
+
+Command:
+
+```bash
+PYTHON_BIN=/path/to/python3.10+ ./run_e2e_demo.sh
+```
+
+Raw local artifacts were written to:
+
+```text
+reports/run_20260609_171547/
+```
+
+The raw `reports/` directory remains ignored because it is per-run output.
+This file records the live end-to-end result that should be stable enough
+to keep with the demo source.
+
+## What Ran
+
+```mermaid
+flowchart LR
+  A["ADK sample agent V1"] --> B["BigQuery analytics plugin"]
+  B --> C["BigQuery trace table"]
+  C --> D["SDK evaluators + trace SQL"]
+  D --> E["Gemini prompt generator"]
+  E --> F["Generated V2 prompt"]
+  F --> G["ADK sample agent V2"]
+  G --> H["Before/after gate report"]
+```
+
+The live run exercised:
+
+- ADK agent execution with Gemini.
+- BigQuery Agent Analytics Plugin trace logging.
+- BigQuery trace readback from
+  `rag-chatbot-485501.self_evolving_agent_demo.agent_events`.
+- SDK deterministic evaluator checks for token efficiency, cost, turn count,
+  and error rate.
+- Runtime generation of a replacement V2 prompt.
+- Evolved-agent rerun against the same deterministic sample eval set.
+- Before/after comparison gates.
+
+## Generated Change
+
+The generated V2 prompt changed the agent from broad-first behavior to a
+narrowest-sufficient-tool policy:
+
+- Player comparison -> `compare_players`.
+- Team comparison -> `compare_teams`.
+- Named-player scoring/profile/quick-read -> `get_player_stats`.
+- Named-team strategy/strengths/profile/late-game offense ->
+  `get_team_profile`.
+- `lookup_basketball_reference` only for broad, league-wide, or unsupported
+  ambiguous questions.
+
+Candidate source: `model`.
+
+It also changed the answer style from a long fixed scouting-report format
+to at most four bullets or 120 words.
+
+## Metrics
+
+| Metric | V1 | Generated V2 | Delta |
+|---|---:|---:|---:|
+| Quality pass rate | 100% | 100% | +0% |
+| Avg total tokens | 3640.2 | 1479.8 | -59.4% |
+| Avg tool calls | 2.5 | 1.0 | -60.0% |
+| Broad lookup calls | 4 | 0 | -4 |
+| Tool errors | 0 | 0 | +0 |
+
+## Gates
+
+| Gate | Result |
+|---|---:|
+| `quality_not_regressed` | PASS |
+| `tokens_reduced` | PASS |
+| `broad_lookup_reduced` | PASS |
+| `tool_errors_clear` | PASS |
+
+Final result: PASS.
+
+## Baseline SDK Signals
+
+The SDK-backed analysis observed the following V1 signals before generating
+the V2 prompt:
+
+- Sessions: 4.
+- Avg total tokens: 3640.2.
+- Avg tool calls: 2.5.
+- Broad lookup sessions: 4/4.
+- Quality pass rate: 100%.
+- Cost evaluator average observed value: 0.0015.
+
+The default one-run cost remains well under `$1`: the run uses four V1
+agent sessions, one prompt-generation call, four generated-V2 sessions,
+and small BigQuery reads.
@@ -0,0 +1 @@
+"""self-evolving agent demo agent package."""
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""self-evolving agent demo agent package."""`