GoogleCloudPlatform
diff --git a/‎examples/README.md‎
Lines changed: 1 addition & 0 deletions b/‎examples/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/nba_self_evolving_demo/.gitignore‎
Lines changed: 5 additions & 0 deletions b/‎examples/nba_self_evolving_demo/.gitignore‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎examples/nba_self_evolving_demo/DEMO_NARRATION.md‎
Lines changed: 28 additions & 0 deletions b/‎examples/nba_self_evolving_demo/DEMO_NARRATION.md‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎examples/nba_self_evolving_demo/README.md‎
Lines changed: 188 additions & 0 deletions b/‎examples/nba_self_evolving_demo/README.md‎
Lines changed: 188 additions & 0 deletions
diff --git a/‎examples/nba_self_evolving_demo/VERIFICATION.md‎
Lines changed: 95 additions & 0 deletions b/‎examples/nba_self_evolving_demo/VERIFICATION.md‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎examples/nba_self_evolving_demo/agent/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎examples/nba_self_evolving_demo/agent/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -53,6 +53,7 @@ artifacts that demonstrate SDK capabilities.
 |-----------|-------------|
 | [context_graph/](context_graph/) | Context Graph extraction: a runnable ADK agent + BQ AA plugin, the ontology-driven artifact pipeline (MAKO reference config), and the scheduled Cloud Run + Cloud Scheduler deploy. The advanced explicit-ontology path; for the primary one-artifact path see the [codelab](../docs/codelabs/periodic_materialization.md). |
 | [agent_improvement_cycle/](agent_improvement_cycle/) | LoopAgent-driven prompt improvement cycle |
+| [nba_self_evolving_demo/](nba_self_evolving_demo/) | Single-agent self-evolution demo: NBA ADK agent logs traces to BigQuery, SDK analysis finds broad-tool/token waste, and a bounded prompt evolution is accepted only after before/after gates pass |
 | [decision_lineage_demo/](decision_lineage_demo/) | Decision-lineage property graph (issue #98): live ADK media-planner agent + BQ AA Plugin running across 6 campaign sessions → SDK `build_context_graph(use_ai_generate=True, include_decisions=True)` → six GQL blocks pasted into BigQuery Studio (one renders an interactive graph diagram, one is a portfolio roll-up) |
 
 ## Reference Artifacts
 
@@ -0,0 +1,5 @@
+.env
+prompt_state.json
+reports/
+__pycache__/
+*/__pycache__/
@@ -0,0 +1,28 @@
+# NBA Self-Evolving Agent Demo Narration
+
+## 30-second version
+
+This demo starts with an NBA analytics agent that answers correctly but
+wastes work. It logs every run to BigQuery through the analytics
+plugin. The SDK reads the traces, finds that the agent keeps calling a
+broad reference tool and spending excess tokens, generates a tighter V2
+prompt, reruns the same questions, and proves that quality stayed flat
+while token and tool usage dropped.
+
+## Walkthrough
+
+1. Run `./setup.sh`.
+2. Run `./run_e2e_demo.sh`.
+3. Watch the V1 run call broad and narrow NBA tools.
+4. Watch `analyze_and_evolve.py` print the SDK-backed finding:
+   broad reference lookups were used on narrow tasks.
+5. Open `prompt_diff.md` to inspect the exact V1 -> generated V2 diff.
+6. Watch the V2 run use narrow tools directly.
+7. Open `comparison.md` for the final quality/token/tool diff.
+
+## Demo Message
+
+The important idea is not "save tokens" in isolation. The agent uses
+its own production-shaped traces as feedback. Token tracking gives the
+loop a measurable signal, but the goal is a self-evolving agent that
+gets cheaper or cleaner without losing answer quality.
@@ -0,0 +1,188 @@
+# NBA Self-Evolving Agent Demo
+
+This demo shows a single ADK agent improving from its own logged
+behavior. The agent answers NBA analytics questions using deterministic
+fixture tools. V1 is intentionally wasteful: it loads broad NBA
+reference context and writes long scouting reports even when a narrow
+tool can answer the question. The BigQuery Agent Analytics Plugin logs
+the sessions to BigQuery, and the SDK reads those traces back to find a
+concrete improvement opportunity. The demo generates V2 during the run,
+then activates it only when the baseline answers already pass quality
+checks and the trace analysis shows broad-tool / token waste.
+
+```mermaid
+flowchart TD
+  A["Run NBA agent V1"] --> B["Plugin logs agent_events to BigQuery"]
+  B --> C["SDK CodeEvaluator + trace SQL"]
+  C --> D["Find broad lookup and token waste"]
+  D --> E["Generate bounded V2 prompt"]
+  E --> F["Run same NBA eval questions"]
+  F --> G["Show prompt diff + metric diff"]
+```
+
+The point is self-evolution. Token tracking is the measurement signal,
+not the product promise.
+
+## What Improves
+
+V1 behavior:
+
+- Calls `lookup_nba_reference` before narrow tools.
+- Often calls more than one tool for a one-question task.
+- Produces long sectioned scouting reports.
+
+Generated V2 behavior:
+
+- Is created at runtime by a prompt generator from the SDK trace
+  summary, tool counts, quality summary, and available tool signatures.
+- Should use the cheapest sufficient narrow tool.
+- Should avoid `lookup_nba_reference` unless no narrow tool fits.
+- Should give a short answer with decisive stats and a recommendation.
+
+The acceptance gate is:
+
+```mermaid
+flowchart TD
+  A["Generated V2"] --> B{"Quality not worse?"}
+  B -- no --> R["Reject"]
+  B -- yes --> C{"Avg tokens lower?"}
+  C -- no --> R
+  C -- yes --> D{"Broad lookup reduced?"}
+  D -- no --> R
+  D -- yes --> E{"No tool errors?"}
+  E -- no --> R
+  E -- yes --> P["Accept evolved prompt"]
+```
+
+## Run It
+
+Prerequisites:
+
+- Python 3.10+
+- `gcloud` and `bq` CLIs
+- Application Default Credentials
+- A Google Cloud project with billing enabled
+- IAM: BigQuery data editor/job user and Vertex AI user
+
+Setup:
+
+```bash
+./setup.sh
+```
+
+If your default `python3` is older than 3.10, run with:
+
+```bash
+PYTHON_BIN=python3.11 ./setup.sh
+PYTHON_BIN=python3.11 ./run_e2e_demo.sh
+```
+
+Run the end-to-end demo:
+
+```bash
+./run_e2e_demo.sh
+```
+
+Reset local prompt state and reports:
+
+```bash
+./reset.sh
+```
+
+Expected default one-run cost is typically well under `$1`: four V1
+agent sessions, one small prompt-generation call, four generated-V2
+agent sessions, small BigQuery reads, and SDK deterministic evaluators.
+The demo does not deploy Cloud Run,
+Scheduler, Workflows, or any long-running infrastructure.
+
+## Outputs
+
+Each run writes a timestamped directory under `reports/`:
+
+```text
+reports/run_<timestamp>/
+├── latest_eval_results_baseline.json  # V1 answers + session IDs
+├── candidate_prompt.json              # model-generated V2 prompt
+├── prompt_diff.md                     # exact V1 -> generated V2 diff
+├── self_evolution_analysis.json       # SDK-backed evolution decision
+├── latest_eval_results_evolved.json   # V2 answers + session IDs
+├── comparison.json                    # before/after gates
+└── comparison.md                      # readable metric diff report
+```
+
+For the main story, open these two files after a run:
+
+- `prompt_diff.md` — shows the exact prompt changes generated from
+  the trace/token signal.
+- `comparison.md` — shows quality, token, tool-call, and broad-lookup
+  deltas between agent V1 and generated V2.
+
+The tracked `VERIFICATION.md` file records the latest live end-to-end
+verification result for this demo.
+
+The raw traces land in:
+
+```text
+<PROJECT_ID>.nba_self_evolving_demo.agent_events
+```
+
+Override with:
+
+```bash
+export NBA_DATASET_ID=my_dataset
+export NBA_TABLE_ID=agent_events
+export NBA_AGENT_MODEL=gemini-2.5-flash
+export NBA_PROMPT_GENERATOR_MODEL=gemini-2.5-flash
+export DATASET_LOCATION=us-central1
+```
+
+## File Map
+
+```text
+examples/nba_self_evolving_demo/
+├── README.md
+├── DEMO_NARRATION.md
+├── VERIFICATION.md
+├── setup.sh
+├── reset.sh
+├── run_e2e_demo.sh
+├── run_agent.py
+├── analyze_and_evolve.py
+├── compare_runs.py
+├── agent/
+│   ├── agent.py
+│   ├── prompts.py
+│   ├── prompt_store.py
+│   └── tools.py
+├── analytics/
+│   └── session_metrics.py
+└── eval/
+    └── eval_cases.json
+```
+
+## Productionization Roadmap
+
+The demo is intentionally one-shot. A production self-evolving loop
+would add durable orchestration, approvals, and rollout controls:
+
+```mermaid
+flowchart LR
+  A["Scheduler"] --> B["Cloud Run Job"]
+  B --> C["Analyze recent BigQuery traces"]
+  C --> D["Generate prompt or skill candidate"]
+  D --> E["Regression eval gate"]
+  E --> F["Human approval or policy gate"]
+  F --> G["Prompt Registry / config rollout"]
+  G --> H["Canary traffic"]
+  H --> C
+```
+
+Recommended next steps:
+
+- Store accepted and rejected candidates in BigQuery.
+- Add prompt registry support for managed version history.
+- Add a human approval step before production rollout.
+- Add canary routing and automatic rollback if quality or cost
+  regressions appear.
+- Extend the candidate generator from full-prompt generation to bounded
+  prompt/skill patch optimization.
@@ -0,0 +1,95 @@
+# Live Verification
+
+Last verified: 2026-06-04, America/Los_Angeles
+
+Run id: `run_20260604_105058`
+
+Command:
+
+```bash
+PYTHON_BIN=/path/to/python3.10+ ./run_e2e_demo.sh
+```
+
+Raw local artifacts were written to:
+
+```text
+reports/run_20260604_105058/
+```
+
+The raw `reports/` directory remains ignored because it is per-run output.
+This file records the live end-to-end result that should be stable enough
+to keep with the demo source.
+
+## What Ran
+
+```mermaid
+flowchart LR
+  A["ADK NBA agent V1"] --> B["BigQuery analytics plugin"]
+  B --> C["BigQuery trace table"]
+  C --> D["SDK evaluators + trace SQL"]
+  D --> E["Gemini prompt generator"]
+  E --> F["Generated V2 prompt"]
+  F --> G["ADK NBA agent V2"]
+  G --> H["Before/after gate report"]
+```
+
+The live run exercised:
+
+- ADK agent execution with Gemini.
+- BigQuery Agent Analytics Plugin trace logging.
+- BigQuery trace readback.
+- SDK `CodeEvaluator` checks for token efficiency, cost, turn count, and
+  error rate.
+- Runtime generation of a replacement V2 prompt.
+- Evolved-agent rerun against the same deterministic NBA eval set.
+- Before/after comparison gates.
+
+## Generated Change
+
+The generated V2 prompt changed the agent from broad-first behavior to a
+narrowest-sufficient-tool policy:
+
+- Player comparison -> `compare_players`.
+- Team comparison -> `compare_teams`.
+- Named-player scoring/profile/quick-read -> `get_player_stats`.
+- Named-team strategy/strengths/profile/late-game offense ->
+  `get_team_profile`.
+- `lookup_nba_reference` only for broad, league-wide, or unsupported
+  ambiguous questions.
+
+It also changed the answer style from a long fixed scouting-report format
+to at most four bullets or 120 words.
+
+## Metrics
+
+| Metric | V1 | Generated V2 | Delta |
+|---|---:|---:|---:|
+| Quality pass rate | 100% | 100% | +0% |
+| Avg total tokens | 3512.5 | 1419.5 | -59.6% |
+| Avg tool calls | 3.0 | 1.0 | -66.7% |
+| Broad lookup calls | 4 | 0 | -4 |
+| Tool errors | 0 | 0 | +0 |
+
+## Gates
+
+| Gate | Result |
+|---|---:|
+| `quality_not_regressed` | PASS |
+| `tokens_reduced` | PASS |
+| `broad_lookup_reduced` | PASS |
+| `tool_errors_clear` | PASS |
+
+Final result: PASS.
+
+## Baseline SDK Signals
+
+The SDK-backed analysis observed the following V1 signals before generating
+the V2 prompt:
+
+- Sessions: 4.
+- Avg total tokens: 3512.5.
+- Avg tool calls: 3.0.
+- Broad lookup sessions: 4/4.
+- Quality pass rate: 100%.
+- Cost evaluator average observed value: 0.0014.
+
@@ -0,0 +1 @@
+"""NBA self-evolving demo agent package."""
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""NBA self-evolving demo agent package."""`