Skip to content

Commit eaecfd8

Browse files
committed
Add verified self-evolving agent demo
1 parent 73bfd9c commit eaecfd8

19 files changed

Lines changed: 2236 additions & 0 deletions

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ artifacts that demonstrate SDK capabilities.
5353
|-----------|-------------|
5454
| [context_graph/](context_graph/) | Context Graph extraction: a runnable ADK agent + BQ AA plugin, the ontology-driven artifact pipeline (MAKO reference config), and the scheduled Cloud Run + Cloud Scheduler deploy. The advanced explicit-ontology path; for the primary one-artifact path see the [codelab](../docs/codelabs/periodic_materialization.md). |
5555
| [agent_improvement_cycle/](agent_improvement_cycle/) | LoopAgent-driven prompt improvement cycle |
56+
| [nba_self_evolving_demo/](nba_self_evolving_demo/) | Single-agent self-evolution demo: NBA ADK agent logs traces to BigQuery, SDK analysis finds broad-tool/token waste, and a bounded prompt evolution is accepted only after before/after gates pass |
5657
| [decision_lineage_demo/](decision_lineage_demo/) | Decision-lineage property graph (issue #98): live ADK media-planner agent + BQ AA Plugin running across 6 campaign sessions → SDK `build_context_graph(use_ai_generate=True, include_decisions=True)` → six GQL blocks pasted into BigQuery Studio (one renders an interactive graph diagram, one is a portfolio roll-up) |
5758

5859
## Reference Artifacts
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
.env
2+
prompt_state.json
3+
reports/
4+
__pycache__/
5+
*/__pycache__/
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# NBA Self-Evolving Agent Demo Narration
2+
3+
## 30-second version
4+
5+
This demo starts with an NBA analytics agent that answers correctly but
6+
wastes work. It logs every run to BigQuery through the analytics
7+
plugin. The SDK reads the traces, finds that the agent keeps calling a
8+
broad reference tool and spending excess tokens, generates a tighter V2
9+
prompt, reruns the same questions, and proves that quality stayed flat
10+
while token and tool usage dropped.
11+
12+
## Walkthrough
13+
14+
1. Run `./setup.sh`.
15+
2. Run `./run_e2e_demo.sh`.
16+
3. Watch the V1 run call broad and narrow NBA tools.
17+
4. Watch `analyze_and_evolve.py` print the SDK-backed finding:
18+
broad reference lookups were used on narrow tasks.
19+
5. Open `prompt_diff.md` to inspect the exact V1 -> generated V2 diff.
20+
6. Watch the V2 run use narrow tools directly.
21+
7. Open `comparison.md` for the final quality/token/tool diff.
22+
23+
## Demo Message
24+
25+
The important idea is not "save tokens" in isolation. The agent uses
26+
its own production-shaped traces as feedback. Token tracking gives the
27+
loop a measurable signal, but the goal is a self-evolving agent that
28+
gets cheaper or cleaner without losing answer quality.
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# NBA Self-Evolving Agent Demo
2+
3+
This demo shows a single ADK agent improving from its own logged
4+
behavior. The agent answers NBA analytics questions using deterministic
5+
fixture tools. V1 is intentionally wasteful: it loads broad NBA
6+
reference context and writes long scouting reports even when a narrow
7+
tool can answer the question. The BigQuery Agent Analytics Plugin logs
8+
the sessions to BigQuery, and the SDK reads those traces back to find a
9+
concrete improvement opportunity. The demo generates V2 during the run,
10+
then activates it only when the baseline answers already pass quality
11+
checks and the trace analysis shows broad-tool / token waste.
12+
13+
```mermaid
14+
flowchart TD
15+
A["Run NBA agent V1"] --> B["Plugin logs agent_events to BigQuery"]
16+
B --> C["SDK CodeEvaluator + trace SQL"]
17+
C --> D["Find broad lookup and token waste"]
18+
D --> E["Generate bounded V2 prompt"]
19+
E --> F["Run same NBA eval questions"]
20+
F --> G["Show prompt diff + metric diff"]
21+
```
22+
23+
The point is self-evolution. Token tracking is the measurement signal,
24+
not the product promise.
25+
26+
## What Improves
27+
28+
V1 behavior:
29+
30+
- Calls `lookup_nba_reference` before narrow tools.
31+
- Often calls more than one tool for a one-question task.
32+
- Produces long sectioned scouting reports.
33+
34+
Generated V2 behavior:
35+
36+
- Is created at runtime by a prompt generator from the SDK trace
37+
summary, tool counts, quality summary, and available tool signatures.
38+
- Should use the cheapest sufficient narrow tool.
39+
- Should avoid `lookup_nba_reference` unless no narrow tool fits.
40+
- Should give a short answer with decisive stats and a recommendation.
41+
42+
The acceptance gate is:
43+
44+
```mermaid
45+
flowchart TD
46+
A["Generated V2"] --> B{"Quality not worse?"}
47+
B -- no --> R["Reject"]
48+
B -- yes --> C{"Avg tokens lower?"}
49+
C -- no --> R
50+
C -- yes --> D{"Broad lookup reduced?"}
51+
D -- no --> R
52+
D -- yes --> E{"No tool errors?"}
53+
E -- no --> R
54+
E -- yes --> P["Accept evolved prompt"]
55+
```
56+
57+
## Run It
58+
59+
Prerequisites:
60+
61+
- Python 3.10+
62+
- `gcloud` and `bq` CLIs
63+
- Application Default Credentials
64+
- A Google Cloud project with billing enabled
65+
- IAM: BigQuery data editor/job user and Vertex AI user
66+
67+
Setup:
68+
69+
```bash
70+
./setup.sh
71+
```
72+
73+
If your default `python3` is older than 3.10, run with:
74+
75+
```bash
76+
PYTHON_BIN=python3.11 ./setup.sh
77+
PYTHON_BIN=python3.11 ./run_e2e_demo.sh
78+
```
79+
80+
Run the end-to-end demo:
81+
82+
```bash
83+
./run_e2e_demo.sh
84+
```
85+
86+
Reset local prompt state and reports:
87+
88+
```bash
89+
./reset.sh
90+
```
91+
92+
Expected default one-run cost is typically well under `$1`: four V1
93+
agent sessions, one small prompt-generation call, four generated-V2
94+
agent sessions, small BigQuery reads, and SDK deterministic evaluators.
95+
The demo does not deploy Cloud Run,
96+
Scheduler, Workflows, or any long-running infrastructure.
97+
98+
## Outputs
99+
100+
Each run writes a timestamped directory under `reports/`:
101+
102+
```text
103+
reports/run_<timestamp>/
104+
├── latest_eval_results_baseline.json # V1 answers + session IDs
105+
├── candidate_prompt.json # model-generated V2 prompt
106+
├── prompt_diff.md # exact V1 -> generated V2 diff
107+
├── self_evolution_analysis.json # SDK-backed evolution decision
108+
├── latest_eval_results_evolved.json # V2 answers + session IDs
109+
├── comparison.json # before/after gates
110+
└── comparison.md # readable metric diff report
111+
```
112+
113+
For the main story, open these two files after a run:
114+
115+
- `prompt_diff.md` — shows the exact prompt changes generated from
116+
the trace/token signal.
117+
- `comparison.md` — shows quality, token, tool-call, and broad-lookup
118+
deltas between agent V1 and generated V2.
119+
120+
The tracked `VERIFICATION.md` file records the latest live end-to-end
121+
verification result for this demo.
122+
123+
The raw traces land in:
124+
125+
```text
126+
<PROJECT_ID>.nba_self_evolving_demo.agent_events
127+
```
128+
129+
Override with:
130+
131+
```bash
132+
export NBA_DATASET_ID=my_dataset
133+
export NBA_TABLE_ID=agent_events
134+
export NBA_AGENT_MODEL=gemini-2.5-flash
135+
export NBA_PROMPT_GENERATOR_MODEL=gemini-2.5-flash
136+
export DATASET_LOCATION=us-central1
137+
```
138+
139+
## File Map
140+
141+
```text
142+
examples/nba_self_evolving_demo/
143+
├── README.md
144+
├── DEMO_NARRATION.md
145+
├── VERIFICATION.md
146+
├── setup.sh
147+
├── reset.sh
148+
├── run_e2e_demo.sh
149+
├── run_agent.py
150+
├── analyze_and_evolve.py
151+
├── compare_runs.py
152+
├── agent/
153+
│ ├── agent.py
154+
│ ├── prompts.py
155+
│ ├── prompt_store.py
156+
│ └── tools.py
157+
├── analytics/
158+
│ └── session_metrics.py
159+
└── eval/
160+
└── eval_cases.json
161+
```
162+
163+
## Productionization Roadmap
164+
165+
The demo is intentionally one-shot. A production self-evolving loop
166+
would add durable orchestration, approvals, and rollout controls:
167+
168+
```mermaid
169+
flowchart LR
170+
A["Scheduler"] --> B["Cloud Run Job"]
171+
B --> C["Analyze recent BigQuery traces"]
172+
C --> D["Generate prompt or skill candidate"]
173+
D --> E["Regression eval gate"]
174+
E --> F["Human approval or policy gate"]
175+
F --> G["Prompt Registry / config rollout"]
176+
G --> H["Canary traffic"]
177+
H --> C
178+
```
179+
180+
Recommended next steps:
181+
182+
- Store accepted and rejected candidates in BigQuery.
183+
- Add prompt registry support for managed version history.
184+
- Add a human approval step before production rollout.
185+
- Add canary routing and automatic rollback if quality or cost
186+
regressions appear.
187+
- Extend the candidate generator from full-prompt generation to bounded
188+
prompt/skill patch optimization.
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Live Verification
2+
3+
Last verified: 2026-06-04, America/Los_Angeles
4+
5+
Run id: `run_20260604_105058`
6+
7+
Command:
8+
9+
```bash
10+
PYTHON_BIN=/path/to/python3.10+ ./run_e2e_demo.sh
11+
```
12+
13+
Raw local artifacts were written to:
14+
15+
```text
16+
reports/run_20260604_105058/
17+
```
18+
19+
The raw `reports/` directory remains ignored because it is per-run output.
20+
This file records the live end-to-end result that should be stable enough
21+
to keep with the demo source.
22+
23+
## What Ran
24+
25+
```mermaid
26+
flowchart LR
27+
A["ADK NBA agent V1"] --> B["BigQuery analytics plugin"]
28+
B --> C["BigQuery trace table"]
29+
C --> D["SDK evaluators + trace SQL"]
30+
D --> E["Gemini prompt generator"]
31+
E --> F["Generated V2 prompt"]
32+
F --> G["ADK NBA agent V2"]
33+
G --> H["Before/after gate report"]
34+
```
35+
36+
The live run exercised:
37+
38+
- ADK agent execution with Gemini.
39+
- BigQuery Agent Analytics Plugin trace logging.
40+
- BigQuery trace readback.
41+
- SDK `CodeEvaluator` checks for token efficiency, cost, turn count, and
42+
error rate.
43+
- Runtime generation of a replacement V2 prompt.
44+
- Evolved-agent rerun against the same deterministic NBA eval set.
45+
- Before/after comparison gates.
46+
47+
## Generated Change
48+
49+
The generated V2 prompt changed the agent from broad-first behavior to a
50+
narrowest-sufficient-tool policy:
51+
52+
- Player comparison -> `compare_players`.
53+
- Team comparison -> `compare_teams`.
54+
- Named-player scoring/profile/quick-read -> `get_player_stats`.
55+
- Named-team strategy/strengths/profile/late-game offense ->
56+
`get_team_profile`.
57+
- `lookup_nba_reference` only for broad, league-wide, or unsupported
58+
ambiguous questions.
59+
60+
It also changed the answer style from a long fixed scouting-report format
61+
to at most four bullets or 120 words.
62+
63+
## Metrics
64+
65+
| Metric | V1 | Generated V2 | Delta |
66+
|---|---:|---:|---:|
67+
| Quality pass rate | 100% | 100% | +0% |
68+
| Avg total tokens | 3512.5 | 1419.5 | -59.6% |
69+
| Avg tool calls | 3.0 | 1.0 | -66.7% |
70+
| Broad lookup calls | 4 | 0 | -4 |
71+
| Tool errors | 0 | 0 | +0 |
72+
73+
## Gates
74+
75+
| Gate | Result |
76+
|---|---:|
77+
| `quality_not_regressed` | PASS |
78+
| `tokens_reduced` | PASS |
79+
| `broad_lookup_reduced` | PASS |
80+
| `tool_errors_clear` | PASS |
81+
82+
Final result: PASS.
83+
84+
## Baseline SDK Signals
85+
86+
The SDK-backed analysis observed the following V1 signals before generating
87+
the V2 prompt:
88+
89+
- Sessions: 4.
90+
- Avg total tokens: 3512.5.
91+
- Avg tool calls: 3.0.
92+
- Broad lookup sessions: 4/4.
93+
- Quality pass rate: 100%.
94+
- Cost evaluator average observed value: 0.0014.
95+
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""NBA self-evolving demo agent package."""

0 commit comments

Comments
 (0)