Skip to content

Commit fdf14fc

Browse files
committed
Add verified self-evolving agent demo
1 parent 73bfd9c commit fdf14fc

20 files changed

Lines changed: 2454 additions & 0 deletions

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ artifacts that demonstrate SDK capabilities.
5353
|-----------|-------------|
5454
| [context_graph/](context_graph/) | Context Graph extraction: a runnable ADK agent + BQ AA plugin, the ontology-driven artifact pipeline (MAKO reference config), and the scheduled Cloud Run + Cloud Scheduler deploy. The advanced explicit-ontology path; for the primary one-artifact path see the [codelab](../docs/codelabs/periodic_materialization.md). |
5555
| [agent_improvement_cycle/](agent_improvement_cycle/) | LoopAgent-driven prompt improvement cycle |
56+
| [self_evolving_agent_demo/](self_evolving_agent_demo/) | Metric-driven self-evolution demo for a single ADK agent. Uses trace signals to generate and gate a bounded prompt evolution. |
5657
| [decision_lineage_demo/](decision_lineage_demo/) | Decision-lineage property graph (issue #98): live ADK media-planner agent + BQ AA Plugin running across 6 campaign sessions → SDK `build_context_graph(use_ai_generate=True, include_decisions=True)` → six GQL blocks pasted into BigQuery Studio (one renders an interactive graph diagram, one is a portfolio roll-up) |
5758

5859
## Reference Artifacts
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
.env
2+
prompt_state.json
3+
reports/
4+
__pycache__/
5+
*/__pycache__/
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Self-Evolving Agent Demo Narration
2+
3+
## 30-second version
4+
5+
This demo starts with a basketball analytics agent that answers correctly but
6+
wastes work. It logs every run to BigQuery through the analytics
7+
plugin. The SDK reads the traces, finds that the agent keeps calling a
8+
broad reference tool and spending excess tokens, generates a tighter V2
9+
prompt, reruns the same questions, and proves that quality stayed flat
10+
while token and tool usage dropped.
11+
12+
## Walkthrough
13+
14+
1. Run `./setup.sh`.
15+
2. Run `./run_e2e_demo.sh`.
16+
3. Watch the V1 run call broad and narrow sample tools.
17+
4. Watch `analyze_and_evolve.py` print the SDK-backed finding:
18+
broad reference lookups were used on narrow tasks.
19+
5. Open `prompt_diff.md` to inspect the exact V1 -> generated V2 diff.
20+
6. Watch the V2 run use narrow tools directly.
21+
7. Open `comparison.md` for the final quality/token/tool diff.
22+
23+
## Demo Message
24+
25+
The important idea is not "save tokens" in isolation. The agent uses
26+
its own production-shaped traces as feedback. Token tracking gives the
27+
loop a measurable signal, but the goal is a self-evolving agent that
28+
gets cheaper or cleaner without losing answer quality.
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# Self-Evolving Agent Demo
2+
3+
This demo shows a single ADK agent improving from its own logged
4+
behavior. The agent answers basketball analytics questions using deterministic
5+
fixture tools. V1 is intentionally wasteful: it loads broad basketball
6+
reference context and writes long scouting reports even when a narrow
7+
tool can answer the question. The BigQuery Agent Analytics Plugin logs
8+
the sessions to BigQuery, and the SDK reads those traces back to find a
9+
concrete improvement opportunity. The demo generates V2 during the run,
10+
then activates it only when the baseline answers already pass quality
11+
checks and the trace analysis shows broad-tool / token waste.
12+
13+
```mermaid
14+
flowchart TD
15+
A["Run sample agent V1"] --> B["Plugin logs agent_events to BigQuery"]
16+
B --> C["SDK deterministic evaluators + trace SQL"]
17+
C --> D["Find broad lookup and token waste"]
18+
D --> E["Generate bounded V2 prompt"]
19+
E --> F["Run same sample eval questions"]
20+
F --> G["Show prompt diff + metric diff"]
21+
```
22+
23+
The point is self-evolution. Token tracking is the measurement signal,
24+
not the product promise.
25+
26+
This is a lightweight companion to `examples/agent_improvement_cycle/`.
27+
That demo shows a production-facing quality-improvement loop with
28+
Prompt Registry and Prompt Optimizer. This demo is intentionally smaller:
29+
it focuses on operational trace signals such as tool overuse and token
30+
waste, then gates a single generated prompt evolution against before/after
31+
metrics.
32+
33+
## What Improves
34+
35+
V1 behavior:
36+
37+
- Calls `lookup_basketball_reference` before narrow tools.
38+
- Often calls more than one tool for a one-question task.
39+
- Produces long sectioned scouting reports.
40+
41+
Generated V2 behavior:
42+
43+
- Is created at runtime by a prompt generator from the SDK trace
44+
summary, tool counts, quality summary, and available tool signatures.
45+
- Should use the cheapest sufficient narrow tool.
46+
- Should avoid `lookup_basketball_reference` unless no narrow tool fits.
47+
- Should give a short answer with decisive stats and a recommendation.
48+
49+
The acceptance gate is:
50+
51+
```mermaid
52+
flowchart TD
53+
A["Generated V2"] --> B{"Quality not worse?"}
54+
B -- no --> R["Reject"]
55+
B -- yes --> C{"Avg tokens lower?"}
56+
C -- no --> R
57+
C -- yes --> D{"Broad lookup reduced?"}
58+
D -- no --> R
59+
D -- yes --> E{"No tool errors?"}
60+
E -- no --> R
61+
E -- yes --> P["Accept evolved prompt"]
62+
```
63+
64+
## Run It
65+
66+
Prerequisites:
67+
68+
- Python 3.10+
69+
- `gcloud` and `bq` CLIs
70+
- Application Default Credentials
71+
- A Google Cloud project with billing enabled
72+
- IAM: BigQuery data editor/job user and Vertex AI user
73+
74+
Setup:
75+
76+
```bash
77+
./setup.sh
78+
```
79+
80+
If your default `python3` is older than 3.10, run with:
81+
82+
```bash
83+
PYTHON_BIN=python3.11 ./setup.sh
84+
PYTHON_BIN=python3.11 ./run_e2e_demo.sh
85+
```
86+
87+
Run the end-to-end demo:
88+
89+
```bash
90+
./run_e2e_demo.sh
91+
```
92+
93+
Reset local prompt state and reports:
94+
95+
```bash
96+
./reset.sh
97+
```
98+
99+
Expected default one-run cost is typically well under `$1`: four V1
100+
agent sessions, one small prompt-generation call, four generated-V2
101+
agent sessions, small BigQuery reads, and SDK deterministic evaluators.
102+
The demo does not deploy Cloud Run,
103+
Scheduler, Workflows, or any long-running infrastructure.
104+
105+
## Outputs
106+
107+
Each run writes a timestamped directory under `reports/`:
108+
109+
```text
110+
reports/run_<timestamp>/
111+
├── latest_eval_results_baseline.json # V1 answers + session IDs
112+
├── candidate_prompt.json # model-generated V2 prompt
113+
├── prompt_diff.md # exact V1 -> generated V2 diff
114+
├── self_evolution_analysis.json # SDK-backed evolution decision
115+
├── latest_eval_results_evolved.json # V2 answers + session IDs
116+
├── comparison.json # before/after gates
117+
└── comparison.md # readable metric diff report
118+
```
119+
120+
For the main story, open these two files after a run:
121+
122+
- `prompt_diff.md` — shows the exact prompt changes generated from
123+
the trace/token signal.
124+
- `comparison.md` — shows quality, token, tool-call, and broad-lookup
125+
deltas between agent V1 and generated V2.
126+
127+
The tracked `VERIFICATION.md` file records the latest live end-to-end
128+
verification result for this demo.
129+
130+
The raw traces land in:
131+
132+
```text
133+
<PROJECT_ID>.self_evolving_agent_demo.agent_events
134+
```
135+
136+
Override with:
137+
138+
```bash
139+
export SELF_EVOLVING_DATASET_ID=my_dataset
140+
export SELF_EVOLVING_TABLE_ID=agent_events
141+
export SELF_EVOLVING_AGENT_MODEL=gemini-2.5-flash
142+
export SELF_EVOLVING_PROMPT_GENERATOR_MODEL=gemini-2.5-flash
143+
export DATASET_LOCATION=us-central1
144+
```
145+
146+
Re-running `setup.sh` regenerates `.env` from the current environment.
147+
To customize a setting persistently, pass it as an environment variable
148+
when running setup, for example:
149+
150+
```bash
151+
SELF_EVOLVING_AGENT_MODEL=gemini-2.5-pro ./setup.sh
152+
```
153+
154+
Evolution thresholds can be tuned with:
155+
156+
```bash
157+
python analyze_and_evolve.py \
158+
--min-quality-pass-rate 1.0 \
159+
--min-broad-lookup-rate 0.5 \
160+
--max-avg-tool-calls 2.0
161+
```
162+
163+
## File Map
164+
165+
```text
166+
examples/self_evolving_agent_demo/
167+
├── README.md
168+
├── DEMO_NARRATION.md
169+
├── VERIFICATION.md
170+
├── setup.sh
171+
├── reset.sh
172+
├── run_e2e_demo.sh
173+
├── run_agent.py
174+
├── analyze_and_evolve.py
175+
├── compare_runs.py
176+
├── agent/
177+
│ ├── agent.py
178+
│ ├── prompts.py
179+
│ ├── prompt_store.py
180+
│ └── tools.py
181+
├── analytics/
182+
│ └── session_metrics.py
183+
└── eval/
184+
└── eval_cases.json
185+
```
186+
187+
## Productionization Roadmap
188+
189+
The demo is intentionally one-shot. A production self-evolving loop
190+
would add durable orchestration, approvals, and rollout controls:
191+
192+
```mermaid
193+
flowchart LR
194+
A["Scheduler"] --> B["Cloud Run Job"]
195+
B --> C["Analyze recent BigQuery traces"]
196+
C --> D["Generate prompt or skill candidate"]
197+
D --> E["Regression eval gate"]
198+
E --> F["Human approval or policy gate"]
199+
F --> G["Prompt Registry / config rollout"]
200+
G --> H["Canary traffic"]
201+
H --> C
202+
```
203+
204+
Recommended next steps:
205+
206+
- Store accepted and rejected candidates in BigQuery.
207+
- Add prompt registry support for managed version history.
208+
- Add a human approval step before production rollout.
209+
- Add canary routing and automatic rollback if quality or cost
210+
regressions appear.
211+
- Extend the candidate generator from full-prompt generation to bounded
212+
prompt/skill patch optimization.
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Live Verification
2+
3+
Last verified: 2026-06-09, America/Los_Angeles
4+
5+
Run id: `run_20260609_171547`
6+
7+
Command:
8+
9+
```bash
10+
PYTHON_BIN=/path/to/python3.10+ ./run_e2e_demo.sh
11+
```
12+
13+
Raw local artifacts were written to:
14+
15+
```text
16+
reports/run_20260609_171547/
17+
```
18+
19+
The raw `reports/` directory remains ignored because it is per-run output.
20+
This file records the live end-to-end result that should be stable enough
21+
to keep with the demo source.
22+
23+
## What Ran
24+
25+
```mermaid
26+
flowchart LR
27+
A["ADK sample agent V1"] --> B["BigQuery analytics plugin"]
28+
B --> C["BigQuery trace table"]
29+
C --> D["SDK evaluators + trace SQL"]
30+
D --> E["Gemini prompt generator"]
31+
E --> F["Generated V2 prompt"]
32+
F --> G["ADK sample agent V2"]
33+
G --> H["Before/after gate report"]
34+
```
35+
36+
The live run exercised:
37+
38+
- ADK agent execution with Gemini.
39+
- BigQuery Agent Analytics Plugin trace logging.
40+
- BigQuery trace readback from
41+
`rag-chatbot-485501.self_evolving_agent_demo.agent_events`.
42+
- SDK deterministic evaluator checks for token efficiency, cost, turn count,
43+
and error rate.
44+
- Runtime generation of a replacement V2 prompt.
45+
- Evolved-agent rerun against the same deterministic sample eval set.
46+
- Before/after comparison gates.
47+
48+
## Generated Change
49+
50+
The generated V2 prompt changed the agent from broad-first behavior to a
51+
narrowest-sufficient-tool policy:
52+
53+
- Player comparison -> `compare_players`.
54+
- Team comparison -> `compare_teams`.
55+
- Named-player scoring/profile/quick-read -> `get_player_stats`.
56+
- Named-team strategy/strengths/profile/late-game offense ->
57+
`get_team_profile`.
58+
- `lookup_basketball_reference` only for broad, league-wide, or unsupported
59+
ambiguous questions.
60+
61+
Candidate source: `model`.
62+
63+
It also changed the answer style from a long fixed scouting-report format
64+
to at most four bullets or 120 words.
65+
66+
## Metrics
67+
68+
| Metric | V1 | Generated V2 | Delta |
69+
|---|---:|---:|---:|
70+
| Quality pass rate | 100% | 100% | +0% |
71+
| Avg total tokens | 3640.2 | 1479.8 | -59.4% |
72+
| Avg tool calls | 2.5 | 1.0 | -60.0% |
73+
| Broad lookup calls | 4 | 0 | -4 |
74+
| Tool errors | 0 | 0 | +0 |
75+
76+
## Gates
77+
78+
| Gate | Result |
79+
|---|---:|
80+
| `quality_not_regressed` | PASS |
81+
| `tokens_reduced` | PASS |
82+
| `broad_lookup_reduced` | PASS |
83+
| `tool_errors_clear` | PASS |
84+
85+
Final result: PASS.
86+
87+
## Baseline SDK Signals
88+
89+
The SDK-backed analysis observed the following V1 signals before generating
90+
the V2 prompt:
91+
92+
- Sessions: 4.
93+
- Avg total tokens: 3640.2.
94+
- Avg tool calls: 2.5.
95+
- Broad lookup sessions: 4/4.
96+
- Quality pass rate: 100%.
97+
- Cost evaluator average observed value: 0.0015.
98+
99+
The default one-run cost remains well under `$1`: the run uses four V1
100+
agent sessions, one prompt-generation call, four generated-V2 sessions,
101+
and small BigQuery reads.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""self-evolving agent demo agent package."""

0 commit comments

Comments
 (0)