Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ artifacts that demonstrate SDK capabilities.
|-----------|-------------|
| [context_graph/](context_graph/) | Agent Context Graph: extract decision traces from your agent's context graph — a runnable ADK agent + BQ AA plugin streaming events, the codelab artifacts ([codelab/](context_graph/codelab/)), and the scheduled Cloud Run + Cloud Scheduler deploy ([periodic_materialization/](context_graph/periodic_materialization/)). Start with the [codelab](../docs/codelabs/periodic_materialization.md). |
| [agent_improvement_cycle/](agent_improvement_cycle/) | LoopAgent-driven prompt improvement cycle |
| [self_evolving_agent_demo/](self_evolving_agent_demo/) | Metric-driven self-evolution demo for a single ADK agent. Uses trace signals to generate and gate a bounded prompt evolution. |
| [decision_lineage_demo/](decision_lineage_demo/) | Decision-lineage property graph (issue #98): live ADK media-planner agent + BQ AA Plugin running across 6 campaign sessions → SDK `build_context_graph(use_ai_generate=True, include_decisions=True)` → six GQL blocks pasted into BigQuery Studio (one renders an interactive graph diagram, one is a portfolio roll-up) |

## Reference Artifacts
Expand Down
5 changes: 5 additions & 0 deletions examples/self_evolving_agent_demo/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.env
prompt_state.json
reports/
__pycache__/
*/__pycache__/
28 changes: 28 additions & 0 deletions examples/self_evolving_agent_demo/DEMO_NARRATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Self-Evolving Agent Demo Narration

## 30-second version

This demo starts with a basketball analytics agent that answers correctly but
wastes work. It logs every run to BigQuery through the analytics
plugin. The SDK reads the traces, finds that the agent keeps calling a
broad reference tool and spending excess tokens, generates a tighter V2
prompt, reruns the same questions, and proves that quality stayed flat
while token and tool usage dropped.

## Walkthrough

1. Run `./setup.sh`.
2. Run `./run_e2e_demo.sh`.
3. Watch the V1 run call broad and narrow sample tools.
4. Watch `analyze_and_evolve.py` print the SDK-backed finding:
broad reference lookups were used on narrow tasks.
5. Open `prompt_diff.md` to inspect the exact V1 -> generated V2 diff.
6. Watch the V2 run use narrow tools directly.
7. Open `comparison.md` for the final quality/token/tool diff.

## Demo Message

The important idea is not "save tokens" in isolation. The agent uses
its own production-shaped traces as feedback. Token tracking gives the
loop a measurable signal, but the goal is a self-evolving agent that
gets cheaper or cleaner without losing answer quality.
212 changes: 212 additions & 0 deletions examples/self_evolving_agent_demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# Self-Evolving Agent Demo

This demo shows a single ADK agent improving from its own logged
behavior. The agent answers basketball analytics questions using deterministic
fixture tools. V1 is intentionally wasteful: it loads broad basketball
reference context and writes long scouting reports even when a narrow
tool can answer the question. The BigQuery Agent Analytics Plugin logs
the sessions to BigQuery, and the SDK reads those traces back to find a
concrete improvement opportunity. The demo generates V2 during the run,
then activates it only when the baseline answers already pass quality
checks and the trace analysis shows broad-tool / token waste.

```mermaid
flowchart TD
A["Run sample agent V1"] --> B["Plugin logs agent_events to BigQuery"]
B --> C["SDK deterministic evaluators + trace SQL"]
C --> D["Find broad lookup and token waste"]
D --> E["Generate bounded V2 prompt"]
E --> F["Run same sample eval questions"]
F --> G["Show prompt diff + metric diff"]
```

The point is self-evolution. Token tracking is the measurement signal,
not the product promise.

This is a lightweight companion to `examples/agent_improvement_cycle/`.
That demo shows a production-facing quality-improvement loop with
Prompt Registry and Prompt Optimizer. This demo is intentionally smaller:
it focuses on operational trace signals such as tool overuse and token
waste, then gates a single generated prompt evolution against before/after
metrics.

## What Improves

V1 behavior:

- Calls `lookup_basketball_reference` before narrow tools.
- Often calls more than one tool for a one-question task.
- Produces long sectioned scouting reports.

Generated V2 behavior:

- Is created at runtime by a prompt generator from the SDK trace
summary, tool counts, quality summary, and available tool signatures.
- Should use the cheapest sufficient narrow tool.
- Should avoid `lookup_basketball_reference` unless no narrow tool fits.
- Should give a short answer with decisive stats and a recommendation.

The acceptance gate is:

```mermaid
flowchart TD
A["Generated V2"] --> B{"Quality not worse?"}
B -- no --> R["Reject"]
B -- yes --> C{"Avg tokens lower?"}
C -- no --> R
C -- yes --> D{"Broad lookup reduced?"}
D -- no --> R
D -- yes --> E{"No tool errors?"}
E -- no --> R
E -- yes --> P["Accept evolved prompt"]
```

## Run It

Prerequisites:

- Python 3.10+
- `gcloud` and `bq` CLIs
- Application Default Credentials
- A Google Cloud project with billing enabled
- IAM: BigQuery data editor/job user and Vertex AI user

Setup:

```bash
./setup.sh
```

If your default `python3` is older than 3.10, run with:

```bash
PYTHON_BIN=python3.11 ./setup.sh
PYTHON_BIN=python3.11 ./run_e2e_demo.sh
```

Run the end-to-end demo:

```bash
./run_e2e_demo.sh
```

Reset local prompt state and reports:

```bash
./reset.sh
```

Expected default one-run cost is typically well under `$1`: four V1
agent sessions, one small prompt-generation call, four generated-V2
agent sessions, small BigQuery reads, and SDK deterministic evaluators.
The demo does not deploy Cloud Run,
Scheduler, Workflows, or any long-running infrastructure.

## Outputs

Each run writes a timestamped directory under `reports/`:

```text
reports/run_<timestamp>/
├── latest_eval_results_baseline.json # V1 answers + session IDs
├── candidate_prompt.json # model-generated V2 prompt
├── prompt_diff.md # exact V1 -> generated V2 diff
├── self_evolution_analysis.json # SDK-backed evolution decision
├── latest_eval_results_evolved.json # V2 answers + session IDs
├── comparison.json # before/after gates
└── comparison.md # readable metric diff report
```

For the main story, open these two files after a run:

- `prompt_diff.md` — shows the exact prompt changes generated from
the trace/token signal.
- `comparison.md` — shows quality, token, tool-call, and broad-lookup
deltas between agent V1 and generated V2.

The tracked `VERIFICATION.md` file records the latest live end-to-end
verification result for this demo.

The raw traces land in:

```text
<PROJECT_ID>.self_evolving_agent_demo.agent_events
```

Override with:

```bash
export SELF_EVOLVING_DATASET_ID=my_dataset
export SELF_EVOLVING_TABLE_ID=agent_events
export SELF_EVOLVING_AGENT_MODEL=gemini-2.5-flash
export SELF_EVOLVING_PROMPT_GENERATOR_MODEL=gemini-2.5-flash
export DATASET_LOCATION=us-central1
```

Re-running `setup.sh` regenerates `.env` from the current environment.
To customize a setting persistently, pass it as an environment variable
when running setup, for example:

```bash
SELF_EVOLVING_AGENT_MODEL=gemini-2.5-pro ./setup.sh
```

Evolution thresholds can be tuned with:

```bash
python analyze_and_evolve.py \
--min-quality-pass-rate 1.0 \
--min-broad-lookup-rate 0.5 \
--max-avg-tool-calls 2.0
```

## File Map

```text
examples/self_evolving_agent_demo/
├── README.md
├── DEMO_NARRATION.md
├── VERIFICATION.md
├── setup.sh
├── reset.sh
├── run_e2e_demo.sh
├── run_agent.py
├── analyze_and_evolve.py
├── compare_runs.py
├── agent/
│ ├── agent.py
│ ├── prompts.py
│ ├── prompt_store.py
│ └── tools.py
├── analytics/
│ └── session_metrics.py
└── eval/
└── eval_cases.json
```

## Productionization Roadmap

The demo is intentionally one-shot. A production self-evolving loop
would add durable orchestration, approvals, and rollout controls:

```mermaid
flowchart LR
A["Scheduler"] --> B["Cloud Run Job"]
B --> C["Analyze recent BigQuery traces"]
C --> D["Generate prompt or skill candidate"]
D --> E["Regression eval gate"]
E --> F["Human approval or policy gate"]
F --> G["Prompt Registry / config rollout"]
G --> H["Canary traffic"]
H --> C
```

Recommended next steps:

- Store accepted and rejected candidates in BigQuery.
- Add prompt registry support for managed version history.
- Add a human approval step before production rollout.
- Add canary routing and automatic rollback if quality or cost
regressions appear.
- Extend the candidate generator from full-prompt generation to bounded
prompt/skill patch optimization.
101 changes: 101 additions & 0 deletions examples/self_evolving_agent_demo/VERIFICATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Live Verification

Last verified: 2026-06-09, America/Los_Angeles

Run id: `run_20260609_171547`

Command:

```bash
PYTHON_BIN=/path/to/python3.10+ ./run_e2e_demo.sh
```

Raw local artifacts were written to:

```text
reports/run_20260609_171547/
```

The raw `reports/` directory remains ignored because it is per-run output.
This file records the live end-to-end result that should be stable enough
to keep with the demo source.

## What Ran

```mermaid
flowchart LR
A["ADK sample agent V1"] --> B["BigQuery analytics plugin"]
B --> C["BigQuery trace table"]
C --> D["SDK evaluators + trace SQL"]
D --> E["Gemini prompt generator"]
E --> F["Generated V2 prompt"]
F --> G["ADK sample agent V2"]
G --> H["Before/after gate report"]
```

The live run exercised:

- ADK agent execution with Gemini.
- BigQuery Agent Analytics Plugin trace logging.
- BigQuery trace readback from
`rag-chatbot-485501.self_evolving_agent_demo.agent_events`.
- SDK deterministic evaluator checks for token efficiency, cost, turn count,
and error rate.
- Runtime generation of a replacement V2 prompt.
- Evolved-agent rerun against the same deterministic sample eval set.
- Before/after comparison gates.

## Generated Change

The generated V2 prompt changed the agent from broad-first behavior to a
narrowest-sufficient-tool policy:

- Player comparison -> `compare_players`.
- Team comparison -> `compare_teams`.
- Named-player scoring/profile/quick-read -> `get_player_stats`.
- Named-team strategy/strengths/profile/late-game offense ->
`get_team_profile`.
- `lookup_basketball_reference` only for broad, league-wide, or unsupported
ambiguous questions.

Candidate source: `model`.

It also changed the answer style from a long fixed scouting-report format
to at most four bullets or 120 words.

## Metrics

| Metric | V1 | Generated V2 | Delta |
|---|---:|---:|---:|
| Quality pass rate | 100% | 100% | +0% |
| Avg total tokens | 3640.2 | 1479.8 | -59.4% |
| Avg tool calls | 2.5 | 1.0 | -60.0% |
| Broad lookup calls | 4 | 0 | -4 |
| Tool errors | 0 | 0 | +0 |

## Gates

| Gate | Result |
|---|---:|
| `quality_not_regressed` | PASS |
| `tokens_reduced` | PASS |
| `broad_lookup_reduced` | PASS |
| `tool_errors_clear` | PASS |

Final result: PASS.

## Baseline SDK Signals

The SDK-backed analysis observed the following V1 signals before generating
the V2 prompt:

- Sessions: 4.
- Avg total tokens: 3640.2.
- Avg tool calls: 2.5.
- Broad lookup sessions: 4/4.
- Quality pass rate: 100%.
- Cost evaluator average observed value: 0.0015.

The default one-run cost remains well under `$1`: the run uses four V1
agent sessions, one prompt-generation call, four generated-V2 sessions,
and small BigQuery reads.
1 change: 1 addition & 0 deletions examples/self_evolving_agent_demo/agent/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""self-evolving agent demo agent package."""
Loading
Loading