|
| 1 | +# Self-Evolving Agent Demo |
| 2 | + |
| 3 | +This demo shows a single ADK agent improving from its own logged |
| 4 | +behavior. The agent answers basketball analytics questions using deterministic |
| 5 | +fixture tools. V1 is intentionally wasteful: it loads broad basketball |
| 6 | +reference context and writes long scouting reports even when a narrow |
| 7 | +tool can answer the question. The BigQuery Agent Analytics Plugin logs |
| 8 | +the sessions to BigQuery, and the SDK reads those traces back to find a |
| 9 | +concrete improvement opportunity. The demo generates V2 during the run, |
| 10 | +then activates it only when the baseline answers already pass quality |
| 11 | +checks and the trace analysis shows broad-tool / token waste. |
| 12 | + |
| 13 | +```mermaid |
| 14 | +flowchart TD |
| 15 | + A["Run sample agent V1"] --> B["Plugin logs agent_events to BigQuery"] |
| 16 | + B --> C["SDK deterministic evaluators + trace SQL"] |
| 17 | + C --> D["Find broad lookup and token waste"] |
| 18 | + D --> E["Generate bounded V2 prompt"] |
| 19 | + E --> F["Run same sample eval questions"] |
| 20 | + F --> G["Show prompt diff + metric diff"] |
| 21 | +``` |
| 22 | + |
| 23 | +The point is self-evolution. Token tracking is the measurement signal, |
| 24 | +not the product promise. |
| 25 | + |
| 26 | +This is a lightweight companion to `examples/agent_improvement_cycle/`. |
| 27 | +That demo shows a production-facing quality-improvement loop with |
| 28 | +Prompt Registry and Prompt Optimizer. This demo is intentionally smaller: |
| 29 | +it focuses on operational trace signals such as tool overuse and token |
| 30 | +waste, then gates a single generated prompt evolution against before/after |
| 31 | +metrics. |
| 32 | + |
| 33 | +## What Improves |
| 34 | + |
| 35 | +V1 behavior: |
| 36 | + |
| 37 | +- Calls `lookup_basketball_reference` before narrow tools. |
| 38 | +- Often calls more than one tool for a one-question task. |
| 39 | +- Produces long sectioned scouting reports. |
| 40 | + |
| 41 | +Generated V2 behavior: |
| 42 | + |
| 43 | +- Is created at runtime by a prompt generator from the SDK trace |
| 44 | + summary, tool counts, quality summary, and available tool signatures. |
| 45 | +- Should use the cheapest sufficient narrow tool. |
| 46 | +- Should avoid `lookup_basketball_reference` unless no narrow tool fits. |
| 47 | +- Should give a short answer with decisive stats and a recommendation. |
| 48 | + |
| 49 | +The acceptance gate is: |
| 50 | + |
| 51 | +```mermaid |
| 52 | +flowchart TD |
| 53 | + A["Generated V2"] --> B{"Quality not worse?"} |
| 54 | + B -- no --> R["Reject"] |
| 55 | + B -- yes --> C{"Avg tokens lower?"} |
| 56 | + C -- no --> R |
| 57 | + C -- yes --> D{"Broad lookup reduced?"} |
| 58 | + D -- no --> R |
| 59 | + D -- yes --> E{"No tool errors?"} |
| 60 | + E -- no --> R |
| 61 | + E -- yes --> P["Accept evolved prompt"] |
| 62 | +``` |
| 63 | + |
| 64 | +## Run It |
| 65 | + |
| 66 | +Prerequisites: |
| 67 | + |
| 68 | +- Python 3.10+ |
| 69 | +- `gcloud` and `bq` CLIs |
| 70 | +- Application Default Credentials |
| 71 | +- A Google Cloud project with billing enabled |
| 72 | +- IAM: BigQuery data editor/job user and Vertex AI user |
| 73 | + |
| 74 | +Setup: |
| 75 | + |
| 76 | +```bash |
| 77 | +./setup.sh |
| 78 | +``` |
| 79 | + |
| 80 | +If your default `python3` is older than 3.10, run with: |
| 81 | + |
| 82 | +```bash |
| 83 | +PYTHON_BIN=python3.11 ./setup.sh |
| 84 | +PYTHON_BIN=python3.11 ./run_e2e_demo.sh |
| 85 | +``` |
| 86 | + |
| 87 | +Run the end-to-end demo: |
| 88 | + |
| 89 | +```bash |
| 90 | +./run_e2e_demo.sh |
| 91 | +``` |
| 92 | + |
| 93 | +Reset local prompt state and reports: |
| 94 | + |
| 95 | +```bash |
| 96 | +./reset.sh |
| 97 | +``` |
| 98 | + |
| 99 | +Expected default one-run cost is typically well under `$1`: four V1 |
| 100 | +agent sessions, one small prompt-generation call, four generated-V2 |
| 101 | +agent sessions, small BigQuery reads, and SDK deterministic evaluators. |
| 102 | +The demo does not deploy Cloud Run, |
| 103 | +Scheduler, Workflows, or any long-running infrastructure. |
| 104 | + |
| 105 | +## Outputs |
| 106 | + |
| 107 | +Each run writes a timestamped directory under `reports/`: |
| 108 | + |
| 109 | +```text |
| 110 | +reports/run_<timestamp>/ |
| 111 | +├── latest_eval_results_baseline.json # V1 answers + session IDs |
| 112 | +├── candidate_prompt.json # model-generated V2 prompt |
| 113 | +├── prompt_diff.md # exact V1 -> generated V2 diff |
| 114 | +├── self_evolution_analysis.json # SDK-backed evolution decision |
| 115 | +├── latest_eval_results_evolved.json # V2 answers + session IDs |
| 116 | +├── comparison.json # before/after gates |
| 117 | +└── comparison.md # readable metric diff report |
| 118 | +``` |
| 119 | + |
| 120 | +For the main story, open these two files after a run: |
| 121 | + |
| 122 | +- `prompt_diff.md` — shows the exact prompt changes generated from |
| 123 | + the trace/token signal. |
| 124 | +- `comparison.md` — shows quality, token, tool-call, and broad-lookup |
| 125 | + deltas between agent V1 and generated V2. |
| 126 | + |
| 127 | +The tracked `VERIFICATION.md` file records the latest live end-to-end |
| 128 | +verification result for this demo. |
| 129 | + |
| 130 | +The raw traces land in: |
| 131 | + |
| 132 | +```text |
| 133 | +<PROJECT_ID>.self_evolving_agent_demo.agent_events |
| 134 | +``` |
| 135 | + |
| 136 | +Override with: |
| 137 | + |
| 138 | +```bash |
| 139 | +export SELF_EVOLVING_DATASET_ID=my_dataset |
| 140 | +export SELF_EVOLVING_TABLE_ID=agent_events |
| 141 | +export SELF_EVOLVING_AGENT_MODEL=gemini-2.5-flash |
| 142 | +export SELF_EVOLVING_PROMPT_GENERATOR_MODEL=gemini-2.5-flash |
| 143 | +export DATASET_LOCATION=us-central1 |
| 144 | +``` |
| 145 | + |
| 146 | +Re-running `setup.sh` regenerates `.env` from the current environment. |
| 147 | +To customize a setting persistently, pass it as an environment variable |
| 148 | +when running setup, for example: |
| 149 | + |
| 150 | +```bash |
| 151 | +SELF_EVOLVING_AGENT_MODEL=gemini-2.5-pro ./setup.sh |
| 152 | +``` |
| 153 | + |
| 154 | +Evolution thresholds can be tuned with: |
| 155 | + |
| 156 | +```bash |
| 157 | +python analyze_and_evolve.py \ |
| 158 | + --min-quality-pass-rate 1.0 \ |
| 159 | + --min-broad-lookup-rate 0.5 \ |
| 160 | + --max-avg-tool-calls 2.0 |
| 161 | +``` |
| 162 | + |
| 163 | +## File Map |
| 164 | + |
| 165 | +```text |
| 166 | +examples/self_evolving_agent_demo/ |
| 167 | +├── README.md |
| 168 | +├── DEMO_NARRATION.md |
| 169 | +├── VERIFICATION.md |
| 170 | +├── setup.sh |
| 171 | +├── reset.sh |
| 172 | +├── run_e2e_demo.sh |
| 173 | +├── run_agent.py |
| 174 | +├── analyze_and_evolve.py |
| 175 | +├── compare_runs.py |
| 176 | +├── agent/ |
| 177 | +│ ├── agent.py |
| 178 | +│ ├── prompts.py |
| 179 | +│ ├── prompt_store.py |
| 180 | +│ └── tools.py |
| 181 | +├── analytics/ |
| 182 | +│ └── session_metrics.py |
| 183 | +└── eval/ |
| 184 | + └── eval_cases.json |
| 185 | +``` |
| 186 | + |
| 187 | +## Productionization Roadmap |
| 188 | + |
| 189 | +The demo is intentionally one-shot. A production self-evolving loop |
| 190 | +would add durable orchestration, approvals, and rollout controls: |
| 191 | + |
| 192 | +```mermaid |
| 193 | +flowchart LR |
| 194 | + A["Scheduler"] --> B["Cloud Run Job"] |
| 195 | + B --> C["Analyze recent BigQuery traces"] |
| 196 | + C --> D["Generate prompt or skill candidate"] |
| 197 | + D --> E["Regression eval gate"] |
| 198 | + E --> F["Human approval or policy gate"] |
| 199 | + F --> G["Prompt Registry / config rollout"] |
| 200 | + G --> H["Canary traffic"] |
| 201 | + H --> C |
| 202 | +``` |
| 203 | + |
| 204 | +Recommended next steps: |
| 205 | + |
| 206 | +- Store accepted and rejected candidates in BigQuery. |
| 207 | +- Add prompt registry support for managed version history. |
| 208 | +- Add a human approval step before production rollout. |
| 209 | +- Add canary routing and automatic rollback if quality or cost |
| 210 | + regressions appear. |
| 211 | +- Extend the candidate generator from full-prompt generation to bounded |
| 212 | + prompt/skill patch optimization. |
0 commit comments