|
| 1 | +# NBA Self-Evolving Agent Demo |
| 2 | + |
| 3 | +This demo shows a single ADK agent improving from its own logged |
| 4 | +behavior. The agent answers NBA analytics questions using deterministic |
| 5 | +fixture tools. V1 is intentionally wasteful: it loads broad NBA |
| 6 | +reference context and writes long scouting reports even when a narrow |
| 7 | +tool can answer the question. The BigQuery Agent Analytics Plugin logs |
| 8 | +the sessions to BigQuery, and the SDK reads those traces back to find a |
| 9 | +concrete improvement opportunity. The demo generates V2 during the run, |
| 10 | +then activates it only when the baseline answers already pass quality |
| 11 | +checks and the trace analysis shows broad-tool / token waste. |
| 12 | + |
| 13 | +```mermaid |
| 14 | +flowchart TD |
| 15 | + A["Run NBA agent V1"] --> B["Plugin logs agent_events to BigQuery"] |
| 16 | + B --> C["SDK CodeEvaluator + trace SQL"] |
| 17 | + C --> D["Find broad lookup and token waste"] |
| 18 | + D --> E["Generate bounded V2 prompt"] |
| 19 | + E --> F["Run same NBA eval questions"] |
| 20 | + F --> G["Show prompt diff + metric diff"] |
| 21 | +``` |
| 22 | + |
| 23 | +The point is self-evolution. Token tracking is the measurement signal, |
| 24 | +not the product promise. |
| 25 | + |
| 26 | +## What Improves |
| 27 | + |
| 28 | +V1 behavior: |
| 29 | + |
| 30 | +- Calls `lookup_nba_reference` before narrow tools. |
| 31 | +- Often calls more than one tool for a one-question task. |
| 32 | +- Produces long sectioned scouting reports. |
| 33 | + |
| 34 | +Generated V2 behavior: |
| 35 | + |
| 36 | +- Is created at runtime by a prompt generator from the SDK trace |
| 37 | + summary, tool counts, quality summary, and available tool signatures. |
| 38 | +- Should use the cheapest sufficient narrow tool. |
| 39 | +- Should avoid `lookup_nba_reference` unless no narrow tool fits. |
| 40 | +- Should give a short answer with decisive stats and a recommendation. |
| 41 | + |
| 42 | +The acceptance gate is: |
| 43 | + |
| 44 | +```mermaid |
| 45 | +flowchart TD |
| 46 | + A["Generated V2"] --> B{"Quality not worse?"} |
| 47 | + B -- no --> R["Reject"] |
| 48 | + B -- yes --> C{"Avg tokens lower?"} |
| 49 | + C -- no --> R |
| 50 | + C -- yes --> D{"Broad lookup reduced?"} |
| 51 | + D -- no --> R |
| 52 | + D -- yes --> E{"No tool errors?"} |
| 53 | + E -- no --> R |
| 54 | + E -- yes --> P["Accept evolved prompt"] |
| 55 | +``` |
| 56 | + |
| 57 | +## Run It |
| 58 | + |
| 59 | +Prerequisites: |
| 60 | + |
| 61 | +- Python 3.10+ |
| 62 | +- `gcloud` and `bq` CLIs |
| 63 | +- Application Default Credentials |
| 64 | +- A Google Cloud project with billing enabled |
| 65 | +- IAM: BigQuery data editor/job user and Vertex AI user |
| 66 | + |
| 67 | +Setup: |
| 68 | + |
| 69 | +```bash |
| 70 | +./setup.sh |
| 71 | +``` |
| 72 | + |
| 73 | +If your default `python3` is older than 3.10, run with: |
| 74 | + |
| 75 | +```bash |
| 76 | +PYTHON_BIN=python3.11 ./setup.sh |
| 77 | +PYTHON_BIN=python3.11 ./run_e2e_demo.sh |
| 78 | +``` |
| 79 | + |
| 80 | +Run the end-to-end demo: |
| 81 | + |
| 82 | +```bash |
| 83 | +./run_e2e_demo.sh |
| 84 | +``` |
| 85 | + |
| 86 | +Reset local prompt state and reports: |
| 87 | + |
| 88 | +```bash |
| 89 | +./reset.sh |
| 90 | +``` |
| 91 | + |
| 92 | +Expected default one-run cost is typically well under `$1`: four V1 |
| 93 | +agent sessions, one small prompt-generation call, four generated-V2 |
| 94 | +agent sessions, small BigQuery reads, and SDK deterministic evaluators. |
| 95 | +The demo does not deploy Cloud Run, |
| 96 | +Scheduler, Workflows, or any long-running infrastructure. |
| 97 | + |
| 98 | +## Outputs |
| 99 | + |
| 100 | +Each run writes a timestamped directory under `reports/`: |
| 101 | + |
| 102 | +```text |
| 103 | +reports/run_<timestamp>/ |
| 104 | +├── latest_eval_results_baseline.json # V1 answers + session IDs |
| 105 | +├── candidate_prompt.json # model-generated V2 prompt |
| 106 | +├── prompt_diff.md # exact V1 -> generated V2 diff |
| 107 | +├── self_evolution_analysis.json # SDK-backed evolution decision |
| 108 | +├── latest_eval_results_evolved.json # V2 answers + session IDs |
| 109 | +├── comparison.json # before/after gates |
| 110 | +└── comparison.md # readable metric diff report |
| 111 | +``` |
| 112 | + |
| 113 | +For the main story, open these two files after a run: |
| 114 | + |
| 115 | +- `prompt_diff.md` — shows the exact prompt changes generated from |
| 116 | + the trace/token signal. |
| 117 | +- `comparison.md` — shows quality, token, tool-call, and broad-lookup |
| 118 | + deltas between agent V1 and generated V2. |
| 119 | + |
| 120 | +The tracked `VERIFICATION.md` file records the latest live end-to-end |
| 121 | +verification result for this demo. |
| 122 | + |
| 123 | +The raw traces land in: |
| 124 | + |
| 125 | +```text |
| 126 | +<PROJECT_ID>.nba_self_evolving_demo.agent_events |
| 127 | +``` |
| 128 | + |
| 129 | +Override with: |
| 130 | + |
| 131 | +```bash |
| 132 | +export NBA_DATASET_ID=my_dataset |
| 133 | +export NBA_TABLE_ID=agent_events |
| 134 | +export NBA_AGENT_MODEL=gemini-2.5-flash |
| 135 | +export NBA_PROMPT_GENERATOR_MODEL=gemini-2.5-flash |
| 136 | +export DATASET_LOCATION=us-central1 |
| 137 | +``` |
| 138 | + |
| 139 | +## File Map |
| 140 | + |
| 141 | +```text |
| 142 | +examples/nba_self_evolving_demo/ |
| 143 | +├── README.md |
| 144 | +├── DEMO_NARRATION.md |
| 145 | +├── VERIFICATION.md |
| 146 | +├── setup.sh |
| 147 | +├── reset.sh |
| 148 | +├── run_e2e_demo.sh |
| 149 | +├── run_agent.py |
| 150 | +├── analyze_and_evolve.py |
| 151 | +├── compare_runs.py |
| 152 | +├── agent/ |
| 153 | +│ ├── agent.py |
| 154 | +│ ├── prompts.py |
| 155 | +│ ├── prompt_store.py |
| 156 | +│ └── tools.py |
| 157 | +├── analytics/ |
| 158 | +│ └── session_metrics.py |
| 159 | +└── eval/ |
| 160 | + └── eval_cases.json |
| 161 | +``` |
| 162 | + |
| 163 | +## Productionization Roadmap |
| 164 | + |
| 165 | +The demo is intentionally one-shot. A production self-evolving loop |
| 166 | +would add durable orchestration, approvals, and rollout controls: |
| 167 | + |
| 168 | +```mermaid |
| 169 | +flowchart LR |
| 170 | + A["Scheduler"] --> B["Cloud Run Job"] |
| 171 | + B --> C["Analyze recent BigQuery traces"] |
| 172 | + C --> D["Generate prompt or skill candidate"] |
| 173 | + D --> E["Regression eval gate"] |
| 174 | + E --> F["Human approval or policy gate"] |
| 175 | + F --> G["Prompt Registry / config rollout"] |
| 176 | + G --> H["Canary traffic"] |
| 177 | + H --> C |
| 178 | +``` |
| 179 | + |
| 180 | +Recommended next steps: |
| 181 | + |
| 182 | +- Store accepted and rejected candidates in BigQuery. |
| 183 | +- Add prompt registry support for managed version history. |
| 184 | +- Add a human approval step before production rollout. |
| 185 | +- Add canary routing and automatic rollback if quality or cost |
| 186 | + regressions appear. |
| 187 | +- Extend the candidate generator from full-prompt generation to bounded |
| 188 | + prompt/skill patch optimization. |
0 commit comments