diff --git a/examples/README.md b/examples/README.md index 35d0b887..66c32526 100644 --- a/examples/README.md +++ b/examples/README.md @@ -53,6 +53,7 @@ artifacts that demonstrate SDK capabilities. |-----------|-------------| | [context_graph/](context_graph/) | Agent Context Graph: extract decision traces from your agent's context graph — a runnable ADK agent + BQ AA plugin streaming events, the codelab artifacts ([codelab/](context_graph/codelab/)), and the scheduled Cloud Run + Cloud Scheduler deploy ([periodic_materialization/](context_graph/periodic_materialization/)). Start with the [codelab](../docs/codelabs/periodic_materialization.md). | | [agent_improvement_cycle/](agent_improvement_cycle/) | LoopAgent-driven prompt improvement cycle | +| [self_evolving_agent_demo/](self_evolving_agent_demo/) | Metric-driven self-evolution demo for a single ADK agent. Uses trace signals to generate and gate a bounded prompt evolution. | | [decision_lineage_demo/](decision_lineage_demo/) | Decision-lineage property graph (issue #98): live ADK media-planner agent + BQ AA Plugin running across 6 campaign sessions → SDK `build_context_graph(use_ai_generate=True, include_decisions=True)` → six GQL blocks pasted into BigQuery Studio (one renders an interactive graph diagram, one is a portfolio roll-up) | ## Reference Artifacts diff --git a/examples/self_evolving_agent_demo/.gitignore b/examples/self_evolving_agent_demo/.gitignore new file mode 100644 index 00000000..b4de0192 --- /dev/null +++ b/examples/self_evolving_agent_demo/.gitignore @@ -0,0 +1,5 @@ +.env +prompt_state.json +reports/ +__pycache__/ +*/__pycache__/ diff --git a/examples/self_evolving_agent_demo/DEMO_NARRATION.md b/examples/self_evolving_agent_demo/DEMO_NARRATION.md new file mode 100644 index 00000000..1a09698e --- /dev/null +++ b/examples/self_evolving_agent_demo/DEMO_NARRATION.md @@ -0,0 +1,28 @@ +# Self-Evolving Agent Demo Narration + +## 30-second version + +This demo starts with a basketball analytics agent that answers correctly but +wastes work. It logs every run to BigQuery through the analytics +plugin. The SDK reads the traces, finds that the agent keeps calling a +broad reference tool and spending excess tokens, generates a tighter V2 +prompt, reruns the same questions, and proves that quality stayed flat +while token and tool usage dropped. + +## Walkthrough + +1. Run `./setup.sh`. +2. Run `./run_e2e_demo.sh`. +3. Watch the V1 run call broad and narrow sample tools. +4. Watch `analyze_and_evolve.py` print the SDK-backed finding: + broad reference lookups were used on narrow tasks. +5. Open `prompt_diff.md` to inspect the exact V1 -> generated V2 diff. +6. Watch the V2 run use narrow tools directly. +7. Open `comparison.md` for the final quality/token/tool diff. + +## Demo Message + +The important idea is not "save tokens" in isolation. The agent uses +its own production-shaped traces as feedback. Token tracking gives the +loop a measurable signal, but the goal is a self-evolving agent that +gets cheaper or cleaner without losing answer quality. diff --git a/examples/self_evolving_agent_demo/README.md b/examples/self_evolving_agent_demo/README.md new file mode 100644 index 00000000..3f4f8283 --- /dev/null +++ b/examples/self_evolving_agent_demo/README.md @@ -0,0 +1,212 @@ +# Self-Evolving Agent Demo + +This demo shows a single ADK agent improving from its own logged +behavior. The agent answers basketball analytics questions using deterministic +fixture tools. V1 is intentionally wasteful: it loads broad basketball +reference context and writes long scouting reports even when a narrow +tool can answer the question. The BigQuery Agent Analytics Plugin logs +the sessions to BigQuery, and the SDK reads those traces back to find a +concrete improvement opportunity. The demo generates V2 during the run, +then activates it only when the baseline answers already pass quality +checks and the trace analysis shows broad-tool / token waste. + +```mermaid +flowchart TD + A["Run sample agent V1"] --> B["Plugin logs agent_events to BigQuery"] + B --> C["SDK deterministic evaluators + trace SQL"] + C --> D["Find broad lookup and token waste"] + D --> E["Generate bounded V2 prompt"] + E --> F["Run same sample eval questions"] + F --> G["Show prompt diff + metric diff"] +``` + +The point is self-evolution. Token tracking is the measurement signal, +not the product promise. + +This is a lightweight companion to `examples/agent_improvement_cycle/`. +That demo shows a production-facing quality-improvement loop with +Prompt Registry and Prompt Optimizer. This demo is intentionally smaller: +it focuses on operational trace signals such as tool overuse and token +waste, then gates a single generated prompt evolution against before/after +metrics. + +## What Improves + +V1 behavior: + +- Calls `lookup_basketball_reference` before narrow tools. +- Often calls more than one tool for a one-question task. +- Produces long sectioned scouting reports. + +Generated V2 behavior: + +- Is created at runtime by a prompt generator from the SDK trace + summary, tool counts, quality summary, and available tool signatures. +- Should use the cheapest sufficient narrow tool. +- Should avoid `lookup_basketball_reference` unless no narrow tool fits. +- Should give a short answer with decisive stats and a recommendation. + +The acceptance gate is: + +```mermaid +flowchart TD + A["Generated V2"] --> B{"Quality not worse?"} + B -- no --> R["Reject"] + B -- yes --> C{"Avg tokens lower?"} + C -- no --> R + C -- yes --> D{"Broad lookup reduced?"} + D -- no --> R + D -- yes --> E{"No tool errors?"} + E -- no --> R + E -- yes --> P["Accept evolved prompt"] +``` + +## Run It + +Prerequisites: + +- Python 3.10+ +- `gcloud` and `bq` CLIs +- Application Default Credentials +- A Google Cloud project with billing enabled +- IAM: BigQuery data editor/job user and Vertex AI user + +Setup: + +```bash +./setup.sh +``` + +If your default `python3` is older than 3.10, run with: + +```bash +PYTHON_BIN=python3.11 ./setup.sh +PYTHON_BIN=python3.11 ./run_e2e_demo.sh +``` + +Run the end-to-end demo: + +```bash +./run_e2e_demo.sh +``` + +Reset local prompt state and reports: + +```bash +./reset.sh +``` + +Expected default one-run cost is typically well under `$1`: four V1 +agent sessions, one small prompt-generation call, four generated-V2 +agent sessions, small BigQuery reads, and SDK deterministic evaluators. +The demo does not deploy Cloud Run, +Scheduler, Workflows, or any long-running infrastructure. + +## Outputs + +Each run writes a timestamped directory under `reports/`: + +```text +reports/run_/ +├── latest_eval_results_baseline.json # V1 answers + session IDs +├── candidate_prompt.json # model-generated V2 prompt +├── prompt_diff.md # exact V1 -> generated V2 diff +├── self_evolution_analysis.json # SDK-backed evolution decision +├── latest_eval_results_evolved.json # V2 answers + session IDs +├── comparison.json # before/after gates +└── comparison.md # readable metric diff report +``` + +For the main story, open these two files after a run: + +- `prompt_diff.md` — shows the exact prompt changes generated from + the trace/token signal. +- `comparison.md` — shows quality, token, tool-call, and broad-lookup + deltas between agent V1 and generated V2. + +The tracked `VERIFICATION.md` file records the latest live end-to-end +verification result for this demo. + +The raw traces land in: + +```text +.self_evolving_agent_demo.agent_events +``` + +Override with: + +```bash +export SELF_EVOLVING_DATASET_ID=my_dataset +export SELF_EVOLVING_TABLE_ID=agent_events +export SELF_EVOLVING_AGENT_MODEL=gemini-2.5-flash +export SELF_EVOLVING_PROMPT_GENERATOR_MODEL=gemini-2.5-flash +export DATASET_LOCATION=us-central1 +``` + +Re-running `setup.sh` regenerates `.env` from the current environment. +To customize a setting persistently, pass it as an environment variable +when running setup, for example: + +```bash +SELF_EVOLVING_AGENT_MODEL=gemini-2.5-pro ./setup.sh +``` + +Evolution thresholds can be tuned with: + +```bash +python analyze_and_evolve.py \ + --min-quality-pass-rate 1.0 \ + --min-broad-lookup-rate 0.5 \ + --max-avg-tool-calls 2.0 +``` + +## File Map + +```text +examples/self_evolving_agent_demo/ +├── README.md +├── DEMO_NARRATION.md +├── VERIFICATION.md +├── setup.sh +├── reset.sh +├── run_e2e_demo.sh +├── run_agent.py +├── analyze_and_evolve.py +├── compare_runs.py +├── agent/ +│ ├── agent.py +│ ├── prompts.py +│ ├── prompt_store.py +│ └── tools.py +├── analytics/ +│ └── session_metrics.py +└── eval/ + └── eval_cases.json +``` + +## Productionization Roadmap + +The demo is intentionally one-shot. A production self-evolving loop +would add durable orchestration, approvals, and rollout controls: + +```mermaid +flowchart LR + A["Scheduler"] --> B["Cloud Run Job"] + B --> C["Analyze recent BigQuery traces"] + C --> D["Generate prompt or skill candidate"] + D --> E["Regression eval gate"] + E --> F["Human approval or policy gate"] + F --> G["Prompt Registry / config rollout"] + G --> H["Canary traffic"] + H --> C +``` + +Recommended next steps: + +- Store accepted and rejected candidates in BigQuery. +- Add prompt registry support for managed version history. +- Add a human approval step before production rollout. +- Add canary routing and automatic rollback if quality or cost + regressions appear. +- Extend the candidate generator from full-prompt generation to bounded + prompt/skill patch optimization. diff --git a/examples/self_evolving_agent_demo/VERIFICATION.md b/examples/self_evolving_agent_demo/VERIFICATION.md new file mode 100644 index 00000000..64fdd27f --- /dev/null +++ b/examples/self_evolving_agent_demo/VERIFICATION.md @@ -0,0 +1,101 @@ +# Live Verification + +Last verified: 2026-06-09, America/Los_Angeles + +Run id: `run_20260609_171547` + +Command: + +```bash +PYTHON_BIN=/path/to/python3.10+ ./run_e2e_demo.sh +``` + +Raw local artifacts were written to: + +```text +reports/run_20260609_171547/ +``` + +The raw `reports/` directory remains ignored because it is per-run output. +This file records the live end-to-end result that should be stable enough +to keep with the demo source. + +## What Ran + +```mermaid +flowchart LR + A["ADK sample agent V1"] --> B["BigQuery analytics plugin"] + B --> C["BigQuery trace table"] + C --> D["SDK evaluators + trace SQL"] + D --> E["Gemini prompt generator"] + E --> F["Generated V2 prompt"] + F --> G["ADK sample agent V2"] + G --> H["Before/after gate report"] +``` + +The live run exercised: + +- ADK agent execution with Gemini. +- BigQuery Agent Analytics Plugin trace logging. +- BigQuery trace readback from + `rag-chatbot-485501.self_evolving_agent_demo.agent_events`. +- SDK deterministic evaluator checks for token efficiency, cost, turn count, + and error rate. +- Runtime generation of a replacement V2 prompt. +- Evolved-agent rerun against the same deterministic sample eval set. +- Before/after comparison gates. + +## Generated Change + +The generated V2 prompt changed the agent from broad-first behavior to a +narrowest-sufficient-tool policy: + +- Player comparison -> `compare_players`. +- Team comparison -> `compare_teams`. +- Named-player scoring/profile/quick-read -> `get_player_stats`. +- Named-team strategy/strengths/profile/late-game offense -> + `get_team_profile`. +- `lookup_basketball_reference` only for broad, league-wide, or unsupported + ambiguous questions. + +Candidate source: `model`. + +It also changed the answer style from a long fixed scouting-report format +to at most four bullets or 120 words. + +## Metrics + +| Metric | V1 | Generated V2 | Delta | +|---|---:|---:|---:| +| Quality pass rate | 100% | 100% | +0% | +| Avg total tokens | 3640.2 | 1479.8 | -59.4% | +| Avg tool calls | 2.5 | 1.0 | -60.0% | +| Broad lookup calls | 4 | 0 | -4 | +| Tool errors | 0 | 0 | +0 | + +## Gates + +| Gate | Result | +|---|---:| +| `quality_not_regressed` | PASS | +| `tokens_reduced` | PASS | +| `broad_lookup_reduced` | PASS | +| `tool_errors_clear` | PASS | + +Final result: PASS. + +## Baseline SDK Signals + +The SDK-backed analysis observed the following V1 signals before generating +the V2 prompt: + +- Sessions: 4. +- Avg total tokens: 3640.2. +- Avg tool calls: 2.5. +- Broad lookup sessions: 4/4. +- Quality pass rate: 100%. +- Cost evaluator average observed value: 0.0015. + +The default one-run cost remains well under `$1`: the run uses four V1 +agent sessions, one prompt-generation call, four generated-V2 sessions, +and small BigQuery reads. diff --git a/examples/self_evolving_agent_demo/agent/__init__.py b/examples/self_evolving_agent_demo/agent/__init__.py new file mode 100644 index 00000000..be4ae66a --- /dev/null +++ b/examples/self_evolving_agent_demo/agent/__init__.py @@ -0,0 +1 @@ +"""self-evolving agent demo agent package.""" diff --git a/examples/self_evolving_agent_demo/agent/agent.py b/examples/self_evolving_agent_demo/agent/agent.py new file mode 100644 index 00000000..cf505a25 --- /dev/null +++ b/examples/self_evolving_agent_demo/agent/agent.py @@ -0,0 +1,100 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""ADK sample analytics agent used by the self-evolving demo.""" + +from __future__ import annotations + +import os + +from dotenv import load_dotenv +from google.adk.agents import Agent +from google.adk.models import Gemini +from google.adk.plugins.bigquery_agent_analytics_plugin import BigQueryAgentAnalyticsPlugin +from google.adk.plugins.bigquery_agent_analytics_plugin import BigQueryLoggerConfig +import google.auth +from google.genai import types + +from .prompt_store import read_prompt +from .tools import DEMO_TOOLS + +_DEMO_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) +_ENV_PATH = os.path.join(_DEMO_DIR, ".env") +if os.path.exists(_ENV_PATH): + load_dotenv(dotenv_path=_ENV_PATH) + +try: + _, _auth_project = google.auth.default() +except Exception: + _auth_project = None + +PROJECT_ID = os.getenv("PROJECT_ID") or os.getenv("GOOGLE_CLOUD_PROJECT") +if not PROJECT_ID: + PROJECT_ID = _auth_project +if not PROJECT_ID: + raise RuntimeError( + "Could not resolve PROJECT_ID from .env, GOOGLE_CLOUD_PROJECT, or ADC. " + "Run ./setup.sh or `gcloud config set project YOUR_PROJECT_ID`." + ) + +DATASET_LOCATION = os.getenv("DATASET_LOCATION", "us-central1") +DATASET_ID = os.getenv("SELF_EVOLVING_DATASET_ID", "self_evolving_agent_demo") +TABLE_ID = os.getenv("SELF_EVOLVING_TABLE_ID", "agent_events") +MODEL_ID = os.getenv("SELF_EVOLVING_AGENT_MODEL", "gemini-2.5-flash") +AGENT_LOCATION = os.getenv("SELF_EVOLVING_AGENT_LOCATION", "us-central1") +APP_NAME = "self_evolving_agent" + + +def _configure_environment() -> None: + """Configure Vertex AI environment variables required by ADK Gemini.""" + os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID + os.environ["GOOGLE_CLOUD_LOCATION"] = AGENT_LOCATION + os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "true" + + +def create_agent(prompt: str, model_id: str | None = None) -> Agent: + """Create the sample agent with the supplied system prompt.""" + _configure_environment() + return Agent( + name=APP_NAME, + model=Gemini( + model=model_id or MODEL_ID, + retry_options=types.HttpRetryOptions(attempts=3), + ), + description=( + "Basketball analytics assistant with deterministic fixture tools." + ), + instruction=prompt, + tools=DEMO_TOOLS, + ) + + +_prompt, PROMPT_VERSION = read_prompt() +root_agent = create_agent(_prompt) + +bq_logging_plugin = BigQueryAgentAnalyticsPlugin( + project_id=PROJECT_ID, + dataset_id=DATASET_ID, + table_id=TABLE_ID, + location=DATASET_LOCATION, + config=BigQueryLoggerConfig( + enabled=True, + max_content_length=50 * 1024, + # Small batches make rows visible quickly for this one-shot demo. + batch_size=1, + shutdown_timeout=15.0, + ), +) + +app = root_agent diff --git a/examples/self_evolving_agent_demo/agent/prompt_store.py b/examples/self_evolving_agent_demo/agent/prompt_store.py new file mode 100644 index 00000000..a9ad5838 --- /dev/null +++ b/examples/self_evolving_agent_demo/agent/prompt_store.py @@ -0,0 +1,101 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tiny local prompt registry for the demo. + +The tracked source stays immutable during a run. The active prompt +version is stored in ``prompt_state.json``, which is ignored by Git and +created by setup/reset/evolution scripts. +""" + +from __future__ import annotations + +import argparse +from datetime import datetime +from datetime import timezone +import json +import os +from typing import Any + +from .prompts import V1_PROMPT + +_DEMO_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) +STATE_PATH = os.path.join(_DEMO_DIR, "prompt_state.json") + + +def _state(version: str, prompt: str, rationale: str) -> dict[str, Any]: + return { + "version": version, + "prompt": prompt, + "rationale": rationale, + "updated_at": datetime.now(timezone.utc).isoformat(), + } + + +def read_state() -> dict[str, Any]: + """Read the current prompt state, falling back to V1.""" + if not os.path.exists(STATE_PATH): + return _state("v1", V1_PROMPT, "Default V1 prompt.") + with open(STATE_PATH) as f: + data = json.load(f) + version = str(data.get("version", "v1")).lower() + prompt = str(data.get("prompt") or V1_PROMPT) + return { + "version": version, + "prompt": prompt, + "rationale": str(data.get("rationale", "")), + "updated_at": str(data.get("updated_at", "")), + } + + +def read_prompt() -> tuple[str, str]: + """Return ``(prompt, version)`` for agent construction.""" + state = read_state() + return state["prompt"], state["version"] + + +def write_prompt(version: str, prompt: str, rationale: str) -> dict[str, Any]: + """Persist prompt text as the active demo prompt version.""" + normalized = version.strip().lower() + if normalized not in {"v1", "v2", "candidate"}: + raise ValueError(f"Unsupported prompt version: {version!r}") + if not prompt.strip(): + raise ValueError("Prompt text must not be empty.") + state = _state(normalized, prompt.strip(), rationale) + with open(STATE_PATH, "w") as f: + json.dump(state, f, indent=2) + f.write("\n") + return state + + +def reset_state() -> dict[str, Any]: + """Reset the demo to the intentionally inefficient V1 prompt.""" + return write_prompt("v1", V1_PROMPT, "Reset to baseline V1 prompt.") + + +def main() -> None: + parser = argparse.ArgumentParser(description="Manage demo prompt state.") + parser.add_argument("action", choices=["show", "reset"]) + args = parser.parse_args() + + if args.action == "reset": + state = reset_state() + else: + state = read_state() + + print(json.dumps(state, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/examples/self_evolving_agent_demo/agent/prompts.py b/examples/self_evolving_agent_demo/agent/prompts.py new file mode 100644 index 00000000..780ce577 --- /dev/null +++ b/examples/self_evolving_agent_demo/agent/prompts.py @@ -0,0 +1,44 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Baseline prompt for the self-evolving agent demo. + +The demo starts with V1, which is intentionally wasteful: it asks the +agent to load broad reference context and write long analyst notes even +when a narrow tool can answer the question. V2 is generated at runtime +from SDK trace analysis and stored in ``prompt_state.json``. +""" + +V1_PROMPT = """\ +You are Courtside Scout, a basketball analytics assistant. + +You must be exhaustive. For every user question, first call +`lookup_basketball_reference(query)` using the full user question so you have +league-wide context. Then call any narrow tool that could possibly be +relevant. If a player appears, call `get_player_stats`. If a team +appears, call `get_team_profile`. If the user compares two players, +also call `compare_players`. If the user compares two teams, also call +`compare_teams`. + +Write a scouting-report style answer with these sections: +1. Context +2. Numbers +3. Reasoning +4. Caveats +5. Recommendation + +Use six to eight bullets. Mention that the data is a synthetic demo +fixture and that a live production agent would verify against a +licensed stats feed. +""" diff --git a/examples/self_evolving_agent_demo/agent/tools.py b/examples/self_evolving_agent_demo/agent/tools.py new file mode 100644 index 00000000..35d9f980 --- /dev/null +++ b/examples/self_evolving_agent_demo/agent/tools.py @@ -0,0 +1,318 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Deterministic basketball fixture tools for the self-evolving demo. + +The data below is intentionally synthetic. The point of the demo is the +agent evolution loop and trace analytics, not live sports accuracy. +""" + +from __future__ import annotations + +from typing import Any + +SEASON = "2025-26-demo" + +PLAYERS: dict[str, dict[str, Any]] = { + "nikola jokic": { + "player": "Nikola Jokic", + "team": "Denver Nuggets", + "ppg": 26.4, + "rpg": 12.4, + "apg": 9.1, + "ts_pct": 0.662, + "usage_pct": 28.8, + "assist_rate": 43.0, + "strength": "elite half-court creation through post play and passing", + }, + "joel embiid": { + "player": "Joel Embiid", + "team": "Philadelphia 76ers", + "ppg": 31.8, + "rpg": 10.9, + "apg": 5.6, + "ts_pct": 0.646, + "usage_pct": 35.1, + "assist_rate": 28.4, + "strength": "dominant scoring pressure, foul generation, and rim defense", + }, + "shai gilgeous-alexander": { + "player": "Shai Gilgeous-Alexander", + "team": "Oklahoma City Thunder", + "ppg": 30.6, + "rpg": 5.8, + "apg": 6.4, + "ts_pct": 0.635, + "usage_pct": 32.4, + "assist_rate": 29.9, + "strength": "paint pressure, midrange scoring, and low-turnover creation", + }, + "luka doncic": { + "player": "Luka Doncic", + "team": "Dallas Mavericks", + "ppg": 29.8, + "rpg": 8.7, + "apg": 9.4, + "ts_pct": 0.612, + "usage_pct": 34.8, + "assist_rate": 45.5, + "strength": "pick-and-roll control and skip-pass creation", + }, + "jayson tatum": { + "player": "Jayson Tatum", + "team": "Boston Celtics", + "ppg": 27.1, + "rpg": 8.2, + "apg": 4.9, + "ts_pct": 0.604, + "usage_pct": 30.6, + "assist_rate": 22.5, + "strength": "two-way wing scoring with switchable defense", + }, + "anthony edwards": { + "player": "Anthony Edwards", + "team": "Minnesota Timberwolves", + "ppg": 27.8, + "rpg": 5.5, + "apg": 5.1, + "ts_pct": 0.589, + "usage_pct": 31.8, + "assist_rate": 24.7, + "strength": "rim pressure, transition force, and late-clock shot making", + }, +} + +TEAMS: dict[str, dict[str, Any]] = { + "denver nuggets": { + "team": "Denver Nuggets", + "wins": 55, + "losses": 27, + "off_rating": 119.1, + "def_rating": 113.6, + "net_rating": 5.5, + "pace": 97.2, + "profile": "methodical half-court offense built around Jokic actions", + "late_game_edge": "high-value two-man actions and elite decision making", + }, + "oklahoma city thunder": { + "team": "Oklahoma City Thunder", + "wins": 60, + "losses": 22, + "off_rating": 118.4, + "def_rating": 109.2, + "net_rating": 9.2, + "pace": 100.5, + "profile": "drive-heavy offense with aggressive point-of-attack defense", + "late_game_edge": "Shai isolation plus five-out spacing", + }, + "boston celtics": { + "team": "Boston Celtics", + "wins": 58, + "losses": 24, + "off_rating": 120.2, + "def_rating": 111.1, + "net_rating": 9.1, + "pace": 98.9, + "profile": "spacing, three-point volume, and switchable wing size", + "late_game_edge": "multiple creators around elite spacing", + }, + "dallas mavericks": { + "team": "Dallas Mavericks", + "wins": 50, + "losses": 32, + "off_rating": 117.2, + "def_rating": 114.5, + "net_rating": 2.7, + "pace": 99.4, + "profile": "pick-and-roll creation and corner spacing", + "late_game_edge": "Doncic advantage creation against switches", + }, + "minnesota timberwolves": { + "team": "Minnesota Timberwolves", + "wins": 53, + "losses": 29, + "off_rating": 115.8, + "def_rating": 109.8, + "net_rating": 6.0, + "pace": 98.0, + "profile": "rim protection, size, and Edwards downhill creation", + "late_game_edge": "defense-to-offense swings and Edwards shot pressure", + }, +} + +PLAYER_ALIASES = { + "jokic": "nikola jokic", + "nikola": "nikola jokic", + "embiid": "joel embiid", + "joel": "joel embiid", + "shai": "shai gilgeous-alexander", + "sga": "shai gilgeous-alexander", + "gilgeous-alexander": "shai gilgeous-alexander", + "luka": "luka doncic", + "doncic": "luka doncic", + "tatum": "jayson tatum", + "jayson": "jayson tatum", + "edwards": "anthony edwards", + "anthony edwards": "anthony edwards", +} + +TEAM_ALIASES = { + "nuggets": "denver nuggets", + "denver": "denver nuggets", + "thunder": "oklahoma city thunder", + "okc": "oklahoma city thunder", + "celtics": "boston celtics", + "boston": "boston celtics", + "mavericks": "dallas mavericks", + "mavs": "dallas mavericks", + "dallas": "dallas mavericks", + "timberwolves": "minnesota timberwolves", + "wolves": "minnesota timberwolves", + "minnesota": "minnesota timberwolves", +} + + +def _resolve_player(name: str) -> str: + key = name.lower().strip() + if key in PLAYERS: + return key + for alias, canonical in PLAYER_ALIASES.items(): + if alias in key: + return canonical + raise ValueError(f"Unknown demo player: {name}") + + +def _resolve_team(name: str) -> str: + key = name.lower().strip() + if key in TEAMS: + return key + for alias, canonical in TEAM_ALIASES.items(): + if alias in key: + return canonical + raise ValueError(f"Unknown demo team: {name}") + + +def lookup_basketball_reference(query: str) -> dict[str, Any]: + """Return a broad basketball reference packet for ambiguous questions. + + This tool is intentionally verbose. V1 overuses it, which makes the + SDK token analysis find a concrete optimization opportunity. + """ + return { + "query": query, + "season": SEASON, + "usage_note": ( + "Broad reference packet. Prefer narrow tools for player, team, " + "and comparison questions when possible." + ), + "league_principles": [ + "Net rating estimates team strength better than wins alone.", + "True shooting percentage helps compare scoring efficiency.", + "Usage rate indicates how much offense a player carries.", + "Assist rate and turnover context matter for primary creators.", + "Pace changes counting stats and should be considered in team reads.", + "Late-game offense rewards shot creation, spacing, and low turnovers.", + "Playoff defense values rim protection and switchable point-of-attack size.", + "Synthetic demo fixtures are stable so trace comparisons are repeatable.", + ], + "teams": list(TEAMS.values()), + "players": list(PLAYERS.values()), + "common_matchup_lenses": [ + "creation burden", + "efficiency", + "rim pressure", + "spacing environment", + "defensive matchup flexibility", + "late-clock reliability", + "transition creation", + "bench context", + ], + } + + +def get_player_stats(player: str, season: str = SEASON) -> dict[str, Any]: + """Return compact stats, strengths, and scoring profile for one player.""" + data = dict(PLAYERS[_resolve_player(player)]) + data["season"] = season + return data + + +def get_team_profile(team: str, season: str = SEASON) -> dict[str, Any]: + """Return team profile, strengths, and late-game strategy data.""" + data = dict(TEAMS[_resolve_team(team)]) + data["season"] = season + return data + + +def compare_players( + player_a: str, + player_b: str, + season: str = SEASON, +) -> dict[str, Any]: + """Compare two demo players with a compact recommendation.""" + left = get_player_stats(player_a, season) + right = get_player_stats(player_b, season) + left_score = ( + left["ppg"] * 0.35 + + left["apg"] * 0.30 + + left["ts_pct"] * 20 + + left["assist_rate"] * 0.10 + ) + right_score = ( + right["ppg"] * 0.35 + + right["apg"] * 0.30 + + right["ts_pct"] * 20 + + right["assist_rate"] * 0.10 + ) + winner = left if left_score >= right_score else right + return { + "season": season, + "player_a": left, + "player_b": right, + "recommended": winner["player"], + "reason": ( + f"{winner['player']} has the stronger creation profile for this " + "question because of scoring efficiency plus playmaking load." + ), + } + + +def compare_teams( + team_a: str, + team_b: str, + season: str = SEASON, +) -> dict[str, Any]: + """Compare two demo teams with a compact recommendation.""" + left = get_team_profile(team_a, season) + right = get_team_profile(team_b, season) + winner = left if left["net_rating"] >= right["net_rating"] else right + return { + "season": season, + "team_a": left, + "team_b": right, + "recommended": winner["team"], + "reason": ( + f"{winner['team']} has the better demo profile by net rating " + "and role clarity." + ), + } + + +DEMO_TOOLS = [ + lookup_basketball_reference, + get_player_stats, + get_team_profile, + compare_players, + compare_teams, +] diff --git a/examples/self_evolving_agent_demo/analytics/__init__.py b/examples/self_evolving_agent_demo/analytics/__init__.py new file mode 100644 index 00000000..fe4bc6bd --- /dev/null +++ b/examples/self_evolving_agent_demo/analytics/__init__.py @@ -0,0 +1 @@ +"""Analytics helpers for the self-evolving agent demo.""" diff --git a/examples/self_evolving_agent_demo/analytics/session_metrics.py b/examples/self_evolving_agent_demo/analytics/session_metrics.py new file mode 100644 index 00000000..0fa6a2ab --- /dev/null +++ b/examples/self_evolving_agent_demo/analytics/session_metrics.py @@ -0,0 +1,322 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Session metric helpers backed by BigQuery and SDK evaluators.""" + +from __future__ import annotations + +from dataclasses import dataclass +import json +import os +import time +from typing import Any + +_DEMO_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) +_ENV_PATH = os.path.join(_DEMO_DIR, ".env") + + +@dataclass(frozen=True) +class DemoConfig: + project_id: str + dataset_id: str + table_id: str + location: str + + @property + def table_ref(self) -> str: + return f"{self.project_id}.{self.dataset_id}.{self.table_id}" + + +def load_config() -> DemoConfig: + """Load demo BigQuery configuration from ``.env`` and ADC.""" + if os.path.exists(_ENV_PATH): + try: + from dotenv import load_dotenv + + load_dotenv(dotenv_path=_ENV_PATH) + except ImportError: + pass + try: + import google.auth + + _, auth_project = google.auth.default() + except Exception: + auth_project = None + project_id = ( + os.getenv("PROJECT_ID") + or os.getenv("GOOGLE_CLOUD_PROJECT") + or auth_project + ) + if not project_id: + raise RuntimeError( + "PROJECT_ID is not set and no default Google Cloud project was found." + ) + return DemoConfig( + project_id=project_id, + dataset_id=os.getenv( + "SELF_EVOLVING_DATASET_ID", "self_evolving_agent_demo" + ), + table_id=os.getenv("SELF_EVOLVING_TABLE_ID", "agent_events"), + location=os.getenv("DATASET_LOCATION", "us-central1"), + ) + + +def load_session_ids(path: str) -> list[str]: + """Load non-empty session IDs from a run-agent result file.""" + with open(path) as f: + data = json.load(f) + if isinstance(data, dict): + data = data.get("sessions", []) + return [str(r["session_id"]) for r in data if r.get("session_id")] + + +def load_quality_summary(path: str) -> dict[str, Any]: + """Summarize deterministic quality fields from run-agent results.""" + with open(path) as f: + rows = json.load(f) + if isinstance(rows, dict): + rows = rows.get("sessions", []) + total = len(rows) + passed = sum(1 for r in rows if r.get("quality_passed")) + expected_tool_used = sum(1 for r in rows if r.get("expected_tool_used")) + avoid_tool_used = sum(1 for r in rows if r.get("avoid_tool_used")) + return { + "total": total, + "passed": passed, + "pass_rate": passed / total if total else 0.0, + "expected_tool_used": expected_tool_used, + "avoid_tool_used": avoid_tool_used, + } + + +def _bq_client(config: DemoConfig) -> Any: + from google.cloud import bigquery + + return bigquery.Client(project=config.project_id, location=config.location) + + +def fetch_session_metrics( + session_ids: list[str], + *, + attempts: int = 1, + wait_seconds: int = 0, +) -> list[dict[str, Any]]: + """Fetch per-session token/tool metrics from the raw event table.""" + if not session_ids: + return [] + from google.cloud import bigquery + + config = load_config() + client = _bq_client(config) + query = f""" + SELECT + session_id, + COUNT(*) AS event_count, + COUNTIF(event_type = 'LLM_REQUEST') AS llm_calls, + COUNTIF(event_type = 'LLM_RESPONSE') AS llm_responses, + COUNTIF(event_type = 'TOOL_STARTING') AS tool_calls, + COUNTIF(event_type = 'TOOL_ERROR') AS tool_errors, + COUNTIF( + event_type = 'TOOL_STARTING' + AND JSON_VALUE(content, '$.tool') = 'lookup_basketball_reference' + ) AS broad_lookup_calls, + SUM(COALESCE( + SAFE_CAST(JSON_VALUE( + attributes, '$.usage_metadata.prompt_token_count' + ) AS INT64), + SAFE_CAST(JSON_VALUE(content, '$.usage.prompt') AS INT64), + SAFE_CAST(JSON_VALUE(attributes, '$.input_tokens') AS INT64), + 0 + )) AS input_tokens, + SUM(COALESCE( + SAFE_CAST(JSON_VALUE( + attributes, '$.usage_metadata.candidates_token_count' + ) AS INT64), + SAFE_CAST(JSON_VALUE(content, '$.usage.completion') AS INT64), + SAFE_CAST(JSON_VALUE(attributes, '$.output_tokens') AS INT64), + 0 + )) AS output_tokens, + SUM(COALESCE( + SAFE_CAST(JSON_VALUE( + attributes, '$.usage_metadata.total_token_count' + ) AS INT64), + SAFE_CAST(JSON_VALUE(content, '$.usage.total') AS INT64), + COALESCE( + SAFE_CAST(JSON_VALUE(attributes, '$.input_tokens') AS INT64), + 0 + ) + COALESCE( + SAFE_CAST(JSON_VALUE(attributes, '$.output_tokens') AS INT64), + 0 + ) + )) AS total_tokens + FROM `{config.table_ref}` + WHERE session_id IN UNNEST(@session_ids) + GROUP BY session_id + ORDER BY session_id + """ + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ArrayQueryParameter("session_ids", "STRING", session_ids), + ] + ) + rows: list[dict[str, Any]] = [] + for attempt in range(1, attempts + 1): + if wait_seconds and attempt > 1: + time.sleep(wait_seconds) + rows = [ + dict(r) for r in client.query(query, job_config=job_config).result() + ] + if len(rows) >= len(set(session_ids)): + break + return rows + + +def require_complete_session_metrics( + rows: list[dict[str, Any]], + session_ids: list[str], + *, + label: str, +) -> None: + """Validate that BigQuery returned complete and usable metric rows.""" + expected = {str(session_id) for session_id in session_ids if session_id} + observed = {str(row.get("session_id", "")) for row in rows} + missing = sorted(expected - observed) + if missing: + raise RuntimeError( + f"Only found {len(observed)}/{len(expected)} {label} sessions in " + "BigQuery after retries. Missing session IDs: " + ", ".join(missing) + ) + + total_events = sum(int(row.get("event_count") or 0) for row in rows) + total_tokens = sum(float(row.get("total_tokens") or 0) for row in rows) + if total_events and total_tokens == 0: + config = load_config() + raise RuntimeError( + "Token extraction produced zero total tokens even though trace events " + f"exist. The analytics plugin schema may have changed; inspect " + f"LLM_RESPONSE rows in `{config.table_ref}`." + ) + + +def fetch_tool_counts(session_ids: list[str]) -> list[dict[str, Any]]: + """Fetch aggregate tool-call counts for the selected sessions.""" + if not session_ids: + return [] + from google.cloud import bigquery + + config = load_config() + client = _bq_client(config) + query = f""" + SELECT + JSON_VALUE(content, '$.tool') AS tool_name, + COUNT(*) AS calls + FROM `{config.table_ref}` + WHERE session_id IN UNNEST(@session_ids) + AND event_type = 'TOOL_STARTING' + GROUP BY tool_name + ORDER BY calls DESC, tool_name + """ + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ArrayQueryParameter("session_ids", "STRING", session_ids), + ] + ) + return [dict(r) for r in client.query(query, job_config=job_config).result()] + + +def summarize(rows: list[dict[str, Any]]) -> dict[str, Any]: + """Aggregate per-session metrics into a compact summary.""" + if not rows: + return { + "sessions": 0, + "avg_total_tokens": 0.0, + "avg_input_tokens": 0.0, + "avg_output_tokens": 0.0, + "avg_tool_calls": 0.0, + "avg_llm_calls": 0.0, + "total_broad_lookup_calls": 0, + "sessions_with_broad_lookup": 0, + "broad_lookup_session_rate": 0.0, + "total_tool_errors": 0, + } + count = len(rows) + + def total(name: str) -> float: + return sum(float(r.get(name) or 0) for r in rows) + + broad_sessions = sum(1 for r in rows if int(r.get("broad_lookup_calls") or 0)) + return { + "sessions": count, + "avg_total_tokens": round(total("total_tokens") / count, 1), + "avg_input_tokens": round(total("input_tokens") / count, 1), + "avg_output_tokens": round(total("output_tokens") / count, 1), + "avg_tool_calls": round(total("tool_calls") / count, 1), + "avg_llm_calls": round(total("llm_calls") / count, 1), + "total_broad_lookup_calls": int(total("broad_lookup_calls")), + "sessions_with_broad_lookup": broad_sessions, + "broad_lookup_session_rate": round(broad_sessions / count, 3), + "total_tool_errors": int(total("tool_errors")), + } + + +def run_sdk_evaluators( + session_ids: list[str], + *, + token_budget: int, + max_cost_usd: float, + max_turns: int, +) -> dict[str, Any]: + """Run SDK deterministic evaluator gates over the selected sessions.""" + from bigquery_agent_analytics import Client + from bigquery_agent_analytics.trace import TraceFilter + + try: + from bigquery_agent_analytics.evaluators import SystemEvaluator + except ImportError: + from bigquery_agent_analytics.evaluators import CodeEvaluator as SystemEvaluator + + config = load_config() + client = Client( + project_id=config.project_id, + dataset_id=config.dataset_id, + table_id=config.table_id, + location=config.location, + ) + filters = TraceFilter(session_ids=session_ids) + evaluators = { + "token_efficiency": SystemEvaluator.token_efficiency( + max_tokens=token_budget + ), + "cost": SystemEvaluator.cost_per_session(max_cost_usd=max_cost_usd), + "turn_count": SystemEvaluator.turn_count(max_turns=max_turns), + "error_rate": SystemEvaluator.error_rate(max_error_rate=0.0), + } + reports = {} + for name, evaluator in evaluators.items(): + report = client.evaluate(evaluator=evaluator, filters=filters) + observed = [] + for session_score in report.session_scores: + for detail in session_score.details.values(): + if isinstance(detail, dict) and detail.get("observed") is not None: + observed.append(detail["observed"]) + reports[name] = { + "total_sessions": report.total_sessions, + "passed_sessions": report.passed_sessions, + "failed_sessions": report.failed_sessions, + "pass_rate": report.pass_rate, + "avg_observed": ( + round(sum(observed) / len(observed), 4) if observed else None + ), + } + return reports diff --git a/examples/self_evolving_agent_demo/analyze_and_evolve.py b/examples/self_evolving_agent_demo/analyze_and_evolve.py new file mode 100755 index 00000000..546a9b44 --- /dev/null +++ b/examples/self_evolving_agent_demo/analyze_and_evolve.py @@ -0,0 +1,377 @@ +#!/usr/bin/env python3 +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Analyze baseline traces and promote an evolved prompt when warranted.""" + +from __future__ import annotations + +import argparse +import difflib +import json +import os +import sys +from typing import Any + +_DEMO_DIR = os.path.dirname(os.path.abspath(__file__)) +if _DEMO_DIR not in sys.path: + sys.path.insert(0, _DEMO_DIR) + +from agent.prompt_store import read_state +from agent.prompt_store import write_prompt +from agent.tools import DEMO_TOOLS +from analytics.session_metrics import fetch_session_metrics +from analytics.session_metrics import fetch_tool_counts +from analytics.session_metrics import load_quality_summary +from analytics.session_metrics import load_session_ids +from analytics.session_metrics import require_complete_session_metrics +from analytics.session_metrics import run_sdk_evaluators +from analytics.session_metrics import summarize + +DEFAULT_MIN_BROAD_LOOKUP_RATE = 0.5 +DEFAULT_MAX_AVG_TOOL_CALLS = 2.0 +MIN_GENERATED_PROMPT_CHARS = 120 + + +def _tool_signatures() -> str: + lines = [] + for tool in DEMO_TOOLS: + name = getattr(tool, "__name__", "unknown") + doc = (getattr(tool, "__doc__", "") or "").strip().splitlines()[0] + lines.append(f"- {name}: {doc}") + return "\n".join(lines) + + +def _load_eval_contract(path: str) -> list[dict[str, str]]: + """Load the deterministic routing contract from run-agent results.""" + with open(path) as f: + rows = json.load(f) + if isinstance(rows, dict): + rows = rows.get("sessions", []) + contract = [] + for row in rows: + contract.append( + { + "case_id": str(row.get("case_id", "")), + "question": str(row.get("question", "")), + "expected_tool": str(row.get("expected_tool", "")), + "avoid_tool": str(row.get("avoid_tool", "")), + } + ) + return contract + + +def _observations( + summary: dict[str, Any], + *, + token_budget: int, + min_broad_lookup_rate: float, + max_avg_tool_calls: float, +) -> list[str]: + obs = [] + if summary["avg_total_tokens"] > token_budget: + obs.append("Average total tokens are above the configured session budget.") + if summary["broad_lookup_session_rate"] >= min_broad_lookup_rate: + obs.append( + "Most sessions used the broad basketball reference tool even though each " + "eval case has a narrow tool path." + ) + if summary["avg_tool_calls"] > max_avg_tool_calls: + obs.append( + "Average tool calls are high for one-question single-turn tasks." + ) + if not obs: + obs.append("No clear token or tool-use hotspot was detected.") + return obs + + +def _generate_candidate_prompt( + *, + current_prompt: str, + observations: list[str], + summary: dict[str, Any], + tool_counts: list[dict[str, Any]], + quality: dict[str, Any], + eval_contract: list[dict[str, str]], + model_id: str, +) -> dict[str, str]: + """Generate an improved prompt from trace analysis.""" + prompt = f"""\ +You are improving an ADK basketball analytics agent prompt from its own trace +analytics. Generate a complete replacement system prompt. + +Current prompt: +``` +{current_prompt} +``` + +Available tools: +{_tool_signatures()} + +SDK trace summary: +{json.dumps(summary, indent=2)} + +Tool counts: +{json.dumps(tool_counts, indent=2)} + +Deterministic quality summary: +{json.dumps(quality, indent=2)} + +Deterministic routing contract from the eval run: +{json.dumps(eval_contract, indent=2)} + +Observed issues: +{json.dumps(observations, indent=2)} + +Requirements for the improved prompt: +- Keep the same agent role and basketball analytics task. +- Remove the broad-first behavior that caused lookup_basketball_reference overuse. +- Instruct the agent to choose the narrowest sufficient tool. +- Preserve every expected_tool / avoid_tool pair in the routing contract. +- Treat a named-team strategy, strengths, profile, or late-game offense + question as a single-team question that calls get_team_profile. +- Treat a named-player scoring, strengths, profile, or quick-read question + as a single-player question that calls get_player_stats. +- Use lookup_basketball_reference only for league-wide or unsupported ambiguous + questions where no narrow player, team, or comparison tool fits. +- Remove the fixed five-section scouting-report format from the old prompt. +- Keep final answers to at most four bullets or 120 words. +- Preserve answer quality and tool grounding. +- Keep final answers concise. +- Do not mention trace analytics, BigQuery, SDKs, prompts, or optimization to users. + +Return JSON with exactly: +{{ + "improved_prompt": "full replacement system prompt", + "changes_summary": "one sentence explaining the improvement" +}} +""" + from google import genai + from google.genai.types import GenerateContentConfig + + client = genai.Client() + response = client.models.generate_content( + model=model_id, + contents=prompt, + config=GenerateContentConfig( + temperature=0.2, + response_mime_type="application/json", + ), + ) + data = json.loads(response.text or "{}") + improved = str(data.get("improved_prompt", "")).strip() + changes = str(data.get("changes_summary", "")).strip() + # A complete system prompt should at least include role and routing guidance. + if len(improved) < MIN_GENERATED_PROMPT_CHARS: + raise ValueError("Generated prompt was too short.") + return { + "source": "model", + "improved_prompt": improved, + "changes_summary": changes or "Generated from SDK trace analysis.", + } + + +def _write_prompt_diff( + *, + output_dir: str, + before_prompt: str, + after_prompt: str, + observations: list[str], + changes_summary: str, +) -> str: + """Write a human-readable V1 -> generated V2 prompt diff.""" + diff_lines = list( + difflib.unified_diff( + before_prompt.splitlines(), + after_prompt.splitlines(), + fromfile="agent_v1_prompt", + tofile="generated_agent_v2_prompt", + lineterm="", + ) + ) + path = os.path.join(output_dir, "prompt_diff.md") + with open(path, "w") as f: + f.write("# Prompt Diff: Agent V1 -> Generated V2\n\n") + f.write("## Trace Signal\n\n") + for obs in observations: + f.write(f"- {obs}\n") + f.write("\n## Generated Improvement\n\n") + f.write(f"{changes_summary}\n\n") + f.write("## Unified Diff\n\n") + f.write("```diff\n") + f.write("\n".join(diff_lines)) + f.write("\n```\n") + return path + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Analyze demo sessions and evolve the active prompt." + ) + parser.add_argument("--sessions", required=True) + parser.add_argument( + "--output-dir", default=os.path.join(_DEMO_DIR, "reports") + ) + parser.add_argument("--token-budget", type=int, default=12000) + parser.add_argument("--max-cost-usd", type=float, default=0.05) + parser.add_argument("--max-turns", type=int, default=4) + parser.add_argument("--min-quality-pass-rate", type=float, default=1.0) + parser.add_argument( + "--min-broad-lookup-rate", + type=float, + default=DEFAULT_MIN_BROAD_LOOKUP_RATE, + ) + parser.add_argument( + "--max-avg-tool-calls", + type=float, + default=DEFAULT_MAX_AVG_TOOL_CALLS, + ) + parser.add_argument( + "--generator-model", + default=os.getenv( + "SELF_EVOLVING_PROMPT_GENERATOR_MODEL", "gemini-2.5-flash" + ), + ) + parser.add_argument("--wait-seconds", type=int, default=15) + parser.add_argument("--attempts", type=int, default=6) + args = parser.parse_args() + + os.makedirs(args.output_dir, exist_ok=True) + session_ids = load_session_ids(args.sessions) + if not session_ids: + raise SystemExit(f"No session IDs found in {args.sessions}") + + rows = fetch_session_metrics( + session_ids, + attempts=args.attempts, + wait_seconds=args.wait_seconds, + ) + try: + require_complete_session_metrics(rows, session_ids, label="baseline") + except RuntimeError as exc: + raise SystemExit(str(exc)) from exc + + summary = summarize(rows) + tool_counts = fetch_tool_counts(session_ids) + quality = load_quality_summary(args.sessions) + eval_contract = _load_eval_contract(args.sessions) + sdk_reports = run_sdk_evaluators( + session_ids, + token_budget=args.token_budget, + max_cost_usd=args.max_cost_usd, + max_turns=args.max_turns, + ) + observations = _observations( + summary, + token_budget=args.token_budget, + min_broad_lookup_rate=args.min_broad_lookup_rate, + max_avg_tool_calls=args.max_avg_tool_calls, + ) + current_state = read_state() + should_promote = ( + current_state["version"] == "v1" + and quality["pass_rate"] >= args.min_quality_pass_rate + and ( + summary["broad_lookup_session_rate"] >= args.min_broad_lookup_rate + or summary["avg_total_tokens"] > args.token_budget + ) + ) + + evolution = { + "from_version": current_state["version"], + "to_version": current_state["version"], + "promoted": False, + "rationale": "No candidate prompt generated.", + } + if should_promote: + try: + candidate = _generate_candidate_prompt( + current_prompt=current_state["prompt"], + observations=observations, + summary=summary, + tool_counts=tool_counts, + quality=quality, + eval_contract=eval_contract, + model_id=args.generator_model, + ) + except Exception as exc: + raise SystemExit( + "Prompt generation failed; no fallback prompt was promoted. " + f"Original error: {exc}" + ) from exc + candidate_path = os.path.join(args.output_dir, "candidate_prompt.json") + with open(candidate_path, "w") as f: + json.dump(candidate, f, indent=2) + f.write("\n") + prompt_diff_path = _write_prompt_diff( + output_dir=args.output_dir, + before_prompt=current_state["prompt"], + after_prompt=candidate["improved_prompt"], + observations=observations, + changes_summary=candidate["changes_summary"], + ) + rationale = ( + "Generated V2 from SDK trace analysis because baseline quality met " + "the configured gate and an operational waste signal was detected." + ) + write_prompt("v2", candidate["improved_prompt"], rationale) + evolution = { + "from_version": "v1", + "to_version": "v2", + "promoted": True, + "rationale": rationale, + "candidate_path": candidate_path, + "prompt_diff_path": prompt_diff_path, + "changes_summary": candidate["changes_summary"], + "candidate_source": candidate.get("source", "model"), + "generator_model": args.generator_model, + } + + report = { + "quality": quality, + "session_summary": summary, + "tool_counts": tool_counts, + "sdk_evaluator_reports": sdk_reports, + "observations": observations, + "evolution": evolution, + } + output_path = os.path.join(args.output_dir, "self_evolution_analysis.json") + with open(output_path, "w") as f: + json.dump(report, f, indent=2) + f.write("\n") + + print("") + print(" SDK-backed self-evolution analysis") + print(" ----------------------------------") + print(f" Sessions: {summary['sessions']}") + print(f" Avg total tokens: {summary['avg_total_tokens']}") + print(f" Avg tool calls: {summary['avg_tool_calls']}") + print( + " Broad lookup sessions: " + f"{summary['sessions_with_broad_lookup']}/{summary['sessions']}" + ) + print(f" Quality pass rate: {quality['pass_rate']:.0%}") + print( + f" Evolution: {evolution['from_version']} -> {evolution['to_version']}" + ) + print(f" Promoted: {evolution['promoted']}") + if evolution.get("candidate_path"): + print(f" Candidate prompt: {evolution['candidate_path']}") + if evolution.get("prompt_diff_path"): + print(f" Prompt diff: {evolution['prompt_diff_path']}") + print(f" Report: {output_path}") + + +if __name__ == "__main__": + main() diff --git a/examples/self_evolving_agent_demo/compare_runs.py b/examples/self_evolving_agent_demo/compare_runs.py new file mode 100755 index 00000000..41413f25 --- /dev/null +++ b/examples/self_evolving_agent_demo/compare_runs.py @@ -0,0 +1,266 @@ +#!/usr/bin/env python3 +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Compare baseline and evolved demo runs.""" + +from __future__ import annotations + +import argparse +import json +import os +import sys +from typing import Any + +_DEMO_DIR = os.path.dirname(os.path.abspath(__file__)) +if _DEMO_DIR not in sys.path: + sys.path.insert(0, _DEMO_DIR) + +from analytics.session_metrics import fetch_session_metrics +from analytics.session_metrics import load_quality_summary +from analytics.session_metrics import load_session_ids +from analytics.session_metrics import require_complete_session_metrics +from analytics.session_metrics import summarize + + +def _load( + path: str, attempts: int, wait_seconds: int, label: str +) -> tuple[dict, dict]: + ids = load_session_ids(path) + rows = fetch_session_metrics( + ids, + attempts=attempts, + wait_seconds=wait_seconds, + ) + try: + require_complete_session_metrics(rows, ids, label=label) + except RuntimeError as exc: + raise SystemExit(str(exc)) from exc + return summarize(rows), load_quality_summary(path) + + +def _pct_delta(before: float, after: float) -> float | None: + if before == 0: + return 0.0 if after == 0 else None + return round((after - before) / before, 4) + + +def _format_pct_delta(value: float | None) -> str: + if value is None: + return "n/a" + return f"{value:+.1%}" + + +def _read_candidate_metadata(output_dir: str) -> dict[str, str]: + path = os.path.join(output_dir, "candidate_prompt.json") + if not os.path.exists(path): + return {} + with open(path) as f: + data = json.load(f) + return { + "changes_summary": str(data.get("changes_summary", "")), + "source": str(data.get("source", "")), + } + + +def _write_markdown_report( + *, + output_path: str, + result: dict[str, Any], +) -> str: + """Write a concise operator-facing before/after report.""" + output_dir = os.path.dirname(output_path) + path = os.path.join(output_dir, "comparison.md") + before_quality = result["before"]["quality"] + after_quality = result["after"]["quality"] + before_metrics = result["before"]["metrics"] + after_metrics = result["after"]["metrics"] + deltas = result["deltas"] + candidate_metadata = _read_candidate_metadata(output_dir) + candidate_summary = candidate_metadata.get("changes_summary", "") + candidate_source = candidate_metadata.get("source", "") + prompt_diff_path = os.path.join(output_dir, "prompt_diff.md") + has_prompt_diff = os.path.exists(prompt_diff_path) + + with open(path, "w") as f: + f.write("# Agent V1 -> Generated V2 Comparison\n\n") + f.write("## What Trace Analysis Changed\n\n") + if candidate_summary: + f.write(f"{candidate_summary}\n\n") + else: + f.write( + "The generated V2 prompt was created from the baseline trace " + "summary, tool counts, quality summary, and available tool " + "signatures.\n\n" + ) + if candidate_source: + f.write(f"Candidate source: `{candidate_source}`.\n\n") + if has_prompt_diff: + f.write("See `prompt_diff.md` for the exact prompt-level diff.\n\n") + + f.write("## Before / After Metrics\n\n") + f.write("| Metric | V1 | Generated V2 | Delta |\n") + f.write("|---|---:|---:|---:|\n") + f.write( + "| Quality pass rate | " + f"{before_quality['pass_rate']:.0%} | " + f"{after_quality['pass_rate']:.0%} | " + f"{after_quality['pass_rate'] - before_quality['pass_rate']:+.0%} |\n" + ) + f.write( + "| Avg total tokens | " + f"{before_metrics['avg_total_tokens']} | " + f"{after_metrics['avg_total_tokens']} | " + f"{_format_pct_delta(deltas['avg_total_tokens_pct'])} |\n" + ) + f.write( + "| Avg tool calls | " + f"{before_metrics['avg_tool_calls']} | " + f"{after_metrics['avg_tool_calls']} | " + f"{_format_pct_delta(deltas['avg_tool_calls_pct'])} |\n" + ) + f.write( + "| Broad lookup calls | " + f"{before_metrics['total_broad_lookup_calls']} | " + f"{after_metrics['total_broad_lookup_calls']} | " + f"{deltas['broad_lookup_calls']:+d} |\n" + ) + f.write( + "| Tool errors | " + f"{before_metrics['total_tool_errors']} | " + f"{after_metrics['total_tool_errors']} | " + f"{after_metrics['total_tool_errors'] - before_metrics['total_tool_errors']:+d} |\n" + ) + + f.write("\n## Acceptance Gates\n\n") + for name, passed in result["gates"].items(): + f.write(f"- `{name}`: {passed}\n") + + f.write("\n## Why This Demonstrates Self-Evolution\n\n") + f.write( + "The demo does not just compare two static prompts. It uses the " + "baseline BigQuery traces to identify broad-tool overuse and token " + "waste, generates a replacement prompt from that evidence, reruns " + "the agent, then records whether the generated V2 preserved quality " + "while reducing the measured waste.\n" + ) + return path + + +def main() -> None: + parser = argparse.ArgumentParser(description="Compare two demo runs.") + parser.add_argument("--before", required=True) + parser.add_argument("--after", required=True) + parser.add_argument("--output", default=None) + parser.add_argument("--min-token-reduction", type=float, default=0.05) + parser.add_argument("--wait-seconds", type=int, default=15) + parser.add_argument("--attempts", type=int, default=6) + parser.add_argument( + "--fail-on-gate-failure", + action="store_true", + help="Exit nonzero when acceptance gates fail.", + ) + args = parser.parse_args() + + before_summary, before_quality = _load( + args.before, args.attempts, args.wait_seconds, "baseline" + ) + after_summary, after_quality = _load( + args.after, args.attempts, args.wait_seconds, "evolved" + ) + token_delta = _pct_delta( + before_summary["avg_total_tokens"], + after_summary["avg_total_tokens"], + ) + tool_delta = _pct_delta( + before_summary["avg_tool_calls"], + after_summary["avg_tool_calls"], + ) + broad_delta = ( + after_summary["total_broad_lookup_calls"] + - before_summary["total_broad_lookup_calls"] + ) + gates = { + "quality_not_regressed": ( + after_quality["pass_rate"] >= before_quality["pass_rate"] + ), + "tokens_reduced": ( + token_delta is not None and token_delta <= -args.min_token_reduction + ), + "broad_lookup_reduced": broad_delta < 0, + "tool_errors_clear": after_summary["total_tool_errors"] == 0, + } + result: dict[str, Any] = { + "before": {"quality": before_quality, "metrics": before_summary}, + "after": {"quality": after_quality, "metrics": after_summary}, + "deltas": { + "avg_total_tokens_pct": token_delta, + "avg_tool_calls_pct": tool_delta, + "broad_lookup_calls": broad_delta, + }, + "gates": gates, + "passed": all(gates.values()), + } + + if args.output: + os.makedirs(os.path.dirname(args.output), exist_ok=True) + markdown_path = _write_markdown_report( + output_path=args.output, + result=result, + ) + result["artifacts"] = { + "markdown_report": markdown_path, + "prompt_diff": os.path.join( + os.path.dirname(args.output), "prompt_diff.md" + ), + } + with open(args.output, "w") as f: + json.dump(result, f, indent=2) + f.write("\n") + + print("") + print(" Before/after self-evolution report") + print(" ----------------------------------") + print( + f" Quality pass rate: {before_quality['pass_rate']:.0%}" + f" -> {after_quality['pass_rate']:.0%}" + ) + print( + f" Avg total tokens: {before_summary['avg_total_tokens']}" + f" -> {after_summary['avg_total_tokens']}" + f" ({_format_pct_delta(token_delta)})" + ) + print( + f" Avg tool calls: {before_summary['avg_tool_calls']}" + f" -> {after_summary['avg_tool_calls']}" + f" ({_format_pct_delta(tool_delta)})" + ) + print( + " Broad lookup calls: " + f"{before_summary['total_broad_lookup_calls']}" + f" -> {after_summary['total_broad_lookup_calls']}" + ) + print(" Gates:") + for name, passed in gates.items(): + print(f" {name}: {passed}") + if args.output: + print(f" Report: {args.output}") + print(f" Markdown: {markdown_path}") + + if args.fail_on_gate_failure and not result["passed"]: + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/examples/self_evolving_agent_demo/eval/eval_cases.json b/examples/self_evolving_agent_demo/eval/eval_cases.json new file mode 100644 index 00000000..c3528642 --- /dev/null +++ b/examples/self_evolving_agent_demo/eval/eval_cases.json @@ -0,0 +1,28 @@ +{ + "eval_cases": [ + { + "id": "player_compare_jokic_embiid", + "question": "For half-court offense, who is the better hub: Nikola Jokic or Joel Embiid?", + "expected_tool": "compare_players", + "avoid_tool": "lookup_basketball_reference" + }, + { + "id": "team_compare_celtics_thunder", + "question": "Which team has the stronger playoff profile, the Celtics or the Thunder?", + "expected_tool": "compare_teams", + "avoid_tool": "lookup_basketball_reference" + }, + { + "id": "single_player_shai", + "question": "Give me a quick read on Shai Gilgeous-Alexander's scoring profile.", + "expected_tool": "get_player_stats", + "avoid_tool": "lookup_basketball_reference" + }, + { + "id": "single_team_nuggets", + "question": "How should Denver build its late-game offense around the Nuggets' strengths?", + "expected_tool": "get_team_profile", + "avoid_tool": "lookup_basketball_reference" + } + ] +} diff --git a/examples/self_evolving_agent_demo/reset.sh b/examples/self_evolving_agent_demo/reset.sh new file mode 100755 index 00000000..c7e1bd1c --- /dev/null +++ b/examples/self_evolving_agent_demo/reset.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PYTHON_BIN="${PYTHON_BIN:-python3}" + +cd "$SCRIPT_DIR" +"$PYTHON_BIN" -m agent.prompt_store reset >/dev/null +rm -rf "$SCRIPT_DIR/reports" + +echo "Demo state reset to V1. Reports were removed." +echo "BigQuery data was left intact. Use setup.sh to recreate .env if needed." diff --git a/examples/self_evolving_agent_demo/run_agent.py b/examples/self_evolving_agent_demo/run_agent.py new file mode 100755 index 00000000..09550a83 --- /dev/null +++ b/examples/self_evolving_agent_demo/run_agent.py @@ -0,0 +1,201 @@ +#!/usr/bin/env python3 +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Run demo eval questions through the ADK agent with BigQuery logging.""" + +from __future__ import annotations + +import argparse +import asyncio +import json +import os +import sys +from typing import Any + +_DEMO_DIR = os.path.dirname(os.path.abspath(__file__)) +if _DEMO_DIR not in sys.path: + sys.path.insert(0, _DEMO_DIR) + + +def _load_cases(path: str) -> list[dict[str, Any]]: + with open(path) as f: + return json.load(f)["eval_cases"] + + +def _part_text(part: Any) -> str: + text = getattr(part, "text", None) + return text or "" + + +def _part_function_name(part: Any) -> str | None: + function_call = getattr(part, "function_call", None) + if not function_call: + return None + return getattr(function_call, "name", None) + + +async def _run_case( + runner: Any, + case: dict[str, Any], + *, + user_id: str, + timeout_seconds: int, +) -> dict[str, Any]: + from google.genai.types import Content + from google.genai.types import Part + + session = await runner.session_service.create_session( + app_name=runner.app_name, + user_id=user_id, + ) + user_message = Content(role="user", parts=[Part(text=case["question"])]) + response_text = "" + tools_called: list[str] = [] + + async def _consume() -> None: + nonlocal response_text + async for event in runner.run_async( + user_id=user_id, + session_id=session.id, + new_message=user_message, + ): + if not event.content or not event.content.parts: + continue + for part in event.content.parts: + response_text += _part_text(part) + tool_name = _part_function_name(part) + if tool_name: + tools_called.append(tool_name) + + await asyncio.wait_for(_consume(), timeout=timeout_seconds) + + expected_tool = case.get("expected_tool", "") + avoid_tool = case.get("avoid_tool", "") + expected_tool_used = expected_tool in tools_called if expected_tool else True + avoid_tool_used = avoid_tool in tools_called if avoid_tool else False + # Quality checks answerability; avoid-tool overuse is the separate + # efficiency signal that drives this demo's evolution. + quality_passed = bool(response_text.strip()) and expected_tool_used + return { + "case_id": case["id"], + "question": case["question"], + "expected_tool": expected_tool, + "avoid_tool": avoid_tool, + "tools_called": tools_called, + "expected_tool_used": expected_tool_used, + "avoid_tool_used": avoid_tool_used, + "quality_passed": quality_passed, + "response": response_text.strip(), + "session_id": session.id, + } + + +async def _run_all(args: argparse.Namespace) -> list[dict[str, Any]]: + from agent.agent import APP_NAME + from agent.agent import bq_logging_plugin + from agent.agent import PROMPT_VERSION + from agent.agent import root_agent + from google.adk.runners import InMemoryRunner + + cases = _load_cases(args.eval_cases) + runner = InMemoryRunner( + agent=root_agent, + app_name=APP_NAME, + plugins=[bq_logging_plugin], + ) + semaphore = asyncio.Semaphore(args.max_concurrency) + + async def _guarded(i: int, case: dict[str, Any]) -> dict[str, Any]: + async with semaphore: + print(f" [{i}/{len(cases)}] {case['id']}: {case['question']}") + try: + result = await _run_case( + runner, + case, + user_id=f"{args.label}_user", + timeout_seconds=args.timeout, + ) + except Exception as exc: + result = { + "case_id": case["id"], + "question": case["question"], + "expected_tool": case.get("expected_tool", ""), + "avoid_tool": case.get("avoid_tool", ""), + "tools_called": [], + "expected_tool_used": False, + "avoid_tool_used": False, + "quality_passed": False, + "response": f"ERROR: {exc}", + "session_id": "", + } + result["label"] = args.label + result["prompt_version"] = PROMPT_VERSION + answer = result["response"].replace("\n", " ").strip() + if len(answer) > 180: + answer = answer[:180] + "..." + print(f" tools: {', '.join(result['tools_called']) or 'none'}") + print(f" pass: {result['quality_passed']}") + print(f" ans: {answer}") + return result + + return list( + await asyncio.gather( + *[_guarded(i, case) for i, case in enumerate(cases, 1)] + ) + ) + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Run self-evolving agent demo eval traffic." + ) + parser.add_argument( + "--eval-cases", + default=os.path.join(_DEMO_DIR, "eval", "eval_cases.json"), + ) + parser.add_argument( + "--output-dir", default=os.path.join(_DEMO_DIR, "reports") + ) + parser.add_argument("--label", default="baseline") + parser.add_argument("--max-concurrency", type=int, default=2) + parser.add_argument("--timeout", type=int, default=180) + parser.add_argument( + "--allow-failures", + action="store_true", + help="Write results without exiting nonzero on quality failures.", + ) + args = parser.parse_args() + + os.makedirs(args.output_dir, exist_ok=True) + results = asyncio.run(_run_all(args)) + + labeled_path = os.path.join( + args.output_dir, f"latest_eval_results_{args.label}.json" + ) + latest_path = os.path.join(args.output_dir, "latest_eval_results.json") + for path in (labeled_path, latest_path): + with open(path, "w") as f: + json.dump(results, f, indent=2) + f.write("\n") + print("") + print(f" Results saved to: {labeled_path}") + + failures = sum(1 for r in results if not r.get("quality_passed")) + if failures and not args.allow_failures: + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/examples/self_evolving_agent_demo/run_e2e_demo.sh b/examples/self_evolving_agent_demo/run_e2e_demo.sh new file mode 100755 index 00000000..02bc921b --- /dev/null +++ b/examples/self_evolving_agent_demo/run_e2e_demo.sh @@ -0,0 +1,99 @@ +#!/usr/bin/env bash +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PYTHON_BIN="${PYTHON_BIN:-python3}" + +if ! "$PYTHON_BIN" - <<'PY' >/dev/null; then +import sys +raise SystemExit(0 if sys.version_info >= (3, 10) else 1) +PY + echo "ERROR: Python 3.10+ is required. Set PYTHON_BIN to a 3.10+ interpreter." >&2 + exit 1 +fi + +if [[ -f "$SCRIPT_DIR/.env" ]]; then + set -a + source "$SCRIPT_DIR/.env" + set +a +else + echo "ERROR: .env not found. Run ./setup.sh first." >&2 + exit 1 +fi + +export PYTHONPATH="$SCRIPT_DIR:${PYTHONPATH:-}" +export TOKEN_BUDGET="${TOKEN_BUDGET:-12000}" +export MAX_COST_USD="${MAX_COST_USD:-0.05}" +export SELF_EVOLVING_PROMPT_GENERATOR_MODEL="${SELF_EVOLVING_PROMPT_GENERATOR_MODEL:-gemini-2.5-flash}" + +RUN_ID="$(date +%Y%m%d_%H%M%S)" +REPORTS_DIR="$SCRIPT_DIR/reports/run_${RUN_ID}" +mkdir -p "$REPORTS_DIR" + +echo "" +echo "============================================" +echo " Self-Evolving Agent Demo" +echo "============================================" +echo "" +echo "Reports: $REPORTS_DIR" +echo "Estimated one-run cloud cost: typically well under \$1 with defaults." +echo "" + +cd "$SCRIPT_DIR" +"$PYTHON_BIN" -m agent.prompt_store reset >/dev/null + +echo "[1/5] Run baseline V1 agent..." +"$PYTHON_BIN" run_agent.py \ + --label baseline \ + --output-dir "$REPORTS_DIR" \ + --allow-failures + +echo "" +echo "[2/5] Analyze traces and generate evolved prompt..." +"$PYTHON_BIN" analyze_and_evolve.py \ + --sessions "$REPORTS_DIR/latest_eval_results_baseline.json" \ + --output-dir "$REPORTS_DIR" \ + --token-budget "$TOKEN_BUDGET" \ + --max-cost-usd "$MAX_COST_USD" \ + --generator-model "$SELF_EVOLVING_PROMPT_GENERATOR_MODEL" + +echo "" +echo "[3/5] Run evolved agent..." +"$PYTHON_BIN" run_agent.py \ + --label evolved \ + --output-dir "$REPORTS_DIR" \ + --allow-failures + +echo "" +echo "[4/5] Compare before and after..." +"$PYTHON_BIN" compare_runs.py \ + --before "$REPORTS_DIR/latest_eval_results_baseline.json" \ + --after "$REPORTS_DIR/latest_eval_results_evolved.json" \ + --output "$REPORTS_DIR/comparison.json" + +echo "" +echo "[5/5] Done." +echo "" +echo "Key artifacts:" +echo " $REPORTS_DIR/latest_eval_results_baseline.json" +echo " $REPORTS_DIR/candidate_prompt.json" +echo " $REPORTS_DIR/prompt_diff.md" +echo " $REPORTS_DIR/self_evolution_analysis.json" +echo " $REPORTS_DIR/latest_eval_results_evolved.json" +echo " $REPORTS_DIR/comparison.json" +echo " $REPORTS_DIR/comparison.md" +echo "" diff --git a/examples/self_evolving_agent_demo/setup.sh b/examples/self_evolving_agent_demo/setup.sh new file mode 100755 index 00000000..7f12f355 --- /dev/null +++ b/examples/self_evolving_agent_demo/setup.sh @@ -0,0 +1,124 @@ +#!/usr/bin/env bash +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" +ENV_FILE="$SCRIPT_DIR/.env" + +echo "" +echo "============================================" +echo " Self-Evolving Agent Demo - Setup" +echo "============================================" +echo "" +echo "Estimated one-run cloud cost: typically well under \$1 for the" +echo "default four-question demo. Setup itself only enables APIs, installs" +echo "local dependencies, and creates a small BigQuery dataset." +echo "" + +PYTHON_BIN="${PYTHON_BIN:-python3}" +if ! command -v "$PYTHON_BIN" &>/dev/null; then + echo "ERROR: $PYTHON_BIN is required." >&2 + exit 1 +fi +if ! "$PYTHON_BIN" - <<'PY' >/dev/null; then +import sys +raise SystemExit(0 if sys.version_info >= (3, 10) else 1) +PY + echo "ERROR: Python 3.10+ is required. Set PYTHON_BIN to a 3.10+ interpreter." >&2 + exit 1 +fi +if ! command -v gcloud &>/dev/null; then + echo "ERROR: gcloud CLI is required." >&2 + exit 1 +fi +if ! command -v bq &>/dev/null; then + echo "ERROR: bq CLI is required." >&2 + exit 1 +fi + +PROJECT_ID="${PROJECT_ID:-$(gcloud config get-value project 2>/dev/null || true)}" +if [[ -z "$PROJECT_ID" ]]; then + echo "ERROR: No project set. Export PROJECT_ID or run:" >&2 + echo " gcloud config set project YOUR_PROJECT_ID" >&2 + exit 1 +fi +echo "Project: $PROJECT_ID" + +if ! gcloud auth application-default print-access-token &>/dev/null 2>&1; then + echo "Application default credentials not found. Starting login..." + gcloud auth application-default login +fi + +echo "" +echo "Enabling required APIs..." +gcloud services enable bigquery.googleapis.com --project="$PROJECT_ID" >/dev/null +gcloud services enable aiplatform.googleapis.com --project="$PROJECT_ID" >/dev/null +echo "APIs enabled." + +echo "" +echo "Installing local package dependencies..." +"$PYTHON_BIN" -m pip install -e "$REPO_ROOT[improvement]" --quiet +echo "Dependencies installed." + +DATASET_LOCATION="${DATASET_LOCATION:-${BQ_LOCATION:-us-central1}}" +SELF_EVOLVING_DATASET_ID="${SELF_EVOLVING_DATASET_ID:-self_evolving_agent_demo}" +SELF_EVOLVING_TABLE_ID="${SELF_EVOLVING_TABLE_ID:-agent_events}" +SELF_EVOLVING_AGENT_MODEL="${SELF_EVOLVING_AGENT_MODEL:-gemini-2.5-flash}" +SELF_EVOLVING_PROMPT_GENERATOR_MODEL="${SELF_EVOLVING_PROMPT_GENERATOR_MODEL:-gemini-2.5-flash}" +SELF_EVOLVING_AGENT_LOCATION="${SELF_EVOLVING_AGENT_LOCATION:-us-central1}" +TOKEN_BUDGET="${TOKEN_BUDGET:-12000}" +MAX_COST_USD="${MAX_COST_USD:-0.05}" + +if ! bq show "${PROJECT_ID}:${SELF_EVOLVING_DATASET_ID}" &>/dev/null 2>&1; then + echo "" + echo "Creating BigQuery dataset: ${SELF_EVOLVING_DATASET_ID} (${DATASET_LOCATION})" + bq mk --dataset --location="$DATASET_LOCATION" \ + "${PROJECT_ID}:${SELF_EVOLVING_DATASET_ID}" >/dev/null +else + EXISTING_LOCATION="$( + bq show --format=prettyjson "${PROJECT_ID}:${SELF_EVOLVING_DATASET_ID}" \ + | "$PYTHON_BIN" -c 'import json, sys; print(json.load(sys.stdin).get("location", ""))' + )" + if [[ "${EXISTING_LOCATION,,}" != "${DATASET_LOCATION,,}" ]]; then + echo "ERROR: Dataset ${SELF_EVOLVING_DATASET_ID} exists in ${EXISTING_LOCATION}," >&2 + echo "but DATASET_LOCATION is ${DATASET_LOCATION}. Use a matching location or a new dataset ID." >&2 + exit 1 + fi +fi + +cat > "$ENV_FILE" </dev/null + +echo "" +echo "Setup complete." +echo "Run:" +echo " cd $SCRIPT_DIR" +echo " ./run_e2e_demo.sh" diff --git a/tests/test_self_evolving_agent_demo.py b/tests/test_self_evolving_agent_demo.py new file mode 100644 index 00000000..05a008db --- /dev/null +++ b/tests/test_self_evolving_agent_demo.py @@ -0,0 +1,99 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tests for pure helpers in the self-evolving agent demo.""" + +from __future__ import annotations + +from pathlib import Path +import sys + +import pytest + +_DEMO_DIR = ( + Path(__file__).resolve().parents[1] + / "examples" + / "self_evolving_agent_demo" +) +sys.path.insert(0, str(_DEMO_DIR)) + +from analytics.session_metrics import require_complete_session_metrics # noqa: E402 +from analytics.session_metrics import summarize # noqa: E402 +import analyze_and_evolve # noqa: E402 +import compare_runs # noqa: E402 + + +def test_summarize_empty_rows_has_stable_shape(): + summary = summarize([]) + + assert summary == { + "sessions": 0, + "avg_total_tokens": 0.0, + "avg_input_tokens": 0.0, + "avg_output_tokens": 0.0, + "avg_tool_calls": 0.0, + "avg_llm_calls": 0.0, + "total_broad_lookup_calls": 0, + "sessions_with_broad_lookup": 0, + "broad_lookup_session_rate": 0.0, + "total_tool_errors": 0, + } + + +def test_require_complete_session_metrics_rejects_missing_rows(): + rows = [{"session_id": "s1", "event_count": 2, "total_tokens": 100}] + + with pytest.raises(RuntimeError, match="Only found 1/2 baseline sessions"): + require_complete_session_metrics(rows, ["s1", "s2"], label="baseline") + + +def test_require_complete_session_metrics_rejects_zero_token_schema( + monkeypatch: pytest.MonkeyPatch, +): + monkeypatch.setenv("PROJECT_ID", "demo-project") + rows = [{"session_id": "s1", "event_count": 2, "total_tokens": 0}] + + with pytest.raises(RuntimeError, match="Token extraction produced zero"): + require_complete_session_metrics(rows, ["s1"], label="baseline") + + +def test_pct_delta_marks_zero_baseline_growth_as_not_applicable(): + assert compare_runs._pct_delta(0, 0) == 0.0 + assert compare_runs._pct_delta(0, 5) is None + assert compare_runs._format_pct_delta(None) == "n/a" + assert compare_runs._format_pct_delta(-0.25) == "-25.0%" + + +def test_observations_use_configured_thresholds(): + summary = { + "avg_total_tokens": 1500, + "broad_lookup_session_rate": 0.5, + "avg_tool_calls": 3.0, + } + + observations = analyze_and_evolve._observations( + summary, + token_budget=1000, + min_broad_lookup_rate=0.5, + max_avg_tool_calls=2.0, + ) + + assert observations == [ + "Average total tokens are above the configured session budget.", + ( + "Most sessions used the broad basketball reference tool even though " + "each eval case has a narrow tool path." + ), + "Average tool calls are high for one-question single-turn tasks.", + ]