Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,50 @@ artifacts that demonstrate SDK capabilities.
| Directory | Description |
|-----------|-------------|
| [agent_improvement_cycle/](agent_improvement_cycle/) | LoopAgent-driven prompt improvement cycle |
| [skill_evolution_lab/](skill_evolution_lab/) | An agent that rewrites its own versioned `SKILL.md` from its conversation traces (no teacher model): flawed V0 → `evolve_skill()` → tool-first V1, golden-Q&A scored, with the anti-parroting rule and Skill Registry versioning. See the dedicated section below. |
| [decision_lineage_demo/](decision_lineage_demo/) | Decision-lineage property graph (issue #98): live ADK media-planner agent + BQ AA Plugin running across 6 campaign sessions → SDK `build_context_graph(use_ai_generate=True, include_decisions=True)` → six GQL blocks pasted into BigQuery Studio (one renders an interactive graph diagram, one is a portfolio roll-up) |

### Skill Evolution Lab — a self-improving agent

[`skill_evolution_lab/`](skill_evolution_lab/) is the runnable companion to the
blog post *"Your Agent Can Learn From Its Own Conversations."* One company-policy Q&A agent
reads its own conversation traces — successes and failures — and extracts a
structured, versioned `SKILL.md`. No teacher model, no managed optimizer.

- **The flaw with headroom.** V0 is a deliberately flawed skill (a few facts
baked in plus *"answer only from the above, else contact HR"*) that suppresses
a tool which already knows every answer. Only the skill is wrong — the model,
tools, and questions stay fixed across V0 and V1, so any delta is attributable
to the skill.
- **The engine, imported not copied.** `analyze_and_evolve.py` imports the SDK's
reusable [`scripts/skill_evolution.py`](../../scripts/skill_evolution.py) (the
same `evolve_skill()` the quality lab uses): it partitions scored
conversations, runs a fleet of parallel analysts, and consolidates recurring
rules into a new skill version.
- **Ground-truth scoring.** Quality is graded against a golden Q&A answer key
(`eval/eval_spec.json`) via [`scripts/quality_report.py`](../../scripts/quality_report.py)
(`--eval-spec`), not a no-ground-truth "usefulness" guess.
- **The anti-parroting rule.** Multi-turn cases where the user asserts a *wrong*
correction; a good agent re-verifies with its tool and holds the right figure
instead of caving. The engine detects parroting (`--tag-turns`) and learns a
"re-verify, don't just agree" rule.
- **Skill Registry versioning.** The evolved skill is mirrored to the Gemini
Enterprise Agent Platform Skill Registry as a new immutable revision
(V0 = revision 1, V1 = revision 2); `reset.sh` reverts both the local copy and
the registry to V0.

```bash
cd skill_evolution_lab
./setup.sh YOUR_PROJECT_ID us-central1 # writes .env, resets to V0
./run_e2e_demo.sh # V0 -> evolve -> V1 -> compare, restore V0
```

A verified run (gemini-3-flash-preview, golden-grounded, held-out): **V0 23.8% →
V1 100%** overall; corrections (anti-parroting) **33.3% → 100%**; evolved skill
2.5KB. See the example's [README](skill_evolution_lab/README.md),
[DEMO_NARRATION](skill_evolution_lab/DEMO_NARRATION.md), and
[VERIFICATION](skill_evolution_lab/VERIFICATION.md).

## Reference Artifacts

| File | Description |
Expand Down
4 changes: 4 additions & 0 deletions examples/skill_evolution_lab/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.env
runs/
__pycache__/
*.pyc
613 changes: 613 additions & 0 deletions examples/skill_evolution_lab/DEMO_NARRATION.md

Large diffs are not rendered by default.

112 changes: 112 additions & 0 deletions examples/skill_evolution_lab/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Skill Evolution Lab

An agent that **rewrites its own skill** from its conversation traces — no
teacher model, no managed optimizer. One company-policy Q&A agent starts with a
deliberately flawed `SKILL.md`, generates traffic, and the SDK's evolution
engine reads the failing trajectories and produces a small, tool-first V1 skill.
The skill is versioned in the **Gemini Enterprise Agent Platform Skill
Registry** (V0 = revision 1, V1 = revision 2).

This is the runnable companion to the blog post *"Your Agent Can Learn From Its
Own Conversations."* See [`DEMO_NARRATION.md`](DEMO_NARRATION.md) for the full
story and [`VERIFICATION.md`](VERIFICATION.md) for a recorded end-to-end run.

## What it shows

- **Self-improvement from traces.** The engine
(`scripts/skill_evolution.py`, imported here — not copied) partitions scored
conversations into successes/failures, runs a fleet of parallel analysts, and
consolidates recurring rules into a versioned `SKILL.md`.
- **Ground-truth scoring.** Quality is graded against a golden Q&A answer key
(`eval/eval_spec.json`) via the SDK's `quality_report.py`, not a
no-ground-truth "usefulness" guess.
- **The anti-parroting rule.** Multi-turn cases where the user asserts a *wrong*
correction. A good agent re-verifies with its tool and holds the right figure
instead of caving. The engine detects parroting and learns a "re-verify, don't
just agree" rule. (See [DEMO_NARRATION.md](DEMO_NARRATION.md#corrections-are-not-answers-the-anti-parroting-rule).)
- **Skill Registry versioning.** The evolved skill is mirrored to the registry
as a new immutable revision; `reset.sh` reverts both the local copy and the
registry to V0.

## Layout

```text
skill_evolution_lab/
agent/
agent.py # genai agent factory: SKILL.md (instruction) + tools
tools.py # lookup_company_policy + get_current_date (the data)
skill_registry.py # REST client for the Skill Registry (create/update/...)
skills/
SKILL.md # working copy (starts as the flawed V0)
SKILL.v0.md # immutable flawed V0 baseline (used by reset)
eval/
eval_spec.json # scope + golden Q&A answer key (ground truth)
questions_evolve.json # questions the skill evolves from
questions_test.json # held-out questions for the V0->V1 number
questions_corrections.json # anti-parroting cases (teach)
questions_corrections_heldout.json # anti-parroting cases (held-out)
run_agent.py # runs questions through the agent -> conversations JSON
analyze_and_evolve.py # scored report -> evolve_skill() -> V1 (+ registry)
compare_runs.py # V0 vs V1 golden-grounded correctness + parroting
registry_cli.py # create/update/delete/inspect registry revisions
run_e2e_demo.sh # the whole cycle, one command
setup.sh / reset.sh # write .env / revert to V0 (local + registry)
sample_run/ # a committed end-to-end run (scored reports, evolved
# skill, RESULT) + README explaining each artifact
```

A complete recorded run lives in [`sample_run/`](sample_run/) — the scored V0/V1
reports, the evolved skill, and `RESULT.md` — so you can read the exact inputs and
outputs (and what each file means) without running anything. Live runs write to
`runs/<timestamp>/` (git-ignored).

## Prerequisites

- A GCP project with Vertex AI enabled; `roles/aiplatform.user`.
- `gcloud auth application-default login`.
- [`uv`](https://github.com/astral-sh/uv) (used to run with the repo's deps).
- Gemini 3.x models are served from the Vertex `global` endpoint (handled
automatically); the Skill Registry is regional (`us-central1` by default).

## Run it

```bash
cd examples/skill_evolution_lab
./setup.sh YOUR_PROJECT_ID us-central1 # writes .env, resets to V0
./run_e2e_demo.sh # V0 -> evolve -> V1 -> compare
```

The run deploys the flawed V0, generates and scores traffic on the evolve and
held-out test sets, evolves a tool-first V1 skill, re-scores the held-out set,
prints the V0→V1 comparison, and restores V0. Artifacts land in
`runs/<timestamp>_<model>/` (git-ignored), with `RESULT.md` as the summary.

### With the Skill Registry

```bash
WITH_REGISTRY=1 SKILL_ID=skill-lab-policy ./setup.sh YOUR_PROJECT_ID us-central1
WITH_REGISTRY=1 SKILL_ID=skill-lab-policy ./run_e2e_demo.sh
WITH_REGISTRY=1 SKILL_ID=skill-lab-policy ./reset.sh # revert local + registry
```

Inspect revisions any time: `uv run python registry_cli.py revisions
--skill-id skill-lab-policy`.

### Model overrides

```bash
AGENT_MODEL=gemini-3.1-pro-preview ANALYST_MODEL=gemini-3.1-pro-preview ./run_e2e_demo.sh
```

`AGENT_MODEL` is the agent under test; `ANALYST_MODEL` runs the evolution
analysts/consolidator; `JUDGE_MODEL` (default `gemini-2.5-flash`, regional)
scores. The model, tools, and questions are fixed across V0 and V1 — only the
skill changes — so any delta is attributable to the skill.

## How it relates to the research

The engine follows [Trace2Skill](https://arxiv.org/abs/2603.25158) (parallel
analysts + inductive consolidation, held-out validation) and
[AutoSkill](https://arxiv.org/abs/2603.01145) (versioned skill evolution as a
semantic merge). It is the same `evolve_skill()` the knowledge-supervisor
quality lab imports from this SDK.
108 changes: 108 additions & 0 deletions examples/skill_evolution_lab/VERIFICATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Verification — recorded end-to-end run

A full `./run_e2e_demo.sh` run of this example, captured so the result is
reproducible and the numbers in [`DEMO_NARRATION.md`](DEMO_NARRATION.md) are
backed by an actual run (not aspirational).

## Configuration

| Setting | Value |
| --- | --- |
| Agent under test | `gemini-3-flash-preview` (Vertex `global`) |
| Evolution analysts/consolidator | `gemini-3.1-pro-preview` (Vertex `global`) |
| Judge (scoring) | `gemini-2.5-flash` (`us-central1`) |
| Ground truth | `eval/eval_spec.json` golden Q&A (matched at cosine ≥ 0.92) |
| Evolve set | `questions_evolve.json` (28) + `questions_corrections.json` (5) |
| Held-out test set | `questions_test.json` (18) + `questions_corrections_heldout.json` (3) |
| Date | 2026-06-05 |

The agent model, tools, and questions are identical for V0 and V1 — **only the
skill file changes** — so the delta is attributable to the skill.

## Result (held-out set, golden-grounded correctness)

| Metric | V0 (flawed) | V1 (evolved) | Delta |
| --- | --- | --- | --- |
| Overall | 23.8% (5/21) | 100.0% (21/21) | +76.2pp |
| Single-turn | 22.2% (4/18) | 100.0% (18/18) | +77.8pp |
| Corrections (anti-parrot) | 33.3% (1/3) | 100.0% (3/3) | +66.7pp |
| Tool-grounded answers | 6/21 | 18/21 | — |

Parroted sub-trajectories: V0 = 0, V1 = 0. In this run the flawed V0 *declined*
on the correction cases ("I don't have that, contact HR") rather than caving to
the user's wrong number, so the engine learned the tool-first rule that
subsumes the correction cases; the explicit `PARROTING` detection/learning
machinery (in `quality_report.py` and `skill_evolution.py`) is the safety net
that prevents the opposite failure — learning to agree with a confident, wrong
user.

## Evolution internals (from the run log)

```text
Trajectories: 6 successes, 27 failures
Collected 29 patches (19 passed the quality gate)
Generating 3 candidate(s)...
Selected median-size candidate (2519 chars)
```

No `score_fn` was used; the engine returns the median-size viable candidate and
the held-out re-score is the proof. Run with a `score_fn` for best-of-N
selection.

## The evolved V1 skill (675B → 2519B)

The engine rewrote the flawed "answer only from the baked summary, else contact
HR" prompt into a small, legible, tool-first skill. Notably it learned a
**"Premature HR Deflection"** anti-pattern and a tool-first fallback rule:

```markdown
---
name: company-policy
description: Answers employee questions about company policies.
metadata:
version: "1"
author: skill-evolution
evolvable: true
evolved_from: "0"
---

You are a helpful company information assistant.

## Knowledge Base
You have the following knowledge about company policies:
- **PTO:** 20 days per year, accrued monthly. Up to 5 unused days roll over. ...
- **Sick leave:** 10 days per year, does not roll over. (For specific details ...
use your tools to search the policy database).
- **Remote work:** Up to 3 days per week with manager approval. ...
- **Benefits:** ... For exact monetary limits, match percentages, or session
limits, use your tools to search or advise the user to check the Benefits Handbook.
- **Expenses and Travel:** ... There is a daily meal reimbursement limit on
business travel (use tools to find the exact amount).
- **Flex time / Work hours:** Employees may adjust their daily start and end times ...

## Instructions
- **Tool Use & Fallback:** If a user asks about a company policy or detail not
explicitly listed in your provided knowledge above ..., you MUST first use your
available tools to search for the information. Only tell the user you do not
have the information ... if your tool search yields no relevant results.
- **Policy Evaluation:** When a user asks if a specific amount or scenario is
allowed ..., explicitly compare their request to the policy limits ...

## Anti-Patterns
- **Premature HR Deflection:** Do not immediately tell the user you lack
information or direct them to HR for policy topics not listed in your static
knowledge. You must always attempt to use your available tools first.
```

## Reproduce

```bash
cd examples/skill_evolution_lab
./setup.sh YOUR_PROJECT_ID us-central1
./run_e2e_demo.sh
```

Exact numbers vary run-to-run (LLM nondeterminism, golden-match set), but the
direction is stable: the flawed V0 defers/declines on topics it has a tool for,
and the evolved V1 uses the tool and answers correctly, including when the user
asserts a wrong "correction".
15 changes: 15 additions & 0 deletions examples/skill_evolution_lab/agent/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Self-contained policy/benefits agent for the skill-evolution lab."""
74 changes: 74 additions & 0 deletions examples/skill_evolution_lab/agent/agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Agent factory for the skill-evolution lab.

The agent is deliberately minimal so the demo is legible: a single Gemini
model whose *system instruction is the SKILL.md body* plus two Python tools
(automatic function calling). Swapping the skill file is the only thing that
changes between V0 and V1 -- the model, tools, and questions stay fixed, so any
quality delta is attributable to the skill.

Gemini 3.x models are served from the Vertex AI ``global`` endpoint; 2.5
models are regional. ``make_client`` routes automatically based on the model
name.
"""

from __future__ import annotations

import os
import re

from google import genai
from google.genai import types

from .tools import AGENT_TOOLS

_FRONTMATTER_RE = re.compile(r"^---\n.*?\n---\n", re.DOTALL)


def skill_instruction(skill_text: str) -> str:
"""Return the SKILL.md body (YAML frontmatter stripped) for use as the
system instruction."""
return _FRONTMATTER_RE.sub("", skill_text, count=1).strip()


def model_location(model: str) -> str:
"""Vertex location for a model: 'global' for Gemini 3.x, else regional."""
if model.startswith("gemini-3"):
return "global"
return os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")


def make_client(model: str, project: str | None = None) -> genai.Client:
"""Build a Vertex AI google-genai client routed to the right endpoint."""
project = (
project or os.getenv("GOOGLE_CLOUD_PROJECT") or os.getenv("PROJECT_ID")
)
return genai.Client(
vertexai=True, project=project, location=model_location(model)
)


def build_config(skill_text: str) -> types.GenerateContentConfig:
"""Build the generation config: skill as system instruction + tools.

Temperature 0 keeps the demo deterministic. Automatic function calling is
left enabled (the default) so the SDK executes the Python tools and loops.
"""
return types.GenerateContentConfig(
system_instruction=skill_instruction(skill_text),
tools=list(AGENT_TOOLS),
temperature=0.0,
)
Loading
Loading