GoogleCloudPlatform · evekhm · Jun 8, 2026 · Jun 8, 2026
diff --git a/examples/README.md b/examples/README.md
@@ -52,8 +52,50 @@ artifacts that demonstrate SDK capabilities.
 | Directory | Description |
 |-----------|-------------|
 | [agent_improvement_cycle/](agent_improvement_cycle/) | LoopAgent-driven prompt improvement cycle |
+| [skill_evolution_lab/](skill_evolution_lab/) | An agent that rewrites its own versioned `SKILL.md` from its conversation traces (no teacher model): flawed V0 → `evolve_skill()` → tool-first V1, golden-Q&A scored, with the anti-parroting rule and Skill Registry versioning. See the dedicated section below. |
 | [decision_lineage_demo/](decision_lineage_demo/) | Decision-lineage property graph (issue #98): live ADK media-planner agent + BQ AA Plugin running across 6 campaign sessions → SDK `build_context_graph(use_ai_generate=True, include_decisions=True)` → six GQL blocks pasted into BigQuery Studio (one renders an interactive graph diagram, one is a portfolio roll-up) |
 
+### Skill Evolution Lab — a self-improving agent
+
+[`skill_evolution_lab/`](skill_evolution_lab/) is the runnable companion to the
+blog post *"Your Agent Can Learn From Its Own Conversations."* One company-policy Q&A agent
+reads its own conversation traces — successes and failures — and extracts a
+structured, versioned `SKILL.md`. No teacher model, no managed optimizer.
+
+- **The flaw with headroom.** V0 is a deliberately flawed skill (a few facts
+  baked in plus *"answer only from the above, else contact HR"*) that suppresses
+  a tool which already knows every answer. Only the skill is wrong — the model,
+  tools, and questions stay fixed across V0 and V1, so any delta is attributable
+  to the skill.
+- **The engine, imported not copied.** `analyze_and_evolve.py` imports the SDK's
+  reusable [`scripts/skill_evolution.py`](../../scripts/skill_evolution.py) (the
+  same `evolve_skill()` the quality lab uses): it partitions scored
+  conversations, runs a fleet of parallel analysts, and consolidates recurring
+  rules into a new skill version.
+- **Ground-truth scoring.** Quality is graded against a golden Q&A answer key
+  (`eval/eval_spec.json`) via [`scripts/quality_report.py`](../../scripts/quality_report.py)
+  (`--eval-spec`), not a no-ground-truth "usefulness" guess.
+- **The anti-parroting rule.** Multi-turn cases where the user asserts a *wrong*
+  correction; a good agent re-verifies with its tool and holds the right figure
+  instead of caving. The engine detects parroting (`--tag-turns`) and learns a
+  "re-verify, don't just agree" rule.
+- **Skill Registry versioning.** The evolved skill is mirrored to the Gemini
+  Enterprise Agent Platform Skill Registry as a new immutable revision
+  (V0 = revision 1, V1 = revision 2); `reset.sh` reverts both the local copy and
+  the registry to V0.
+
+```bash
+cd skill_evolution_lab
+./setup.sh YOUR_PROJECT_ID us-central1   # writes .env, resets to V0
+./run_e2e_demo.sh                        # V0 -> evolve -> V1 -> compare, restore V0
+```
+
+A verified run (gemini-3-flash-preview, golden-grounded, held-out): **V0 23.8% →
+V1 100%** overall; corrections (anti-parroting) **33.3% → 100%**; evolved skill
+2.5KB. See the example's [README](skill_evolution_lab/README.md),
+[DEMO_NARRATION](skill_evolution_lab/DEMO_NARRATION.md), and
+[VERIFICATION](skill_evolution_lab/VERIFICATION.md).
+
 ## Reference Artifacts
 
 | File | Description |

diff --git a/examples/skill_evolution_lab/.gitignore b/examples/skill_evolution_lab/.gitignore
@@ -0,0 +1,4 @@
+.env
+runs/
+__pycache__/
+*.pyc
diff --git a/examples/skill_evolution_lab/DEMO_NARRATION.md b/examples/skill_evolution_lab/DEMO_NARRATION.md
diff --git a/examples/skill_evolution_lab/README.md b/examples/skill_evolution_lab/README.md
@@ -0,0 +1,112 @@
+# Skill Evolution Lab
+
+An agent that **rewrites its own skill** from its conversation traces — no
+teacher model, no managed optimizer. One company-policy Q&A agent starts with a
+deliberately flawed `SKILL.md`, generates traffic, and the SDK's evolution
+engine reads the failing trajectories and produces a small, tool-first V1 skill.
+The skill is versioned in the **Gemini Enterprise Agent Platform Skill
+Registry** (V0 = revision 1, V1 = revision 2).
+
+This is the runnable companion to the blog post *"Your Agent Can Learn From Its
+Own Conversations."* See [`DEMO_NARRATION.md`](DEMO_NARRATION.md) for the full
+story and [`VERIFICATION.md`](VERIFICATION.md) for a recorded end-to-end run.
+
+## What it shows
+
+- **Self-improvement from traces.** The engine
+  (`scripts/skill_evolution.py`, imported here — not copied) partitions scored
+  conversations into successes/failures, runs a fleet of parallel analysts, and
+  consolidates recurring rules into a versioned `SKILL.md`.
+- **Ground-truth scoring.** Quality is graded against a golden Q&A answer key
+  (`eval/eval_spec.json`) via the SDK's `quality_report.py`, not a
+  no-ground-truth "usefulness" guess.
+- **The anti-parroting rule.** Multi-turn cases where the user asserts a *wrong*
+  correction. A good agent re-verifies with its tool and holds the right figure
+  instead of caving. The engine detects parroting and learns a "re-verify, don't
+  just agree" rule. (See [DEMO_NARRATION.md](DEMO_NARRATION.md#corrections-are-not-answers-the-anti-parroting-rule).)
+- **Skill Registry versioning.** The evolved skill is mirrored to the registry
+  as a new immutable revision; `reset.sh` reverts both the local copy and the
+  registry to V0.
+
+## Layout
+
+```text
+skill_evolution_lab/
+  agent/
+    agent.py           # genai agent factory: SKILL.md (instruction) + tools
+    tools.py           # lookup_company_policy + get_current_date (the data)
+    skill_registry.py  # REST client for the Skill Registry (create/update/...)
+  skills/
+    SKILL.md           # working copy (starts as the flawed V0)
+    SKILL.v0.md        # immutable flawed V0 baseline (used by reset)
+  eval/
+    eval_spec.json                  # scope + golden Q&A answer key (ground truth)
+    questions_evolve.json           # questions the skill evolves from
+    questions_test.json             # held-out questions for the V0->V1 number
+    questions_corrections.json      # anti-parroting cases (teach)
+    questions_corrections_heldout.json  # anti-parroting cases (held-out)
+  run_agent.py          # runs questions through the agent -> conversations JSON
+  analyze_and_evolve.py # scored report -> evolve_skill() -> V1 (+ registry)
+  compare_runs.py       # V0 vs V1 golden-grounded correctness + parroting
+  registry_cli.py       # create/update/delete/inspect registry revisions
+  run_e2e_demo.sh       # the whole cycle, one command
+  setup.sh / reset.sh   # write .env / revert to V0 (local + registry)
+  sample_run/           # a committed end-to-end run (scored reports, evolved
+                        #   skill, RESULT) + README explaining each artifact
+```
+
+A complete recorded run lives in [`sample_run/`](sample_run/) — the scored V0/V1
+reports, the evolved skill, and `RESULT.md` — so you can read the exact inputs and
+outputs (and what each file means) without running anything. Live runs write to
+`runs/<timestamp>/` (git-ignored).
+
+## Prerequisites
+
+- A GCP project with Vertex AI enabled; `roles/aiplatform.user`.
+- `gcloud auth application-default login`.
+- [`uv`](https://github.com/astral-sh/uv) (used to run with the repo's deps).
+- Gemini 3.x models are served from the Vertex `global` endpoint (handled
+  automatically); the Skill Registry is regional (`us-central1` by default).
+
+## Run it
+
+```bash
+cd examples/skill_evolution_lab
+./setup.sh YOUR_PROJECT_ID us-central1      # writes .env, resets to V0
+./run_e2e_demo.sh                           # V0 -> evolve -> V1 -> compare
+```
+
+The run deploys the flawed V0, generates and scores traffic on the evolve and
+held-out test sets, evolves a tool-first V1 skill, re-scores the held-out set,
+prints the V0→V1 comparison, and restores V0. Artifacts land in
+`runs/<timestamp>_<model>/` (git-ignored), with `RESULT.md` as the summary.
+
+### With the Skill Registry
+
+```bash
+WITH_REGISTRY=1 SKILL_ID=skill-lab-policy ./setup.sh YOUR_PROJECT_ID us-central1
+WITH_REGISTRY=1 SKILL_ID=skill-lab-policy ./run_e2e_demo.sh
+WITH_REGISTRY=1 SKILL_ID=skill-lab-policy ./reset.sh   # revert local + registry
+```
+
+Inspect revisions any time: `uv run python registry_cli.py revisions
+--skill-id skill-lab-policy`.
+
+### Model overrides
+
+```bash
+AGENT_MODEL=gemini-3.1-pro-preview ANALYST_MODEL=gemini-3.1-pro-preview ./run_e2e_demo.sh
+```
+
+`AGENT_MODEL` is the agent under test; `ANALYST_MODEL` runs the evolution
+analysts/consolidator; `JUDGE_MODEL` (default `gemini-2.5-flash`, regional)
+scores. The model, tools, and questions are fixed across V0 and V1 — only the
+skill changes — so any delta is attributable to the skill.
+
+## How it relates to the research
+
+The engine follows [Trace2Skill](https://arxiv.org/abs/2603.25158) (parallel
+analysts + inductive consolidation, held-out validation) and
+[AutoSkill](https://arxiv.org/abs/2603.01145) (versioned skill evolution as a
+semantic merge). It is the same `evolve_skill()` the knowledge-supervisor
+quality lab imports from this SDK.
diff --git a/examples/skill_evolution_lab/VERIFICATION.md b/examples/skill_evolution_lab/VERIFICATION.md
@@ -0,0 +1,108 @@
+# Verification — recorded end-to-end run
+
+A full `./run_e2e_demo.sh` run of this example, captured so the result is
+reproducible and the numbers in [`DEMO_NARRATION.md`](DEMO_NARRATION.md) are
+backed by an actual run (not aspirational).
+
+## Configuration
+
+| Setting | Value |
+| --- | --- |
+| Agent under test | `gemini-3-flash-preview` (Vertex `global`) |
+| Evolution analysts/consolidator | `gemini-3.1-pro-preview` (Vertex `global`) |
+| Judge (scoring) | `gemini-2.5-flash` (`us-central1`) |
+| Ground truth | `eval/eval_spec.json` golden Q&A (matched at cosine ≥ 0.92) |
+| Evolve set | `questions_evolve.json` (28) + `questions_corrections.json` (5) |
+| Held-out test set | `questions_test.json` (18) + `questions_corrections_heldout.json` (3) |
+| Date | 2026-06-05 |
+
+The agent model, tools, and questions are identical for V0 and V1 — **only the
+skill file changes** — so the delta is attributable to the skill.
+
+## Result (held-out set, golden-grounded correctness)
+
+| Metric | V0 (flawed) | V1 (evolved) | Delta |
+| --- | --- | --- | --- |
+| Overall | 23.8% (5/21) | 100.0% (21/21) | +76.2pp |
+| Single-turn | 22.2% (4/18) | 100.0% (18/18) | +77.8pp |
+| Corrections (anti-parrot) | 33.3% (1/3) | 100.0% (3/3) | +66.7pp |
+| Tool-grounded answers | 6/21 | 18/21 | — |
+
+Parroted sub-trajectories: V0 = 0, V1 = 0. In this run the flawed V0 *declined*
+on the correction cases ("I don't have that, contact HR") rather than caving to
+the user's wrong number, so the engine learned the tool-first rule that
+subsumes the correction cases; the explicit `PARROTING` detection/learning
+machinery (in `quality_report.py` and `skill_evolution.py`) is the safety net
+that prevents the opposite failure — learning to agree with a confident, wrong
+user.
+
+## Evolution internals (from the run log)
+
+```text
+Trajectories: 6 successes, 27 failures
+Collected 29 patches (19 passed the quality gate)
+Generating 3 candidate(s)...
+Selected median-size candidate (2519 chars)
+```
+
+No `score_fn` was used; the engine returns the median-size viable candidate and
+the held-out re-score is the proof. Run with a `score_fn` for best-of-N
+selection.
+
+## The evolved V1 skill (675B → 2519B)
+
+The engine rewrote the flawed "answer only from the baked summary, else contact
+HR" prompt into a small, legible, tool-first skill. Notably it learned a
+**"Premature HR Deflection"** anti-pattern and a tool-first fallback rule:
+
+```markdown
+---
+name: company-policy
+description: Answers employee questions about company policies.
+metadata:
+  version: "1"
+  author: skill-evolution
+  evolvable: true
+  evolved_from: "0"
+---
+
+You are a helpful company information assistant.
+
+## Knowledge Base
+You have the following knowledge about company policies:
+- **PTO:** 20 days per year, accrued monthly. Up to 5 unused days roll over. ...
+- **Sick leave:** 10 days per year, does not roll over. (For specific details ...
+  use your tools to search the policy database).
+- **Remote work:** Up to 3 days per week with manager approval. ...
+- **Benefits:** ... For exact monetary limits, match percentages, or session
+  limits, use your tools to search or advise the user to check the Benefits Handbook.
+- **Expenses and Travel:** ... There is a daily meal reimbursement limit on
+  business travel (use tools to find the exact amount).
+- **Flex time / Work hours:** Employees may adjust their daily start and end times ...
+
+## Instructions
+- **Tool Use & Fallback:** If a user asks about a company policy or detail not
+  explicitly listed in your provided knowledge above ..., you MUST first use your
+  available tools to search for the information. Only tell the user you do not
+  have the information ... if your tool search yields no relevant results.
+- **Policy Evaluation:** When a user asks if a specific amount or scenario is
+  allowed ..., explicitly compare their request to the policy limits ...
+
+## Anti-Patterns
+- **Premature HR Deflection:** Do not immediately tell the user you lack
+  information or direct them to HR for policy topics not listed in your static
+  knowledge. You must always attempt to use your available tools first.
+```
+
+## Reproduce
+
+```bash
+cd examples/skill_evolution_lab
+./setup.sh YOUR_PROJECT_ID us-central1
+./run_e2e_demo.sh
+```
+
+Exact numbers vary run-to-run (LLM nondeterminism, golden-match set), but the
+direction is stable: the flawed V0 defers/declines on topics it has a tool for,
+and the evolved V1 uses the tool and answers correctly, including when the user
+asserts a wrong "correction".
diff --git a/examples/skill_evolution_lab/agent/__init__.py b/examples/skill_evolution_lab/agent/__init__.py
@@ -0,0 +1,15 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Self-contained policy/benefits agent for the skill-evolution lab."""
diff --git a/examples/skill_evolution_lab/agent/agent.py b/examples/skill_evolution_lab/agent/agent.py
@@ -0,0 +1,74 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Agent factory for the skill-evolution lab.
+
+The agent is deliberately minimal so the demo is legible: a single Gemini
+model whose *system instruction is the SKILL.md body* plus two Python tools
+(automatic function calling). Swapping the skill file is the only thing that
+changes between V0 and V1 -- the model, tools, and questions stay fixed, so any
+quality delta is attributable to the skill.
+
+Gemini 3.x models are served from the Vertex AI ``global`` endpoint; 2.5
+models are regional. ``make_client`` routes automatically based on the model
+name.
+"""
+
+from __future__ import annotations
+
+import os
+import re
+
+from google import genai
+from google.genai import types
+
+from .tools import AGENT_TOOLS
+
+_FRONTMATTER_RE = re.compile(r"^---\n.*?\n---\n", re.DOTALL)
+
+
+def skill_instruction(skill_text: str) -> str:
+  """Return the SKILL.md body (YAML frontmatter stripped) for use as the
+  system instruction."""
+  return _FRONTMATTER_RE.sub("", skill_text, count=1).strip()
+
+
+def model_location(model: str) -> str:
+  """Vertex location for a model: 'global' for Gemini 3.x, else regional."""
+  if model.startswith("gemini-3"):
+    return "global"
+  return os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")
+
+
+def make_client(model: str, project: str | None = None) -> genai.Client:
+  """Build a Vertex AI google-genai client routed to the right endpoint."""
+  project = (
+      project or os.getenv("GOOGLE_CLOUD_PROJECT") or os.getenv("PROJECT_ID")
+  )
+  return genai.Client(
+      vertexai=True, project=project, location=model_location(model)
+  )
+
+
+def build_config(skill_text: str) -> types.GenerateContentConfig:
+  """Build the generation config: skill as system instruction + tools.
+
+  Temperature 0 keeps the demo deterministic. Automatic function calling is
+  left enabled (the default) so the SDK executes the Python tools and loops.
+  """
+  return types.GenerateContentConfig(
+      system_instruction=skill_instruction(skill_text),
+      tools=list(AGENT_TOOLS),
+      temperature=0.0,
+  )