Skip to content

Add demo and implementation: an agent that writes its own skill#301

Draft
evekhm wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
evekhm:feat/skill-evolution-lab
Draft

Add demo and implementation: an agent that writes its own skill#301
evekhm wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
evekhm:feat/skill-evolution-lab

Conversation

@evekhm

@evekhm evekhm commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What

A runnable, self-contained example — examples/skill_evolution_lab/ — of an agent that rewrites its own versioned SKILL.md from its conversation traces, with no teacher model, plus the reusable engine that powers it (scripts/skill_evolution.py).

A company-policy Q&A agent ships with a deliberately flawed V0 skill ("answer only from the baked summary, else contact HR") that suppresses a tool which already knows every answer. The engine reads the scored traces — successes and failures — and evolves a tool-first V1. Same model, same tool, same questions across V0/V1, so any quality delta is attributable to the skill.

Contents

  • scripts/skill_evolution.py — the evolution engine. Partitions scored conversations into successes/failures, runs a parallel analyst fleet (one analyst per trajectory, each on a frozen copy of the skill), and consolidates recurring rules into a new version (best-of-N, size compaction, anti-parroting reclassification). Standalone — consumes a scored report dict and returns skill text, so it composes with the scorer without importing it.
  • examples/skill_evolution_lab/ — the agent (agent/), a golden-Q&A eval spec (eval/eval_spec.json), evolve + disjoint held-out question sets, a one-command run_e2e_demo.sh, a committed sample_run/, DEMO_NARRATION.md, and VERIFICATION.md.
  • tests/test_skill_evolution.py — 19 engine unit tests.
  • examples/README.md — a "Skill Evolution Lab" section.

Verified

gemini-3-flash-preview, golden-grounded, on a held-out set (see VERIFICATION.md):

Metric V0 (flawed) V1 (evolved)
Overall correctness 23.8% 100%
Corrections (anti-parroting) 33.3% 100%

Evolved skill: ~2.5KB, tool-first, with a learned "when corrected, re-verify with a tool — don't just agree" rule.

Dependency / draft status

Draft: the demo's scorer step (run_e2e_demo.shquality_report.py) uses the --eval-spec / golden-Q&A / --tag-turns features added in #174. The example code and engine are complete and tested here, but the end-to-end demo should be run after #174 lands. Opening as draft for early review of the example + engine.

A runnable, self-contained example of an agent that rewrites its own versioned
SKILL.md from its conversation traces — no teacher model. A deliberately flawed
V0 skill ("answer only from the baked summary, else contact HR") suppresses a
tool that already knows every answer; the engine reads the scored traces and
evolves a tool-first V1.

- scripts/skill_evolution.py — reusable evolution engine: partitions scored
  conversations into successes/failures, runs a parallel analyst fleet on frozen
  copies of the skill, and consolidates recurring rules into a new version
  (best-of-N, compaction, anti-parroting reclassification). Standalone: consumes
  a scored report dict, returns skill text.
- examples/skill_evolution_lab/ — the company-policy Q&A agent, golden Q&A eval
  spec, evolve/held-out question sets, one-command run_e2e_demo.sh, a committed
  sample_run/, DEMO_NARRATION.md, and VERIFICATION.md.
- tests/test_skill_evolution.py — 19 engine unit tests.

Verified (gemini-3-flash-preview, golden-grounded, held-out): V0 23.8% -> V1
100% overall; corrections (anti-parroting) 33.3% -> 100%.

Note: the demo's scorer step uses quality_report.py's eval-spec / golden-Q&A /
--tag-turns features from GoogleCloudPlatform#174; this PR should land after that one.
@evekhm evekhm changed the title Add skill_evolution_lab example + skill evolution engine Agent that writes it own skill based on conversations traces Jun 8, 2026
@evekhm evekhm changed the title Agent that writes it own skill based on conversations traces Add demo for an agent that writes it own skill based on conversations traces Jun 8, 2026
@evekhm evekhm changed the title Add demo for an agent that writes it own skill based on conversations traces Add demo for an agent that writes it own Skills based on conversations traces Jun 8, 2026
@evekhm evekhm changed the title Add demo for an agent that writes it own Skills based on conversations traces Add example: an agent that writes its own skill Jun 8, 2026
Add a 'Skill Evolution' section (table row + full section) covering
scripts/skill_evolution.py: what it does (parallel analyst fleet + inductive
consolidation, Trace2Skill/AutoSkill), the pipeline, CLI usage and flags
(--report/--skill/-o, --candidates, --max-chars, --analyst-mode), the
evolve_skill() Python API and knobs, the expected report input, prerequisites
(Vertex/ADC), and output (version bump + evolved_from). Links to the
skill_evolution_lab example for an end-to-end run.
@evekhm evekhm changed the title Add example: an agent that writes its own skill Add demo and implementation: an agent that writes its own skill Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant