Add demo and implementation: an agent that writes its own skill#301
Draft
evekhm wants to merge 2 commits into
Draft
Add demo and implementation: an agent that writes its own skill#301evekhm wants to merge 2 commits into
evekhm wants to merge 2 commits into
Conversation
A runnable, self-contained example of an agent that rewrites its own versioned
SKILL.md from its conversation traces — no teacher model. A deliberately flawed
V0 skill ("answer only from the baked summary, else contact HR") suppresses a
tool that already knows every answer; the engine reads the scored traces and
evolves a tool-first V1.
- scripts/skill_evolution.py — reusable evolution engine: partitions scored
conversations into successes/failures, runs a parallel analyst fleet on frozen
copies of the skill, and consolidates recurring rules into a new version
(best-of-N, compaction, anti-parroting reclassification). Standalone: consumes
a scored report dict, returns skill text.
- examples/skill_evolution_lab/ — the company-policy Q&A agent, golden Q&A eval
spec, evolve/held-out question sets, one-command run_e2e_demo.sh, a committed
sample_run/, DEMO_NARRATION.md, and VERIFICATION.md.
- tests/test_skill_evolution.py — 19 engine unit tests.
Verified (gemini-3-flash-preview, golden-grounded, held-out): V0 23.8% -> V1
100% overall; corrections (anti-parroting) 33.3% -> 100%.
Note: the demo's scorer step uses quality_report.py's eval-spec / golden-Q&A /
--tag-turns features from GoogleCloudPlatform#174; this PR should land after that one.
Add a 'Skill Evolution' section (table row + full section) covering scripts/skill_evolution.py: what it does (parallel analyst fleet + inductive consolidation, Trace2Skill/AutoSkill), the pipeline, CLI usage and flags (--report/--skill/-o, --candidates, --max-chars, --analyst-mode), the evolve_skill() Python API and knobs, the expected report input, prerequisites (Vertex/ADC), and output (version bump + evolved_from). Links to the skill_evolution_lab example for an end-to-end run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A runnable, self-contained example —
examples/skill_evolution_lab/— of an agent that rewrites its own versionedSKILL.mdfrom its conversation traces, with no teacher model, plus the reusable engine that powers it (scripts/skill_evolution.py).A company-policy Q&A agent ships with a deliberately flawed V0 skill ("answer only from the baked summary, else contact HR") that suppresses a tool which already knows every answer. The engine reads the scored traces — successes and failures — and evolves a tool-first V1. Same model, same tool, same questions across V0/V1, so any quality delta is attributable to the skill.
Contents
scripts/skill_evolution.py— the evolution engine. Partitions scored conversations into successes/failures, runs a parallel analyst fleet (one analyst per trajectory, each on a frozen copy of the skill), and consolidates recurring rules into a new version (best-of-N, size compaction, anti-parroting reclassification). Standalone — consumes a scored report dict and returns skill text, so it composes with the scorer without importing it.examples/skill_evolution_lab/— the agent (agent/), a golden-Q&A eval spec (eval/eval_spec.json), evolve + disjoint held-out question sets, a one-commandrun_e2e_demo.sh, a committedsample_run/,DEMO_NARRATION.md, andVERIFICATION.md.tests/test_skill_evolution.py— 19 engine unit tests.examples/README.md— a "Skill Evolution Lab" section.Verified
gemini-3-flash-preview, golden-grounded, on a held-out set (seeVERIFICATION.md):Evolved skill: ~2.5KB, tool-first, with a learned "when corrected, re-verify with a tool — don't just agree" rule.
Dependency / draft status