Add demo and implementation: an agent that writes its own skill by evekhm · Pull Request #301 · GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK

evekhm · 2026-06-08T17:50:23Z

What

A runnable, self-contained example — examples/skill_evolution_lab/ — of an agent that rewrites its own versioned SKILL.md from its conversation traces, with no teacher model, plus the reusable engine that powers it (scripts/skill_evolution.py).

A company-policy Q&A agent ships with a deliberately flawed V0 skill ("answer only from the baked summary, else contact HR") that suppresses a tool which already knows every answer. The engine reads the scored traces — successes and failures — and evolves a tool-first V1. Same model, same tool, same questions across V0/V1, so any quality delta is attributable to the skill.

scripts/skill_evolution.py — the evolution engine. Partitions scored conversations into successes/failures, runs a parallel analyst fleet (one analyst per trajectory, each on a frozen copy of the skill), and consolidates recurring rules into a new version (best-of-N, size compaction, anti-parroting reclassification). Standalone — consumes a scored report dict and returns skill text, so it composes with the scorer without importing it.
examples/skill_evolution_lab/ — the agent (agent/), a golden-Q&A eval spec (eval/eval_spec.json), evolve + disjoint held-out question sets, a one-command run_e2e_demo.sh, a committed sample_run/, DEMO_NARRATION.md, and VERIFICATION.md.
tests/test_skill_evolution.py — 19 engine unit tests.
examples/README.md — a "Skill Evolution Lab" section.

Verified

gemini-3-flash-preview, golden-grounded, on a held-out set (see VERIFICATION.md):

Metric	V0 (flawed)	V1 (evolved)
Overall correctness	23.8%	100%
Corrections (anti-parroting)	33.3%	100%

Evolved skill: ~2.5KB, tool-first, with a learned "when corrected, re-verify with a tool — don't just agree" rule.

Dependency / draft status

Draft: the demo's scorer step (run_e2e_demo.sh → quality_report.py) uses the --eval-spec / golden-Q&A / --tag-turns features added in #174. The example code and engine are complete and tested here, but the end-to-end demo should be run after #174 lands. Opening as draft for early review of the example + engine.

A runnable, self-contained example of an agent that rewrites its own versioned SKILL.md from its conversation traces — no teacher model. A deliberately flawed V0 skill ("answer only from the baked summary, else contact HR") suppresses a tool that already knows every answer; the engine reads the scored traces and evolves a tool-first V1. - scripts/skill_evolution.py — reusable evolution engine: partitions scored conversations into successes/failures, runs a parallel analyst fleet on frozen copies of the skill, and consolidates recurring rules into a new version (best-of-N, compaction, anti-parroting reclassification). Standalone: consumes a scored report dict, returns skill text. - examples/skill_evolution_lab/ — the company-policy Q&A agent, golden Q&A eval spec, evolve/held-out question sets, one-command run_e2e_demo.sh, a committed sample_run/, DEMO_NARRATION.md, and VERIFICATION.md. - tests/test_skill_evolution.py — 19 engine unit tests. Verified (gemini-3-flash-preview, golden-grounded, held-out): V0 23.8% -> V1 100% overall; corrections (anti-parroting) 33.3% -> 100%. Note: the demo's scorer step uses quality_report.py's eval-spec / golden-Q&A / --tag-turns features from GoogleCloudPlatform#174; this PR should land after that one.

Add a 'Skill Evolution' section (table row + full section) covering scripts/skill_evolution.py: what it does (parallel analyst fleet + inductive consolidation, Trace2Skill/AutoSkill), the pipeline, CLI usage and flags (--report/--skill/-o, --candidates, --max-chars, --analyst-mode), the evolve_skill() Python API and knobs, the expected report input, prerequisites (Vertex/ADC), and output (version bump + evolved_from). Links to the skill_evolution_lab example for an end-to-end run.

evekhm changed the title ~~Add skill_evolution_lab example + skill evolution engine~~ Agent that writes it own skill based on conversations traces Jun 8, 2026

evekhm changed the title ~~Agent that writes it own skill based on conversations traces~~ Add demo for an agent that writes it own skill based on conversations traces Jun 8, 2026

evekhm changed the title ~~Add demo for an agent that writes it own skill based on conversations traces~~ Add demo for an agent that writes it own Skills based on conversations traces Jun 8, 2026

evekhm changed the title ~~Add demo for an agent that writes it own Skills based on conversations traces~~ Add example: an agent that writes its own skill Jun 8, 2026

evekhm changed the title ~~Add example: an agent that writes its own skill~~ Add demo and implementation: an agent that writes its own skill Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add demo and implementation: an agent that writes its own skill#301

Add demo and implementation: an agent that writes its own skill#301
evekhm wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
evekhm:feat/skill-evolution-lab

evekhm commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evekhm commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Contents

Verified

Dependency / draft status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

evekhm commented Jun 8, 2026 •

edited

Loading