This folder is a complete, recorded run of ./run_e2e_demo.sh on
gemini-3.5-flash (the default agent model), committed so you can inspect the
exact inputs and outputs of the skill-evolution loop without running anything.
(Live runs go to runs/<timestamp>/, which is git-ignored; this is a curated
copy of one.)
The headline result for this run (see RESULT.md): held-out correctness
V0 18.2% → V1 100% (55/55), grounding (tool-call share) 7% → 96%,
evolved skill 2.9 KB. (Held-out set: 50 single-turn + 5 anti-parroting.)
The loop runs in five steps. The model, tool, and questions are identical for V0
and V1 — only the SKILL.md changes — so any delta is attributable to the skill.
-
V0 traffic (evolve set). The flawed V0 skill answers the evolve questions. →
v0_evolve_traffic.json— raw conversations, one per session:{session_id, question, conversation[], final_response, tool_calls, ...}, the schemaquality_report.py --conversations-fileconsumes. -
Score V0 (evolve set).
quality_report.py --eval-spec eval_spec.json --tag-turnsgrades each conversation against the golden Q&A and tags corrections. →v0_evolve_report.json— the engine's input. Each session hasmetrics.response_usefulness.category(meaningful / unhelpful / partial / declined),golden_eval(matched,expected_answer,similarity), andsub_trajectories(correction outcomes: recovered / parroted / not_recovered).summary.golden_eval_summary.matched_meaningful_rateis the headline metric. -
V0 baseline (held-out). Same two steps on the disjoint held-out test set. →
v0_test_traffic.json,v0_test_report.json— the honest baseline, on questions the engine never trains on. -
Evolve.
evolve_skill()partitions the V0 evolve report into successes / failures, runs the analyst fleet, consolidates (best-of-N), and writes a new skill. →v1_skill.md— the evolved skill (version: "1",evolved_from: "0"), tool-first: it lists which topics to look up with tools and forbids premature HR deflection (and bakes no specific data values). -
V1 result + compare (held-out). Deploy V1, re-run the held-out set, score, and compare. →
v1_test_traffic.json,v1_test_report.json— V1 scored identically. →RESULT.md/RESULT.json— V0 vs V1: overall, single-turn, anti-parroting, and parroted-sub-trajectory counts.
The same held-out question, V0 vs V1 (from v0_test_report.json and
v1_test_report.json):
Q: "How much does the company contribute to my HSA for family coverage?"
V0: category=unhelpful tool_calls=0 golden_matched=true
"I do not have that information. Please contact HR for details regarding
HSA contributions."
V1: category=meaningful tool_calls=1 golden_matched=true
"For family coverage, the company contributes $1,500 per year to your
Health Savings Account (HSA)."
cd .. # examples/skill_evolution_lab
./setup.sh YOUR_PROJECT_ID us-central1
./run_e2e_demo.shNumbers vary run-to-run (LLM nondeterminism, golden-match set), but the direction is stable: V0 defers/declines on topics it has a tool for; V1 uses the tool.