AgentToolkit
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 6 additions & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎.secrets.baseline‎
Lines changed: 6 additions & 6 deletions b/‎.secrets.baseline‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎explorations/agent-wiki/README.md‎
Lines changed: 68 additions & 0 deletions b/‎explorations/agent-wiki/README.md‎
Lines changed: 68 additions & 0 deletions
diff --git a/‎explorations/agent-wiki/docs/design.md‎
Lines changed: 263 additions & 0 deletions b/‎explorations/agent-wiki/docs/design.md‎
Lines changed: 263 additions & 0 deletions
@@ -49,7 +49,12 @@ repos:
       - id: detect-secrets
         name: detect secrets
         args: ['--baseline', '.secrets.baseline']
-        exclude: package.lock.json
+        # Narrowly skip only the generated example-wiki tree and the schema doc's
+        # worked examples: their 12-hex guideline content-hashes and session UUIDs
+        # are identifiers, not secrets. Hand-written code/docs/harness under
+        # explorations/agent-wiki/ stay scanned. (Mirrors the ruff/mypy scoping,
+        # which excludes only explorations/agent-wiki/wikis/.)
+        exclude: 'package.lock.json|^explorations/agent-wiki/wikis/|^explorations/agent-wiki/docs/schema\.md$'
 
   # Plugin render-equality gate — fails if platform-integrations/ has drifted
   # from plugin-source/. Runs whenever plugin-source/ or the rendered tree
 
@@ -1,9 +1,9 @@
 {
   "exclude": {
-    "files": "^.secrets.baseline$|package-lock\\.json$",
+    "files": "^.secrets.baseline$|package-lock\\.json$|^explorations/agent\\-wiki/wikis/|^explorations/agent\\-wiki/docs/schema\\.md$",
     "lines": null
   },
-  "generated_at": "2026-04-29T16:14:59Z",
+  "generated_at": "2026-06-10T06:41:48Z",
   "plugins_used": [
     {
       "name": "AWSKeyDetector"
@@ -156,11 +156,11 @@
     "sandbox/README.md": [
       {
         "hashed_secret": "b792a28a35da9b44fa0ee8a53002e9c238afb1bd",
+        "is_secret": false,
         "is_verified": false,
-        "line_number": 67,
+        "line_number": 68,
         "type": "Secret Keyword",
-        "verified_result": null,
-        "is_secret": false
+        "verified_result": null
       }
     ],
     "sandbox/sample.env": [
@@ -223,4 +223,4 @@
     "file": null,
     "hash": null
   }
-}
+}
@@ -0,0 +1,68 @@
+# agent-wiki
+
+An exploration in turning agent trajectories into a **reusable, evidence-grounded
+wiki** that future agents consult before acting — and the experiments measuring
+whether it actually helps.
+
+The core idea: after an agent finishes a task, distill its trajectory into wiki
+pages — episodic **summaries**, atomic **guidelines**, themed **cluster** pages,
+and executable **skills** — each linked back to the trajectory that produced it.
+A future agent, pointed at the wiki's `AGENTS.md`, retrieves the pages relevant
+to its task and applies them instead of re-deriving the recipe.
+
+## Layout
+
+```
+explorations/agent-wiki/
+├── skills/            the agent-wiki skill family + the build_agent_wiki.py builder
+│   ├── agent-wiki-summarize/             trajectory → episodic summary
+│   ├── agent-wiki-extract-guidelines/    trajectory → atomic guidelines
+│   ├── agent-wiki-synthesize-skill/      trajectory → executable SKILL.md
+│   ├── agent-wiki-consolidate-guidelines/ atomics → themed cluster pages
+│   ├── agent-wiki-tasks/                 cross-session task-comparison pages
+│   ├── agent-wiki-consult/               retrieval-time entry point
+│   ├── agent-wiki-ingest/                end-to-end orchestrator (all of the above)
+│   └── scripts/build_agent_wiki.py       deterministic builder (render-*/catalog)
+├── docs/
+│   ├── design.md      design & rationale
+│   └── schema.md      on-disk page/index schema
+└── experiments/       the empirical evidence (see RESULTS-SUMMARY.md)
+    ├── RESULTS-SUMMARY.md
+    ├── twobatch-*.md  the comparison reports (wiki vs no-wiki; skills vs guidelines; …)
+    ├── pruned-index-hypothesis.md
+    ├── metrics/       per-trial metric rollups (.jsonl)
+    └── harness/       comparison scripts (re-runnable) + the A/B runner (reference)
+```
+
+The example **wikis** built by these skills (`wiki-twobatch` / `-skills` /
+`-both` / `-pruned`) are shipped in a companion PR to keep this one focused on
+reviewable code — they are ~10k lines of generated output. They land under
+`explorations/agent-wiki/wikis/` once that PR merges.
+
+## Reading order
+
+1. **`docs/design.md`** — what the wiki is and why it's shaped this way.
+2. **`experiments/RESULTS-SUMMARY.md`** — the running tape of findings
+   (wiki cuts cost ~20% at equal accuracy; skills beat guidelines; pointer
+   wording is load-bearing; composition matters more than wiki size).
+3. **`skills/agent-wiki-ingest/SKILL.md`** — how a batch of traces becomes a
+   wiki in one pass.
+4. **The example wikis** (companion PR) — open a built `wiki-twobatch-skills/`'s
+   `AGENTS.md`, then `_index.jsonl`, then any page, to see a real wiki
+   end-to-end.
+
+## Scope of this exploration
+
+The example wikis (companion PR) are **benchmark-derived** (a synthetic 16-task
+file-format corpus). The raw per-trial sandbox transcripts and any wikis built
+from internal trajectory corpora are intentionally **not** included — only the
+metric rollups, the narrative reports, and the benchmark-derived wikis. Source
+links in wiki frontmatter are shown in the generic form
+`trajectories/<session-id>.json`.
+
+The skills here are a **standalone reference copy**, runnable via
+`explorations/agent-wiki/skills/scripts/build_agent_wiki.py`; they are not wired
+into any plugin loader in this tree. The experiment **harness** ships the
+re-runnable comparison scripts; the sandbox A/B runner
+(`experiments/harness/experiment_wiki_consult.py`) is reference-only — it needs
+project-level sandbox assets not included here.
@@ -0,0 +1,263 @@
+# Agent-wiki: design & rationale
+
+*A durable, evidence-grounded knowledge layer mined from an agent's own
+trajectories, consulted by future agents at recall-time.*
+
+This doc explains **why** the agent-wiki is shaped the way it is, **what**
+its pieces are, **how** a raw trace becomes a recallable page, and **what
+the experiments show**. It is the canonical design statement; for the
+operational contracts it links to the recall recipe
+([`_default_agents.md`](../skills/scripts/_default_agents.md),
+copied into every wiki as `AGENTS.md`), and the empirical log
+([`experiments/RESULTS-SUMMARY.md`](../experiments/RESULTS-SUMMARY.md)).
+
+---
+
+## 1. The problem
+
+Coding agents start every session cold. An agent that spent twenty tool
+calls last week discovering that a Debian container has no `pip` and
+PEP-668 blocks `pip install` will spend twenty tool calls rediscovering it
+next week. The knowledge a session produces dies with the session.
+
+The usual fixes don't hold up:
+
+- **Hand-authored runbooks** drift from reality and carry no provenance —
+  you can't tell whether a rule still reflects how the tool behaves, or who
+  decided it.
+- **Raw trajectory stores** keep everything but generalize nothing. They're
+  too bulky to load at recall-time, and a future agent has to re-derive the
+  lesson from a transcript instead of reading it.
+- **Generic long-term memory** (embed-everything vector stores) is lossy and
+  unauditable: a retrieved snippet has no chain back to the moment it was
+  true.
+
+The goal: a **knowledge layer the agent earns from its own work** — small
+enough to consult cheaply, general enough to apply to unseen-but-related
+tasks, and auditable down to the transcript that produced each claim.
+
+## 2. The core idea
+
+Build a **wiki from agent traces**. Each completed trajectory is distilled
+into pages; every page links back to the session it came from. Future agents
+**consult the wiki once they know the task they're about to do** — after the
+user's request is understood and the task family is clear, before writing
+code.
+
+```
+ past sessions            the wiki                  future session
+┌──────────────┐      ┌──────────────────┐       ┌──────────────────┐
+│ trajectory A │─┐    │ summaries/       │       │ user states task │
+│ trajectory B │─┼──▶ │ guidelines/      │ ◀─────│ agent reads      │
+│ trajectory C │─┘    │ skills/  tasks/  │consult│ _index.jsonl,    │
+└──────────────┘ dist.│ _index.jsonl     │       │ applies the rule │
+        ▲             └──────────────────┘       └──────────────────┘
+        └── provenance ──┘
+   (each wiki page links back to the trajectory it was distilled from)
+```
+
+The wiki is **not** a transcript archive and **not** a session-start
+preload. It's a curated, recall-preferred index of distilled lessons that an
+agent pulls from on demand.
+
+## 3. Design principles
+
+Each decision below earns its place; the *why* is the point.
+
+### Provenance is mandatory
+
+Every page is traceable, in a couple of clicks, to the raw transcript that
+produced it:
+
+```
+guideline.md
+  ↓ related_summary:
+summaries/<session_id>.md
+  ↓ sources:
+trajectories/<session_id>.json
+  ↓ source.transcript_path
+~/.../<session_id>.jsonl   (the raw trace)
+```
+
+Why: a recommendation is only trustworthy if you can audit where it came
+from and revise it when the underlying tool behavior changes. Provenance is
+what separates this from a generic memory store. Cluster pages aggregate
+their members' provenance rather than replacing it.
+
+### Page kinds, and a retrieval preference order
+
+The wiki has five page kinds, and `_index.jsonl` sorts them in **recall
+preference order**:
+
+| Kind | What it is | Why it exists |
+|---|---|---|
+| **cluster** | Themed aggregator over ≥2 atomic guidelines | One consolidated rule instead of N near-duplicate hits |
+| **skill** | Callable workflow page + sibling scripts | Directly *executable* — no interpretation needed |
+| **guideline** (atomic) | One rule, free-text, trigger-tagged | The base unit; a single distilled lesson |
+| **task / subtask** | Cross-session comparison / per-session workstream | Analysis surface, not recall-time advice |
+| **summary** | Episodic record of one session | The provenance anchor every other page links to |
+
+Sort order is `cluster → skill → guideline → task`, so the most
+consolidated and most directly-actionable artifacts surface first. The exact
+retrieval recipe (parse task → read `_index.jsonl` → filter by tag/trigger →
+prefer clusters → read top 2–5) lives in the recall contract; see
+[`_default_agents.md`](../skills/scripts/_default_agents.md).
+
+### Procedural over declarative where possible
+
+A **guideline** tells a future agent *what to do* ("when pip's module dir is
+missing, don't trust `ensurepip`"). A **skill** is a structured workflow page
+the agent can *execute* — Overview / When-To-Use / Workflow / optional
+sibling scripts it runs via Bash.
+
+Skills are **recall-preferred over guidelines** because they remove an
+interpretation step: the agent reads the SKILL.md and runs the recipe
+instead of reconstructing it from advice. §5 shows skills also win on cost.
+
+### Consolidation + delete-on-promote
+
+Two cross-trajectory moves keep the recall surface small and non-redundant:
+
+- **Consolidation** clusters ≥2 atomic guidelines that share a real *rule*
+  (not merely a topic) into a `__cluster.md` aggregator. Members stay on
+  disk with a `superseded_by:` backref — provenance is preserved.
+- **Delete-on-promote** (`--archive-covered`): when a skill is synthesized
+  (or a cluster created), the atomics it subsumes are **soft-archived** to
+  `_archived/`. They leave the recall index but stay auditable on disk; the
+  `_audit.log` records the move.
+
+Why: §5's central empirical finding is that **recall quality degrades as the
+index grows** — a smaller, non-redundant index helps even on tasks where no
+page matches. Consolidation and pruning are how the wiki stays small as it
+accumulates traces.
+
+### Recall-time discipline
+
+Consult **once you know the task or sub-task** — not at session start (too
+vague to match), not as a last resort when stuck (too late). And the
+**pointer wording is load-bearing**: a strong-imperative instruction to
+consult the wiki gets followed; a soft "you may want to check" gets skipped
+(§5, the A/B sweep). The pointer lives in the workspace `CLAUDE.md` /
+`AGENTS.md`; placement and wording both matter.
+
+## 4. How a trace becomes a recallable page
+
+The build pipeline is a sequence of LLM passes, each piping structured JSON
+to a deterministic builder
+([`build_agent_wiki.py`](../skills/scripts/build_agent_wiki.py))
+that writes the page and maintains the indexes:
+
+```
+raw trace ─┬─[convert]──▶ normalized JSON
+           │
+           ├─[summarize]─────────▶ summaries/<sid>.md        render-summary
+           ├─[extract-guidelines]▶ guidelines/<slug>__<gid>.md  render-guidelines
+           ├─[synthesize-skill]──▶ skills/<slug>/SKILL.md     render-skill --archive-covered
+           │                                                  (per trace, above)
+           ├─[consolidate]───────▶ guidelines/<slug>__cluster.md  render-cluster
+           │                                                  (once, cross-corpus)
+           └─[catalog]───────────▶ _index.jsonl, indexes, backrefs
+```
+
+| Stage | Skill | Builder subcommand | Scope |
+|---|---|---|---|
+| Convert | (bob-trace-converter / `normalize_stream_json_transcripts.py`) | — | per trace |
+| Summarize | [`agent-wiki-summarize`](../skills/agent-wiki-summarize/SKILL.md) | `render-summary` | per trace |
+| Extract guidelines | [`agent-wiki-extract-guidelines`](../skills/agent-wiki-extract-guidelines/SKILL.md) | `render-guidelines` | per trace |
+| Synthesize skill | [`agent-wiki-synthesize-skill`](../skills/agent-wiki-synthesize-skill/SKILL.md) | `render-skill` | per trace |
+| Consolidate | [`agent-wiki-consolidate-guidelines`](../skills/agent-wiki-consolidate-guidelines/SKILL.md) | `render-cluster` | **cross-corpus, once** |
+| Catalog | (any) | `catalog` | bookkeeping |
+
+**Order matters.** `synthesize-skill` runs *before* `consolidate` so skills
+claim recipe-level territory first (and archive the atomics they cover);
+consolidation then clusters only the surviving atomics. This matches the
+consolidate skill's own rule — don't propose a cluster overlapping a skill's
+territory.
+
+**`catalog` renders; `consolidate` proposes.** A sharp edge worth
+internalizing: `catalog` only *materializes* clusters already declared in
+`_config.yaml` and refreshes indexes/backrefs. It never *proposes* new
+clusters. Consolidation is the LLM pass that proposes them. Running `catalog`
+and expecting clusters to appear is a mistake — they won't unless
+consolidation declared them first.
+
+### The one-pass entry point
+
+[`agent-wiki-ingest`](../skills/agent-wiki-ingest/SKILL.md)
+orchestrates the whole pipeline end-to-end (convert → bootstrap → summarize
+→ extract → synthesize → consolidate → catalog) via subagent fan-out:
+summarize runs in parallel (independent file writes), extract and synthesize
+run sequentially (they mutate shared index/config state), consolidation runs
+once. It exists specifically so the **consolidation pass is never silently
+skipped** when ingesting a batch — the failure mode that motivated it.
+
+### Build patterns
+
+The same corpus can be turned into a wiki three ways, varying *when* the
+wiki is built and *what* the agent sees during each trial (see
+[`RESULTS-SUMMARY.md` §3–4](../experiments/RESULTS-SUMMARY.md)):
+
+- **Open-loop** — trials run against a fixed external wiki; the new wiki is a
+  study log built from observing them.
+- **Closed-loop** — trials mount the wiki being built; it grows trial-by-trial,
+  so trial N+1 sees what trial N spawned. The only pattern with real
+  intra-wiki recall data.
+- **Retroactive** — the wiki stays empty during all trials, then is built in
+  one batch afterward. Cleanest pure-recipe corpus.
+
+The three real-task themes emerge in **all three** patterns — consolidation
+is robust to build order.
+
+## 5. Evidence
+
+All experiments use the same 16-task corpus, `claude_md_strong` pointer,
+3 trials/task. `total_cost_usd` is the ground-truth cost metric (cache reads
+bill at ~10% of regular input, so raw token sums overcount). Full tables and
+methodology: [`experiments/RESULTS-SUMMARY.md`](../experiments/RESULTS-SUMMARY.md).
+
+| Finding | Result | Source |
+|---|---|---|
+| **Wiki vs no wiki** | −20% cost, −38% duration, −43% tool calls, accuracy unchanged (96%) | [twobatch-comparison](../experiments/twobatch-comparison.md) |
+| **Pointer wording is load-bearing** | strong-imperative CLAUDE.md 3/3 reads; soft phrasing 1/3 | [RESULTS-SUMMARY §1](../experiments/RESULTS-SUMMARY.md#1-agentsmd-ab-sweep-the-original) |
+| **Build pattern is robust** | same 3 clusters emerge open-/closed-/retroactive | [RESULTS-SUMMARY §3–4](../experiments/RESULTS-SUMMARY.md#34-build-pattern-comparison-closed-loop-vs-retroactive) |
+| **Skills > guidelines** | skills-only $0.146 vs guidelines $0.17 (−14%), accuracy 98% vs 96% | [twobatch-skills-comparison](../experiments/twobatch-skills-comparison.md) |
+| **Composition is non-additive** | skills+guidelines costs +22% vs skills, +5% vs guidelines | [twobatch-fourway-comparison](../experiments/twobatch-fourway-comparison.md) |
+| **Composition > size; skills-only still cheapest** | delete-on-promote (corrected index): −3% vs both, +18% vs skills | [twobatch-fiveway-comparison](../experiments/twobatch-fiveway-comparison.md) |
+
+The throughline across these:
+
+- **The wiki materially reduces cost at equal accuracy.** Savings come
+  mainly from fewer tool calls and shorter responses, not from reading fewer
+  input bytes — the agent reads *more* wiki bytes but acts more directly.
+- **A smaller recall surface helps even when nothing matches.** The
+  skills-only arm beat guidelines-only on tasks where *no skill matched*
+  (e.g. t2-imports −39%) — evidence that index noise itself costs, which is
+  why consolidation and delete-on-promote exist.
+- **Don't stack page kinds.** Skills + guidelines together is the worst
+  populated wiki, and pruning the redundant atomics doesn't recover the gap.
+  Pick procedural-first; let consolidation + archive keep the rest lean.
+
+## 6. Open questions / limitations
+
+From [`RESULTS-SUMMARY.md`](../experiments/RESULTS-SUMMARY.md)'s open
+questions — live, not yet resolved:
+
+- **Statistical power.** Headline numbers rest on 3 trials/task; per-task
+  confidence intervals are wide, especially on the two observed regressions
+  (wav-info, imports).
+- **True transfer.** All experiments reuse the same task in build and recall.
+  A real transfer test (build from tasks Y, recall on task X where X ∈
+  family(Y), X ∉ Y) would test whether clusters *generalize* rather than
+  memorize.
+- **Scale.** 16 tasks is small. Does the cost-reduction percentage hold,
+  grow, or saturate at 50+ tasks and a larger index?
+- **Why composition regresses.** The skills+guidelines penalty is
+  output-token-driven, not read-count-driven — trace-level inspection of why
+  the agent "says more" when both kinds are present is unresolved.
+
+## See also
+
+- [`schema.md`](schema.md) — the on-disk schema reference: directory layout, per-kind frontmatter, links, and the promotion/archival lifecycle.
+- [`_default_agents.md`](../skills/scripts/_default_agents.md) — the recall contract copied into every wiki as `AGENTS.md` (page kinds, retrieval recipe, provenance chain).
+- [`experiments/RESULTS-SUMMARY.md`](../experiments/RESULTS-SUMMARY.md) — the full empirical log.
+- The `agent-wiki-*` skills under [`skills/`](../skills/) and the builder [`build_agent_wiki.py`](../skills/scripts/build_agent_wiki.py).