Skip to content

Commit 3e26154

Browse files
authored
explore(agent-wiki): trajectory-derived wiki — skills, builder, experiments (#268)
* explore(agent-wiki): self-contained, public-safe agent-wiki exploration Adds explorations/agent-wiki/ — the agent-wiki skill family, builder, design + schema docs, the wiki-helps experiment reports, and benchmark-derived example wikis, all under one tree suitable for a public PR. Contents: - skills/ 7 agent-wiki skills + build_agent_wiki.py (reference copy, not plugin-wired) - docs/ design.md + schema.md - experiments/ RESULTS-SUMMARY + twobatch comparison reports + pruned-index-hypothesis; metrics/ rollups (no raw transcripts); harness/ runner + compare scripts - wikis/ wiki-terminalbench-bob + the twobatch arms (base / skills / both / pruned-corrected) Public-safety scrub: - Excluded all raw per-trial sandbox transcripts (kept only metric rollups + narrative reports). - Excluded wikis built from internal corpora (procedural-design, consult-meta, iterative, retroactive, simple-claude, test-paired, claude) and the build-pattern comparison that ran on them; §3-4 of RESULTS-SUMMARY reduced to a portable-finding note. - Rewrote all source-path frontmatter to the generic trajectories/<session-id>.json form; genericized internal example names and the benchmark-data dir convention in skills/docs. - Leak gate (benchmark-data / internal corpus + wiki names / org paths) passes with zero hits across the tree. Branched off main; diff touches only explorations/agent-wiki/. Builder catalog + comparison scripts verified runnable from the new location. * explore(agent-wiki): drop wiki-terminalbench-bob example Removes the terminal-bench example wiki from the exploration. Repoints the README reading-order + layout to wiki-twobatch-skills, fixes the docs that attributed worked examples to it (schema.md now points at the wiki-twobatch arms; example index rows retagged), and corrects stale relative links the docs carried from the original tree (../plugin-source → ../skills, ../WIKIS.md removed, ../experiments/wiki-build-comparison.md → RESULTS-SUMMARY §3–4, design.md/schema.md cross-links to renamed filenames). Skill example paths (consult, ingest) repointed off the removed wiki. Remaining wikis: wiki-twobatch {base, skills, both, pruned}. All intra-doc relative links resolve; leak gate clean. * fix(explorations): make CI green for the agent-wiki exploration CI (ruff, mypy, detect-secrets) was scanning explorations/agent-wiki/ as project source — the first content under explorations/ to carry .py files and high-entropy identifiers. Fixes, scoped so generated example artifacts are treated like the already-excluded plugin-source/ and examples/ trees: - ruff: lint + format fixes in the harness scripts + builder; exclude the generated wiki scripts (explorations/agent-wiki/wikis/) via extend-exclude. - mypy: add explorations/agent-wiki/wikis/ to exclude; add file-local `# mypy: ignore-errors` to the exploration harness + the builder (a verbatim copy of the mypy-excluded plugin-source/ original). - detect-secrets: exclude explorations/agent-wiki/ in the pre-commit hook and .secrets.baseline — the 53 findings are 12-hex guideline content hashes and session-id UUIDs, not secrets. No example-wiki content changed (scripts keep their original names). Fixes failing CI checks: check-formatting, check-linting, check-typing, tekton/pr-code-checks/code-detect-secrets. * explore(agent-wiki): move example wikis to a follow-up PR Drops explorations/agent-wiki/wikis/ (253 generated files, ~10k lines) from this PR so the diff is the reviewable surface — skills, builder, docs, and the experiment reports/harness (~34 files). The example wikis are machine- generated output; bundling them buried the code and appears to have made CodeRabbit skip deep review (summary only, zero inline findings). The wikis land in a stacked follow-up PR. README/docs still reference wikis/wiki-twobatch-* by path; those links resolve once the follow-up merges. Root-config excludes (ruff/mypy/detect-secrets) are kept — the detect-secrets exclude still covers example content hashes in docs/schema.md, and the wiki excludes become live again when the follow-up lands. * fix(agent-wiki): address PR review findings P1 — fresh catalog bootstrap crash: cmd_catalog now creates summaries/, guidelines/, tasks/, skills/ before any index writer runs. A `catalog` on a bare wiki-root no longer FileNotFounds on summaries/index.md. P1 — skill docs referenced non-existent paths: repointed all 23 build_agent_wiki.py invocations and the normalizer reference from plugin-source/… and scripts/… to the in-tree explorations/agent-wiki/skills/scripts/ and …/experiments/harness/ paths (across the 7 skills + _default_agents.md). P1 — harness reproducibility: experiment_wiki_consult.py is marked REFERENCE ONLY (it needs project-level sandbox assets — docker image, demo workspace, hint plugin, _format_samples — not shipped here); the tasks-file path now resolves to the checked-in harness/wiki_consult_tasks.yaml. README's "reproduce" wording split into re-runnable compare scripts vs the reference-only A/B runner. P2 — render-cluster --archive-members broke member links: archive members BEFORE rendering the cluster page, and resolve each member to its real location — sibling in guidelines/, or ../_archived/<name>.md when archived. Links and titles now resolve in both modes. P2 — README described moved-out wikis: the example wikis live in the companion PR; README layout/reading-order/scope updated accordingly. Also: stripped trailing EOF blank lines in twobatch-comparison.md and twobatch-skills-comparison.md (git diff --check). * fix(agent-wiki): address CodeRabbit review on the split-down diff CodeRabbit re-reviewed the focused (code-only) PR and flagged 7 items; 3 were already fixed by the prior commit (REPO_ROOT, tasks_file path, build-script path — CodeRabbit confirmed resolved). The remaining 4: - [major] _format_samples import: wrap the deferred import in a clear RuntimeError explaining it's a project-level sandbox asset absent from this reference-only runner, instead of a bare ImportError. - [minor] median was durs[n//2] — wrong for even-length trial lists; now averages the two middle values for even n (default --trials 3 unaffected). - [minor] typo "byes" -> "bytes" in RESULTS-SUMMARY.md. - [minor] _default_agents.md Structure tree: add the per-section index.md entries (summaries/guidelines/skills/tasks) the catalog regenerates. * fix(agent-wiki): address review feedback from visahak 1. Harness REPO_ROOT resolved to explorations/agent-wiki (parents[2]) instead of the repo root, so the reference A/B runner couldn't find project assets (demo/workspace, platform-integrations/, tests/e2e/_wiki_hint_plugin). The script moved from tests/e2e/ (where parents[2] was the root) down two levels to experiments/harness/; REPO_ROOT is now parents[4] (the real repo root), matching the documented "run from the full project" usage. 2. detect-secrets exclude was over-broad (^explorations/agent-wiki/), disabling the secret gate over all hand-written code/docs/harness there. Narrowed to only the generated example-wiki tree and the schema doc's worked examples (^explorations/agent-wiki/wikis/ + docs/schema.md) — the only paths whose 12-hex content hashes / session UUIDs trip the high-entropy detector. This mirrors the ruff/mypy scoping (wikis/ only). Applied in both .pre-commit-config.yaml and .secrets.baseline. * chore(agent-wiki): remove REVIEW-FINDINGS.md working note Accidentally added in the prior commit by `git add -A`; this is a local review-notes scratch file, not part of the exploration. * fix(agent-wiki): harness --out-root tolerant of absolute paths experiment_wiki_consult.py rendered the summary footer with runs_path.relative_to(REPO_ROOT) / transcripts_dir.relative_to(REPO_ROOT), which raised ValueError at the very end of a run when --out-root pointed at an absolute path outside the repo. Added a _display_path() helper that returns the repo-relative form when the path is under REPO_ROOT and the absolute path otherwise. In-repo out-roots still render relative; external ones no longer crash. (The other open finding in the review notes — over-broad detect-secrets exclude — was already narrowed to wikis/ + docs/schema.md in d0e0850.)
1 parent 6de3712 commit 3e26154

34 files changed

Lines changed: 8314 additions & 7 deletions

.pre-commit-config.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,12 @@ repos:
4949
- id: detect-secrets
5050
name: detect secrets
5151
args: ['--baseline', '.secrets.baseline']
52-
exclude: package.lock.json
52+
# Narrowly skip only the generated example-wiki tree and the schema doc's
53+
# worked examples: their 12-hex guideline content-hashes and session UUIDs
54+
# are identifiers, not secrets. Hand-written code/docs/harness under
55+
# explorations/agent-wiki/ stay scanned. (Mirrors the ruff/mypy scoping,
56+
# which excludes only explorations/agent-wiki/wikis/.)
57+
exclude: 'package.lock.json|^explorations/agent-wiki/wikis/|^explorations/agent-wiki/docs/schema\.md$'
5358

5459
# Plugin render-equality gate — fails if platform-integrations/ has drifted
5560
# from plugin-source/. Runs whenever plugin-source/ or the rendered tree

.secrets.baseline

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
{
22
"exclude": {
3-
"files": "^.secrets.baseline$|package-lock\\.json$",
3+
"files": "^.secrets.baseline$|package-lock\\.json$|^explorations/agent\\-wiki/wikis/|^explorations/agent\\-wiki/docs/schema\\.md$",
44
"lines": null
55
},
6-
"generated_at": "2026-04-29T16:14:59Z",
6+
"generated_at": "2026-06-10T06:41:48Z",
77
"plugins_used": [
88
{
99
"name": "AWSKeyDetector"
@@ -156,11 +156,11 @@
156156
"sandbox/README.md": [
157157
{
158158
"hashed_secret": "b792a28a35da9b44fa0ee8a53002e9c238afb1bd",
159+
"is_secret": false,
159160
"is_verified": false,
160-
"line_number": 67,
161+
"line_number": 68,
161162
"type": "Secret Keyword",
162-
"verified_result": null,
163-
"is_secret": false
163+
"verified_result": null
164164
}
165165
],
166166
"sandbox/sample.env": [
@@ -223,4 +223,4 @@
223223
"file": null,
224224
"hash": null
225225
}
226-
}
226+
}

explorations/agent-wiki/README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# agent-wiki
2+
3+
An exploration in turning agent trajectories into a **reusable, evidence-grounded
4+
wiki** that future agents consult before acting — and the experiments measuring
5+
whether it actually helps.
6+
7+
The core idea: after an agent finishes a task, distill its trajectory into wiki
8+
pages — episodic **summaries**, atomic **guidelines**, themed **cluster** pages,
9+
and executable **skills** — each linked back to the trajectory that produced it.
10+
A future agent, pointed at the wiki's `AGENTS.md`, retrieves the pages relevant
11+
to its task and applies them instead of re-deriving the recipe.
12+
13+
## Layout
14+
15+
```
16+
explorations/agent-wiki/
17+
├── skills/ the agent-wiki skill family + the build_agent_wiki.py builder
18+
│ ├── agent-wiki-summarize/ trajectory → episodic summary
19+
│ ├── agent-wiki-extract-guidelines/ trajectory → atomic guidelines
20+
│ ├── agent-wiki-synthesize-skill/ trajectory → executable SKILL.md
21+
│ ├── agent-wiki-consolidate-guidelines/ atomics → themed cluster pages
22+
│ ├── agent-wiki-tasks/ cross-session task-comparison pages
23+
│ ├── agent-wiki-consult/ retrieval-time entry point
24+
│ ├── agent-wiki-ingest/ end-to-end orchestrator (all of the above)
25+
│ └── scripts/build_agent_wiki.py deterministic builder (render-*/catalog)
26+
├── docs/
27+
│ ├── design.md design & rationale
28+
│ └── schema.md on-disk page/index schema
29+
└── experiments/ the empirical evidence (see RESULTS-SUMMARY.md)
30+
├── RESULTS-SUMMARY.md
31+
├── twobatch-*.md the comparison reports (wiki vs no-wiki; skills vs guidelines; …)
32+
├── pruned-index-hypothesis.md
33+
├── metrics/ per-trial metric rollups (.jsonl)
34+
└── harness/ comparison scripts (re-runnable) + the A/B runner (reference)
35+
```
36+
37+
The example **wikis** built by these skills (`wiki-twobatch` / `-skills` /
38+
`-both` / `-pruned`) are shipped in a companion PR to keep this one focused on
39+
reviewable code — they are ~10k lines of generated output. They land under
40+
`explorations/agent-wiki/wikis/` once that PR merges.
41+
42+
## Reading order
43+
44+
1. **`docs/design.md`** — what the wiki is and why it's shaped this way.
45+
2. **`experiments/RESULTS-SUMMARY.md`** — the running tape of findings
46+
(wiki cuts cost ~20% at equal accuracy; skills beat guidelines; pointer
47+
wording is load-bearing; composition matters more than wiki size).
48+
3. **`skills/agent-wiki-ingest/SKILL.md`** — how a batch of traces becomes a
49+
wiki in one pass.
50+
4. **The example wikis** (companion PR) — open a built `wiki-twobatch-skills/`'s
51+
`AGENTS.md`, then `_index.jsonl`, then any page, to see a real wiki
52+
end-to-end.
53+
54+
## Scope of this exploration
55+
56+
The example wikis (companion PR) are **benchmark-derived** (a synthetic 16-task
57+
file-format corpus). The raw per-trial sandbox transcripts and any wikis built
58+
from internal trajectory corpora are intentionally **not** included — only the
59+
metric rollups, the narrative reports, and the benchmark-derived wikis. Source
60+
links in wiki frontmatter are shown in the generic form
61+
`trajectories/<session-id>.json`.
62+
63+
The skills here are a **standalone reference copy**, runnable via
64+
`explorations/agent-wiki/skills/scripts/build_agent_wiki.py`; they are not wired
65+
into any plugin loader in this tree. The experiment **harness** ships the
66+
re-runnable comparison scripts; the sandbox A/B runner
67+
(`experiments/harness/experiment_wiki_consult.py`) is reference-only — it needs
68+
project-level sandbox assets not included here.
Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
# Agent-wiki: design & rationale
2+
3+
*A durable, evidence-grounded knowledge layer mined from an agent's own
4+
trajectories, consulted by future agents at recall-time.*
5+
6+
This doc explains **why** the agent-wiki is shaped the way it is, **what**
7+
its pieces are, **how** a raw trace becomes a recallable page, and **what
8+
the experiments show**. It is the canonical design statement; for the
9+
operational contracts it links to the recall recipe
10+
([`_default_agents.md`](../skills/scripts/_default_agents.md),
11+
copied into every wiki as `AGENTS.md`), and the empirical log
12+
([`experiments/RESULTS-SUMMARY.md`](../experiments/RESULTS-SUMMARY.md)).
13+
14+
---
15+
16+
## 1. The problem
17+
18+
Coding agents start every session cold. An agent that spent twenty tool
19+
calls last week discovering that a Debian container has no `pip` and
20+
PEP-668 blocks `pip install` will spend twenty tool calls rediscovering it
21+
next week. The knowledge a session produces dies with the session.
22+
23+
The usual fixes don't hold up:
24+
25+
- **Hand-authored runbooks** drift from reality and carry no provenance —
26+
you can't tell whether a rule still reflects how the tool behaves, or who
27+
decided it.
28+
- **Raw trajectory stores** keep everything but generalize nothing. They're
29+
too bulky to load at recall-time, and a future agent has to re-derive the
30+
lesson from a transcript instead of reading it.
31+
- **Generic long-term memory** (embed-everything vector stores) is lossy and
32+
unauditable: a retrieved snippet has no chain back to the moment it was
33+
true.
34+
35+
The goal: a **knowledge layer the agent earns from its own work** — small
36+
enough to consult cheaply, general enough to apply to unseen-but-related
37+
tasks, and auditable down to the transcript that produced each claim.
38+
39+
## 2. The core idea
40+
41+
Build a **wiki from agent traces**. Each completed trajectory is distilled
42+
into pages; every page links back to the session it came from. Future agents
43+
**consult the wiki once they know the task they're about to do** — after the
44+
user's request is understood and the task family is clear, before writing
45+
code.
46+
47+
```
48+
past sessions the wiki future session
49+
┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐
50+
│ trajectory A │─┐ │ summaries/ │ │ user states task │
51+
│ trajectory B │─┼──▶ │ guidelines/ │ ◀─────│ agent reads │
52+
│ trajectory C │─┘ │ skills/ tasks/ │consult│ _index.jsonl, │
53+
└──────────────┘ dist.│ _index.jsonl │ │ applies the rule │
54+
▲ └──────────────────┘ └──────────────────┘
55+
└── provenance ──┘
56+
(each wiki page links back to the trajectory it was distilled from)
57+
```
58+
59+
The wiki is **not** a transcript archive and **not** a session-start
60+
preload. It's a curated, recall-preferred index of distilled lessons that an
61+
agent pulls from on demand.
62+
63+
## 3. Design principles
64+
65+
Each decision below earns its place; the *why* is the point.
66+
67+
### Provenance is mandatory
68+
69+
Every page is traceable, in a couple of clicks, to the raw transcript that
70+
produced it:
71+
72+
```
73+
guideline.md
74+
↓ related_summary:
75+
summaries/<session_id>.md
76+
↓ sources:
77+
trajectories/<session_id>.json
78+
↓ source.transcript_path
79+
~/.../<session_id>.jsonl (the raw trace)
80+
```
81+
82+
Why: a recommendation is only trustworthy if you can audit where it came
83+
from and revise it when the underlying tool behavior changes. Provenance is
84+
what separates this from a generic memory store. Cluster pages aggregate
85+
their members' provenance rather than replacing it.
86+
87+
### Page kinds, and a retrieval preference order
88+
89+
The wiki has five page kinds, and `_index.jsonl` sorts them in **recall
90+
preference order**:
91+
92+
| Kind | What it is | Why it exists |
93+
|---|---|---|
94+
| **cluster** | Themed aggregator over ≥2 atomic guidelines | One consolidated rule instead of N near-duplicate hits |
95+
| **skill** | Callable workflow page + sibling scripts | Directly *executable* — no interpretation needed |
96+
| **guideline** (atomic) | One rule, free-text, trigger-tagged | The base unit; a single distilled lesson |
97+
| **task / subtask** | Cross-session comparison / per-session workstream | Analysis surface, not recall-time advice |
98+
| **summary** | Episodic record of one session | The provenance anchor every other page links to |
99+
100+
Sort order is `cluster → skill → guideline → task`, so the most
101+
consolidated and most directly-actionable artifacts surface first. The exact
102+
retrieval recipe (parse task → read `_index.jsonl` → filter by tag/trigger →
103+
prefer clusters → read top 2–5) lives in the recall contract; see
104+
[`_default_agents.md`](../skills/scripts/_default_agents.md).
105+
106+
### Procedural over declarative where possible
107+
108+
A **guideline** tells a future agent *what to do* ("when pip's module dir is
109+
missing, don't trust `ensurepip`"). A **skill** is a structured workflow page
110+
the agent can *execute* — Overview / When-To-Use / Workflow / optional
111+
sibling scripts it runs via Bash.
112+
113+
Skills are **recall-preferred over guidelines** because they remove an
114+
interpretation step: the agent reads the SKILL.md and runs the recipe
115+
instead of reconstructing it from advice. §5 shows skills also win on cost.
116+
117+
### Consolidation + delete-on-promote
118+
119+
Two cross-trajectory moves keep the recall surface small and non-redundant:
120+
121+
- **Consolidation** clusters ≥2 atomic guidelines that share a real *rule*
122+
(not merely a topic) into a `__cluster.md` aggregator. Members stay on
123+
disk with a `superseded_by:` backref — provenance is preserved.
124+
- **Delete-on-promote** (`--archive-covered`): when a skill is synthesized
125+
(or a cluster created), the atomics it subsumes are **soft-archived** to
126+
`_archived/`. They leave the recall index but stay auditable on disk; the
127+
`_audit.log` records the move.
128+
129+
Why: §5's central empirical finding is that **recall quality degrades as the
130+
index grows** — a smaller, non-redundant index helps even on tasks where no
131+
page matches. Consolidation and pruning are how the wiki stays small as it
132+
accumulates traces.
133+
134+
### Recall-time discipline
135+
136+
Consult **once you know the task or sub-task** — not at session start (too
137+
vague to match), not as a last resort when stuck (too late). And the
138+
**pointer wording is load-bearing**: a strong-imperative instruction to
139+
consult the wiki gets followed; a soft "you may want to check" gets skipped
140+
(§5, the A/B sweep). The pointer lives in the workspace `CLAUDE.md` /
141+
`AGENTS.md`; placement and wording both matter.
142+
143+
## 4. How a trace becomes a recallable page
144+
145+
The build pipeline is a sequence of LLM passes, each piping structured JSON
146+
to a deterministic builder
147+
([`build_agent_wiki.py`](../skills/scripts/build_agent_wiki.py))
148+
that writes the page and maintains the indexes:
149+
150+
```
151+
raw trace ─┬─[convert]──▶ normalized JSON
152+
153+
├─[summarize]─────────▶ summaries/<sid>.md render-summary
154+
├─[extract-guidelines]▶ guidelines/<slug>__<gid>.md render-guidelines
155+
├─[synthesize-skill]──▶ skills/<slug>/SKILL.md render-skill --archive-covered
156+
│ (per trace, above)
157+
├─[consolidate]───────▶ guidelines/<slug>__cluster.md render-cluster
158+
│ (once, cross-corpus)
159+
└─[catalog]───────────▶ _index.jsonl, indexes, backrefs
160+
```
161+
162+
| Stage | Skill | Builder subcommand | Scope |
163+
|---|---|---|---|
164+
| Convert | (bob-trace-converter / `normalize_stream_json_transcripts.py`) || per trace |
165+
| Summarize | [`agent-wiki-summarize`](../skills/agent-wiki-summarize/SKILL.md) | `render-summary` | per trace |
166+
| Extract guidelines | [`agent-wiki-extract-guidelines`](../skills/agent-wiki-extract-guidelines/SKILL.md) | `render-guidelines` | per trace |
167+
| Synthesize skill | [`agent-wiki-synthesize-skill`](../skills/agent-wiki-synthesize-skill/SKILL.md) | `render-skill` | per trace |
168+
| Consolidate | [`agent-wiki-consolidate-guidelines`](../skills/agent-wiki-consolidate-guidelines/SKILL.md) | `render-cluster` | **cross-corpus, once** |
169+
| Catalog | (any) | `catalog` | bookkeeping |
170+
171+
**Order matters.** `synthesize-skill` runs *before* `consolidate` so skills
172+
claim recipe-level territory first (and archive the atomics they cover);
173+
consolidation then clusters only the surviving atomics. This matches the
174+
consolidate skill's own rule — don't propose a cluster overlapping a skill's
175+
territory.
176+
177+
**`catalog` renders; `consolidate` proposes.** A sharp edge worth
178+
internalizing: `catalog` only *materializes* clusters already declared in
179+
`_config.yaml` and refreshes indexes/backrefs. It never *proposes* new
180+
clusters. Consolidation is the LLM pass that proposes them. Running `catalog`
181+
and expecting clusters to appear is a mistake — they won't unless
182+
consolidation declared them first.
183+
184+
### The one-pass entry point
185+
186+
[`agent-wiki-ingest`](../skills/agent-wiki-ingest/SKILL.md)
187+
orchestrates the whole pipeline end-to-end (convert → bootstrap → summarize
188+
→ extract → synthesize → consolidate → catalog) via subagent fan-out:
189+
summarize runs in parallel (independent file writes), extract and synthesize
190+
run sequentially (they mutate shared index/config state), consolidation runs
191+
once. It exists specifically so the **consolidation pass is never silently
192+
skipped** when ingesting a batch — the failure mode that motivated it.
193+
194+
### Build patterns
195+
196+
The same corpus can be turned into a wiki three ways, varying *when* the
197+
wiki is built and *what* the agent sees during each trial (see
198+
[`RESULTS-SUMMARY.md` §3–4](../experiments/RESULTS-SUMMARY.md)):
199+
200+
- **Open-loop** — trials run against a fixed external wiki; the new wiki is a
201+
study log built from observing them.
202+
- **Closed-loop** — trials mount the wiki being built; it grows trial-by-trial,
203+
so trial N+1 sees what trial N spawned. The only pattern with real
204+
intra-wiki recall data.
205+
- **Retroactive** — the wiki stays empty during all trials, then is built in
206+
one batch afterward. Cleanest pure-recipe corpus.
207+
208+
The three real-task themes emerge in **all three** patterns — consolidation
209+
is robust to build order.
210+
211+
## 5. Evidence
212+
213+
All experiments use the same 16-task corpus, `claude_md_strong` pointer,
214+
3 trials/task. `total_cost_usd` is the ground-truth cost metric (cache reads
215+
bill at ~10% of regular input, so raw token sums overcount). Full tables and
216+
methodology: [`experiments/RESULTS-SUMMARY.md`](../experiments/RESULTS-SUMMARY.md).
217+
218+
| Finding | Result | Source |
219+
|---|---|---|
220+
| **Wiki vs no wiki** | −20% cost, −38% duration, −43% tool calls, accuracy unchanged (96%) | [twobatch-comparison](../experiments/twobatch-comparison.md) |
221+
| **Pointer wording is load-bearing** | strong-imperative CLAUDE.md 3/3 reads; soft phrasing 1/3 | [RESULTS-SUMMARY §1](../experiments/RESULTS-SUMMARY.md#1-agentsmd-ab-sweep-the-original) |
222+
| **Build pattern is robust** | same 3 clusters emerge open-/closed-/retroactive | [RESULTS-SUMMARY §3–4](../experiments/RESULTS-SUMMARY.md#34-build-pattern-comparison-closed-loop-vs-retroactive) |
223+
| **Skills > guidelines** | skills-only $0.146 vs guidelines $0.17 (−14%), accuracy 98% vs 96% | [twobatch-skills-comparison](../experiments/twobatch-skills-comparison.md) |
224+
| **Composition is non-additive** | skills+guidelines costs +22% vs skills, +5% vs guidelines | [twobatch-fourway-comparison](../experiments/twobatch-fourway-comparison.md) |
225+
| **Composition > size; skills-only still cheapest** | delete-on-promote (corrected index): −3% vs both, +18% vs skills | [twobatch-fiveway-comparison](../experiments/twobatch-fiveway-comparison.md) |
226+
227+
The throughline across these:
228+
229+
- **The wiki materially reduces cost at equal accuracy.** Savings come
230+
mainly from fewer tool calls and shorter responses, not from reading fewer
231+
input bytes — the agent reads *more* wiki bytes but acts more directly.
232+
- **A smaller recall surface helps even when nothing matches.** The
233+
skills-only arm beat guidelines-only on tasks where *no skill matched*
234+
(e.g. t2-imports −39%) — evidence that index noise itself costs, which is
235+
why consolidation and delete-on-promote exist.
236+
- **Don't stack page kinds.** Skills + guidelines together is the worst
237+
populated wiki, and pruning the redundant atomics doesn't recover the gap.
238+
Pick procedural-first; let consolidation + archive keep the rest lean.
239+
240+
## 6. Open questions / limitations
241+
242+
From [`RESULTS-SUMMARY.md`](../experiments/RESULTS-SUMMARY.md)'s open
243+
questions — live, not yet resolved:
244+
245+
- **Statistical power.** Headline numbers rest on 3 trials/task; per-task
246+
confidence intervals are wide, especially on the two observed regressions
247+
(wav-info, imports).
248+
- **True transfer.** All experiments reuse the same task in build and recall.
249+
A real transfer test (build from tasks Y, recall on task X where X ∈
250+
family(Y), X ∉ Y) would test whether clusters *generalize* rather than
251+
memorize.
252+
- **Scale.** 16 tasks is small. Does the cost-reduction percentage hold,
253+
grow, or saturate at 50+ tasks and a larger index?
254+
- **Why composition regresses.** The skills+guidelines penalty is
255+
output-token-driven, not read-count-driven — trace-level inspection of why
256+
the agent "says more" when both kinds are present is unresolved.
257+
258+
## See also
259+
260+
- [`schema.md`](schema.md) — the on-disk schema reference: directory layout, per-kind frontmatter, links, and the promotion/archival lifecycle.
261+
- [`_default_agents.md`](../skills/scripts/_default_agents.md) — the recall contract copied into every wiki as `AGENTS.md` (page kinds, retrieval recipe, provenance chain).
262+
- [`experiments/RESULTS-SUMMARY.md`](../experiments/RESULTS-SUMMARY.md) — the full empirical log.
263+
- The `agent-wiki-*` skills under [`skills/`](../skills/) and the builder [`build_agent_wiki.py`](../skills/scripts/build_agent_wiki.py).

0 commit comments

Comments
 (0)