Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions explorations/agent-wiki/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ explorations/agent-wiki/
│ ├── agent-wiki-summarize/ trajectory → episodic summary
│ ├── agent-wiki-extract-guidelines/ trajectory → atomic guidelines
│ ├── agent-wiki-synthesize-skill/ trajectory → executable SKILL.md
│ ├── agent-wiki-compare-outcomes/ success vs failed trajectories → contrastive guidelines
│ ├── agent-wiki-consolidate-guidelines/ atomics → themed cluster pages
│ ├── agent-wiki-tasks/ cross-session task-comparison pages
│ ├── agent-wiki-consult/ retrieval-time entry point
Expand Down
25 changes: 21 additions & 4 deletions explorations/agent-wiki/docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,9 @@ raw trace ─┬─[convert]──▶ normalized JSON
├─[extract-guidelines]▶ guidelines/<slug>__<gid>.md render-guidelines
├─[synthesize-skill]──▶ skills/<slug>/SKILL.md render-skill --archive-covered
│ (per trace, above)
├─[compare-outcomes]──▶ guidelines/<slug>__<gid>.md render-guidelines
│ (cross-corpus, conditional:
│ only with success/failure contrast)
├─[consolidate]───────▶ guidelines/<slug>__cluster.md render-cluster
│ (once, cross-corpus)
└─[catalog]───────────▶ _index.jsonl, indexes, backrefs
Expand All @@ -165,6 +168,7 @@ raw trace ─┬─[convert]──▶ normalized JSON
| Summarize | [`agent-wiki-summarize`](../skills/agent-wiki-summarize/SKILL.md) | `render-summary` | per trace |
| Extract guidelines | [`agent-wiki-extract-guidelines`](../skills/agent-wiki-extract-guidelines/SKILL.md) | `render-guidelines` | per trace |
| Synthesize skill | [`agent-wiki-synthesize-skill`](../skills/agent-wiki-synthesize-skill/SKILL.md) | `render-skill` | per trace |
| Compare outcomes | [`agent-wiki-compare-outcomes`](../skills/agent-wiki-compare-outcomes/SKILL.md) | `render-guidelines` | **cross-corpus, conditional** |
| Consolidate | [`agent-wiki-consolidate-guidelines`](../skills/agent-wiki-consolidate-guidelines/SKILL.md) | `render-cluster` | **cross-corpus, once** |
| Catalog | (any) | `catalog` | bookkeeping |

Expand All @@ -174,6 +178,17 @@ consolidation then clusters only the surviving atomics. This matches the
consolidate skill's own rule — don't propose a cluster overlapping a skill's
territory.

**Learning from contrast, not just from one trace.** The per-trace passes
(summarize / extract / synthesize) each mine one trajectory. `compare-outcomes`
is different: it contrasts *successful vs failed* runs of the same (or similar)
task and promotes a **contrastive guideline** only when a rule is backed by a
failed path, a successful path, and concrete trajectory evidence (task wording,
observed tool/API calls, transcript/doc snippets). It can LLM-judge
success/failure from the normalized transcript, so it doesn't depend on
benchmark-specific outcome labels. It's **conditional** — it runs only when the
corpus actually contains a success/failure contrast, and runs *before*
consolidate so its contrastive atoms can join clusters.

**`catalog` renders; `consolidate` proposes.** A sharp edge worth
internalizing: `catalog` only *materializes* clusters already declared in
`_config.yaml` and refreshes indexes/backrefs. It never *proposes* new
Expand All @@ -185,11 +200,13 @@ consolidation declared them first.

[`agent-wiki-ingest`](../skills/agent-wiki-ingest/SKILL.md)
orchestrates the whole pipeline end-to-end (convert → bootstrap → summarize
→ extract → synthesize → consolidate → catalog) via subagent fan-out:
→ extract → synthesize → compare-outcomes → consolidate → catalog) via subagent fan-out:
summarize runs in parallel (independent file writes), extract and synthesize
run sequentially (they mutate shared index/config state), consolidation runs
once. It exists specifically so the **consolidation pass is never silently
skipped** when ingesting a batch — the failure mode that motivated it.
run sequentially (they mutate shared index/config state), compare-outcomes
runs once over the corpus when there's a success/failure contrast (else it's
skipped), and consolidation runs once. It exists specifically so the
**consolidation pass is never silently skipped** when ingesting a batch — the
failure mode that motivated it.

### Build patterns

Expand Down
134 changes: 134 additions & 0 deletions explorations/agent-wiki/skills/agent-wiki-compare-outcomes/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
---
name: agent-wiki-compare-outcomes
description: Compare successful and failed normalized agent trajectories to derive evidence-backed agent-wiki guidelines. Use when Codex has multiple runs for the same or similar task, evaluator outcomes, failed/successful variants, benchmark trajectories, or wants to learn rules from contrasts rather than from one trajectory alone.
---

# Agent Wiki — Compare Outcomes

## Overview

Use this pass after summarize/extract/synthesize when there are multiple
trajectories that can be judged as successful or failed. It can judge
outcomes with an LLM from the normalized transcript, so it does not need to
depend on benchmark-specific success/failure labels. It derives
**contrastive guidelines**: rules that are supported by a failed path, a
successful path, and concrete evidence from task wording, tool/API
documentation, tool/API calls, transcript evidence, optional failure snippets,
and optionally an LLM success/failure judgment.

This pass exists to avoid hand-authored domain knowledge. Do not write a rule
just because you know the benchmark or application. Write a rule only when the
input trajectories contain the evidence.

## Workflow

### Step 1: Build an Evidence Pack

Run the bundled script over normalized trajectory JSON files:

```bash
uv run python explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py \
--input <normalized-dir-or-json> \
--out-json <analysis.json> \
--out-md <analysis.md> \
--judge-outcomes always
```

Pass `--input` multiple times to compare several experiment arms.

The script groups traces by `metadata.task_id` when present; otherwise it uses
a normalized task request. For each group it compares successful and failed
runs, then extracts:

- task request text;
- stored outcome and failure snippets when present;
- LLM-judged outcome when `--judge-outcomes missing` or
`--judge-outcomes always` is set;
- observed tool/API calls from `stats.top_tools`, code snippets, and source
`api_calls.jsonl` when available;
- tool/API descriptions shown in the trajectory transcript.

Judging modes:

- `--judge-outcomes never`: use only stored `outcome.success`.
- `--judge-outcomes missing`: judge only traces without stored outcomes.
- `--judge-outcomes always`: ignore stored success labels and use the LLM
judgment for all traces.

Prefer `--judge-outcomes always` when the available stored labels come from a
benchmark evaluator or another dataset-specific schema. Use stored outcomes
only when they are trusted, dataset-neutral annotations you are comfortable
using as ground truth.

Use `--judge-include-failures` when generic failure reports or evaluator
snippets are available and you want the LLM to interpret them. This does not
require benchmark-specific code; the snippets are passed as opaque evidence.
Without failure snippets or ground truth, an LLM can still identify obvious
tool errors, step-limit failures, missing finalization, or apparent success,
but it may not detect silent semantic mismatches.

### Step 2: Inspect Candidate Rules

Read the generated Markdown. A candidate is promotable only if it has:

- at least one failed trajectory and one successful trajectory in the same
group;
- a task-action tool/API or workflow difference between them, not just
authentication, documentation lookup, or finalization calls;
- a comparison between plausible alternatives in the same tool namespace,
unless the transcript evidence clearly supports a cross-namespace workflow
rule;
- evidence that the successful tool/API is more semantically aligned with the
current task wording, or that the transcript/failure evidence names the
failed side effect;
- source trajectory IDs for both sides.

If the evidence is incomplete, keep it as a hypothesis. Hypotheses are useful
for evaluation notes but should not be promoted into future-agent instructions.

### Step 3: Promote Carefully

When a candidate is strong, render it as a guideline with provenance:

```json
{
"entities": [
{
"type": "guideline",
"title": "Choose record source from task wording",
"content": "Apply this rule only when the live choice is between the observed successful and failed APIs, or between APIs with the same documented meanings. Prefer the successful source when the request matches its observed documentation. Do not apply this rule when the request explicitly uses failed-side terms; inspect the failed-side source instead. Do not generalize this rule to other record families or unrelated APIs unless a separate contrast includes those APIs.",
"rationale": "In the contrasted trajectories, failed runs used a feed endpoint for a task about the user's own transactions, while the successful run used the documented account-owned transaction endpoint.",
"trigger": "Use only when choosing between the observed successful and failed APIs and the task wording aligns with the successful-side documentation; skip when the task explicitly mentions failed-side terms or asks about a different record family.",
"session_id": "<comparison-id>",
"agent": "agent-wiki-compare-outcomes",
"tags": ["contrastive", "tool-selection", "data-source-routing"],
"normalized_path": "<analysis.json>"
}
]
}
```

Pipe through the normal helper:

```bash
cat /tmp/contrastive-guideline.json | uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py --wiki-root <wiki-root> render-guidelines
uv run python explorations/agent-wiki/skills/scripts/build_agent_wiki.py --wiki-root <wiki-root> catalog
```

## Guardrails

- Do not derive rules from outcome labels or private evaluator data alone.
Outcome labels and LLM judgments can identify which side failed, but the
proposed future behavior must come from trajectory-visible task wording,
observed calls, or observed documentation.
- Do not invent tool/API names. A concrete name must appear in a call or in
retrieved documentation.
- Prefer generic rule wording first, with tool-specific examples under
evidence. The wiki can specialize only where the evidence supports it.
- Keep triggers narrow. Name the observed successful and failed API pair, add
the successful-side positive terms, and add explicit counter-scope for
failed-side terms and unrelated record families.
- Record counterexamples: if a failed and successful run used the same tool,
this pass did not identify a source-selection rule.
- Keep confidence explicit. High confidence requires at least one success, one
failure, and a clear successful-only vs failed-only behavior difference.
Loading
Loading