Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
596 changes: 596 additions & 0 deletions .agents/evals/EVAL_PLAN.md

Large diffs are not rendered by default.

149 changes: 149 additions & 0 deletions .agents/evals/INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# SE-2 Agent Skill Evals

A/B benchmark of SE-2 agent skills: does giving Claude a SKILL.md file improve implementation quality? Tested across 4 iterations covering all 3 tiers (10 skills, 60 independently-graded runs).

## Final Results

### Tier 1 (Iteration 3) — High Capability Uplift

| Skill | With Skill (5 runs) | Without Skill (5 runs) | Delta |
|-------|---------------------|------------------------|-------|
| drizzle-neon | 100% | 10% | +90pp |
| x402 | 100% | 38% | +62pp |
| eip-5792 | 88% | 50% | +38pp |
| ponder | 100% | 68% | +32pp |
| **Overall** | **97%** | **42%** | **+55pp** |

### Tier 2+3 (Iteration 4) — Mostly Encoded Preference

| Skill | Tier | With Skill | Without Skill | Delta |
|-------|------|-----------|---------------|-------|
| eip-712 | 2 | 100% (2 runs) | 80% (2 runs) | +20pp |
| siwe | 2 | 100% (2 runs) | 85% (2 runs) | +15pp |
| erc-20 | 2 | 100% (2 runs) | 95% (2 runs) | +5pp |
| erc-721 | 2 | 85% (2 runs) | 90% (2 runs) | -5pp |
| defi-protocol-templates | 3 | 100% (1 run) | 100% (1 run) | 0pp |
| solidity-security | 3 | 90% (1 run) | 100% (1 run) | -10pp |
| **Overall** | | **96%** | **90%** | **+6pp** |

## Directory Structure

```
.agents/evals/
├── INDEX.md <- you are here
├── blog-post.md <- narrative writeup covering all 3 iterations
└── combined-workspace/ <- all eval data lives here
├── x402-evals.json <- x402 assertion definitions (evals.json format)
├── iteration-1/ <- 1 run per config, independent grading
│ ├── benchmark.json <- aggregated results (generated by aggregate_benchmark.py)
│ ├── feedback.json <- reviewer feedback from eval viewer UI
│ └── eval-*/ <- one dir per skill eval
│ ├── eval_metadata.json <- eval ID, prompt, assertions
│ ├── with_skill/
│ │ ├── grading.json <- pass/fail per assertion with evidence
│ │ ├── timing.json <- tokens, duration
│ │ └── outputs/
│ │ └── summary.md <- human-readable grading summary
│ └── without_skill/
│ └── [same structure]
├── iteration-2/ <- 3 runs per config, self-graded (biased — see ANALYSIS.md)
│ ├── benchmark.json <- generated by aggregate_benchmark.py
│ ├── benchmark.md <- generated by aggregate_benchmark.py
│ ├── ANALYSIS.md <- self-grading bias discovery (key finding)
│ └── eval-*/
│ ├── eval_metadata.json
│ ├── with_skill/run-{1..3}/
│ │ ├── grading.json
│ │ ├── timing.json
│ │ └── outputs/summary.md
│ └── without_skill/run-{1..3}/
│ └── [same structure]
└── iteration-3/ <- 5 runs per config, independent grading (authoritative)
├── benchmark.json <- generated by aggregate_benchmark.py
├── benchmark.md <- generated by aggregate_benchmark.py
├── PLAN.md <- 2-phase pipeline design, bias controls
└── eval-*/
├── eval_metadata.json
├── with_skill/run-{1..5}/
│ ├── grading.json
│ ├── timing.json
│ └── outputs/summary.md
└── without_skill/run-{1..5}/
└── [same structure]
```

## Skills Tested

4 Tier 1 skills (highest predicted Capability Uplift):

| Skill | Eval ID | Prompt |
|-------|---------|--------|
| drizzle-neon | eval-drizzle-db-integration | "I need a database for my dApp to store user profiles with wallet addresses..." |
| x402 | eval-x402-api-monetization | "I want to monetize an API endpoint in my SE-2 dApp with micropayments..." |
| ponder | eval-ponder-event-indexing | "I want to index my contract events so I can query historical data with GraphQL..." |
| eip-5792 | eval-eip5792-batch-txns | "I want to batch multiple contract calls into a single transaction..." |

## Iteration History

| Iteration | Runs | Grading | Key Learning |
|-----------|------|---------|--------------|
| 1 | 8 (1 per config) | Independent grader agent | Skills show +60% avg delta. Small sample size. |
| 2 | 24 (3 per config) | Self-graded (executor grades own work) | **Self-grading bias discovered**: without_skill jumped from 40% to 100%. Time/token data still valid. See `iteration-2/ANALYSIS.md`. |
| 3 | 40 (5 per config) | Independent grader, AGENTS.md stripped for baseline | **Authoritative results**: 97% vs 42%, +55pp delta. Near-zero variance. |

## Data File Schemas

### eval_metadata.json
```json
{"eval_id": 0, "eval_name": "...", "prompt": "...", "assertions": [{"id": "...", "description": "..."}]}
```

### grading.json
```json
{
"expectations": [{"text": "assertion text", "passed": true, "evidence": "..."}],
"summary": {"passed": 8, "failed": 2, "total": 10, "pass_rate": 0.8}
}
```

### timing.json
```json
{"total_tokens": 39805, "duration_ms": 184300, "total_duration_seconds": 184.3}
```

### benchmark.json (generated by `aggregate_benchmark.py`)
```json
{
"metadata": {"skill_name": "...", "executor_model": "claude-opus-4-6", "runs_per_configuration": 5},
"runs": [{"eval_name": "...", "configuration": "with_skill|without_skill", "run_number": 1, "result": {"pass_rate": 1.0, "passed": 10, "failed": 0, "total": 10, "time_seconds": 184.3, "tokens": 39805}, "expectations": [...], "notes": [...]}],
"run_summary": {"with_skill": {"pass_rate": {"mean": 1.0, "stddev": 0.0}}, "without_skill": {...}, "delta": {...}}
}
```

### feedback.json (generated by eval viewer UI)
```json
{"reviews": [{"run_id": "...", "feedback": "...", "timestamp": "..."}], "status": "reviewed"}
```

## Viewer

```bash
python3 ~/.claude/plugins/cache/claude-plugins-official/skill-creator/205b6e0b3036/skills/skill-creator/eval-viewer/generate_review.py \
.agents/evals/combined-workspace/iteration-3 \
--skill-name "SE-2 Tier 1 Skills" \
--benchmark .agents/evals/combined-workspace/iteration-3/benchmark.json
```

Opens at http://localhost:3117. Use `--port <N>` to change, `--static /tmp/report.html` for static export.

## Key Docs

| File | What it covers |
|------|---------------|
| `blog-post.md` | Full narrative of all 3 iterations including the self-grading bias discovery and final results |
| `combined-workspace/iteration-2/ANALYSIS.md` | Technical deep-dive on self-grading bias: 60pp gap, why it happens, what metrics remain reliable |
| `combined-workspace/iteration-3/PLAN.md` | 2-phase pipeline design, AGENTS.md context contamination fix, why 5 runs |
Loading
Loading