scaffold-eth · technophile-04 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026
diff --git a/.agents/evals/EVAL_PLAN.md b/.agents/evals/EVAL_PLAN.md
diff --git a/.agents/evals/INDEX.md b/.agents/evals/INDEX.md
@@ -0,0 +1,149 @@
+# SE-2 Agent Skill Evals
+
+A/B benchmark of SE-2 agent skills: does giving Claude a SKILL.md file improve implementation quality? Tested across 4 iterations covering all 3 tiers (10 skills, 60 independently-graded runs).
+
+## Final Results
+
+### Tier 1 (Iteration 3) — High Capability Uplift
+
+| Skill | With Skill (5 runs) | Without Skill (5 runs) | Delta |
+|-------|---------------------|------------------------|-------|
+| drizzle-neon | 100% | 10% | +90pp |
+| x402 | 100% | 38% | +62pp |
+| eip-5792 | 88% | 50% | +38pp |
+| ponder | 100% | 68% | +32pp |
+| **Overall** | **97%** | **42%** | **+55pp** |
+
+### Tier 2+3 (Iteration 4) — Mostly Encoded Preference
+
+| Skill | Tier | With Skill | Without Skill | Delta |
+|-------|------|-----------|---------------|-------|
+| eip-712 | 2 | 100% (2 runs) | 80% (2 runs) | +20pp |
+| siwe | 2 | 100% (2 runs) | 85% (2 runs) | +15pp |
+| erc-20 | 2 | 100% (2 runs) | 95% (2 runs) | +5pp |
+| erc-721 | 2 | 85% (2 runs) | 90% (2 runs) | -5pp |
+| defi-protocol-templates | 3 | 100% (1 run) | 100% (1 run) | 0pp |
+| solidity-security | 3 | 90% (1 run) | 100% (1 run) | -10pp |
+| **Overall** | | **96%** | **90%** | **+6pp** |
+
+## Directory Structure
+
+```
+.agents/evals/
+├── INDEX.md                          <- you are here
+├── blog-post.md                      <- narrative writeup covering all 3 iterations
+│
+└── combined-workspace/               <- all eval data lives here
+    ├── x402-evals.json               <- x402 assertion definitions (evals.json format)
+    │
+    ├── iteration-1/                  <- 1 run per config, independent grading
+    │   ├── benchmark.json            <- aggregated results (generated by aggregate_benchmark.py)
+    │   ├── feedback.json             <- reviewer feedback from eval viewer UI
+    │   └── eval-*/                   <- one dir per skill eval
+    │       ├── eval_metadata.json    <- eval ID, prompt, assertions
+    │       ├── with_skill/
+    │       │   ├── grading.json      <- pass/fail per assertion with evidence
+    │       │   ├── timing.json       <- tokens, duration
+    │       │   └── outputs/
+    │       │       └── summary.md    <- human-readable grading summary
+    │       └── without_skill/
+    │           └── [same structure]
+    │
+    ├── iteration-2/                  <- 3 runs per config, self-graded (biased — see ANALYSIS.md)
+    │   ├── benchmark.json            <- generated by aggregate_benchmark.py
+    │   ├── benchmark.md              <- generated by aggregate_benchmark.py
+    │   ├── ANALYSIS.md               <- self-grading bias discovery (key finding)
+    │   └── eval-*/
+    │       ├── eval_metadata.json
+    │       ├── with_skill/run-{1..3}/
+    │       │   ├── grading.json
+    │       │   ├── timing.json
+    │       │   └── outputs/summary.md
+    │       └── without_skill/run-{1..3}/
+    │           └── [same structure]
+    │
+    └── iteration-3/                  <- 5 runs per config, independent grading (authoritative)
+        ├── benchmark.json            <- generated by aggregate_benchmark.py
+        ├── benchmark.md              <- generated by aggregate_benchmark.py
+        ├── PLAN.md                   <- 2-phase pipeline design, bias controls
+        └── eval-*/
+            ├── eval_metadata.json
+            ├── with_skill/run-{1..5}/
+            │   ├── grading.json
+            │   ├── timing.json
+            │   └── outputs/summary.md
+            └── without_skill/run-{1..5}/
+                └── [same structure]
+```
+
+## Skills Tested
+
+4 Tier 1 skills (highest predicted Capability Uplift):
+
+| Skill | Eval ID | Prompt |
+|-------|---------|--------|
+| drizzle-neon | eval-drizzle-db-integration | "I need a database for my dApp to store user profiles with wallet addresses..." |
+| x402 | eval-x402-api-monetization | "I want to monetize an API endpoint in my SE-2 dApp with micropayments..." |
+| ponder | eval-ponder-event-indexing | "I want to index my contract events so I can query historical data with GraphQL..." |
+| eip-5792 | eval-eip5792-batch-txns | "I want to batch multiple contract calls into a single transaction..." |
+
+## Iteration History
+
+| Iteration | Runs | Grading | Key Learning |
+|-----------|------|---------|--------------|
+| 1 | 8 (1 per config) | Independent grader agent | Skills show +60% avg delta. Small sample size. |
+| 2 | 24 (3 per config) | Self-graded (executor grades own work) | **Self-grading bias discovered**: without_skill jumped from 40% to 100%. Time/token data still valid. See `iteration-2/ANALYSIS.md`. |
+| 3 | 40 (5 per config) | Independent grader, AGENTS.md stripped for baseline | **Authoritative results**: 97% vs 42%, +55pp delta. Near-zero variance. |
+
+## Data File Schemas
+
+### eval_metadata.json
+```json
+{"eval_id": 0, "eval_name": "...", "prompt": "...", "assertions": [{"id": "...", "description": "..."}]}
+```
+
+### grading.json
+```json
+{
+  "expectations": [{"text": "assertion text", "passed": true, "evidence": "..."}],
+  "summary": {"passed": 8, "failed": 2, "total": 10, "pass_rate": 0.8}
+}
+```
+
+### timing.json
+```json
+{"total_tokens": 39805, "duration_ms": 184300, "total_duration_seconds": 184.3}
+```
+
+### benchmark.json (generated by `aggregate_benchmark.py`)
+```json
+{
+  "metadata": {"skill_name": "...", "executor_model": "claude-opus-4-6", "runs_per_configuration": 5},
+  "runs": [{"eval_name": "...", "configuration": "with_skill|without_skill", "run_number": 1, "result": {"pass_rate": 1.0, "passed": 10, "failed": 0, "total": 10, "time_seconds": 184.3, "tokens": 39805}, "expectations": [...], "notes": [...]}],
+  "run_summary": {"with_skill": {"pass_rate": {"mean": 1.0, "stddev": 0.0}}, "without_skill": {...}, "delta": {...}}
+}
+```
+
+### feedback.json (generated by eval viewer UI)
+```json
+{"reviews": [{"run_id": "...", "feedback": "...", "timestamp": "..."}], "status": "reviewed"}
+```
+
+## Viewer
+
+```bash
+python3 ~/.claude/plugins/cache/claude-plugins-official/skill-creator/205b6e0b3036/skills/skill-creator/eval-viewer/generate_review.py \
+  .agents/evals/combined-workspace/iteration-3 \
+  --skill-name "SE-2 Tier 1 Skills" \
+  --benchmark .agents/evals/combined-workspace/iteration-3/benchmark.json
+```
+
+Opens at http://localhost:3117. Use `--port <N>` to change, `--static /tmp/report.html` for static export.
+
+## Key Docs
+
+| File | What it covers |
+|------|---------------|
+| `blog-post.md` | Full narrative of all 3 iterations including the self-grading bias discovery and final results |
+| `combined-workspace/iteration-2/ANALYSIS.md` | Technical deep-dive on self-grading bias: 60pp gap, why it happens, what metrics remain reliable |
+| `combined-workspace/iteration-3/PLAN.md` | 2-phase pipeline design, AGENTS.md context contamination fix, why 5 runs |