Trim tier 2+3 skills to eval-proven content (83% reduction) by technophile-04 · Pull Request #1245 · scaffold-eth/scaffold-eth-2

technophile-04 · 2026-03-14T13:20:54Z

Summary

Based on iteration-4 A/B eval data (20 runs, independent grading), trimmed 6 tier 2+3 skills to only content the model actually gets wrong without them
Total reduction: 2,123 lines → 365 lines (83%)
Follow-up: removed erc-20 and skills with negative/zero delta entirely

Iteration 4 Findings

Tier 2+3 skills show dramatically smaller uplift than tier 1:

Metric	Tier 1 (Iter 3)	Tier 2+3 (Iter 4)
With Skill pass rate	97%	96%
Without Skill pass rate	42%	90%
Delta	+55pp	+6pp
Skills tested	4	6
Total runs	40	20

Per-skill results

Skill	Tier	With Skill	Without Skill	Delta	Verdict
eip-712	2	100%	80%	+20pp	Keep — shared utility module + `as const` pattern
siwe	2	100%	85%	+15pp	Keep — viem SIWE vs `siwe` npm pkg, domain validation
erc-20	2	100%	95%	+5pp	Removed — within noise
erc-721	2	85%	90%	-5pp	Removed — no value demonstrated
defi-protocol-templates	3	100%	100%	0pp	Removed — zero delta
solidity-security	3	90%	100%	-10pp	Removed — negative delta

Key insight

The model already knows ERC-20, ERC-721, DeFi staking, and Solidity security well enough. These skills encode preference, not capability. Only eip-712 and siwe show real discriminating assertions (shared utility module, as const, viem SIWE, domain validation).

What was kept (trimmed)

Skill	Lines	What's left
eip-712	230→60	Shared utility module pattern, `as const` requirement
siwe	547→100	"Use viem SIWE not siwe npm pkg", domain validation, lazy session getter, hasSeenWalletConnected ref

What was removed

erc-20, erc-721: Removed entirely (delta within noise or negative)
defi-protocol-templates, solidity-security: Removed entirely (0pp or -10pp delta)
From kept skills: everything the model already knows — basic patterns, OZ v5 usage, well-known addresses, full code examples for flows the model implements correctly on its own

Test plan

Eval data in iteration-4/ANALYSIS.md on the base branch
Each kept section maps to a discriminating assertion (passes with skill, fails without)
Each removed section/skill maps to a non-discriminating assertion (passes regardless)
Cross-iteration comparison validates tier classification

🤖 Generated with Claude Code

Based on iteration-4 eval data (20 runs, independent grading), removed content the model already knows and kept only what it actually gets wrong without the skill: - eip-712 (230→60 lines): Keep shared utility module, as const pattern - siwe (547→100 lines): Keep viem-not-siwe-pkg, domain validation, lazy session getter, hasSeenWalletConnected ref - erc-721 (211→65 lines): Keep _safeMint reentrancy, SVG stack-too-deep, attributes array, IPFS trailing slash - erc-20 (159→65 lines): Keep ERC20Capped, SafeERC20, fee-on-transfer - defi-protocol-templates (443→40 lines): Convert to checklist (0pp delta) - solidity-security (533→35 lines): Convert to checklist (-10pp delta) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

damianmarti · 2026-03-17T17:52:25Z

Great analysis!

When I tested the ERC-20 skill, I checked the resulting code with and without the skill, and using the skill was better. You can find the comment here scaffold-eth/create-eth-extensions#118 (comment)

Now it seems like the ERC-20 skill is not needed anymore. Is this for something that changed from when I did the test or something else?

technophile-04 · 2026-03-18T14:18:16Z

@damianmarti Great catch! You were right, the ERC-20 skill was providing real value that our eval wasn't measuring.

I re-ran the eval with assertions based on your specific findings (named imports, _update override, ERC20Permit, dynamic decimals) and found:

Assertion	With Skill (3 runs)	Without Skill (3 runs)	Delta
Named imports (`import {ERC20}` not `import "..."`)	3/3	0/3	+100pp
`_update` override (required for OZ v5 compilation)	3/3	0/3	+100pp
ERC20Capped (vs manual cap)	3/3	3/3	0pp
ERC20Permit	0/3	0/3	0pp

So instead of restoring the ERC-20 skill, we're adding a unified OpenZeppelin skill inspired by OZ's official skill. It teaches:

Library-first development (prefer ERC20Capped over manual cap)
Pattern discovery from installed source (read the actual node_modules to find import syntax, virtual functions, constructor params)
Version-safe development (don't assume patterns from memory)

This covers ERC-20, ERC-721, and any future OZ contract in one skill. The ERC-721 skill stays for NFT-specific pitfalls (_safeMint reentrancy, SVG stack-too-deep, metadata attributes).

Eval data for iterations 6-7 pushed to PR #1241.

damianmarti · 2026-03-18T14:23:15Z

@technophile-04 great! thanks!!

damianmarti · 2026-03-18T14:26:51Z

@technophile-04 maybe add something for ERC20Permit? (with skill is 0/3 now)

technophile-04 · 2026-03-18T14:34:58Z

@damianmarti Good point — I shouldn't have included ERC20Permit in that table since it's 0/3 on both sides (not discriminating). The table should have only shown the assertions that actually differentiated.

On whether to add ERC20Permit guidance: when we tested with a prompt that explicitly said "gasless approvals", both with-skill and without-skill got it right (5/5). So the model knows ERC20Permit — it just doesn't add it unprompted.

The question is: should the OZ skill teach "always include ERC20Permit for new ERC-20 tokens"? Happy to add that if you think it's the right default.

Really sorry if it feel a bit AI'E my agent just directly posted a comment, but lol keeping it since look good

damianmarti · 2026-03-18T14:52:19Z

@damianmarti Good point — I shouldn't have included ERC20Permit in that table since it's 0/3 on both sides (not discriminating). The table should have only shown the assertions that actually differentiated.

On whether to add ERC20Permit guidance: when we tested with a prompt that explicitly said "gasless approvals", both with-skill and without-skill got it right (5/5). So the model knows ERC20Permit — it just doesn't add it unprompted.

The question is: should the OZ skill teach "always include ERC20Permit for new ERC-20 tokens"? Happy to add that if you think it's the right default.

mmmmm, I don´t know really. Maybe leaving as an explicit feature to add is ok.

Really sorry if it feel a bit AI'E my agent just directly posted a comment, but lol keeping it since look good

:-)

technophile-04 · 2026-03-19T02:19:02Z

mmmmm, I don´t know really. Maybe leaving as an explicit feature to add is ok.

Yup then will keep it as is 🙌

rin-st

Looking good to me, but see the comment regarding erc-721

The model reads "mirror import style from source" but doesn't actually check the source files. Adding a concrete example fixes this — the model needs to SEE the pattern, not be told to discover it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

technophile-04 · 2026-03-20T07:45:35Z

Thanks @rin-st and @damianmarti for testing!!

technophile-04 force-pushed the trim-tier2-tier3-skills branch from 0f53c21 to 638a2a8 Compare March 14, 2026 15:14

technophile-04 changed the base branch from tinker-with-evaluation to main March 14, 2026 15:14

technophile-04 added 2 commits March 15, 2026 14:23

rmeove skill with negative or 0pp

4f60cea

remove erc-20 as well

c3b0838

technophile-04 mentioned this pull request Mar 16, 2026

Add A/B evaluation framework for SE-2 agent skills #1241

Closed

technophile-04 added 2 commits March 17, 2026 11:01

tweak 721 a bit

9b1049a

tweak siwe

1d51f6c

technophile-04 requested review from carletex and rin-st March 17, 2026 08:30

add oz skill

7448f5f

rin-st approved these changes Mar 19, 2026

View reviewed changes

Comment thread .agents/skills/openzeppelin/SKILL.md

technophile-04 and others added 3 commits March 20, 2026 12:34

add reference to OZ skill in erc-721

7b65503

add the ERC721 gotch regarding the _update ovveride

2be5a91

rin-st approved these changes Mar 20, 2026

View reviewed changes

technophile-04 merged commit 9ac8e8b into main Mar 20, 2026
1 check passed

technophile-04 deleted the trim-tier2-tier3-skills branch March 20, 2026 14:29

technophile-04 mentioned this pull request Mar 20, 2026

backmerge 20-03-26 scaffold-eth/create-eth#405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim tier 2+3 skills to eval-proven content (83% reduction)#1245

Trim tier 2+3 skills to eval-proven content (83% reduction)#1245
technophile-04 merged 9 commits into
mainfrom
trim-tier2-tier3-skills

technophile-04 commented Mar 14, 2026 •

edited

Loading

Uh oh!

damianmarti commented Mar 17, 2026

Uh oh!

technophile-04 commented Mar 18, 2026 •

edited

Loading

Uh oh!

damianmarti commented Mar 18, 2026

Uh oh!

damianmarti commented Mar 18, 2026

Uh oh!

technophile-04 commented Mar 18, 2026 •

edited

Loading

Uh oh!

damianmarti commented Mar 18, 2026

Uh oh!

technophile-04 commented Mar 19, 2026

Uh oh!

rin-st left a comment

Uh oh!

Uh oh!

technophile-04 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

technophile-04 commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Iteration 4 Findings

Per-skill results

Key insight

What was kept (trimmed)

What was removed

Test plan

Uh oh!

damianmarti commented Mar 17, 2026

Uh oh!

technophile-04 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damianmarti commented Mar 18, 2026

Uh oh!

damianmarti commented Mar 18, 2026

Uh oh!

technophile-04 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damianmarti commented Mar 18, 2026

Uh oh!

technophile-04 commented Mar 19, 2026

Uh oh!

rin-st left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

technophile-04 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

technophile-04 commented Mar 14, 2026 •

edited

Loading

technophile-04 commented Mar 18, 2026 •

edited

Loading

technophile-04 commented Mar 18, 2026 •

edited

Loading