Skip to content

Trim tier 2+3 skills to eval-proven content (83% reduction)#1245

Merged
technophile-04 merged 9 commits into
mainfrom
trim-tier2-tier3-skills
Mar 20, 2026
Merged

Trim tier 2+3 skills to eval-proven content (83% reduction)#1245
technophile-04 merged 9 commits into
mainfrom
trim-tier2-tier3-skills

Conversation

@technophile-04
Copy link
Copy Markdown
Collaborator

@technophile-04 technophile-04 commented Mar 14, 2026

Summary

  • Based on iteration-4 A/B eval data (20 runs, independent grading), trimmed 6 tier 2+3 skills to only content the model actually gets wrong without them
  • Total reduction: 2,123 lines → 365 lines (83%)
  • Follow-up: removed erc-20 and skills with negative/zero delta entirely

Iteration 4 Findings

Tier 2+3 skills show dramatically smaller uplift than tier 1:

Metric Tier 1 (Iter 3) Tier 2+3 (Iter 4)
With Skill pass rate 97% 96%
Without Skill pass rate 42% 90%
Delta +55pp +6pp
Skills tested 4 6
Total runs 40 20

Per-skill results

Skill Tier With Skill Without Skill Delta Verdict
eip-712 2 100% 80% +20pp Keep — shared utility module + as const pattern
siwe 2 100% 85% +15pp Keep — viem SIWE vs siwe npm pkg, domain validation
erc-20 2 100% 95% +5pp Removed — within noise
erc-721 2 85% 90% -5pp Removed — no value demonstrated
defi-protocol-templates 3 100% 100% 0pp Removed — zero delta
solidity-security 3 90% 100% -10pp Removed — negative delta

Key insight

The model already knows ERC-20, ERC-721, DeFi staking, and Solidity security well enough. These skills encode preference, not capability. Only eip-712 and siwe show real discriminating assertions (shared utility module, as const, viem SIWE, domain validation).

What was kept (trimmed)

Skill Lines What's left
eip-712 230→60 Shared utility module pattern, as const requirement
siwe 547→100 "Use viem SIWE not siwe npm pkg", domain validation, lazy session getter, hasSeenWalletConnected ref

What was removed

  • erc-20, erc-721: Removed entirely (delta within noise or negative)
  • defi-protocol-templates, solidity-security: Removed entirely (0pp or -10pp delta)
  • From kept skills: everything the model already knows — basic patterns, OZ v5 usage, well-known addresses, full code examples for flows the model implements correctly on its own

Test plan

  • Eval data in iteration-4/ANALYSIS.md on the base branch
  • Each kept section maps to a discriminating assertion (passes with skill, fails without)
  • Each removed section/skill maps to a non-discriminating assertion (passes regardless)
  • Cross-iteration comparison validates tier classification

🤖 Generated with Claude Code

Based on iteration-4 eval data (20 runs, independent grading), removed
content the model already knows and kept only what it actually gets wrong
without the skill:

- eip-712 (230→60 lines): Keep shared utility module, as const pattern
- siwe (547→100 lines): Keep viem-not-siwe-pkg, domain validation, lazy
  session getter, hasSeenWalletConnected ref
- erc-721 (211→65 lines): Keep _safeMint reentrancy, SVG stack-too-deep,
  attributes array, IPFS trailing slash
- erc-20 (159→65 lines): Keep ERC20Capped, SafeERC20, fee-on-transfer
- defi-protocol-templates (443→40 lines): Convert to checklist (0pp delta)
- solidity-security (533→35 lines): Convert to checklist (-10pp delta)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@technophile-04 technophile-04 force-pushed the trim-tier2-tier3-skills branch from 0f53c21 to 638a2a8 Compare March 14, 2026 15:14
@technophile-04 technophile-04 changed the base branch from tinker-with-evaluation to main March 14, 2026 15:14
@damianmarti
Copy link
Copy Markdown
Member

Great analysis!

When I tested the ERC-20 skill, I checked the resulting code with and without the skill, and using the skill was better. You can find the comment here scaffold-eth/create-eth-extensions#118 (comment)

Now it seems like the ERC-20 skill is not needed anymore. Is this for something that changed from when I did the test or something else?

@technophile-04
Copy link
Copy Markdown
Collaborator Author

technophile-04 commented Mar 18, 2026

@damianmarti Great catch! You were right, the ERC-20 skill was providing real value that our eval wasn't measuring.

I re-ran the eval with assertions based on your specific findings (named imports, _update override, ERC20Permit, dynamic decimals) and found:

Assertion With Skill (3 runs) Without Skill (3 runs) Delta
Named imports (import {ERC20} not import "...") 3/3 0/3 +100pp
_update override (required for OZ v5 compilation) 3/3 0/3 +100pp
ERC20Capped (vs manual cap) 3/3 3/3 0pp
ERC20Permit 0/3 0/3 0pp

So instead of restoring the ERC-20 skill, we're adding a unified OpenZeppelin skill inspired by OZ's official skill. It teaches:

  • Library-first development (prefer ERC20Capped over manual cap)
  • Pattern discovery from installed source (read the actual node_modules to find import syntax, virtual functions, constructor params)
  • Version-safe development (don't assume patterns from memory)

This covers ERC-20, ERC-721, and any future OZ contract in one skill. The ERC-721 skill stays for NFT-specific pitfalls (_safeMint reentrancy, SVG stack-too-deep, metadata attributes).

Eval data for iterations 6-7 pushed to PR #1241.

@damianmarti
Copy link
Copy Markdown
Member

@technophile-04 great! thanks!!

@damianmarti
Copy link
Copy Markdown
Member

@technophile-04 maybe add something for ERC20Permit? (with skill is 0/3 now)

@technophile-04
Copy link
Copy Markdown
Collaborator Author

technophile-04 commented Mar 18, 2026

@damianmarti Good point — I shouldn't have included ERC20Permit in that table since it's 0/3 on both sides (not discriminating). The table should have only shown the assertions that actually differentiated.

On whether to add ERC20Permit guidance: when we tested with a prompt that explicitly said "gasless approvals", both with-skill and without-skill got it right (5/5). So the model knows ERC20Permit — it just doesn't add it unprompted.

The question is: should the OZ skill teach "always include ERC20Permit for new ERC-20 tokens"? Happy to add that if you think it's the right default.

Really sorry if it feel a bit AI'E my agent just directly posted a comment, but lol keeping it since look good

@damianmarti
Copy link
Copy Markdown
Member

@damianmarti Good point — I shouldn't have included ERC20Permit in that table since it's 0/3 on both sides (not discriminating). The table should have only shown the assertions that actually differentiated.

On whether to add ERC20Permit guidance: when we tested with a prompt that explicitly said "gasless approvals", both with-skill and without-skill got it right (5/5). So the model knows ERC20Permit — it just doesn't add it unprompted.

The question is: should the OZ skill teach "always include ERC20Permit for new ERC-20 tokens"? Happy to add that if you think it's the right default.

mmmmm, I don´t know really. Maybe leaving as an explicit feature to add is ok.

Really sorry if it feel a bit AI'E my agent just directly posted a comment, but lol keeping it since look good

:-)

@technophile-04
Copy link
Copy Markdown
Collaborator Author

mmmmm, I don´t know really. Maybe leaving as an explicit feature to add is ok.

Yup then will keep it as is 🙌

Copy link
Copy Markdown
Member

@rin-st rin-st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me, but see the comment regarding erc-721

Comment thread .agents/skills/openzeppelin/SKILL.md
technophile-04 and others added 3 commits March 20, 2026 12:34
The model reads "mirror import style from source" but doesn't actually
check the source files. Adding a concrete example fixes this — the model
needs to SEE the pattern, not be told to discover it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@technophile-04
Copy link
Copy Markdown
Collaborator Author

Thanks @rin-st and @damianmarti for testing!!

@technophile-04 technophile-04 merged commit 9ac8e8b into main Mar 20, 2026
1 check passed
@technophile-04 technophile-04 deleted the trim-tier2-tier3-skills branch March 20, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants