|
| 1 | +# Eval Learnings: nodejs-cli-best-practices Skill |
| 2 | + |
| 3 | +This document captures findings from two rounds of eval runs comparing the `nodejs-cli-best-practices` skill against a baseline (no skill). Results are stored in `../nodejs-cli-best-practices-workspace/`. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Iteration Summary |
| 8 | + |
| 9 | +### Iteration 1 — Annotated fixture confound |
| 10 | + |
| 11 | +**Fixture used:** `evals/fixtures/bad-cli.js` (with inline `// §X.X VIOLATION:` comments) |
| 12 | + |
| 13 | +**Result:** All 6 runs (3 evals × with_skill + without_skill) scored 100%. |
| 14 | + |
| 15 | +**Root cause:** The fixture file contained comments that explicitly named the violation and its section number, e.g.: |
| 16 | + |
| 17 | +```js |
| 18 | +// §4.4 VIOLATION: Hardcoded node path |
| 19 | +#!/usr/local/bin/node |
| 20 | +``` |
| 21 | + |
| 22 | +This gave the baseline agent the same information the skill provides, making the comparison meaningless. Both agents read the comments and produced equally correct audits. |
| 23 | + |
| 24 | +**Fix:** Created `bad-cli-clean.js` — identical violations, zero annotation comments. Looks like natural developer code. |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +### Iteration 2 — Clean fixture, harder assertions |
| 29 | + |
| 30 | +**Fixture used:** `evals/fixtures/bad-cli-clean.js` (no annotation hints) |
| 31 | + |
| 32 | +**Result:** All 6 runs still scored 100% on content assertions. |
| 33 | + |
| 34 | +**Root cause:** Claude is trained on the nodejs-cli-apps-best-practices repository content. The skill provides structured guidance, but the base model already knows these practices well enough to satisfy all content-based assertions. |
| 35 | + |
| 36 | +**Key finding:** The skill's value shows up in **efficiency metrics**, not pass rates. |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +## Iteration 2 Benchmark Data |
| 41 | + |
| 42 | +| Eval | Config | Tokens | Duration | |
| 43 | +|------|--------|--------|----------| |
| 44 | +| Audit (bad-cli-clean.js + bad-package.json) | with_skill | 23,897 | 98.6s | |
| 45 | +| Audit (bad-cli-clean.js + bad-package.json) | without_skill | 24,468 | 106.1s | |
| 46 | +| New CLI guide (dependency scanner) | with_skill | 21,551 | 90.4s | |
| 47 | +| New CLI guide (dependency scanner) | without_skill | 36,838 | 130.0s | |
| 48 | +| Error handling guide | with_skill | 19,900 | 65.3s | |
| 49 | +| Error handling guide | without_skill | 21,087 | 64.8s | |
| 50 | + |
| 51 | +**Aggregate (across all 3 evals):** |
| 52 | + |
| 53 | +| Metric | With Skill | Without Skill | Delta | |
| 54 | +|--------|------------|---------------|-------| |
| 55 | +| Pass Rate | 100% ± 0% | 100% ± 0% | 0 | |
| 56 | +| Time | 84.8s ± 17.4s | 100.3s ± 33.0s | **−15.5s** | |
| 57 | +| Tokens | 21,783 ± 2,009 | 27,464 ± 8,292 | **−5,682** | |
| 58 | + |
| 59 | +The most striking result is **eval-2 (new CLI guide)**: the skill reduced token usage by 41% and wall-clock time by 31%. With the skill, the agent had a clear framework to follow rather than deriving structure from scratch. |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## Key Learnings |
| 64 | + |
| 65 | +### 1. Fixtures must not contain the answer |
| 66 | + |
| 67 | +Any comment, label, or annotation in a fixture file that names a violation will be read by both agents equally. The baseline agent is not disadvantaged — it can read comments too. For audit-style evals, fixtures must look like natural, unattributed developer code. |
| 68 | + |
| 69 | +### 2. Content assertions are non-discriminating for well-known topics |
| 70 | + |
| 71 | +When a skill encodes knowledge that is already in the model's training data (a public, widely-cited repository), content-based assertions will pass regardless of whether the skill is used. The skill's measurable impact shifts to: |
| 72 | + |
| 73 | +- **Format consistency** — the skill enforces specific output templates (audit report structure, §X.X section numbering, ✅/⚠️/❌/➖ status icons) |
| 74 | +- **Efficiency** — the skill saves the agent from deriving structure from scratch, reducing tokens and latency |
| 75 | +- **Reliability** — the skill makes the output format predictable across different prompts and contexts |
| 76 | + |
| 77 | +### 3. The skill genuinely helps with format adherence |
| 78 | + |
| 79 | +The with-skill audit outputs consistently used the exact §X.X numbering, the per-section table format, and the High/Medium/Low priority groupings specified in the skill template. Without-skill outputs also produced good content, but format varied more across runs. |
| 80 | + |
| 81 | +### 4. Efficiency gains are real and measurable |
| 82 | + |
| 83 | +Across all three evals, the skill reduced token usage by an average of 5,682 tokens (21%) and wall-clock time by 15.5 seconds (15%). For high-volume usage, this compounds significantly. |
| 84 | + |
| 85 | +--- |
| 86 | + |
| 87 | +## Recommended Next Steps |
| 88 | + |
| 89 | +### Add format-compliance assertions |
| 90 | + |
| 91 | +The current assertions only check for content presence. The next iteration should add assertions that test format adherence — things the base model won't produce without the skill template: |
| 92 | + |
| 93 | +- `Output uses §X.X section numbering format (e.g., §6.4, §4.4)` |
| 94 | +- `Output uses ✅/⚠️/❌/➖ status icons in a per-section table` |
| 95 | +- `Output has exactly three priority tiers: High, Medium, and Low` |
| 96 | +- `Output for the guide mode includes a checklist section with unchecked items` |
| 97 | + |
| 98 | +These will likely discriminate between with-skill and without-skill because the format is specific to this skill's template. |
| 99 | + |
| 100 | +### Test with genuinely obscure practices |
| 101 | + |
| 102 | +Several of the 37 practices are not widely known outside this repository — for example: |
| 103 | + |
| 104 | +- §3.4 configuration precedence with `cosmiconfig` |
| 105 | +- §2.2 `npm-shrinkwrap.json` vs `package-lock.json` distinction |
| 106 | +- §1.3 `conf`/`configstore` for XDG-compliant config storage |
| 107 | +- §6.5 pre-populated bug report URLs with embedded version and platform |
| 108 | + |
| 109 | +Adding assertions that specifically test for these by name (not just the concept) will create harder discrimination challenges where the skill reference genuinely provides information the base model might not surface. |
| 110 | + |
| 111 | +### Expand the eval set |
| 112 | + |
| 113 | +The current 3 evals are sufficient for fast iteration but limited for statistical confidence. Before publishing the skill, expand to 8-10 evals covering: |
| 114 | + |
| 115 | +- Audit of a different fixture (e.g., a TypeScript CLI, an ESM-only CLI) |
| 116 | +- Guide for a different use case (e.g., a file watcher CLI, a config migration tool) |
| 117 | +- Edge cases: CI environment detection, Windows-specific cross-platform issues |
| 118 | +- Mixed-mode: auditing a mostly-good CLI to test that the skill correctly reports passes, not just failures |
| 119 | + |
| 120 | +### Run description optimization |
| 121 | + |
| 122 | +The `scripts/run_loop.py` in the skill-creator plugin can optimize the SKILL.md `description` field for better triggering accuracy. This should be done after the skill content is stable: |
| 123 | + |
| 124 | +```bash |
| 125 | +SKILL_CREATOR=~/.claude/plugins/cache/claude-plugins-official/skill-creator/<version>/skills/skill-creator |
| 126 | + |
| 127 | +cd $SKILL_CREATOR && python3 -m scripts.run_loop \ |
| 128 | + --eval-set /path/to/trigger-evals.json \ |
| 129 | + --skill-path skills/nodejs-cli-best-practices/SKILL.md \ |
| 130 | + --model claude-sonnet-4-6 \ |
| 131 | + --max-iterations 5 \ |
| 132 | + --verbose |
| 133 | +``` |
| 134 | + |
| 135 | +### Consider a blind comparison |
| 136 | + |
| 137 | +For a more rigorous quality comparison between with-skill and without-skill outputs, use the skill-creator's blind comparison system (`agents/comparator.md`). An independent agent judges the two outputs without knowing which is which. This captures qualitative differences (format, specificity, organization) that pass-rate metrics cannot. |
0 commit comments