Skip to content

Commit 4ee733e

Browse files
committed
feat: adding Agent Skills for Node.js CLI Best Practices
1 parent a8dd503 commit 4ee733e

8 files changed

Lines changed: 1338 additions & 0 deletions

File tree

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Eval Learnings: nodejs-cli-best-practices Skill
2+
3+
This document captures findings from two rounds of eval runs comparing the `nodejs-cli-best-practices` skill against a baseline (no skill). Results are stored in `../nodejs-cli-best-practices-workspace/`.
4+
5+
---
6+
7+
## Iteration Summary
8+
9+
### Iteration 1 — Annotated fixture confound
10+
11+
**Fixture used:** `evals/fixtures/bad-cli.js` (with inline `// §X.X VIOLATION:` comments)
12+
13+
**Result:** All 6 runs (3 evals × with_skill + without_skill) scored 100%.
14+
15+
**Root cause:** The fixture file contained comments that explicitly named the violation and its section number, e.g.:
16+
17+
```js
18+
// §4.4 VIOLATION: Hardcoded node path
19+
#!/usr/local/bin/node
20+
```
21+
22+
This gave the baseline agent the same information the skill provides, making the comparison meaningless. Both agents read the comments and produced equally correct audits.
23+
24+
**Fix:** Created `bad-cli-clean.js` — identical violations, zero annotation comments. Looks like natural developer code.
25+
26+
---
27+
28+
### Iteration 2 — Clean fixture, harder assertions
29+
30+
**Fixture used:** `evals/fixtures/bad-cli-clean.js` (no annotation hints)
31+
32+
**Result:** All 6 runs still scored 100% on content assertions.
33+
34+
**Root cause:** Claude is trained on the nodejs-cli-apps-best-practices repository content. The skill provides structured guidance, but the base model already knows these practices well enough to satisfy all content-based assertions.
35+
36+
**Key finding:** The skill's value shows up in **efficiency metrics**, not pass rates.
37+
38+
---
39+
40+
## Iteration 2 Benchmark Data
41+
42+
| Eval | Config | Tokens | Duration |
43+
|------|--------|--------|----------|
44+
| Audit (bad-cli-clean.js + bad-package.json) | with_skill | 23,897 | 98.6s |
45+
| Audit (bad-cli-clean.js + bad-package.json) | without_skill | 24,468 | 106.1s |
46+
| New CLI guide (dependency scanner) | with_skill | 21,551 | 90.4s |
47+
| New CLI guide (dependency scanner) | without_skill | 36,838 | 130.0s |
48+
| Error handling guide | with_skill | 19,900 | 65.3s |
49+
| Error handling guide | without_skill | 21,087 | 64.8s |
50+
51+
**Aggregate (across all 3 evals):**
52+
53+
| Metric | With Skill | Without Skill | Delta |
54+
|--------|------------|---------------|-------|
55+
| Pass Rate | 100% ± 0% | 100% ± 0% | 0 |
56+
| Time | 84.8s ± 17.4s | 100.3s ± 33.0s | **−15.5s** |
57+
| Tokens | 21,783 ± 2,009 | 27,464 ± 8,292 | **−5,682** |
58+
59+
The most striking result is **eval-2 (new CLI guide)**: the skill reduced token usage by 41% and wall-clock time by 31%. With the skill, the agent had a clear framework to follow rather than deriving structure from scratch.
60+
61+
---
62+
63+
## Key Learnings
64+
65+
### 1. Fixtures must not contain the answer
66+
67+
Any comment, label, or annotation in a fixture file that names a violation will be read by both agents equally. The baseline agent is not disadvantaged — it can read comments too. For audit-style evals, fixtures must look like natural, unattributed developer code.
68+
69+
### 2. Content assertions are non-discriminating for well-known topics
70+
71+
When a skill encodes knowledge that is already in the model's training data (a public, widely-cited repository), content-based assertions will pass regardless of whether the skill is used. The skill's measurable impact shifts to:
72+
73+
- **Format consistency** — the skill enforces specific output templates (audit report structure, §X.X section numbering, ✅/⚠️/❌/➖ status icons)
74+
- **Efficiency** — the skill saves the agent from deriving structure from scratch, reducing tokens and latency
75+
- **Reliability** — the skill makes the output format predictable across different prompts and contexts
76+
77+
### 3. The skill genuinely helps with format adherence
78+
79+
The with-skill audit outputs consistently used the exact §X.X numbering, the per-section table format, and the High/Medium/Low priority groupings specified in the skill template. Without-skill outputs also produced good content, but format varied more across runs.
80+
81+
### 4. Efficiency gains are real and measurable
82+
83+
Across all three evals, the skill reduced token usage by an average of 5,682 tokens (21%) and wall-clock time by 15.5 seconds (15%). For high-volume usage, this compounds significantly.
84+
85+
---
86+
87+
## Recommended Next Steps
88+
89+
### Add format-compliance assertions
90+
91+
The current assertions only check for content presence. The next iteration should add assertions that test format adherence — things the base model won't produce without the skill template:
92+
93+
- `Output uses §X.X section numbering format (e.g., §6.4, §4.4)`
94+
- `Output uses ✅/⚠️/❌/➖ status icons in a per-section table`
95+
- `Output has exactly three priority tiers: High, Medium, and Low`
96+
- `Output for the guide mode includes a checklist section with unchecked items`
97+
98+
These will likely discriminate between with-skill and without-skill because the format is specific to this skill's template.
99+
100+
### Test with genuinely obscure practices
101+
102+
Several of the 37 practices are not widely known outside this repository — for example:
103+
104+
- §3.4 configuration precedence with `cosmiconfig`
105+
- §2.2 `npm-shrinkwrap.json` vs `package-lock.json` distinction
106+
- §1.3 `conf`/`configstore` for XDG-compliant config storage
107+
- §6.5 pre-populated bug report URLs with embedded version and platform
108+
109+
Adding assertions that specifically test for these by name (not just the concept) will create harder discrimination challenges where the skill reference genuinely provides information the base model might not surface.
110+
111+
### Expand the eval set
112+
113+
The current 3 evals are sufficient for fast iteration but limited for statistical confidence. Before publishing the skill, expand to 8-10 evals covering:
114+
115+
- Audit of a different fixture (e.g., a TypeScript CLI, an ESM-only CLI)
116+
- Guide for a different use case (e.g., a file watcher CLI, a config migration tool)
117+
- Edge cases: CI environment detection, Windows-specific cross-platform issues
118+
- Mixed-mode: auditing a mostly-good CLI to test that the skill correctly reports passes, not just failures
119+
120+
### Run description optimization
121+
122+
The `scripts/run_loop.py` in the skill-creator plugin can optimize the SKILL.md `description` field for better triggering accuracy. This should be done after the skill content is stable:
123+
124+
```bash
125+
SKILL_CREATOR=~/.claude/plugins/cache/claude-plugins-official/skill-creator/<version>/skills/skill-creator
126+
127+
cd $SKILL_CREATOR && python3 -m scripts.run_loop \
128+
--eval-set /path/to/trigger-evals.json \
129+
--skill-path skills/nodejs-cli-best-practices/SKILL.md \
130+
--model claude-sonnet-4-6 \
131+
--max-iterations 5 \
132+
--verbose
133+
```
134+
135+
### Consider a blind comparison
136+
137+
For a more rigorous quality comparison between with-skill and without-skill outputs, use the skill-creator's blind comparison system (`agents/comparator.md`). An independent agent judges the two outputs without knowing which is which. This captures qualitative differences (format, specificity, organization) that pass-rate metrics cannot.
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
name: nodejs-cli-best-practices
3+
description: Guide and audit Node.js CLI application development against 37 established best practices covering UX, distribution, interoperability, accessibility, testing, error handling, development setup, analytics, versioning, and security. Use this skill whenever building, extending, reviewing, or scaffolding a Node.js command-line tool — including when adding commands, flags, argument parsing, error handling, color output, STDIN support, configuration precedence, exit codes, --version flags, npm distribution, or any other CLI feature. If someone is working on Node.js CLI code, this skill applies even if they don't explicitly ask about best practices. Also trigger when someone asks "how should I implement X in my CLI" or "what's the right way to do Y in a Node.js CLI".
4+
---
5+
6+
This skill operates in two modes — **audit** for reviewing existing CLI code and **development guide** for building new features or tools.
7+
8+
Read `references/best-practices.md` for the complete reference of all 37 practices, including code examples and recommended packages. Use it as your source of truth in both modes.
9+
10+
## Determining the mode
11+
12+
- **Audit mode**: user provides existing CLI code or files to review → produce a structured audit report
13+
- **Development guide mode**: user is asking how to build or implement something → proactively surface relevant practices with examples
14+
- **Both**: user says "review and help me improve" → audit first, then provide concrete guidance for each failing practice
15+
16+
---
17+
18+
## Audit mode
19+
20+
Systematically compare the provided code against the practices in `references/best-practices.md`. Focus on what is verifiable from the code. Use judgment to mark practices as not applicable (➖) when there's genuinely no way to assess them from static analysis (e.g., Docker support when the project shows no distribution intent).
21+
22+
**Prioritize by user impact** — a missing exit code (§6.4) or absent `--version` flag (§9.1) affects every user and every CI pipeline; missing Docker image (§4.1) is low priority for most projects.
23+
24+
### Audit report format
25+
26+
```
27+
## Node.js CLI Best Practices Audit
28+
29+
### Summary
30+
✅ N practices followed ⚠️ N need attention ❌ N not implemented ➖ N not applicable
31+
32+
---
33+
34+
### 1. Command Line Experience
35+
| # | Practice | Status | Finding |
36+
|---|----------|--------|---------|
37+
| 1.1 | Respect POSIX args | ✅ | Uses yargs with proper short/long aliases |
38+
| 1.2 | Build empathic CLIs | ❌ | No interactive fallback when required args absent |
39+
...
40+
41+
[Repeat for each section with applicable practices]
42+
43+
---
44+
45+
### Priority recommendations
46+
47+
**High priority** (user-facing or CI-breaking):
48+
- **§6.4 Exit codes** — `process.exit()` called without a code
49+
```js
50+
// Fix
51+
process.exit(1); // on error
52+
process.exit(0); // on success
53+
```
54+
55+
**Medium priority**:
56+
- ...
57+
58+
**Low priority / nice to have**:
59+
- ...
60+
```
61+
62+
Keep findings concise and tied to specific code patterns. Always include a concrete fix with code for each ❌.
63+
64+
---
65+
66+
## Development guide mode
67+
68+
When building a new CLI feature or tool, don't wait to be asked — surface relevant best practices immediately based on what the user is building.
69+
70+
**For a "building a CLI from scratch" request**, organize guidance by development phase:
71+
72+
1. **Project setup** (§7.1, §7.3, §4.4, §2.2): bin object, shebang, files field, shrinkwrap
73+
2. **Argument design** (§1.1, §1.2, §1.7, §3.4): POSIX compliance, empathic fallbacks, zero-config, config precedence
74+
3. **I/O and interoperability** (§3.1, §3.2, §4.2): STDIN, structured output, graceful degradation
75+
4. **Error handling** (§6.1–§6.5): trackable codes, actionable messages, debug mode, exit codes
76+
5. **UX polish** (§1.4, §1.5, §1.6): colors, rich interactions, hyperlinks
77+
6. **Versioning** (§9.1–§9.7): `--version` flag, semver, changelog
78+
7. **Security** (§10.1): argument injection
79+
80+
**For a targeted feature request** (e.g., "add error handling", "add a --json flag"), surface only the directly relevant practices.
81+
82+
### Development guidance format
83+
84+
```
85+
## Node.js CLI best practices for [feature/topic]
86+
87+
### Checklist
88+
- [ ] §X.Y Practice name — one-line explanation of why it matters
89+
90+
### Implementation
91+
[Concrete, copy-pasteable code]
92+
93+
### Recommended packages
94+
- `package-name` — when and why to use it
95+
npm install package-name
96+
97+
### Common mistakes
98+
- What not to do → why it breaks → what to do instead
99+
```
100+
101+
---
102+
103+
## Quick reference: which section applies
104+
105+
| Topic | Sections |
106+
|-------|----------|
107+
| Argument parsing / flags | §1.1, §1.7, §3.4 |
108+
| Prompts / interactivity | §1.2, §1.5 |
109+
| Colors / styling | §1.4, §4.2 |
110+
| STDIN / piping | §3.1 |
111+
| JSON / structured output | §3.2, §4.2 |
112+
| Cross-platform issues | §3.3, §7.2 |
113+
| Error messages | §6.1, §6.2 |
114+
| Exit codes | §6.4 |
115+
| Debug / verbose mode | §6.3 |
116+
| `--version` flag | §9.1, §9.3 |
117+
| package.json setup | §7.1, §7.3, §9.3 |
118+
| npm publishing / distribution | §2.1, §2.2, §9.6 |
119+
| Security / user input | §10.1 |
120+
| Configuration persistence | §1.3, §2.3 |
121+
| Node.js version targeting | §4.3 |
122+
| Analytics / telemetry | §8.1 |

0 commit comments

Comments
 (0)