@@ -81,13 +81,153 @@ node bin/skill-lint.js lint skills/my-skill -f github-actions
8181# Check if skill loads correctly
8282node bin/skill-lint.js check skills/my-skill
8383
84- # Analyze skill and suggest trigger keywords (NEW!)
84+ # Analyze skill and suggest trigger keywords
8585node bin/skill-lint.js analyze skills/my-skill
8686
87+ # Run comprehensive harness audit with statistical analysis (NEW!)
88+ node bin/skill-lint.js audit skills/my-skill
89+
8790# Generate config file
8891node bin/skill-lint.js init
8992```
9093
94+ ## π Harness Audit Command
95+
96+ ** Run comprehensive statistical analysis of harness performance** with multiple iterations, baseline comparisons, and detailed reports.
97+
98+ ### Quick Start
99+
100+ ``` bash
101+ # Basic audit (single run)
102+ node bin/skill-lint.js audit ../skills/my-skill
103+
104+ # Statistical audit (10 iterations for confidence)
105+ node bin/skill-lint.js audit ../skills/my-skill --iterations 10
106+
107+ # Generate markdown report
108+ node bin/skill-lint.js audit ../skills/my-skill --format markdown --output reports/audit.md
109+
110+ # Generate HTML report
111+ node bin/skill-lint.js audit ../skills/my-skill --format html --output reports/audit.html
112+
113+ # Compare with baseline
114+ node bin/skill-lint.js audit ../skills/my-skill --baseline baselines/previous-audit.json
115+ ```
116+
117+ ### What It Measures
118+
119+ The audit command runs the harness validator multiple times and provides:
120+
121+ 1 . ** Statistical Analysis**
122+ - Mean, median, std dev, min/max for accuracy, latency, and token usage
123+ - 95% confidence intervals
124+ - Variance analysis for reliability assessment
125+
126+ 2 . ** Quality Assessment**
127+ - Letter grade (A-F) based on performance
128+ - Quality score (0-100)
129+ - Specific issues and recommendations
130+ - Pass/fail status
131+
132+ 3 . ** Cost Tracking**
133+ - Total token usage across all iterations
134+ - Estimated cost (Claude Sonnet 4.6 pricing)
135+ - Cost per successful test
136+
137+ 4 . ** Baseline Comparison** (optional)
138+ - Compare against historical performance
139+ - Track accuracy improvements/regressions
140+ - Monitor latency and token efficiency changes
141+
142+ ### Options
143+
144+ | Option | Description | Default |
145+ | --------| -------------| ---------|
146+ | ` -i, --iterations <number> ` | Number of iterations to run | ` 1 ` |
147+ | ` -f, --format <format> ` | Output format: text, markdown, html, json | ` text ` |
148+ | ` -o, --output <path> ` | Save report to file | - |
149+ | ` --baseline <path> ` | Compare against historical baseline (JSON) | - |
150+ | ` --confidence <level> ` | Confidence level for statistical tests | ` 0.95 ` |
151+ | ` -b, --benchmark ` | Include performance benchmarking | ` false ` |
152+
153+ ### Example Output
154+
155+ ```
156+ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
157+ π HARNESS AUDIT REPORT: ui5-lint
158+ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
159+
160+ π Summary
161+ Skill: ui5-lint
162+ Iterations: 5
163+ Total Duration: 62.34s
164+ Timestamp: 2026-05-28T10:30:00.000Z
165+
166+ π Aggregated Metrics
167+ Total Tests: 45
168+ Passed: 38
169+ Failed: 7
170+ Overall Accuracy: 84.4%
171+ Total Tokens: 20,450
172+ Total Cost: $0.1841
173+
174+ π Statistical Analysis
175+
176+ Accuracy:
177+ Mean: 84.4%
178+ Median: 85.0%
179+ Std Dev: 3.2%
180+ Range: [80.0%, 88.0%]
181+ 95% CI: [82.1%, 86.7%]
182+
183+ Latency:
184+ Mean: 2134ms
185+ Median: 2100ms
186+ Std Dev: 245ms
187+ Range: [1800ms, 2500ms]
188+
189+ Token Usage:
190+ Mean: 4090
191+ Median: 4050
192+ Std Dev: 180
193+ Range: [3800, 4350]
194+
195+ β
Quality Assessment
196+ Grade: B
197+ Score: 85/100
198+ Status: β
PASSED
199+
200+ Recommendations:
201+ π‘ Consider adding more specific trigger keywords for higher accuracy
202+ π‘ Skill performs consistently across iterations (low variance)
203+
204+ π Baseline Comparison
205+ Accuracy: π +4.2%
206+ Latency: π -340ms
207+ Tokens: π -120
208+ Overall: β
IMPROVED
209+
210+ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
211+ ```
212+
213+ ### Use Cases
214+
215+ | Scenario | Command | Duration |
216+ | ----------| ---------| ----------|
217+ | ** Quick validation** | ` audit skill ` | ~ 1-2 min |
218+ | ** Pre-release check** | ` audit skill -i 5 ` | ~ 5-10 min |
219+ | ** Statistical confidence** | ` audit skill -i 10 ` | ~ 10-20 min |
220+ | ** Track improvements** | ` audit skill --baseline previous.json ` | ~ 1-2 min |
221+ | ** Generate report** | ` audit skill -f html -o report.html ` | ~ 1-2 min |
222+
223+ ### Best Practices
224+
225+ 1 . ** During Development** : Use single iteration (` audit skill ` ) for quick feedback
226+ 2 . ** Before Release** : Run 5-10 iterations for statistical confidence
227+ 3 . ** Track Progress** : Save JSON baselines and compare over time
228+ 4 . ** CI/CD Integration** : Use JSON format for automated quality gates
229+ 5 . ** Documentation** : Generate HTML reports for stakeholder reviews
230+
91231## π Automatic Keyword Extraction
92232
93233** No more manual trigger-cases.json creation!** The ` analyze ` command reads your skill and suggests trigger keywords automatically.
0 commit comments