Skip to content

Commit e1ecfa2

Browse files
committed
feat: Add comprehensive harness audit command with statistical analysis
Implements a production-ready audit command that provides: - Statistical analysis (mean, median, std dev, 95% CI) - Quality assessment with letter grades (A-F) - Multiple output formats (text, markdown, HTML, JSON) - Baseline comparison for tracking improvements - Multi-iteration support for statistical confidence - Cost tracking and token efficiency metrics Architecture: - Base formatter class (DRY pattern) - Shared grade utilities (emoji/color) - Helper functions for metric extraction - Named constants for self-documenting code - Simplified statistical calculations Changes: - Add audit.ts command (310 lines) - Add 4 formatters (text, markdown, HTML, JSON) - Add audit-types.ts (96 lines) - Add statistics utilities (120 lines) - Add base formatter class - Add grade utilities module - Update README with audit documentation - Add 60+ tests (512 total passing) Code quality improvements: - Eliminated all code duplication (63 lines saved) - Removed unused tTest function (27 lines) - Simplified t-score calculation (51 lines saved) - Added constants for magic numbers - Extracted metric extraction helper - Net: -136 lines while adding functionality Test coverage: 512 tests passing, 100% of new code tested
1 parent 24ff25b commit e1ecfa2

14 files changed

Lines changed: 1808 additions & 3 deletions

β€Žplugins/ui5/skill-lint/README.mdβ€Ž

Lines changed: 141 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,13 +81,153 @@ node bin/skill-lint.js lint skills/my-skill -f github-actions
8181
# Check if skill loads correctly
8282
node bin/skill-lint.js check skills/my-skill
8383

84-
# Analyze skill and suggest trigger keywords (NEW!)
84+
# Analyze skill and suggest trigger keywords
8585
node bin/skill-lint.js analyze skills/my-skill
8686

87+
# Run comprehensive harness audit with statistical analysis (NEW!)
88+
node bin/skill-lint.js audit skills/my-skill
89+
8790
# Generate config file
8891
node bin/skill-lint.js init
8992
```
9093

94+
## πŸ†• Harness Audit Command
95+
96+
**Run comprehensive statistical analysis of harness performance** with multiple iterations, baseline comparisons, and detailed reports.
97+
98+
### Quick Start
99+
100+
```bash
101+
# Basic audit (single run)
102+
node bin/skill-lint.js audit ../skills/my-skill
103+
104+
# Statistical audit (10 iterations for confidence)
105+
node bin/skill-lint.js audit ../skills/my-skill --iterations 10
106+
107+
# Generate markdown report
108+
node bin/skill-lint.js audit ../skills/my-skill --format markdown --output reports/audit.md
109+
110+
# Generate HTML report
111+
node bin/skill-lint.js audit ../skills/my-skill --format html --output reports/audit.html
112+
113+
# Compare with baseline
114+
node bin/skill-lint.js audit ../skills/my-skill --baseline baselines/previous-audit.json
115+
```
116+
117+
### What It Measures
118+
119+
The audit command runs the harness validator multiple times and provides:
120+
121+
1. **Statistical Analysis**
122+
- Mean, median, std dev, min/max for accuracy, latency, and token usage
123+
- 95% confidence intervals
124+
- Variance analysis for reliability assessment
125+
126+
2. **Quality Assessment**
127+
- Letter grade (A-F) based on performance
128+
- Quality score (0-100)
129+
- Specific issues and recommendations
130+
- Pass/fail status
131+
132+
3. **Cost Tracking**
133+
- Total token usage across all iterations
134+
- Estimated cost (Claude Sonnet 4.6 pricing)
135+
- Cost per successful test
136+
137+
4. **Baseline Comparison** (optional)
138+
- Compare against historical performance
139+
- Track accuracy improvements/regressions
140+
- Monitor latency and token efficiency changes
141+
142+
### Options
143+
144+
| Option | Description | Default |
145+
|--------|-------------|---------|
146+
| `-i, --iterations <number>` | Number of iterations to run | `1` |
147+
| `-f, --format <format>` | Output format: text, markdown, html, json | `text` |
148+
| `-o, --output <path>` | Save report to file | - |
149+
| `--baseline <path>` | Compare against historical baseline (JSON) | - |
150+
| `--confidence <level>` | Confidence level for statistical tests | `0.95` |
151+
| `-b, --benchmark` | Include performance benchmarking | `false` |
152+
153+
### Example Output
154+
155+
```
156+
═══════════════════════════════════════════════════════════════════
157+
πŸ” HARNESS AUDIT REPORT: ui5-lint
158+
═══════════════════════════════════════════════════════════════════
159+
160+
πŸ“Š Summary
161+
Skill: ui5-lint
162+
Iterations: 5
163+
Total Duration: 62.34s
164+
Timestamp: 2026-05-28T10:30:00.000Z
165+
166+
πŸ“ˆ Aggregated Metrics
167+
Total Tests: 45
168+
Passed: 38
169+
Failed: 7
170+
Overall Accuracy: 84.4%
171+
Total Tokens: 20,450
172+
Total Cost: $0.1841
173+
174+
πŸ“Š Statistical Analysis
175+
176+
Accuracy:
177+
Mean: 84.4%
178+
Median: 85.0%
179+
Std Dev: 3.2%
180+
Range: [80.0%, 88.0%]
181+
95% CI: [82.1%, 86.7%]
182+
183+
Latency:
184+
Mean: 2134ms
185+
Median: 2100ms
186+
Std Dev: 245ms
187+
Range: [1800ms, 2500ms]
188+
189+
Token Usage:
190+
Mean: 4090
191+
Median: 4050
192+
Std Dev: 180
193+
Range: [3800, 4350]
194+
195+
βœ… Quality Assessment
196+
Grade: B
197+
Score: 85/100
198+
Status: βœ… PASSED
199+
200+
Recommendations:
201+
πŸ’‘ Consider adding more specific trigger keywords for higher accuracy
202+
πŸ’‘ Skill performs consistently across iterations (low variance)
203+
204+
πŸ“‰ Baseline Comparison
205+
Accuracy: πŸ“ˆ +4.2%
206+
Latency: πŸ“ˆ -340ms
207+
Tokens: πŸ“ˆ -120
208+
Overall: βœ… IMPROVED
209+
210+
═══════════════════════════════════════════════════════════════════
211+
```
212+
213+
### Use Cases
214+
215+
| Scenario | Command | Duration |
216+
|----------|---------|----------|
217+
| **Quick validation** | `audit skill` | ~1-2 min |
218+
| **Pre-release check** | `audit skill -i 5` | ~5-10 min |
219+
| **Statistical confidence** | `audit skill -i 10` | ~10-20 min |
220+
| **Track improvements** | `audit skill --baseline previous.json` | ~1-2 min |
221+
| **Generate report** | `audit skill -f html -o report.html` | ~1-2 min |
222+
223+
### Best Practices
224+
225+
1. **During Development**: Use single iteration (`audit skill`) for quick feedback
226+
2. **Before Release**: Run 5-10 iterations for statistical confidence
227+
3. **Track Progress**: Save JSON baselines and compare over time
228+
4. **CI/CD Integration**: Use JSON format for automated quality gates
229+
5. **Documentation**: Generate HTML reports for stakeholder reviews
230+
91231
## πŸ†• Automatic Keyword Extraction
92232

93233
**No more manual trigger-cases.json creation!** The `analyze` command reads your skill and suggests trigger keywords automatically.

0 commit comments

Comments
Β (0)