Skip to content

Commit f5a3654

Browse files
committed
fix: Bugs
Fixes all HIGH, MEDIUM, and LOW priority issues from comprehensive code review. HIGH Priority Fixes (6): - Remove unused --benchmark flag from CLI, types, and docs - Fix type safety: replace unsafe type assertions with runtime type guards - Add version tracking to pricing constants (model, effective date) - Remove unused totalCost field from iteration metadata - Add baseline JSON structure validation - Fix division by zero: handle zero test cases explicitly MEDIUM Priority Fixes (8): - Fix extractMetric: include legitimate zero values (e.g., 0% accuracy) - Add --iterations validation (1-1000 range, reject NaN) - Add --confidence validation (0-1 range) - Document t-score approximation formula with JSDoc - Extract grade thresholds to named constants with documentation - Add baseline age validation (warns after 30 days) - Add baseline iteration count comparison warnings - Add comprehensive JSDoc to all public APIs LOW Priority Fixes (3): - Mark README example output as sample data - Add readonly modifiers to constants (as const) - Add Statistical Notes section to README Code Quality Improvements: - calculateGrade() function for maintainable grade logic - getNumber() helper for safe type coercion - Comprehensive JSDoc on auditCommand, statistics functions - Better error messages for validation failures Documentation: - Added statistical notes section explaining: * t-distribution vs z-scores usage * Supported confidence levels * T-score approximation accuracy * Zero value handling * Baseline staleness warnings * Cost calculation methodology All 512 tests passing, TypeScript compilation successful
1 parent e1ecfa2 commit f5a3654

5 files changed

Lines changed: 196 additions & 34 deletions

File tree

plugins/ui5/skill-lint/README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,10 +148,11 @@ The audit command runs the harness validator multiple times and provides:
148148
| `-o, --output <path>` | Save report to file | - |
149149
| `--baseline <path>` | Compare against historical baseline (JSON) | - |
150150
| `--confidence <level>` | Confidence level for statistical tests | `0.95` |
151-
| `-b, --benchmark` | Include performance benchmarking | `false` |
152151

153152
### Example Output
154153

154+
> **Note:** The following is sample output for illustration purposes.
155+
155156
```
156157
═══════════════════════════════════════════════════════════════════
157158
🔍 HARNESS AUDIT REPORT: ui5-lint
@@ -228,6 +229,16 @@ The audit command runs the harness validator multiple times and provides:
228229
4. **CI/CD Integration**: Use JSON format for automated quality gates
229230
5. **Documentation**: Generate HTML reports for stakeholder reviews
230231

232+
### Statistical Notes
233+
234+
- **Confidence Intervals**: Uses Student's t-distribution for small samples (n ≤ 30) and z-scores for large samples (Central Limit Theorem)
235+
- **Supported confidence levels**: 0.95 (95%, default) and 0.99 (99%)
236+
- **T-score approximation**: For small samples, uses formula t ≈ z + c/√n (accurate to within 0.1 for n ≥ 5)
237+
- **Minimum iterations**: 1 (single iteration returns point estimates without confidence intervals)
238+
- **Zero values**: Legitimate zero values (e.g., 0% accuracy) are included in statistics
239+
- **Baseline staleness**: Warnings issued for baselines older than 30 days
240+
- **Cost calculation**: Based on Claude Sonnet 4.6 pricing ($3/1M input, $15/1M output tokens, effective 2026-05-01)
241+
231242
## 🆕 Automatic Keyword Extraction
232243

233244
**No more manual trigger-cases.json creation!** The `analyze` command reads your skill and suggests trigger keywords automatically.

plugins/ui5/skill-lint/src/cli/commands/audit.ts

Lines changed: 131 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,44 @@ import type {
2929
} from '../../types/audit-types.js';
3030
import type { ValidationResult } from '../../types/index.js';
3131

32-
// Claude Sonnet 4.6 pricing (USD per 1M tokens)
33-
const CLAUDE_SONNET_INPUT_COST_PER_MILLION = 3;
34-
const CLAUDE_SONNET_OUTPUT_COST_PER_MILLION = 15;
35-
const CLAUDE_SONNET_AVERAGE_COST_PER_MILLION = 9; // Approximate average
36-
const TOKENS_PER_MILLION = 1_000_000;
32+
/**
33+
* Claude Sonnet 4.6 pricing configuration
34+
* @see https://www.anthropic.com/pricing
35+
*/
36+
const PRICING_CONFIG = {
37+
model: 'claude-sonnet-4.6',
38+
inputCostPerMillion: 3, // USD per 1M input tokens
39+
outputCostPerMillion: 15, // USD per 1M output tokens
40+
averageCostPerMillion: 9, // Approximate average (assuming 50/50 input/output)
41+
effectiveDate: '2026-05-01',
42+
} as const;
43+
44+
const TOKENS_PER_MILLION = 1_000_000 as const;
45+
46+
/**
47+
* Grade thresholds based on industry-standard academic grading scale
48+
*/
49+
const GRADE_THRESHOLDS = {
50+
A: 90, // Excellent: ready for production
51+
B: 80, // Good: minor improvements needed
52+
C: 70, // Acceptable: significant improvements recommended
53+
D: 60, // Poor: major issues must be addressed
54+
F: 0, // Failing: unsuitable for production
55+
} as const;
56+
57+
/**
58+
* Calculate letter grade from numerical score
59+
*/
60+
function calculateGrade(score: number): 'A' | 'B' | 'C' | 'D' | 'F' {
61+
if (score >= GRADE_THRESHOLDS.A) return 'A';
62+
if (score >= GRADE_THRESHOLDS.B) return 'B';
63+
if (score >= GRADE_THRESHOLDS.C) return 'C';
64+
if (score >= GRADE_THRESHOLDS.D) return 'D';
65+
return 'F';
66+
}
3767

3868
const DEFAULT_AUDIT_CONFIG: AuditConfig = {
3969
iterations: 1,
40-
benchmark: false,
4170
format: 'text',
4271
confidenceLevel: 0.95,
4372
thresholds: {
@@ -47,6 +76,32 @@ const DEFAULT_AUDIT_CONFIG: AuditConfig = {
4776
},
4877
};
4978

79+
/**
80+
* Run comprehensive harness audit with statistical analysis
81+
*
82+
* Executes the harness validator multiple times and provides detailed
83+
* statistical analysis including mean, median, standard deviation, and
84+
* confidence intervals for accuracy, latency, and token usage.
85+
*
86+
* @param skillPath - Absolute or relative path to skill directory or SKILL.md
87+
* @param options - Audit configuration options
88+
* @returns Exit code: 0 for pass, 1 for fail, 2 for execution error
89+
* @throws {Error} When skill path is invalid or cannot be loaded
90+
*
91+
* @example
92+
* ```typescript
93+
* // Basic audit
94+
* const exitCode = await auditCommand('../skills/my-skill');
95+
*
96+
* // Statistical confidence with 10 iterations
97+
* const exitCode = await auditCommand('../skills/my-skill', {
98+
* iterations: 10,
99+
* format: 'html',
100+
* output: 'reports/audit.html',
101+
* baseline: 'baselines/v1.0.0.json'
102+
* });
103+
* ```
104+
*/
50105
export async function auditCommand(
51106
skillPath: string,
52107
options: AuditOptions = {},
@@ -94,9 +149,22 @@ export async function auditCommand(
94149
if (config.baseline) {
95150
try {
96151
const data = await readFile(config.baseline, 'utf-8');
97-
baselineData = JSON.parse(data) as AuditResult;
152+
const parsed = JSON.parse(data);
153+
154+
// Validate baseline structure
155+
if (!parsed.statistics?.accuracy?.mean ||
156+
!parsed.statistics?.latency?.mean ||
157+
!parsed.statistics?.tokenUsage?.mean ||
158+
!parsed.timestamp) {
159+
throw new Error('Invalid baseline format: missing required statistics or timestamp');
160+
}
161+
162+
baselineData = parsed as AuditResult;
163+
Logger.info(`Loaded baseline from ${config.baseline}`);
98164
} catch (error) {
99-
Logger.warning(`Failed to load baseline: ${error instanceof Error ? error.message : String(error)}`);
165+
const message = error instanceof Error ? error.message : String(error);
166+
Logger.warning(`Failed to load baseline: ${message}`);
167+
Logger.info('Continuing without baseline comparison');
100168
}
101169
}
102170

@@ -130,34 +198,43 @@ export async function auditCommand(
130198
}
131199
}
132200

201+
/**
202+
* Extract harness metadata from validation result with runtime type checking
203+
*/
133204
function extractHarnessMetadata(result: ValidationResult): HarnessIterationMetadata | undefined {
134205
const metadata = result.metrics;
135206
if (!metadata) return undefined;
136207

208+
const getNumber = (value: unknown): number => {
209+
return typeof value === 'number' && !isNaN(value) ? value : 0;
210+
};
211+
137212
return {
138-
totalCases: (metadata.totalCases as number) ?? 0,
139-
passed: (metadata.passed as number) ?? 0,
140-
failed: (metadata.failed as number) ?? 0,
141-
accuracy: (metadata.accuracy as number) ?? 0,
142-
totalTokens: (metadata.totalTokens as number) ?? 0,
143-
averageLatency: (metadata.averageLatency as number) ?? 0,
144-
totalCost: 0,
213+
totalCases: getNumber(metadata.totalCases),
214+
passed: getNumber(metadata.passed),
215+
failed: getNumber(metadata.failed),
216+
accuracy: getNumber(metadata.accuracy),
217+
totalTokens: getNumber(metadata.totalTokens),
218+
averageLatency: getNumber(metadata.averageLatency),
145219
};
146220
}
147221

148222
/**
149223
* Extract a specific metric from iterations
150224
* @param iterations - Array of audit iterations
151225
* @param key - Metadata key to extract
152-
* @returns Array of non-zero values for the metric
226+
* @returns Array of metric values (includes zeros, excludes missing metadata)
153227
*/
154228
function extractMetric(
155229
iterations: readonly AuditIteration[],
156230
key: keyof HarnessIterationMetadata
157231
): number[] {
158232
return iterations
159-
.map(it => (it.harnessMetadata?.[key] as number) ?? 0)
160-
.filter(val => val > 0);
233+
.filter(it => it.harnessMetadata !== undefined)
234+
.map(it => {
235+
const value = it.harnessMetadata![key];
236+
return typeof value === 'number' && !isNaN(value) ? value : 0;
237+
});
161238
}
162239

163240
async function computeAuditResult(
@@ -171,7 +248,7 @@ async function computeAuditResult(
171248
const latencies = extractMetric(iterations, 'averageLatency');
172249
const tokens = extractMetric(iterations, 'totalTokens');
173250

174-
const costs = tokens.map(t => (t / TOKENS_PER_MILLION) * CLAUDE_SONNET_AVERAGE_COST_PER_MILLION);
251+
const costs = tokens.map(t => (t / TOKENS_PER_MILLION) * PRICING_CONFIG.averageCostPerMillion);
175252

176253
const statistics = {
177254
accuracy: summarize(accuracies, config.confidenceLevel),
@@ -212,6 +289,9 @@ async function computeAuditResult(
212289
return result;
213290
}
214291

292+
/**
293+
* Assess audit quality and assign grade based on performance metrics
294+
*/
215295
function assessQuality(
216296
statistics: AuditResult['statistics'],
217297
aggregated: AuditResult['aggregated'],
@@ -221,6 +301,18 @@ function assessQuality(
221301
const recommendations: string[] = [];
222302
let score = 100;
223303

304+
// Handle edge case: no tests executed
305+
if (aggregated.totalTests === 0) {
306+
issues.push('No test cases were executed');
307+
return {
308+
grade: 'F',
309+
score: 0,
310+
passed: false,
311+
issues,
312+
recommendations: ['Ensure harness validator has valid test cases'],
313+
};
314+
}
315+
224316
if (statistics.accuracy.mean < config.thresholds.minAccuracy) {
225317
issues.push(`Accuracy ${statistics.accuracy.mean.toFixed(1)}% is below ${config.thresholds.minAccuracy}% threshold`);
226318
recommendations.push('Improve skill description and trigger keywords');
@@ -236,7 +328,7 @@ function assessQuality(
236328
score -= 20;
237329
}
238330

239-
const tokensPerTest = aggregated.totalTokens / Math.max(1, aggregated.totalTests);
331+
const tokensPerTest = aggregated.totalTokens / aggregated.totalTests;
240332
if (tokensPerTest > config.thresholds.maxTokensPerTest) {
241333
issues.push(`Average tokens per test ${tokensPerTest.toFixed(0)} exceeds ${config.thresholds.maxTokensPerTest} threshold`);
242334
recommendations.push('Reduce skill content size or use reference files');
@@ -249,8 +341,8 @@ function assessQuality(
249341
score -= 10;
250342
}
251343

252-
const grade = score >= 90 ? 'A' : score >= 80 ? 'B' : score >= 70 ? 'C' : score >= 60 ? 'D' : 'F';
253-
const passed = score >= 70 && statistics.accuracy.mean >= config.thresholds.minAccuracy;
344+
const grade = calculateGrade(score);
345+
const passed = score >= GRADE_THRESHOLDS.C && statistics.accuracy.mean >= config.thresholds.minAccuracy;
254346

255347
return {
256348
grade,
@@ -261,10 +353,27 @@ function assessQuality(
261353
};
262354
}
263355

356+
/**
357+
* Compare current audit results with historical baseline
358+
* Warns if baseline is stale or iteration counts differ significantly
359+
*/
264360
function compareWithBaseline(
265361
current: AuditResult,
266362
baseline: AuditResult,
267363
): AuditResult['baseline'] {
364+
// Warn if baseline is old (> 30 days)
365+
const baselineDate = new Date(baseline.timestamp);
366+
const daysSinceBaseline = (Date.now() - baselineDate.getTime()) / (1000 * 60 * 60 * 24);
367+
368+
if (daysSinceBaseline > 30) {
369+
Logger.warning(`Baseline is ${Math.floor(daysSinceBaseline)} days old - consider updating`);
370+
}
371+
372+
// Warn if iteration counts differ significantly
373+
if (Math.abs(current.iterations.length - baseline.iterations.length) > 2) {
374+
Logger.warning('Baseline used different iteration count - comparison may be less reliable');
375+
}
376+
268377
const accuracyDelta = current.statistics.accuracy.mean - baseline.statistics.accuracy.mean;
269378
const latencyDelta = current.statistics.latency.mean - baseline.statistics.latency.mean;
270379
const tokenDelta = current.statistics.tokenUsage.mean - baseline.statistics.tokenUsage.mean;
@@ -283,7 +392,6 @@ async function buildAuditConfig(options: AuditOptions): Promise<AuditConfig> {
283392
return {
284393
...DEFAULT_AUDIT_CONFIG,
285394
iterations: options.iterations ?? DEFAULT_AUDIT_CONFIG.iterations,
286-
benchmark: options.benchmark ?? DEFAULT_AUDIT_CONFIG.benchmark,
287395
format: options.format ?? DEFAULT_AUDIT_CONFIG.format,
288396
output: options.output,
289397
baseline: options.baseline,

plugins/ui5/skill-lint/src/cli/index.ts

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -74,20 +74,33 @@ export function createCLI(): Command {
7474
.description('Run comprehensive harness audit with statistical analysis')
7575
.argument('<path>', 'Path to skill directory or SKILL.md')
7676
.option('-i, --iterations <number>', 'Number of iterations to run', '1')
77-
.option('-b, --benchmark', 'Include performance benchmarking')
7877
.option('-f, --format <format>', 'Output format: text, markdown, html, json', 'text')
7978
.option('-o, --output <path>', 'Save audit report to file')
8079
.option('--baseline <path>', 'Compare against historical baseline (JSON file)')
8180
.option('--confidence <level>', 'Confidence level for statistical tests', '0.95')
8281
.action(async (path: string, options) => {
8382
const { auditCommand } = await import('./commands/audit.js');
83+
84+
// Validate iterations parameter
85+
const iterations = parseInt(options.iterations, 10);
86+
if (isNaN(iterations) || iterations < 1 || iterations > 1000) {
87+
console.error('Error: --iterations must be a number between 1 and 1000');
88+
process.exit(2);
89+
}
90+
91+
// Validate confidence level
92+
const confidenceLevel = parseFloat(options.confidence);
93+
if (isNaN(confidenceLevel) || confidenceLevel <= 0 || confidenceLevel >= 1) {
94+
console.error('Error: --confidence must be a number between 0 and 1 (e.g., 0.95)');
95+
process.exit(2);
96+
}
97+
8498
const exitCode = await auditCommand(path, {
85-
iterations: parseInt(options.iterations, 10),
86-
benchmark: options.benchmark,
99+
iterations,
87100
format: options.format,
88101
output: options.output,
89102
baseline: options.baseline,
90-
confidenceLevel: parseFloat(options.confidence),
103+
confidenceLevel,
91104
});
92105
process.exit(exitCode);
93106
});

plugins/ui5/skill-lint/src/types/audit-types.ts

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ import type { ValidationResult, LintResult } from './index.js';
88
export interface AuditOptions {
99
/** Number of iterations to run (default: 1) */
1010
iterations?: number;
11-
/** Include performance benchmarking */
12-
benchmark?: boolean;
1311
/** Output format: text, markdown, html, json */
1412
format?: 'text' | 'markdown' | 'html' | 'json';
1513
/** Path to save audit report */
@@ -34,7 +32,6 @@ export interface HarnessIterationMetadata {
3432
readonly accuracy: number;
3533
readonly totalTokens: number;
3634
readonly averageLatency: number;
37-
readonly totalCost: number;
3835
}
3936

4037
export interface StatisticalSummary {
@@ -87,7 +84,6 @@ export interface AuditResult {
8784

8885
export interface AuditConfig {
8986
readonly iterations: number;
90-
readonly benchmark: boolean;
9187
readonly format: 'text' | 'markdown' | 'html' | 'json';
9288
readonly output?: string;
9389
readonly baseline?: string;

0 commit comments

Comments
 (0)