Skip to content

Commit 302dfd2

Browse files
committed
docs: fix API signatures and expand coverage across all 10 skill docs
- Fix evaluate(), loadFromFile(), compareAgainstGolden() signatures - Fix monitorLatency(), calculateTrajectoryCost(), checkBudget() signatures - Fix validateTrajectory(), JudgeCalibrator, JudgeConfig APIs - Fix GateEngine.evaluate(), RunComparator.compare() signatures - Add missing classes: CostTracker, LatencyTracker, GoldenCurator, SuiteRunner - Add missing APIs: verifyResult, batchCompare, analyzeOptimization - Add model price table, budget presets, latency presets, gate presets - Add MCP tool references and full CI integration patterns - Remove non-existent CLI flags and inaccurate API examples
1 parent 592a41e commit 302dfd2

10 files changed

Lines changed: 682 additions & 202 deletions

File tree

skills/cost-tracking/skill.md

Lines changed: 67 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## What It Is
44

5-
Cost tracking calculates per-task and per-trajectory expenses, including LLM API costs, tool invocation costs, and judge evaluation costs. It enforces budgets and provides cost optimization insights.
5+
Cost tracking calculates per-task and per-trajectory expenses, including LLM API costs, tool invocation costs, and judge evaluation costs. It enforces budgets with 3-tier alert thresholds (50% log, 75% notify, 90% block) and provides cost optimization insights.
66

77
## Why It Matters
88

@@ -13,52 +13,67 @@ Cost tracking calculates per-task and per-trajectory expenses, including LLM API
1313

1414
## How to Use It
1515

16-
### Track Costs
16+
### CLI: Eval with Budget
1717

1818
```bash
1919
npx agent-eval-harness eval trajectories/*.jsonl \
2020
--budget 10.00 \
2121
--output results/
2222
```
2323

24-
### Cost Breakdown
24+
### Calculate Trajectory Cost
2525

2626
```typescript
27-
import { calculateTrajectoryCost } from '@reaatech/agent-eval-harness';
27+
import { calculateTrajectoryCost, DEFAULT_PRICING } from '@reaatech/agent-eval-harness';
2828

29-
const pricing = {
30-
'claude-opus': { input: 15.00, output: 75.00 },
31-
'gpt-4-turbo': { input: 10.00, output: 30.00 },
32-
};
29+
// Uses built-in pricing for 8 models (claude-opus, claude-sonnet, claude-haiku,
30+
// gpt-4-turbo, gpt-4, gpt-4-mini, gemini-pro, gemini-flash)
31+
const cost = calculateTrajectoryCost(trajectory, 'claude-opus');
3332

34-
const breakdown = await calculateTrajectoryCost('trajectories/run.jsonl', pricing);
35-
36-
console.log(`Total Cost: $${breakdown.total_cost}`);
37-
console.log(`LLM Calls: $${breakdown.llm_calls}`);
38-
console.log(`Tool Invocations: $${breakdown.tool_invocations}`);
39-
console.log(`Judge Evaluations: $${breakdown.judge_evaluations}`);
33+
console.log(`Total: $${formatCost(cost.total_cost)}`);
34+
console.log(`LLM Calls: $${formatCost(cost.llm_calls)}`);
35+
console.log(`Tool Invocations: $${formatCost(cost.tool_invocations)}`);
36+
console.log(`Per-turn breakdown:`, cost.per_turn);
4037
```
4138

42-
### Budget Alerts
39+
### Budget Enforcement
4340

4441
```typescript
45-
import { checkBudget, createBudget } from '@reaatech/agent-eval-harness';
46-
47-
const budget = createBudget({
48-
per_task: 0.05,
49-
per_trajectory: 1.00,
50-
daily: 100.00,
51-
alerts: [
52-
{ threshold: 0.5, action: 'log' },
53-
{ threshold: 0.75, action: 'notify' },
54-
{ threshold: 0.9, action: 'block' },
55-
],
56-
});
57-
58-
const status = await checkBudget(currentSpend, budget);
59-
if (!status.within_budget) {
60-
console.warn(`Budget exceeded: ${status.percentage}% used`);
42+
import { checkBudget, createBudget, CostTracker } from '@reaatech/agent-eval-harness';
43+
44+
// 3 budget presets: strict, moderate, lenient
45+
const budget = createBudget('moderate');
46+
47+
// checkBudget(cost: CostBreakdown, budget: BudgetConfig, thresholds?)
48+
const status = checkBudget(cost, budget);
49+
50+
if (!status.withinBudget) {
51+
console.warn(`Budget exceeded: ${status.usagePercentage}% used`);
6152
}
53+
54+
// Track cumulative costs
55+
const tracker = new CostTracker({ per_trajectory: 1.00, daily: 100.00 });
56+
tracker.recordCost(cost);
57+
console.log(`Daily total: $${formatCost(tracker.getDailyTotal())}`);
58+
```
59+
60+
### Cost Reporting
61+
62+
```typescript
63+
import {
64+
generateCostReport,
65+
exportToCsv,
66+
exportToJson,
67+
generateSummary,
68+
formatCost,
69+
} from '@reaatech/agent-eval-harness';
70+
71+
const report = generateCostReport(trajectories);
72+
console.log(formatCost(report.totalCost));
73+
74+
const csv = exportToCsv(report);
75+
const json = exportToJson(report);
76+
const summary = generateSummary(report);
6277
```
6378

6479
## Key Metrics
@@ -73,6 +88,27 @@ if (!status.within_budget) {
7388
| `tool_cost` | Tool invocation costs | USD |
7489
| `judge_cost` | LLM judge costs | USD |
7590

91+
## Supported Models (DEFAULT_PRICING)
92+
93+
| Model | Input ($/M tokens) | Output ($/M tokens) |
94+
|-------|-------------------|---------------------|
95+
| claude-opus | $15.00 | $75.00 |
96+
| claude-sonnet | $3.00 | $15.00 |
97+
| claude-haiku | $0.25 | $1.25 |
98+
| gpt-4-turbo | $10.00 | $30.00 |
99+
| gpt-4 | $30.00 | $60.00 |
100+
| gpt-4-mini | $0.15 | $0.60 |
101+
| gemini-pro | $2.50 | $7.50 |
102+
| gemini-flash | $0.50 | $1.50 |
103+
104+
## Budget Presets
105+
106+
| Preset | Per Task | Per Trajectory | Daily |
107+
|--------|----------|----------------|-------|
108+
| `strict` | $0.02 | $0.50 | $50.00 |
109+
| `moderate` | $0.05 | $1.00 | $100.00 |
110+
| `lenient` | $0.10 | $2.00 | $250.00 |
111+
76112
## Best Practices
77113

78114
1. **Set budget limits** — Define per-task, per-trajectory, and daily budgets

skills/eval-gating/skill.md

Lines changed: 131 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## What It Is
44

5-
Eval gating uses evaluation results to make pass/fail decisions in CI/CD pipelines. It checks metrics against thresholds and baselines, blocking deployments when quality standards aren't met.
5+
Eval gating uses evaluation results to make pass/fail decisions in CI/CD pipelines. It checks metrics against thresholds and baselines using 4 gate types (threshold, baseline-comparison, regression, custom) with 6 comparison operators. Blocks deployments when quality standards aren't met.
66

77
## Why It Matters
88

@@ -13,16 +13,15 @@ Eval gating uses evaluation results to make pass/fail decisions in CI/CD pipelin
1313

1414
## How to Use It
1515

16-
### Run Gate Evaluation
16+
### CLI: Run Gate Check
1717

1818
```bash
19-
npx agent-eval-harness gate \
20-
--results results/eval-123.json \
21-
--gates gates.yaml \
22-
--baseline results/baseline.json
19+
npx agent-eval-harness gate results/results.json \
20+
--preset standard \
21+
--exit-code
2322
```
2423

25-
### Gate Configuration
24+
### Gate Configuration (YAML)
2625

2726
```yaml
2827
# gates.yaml
@@ -64,43 +63,143 @@ gates:
6463
threshold: 0.85
6564
```
6665
66+
### Gate Presets
67+
68+
Three named presets for quick setup:
69+
70+
| Preset | Overall Quality | Cost | Latency P99 | Tool Correctness | Faithfulness |
71+
|--------|----------------|------|-------------|------------------|--------------|
72+
| **standard** | >= 0.80 | <= $0.05 | <= 5000ms | >= 0.95 | >= 0.85 |
73+
| **strict** | >= 0.90 | <= $0.03 | <= 3000ms | >= 0.98 | >= 0.90 |
74+
| **lenient** | >= 0.70 | <= $0.10 | <= 10000ms | >= 0.85 | >= 0.75 |
75+
6776
### Programmatic Gate Evaluation
6877
6978
```typescript
70-
import { createGateEngine } from '@reaatech/agent-eval-harness';
71-
72-
const engine = createGateEngine([
73-
{ name: 'quality', metric: 'overall_score', operator: '>=', threshold: 0.80 },
74-
{ name: 'cost', metric: 'avg_cost_per_task', operator: '<=', threshold: 0.05 },
79+
import {
80+
createGateEngine,
81+
getStandardPreset,
82+
getStrictPreset,
83+
getLenientPreset,
84+
CIIntegration,
85+
} from '@reaatech/agent-eval-harness';
86+
87+
// Use a preset
88+
const presets = getStandardPreset();
89+
const engine = createGateEngine(presets.gates);
90+
91+
// Or build custom gates
92+
const engine2 = createGateEngine([
93+
{ name: 'quality', type: 'threshold', metric: 'overall_score',
94+
operator: '>=', threshold: 0.80 },
95+
{ name: 'cost', type: 'threshold', metric: 'avg_cost_per_task',
96+
operator: '<=', threshold: 0.05 },
7597
]);
7698

77-
const result = await engine.evaluate(aggregatedResults);
99+
// evaluate() is synchronous
100+
const summary = engine.evaluate(aggregatedResults);
78101

79-
if (result.passed) {
80-
console.log('All gates passed');
102+
if (summary.overallPassed) {
103+
console.log('All gates passed');
81104
process.exit(0);
82105
} else {
83-
console.log('Gates failed:');
84-
for (const failure of result.failures) {
85-
console.log(` - ${failure.gate}: ${failure.actual} (expected ${failure.expected})`);
106+
console.log('Gates failed:');
107+
for (const r of summary.results.filter(r => !r.passed)) {
108+
console.log(` ${r.name}: ${r.actualValue} (threshold: ${r.threshold})`);
86109
}
87110
process.exit(1);
88111
}
89112
```
90113

114+
### Custom Gate Factories
115+
116+
```typescript
117+
import {
118+
createOverallQualityGate,
119+
createCostGate,
120+
createLatencyGate,
121+
createFaithfulnessGate,
122+
createToolCorrectnessGate,
123+
createNoRegressionGate,
124+
createPassRateGate,
125+
createSLAViolationsGate,
126+
createImprovementGate,
127+
createSignificanceGate,
128+
createMetricRegressionGate,
129+
} from '@reaatech/agent-eval-harness';
130+
131+
const gates = [
132+
createOverallQualityGate(0.85),
133+
createCostGate(0.05),
134+
createLatencyGate(5000),
135+
createNoRegressionGate(baselineResults, 'overall_score'),
136+
];
137+
138+
const engine = createGateEngine(gates);
139+
```
140+
91141
### CI Integration
92142

143+
```typescript
144+
import {
145+
CIIntegration,
146+
writeJUnitReport,
147+
outputGitHubAnnotations,
148+
setGitHubOutput,
149+
exportForCI,
150+
} from '@reaatech/agent-eval-harness';
151+
152+
const summary = engine.evaluate(results);
153+
154+
// GitHub Annotations for PR
155+
const annotations = CIIntegration.generateGitHubAnnotations(summary);
156+
annotations.forEach(a => console.log(a));
157+
158+
// JUnit XML for test reporters
159+
writeJUnitReport(summary, './reports/gates.xml');
160+
161+
// GitHub Actions step outputs
162+
setGitHubOutput(summary);
163+
164+
// Get CI exit code (0 = pass, 1 = failure)
165+
const exitCode = CIIntegration.getExitCode(summary);
166+
process.exit(exitCode);
167+
168+
// Full CI export (annotations + JUnit + outputs + env vars)
169+
exportForCI(summary, './reports/', process.env);
170+
```
171+
172+
### GitHub Actions Workflow
173+
93174
```yaml
94-
# .github/workflows/ci.yml
95-
- name: Run evaluation
96-
run: npx agent-eval-harness eval trajectories/*.jsonl --output results/
97-
98-
- name: Check gates
99-
run: |
100-
npx agent-eval-harness gate \
101-
--results results/eval.json \
102-
--gates gates.yaml \
103-
--baseline results/baseline.json
175+
name: Agent Evaluation
176+
on:
177+
pull_request:
178+
branches: [main]
179+
jobs:
180+
evaluate:
181+
runs-on: ubuntu-latest
182+
steps:
183+
- uses: actions/checkout@v4
184+
185+
- name: Run evaluation
186+
run: |
187+
npx agent-eval-harness eval trajectories/*.jsonl \
188+
--config eval-config.yaml \
189+
--output results/
190+
191+
- name: Check gates
192+
run: |
193+
npx agent-eval-harness gate results/results.json \
194+
--preset standard \
195+
--exit-code
196+
197+
- name: Upload results
198+
if: always()
199+
uses: actions/upload-artifact@v4
200+
with:
201+
name: eval-results
202+
path: results/
104203
```
105204
106205
## Key Metrics
@@ -115,10 +214,10 @@ if (result.passed) {
115214

116215
## Gate Types
117216

118-
1. **Threshold Gates** — Check metric against fixed threshold
119-
2. **Baseline Gates** — Compare against previous run
120-
3. **Statistical Gates** — Require statistical significance
121-
4. **Composite Gates** — Combine multiple metrics
217+
1. **Threshold Gates** — Check metric against fixed value with comparison operators (`>=`, `<=`, `>`, `<`, `==`, `!=`)
218+
2. **Baseline-Comparison Gates** — Compare against previous run with regression/improvement detection
219+
3. **Regression Gates** — Detect specific metric regressions from a baseline
220+
4. **Custom Gates** — Arbitrary evaluation functions returning pass/fail
122221

123222
## Best Practices
124223

0 commit comments

Comments
 (0)