Skip to content

Commit 4de0643

Browse files
committed
docs(skills): add evaluation skill documentation
1 parent 18068b9 commit 4de0643

10 files changed

Lines changed: 903 additions & 0 deletions

File tree

skills/cost-tracking/skill.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Skill: Cost Tracking
2+
3+
## What It Is
4+
5+
Cost tracking calculates per-task and per-trajectory expenses, including LLM API costs, tool invocation costs, and judge evaluation costs. It enforces budgets and provides cost optimization insights.
6+
7+
## Why It Matters
8+
9+
- **Budget Control** — Prevent runaway API costs
10+
- **Cost Optimization** — Identify expensive patterns
11+
- **ROI Analysis** — Measure cost vs. quality tradeoffs
12+
- **Alerting** — Get notified before budgets are exceeded
13+
14+
## How to Use It
15+
16+
### Track Costs
17+
18+
```bash
19+
npx agent-eval-harness eval trajectories/*.jsonl \
20+
--budget 10.00 \
21+
--output results/
22+
```
23+
24+
### Cost Breakdown
25+
26+
```typescript
27+
import { calculateTrajectoryCost } from 'agent-eval-harness';
28+
29+
const pricing = {
30+
'claude-opus': { input: 15.00, output: 75.00 },
31+
'gpt-4-turbo': { input: 10.00, output: 30.00 },
32+
};
33+
34+
const breakdown = await calculateTrajectoryCost('trajectories/run.jsonl', pricing);
35+
36+
console.log(`Total Cost: $${breakdown.total_cost}`);
37+
console.log(`LLM Calls: $${breakdown.llm_calls}`);
38+
console.log(`Tool Invocations: $${breakdown.tool_invocations}`);
39+
console.log(`Judge Evaluations: $${breakdown.judge_evaluations}`);
40+
```
41+
42+
### Budget Alerts
43+
44+
```typescript
45+
import { checkBudget, createBudget } from 'agent-eval-harness';
46+
47+
const budget = createBudget({
48+
per_task: 0.05,
49+
per_trajectory: 1.00,
50+
daily: 100.00,
51+
alerts: [
52+
{ threshold: 0.5, action: 'log' },
53+
{ threshold: 0.75, action: 'notify' },
54+
{ threshold: 0.9, action: 'block' },
55+
],
56+
});
57+
58+
const status = await checkBudget(currentSpend, budget);
59+
if (!status.within_budget) {
60+
console.warn(`Budget exceeded: ${status.percentage}% used`);
61+
}
62+
```
63+
64+
## Key Metrics
65+
66+
| Metric | Description | Unit |
67+
|--------|-------------|------|
68+
| `total_cost` | Total evaluation cost | USD |
69+
| `cost_per_task` | Average cost per task | USD |
70+
| `cost_per_trajectory` | Average cost per trajectory | USD |
71+
| `budget_percentage` | Budget utilization | % |
72+
| `llm_cost` | LLM API costs | USD |
73+
| `tool_cost` | Tool invocation costs | USD |
74+
| `judge_cost` | LLM judge costs | USD |
75+
76+
## Best Practices
77+
78+
1. **Set budget limits** — Define per-task, per-trajectory, and daily budgets
79+
2. **Track all costs** — Include LLM, tools, and judge evaluations
80+
3. **Monitor trends** — Watch for cost increases over time
81+
4. **Optimize judge usage** — Use cheaper models for simple evaluations
82+
5. **Set alerts** — Get notified at 50%, 75%, and 90% budget usage
83+
84+
## Common Pitfalls
85+
86+
- **Ignoring judge costs** — LLM-as-judge can be expensive at scale
87+
- **No budget limits** — Costs can spiral without enforcement
88+
- **Missing token counts** — Always track input and output tokens
89+
- **No cost breakdown** — Understand where costs come from
90+
91+
## Related Skills
92+
93+
- [Trajectory Evaluation](../trajectory-eval/skill.md)
94+
- [LLM Judge](../llm-judge-calibrated/skill.md)
95+
- [Eval Gating](../eval-gating/skill.md)

skills/eval-gating/skill.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Skill: Eval Gating
2+
3+
## What It Is
4+
5+
Eval gating uses evaluation results to make pass/fail decisions in CI/CD pipelines. It checks metrics against thresholds and baselines, blocking deployments when quality standards aren't met.
6+
7+
## Why It Matters
8+
9+
- **Quality Gates** — Prevent regressions from reaching production
10+
- **Automated Decisions** — Remove manual quality review bottlenecks
11+
- **Fast Feedback** — Catch issues before merge
12+
- **Consistent Standards** — Apply the same criteria to every change
13+
14+
## How to Use It
15+
16+
### Run Gate Evaluation
17+
18+
```bash
19+
npx agent-eval-harness gate \
20+
--results results/eval-123.json \
21+
--gates gates.yaml \
22+
--baseline results/baseline.json
23+
```
24+
25+
### Gate Configuration
26+
27+
```yaml
28+
# gates.yaml
29+
gates:
30+
- name: overall-quality
31+
type: threshold
32+
metric: overall_score
33+
operator: ">="
34+
threshold: 0.80
35+
36+
- name: cost-per-task
37+
type: threshold
38+
metric: avg_cost_per_task
39+
operator: "<="
40+
threshold: 0.05
41+
42+
- name: latency-p99
43+
type: threshold
44+
metric: latency_p99_ms
45+
operator: "<="
46+
threshold: 5000
47+
48+
- name: no-regression
49+
type: baseline-comparison
50+
baseline: results/baseline.json
51+
metric: overall_score
52+
allow_regression: false
53+
54+
- name: tool-correctness
55+
type: threshold
56+
metric: tool_correctness_rate
57+
operator: ">="
58+
threshold: 0.95
59+
60+
- name: faithfulness
61+
type: threshold
62+
metric: avg_faithfulness_score
63+
operator: ">="
64+
threshold: 0.85
65+
```
66+
67+
### Programmatic Gate Evaluation
68+
69+
```typescript
70+
import { createGateEngine } from 'agent-eval-harness';
71+
72+
const engine = createGateEngine([
73+
{ name: 'quality', metric: 'overall_score', operator: '>=', threshold: 0.80 },
74+
{ name: 'cost', metric: 'avg_cost_per_task', operator: '<=', threshold: 0.05 },
75+
]);
76+
77+
const result = await engine.evaluate(aggregatedResults);
78+
79+
if (result.passed) {
80+
console.log('✅ All gates passed');
81+
process.exit(0);
82+
} else {
83+
console.log('❌ Gates failed:');
84+
for (const failure of result.failures) {
85+
console.log(` - ${failure.gate}: ${failure.actual} (expected ${failure.expected})`);
86+
}
87+
process.exit(1);
88+
}
89+
```
90+
91+
### CI Integration
92+
93+
```yaml
94+
# .github/workflows/ci.yml
95+
- name: Run evaluation
96+
run: npx agent-eval-harness eval trajectories/*.jsonl --output results/
97+
98+
- name: Check gates
99+
run: |
100+
npx agent-eval-harness gate \
101+
--results results/eval.json \
102+
--gates gates.yaml \
103+
--baseline results/baseline.json
104+
```
105+
106+
## Key Metrics
107+
108+
| Metric | Description | Typical Threshold |
109+
|--------|-------------|-------------------|
110+
| `overall_score` | Combined quality | >= 0.80 |
111+
| `cost_per_task` | Average task cost | <= $0.05 |
112+
| `latency_p99` | 99th percentile latency | <= 5000ms |
113+
| `tool_correctness` | Tool usage accuracy | >= 0.95 |
114+
| `faithfulness` | Context adherence | >= 0.85 |
115+
116+
## Gate Types
117+
118+
1. **Threshold Gates** — Check metric against fixed threshold
119+
2. **Baseline Gates** — Compare against previous run
120+
3. **Statistical Gates** — Require statistical significance
121+
4. **Composite Gates** — Combine multiple metrics
122+
123+
## Best Practices
124+
125+
1. **Start conservative** — Set thresholds you're confident about
126+
2. **Use multiple gates** — Cover quality, cost, and performance
127+
3. **Update baselines** — When quality improves, raise the bar
128+
4. **Monitor false positives** — Adjust thresholds if gates are too strict
129+
5. **Document rationale** — Explain why each gate exists
130+
131+
## Common Pitfalls
132+
133+
- **Too many gates** — Start with critical metrics only
134+
- **Unrealistic thresholds** — Set achievable targets
135+
- **No baseline** — Always have something to compare against
136+
- **Ignoring trends** — Consider directional changes
137+
138+
## Related Skills
139+
140+
- [Regression Suites](../regression-suites/skill.md)
141+
- [Golden Trajectories](../golden-trajectories/skill.md)
142+
- [Cost Tracking](../cost-tracking/skill.md)
143+
- [Latency Budgets](../latency-budgets/skill.md)
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Skill: Faithfulness Scoring
2+
3+
## What It Is
4+
5+
Faithfulness scoring measures whether an agent's response is grounded in and consistent with the provided context. It detects hallucination, fabrication, and context drift.
6+
7+
## Why It Matters
8+
9+
- **Hallucination Detection** — Catch fabricated information
10+
- **Context Adherence** — Ensure responses use provided information
11+
- **Trust Building** — Users need reliable, accurate responses
12+
- **Safety** — Prevent spreading misinformation
13+
14+
## How to Use It
15+
16+
### Score Faithfulness
17+
18+
```bash
19+
npx agent-eval-harness judge faithfulness \
20+
--context "The user's account is associated with email john@example.com. Their subscription expires on 2026-05-01." \
21+
--response "I've sent the password reset to john@example.com" \
22+
--model claude-opus
23+
```
24+
25+
### Batch Evaluation
26+
27+
```typescript
28+
import { JudgeEngine } from 'agent-eval-harness';
29+
30+
const engine = new JudgeEngine({
31+
model: 'claude-opus',
32+
calibration: { enabled: true },
33+
});
34+
35+
const result = await engine.judge({
36+
type: 'faithfulness',
37+
context: "Account email: john@example.com",
38+
response: "I've emailed john@example.com",
39+
});
40+
41+
console.log(`Score: ${result.score} - ${result.explanation}`);
42+
```
43+
44+
## Key Metrics
45+
46+
| Metric | Description | Target |
47+
|--------|-------------|--------|
48+
| `faithfulness_score` | Context adherence | >0.85 |
49+
| `hallucination_rate` | Fabricated information | <0.05 |
50+
| `context_usage` | Proper context utilization | >0.90 |
51+
52+
## Scoring Criteria
53+
54+
1. **Factual Accuracy** — All claims match the context
55+
2. **No Fabrication** — No invented details
56+
3. **Complete Usage** — Relevant context is used
57+
4. **No Contradiction** — Response doesn't contradict context
58+
59+
## Best Practices
60+
61+
1. **Use calibrated judges** — Align with human assessment
62+
2. **Set high thresholds** — Faithfulness is critical for trust
63+
3. **Review failures** — Analyze hallucination patterns
64+
4. **Combine with other metrics** — Faithfulness alone isn't sufficient
65+
66+
## Common Pitfalls
67+
68+
- **Low thresholds** — Faithfulness should be near-perfect
69+
- **Ignoring edge cases** — Check boundary conditions
70+
- **No calibration** — Raw scores may be inflated
71+
- **Single judge** — Use consensus for critical applications
72+
73+
## Related Skills
74+
75+
- [Tool-Use Validation](../tool-use-validation/skill.md)
76+
- [Relevance Scoring](../relevance-scoring/skill.md)
77+
- [LLM Judge](../llm-judge-calibrated/skill.md)

0 commit comments

Comments
 (0)