reaatech
diff --git a/‎skills/cost-tracking/skill.md‎
Lines changed: 95 additions & 0 deletions b/‎skills/cost-tracking/skill.md‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎skills/eval-gating/skill.md‎
Lines changed: 143 additions & 0 deletions b/‎skills/eval-gating/skill.md‎
Lines changed: 143 additions & 0 deletions
diff --git a/‎skills/faithfulness-scoring/skill.md‎
Lines changed: 77 additions & 0 deletions b/‎skills/faithfulness-scoring/skill.md‎
Lines changed: 77 additions & 0 deletions
@@ -0,0 +1,95 @@
+# Skill: Cost Tracking
+
+## What It Is
+
+Cost tracking calculates per-task and per-trajectory expenses, including LLM API costs, tool invocation costs, and judge evaluation costs. It enforces budgets and provides cost optimization insights.
+
+## Why It Matters
+
+- **Budget Control** — Prevent runaway API costs
+- **Cost Optimization** — Identify expensive patterns
+- **ROI Analysis** — Measure cost vs. quality tradeoffs
+- **Alerting** — Get notified before budgets are exceeded
+
+## How to Use It
+
+### Track Costs
+
+```bash
+npx agent-eval-harness eval trajectories/*.jsonl \
+  --budget 10.00 \
+  --output results/
+```
+
+### Cost Breakdown
+
+```typescript
+import { calculateTrajectoryCost } from 'agent-eval-harness';
+
+const pricing = {
+  'claude-opus': { input: 15.00, output: 75.00 },
+  'gpt-4-turbo': { input: 10.00, output: 30.00 },
+};
+
+const breakdown = await calculateTrajectoryCost('trajectories/run.jsonl', pricing);
+
+console.log(`Total Cost: $${breakdown.total_cost}`);
+console.log(`LLM Calls: $${breakdown.llm_calls}`);
+console.log(`Tool Invocations: $${breakdown.tool_invocations}`);
+console.log(`Judge Evaluations: $${breakdown.judge_evaluations}`);
+```
+
+### Budget Alerts
+
+```typescript
+import { checkBudget, createBudget } from 'agent-eval-harness';
+
+const budget = createBudget({
+  per_task: 0.05,
+  per_trajectory: 1.00,
+  daily: 100.00,
+  alerts: [
+    { threshold: 0.5, action: 'log' },
+    { threshold: 0.75, action: 'notify' },
+    { threshold: 0.9, action: 'block' },
+  ],
+});
+
+const status = await checkBudget(currentSpend, budget);
+if (!status.within_budget) {
+  console.warn(`Budget exceeded: ${status.percentage}% used`);
+}
+```
+
+## Key Metrics
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| `total_cost` | Total evaluation cost | USD |
+| `cost_per_task` | Average cost per task | USD |
+| `cost_per_trajectory` | Average cost per trajectory | USD |
+| `budget_percentage` | Budget utilization | % |
+| `llm_cost` | LLM API costs | USD |
+| `tool_cost` | Tool invocation costs | USD |
+| `judge_cost` | LLM judge costs | USD |
+
+## Best Practices
+
+1. **Set budget limits** — Define per-task, per-trajectory, and daily budgets
+2. **Track all costs** — Include LLM, tools, and judge evaluations
+3. **Monitor trends** — Watch for cost increases over time
+4. **Optimize judge usage** — Use cheaper models for simple evaluations
+5. **Set alerts** — Get notified at 50%, 75%, and 90% budget usage
+
+## Common Pitfalls
+
+- **Ignoring judge costs** — LLM-as-judge can be expensive at scale
+- **No budget limits** — Costs can spiral without enforcement
+- **Missing token counts** — Always track input and output tokens
+- **No cost breakdown** — Understand where costs come from
+
+## Related Skills
+
+- [Trajectory Evaluation](../trajectory-eval/skill.md)
+- [LLM Judge](../llm-judge-calibrated/skill.md)
+- [Eval Gating](../eval-gating/skill.md)
@@ -0,0 +1,143 @@
+# Skill: Eval Gating
+
+## What It Is
+
+Eval gating uses evaluation results to make pass/fail decisions in CI/CD pipelines. It checks metrics against thresholds and baselines, blocking deployments when quality standards aren't met.
+
+## Why It Matters
+
+- **Quality Gates** — Prevent regressions from reaching production
+- **Automated Decisions** — Remove manual quality review bottlenecks
+- **Fast Feedback** — Catch issues before merge
+- **Consistent Standards** — Apply the same criteria to every change
+
+## How to Use It
+
+### Run Gate Evaluation
+
+```bash
+npx agent-eval-harness gate \
+  --results results/eval-123.json \
+  --gates gates.yaml \
+  --baseline results/baseline.json
+```
+
+### Gate Configuration
+
+```yaml
+# gates.yaml
+gates:
+  - name: overall-quality
+    type: threshold
+    metric: overall_score
+    operator: ">="
+    threshold: 0.80
+
+  - name: cost-per-task
+    type: threshold
+    metric: avg_cost_per_task
+    operator: "<="
+    threshold: 0.05
+
+  - name: latency-p99
+    type: threshold
+    metric: latency_p99_ms
+    operator: "<="
+    threshold: 5000
+
+  - name: no-regression
+    type: baseline-comparison
+    baseline: results/baseline.json
+    metric: overall_score
+    allow_regression: false
+
+  - name: tool-correctness
+    type: threshold
+    metric: tool_correctness_rate
+    operator: ">="
+    threshold: 0.95
+
+  - name: faithfulness
+    type: threshold
+    metric: avg_faithfulness_score
+    operator: ">="
+    threshold: 0.85
+```
+
+### Programmatic Gate Evaluation
+
+```typescript
+import { createGateEngine } from 'agent-eval-harness';
+
+const engine = createGateEngine([
+  { name: 'quality', metric: 'overall_score', operator: '>=', threshold: 0.80 },
+  { name: 'cost', metric: 'avg_cost_per_task', operator: '<=', threshold: 0.05 },
+]);
+
+const result = await engine.evaluate(aggregatedResults);
+
+if (result.passed) {
+  console.log('✅ All gates passed');
+  process.exit(0);
+} else {
+  console.log('❌ Gates failed:');
+  for (const failure of result.failures) {
+    console.log(`  - ${failure.gate}: ${failure.actual} (expected ${failure.expected})`);
+  }
+  process.exit(1);
+}
+```
+
+### CI Integration
+
+```yaml
+# .github/workflows/ci.yml
+- name: Run evaluation
+  run: npx agent-eval-harness eval trajectories/*.jsonl --output results/
+
+- name: Check gates
+  run: |
+    npx agent-eval-harness gate \
+      --results results/eval.json \
+      --gates gates.yaml \
+      --baseline results/baseline.json
+```
+
+## Key Metrics
+
+| Metric | Description | Typical Threshold |
+|--------|-------------|-------------------|
+| `overall_score` | Combined quality | >= 0.80 |
+| `cost_per_task` | Average task cost | <= $0.05 |
+| `latency_p99` | 99th percentile latency | <= 5000ms |
+| `tool_correctness` | Tool usage accuracy | >= 0.95 |
+| `faithfulness` | Context adherence | >= 0.85 |
+
+## Gate Types
+
+1. **Threshold Gates** — Check metric against fixed threshold
+2. **Baseline Gates** — Compare against previous run
+3. **Statistical Gates** — Require statistical significance
+4. **Composite Gates** — Combine multiple metrics
+
+## Best Practices
+
+1. **Start conservative** — Set thresholds you're confident about
+2. **Use multiple gates** — Cover quality, cost, and performance
+3. **Update baselines** — When quality improves, raise the bar
+4. **Monitor false positives** — Adjust thresholds if gates are too strict
+5. **Document rationale** — Explain why each gate exists
+
+## Common Pitfalls
+
+- **Too many gates** — Start with critical metrics only
+- **Unrealistic thresholds** — Set achievable targets
+- **No baseline** — Always have something to compare against
+- **Ignoring trends** — Consider directional changes
+
+## Related Skills
+
+- [Regression Suites](../regression-suites/skill.md)
+- [Golden Trajectories](../golden-trajectories/skill.md)
+- [Cost Tracking](../cost-tracking/skill.md)
+- [Latency Budgets](../latency-budgets/skill.md)
@@ -0,0 +1,77 @@
+# Skill: Faithfulness Scoring
+
+## What It Is
+
+Faithfulness scoring measures whether an agent's response is grounded in and consistent with the provided context. It detects hallucination, fabrication, and context drift.
+
+## Why It Matters
+
+- **Hallucination Detection** — Catch fabricated information
+- **Context Adherence** — Ensure responses use provided information
+- **Trust Building** — Users need reliable, accurate responses
+- **Safety** — Prevent spreading misinformation
+
+## How to Use It
+
+### Score Faithfulness
+
+```bash
+npx agent-eval-harness judge faithfulness \
+  --context "The user's account is associated with email john@example.com. Their subscription expires on 2026-05-01." \
+  --response "I've sent the password reset to john@example.com" \
+  --model claude-opus
+```
+
+### Batch Evaluation
+
+```typescript
+import { JudgeEngine } from 'agent-eval-harness';
+
+const engine = new JudgeEngine({
+  model: 'claude-opus',
+  calibration: { enabled: true },
+});
+
+const result = await engine.judge({
+  type: 'faithfulness',
+  context: "Account email: john@example.com",
+  response: "I've emailed john@example.com",
+});
+
+console.log(`Score: ${result.score} - ${result.explanation}`);
+```
+
+## Key Metrics
+
+| Metric | Description | Target |
+|--------|-------------|--------|
+| `faithfulness_score` | Context adherence | >0.85 |
+| `hallucination_rate` | Fabricated information | <0.05 |
+| `context_usage` | Proper context utilization | >0.90 |
+
+## Scoring Criteria
+
+1. **Factual Accuracy** — All claims match the context
+2. **No Fabrication** — No invented details
+3. **Complete Usage** — Relevant context is used
+4. **No Contradiction** — Response doesn't contradict context
+
+## Best Practices
+
+1. **Use calibrated judges** — Align with human assessment
+2. **Set high thresholds** — Faithfulness is critical for trust
+3. **Review failures** — Analyze hallucination patterns
+4. **Combine with other metrics** — Faithfulness alone isn't sufficient
+
+## Common Pitfalls
+
+- **Low thresholds** — Faithfulness should be near-perfect
+- **Ignoring edge cases** — Check boundary conditions
+- **No calibration** — Raw scores may be inflated
+- **Single judge** — Use consensus for critical applications
+
+## Related Skills
+
+- [Tool-Use Validation](../tool-use-validation/skill.md)
+- [Relevance Scoring](../relevance-scoring/skill.md)
+- [LLM Judge](../llm-judge-calibrated/skill.md)